I have done an extensive multi-GPU FLUX Full Fine Tuning / DreamBooth training experimentation on RunPod by using 2x A100–80 GB GPUs (PCIe) since this was commonly asked of me.
Image 1 Image 1 shows that only first part of installation of Kohya GUI took 30 minutes on a such powerful machine on a very expensive Secure Cloud pod — 3.28 USD per hour There was also part 2, so just installation took super time On Massed Compute, it would take like 2–3 minutes This is why I suggest you to use Massed Compute over RunPod, RunPod machines have terrible hard disk speeds and they are like lottery to get good ones
Image 2, 3 and 4 Image 2 shows speed of our very best config FLUX Fine Tuning training shared below when doing 2x Multi GPU training https://www.patreon.com/posts/kohya-flux-fine-112099700 Used config name is : Quality_1_27500MB_6_26_Second_IT.json Image 3 shows VRAM usage of this config when doing 2x Multi GPU training Image 4 shows the GPUs of the Pod
Image 5 and 6 Image 5 shows speed of our very best config FLUX Fine Tuning training shared below when doing a single GPU training https://www.patreon.com/posts/kohya-flux-fine-112099700 Used config name is : Quality_1_27500MB_6_26_Second_IT.json Image 6 shows this setup used VRAM amount
Image 7 and 8 Image 7 shows speed of our very best config FLUX Fine Tuning training shared below when doing a single GPU training and Gradient Checkpointing is disabled https://www.patreon.com/posts/kohya-flux-fine-112099700 Used config name is : Quality_1_27500MB_6_26_Second_IT.json Image 8 shows this setup used VRAM amount
Conclusions As expected, as you train lesse parameters e.g. LoRA vs Full Fine Tuning or Single Blocks LoRA vs all Blocks LoRA, your quality get reduced Of course you earn some extra VRAM memory reduction and also some reduced size on the disk Moreover, lesser parameters reduces the overfitting and realism of the FLUX model, so if you are into stylized outputs like comic, it may work better Furthermore, when you reduce LoRA Network Rank, keep original Network Alpha unless you are going to do a new Learning Rate research Finally, very best and least overfitting is achieved with full Fine Tuning Check figure 3 and figure 4 last columns — I make extracted LoRA Strength / Weight 1.1 instead of 1.0 Full fine tuning configs and instructions > https://www.patreon.com/posts/112099700 Second best one is extracting a LoRA from Fine Tuned model if you need a LoRA Check figure 3 and figure 4 last columns — I make extracted LoRA Strength / Weight 1.1 instead of 1.0 Extract LoRA guide (public article) : https://www.patreon.com/posts/112335162 Third is doing a all layers regular LoRA training Full guide, configs and instructions > https://www.patreon.com/posts/110879657 And the worst quality is training lesser blocks / layers with LoRA Full configs are included in > https://www.patreon.com/posts/110879657 So how much VRAM and Speed single block LoRA training brings? All layers 16 bit is 27700 MB (4.85 second / it) and 1 single block is 25800 MB (3.7 second / it) All layers 8 bit is 17250 MB (4.85 second / it) and 1 single block is 15700 MB (3.8 second / it) Image Raw Links Figure 0 : MonsterMMORPG/FLUX-Fine-Tuning-Grid-Tests