温馨提示:
1. 部分包含数学公式或PPT动画的文件,查看预览时可能会显示错乱或异常,文件下载后无此问题,请放心下载。
2. 本文档由用户上传,版权归属用户,汇文网负责整理代发布。如果您对本文档版权有争议请及时联系客服。
3. 下载前请仔细阅读文档内容,确认文档内容符合您的需求后进行下载,若出现内容与标题不符可向本站投诉处理。
4. 下载文档时可能由于网络波动等原因无法下载或下载错误,付费完成后未能成功下载的用户请联系客服处理。
网站客服:3074922707
Pytorch
训练
优化
技巧
及其
Diffusion
应用
23
WN5
Diffusion Generative ModelsYiming Liu,Yisong Li NVIDIA 群内每日免费分享5份+最新资料 群内每日免费分享5份+最新资料 300T网盘资源+4040万份行业报告为您的创业、职场、商业、投资、亲子、网赚、艺术、健身、心理、个人成长 全面赋能!添加微信,备注“入群”立刻免费领取 立刻免费领取 200套知识地图+最新研报收钱文案、增长黑客、产品运营、品牌企划、营销战略、办公软件、会计财务、广告设计、摄影修图、视频剪辑、直播带货、电商运营、投资理财、汽车房产、餐饮烹饪、职场经验、演讲口才、风水命理、心理思维、恋爱情趣、美妆护肤、健身瘦身、格斗搏击、漫画手绘、声乐训练、自媒体打造、效率软件工具、游戏影音扫码先加好友,以备不时之需扫码先加好友,以备不时之需行业报告/思维导图/电子书/资讯情报行业报告/思维导图/电子书/资讯情报致终身学习者社群致终身学习者社群关注公众号获取更多资料关注公众号获取更多资料Stable Diffusion Training OptimizationOverview of optimization ideasPerf breakdownDetails of optimization approachesAgendaAgendaText Text EncoderEncoderImage Information Creator(Unet+Scheduler)Image DecoderStable Diffusion Training Pipeline(v1.4)Stable Diffusion Training Pipeline(v1.4)Text embeddingTraining dataset for VAE encoderTraining:Text encoder FP+VAE encoder FP+unet FP&BPPokemon Datasettext_to_imagefinetune resultsyodairon manStable Diffusion Training PipelineStable Diffusion Training PipelineNsystimelinemoduleTime(ms)ratiovae.encoder fwd33.1737.60%text_encoder fwd13.9463.20%Unet fwd81.67718.72%Unet bwd125.19728.69%clip_grad_norm32.1247.36%update96.45722.11%ema_unet37.6598.63%other16.1093.69%total436.342100.00%TF32:FP16:Stable Diffusion Training PipelineStable Diffusion Training Pipelinebatch_sizebatch_size avg time/avg time/iteriterthroughputthroughputmemorymemoryspeedupspeedupCommentsCommentsBaseline(tf32)Baseline(tf32)80.98658.1098+69479MiB1.0000Default config of HuggingFacetraining recipeBaseline(fp16)Baseline(fp16)80.83759.552071695MiB1.1778Default config of HuggingFacetraining recipedataloaderdataloader80.698311.456471695MiB1.4127Dataloader would be bottlenecks when using multi-GPUs,DeepSpeedfusedlayer+fusedadamfusedlayer+fusedadam80.631912.660066949MiB1.5611Fused Linear,MLP,LayerNormand Adamflash attnflash attn80.442818.067025481MiB2.2278Leveraging FlashAttention against Pytorchstandard attention implementation482.036323.572772979MiB2.9067Multistream+FusedMultistream+FusedEMAEMA482.020923.751172929MiB2.9287We see bubbles between inters caused by EMA,which will introduce frequent malloc and free behaviorZero2Zero2562.330224.032772439MiB2.9634ZeRO2 can help to reduce GPU mem usage,increasing BS can improve training throughputNote:perf on 80G A100*1Overview of Optimization IdeasOverview of Optimization IdeasOptimization 1:Optimization 1:DataloaderDataloaderIdeasEnable pin_mem&num_workerstuningfor dataloader.It is always a good idea for us to check if these two params are set correctlyAnd we can also check if image decoding and preprocessing are overhead in Nsys time(try out DALI if YES)NotesEnable pin_mem&num_workers tuningfor dataloader.There is less computing overhead on the CPU in this case,say:H2D+image decodingSetting for the training Setting for the training DataLoaderDataLoaderTime for one training epochTime for one training epochnum_workers:1,pin_memory:True2.93snum_workers:2,pin_memory:True2.87snum_workers:4,pin_memory:True2.85snum_workers:8,pin_memory:True2.80snum_workers:16,pin_memory:True2.92sNote:perf on 80G A100*8,bs=56Optimization 2:Kernel FusionOptimization 2:Kernel FusionLeverage building blocks from APEX for Layer fusion&Optimization Fusion.NVIDIA APEX(repository,documentation)offers optimized,reusable building blocks.APEX can be pip-installed from github:https:/ is pre-installed in PyTorch NGC docker containersComponents:fused layers:apex.fused_dense.FusedDenseapex.fused_dense.FusedDenseGeluDenseapex.normalization.FusedLayerNormfusedoptimizers:apex.optimizers.FusedAdamapex.optimizers.FusedLAMBapex.optimizers.FusedNovoGradapex.optimizers.FusedSGDIdeasReplace layers from text_encoder/unetand Adam modules with optimized building blocks from APEXNotesReplacing modules may lead to name change of params(MLP),which will cause NAN loss,we need to match the keysDetailsreplace nn.Linear with fused_dense.FusedDensereplace MLP with fused_dense.FusedDenseGeluDensereplace nn.LayerNorm with apex.normalization.FusedLayerNorm Replace torch.optim.AdamW with apex.optimizers.FusedAdamOptimization 2:Kernel FusionOptimization 2:Kernel FusionBeforeAfterFused Adam OptimizerBefore fusion:125.622ms6860 vectorized_elementwise_kernelAfter fusion:25.022ms47 multi_tensor_apply_kernelBesides reducing kernel exec time:Reduce kernel launch overheadReduce mem I/OOptimization 2:Kernel FusionOptimization 2:Kernel FusionBeforeAfterFor one iteration:Before kernel fusion:Vectorized_elementwise_kernel,one float4 aligned per thread,improve mem effieienceyUnrolled_elementwise_kernel,not float4 aligned,loop in threadElementwise_kernel,no optimizationAfter kernel fusion:All elementwise kernel proportion is reduced downOptimization 2:Kernel FusionOptimization 2:Kernel FusionOptimization Optimization 3 3:Flash Attention(HazyResearch)Flash Attention(HazyResearch)Key Insight:On modern GPUs such as A100,the tensor cores are so fast that attention ends up being bottlenecked by reading and writing from GPU memoryFlashAttentionspeeds up attention by reducing the GPU memory reads and writesFlashAttentionfuses the MHA of the attention computation into one kernelOnly Store norm coefficient of softmax instead ofattention matrix(N2)Currently Support:Turing or Ampere GPUsfp16 and bf16(bf16 requires Ampere GPUs)Head dimensions that are multiples of 8,up to 128(e.g.,8,16,24,.,128).Head dim 64 backward requires A100Optimization Optimization 3 3:Flash AttentionFlash AttentionNative AttentionFlash AttentionElementwise kernels are fused by Flash Attentionand the proportion is reduced downWith Flash Attention,Fused MHA is not need anymoreOptimization Optimization 3 3:Flash AttentionFlash AttentionNative AttentionFlash AttentionKernel launch time/number is aggressively reduced downElementwise kernels are fused by Flash Attentionand the proportion is reduced downOptimization Optimization 3 3:Flash AttentionFlash AttentionNative AttentionFlash AttentionFlash Attention can reduce memory footprint:67G-25G with same batch_size=8We can increase the batch size when using FlashAttentionBatch_sizeBatch_sizeNative attentionNative attentionFlash AttentionFlash Attentionspeedupspeedup812.53117.6511.40816OOM19.4971.55648OOM21.1621.589Optimization 4:Optimization 4:MutiMuti-stream EMA&Fused EMAstream EMA&Fused EMAWhat is EMA?An EMA is a moving average which can be calculated only by knowing the last EMA value,and the current value.The EMA doesnt require retaining the last N data points,making it quite memory efficient.EMA in Stable DiffusionStable diffusion uses an Exponential Moving Average of the models weights to improve quality of resulting images and avoid overfitting to the most recently trained images.Async EMAEMA is independent with unet training,it only keeps weights which will be used before next iteration,so we can use multi-stream to overlap ema with other computationOptimization 4:Optimization 4:MutiMuti-stream EMA&Fused EMAstream EMA&Fused EMACreate EMA CUDAStreamMake sure EMA has been completed before next iteration updateAdd last EMA to make sure it is equivalent with original implementationCompute EMA in the new Stream,overlap with model computeOptimization 4:Optimization 4:MutiMuti-stream EMA&Fused EMAstream EMA&Fused EMANative implementationFused implementationOptimization 4:Optimization 4:MutiMuti-stream EMA&Fused EMAstream EMA&Fused EMAOptimization 5:Optimization 5:ZeROZeROIdeasReducing gradients,activation,and fragmented memory via ZeRO,and increase BS as much as possibleNotesDetailsSummarySummarybatch_sizebatch_size avg time/avg time/iteriterthroughputthroughputmemorymemoryspeedupspeedupCommentsCommentsBaseline(tf32)Baseline(tf32)80.98658.1098+69479MiB1.0000Default config of HuggingFacetraining recipeBaseline(fp16)Baseline(fp16)80.83759.552071695MiB1.1778Default config of HuggingFacetraining recipedataloaderdataloader80.698311.456471695MiB1.4127Dataloader would be bottlenecks when using multi-GPUs,DeepSpeedfusedlayer+fusedadamfusedlayer+fusedadam80.631912.660066949MiB1.5611Fused Linear,MLP,LayerNormand Adamflash attnflash attn80.442818.067025481MiB2.2278Leveraging FlashAttention against Pytorchstandard attention implementation482.036323.572772979MiB2.9067Multistream+FusedMultistream+FusedEMAEMA482.020923.751172929MiB2.9287We see bubbles between inters caused by EMA,which will introduce frequent malloc and free behaviorZero2Zero2562.330224.032772439MiB2.9634ZeRO2 can help to reduce GPU mem usage,increasing BS can improve training throughputThanks!Thanks!群内每日免费分享5份+最新资料 群内每日免费分享5份+最新资料 300T网盘资源+4040万份行业报告为您的创业、职场、商业、投资、亲子、网赚、艺术、健身、心理、个人成长 全面赋能!添加微信,备注“入群”立刻免费领取 立刻免费领取 200套知识地图+最新研报收钱文案、增长黑客、产品运营、品牌企划、营销战略、办公软件、会计财务、广告设计、摄影修图、视频剪辑、直播带货、电商运营、投资理财、汽车房产、餐饮烹饪、职场经验、演讲口才、风水命理、心理思维、恋爱情趣、美妆护肤、健身瘦身、格斗搏击、漫画手绘、声乐训练、自媒体打造、效率软件工具、游戏影音扫码先加好友,以备不时之需扫码先加好友,以备不时之需行业报告/思维导图/电子书/资讯情报行业报告/思维导图/电子书/资讯情报致终身学习者社群致终身学习者社群关注公众号获取更多资料关注公众号获取更多资料