I'm still confused by the proliferation of bf16. Although it certainly doesn't hurt compared to fp16, in my testing even with A100 GPUs optimized for it, both training speed and inference quality are the same between bf16 and fp16.
Sometimes during training, fp16 will cause networks that would converge on fp32, to explode to Infs or NaNs with fp16, because of the limited range. bf16 generally speaking fixes that.
It's true also that fp16 is often manageable with enough batch/layer norm and gradient clipping.
Yea, I spent a few months comparing the two, and empirically i had a lot more issues with various normalized entropy problems (explosion, not converging, converging slower) with fp16 than with bf16.
The transfer pipeline I wrote for fp32->fp16 also took a lot more work than fp32->bf16
My understanding is for certain types of networks BF16 will train better than FP16, given the additional protection against exploding gradients and loss functions with the extended range of BF16 - at the loss of precision.
bf16 is generally easier to train neural network than fp16 on due to no need for scaling. And most model training and inference performs the same with fp32 and bf16.
Despite the other answers, I will tell you the grim truth: Your mileage might vary.
It's an empirical question and depends upon the nature of your problem and data. You should try all three fp32, fp16, and bf16 as part our model selection / hyperparameter tuning.
For example, in audio generative models (where typical output is 16-bit), I've sometimes found that fp16 and bf16 just don't produce good output as fp32 weights.
(Not an ML guy.) bf16 and fp16 should be comparable if the weights are of the same magnitude, but what happens in a network where the weights are poorly regularized?
Someone commented below that with enough batchnorm/layernorm/etc. and/or gradient clipping you can manage it, but BF16 just makes life easier if you can live without some precision.