I do not expect meaningful differences between FP16, BF16, and FP32 or the models derived from them and so far I have not seen any evidence to the contrary either.
There is a difference when running test prompts (ie Q4KM), at temp 0 for all three, depending on:
1 - Org source "fp"
2 - Outfile settings - fp16,fp32, or bf16.
Although it is minor in PPL differences, it does show when using a test prompt.
There are word changes, sentence changes and the like.
On longer gen, conclusions change as well.
It is not a big contrast, but it does show when testing this way.