Low MMLU Pro and MMLU Benchmark Replication Scores
#8
by
uygarkurt
- opened
Hi. I tried to replicate the results you have provided.
You report 0.74 MMLU Pro score. However, when I tried to replicate it using LM Eval, I got 0.27. With your regular phi 4 model, I can get 71.4, which aligns with what's being reported. Also, with MMLU, I'm getting 0.23 with phi 4 reasoning, which is basically random. What's wrong here? I'm using LLM Evaluation Harness with default parameters.
Thank you.