Low MMLU Pro and MMLU Benchmark Replication Scores

#8
by uygarkurt - opened

Hi. I tried to replicate the results you have provided.

You report 0.74 MMLU Pro score. However, when I tried to replicate it using LM Eval, I got 0.27. With your regular phi 4 model, I can get 71.4, which aligns with what's being reported. Also, with MMLU, I'm getting 0.23 with phi 4 reasoning, which is basically random. What's wrong here? I'm using LLM Evaluation Harness with default parameters.

Thank you.

Sign up or log in to comment