Low MMLU Pro and MMLU Benchmark Replication Scores

by uygarkurt - opened 19 days ago

19 days ago

Hi. I tried to replicate the results you have provided.

You report 0.74 MMLU Pro score. However, when I tried to replicate it using LM Eval, I got 0.27. With your regular phi 4 model, I can get 71.4, which aligns with what's being reported. Also, with MMLU, I'm getting 0.23 with phi 4 reasoning, which is basically random. What's wrong here? I'm using LLM Evaluation Harness with default parameters.

Thank you.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment