NyxKrage/Microsoft_Phi-4 · SimpleQA score

frappuccino

10 days ago

Hey, thank you for uploading model.

Is Factual Knowledge | SimpleQA score really just 3.0? Are these your numbers?

NyxKrage

Owner 10 days ago

The README is taken directly from Microsoft's AzureML Model Catalog

phil111

3 days ago

Including the SimpleQA score shows a lot of integrity by Microsoft. And yes, the score must be accurate because I ran Phi4 through my personal general knowledge test and it performed MUCH worse than smaller models like Llama 3.1 8b and Gemma2 9b despite Phi4's larger size and much higher MMLU.

This means Microsoft actively trained on the subset of humanity's most popular knowledge that overlaps the domains covered by the MMLU, including virology.

This not only left Phi4 (and the entire Phi series) profoundly ignorant of humanity's most popular knowledge, but has severely crippled instruction following. For example, it trained on so much math and code data that when I ask questions about completely unrelated things like movies, music, books, sports... it often outputted the answers in json. Plus it keeps returning nearest matches rather than respecting the nuances of the prompts.

Sadly, this is becoming the norm. Ever since the peak of ~7b performance with Llama 3.1 8b and Gemma 2 9b a flood of overfitters were released, starting with Qwen2.5, and subsequently Ministral, EXAONE3.5, Cohere 7b, Falcon3 7/10b, and others. All of them scored ~5+ points higher on the MMLU than L3.1 8b & G2 9b, yet have VASTLY less general knowledge. Plus they're also plagued with instruction following issues like ignoring prompt nuance and outputging in json.

I hope open source AI makes a course correction soon, but I'm not counting on it. >95% of the population has less than no interest in AI models that heavily prioritize math, coding, and what's covered by standardized multiple choice tests like the MMLU.