matteogeniaccio/phi-4 · Notably better than Phi3.5 in many ways, but something is wrong.

9 days ago

Thanks for the sneak peek. This model is certainly more powerful than Phi3.5, but it's far too focused on things like coding to work as a general purpose LLM.

For example, it keeps answering basic questions with json formatting, such as the one below (although the answer is correct).

What is the condition called when blood sugar briefly falls too far after eating?

{
  "answer": "reactive hypoglycemia",
  "relevant_passages": [
    {
      "title": "Reactive Hypoglycemia",
      "score": 1,
      "passage_id": 2,
      "text": "Reactive hypoglycemia is a condition in which blood sugar levels drop too low after eating. It's also called postprandial hypoglycemia, or post-meal hypoglycemia."
    }
  ]
}

And while various tasks like creative story writing is improved, plus the alignment is less absurd (e.g. fewer denials and moralizations about perfectly legal, common, and harmless things), its general world knowledge is really bad for its size. Smaller models like Llama 3.1 8b and Gemma 9b can functionally respond to a much wider variety of prompts.

In short, Phi4, like Phi3.5, is just an AI tool, not a general purpose AI model like Llama 3.1, Gemma2 or ChatGPT. It's so overtrained on select tasks that its output format commonly makes no sense (e.g. json), and it can't function as an AI for the general population because extensive data filtering. That is, Microsoft acted as the gakekeepers of information, deciding which of humanity's most popular information to add to the corpus, leaving it almost completely ignorant about too many things that the general population cares most about. Again, that makes it an AI tool/agent (e.g. coder or math solver) and not a general purpose chat/instruct AI model.

nlpguy

9 days ago

•

edited 9 days ago

Thanks for your review phil. I know you calculate exact scores for your models, I'd be interested to know what score phi 4 gets compared to other models in your knowledge test.

Its skewed distribution from heavy tuning probably makes it harder for the model to answer more general questions, but it might know more than it is willing to say. If it responds to basic questions in json format then the fine-tuning process, which primarily tunes the output format and style of responses, may be the reason for the skewed distribution, not the pretraining data itself, because I doubt the distribution resulting from pretraining could cause this response. So re-tuning it could maybe fix it? That's just my Idea though.

matteogeniaccio

Owner 9 days ago

I think there is something wrong with your setup. It seems that phi4 is regurgitating some training data instead of answering properly.
The same prompt returns a properly formatted answer when I tried it.

Make sure that you are using the correct chat template. Phi4 is using a modified chatml format. Also try reducing the temperature settings

phil111

9 days ago

@nlpguy @matteogeniaccio You guys must be right. Something is clearly configured wrong on my end, and the chat template is the most likely culprit. But I do use low temps (0 and 0.3), plus minimize hallucinations further by using a high min-P, so there's clearly pockets of very popular knowledge missing from the corpus, regardless of any configuration issue.

Thanks for testing the question. About half of the answers to my questions are in json format, so if this was inherent to the model you would undoubtedly have noticed it to. Normally I would test odd outputs against a full float version online but can't find any (e.g. at LMsys).

But despite my configuration issue there's clearly something special about this model. It handled a complex, and frankly absurd, story prompt much better than previous Phi models. Plus it correctly answered some unusually difficult questions smaller models typically get wrong (e.g. Thorne–Żytkow Object and the more obscure literary references "it was the age of wisdom, it was the age of foolishness," rather than just the far more recognizable "It was the best of times, it was the worst of times" from Charles Dickens A Tale of Two Cities). I look forward to seeing how this performs once configured properly.

urtuuuu

8 days ago

Seems to be noticeably smarter than any model i tried up to 14b. Answered all my test questions correctly and never in json.

phil111

8 days ago

@urtuuu Thanks for checking. I clearly configured something wrong so I'm closing this discussion.

phil111 changed discussion status to closed 8 days ago

phil111

3 days ago

Apparently I was right.

Phi4 14b was getting most of my simple questions wrong across popular domains of knowledge, and sure enough it only scored 3.0/100 on SimpleQA, which is the gold standard of world knowledge tests because it's extensive, diverse, hard, not multiple choice, and new, so not yet contaminated.

By comparison Phi3 14b, despite scoring notably lower than Phi4 14b on the MMLU, scored notably higher on SimpleQA (7.6). So Microsoft is not only heavily favoring the small subset of popular knowledge that maximizes tests scores, but is doing this to a progressively greater degree.

Qwen recently did the same, with Qwen2 72b scoring much higher on my general knowledge test than Qwen2.5 72b. Sure enough Qwen2.5 72b only scored 10 on SimpleQA, much lower than Llama 3.1 70b. And the smaller Qwen2.5's, including 34 & 14b scored as high or higher than Llama 3.1 70b on the English MMLU despite scoring only ~5 on SimpleQA.

I was hoping that the open source AI community would produce viable alternatives to proprietary models, but all most are doing is training for test scores and bragging rights. Qwen2.5, Falcon3, Cohere 7b, EXAONE3.5, Phi 1-4, Ministral... are all less than useless to >95% of the population. They're basically just overfit to the tests and a single demographic (autistic coding first adopters here on HF and LMsys).

Meta and Google are currently open source AI's best hope for quality general purpose AI. Mistral is starting to slip (Nemo & Ministral grossly overfit tests and have much less world knowledge than their previous and smaller Mistral 7b), and IBM granite is only OK, but shows potential. Pretty much all the rest are useless and just training for select tasks and tests.