🚩 Report
Reflection 70b benchmarks are not real
The whole drama is described here:
https://x.com/shinboson/status/1832933753837982024
Matt Schumer is a fraud.
The claim is that the requests sent to "Reflection 70B" through his hosted API were being routed to a model other than the one whose weights are hosted in this HF repo. The fact that you're unable to reproduce any of the responses other users saw from their API when you run it locally is further evidence that what was benchmarked is not what is in this repo.
@nisten You are missing the point here. The model uploaded here is not the same as the API Access that others can access. (e.g. Artificial Analysis). You will not be able to reproduce the issues with an open weight model. Until Matt himself uploads and provides proof of the model, evals, and the same exact prompt to test with his open weight model. (Not a private one)
Right, so TLDR; Matt Schumer is a fraud using the media as pawns to promote his company. Tik-Tok Matt
let them upload it to Github Models if it's so good. Let's keep this space clear of the scams please.
Why don't you just reply to my post first before calling out others FRAUD ACCOUNTS here?
https://huggingface.co./mattshumer/ref_70_e3/discussions/5#66defbe383b31d8cf891724b
https://huggingface.co./mattshumer/ref_70_e3/discussions/7#66dee423cccbad2a02574834
@nisten , reply to this before you claim anything.
Can you read titles of these posts? They're talking about thees official "reflection 70b APIs". Do you test on these APIs?
I see you're posting the results related to this twitter.
Do you really understand what he is trying to prove? He's trying to prove that the LLM behind "reflection 70b API" is using the same tokenizer as claude 3, chatgpt4o or whatever. Images he posted stands by his point.
What are you trying to prove here by posting this image? I think you're proving that what they uploaded here and what they host after API are totally different. You should explain what you want to prove in detail.
Also, I see you're using local models, so you're testing different models from all these posts claims. A natural question is that can you reproduce the evaluation results @mattshumer provided? Why not post your independent evaluation results here so you can help everyone decide whether they're genuine or overclaiming?
This is a local model.
You are coming from r/local_llama, to complain about about a model which you're NOT running locally.
Please, RUN IT LOCALLY, then post screenshots of WHAT YOU LOCALLY RAN!
COMPRENDE, CAPISCI, KUPTON?
Can you fix the chat_template, HERE, not on reddit, not on uncle Elons twitter, but HERE, and then run it BEFORE yapping?
Before you report your independent evaluation results, please disclose whether you and
@mattshumer
have a conflict of interest.
In particular, relationships like friends, partners, knowing each other, etc.
No I don't actually have for real but I think we all need a new opensource license that's apache for everyone except reddit users.
So go back to r/locallama and tell them that Enigrand's yapping has inspired Nisten to make an opensource license that bans reddit users.
"No I don't actually have for real but I think we all need a new opensource license that's apache for everyone except reddit users."
Quoting from your twitter post:
i dont know what he's on about with torrents, he hasnt slept in 4 days,
checkpoint 3 is working fine as far as I tested, albeit not great (it goes in loops), but IT WAS PASSING MOST OF THE TESTS ya'll claimed it didnt
Ok please just tell me what prompts to try?
EXPLAIN WHO IS HE or CHANGE YOUR DISCLOSURE. Also, please don't delete your twitter posts.
Here's the evaluation results from Kristoph on twitter.
These are the final notes from my work on the Reflection model. I tested the latest version of the model hosted by @hyperbolic_labs. I attempted a variety of different strategies including variation in temperature and system prompt. Ultimately these had only modest impact on the results. The final numbers I am presenting here use the prompt the Reflection team recommended. I did have to modify the question format somewhat to ensure Reflection properly generated the response ( the instruction to output a letter choice was at the end of the prompt )
The TLDR is that on virtually every benchmark the Reflection model was on par with Llama 3.1 70B it is based on.
I ultimately ran through the entire corpus of MMLU Pro for biology, chemistry, physics, engineering, health, law, philosophy, and math. All 0 shot. In all but one case Reflection was within a 1-2% of Llama 3.1 70B 0 shot and 1-3% below 5 shot. In all cases Llama 70B was called with no system prompt.
The one area where Reflection performed better was in Math where it scored 3% higher.