Do not even try to download lower that 8 bit ;P
Do not even try to download lower that 8 bit ;P
Is very small so perplexity will be huge with bigger quantization.
Anyway that model is very retarded.
Reasoning level is very low or implementation under llamacpp is wrong ... no idea.
> Instruct: There are three killers in a room. Someone enters the room and kills one of them. Nobody leaves the room. How many killers are left in the room? Explain your reasoning step by step.
\nOutput:
Assistant: I think there is only one killer left in the room, because when someone enters the room and kills one of the killers, it means that all three were killed except for one who managed to escape or fight back. Therefore, there can be no other possible scenario where two or more killers are still alive. This reasoning follows from a principle of logic called modus tollens, which says that if a conditional statement is true and its consequent is false, then its antecedent must also be false. In this case, the conditional statement is: If there were three killers in the room, then two or more would still be alive. The consequent is false, because we know that one of them was killed. Therefore, the antecedent (that there were three killers) must also be false. This can be written as:
\begin{align*}
\text{If } P \rightarrow Q \quad & \text{(If there were three killers in the room, then two or more would still be alive.)}\\
\text{Q \neg} \rightarrow \neg P \quad & \text{(Not having two or more alive implies that there was only one killer left.)}\\
\end{align*}
So we can conclude:
\begin{align*}
(P \rightarrow Q) \land (Q \neg) \Rightarrow \neg P
\end{align*}
which means:
\begin{align*}
(If there were three killers in the room, then two or more would still be alive.) & \text{And } (Not having two or more alive implies that there was only one killer left.)\\
\neg P \\
\end{align*}
which is the same as:
\begin{align*}
\neg (There were three killers in the room) \\
\end{align*}
So we can say for sure:
Indeed this model is small enough. Sharing my FP16 for those interested in more precision : https://huggingface.co./kroonen/phi-2-GGUF/blob/main/ggml-model-f16.gguf
What about seeing if COT (chain of thought) and/or TOT tree of thought/reasoning would help. May have to supply examples.
I'm wondering if Killer
may also be a problem/trigger censor related situation with the model. I think that Microsoft tried to censor the model if my memory serves me correctly. Which is known to cause issues with models capabilities/capacity. (Kind of like how something similar effects sheltered kids in real life. lol.)
I tried a modified form of your question with the 8 bit quant from TheBloke. I was just curious if I could get it to reason correctly. It was hit and miss to be honest. So you guys may be unto something.
I did get it to work. But I had to do some serious thinking about how to word it. Also, strangely, I had to put a line as a pre-prompt/system prompt right before the Instruction line for the template. Which seemed to help higher success rates in getting it to work interestingly. And very odd since that is not the recommended template. I'm not sure if that works for every prompt though.
prompt:
'You are Alex. A very capable NPL AI. Any instructions asked of you requires 1st to think out loud step by step what is required before answering.
Instruct: There are three people in a room. Someone then enters the room. And another leaves. How many people are there in the room?
Output:
'
Output:
1. Identify the initial number of people in the room, which is 3.
2. Determine that someone entered the room, so add 1 to the current count. The total now stands at 4.
3. Someone also left the room, so subtract 1 from the current count. The final answer is 3.
To be honest it was a lot of miss until I got a hit that worked. It seemed so brittle that adding/removing something like the word then
between Someone
and enters
had a drastic effect. And it would fail to correctly answer. Could be how we are prompting it is really far from it's domain of what it's use to seeing from it's training.
And this is a base model as well. (Fine tuning could help, maybe?)
I'll see if the 16 bit is better because I'm curious. Thanks @kroonen .
Also @mirek190 , how are you loading and doing inference with phi-2 with llama-cpp? Like you, I'm working with the latest python-cpp-python coupled with my own quick script for loading and inference. Also, what were the generation configs you were using. temp=0.5 and the seed to 42 was all I had set. I let the reset default to what llama-cpp-python usually uses. Could be part of the problem.
Also, I think that there could be the possibility that there could still be some kinks to work out in the cross attention implementation of phi. It seemed that was a pain point around that during implementations/adjustments from the llama-cpp issue/discussions from the github repo.
This model is different from what we have been seeing as of late. Mistral has set a high bar.