Version recommendation.
So, for the last couple of years, I've been using DuckDuckGo's Lama LLM to write stories. I recently heard about something called DeepSeek. I looked up mudahar talking about it, and found that the best way to do this would be to host it locally. I decided that the best way to do this would be to do it on my phone. So I looked up a guide on how to host locally on my phone and I was told to use pocket pal and download the right version. The only problem is there's about a billion different versions and dozens of sub-versions of those versions. My hair will probably turn gray before I figure out what I'm doing so I figured that it would probably be a significantly easier and faster if I simply ask what version of DeepSeek R1 to get for the RAZR 2nd gen. Mradermacher Seems to have the most versions, at least of the uncensored version. So I thought I'd ask here.
@Sin-Shadow-Fox
I looked up your phone and assume you have a Motorola Razr 2 5G 2020 gen2 8GB RAM 256GB ROM 6.2"
For that I recommend the i1-Q4_K_M version of https://huggingface.co./mradermacher/DeepSeek-R1-Distill-Qwen-7B-Uncensored-i1-GGUF sized 4.68 GB which should easily fit into the 8 GB RAM your phone has with enough RAM to spare to go for a medium-sized context.
@Sin-Shadow-Fox I looked up your phone and assume you have a Motorola Razr 2 5G 2020 gen2 8GB RAM 256GB ROM 6.2"
For that I recommend the i1-Q4_K_M version of https://huggingface.co./mradermacher/DeepSeek-R1-Distill-Qwen-7B-Uncensored-i1-GGUF sized 4.68 GB which should easily fit into the 8 GB RAM your phone has with enough RAM to spare to go for a medium-sized context.
Well, it's actually spitting out a response this time, but it is glacially slow, like five words a minute slow.
Well, it's actually spitting out a response this time, but it is glacially slow, like five words a minute slow.
If you want something fast you could use Q4_K_M from https://huggingface.co./mradermacher/DeepSeek-R1-Distill-Qwen-1.5B-Fully-Uncensored-i1-GGUF
Keep in mind that this is a 1.5B instead of a 7B model and a smaller brain means it knows less and is slightly less intelligent. But depending what you use it for this might not matter. To just have a virtual friend to chat around it's more than good enough and over 4 times faster compared to 7B.
I recommend also try the Q4_0 versions of booth the 1.5B and 7B model and see if they are faster on your phone. llama.cpp applies some ARM specific optimizations to Q4_0 model that makes it run much faster on some phones.
Methinks that five words per minute sounds awfully slow. I've tried it on my el-cheapo chinese mediatek-based phone and it does more than that, so I suspect something else slows it down more than it should. Admittedly, I also didn't really see a speed-up from Q4_0, but that might have been the inference engine I used.
Methinks that five words per minute sounds awfully slow. I've tried it on my el-cheapo chinese mediatek-based phone and it does more than that, so I suspect something else slows it down more than it should. Admittedly, I also didn't really see a speed-up from Q4_0, but that might have been the inference engine I used.
7b models Q4_K_M quants on a 2020 phone is not going to be fast enough to read while generating probably faster than 5 words per minute after prompt processing but I don't think it will run anywhere close to being usable. The Motorola phone has a Snapdragon 765g which has a multicore performance that is 55% of the i5-8250u cpu which my previous laptop had and that was already not really good enough to run a q4_K_M 7B model with enough speed so you could read along with the tokens being generated but it was at least fast enough to be somewhat useable if you are patient enough but I don't see how a phone with a chip significantly less powerful is going to be fast enough to actually make sense using frequently. It's not a surprise though considering the chip is almost half a decade old and that is a lot for ARM processors. You could for example buy a midrange priced (349€) phone like Xiaomi Poco x7 Pro with a Mediatek 8400 ultra chip and 12gb ddr5 which would beat the 765g easily and be like 200% better in multicore performance and has even dedicated AI NPU cores as part of the chip. If you want to run a model locally on that phone though you could give a 3b model a try and experiment around that size because there are enough models that are between 1b and 5b so you can find the best size/speed ratio for your specific phone.
@WesPro you completely missed my point, namely that from experimental evidence, five words per minute is slower than even a low-tier phone can do, so there is likely another issue.
I don't think it will run anywhere close to being usable.
Well, I used 7Bs on my phone, and it's just as usable as running 120B on my desktop. It is very subjective, but I am used to do something else till I get a full reply, and that works in both scenarios.
@WesPro you completely missed my point, namely that from experimental evidence, five words per minute is slower than even a low-tier phone can do, so there is likely another issue.
I don't think it will run anywhere close to being usable.
Well, I used 7Bs on my phone, and it's just as usable as running 120B on my desktop. It is very subjective, but I am used to do something else till I get a full reply, and that works in both scenarios.
I didn't miss the point I mentioned it should run faster than 5 words per minute but when I red "it is glacially slow, like five words a minute slow." the way it was said made me think that it's not a really a serious measurement of the speed and just a exaggeration to emphasize that it's too slow to be usable. If it's actually that slow though I guess you are right that the settings are somehow keeping it from being as fast as it could be. One thing that could cause a performance worse than what can be expected based on the phones capabilities maybe the fact that some android phones can allocate virtual ram from internal storage and maybe that's happening with 8gb ram when loading a 7b model. Some phones use so much of the ram just for the android os and some background apps that it would take more than the 8GB ram to run the 7B model quant with some context and the rest of apps with the OS.
So I just did some testing and first and foremost I would like to say that I was not exaggerating. When I said it was glacially slow, from my perspective it is. However, when I said five words per minute, I actually timed it.
Secondly, I tested out the Q4_0 version of DeepSeek-R1-Distill-Qwen-7B-Uncensored-i1-GGUF And while it is a little bit faster, it's not faster by much. Maybe 10 words per minute if I'm being generous.
DeepSeek-R1-Distill-Qwen-1.5B-Fully-Uncensored-i1-GGUF Seems to run much faster, equaling that of DuckDuckGo'S LLaMA but when given a simple story prompt to write, it butchered it to hell and back.
I tried giving the same prompt to the Q4_0 version of DeepSeek-R1-Distill-Qwen-7B-Uncensored-i1-GGUF despite the slower speeds and it had equally disastrous results.
it butchered it to hell and back.
Likely using an uncensored reasoner model is not the best idea for story writing. Reasoner models mainly make sense for Q&A or chat-style interactions and are in my opinion terrible for story writing. I have no idea if the mobile app you use even properly filters out the reasoning steps.
Can you try Q4_0 of https://huggingface.co./mradermacher/Llama-3.2-1B-Instruct-Uncensored-i1-GGUF instead? That is a general purpose 1B model and so should be even faster and because it is not a reasoner model much better suited for story writing. If your main use-case is story writing maybe consider using a model especially made for story writing instead.
I believe 3B might be a good tradeoff for the given hardware.
Some other uncensored general-purpose models I can recommend are:
3B: https://huggingface.co./mradermacher/Llama-3.2-3B-Instruct-uncensored-i1-GGUF
7B: https://huggingface.co./mradermacher/Qwen2.5-7B-Instruct-Uncensored-i1-GGUF
3B roleplay finetuned model: 3B: https://huggingface.co./mradermacher/Qwen-2.5-3b-RP-i1-GGUF
The 3B is a nice compromise, but it's not uncensored (Despite the fact that it says it is) at least with the test I used. Also, I noticed that that doesn't say deepseek. I thought deepseek was the new cool one that everyone was talking about.