Llama.cpp
Feature | Available |
---|---|
Tools | No |
Multimodal | No |
Chat UI supports the llama.cpp API server directly without the need for an adapter. You can do this using the llamacpp
endpoint type.
If you want to run Chat UI with llama.cpp, you can do the following, using microsoft/Phi-3-mini-4k-instruct-gguf as an example model:
# install llama.cpp
brew install llama.cpp
# start llama.cpp server
llama-server --hf-repo microsoft/Phi-3-mini-4k-instruct-gguf --hf-file Phi-3-mini-4k-instruct-q4.gguf -c 4096
note: you can swap the hf-repo
and hf-file
with your fav GGUF on the Hub. For example: --hf-repo TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF
for this repo & --hf-file tinyllama-1.1b-chat-v1.0.Q4_0.gguf
for this file.
A local LLaMA.cpp HTTP Server will start on http://localhost:8080
(to change the port or any other default options, please find LLaMA.cpp HTTP Server readme).
Add the following to your .env.local
:
MODELS=`[
{
"name": "Local microsoft/Phi-3-mini-4k-instruct-gguf",
"tokenizer": "microsoft/Phi-3-mini-4k-instruct-gguf",
"preprompt": "",
"chatPromptTemplate": "<s>{{preprompt}}{{#each messages}}{{#ifUser}}<|user|>\n{{content}}<|end|>\n<|assistant|>\n{{/ifUser}}{{#ifAssistant}}{{content}}<|end|>\n{{/ifAssistant}}{{/each}}",
"parameters": {
"stop": ["<|end|>", "<|endoftext|>", "<|assistant|>"],
"temperature": 0.7,
"max_new_tokens": 1024,
"truncate": 3071
},
"endpoints": [{
"type" : "llamacpp",
"baseURL": "http://localhost:8080"
}],
},
]`