duyntnet commited on
Commit
92df6c3
1 Parent(s): 01b6e3a

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +98 -0
README.md ADDED
@@ -0,0 +1,98 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ language:
4
+ - en
5
+ pipeline_tag: text-generation
6
+ inference: false
7
+ tags:
8
+ - transformers
9
+ - gguf
10
+ - imatrix
11
+ - Megrez-3B-Instruct
12
+ ---
13
+ Quantizations of https://huggingface.co/Infinigence/Megrez-3B-Instruct
14
+
15
+ **Note**: you will need llama.cpp [b4381](https://github.com/ggerganov/llama.cpp/releases/tag/b4381) or later to run the model.
16
+
17
+ ### Inference Clients/UIs
18
+ * [llama.cpp](https://github.com/ggerganov/llama.cpp)
19
+ * [KoboldCPP](https://github.com/LostRuins/koboldcpp)
20
+ * [ollama](https://github.com/ollama/ollama)
21
+ * [jan](https://github.com/janhq/jan)
22
+ * [text-generation-webui](https://github.com/oobabooga/text-generation-webui)
23
+ * [GPT4All](https://github.com/nomic-ai/gpt4all)
24
+ ---
25
+
26
+ # From original readme
27
+
28
+ Megrez-3B-Instruct is a large language model trained by [Infinigence AI](https://cloud.infini-ai.com/platform/ai). Megrez-3B aims to provide a fast inference, compact, and powerful edge-side intelligent solution through software-hardware co-design. Megrez-3B has the following advantages:
29
+ 1. High Accuracy: Megrez-3B successfully compresses the capabilities of the previous 14 billion model into a 3 billion size, and achieves excellent performance on mainstream benchmarks.
30
+ 2. High Speed: A smaller model does not necessarily bring faster speed. Megrez-3B ensures a high degree of compatibility with mainstream hardware through software-hardware co-design, leading an inference speedup up to 300% compared to previous models of the same accuracy.
31
+ 3. Easy to Use: In the beginning, we had a debate about model design: should we design a unique but efficient model structure, or use a classic structure for ease of use? We chose the latter and adopt the most primitive LLaMA structure, which allows developers to deploy the model on various platforms without any modifications and minimize the complexity of future development.
32
+ 4. Rich Applications: We have provided a fullstack WebSearch solution. Our model is functionally trained on web search tasks, enabling it to automatically determine the timing of search invocations and provide better summarization results. The complete deployment code is released on [github](https://github.com/infinigence/InfiniWebSearch).
33
+
34
+ ### Inference Parameters
35
+ - For chat, text generation, and other tasks that benefit from diversity, we recommend to use the inference parameter temperature=0.7.
36
+ - For mathematical and reasoning tasks, we recommend to use the inference parameter temperature=0.2 for better determinacy.
37
+
38
+ ### Huggingface
39
+ ``` python
40
+ from transformers import AutoModelForCausalLM, AutoTokenizer
41
+ import torch
42
+
43
+ path = "Infinigence/Megrez-3B-Instruct"
44
+ device = "cuda"
45
+
46
+ tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
47
+ model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map=device, trust_remote_code=True)
48
+
49
+ messages = [
50
+ {"role": "user", "content": "How to make braised chicken in brown sauce?"},
51
+ ]
52
+ model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(device)
53
+
54
+ model_outputs = model.generate(
55
+ model_inputs,
56
+ do_sample=True,
57
+ max_new_tokens=1024,
58
+ top_p=0.9,
59
+ temperature=0.2
60
+ )
61
+
62
+ output_token_ids = [
63
+ model_outputs[i][len(model_inputs[i]):] for i in range(len(model_inputs))
64
+ ]
65
+
66
+ responses = tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)[0]
67
+ print(responses)
68
+ ```
69
+
70
+ ### vLLM Inference
71
+ - Installation
72
+ ```bash
73
+ # Install vLLM with CUDA 12.1.
74
+ pip install vllm
75
+ ```
76
+ - Example code
77
+ ```python
78
+ python inference/inference_vllm.py --model_path <hf_repo_path> --prompt_path prompts/prompt_demo.txt
79
+ from transformers import AutoTokenizer
80
+ from vllm import LLM, SamplingParams
81
+
82
+ model_name = "Infinigence/Megrez-3B-Instruct"
83
+ prompt = [{"role": "user", "content": "How to make braised chicken in brown sauce?"}]
84
+
85
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
86
+ input_text = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
87
+
88
+ llm = LLM(
89
+ model=model_name,
90
+ trust_remote_code=True,
91
+ tensor_parallel_size=1
92
+ )
93
+ sampling_params = SamplingParams(top_p=0.9, temperature=0.2, max_tokens=1024, repetition_penalty=1.02)
94
+
95
+ outputs = llm.generate(prompts=input_text, sampling_params=sampling_params)
96
+
97
+ print(outputs[0].outputs[0].text)
98
+ ```