|
--- |
|
license: gemma |
|
language: |
|
- en |
|
base_model: prithivMLmods/GWQ2b |
|
pipeline_tag: text-generation |
|
library_name: transformers |
|
tags: |
|
- gemma |
|
- 2b |
|
- llama-cpp |
|
- gguf-my-repo |
|
--- |
|
|
|
# Triangle104/GWQ2b-Q4_K_M-GGUF |
|
This model was converted to GGUF format from [`prithivMLmods/GWQ2b`](https://huggingface.co./prithivMLmods/GWQ2b) using llama.cpp via the ggml.ai's [GGUF-my-repo](https://huggingface.co./spaces/ggml-org/gguf-my-repo) space. |
|
Refer to the [original model card](https://huggingface.co./prithivMLmods/GWQ2b) for more details on the model. |
|
|
|
--- |
|
Model details: |
|
- |
|
GWQ2b is a family of lightweight, state-of-the-art open models from |
|
Google, built using the same research and technology employed to create |
|
the Gemini models. These models are text-to-text, decoder-only large |
|
language models, available in English, with open weights for both |
|
pre-trained and instruction-tuned variants. GWQ2b models are well-suited |
|
for a variety of text generation tasks, including question answering, |
|
summarization, and reasoning. GWQ2b is fine-tuned on the Chain of |
|
Continuous Thought Synthetic Dataset, built upon the Gemma2forCasualLM |
|
architecture. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Running GWQ2b Demo |
|
|
|
|
|
|
|
|
|
# pip install accelerate |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
import torch |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("prithivMLmods/GWQ2b") |
|
model = AutoModelForCausalLM.from_pretrained( |
|
"prithivMLmods/GWQ2b", |
|
device_map="auto", |
|
torch_dtype=torch.bfloat16, |
|
) |
|
|
|
input_text = "Write me a poem about Machine Learning." |
|
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda") |
|
|
|
outputs = model.generate(**input_ids, max_new_tokens=32) |
|
print(tokenizer.decode(outputs[0])) |
|
|
|
|
|
|
|
You can ensure the correct chat template is applied by using tokenizer.apply_chat_template as follows: |
|
|
|
|
|
messages = [ |
|
{"role": "user", "content": "Write me a poem about Machine Learning."}, |
|
] |
|
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", return_dict=True).to("cuda") |
|
|
|
outputs = model.generate(**input_ids, max_new_tokens=256) |
|
print(tokenizer.decode(outputs[0])) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Key Architecture |
|
|
|
|
|
|
|
|
|
Transformer-Based Design: |
|
GWQ2b leverages the |
|
transformer architecture, utilizing self-attention mechanisms to |
|
process input text and capture contextual relationships effectively. |
|
|
|
|
|
Lightweight and Efficient: |
|
It is designed to |
|
be computationally efficient, with fewer parameters compared to larger |
|
models, making it ideal for deployment on resource-constrained devices |
|
or environments. |
|
|
|
|
|
Modular Layers: |
|
The architecture consists of |
|
modular encoder and decoder layers, allowing flexibility in adapting the |
|
model for specific tasks like text generation, summarization, or |
|
classification. |
|
|
|
|
|
Attention Mechanisms: |
|
GWQ2b employs |
|
multi-head self-attention to focus on relevant parts of the input text, |
|
improving its ability to handle long-range dependencies and complex |
|
language structures. |
|
|
|
|
|
Pre-training and Fine-Tuning: |
|
The model is |
|
pre-trained on large text corpora and can be fine-tuned for specific |
|
tasks, such as markdown processing in ReadM.Md, to enhance its |
|
performance on domain-specific data. |
|
|
|
|
|
Scalability: |
|
The architecture supports scaling up or down based on the application's requirements, balancing performance and resource usage. |
|
|
|
|
|
Open-Source and Customizable: |
|
Being |
|
open-source, GWQ2b allows developers to modify and extend its |
|
architecture to suit specific use cases, such as integrating it into |
|
tools like ReadM.Md for markdown-related tasks. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Intended Use of GWQ2b (Gemma with Questions2b) |
|
|
|
|
|
|
|
|
|
Question Answering: |
|
The model excels in generating concise and relevant answers to user-provided queries across various domains. |
|
|
|
|
|
Summarization: |
|
It can be used to summarize |
|
large bodies of text, making it suitable for news aggregation, academic |
|
research, and report generation. |
|
|
|
|
|
Reasoning Tasks: |
|
GWQ2b is fine-tuned on the |
|
Chain of Continuous Thought Synthetic Dataset, which enhances its |
|
ability to perform reasoning, multi-step problem solving, and logical |
|
inferences. |
|
|
|
|
|
Text Generation: |
|
The model is ideal for |
|
creative writing tasks such as generating poems, stories, and essays. It |
|
can also be used for generating code comments, documentation, and |
|
markdown files. |
|
|
|
|
|
Instruction Following: |
|
GWQ2b’s |
|
instruction-tuned variant is suitable for generating responses based on |
|
user instructions, making it useful for virtual assistants, tutoring |
|
systems, and automated customer support. |
|
|
|
|
|
Domain-Specific Applications: |
|
Thanks to its |
|
modular design and open-source nature, the model can be fine-tuned for |
|
specific tasks like legal document summarization, medical record |
|
analysis, or financial report generation. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Limitations of GWQ2b |
|
|
|
|
|
|
|
|
|
Resource Requirements: |
|
Although lightweight |
|
compared to larger models, the 9B parameter size still requires |
|
significant computational resources, including GPUs with large memory |
|
for inference. |
|
|
|
|
|
Knowledge Cutoff: |
|
The model’s pre-training |
|
data may not include recent information, making it less effective for |
|
answering queries on current events or newly developed topics. |
|
|
|
|
|
Bias in Outputs: |
|
Since the model is trained |
|
on publicly available datasets, it may inherit biases present in those |
|
datasets, leading to potentially biased or harmful outputs in sensitive |
|
contexts. |
|
|
|
|
|
Hallucinations: |
|
Like other large language |
|
models, GWQ2b can occasionally generate incorrect or nonsensical |
|
information, especially when asked for facts or reasoning outside its |
|
training scope. |
|
|
|
|
|
Lack of Common-Sense Reasoning: |
|
While GWQ2b |
|
is fine-tuned for reasoning, it may still struggle with tasks requiring |
|
deep common-sense knowledge or nuanced understanding of human behavior |
|
and emotions. |
|
|
|
|
|
Dependency on Fine-Tuning: |
|
For optimal |
|
performance on domain-specific tasks, fine-tuning on relevant datasets |
|
is required, which demands additional computational resources and |
|
expertise. |
|
|
|
|
|
Context Length Limitation: |
|
The model’s |
|
ability to process long documents is limited by its maximum context |
|
window size. If the input exceeds this limit, truncation may lead to |
|
loss of important information. |
|
|
|
--- |
|
## Use with llama.cpp |
|
Install llama.cpp through brew (works on Mac and Linux) |
|
|
|
```bash |
|
brew install llama.cpp |
|
|
|
``` |
|
Invoke the llama.cpp server or the CLI. |
|
|
|
### CLI: |
|
```bash |
|
llama-cli --hf-repo Triangle104/GWQ2b-Q4_K_M-GGUF --hf-file gwq2b-q4_k_m.gguf -p "The meaning to life and the universe is" |
|
``` |
|
|
|
### Server: |
|
```bash |
|
llama-server --hf-repo Triangle104/GWQ2b-Q4_K_M-GGUF --hf-file gwq2b-q4_k_m.gguf -c 2048 |
|
``` |
|
|
|
Note: You can also use this checkpoint directly through the [usage steps](https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#usage) listed in the Llama.cpp repo as well. |
|
|
|
Step 1: Clone llama.cpp from GitHub. |
|
``` |
|
git clone https://github.com/ggerganov/llama.cpp |
|
``` |
|
|
|
Step 2: Move into the llama.cpp folder and build it with `LLAMA_CURL=1` flag along with other hardware-specific flags (for ex: LLAMA_CUDA=1 for Nvidia GPUs on Linux). |
|
``` |
|
cd llama.cpp && LLAMA_CURL=1 make |
|
``` |
|
|
|
Step 3: Run inference through the main binary. |
|
``` |
|
./llama-cli --hf-repo Triangle104/GWQ2b-Q4_K_M-GGUF --hf-file gwq2b-q4_k_m.gguf -p "The meaning to life and the universe is" |
|
``` |
|
or |
|
``` |
|
./llama-server --hf-repo Triangle104/GWQ2b-Q4_K_M-GGUF --hf-file gwq2b-q4_k_m.gguf -c 2048 |
|
``` |
|
|