tags:
- text-generation
- pytorch
inference: false
license: llama2
language:
- pt
pipeline_tag: text-generation
library_name: transformers
datasets:
- dominguesm/CC-MAIN-2023-23
Canarim-7B
Canarim-7B is a Portuguese large language model developed by Maicon Domingues.
Model description
The model was pretrained on 16 billion tokens from the Portuguese subset of CommonCrawl 2023-23, starting with the weights of LLaMA2-7B. The pretraining data has cutoff of mid-2023.
Key Features
- Language: Specialized in understanding and generating Portuguese text, making it ideal for applications targeting Portuguese-speaking audiences.
- Architecture: Inherits the robust architecture from LLaMA2-7B, ensuring efficient performance and accurate results.
- Diverse Dataset: The pretraining dataset includes a wide range of topics and writing styles, enhancing the model's ability to understand various contexts and nuances in Portuguese.
Applications
Canarim-7B, was trained solely on a language modeling objective and has not been fine-tuned for instruction following. Therefore, it is more suited for few-shot tasks rather than zero-shot tasks. This means the model tends to perform better when provided with a few examples of the desired outcome during use. Here are some practical applications:
- Natural Language Understanding (NLU): Efficient in tasks such as sentiment analysis, topic classification, and entity recognition in Portuguese text, especially when relevant examples are provided.
- Natural Language Generation (NLG): Capable of generating coherent and contextually relevant text, useful for content creation, chatbots, and more, with improved results when provided examples of the desired style or format.
- Language Translation: Suitable for high-quality translation between Portuguese and other languages, especially when examples of desired translations are included during model training or fine-tuning.
Tips for Efficient Use
- Few-shot Learning: When using Canarim-7B for specific tasks, it is beneficial to provide a few relevant examples. This helps the model better understand the context and purpose of the task.
- Contextualization: Including additional context in the input can significantly improve the quality of the model’s predictions and text generation.
Getting Started
To start using Canarim-7B with the Transformers library, first install the library if you haven't already:
pip install transformers
You can then load the model using the Transformers library. Here's a simple example of how to use the model for text generation using the pipeline
function:
from transformers import AutoTokenizer, pipeline
import torch
model_id = "dominguesm/canarim-7b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
pipe = pipeline(
"text-generation",
model=model_id,
torch_dtype=torch.float16,
device_map="auto",
)
prompt = make_prompt(question)
sequences = pipe(
prompt,
do_sample=True,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
max_length=2048,
temperature=0.9,
top_p=0.6,
repetition_penalty=1.15
)
This code snippet demonstrates how to generate text with Canarim-7B. You can customize the input text and adjust parameters like max_length
according to your requirements.
Citation
If you want to cite Canarim Instruct PTBR dataset, you could use this:
@misc {maicon_domingues_2023,
author = { {Maicon Domingues} },
title = { canarim-7b (Revision 08fdd2b) },
year = 2023,
url = { https://huggingface.co./dominguesm/canarim-7b },
doi = { 10.57967/hf/1356 },
publisher = { Hugging Face }
}
License
Canarim-7B is released under the LLAMA 2 COMMUNITY LICENSE AGREEMENT.