From original readme

Model Overview

Nemotron-Mini-4B-Instruct is a model for generating responses for roleplaying, retrieval augmented generation, and function calling. It is a small language model (SLM) optimized through distillation, pruning and quantization for speed and on-device deployment. It is a fine-tuned version of nvidia/Minitron-4B-Base, which was pruned and distilled from Nemotron-4 15B using our LLM compression technique. This instruct model is optimized for roleplay, RAG QA, and function calling in English. It supports a context length of 4,096 tokens. This model is ready for commercial use.

Try this model on build.nvidia.com.

For more details about how this model is used for NVIDIA ACE, please refer to this blog post and this demo video, which showcases how the model can be integrated into a video game. You can download the model checkpoint for NVIDIA AI Inference Manager (AIM) SDK from here.

Model Developer: NVIDIA

Model Dates: Nemotron-Mini-4B-Instruct was trained between February 2024 and Aug 2024.

License

NVIDIA Community Model License

Model Architecture

Nemotron-Mini-4B-Instruct uses a model embedding size of 3072, 32 attention heads, and an MLP intermediate dimension of 9216. It also uses Grouped-Query Attention (GQA) and Rotary Position Embeddings (RoPE).

Architecture Type: Transformer Decoder (auto-regressive language model)

Network Architecture: Nemotron-4

Prompt Format:

We recommend using the following prompt template, which was used to fine-tune the model. The model may not perform optimally without it.

Single Turn

<extra_id_0>System
{system prompt}

<extra_id_1>User
{prompt}
<extra_id_1>Assistant\n

Tool use

<extra_id_0>System
{system prompt}

<tool> ... </tool>
<context> ... </context>

<extra_id_1>User
{prompt}
<extra_id_1>Assistant
<toolcall> ... </toolcall>
<extra_id_1>Tool
{tool response}
<extra_id_1>Assistant\n

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the tokenizer and model
tokenizer  = AutoTokenizer.from_pretrained("nvidia/Nemotron-Mini-4B-Instruct")
model = AutoModelForCausalLM.from_pretrained("nvidia/Nemotron-Mini-4B-Instruct")

# Use the prompt template
messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds in the style of a pirate",
    },
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
 ]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")

outputs = model.generate(tokenized_chat, max_new_tokens=128) 
print(tokenizer.decode(outputs[0]))

You can also use pipeline but you need to create a tokenizer object and assign it to the pipeline manually.

from transformers import AutoTokenizer
from transformers import pipeline

tokenizer  = AutoTokenizer.from_pretrained("nvidia/Nemotron-Mini-4B-Instruct")

messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe = pipeline("text-generation", model="nvidia/Nemotron-Mini-4B-Instruct")
pipe.tokenizer = tokenizer  # You need to assign tokenizer manually
pipe(messages)