---
license: apache-2.0
language:
- en
library_name: transformers
tags:
- Tulu3
- Smollm
- SLMs
- Small
- Huggingface
- Allenai
- SFT
- DPO
- GGUF
- RLVR
- RL
base_model:
- SultanR/SmolTulu-1.7b-Instruct
datasets:
- allenai/RLVR-GSM-MATH-IF-Mixed-Constraints
pipeline_tag: text-generation
---

# SmolLM2 1.7b Aligned and Reinforced Through Tulu 3!

![SmolTulu Banner](smoltulubanner.png)

SmolTulu-1.7b-Reinforced is the reinforcement learning with verifiable rewards (RLVR) version of [SmolTulu-1.7b-Instruct](https://huggingface.co./SultanR/SmolTulu-1.7b-Instruct), which leverages [AllenAI's Tulu 3 post-training pipeline](https://arxiv.org/abs/2411.15124) 

This model scores the highest current score in both IFEval and GSM8k while maintaining the extremely low contamination levels in Tulu 3 and SmolLM2! I've listed the datasets used to do both the RLVR stage, which is the same one mentioned used in the Tulu 3 paper.

## Evaluation

I ran these evaluations using [SmolLM2's evaluation code](https://github.com/huggingface/smollm/tree/main/evaluation) for a more fair comparison.


| Metric | SmolTulu-1.7b-Instruct | SmolTulu-1.7b-Reinforced | SmolLM2-1.7B-Instruct | Llama-1B-Instruct | Qwen2.5-1.5B-Instruct | SmolLM1-1.7B-Instruct |
|:----------------------------|:---------------------:|:---------------------:|:---------------------:|:---------------------:|:---------------------:|:---------------------:|
| ARC (Average) | 51.5 | 51.1 | **51.7** | 41.6 | 46.2 | 43.7 |
| BBH (3-shot) | 33.8 | 33.4 | 32.2 | 27.6 | **35.3** | 25.7 |
| GSM8K (5-shot) | 51.6 | **61.0** | 48.2 | 26.8 | 42.8 | 4.6 |
| HellaSwag | 61.1 | 60.4 | **66.1** | 56.1 | 60.9 | 55.5 |
| IFEval (Average prompt/inst) | 67.7 | **69.3** | 56.7 | 53.5 | 47.4 | 23.1 |
| MMLU-Pro (MCF) | 17.4 | 17.3 | 19.3 | 12.7 | **24.2** | 11.7 |
| PIQA | 72.2 | 72.1 | **74.4** | 72.3 | 73.2 | 71.6 |

## Training Details

The reinforced model used PPO with verifiable rewards:
- Base model: SmolTulu-1.7b-Instruct
- Learning rate: 3e-6
- Total training episodes: 10M
- PPO KL penalty coefficient (beta): 0.05
- Maximum sequence/prompt length: 2048 tokens
- Response length: 2048 tokens 
- Rollout batch size: 32
- Minibatch size: 32
- Temperature: 1.0
- Penalty reward: -10.0 for incomplete generations
- DeepSpeed Stage 3 optimization
- Gradient checkpointing enabled
- Training data: RLVR-GSM-MATH-IF-Mixed-Constraints
- Reward model multiplier: 0.0 (pure verifiable rewards)

## Usage

Just like any Huggingface model, just run it using the transformers library:

```python
# pip install transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "SultanR/SmolTulu-1.7b-Reinforced"
device = "cuda" # for GPU usage or "cpu" for CPU usage
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# for multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")`
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
inputs = tokenizer.encode("Gravity is", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))
```

## Citation

```
@misc{alrashed2024smoltuluhigherlearningrate,
      title={SmolTulu: Higher Learning Rate to Batch Size Ratios Can Lead to Better Reasoning in SLMs}, 
      author={Sultan Alrashed},
      year={2024},
      eprint={2412.08347},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.08347}, 
}
```

The training methodology follows the Tulu 3 paper:

```
@article{lambert2024tulu3,
  title={TÜLU 3: Pushing Frontiers in Open Language Model Post-Training},
  author={Lambert, Nathan and Morrison, Jacob and Pyatkin, Valentina and others},
  year={2024},
  journal={arXiv preprint arXiv:2411.15124}
}
```