SageLite-s
Model Description
SageLite is a new family of open embedding models with an encoder architecture that supports a wide range of tasks in both code and text. SageLite went through three stages of training:
- MLM Pretraining: Standard masked language model (MLM) pretraining on mixed code and text data (The-Stack-v2 and Falcon-refinedweb).
- Contrastive Pre-Finetuning: Learning from a large amount of positive pairs mined from web data and GitHub.
- Contrastive Fine-Tuning: Fine-tuning on a small amount of synthetic data.
Training Data
This checkpoint is trained on both The-Stack-v2 and Falcon-refinedweb. Supported languages (15 in total) are: English, C, C#, Go, Java, JavaScript, TypeScript, PHP, Python, and Ruby.
How to Use
This checkpoint consists of an encoder (80M model) that extracts code embeddings of 768 dimensions. It can be loaded using the Hugging Face Transformers library and employs the Starcoder Tokenizer.
from transformers import AutoModel, AutoTokenizer
# Specify the checkpoint
checkpoint = "SageLite/SageLite-s"
device = "cuda" # Use "cpu" if GPU is unavailable
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True, add_eos_token=True)
model = AutoModel.from_pretrained(checkpoint, trust_remote_code=True).to(device)
# Example usage
code_snippet = "def print_hello_world():\tprint('Hello World!')"
inputs = tokenizer.encode(code_snippet, return_tensors="pt").to(device)
embedding = model(inputs)[0] # Extract the embedding
Code Retrieval Performance
1. Code2Code Search
Model Name | # Params | Embd Dim | Python | Java | JS | TS | C# | C | Ruby | PhP | GO | AVG |
---|---|---|---|---|---|---|---|---|---|---|---|---|
OpenAI-Code-01 | NA | 3072 | 21.92 | 8.90 | 4.90 | 5.70 | 3.15 | 11.58 | 26.25 | 16.60 | 9.40 | 12.04 |
OpenAI-Text-3-Small | NA | 1536 | 25.18 | 12.61 | 8.00 | 9.44 | 5.46 | 15.86 | 30.70 | 23.33 | 11.20 | 15.57 |
OpenAI-Text-3-Large | NA | 3072 | 40.57 | 25.33 | 20.09 | 22.00 | 11.84 | 31.90 | 42.54 | 41.84 | 21.75 | 28.65 |
CodeSage-v2-Small | 130M | 1024 | 45.60 | 33.65 | 39.96 | 47.78 | 19.19 | 30.55 | 40.12 | 55.39 | 30.96 | 38.13 |
CodeSage-v2-Base | 356M | 1024 | 55.86 | 42.89 | 45.29 | 54.58 | 23.90 | 38.52 | 56.02 | 64.56 | 42.88 | 47.17 |
CodeSage-v2-Large | 1.3B | 2048 | 61.11 | 47.09 | 51.18 | 60.67 | 28.04 | 43.40 | 60.74 | 67.87 | 43.86 | 51.55 |
SageLite-s | 80M | 768 | 47.93 | 30.83 | 35.15 | 37.64 | 18.14 | 30.53 | 42.89 | 50.70 | 21.69 | 35.06 |
SageLite-l | 850M | 1536 | 64.46 | 45.53 | 50.80 | 54.71 | 30.66 | 47.46 | 61.01 | 68.68 | 39.25 | 51.40 |
2. NL2Code Search
Model Name | # Params | CoSQA | AdvTest | Python | Java | JS | PhP | GO | Ruby | Avg |
---|---|---|---|---|---|---|---|---|---|---|
OpenAI-Code-01 | NA | 52.20 | 36.03 | 63.13 | 67.85 | 62.30 | 57.47 | 85.22 | 69.28 | 61.69 |
OpenAI-Text-3-Small | NA | 52.48 | 34.10 | 62.62 | 65.87 | 60.28 | 54.85 | 81.96 | 67.57 | 59.97 |
OpenAI-Text-3-Large | NA | 55.21 | 46.83 | 70.81 | 72.89 | 68.12 | 59.58 | 87.60 | 75.22 | 67.03 |
CodeSage-v2-Small | 130M | 52.39 | 47.28 | 68.79 | 68.13 | 65.77 | 60.20 | 80.26 | 72.46 | 64.41 |
CodeSage-v2-Base | 356M | 50.74 | 52.00 | 70.46 | 70.89 | 69.61 | 62.81 | 82.37 | 73.71 | 66.57 |
CodeSage-v2-Large | 1.3B | 53.18 | 56.31 | 74.18 | 72.33 | 72.49 | 65.26 | 84.67 | 76.61 | 69.38 |
SageLite-s | 80M | 56.49 | 42.32 | 67.59 | 66.62 | 62.32 | 58.87 | 79.36 | 70.75 | 63.04 |
SageLite-l | 850M | 59.76 | 55.55 | 74.25 | 71.76 | 69.35 | 61.62 | 84.09 | 77.14 | 69.19 |
Text Retrieval Performance (MTEB Retrieval)
Metric | SageLite-s | SageLite-l |
---|---|---|
ArguAna | 57.75 | 60.71 |
CQADupstackWordpressRetrieval | 32.42 | 38.63 |
FiQA2018 | 34.85 | 46.73 |
NFCorpus | 29.97 | 33.70 |
QuoraRetrieval | 85.35 | 87.50 |
SCIDOCS | 18.99 | 21.38 |
SciFact | 68.43 | 69.05 |
Touche2020 | 24.41 | 21.43 |
TRECCOVID | 70.88 | 76.08 |
FEVER | 71.72 | 73.64 |
HotpotQA | 58.81 | 62.96 |
NQ | 48.26 | 54.48 |
DBPedia | 34.83 | 40.69 |
ClimateFEVER | 25.69 | 26.20 |
MSMARCO | 35.01 | 36.55 |
average | 46.49 | 49.98 |
- Downloads last month
- 107