arxiv:2502.01637

Scaling Embedding Layers in Language Models

Published on Feb 3

· Submitted by

akhaliq on Feb 4

Upvote

Authors:

Yangsibo Huang ,

Pritish Kamath ,

Daogao Liu ,

Chiyuan Zhang

Abstract

We propose SCONE (Scalable, Contextualized, Offloaded, N-gram Embedding), a method for extending input embedding layers to enhance language model performance as layer size scales. To avoid increased decoding costs, SCONE retains the original vocabulary while introducing embeddings for a set of frequent n-grams. These embeddings provide contextualized representation for each input token and are learned with a separate model during training. During inference, they are precomputed and stored in off-accelerator memory with minimal impact on inference speed. SCONE enables two new scaling strategies: increasing the number of cached n-gram embeddings and scaling the model used to learn them, all while maintaining fixed inference-time FLOPS. We show that scaling both aspects allows SCONE to outperform a 1.9B parameter baseline across diverse corpora, while using only half the inference-time FLOPS.

View arXiv page View PDF Add to collection

Community

akhaliq

Paper submitter about 21 hours ago

Jellyfish0538

about 6 hours ago

Thanks a lot for sharing!

A concurrent work, Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling, also decouples the input embedding layer from the decoding layer. That said, we take a completely different approach, leading to notable differences in pros and cons. See our related work section for a more detailed discussion!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2502.01637 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2502.01637 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2502.01637 in a Space README.md to link it from this page.