Papers
arxiv:2502.01637

Scaling Embedding Layers in Language Models

Published on Feb 3
· Submitted by akhaliq on Feb 4
Authors:
,
,
,
,

Abstract

We propose SCONE (Scalable, Contextualized, Offloaded, N-gram Embedding), a method for extending input embedding layers to enhance language model performance as layer size scales. To avoid increased decoding costs, SCONE retains the original vocabulary while introducing embeddings for a set of frequent n-grams. These embeddings provide contextualized representation for each input token and are learned with a separate model during training. During inference, they are precomputed and stored in off-accelerator memory with minimal impact on inference speed. SCONE enables two new scaling strategies: increasing the number of cached n-gram embeddings and scaling the model used to learn them, all while maintaining fixed inference-time FLOPS. We show that scaling both aspects allows SCONE to outperform a 1.9B parameter baseline across diverse corpora, while using only half the inference-time FLOPS.

Community

Paper submitter

Screenshot 2025-02-04 at 12.27.04 AM.png

Thanks a lot for sharing!

A concurrent work, Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling, also decouples the input embedding layer from the decoding layer. That said, we take a completely different approach, leading to notable differences in pros and cons. See our related work section for a more detailed discussion!

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2502.01637 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2502.01637 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2502.01637 in a Space README.md to link it from this page.

Collections including this paper 2