arxiv:2412.09871

Byte Latent Transformer: Patches Scale Better Than Tokens

Published on Dec 13, 2024

· Submitted by

artidoro on Dec 17, 2024

#1 Paper of the day

Upvote

Authors:

Artidoro Pagnoni ,

Benjamin Muller ,

Margaret Li ,

Chunting Zhou ,

Jason Weston ,

Gargi Ghosh ,

Mike Lewis ,

Ari Holtzman ,

Abstract

We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance at scale with significant improvements in inference efficiency and robustness. BLT encodes bytes into dynamically sized patches, which serve as the primary units of computation. Patches are segmented based on the entropy of the next byte, allocating more compute and model capacity where increased data complexity demands it. We present the first FLOP controlled scaling study of byte-level models up to 8B parameters and 4T training bytes. Our results demonstrate the feasibility of scaling models trained on raw bytes without a fixed vocabulary. Both training and inference efficiency improve due to dynamically selecting long patches when data is predictable, along with qualitative improvements on reasoning and long tail generalization. Overall, for fixed inference costs, BLT shows significantly better scaling than tokenization-based models, by simultaneously growing both patch and model size.

View arXiv page View PDF Add to collection

Community

artidoro

Paper author Paper submitter Dec 17, 2024

Introducing the Byte Latent Transformer (BLT) – An LLM architecture that scales better than Llama 3 using byte patches instead of tokens.

BLT encodes bytes into dynamic patches using light-weight local models and processes them with a large latent transformer.

Entropy patching dynamically adjusts patch sizes based on data complexity, allowing BLT to allocate more compute to hard predictions and use larger patches for simpler ones. This results in fewer larger processing steps to cover the same data.

BLT unlocks a new scaling dimension by simultaneously growing patch and model size without changing training or inference cost. Patch length scaling quickly overtakes BPE transformer scaling, and the trends look even better at larger scales!

Parameter matched training runs up to 8B params and 4T bytes show that BLT performs well on standard benchmarks, and can trade minor losses in evaluation metrics for up to 50% reductions in inference flops.

Credit: https://x.com/garrethleee/status/1868702376754135154

duinamit

Dec 17, 2024

amazing work - I am especially interested in follow-ups regarding the entropy model finetuning because the robustness is probably quite dependent on this, or am I overestimating that?

MichaelBarryUK

Dec 17, 2024

Incredible work. I wonder if we can add a new layer, a patch of patches... And use this for finetuning

ajithprabhakar

Dec 17, 2024

Here is an in-depth explanation of this paper https://ajithp.com/2024/12/15/metas-byte-latent-transformer-revolutionizing-natural-language-processing-with-dynamic-patching/

librarian-bot

Dec 18, 2024

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

aipapersacademy

Dec 21, 2024

•

edited Dec 21, 2024

A video summary is available here - https://aipapersacademy.com/byte-latent-transformer/

Ashokdll

Feb 2

contributions

BojackHorseman

Feb 5

Great work! I'm excited to test these models for French misspelling correction. Character-level tasks seem to benefit the most from this architecture, but I'm particularly curious about its effectiveness in CoT reasoning. For instance, could it enhance mathematical reasoning?

Are there any pre-trained BLTs available on Hugging Face? Even better, is there a multilingual instruction-tuned version?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2412.09871 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2412.09871 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2412.09871 in a Space README.md to link it from this page.

Byte Latent Transformer: Patches Scale Better Than Tokens

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 86