Exciting breakthrough in AI: @Meta's new Byte Latent Transformer (BLT) revolutionizes language models by eliminating tokenization!
The BLT architecture introduces a groundbreaking approach that processes raw bytes instead of tokens, achieving state-of-the-art performance while being more efficient and robust. Here's what makes it special:
>> Key Innovations Dynamic Patching: BLT groups bytes into variable-sized patches based on entropy, allocating more compute power where the data is more complex. This results in up to 50% fewer FLOPs during inference compared to traditional token-based models.
Three-Component Architecture: • Lightweight Local Encoder that converts bytes to patch representations • Powerful Global Latent Transformer that processes patches • Local Decoder that converts patches back to bytes
>> Technical Advantages • Matches performance of Llama 3 at 8B parameters while being more efficient • Superior handling of non-English languages and rare character sequences • Remarkable 99.9% accuracy on spelling tasks • Better scaling properties than token-based models
>> Under the Hood The system uses an entropy model to determine patch boundaries, cross-attention mechanisms for information flow, and hash n-gram embeddings for improved representation. The architecture allows simultaneous scaling of both patch and model size while maintaining fixed inference costs.
This is a game-changer for multilingual AI and could reshape how we build future language models. Excited to see how this technology evolves!
🕰️ Llama-3.1-405B took 39 million GPU-hours to train, i.e. about 4.5 thousand years.
👴🏻 If they had needed all this time, we would have GPU stories from the time of Pharaoh 𓂀: "Alas, Lord of Two Lands, the shipment of counting-stones arriving from Cathay was lost to pirates, this shall delay the building of your computing temple by many moons "
🛠️ But instead, they just parallelized the training on 24k H100s, which made it take just a few months. This required parallelizing across 4 dimensions: data, tensor, context, pipeline. And it is infamously hard to do, making for bloated code repos that hold together only by magic.
🤏 𝗕𝘂𝘁 𝗻𝗼𝘄 𝘄𝗲 𝗱𝗼𝗻'𝘁 𝗻𝗲𝗲𝗱 𝗵𝘂𝗴𝗲 𝗿𝗲𝗽𝗼𝘀 𝗮𝗻𝘆𝗺𝗼𝗿𝗲! Instead of building mega-training codes, Hugging Face colleagues cooked in the other direction, towards tiny 4D parallelism libs. A team has built Nanotron, already widely used in industry. And now a team releases Picotron, a radical approach to code 4D Parallelism in just a few hundred lines of code, a real engineering prowess, making it much easier to understand what's actually happening!
⚡ 𝗜𝘁'𝘀 𝘁𝗶𝗻𝘆, 𝘆𝗲𝘁 𝗽𝗼𝘄𝗲𝗿𝗳𝘂𝗹: Counting in MFU (Model FLOPs Utilization, how much the model actually uses all the compute potential), this lib reaches ~50% on SmolLM-1.7B model with 8 H100 GPUs, which is really close to what huge libs would reach. (Caution: the team is leading further benchmarks to verify this)