Post
1173
๐ ๐ถ๐ป๐ถ๐ ๐ฎ๐
'๐ ๐ป๐ฒ๐ ๐ ๐ผ๐ ๐๐๐ ๐ฟ๐ฒ๐ฎ๐ฐ๐ต๐ฒ๐ ๐๐น๐ฎ๐๐ฑ๐ฒ-๐ฆ๐ผ๐ป๐ป๐ฒ๐ ๐น๐ฒ๐๐ฒ๐น ๐๐ถ๐๐ต ๐ฐ๐ ๐๐ผ๐ธ๐ฒ๐ป๐ ๐ฐ๐ผ๐ป๐๐ฒ๐
๐ ๐น๐ฒ๐ป๐ด๐๐ต ๐ฅ
This work from Chinese startup @MiniMax-AI introduces a novel architecture that achieves state-of-the-art performance while handling context windows up to 4 million tokens - roughly 20x longer than current models. The key was combining lightning attention, mixture of experts (MoE), and a careful hybrid approach.
๐๐ฒ๐ ๐ถ๐ป๐๐ถ๐ด๐ต๐๐:
๐๏ธ MoE with novel hybrid attention:
โฃ Mixture of Experts with 456B total parameters (45.9B activated per token)
โฃ Combines Lightning attention (linear complexity) for most layers and traditional softmax attention every 8 layers
๐ Outperforms leading models across benchmarks while offering vastly longer context:
โฃ Competitive with GPT-4/Claude-3.5-Sonnet on most tasks
โฃ Can efficiently handle 4M token contexts (vs 256K for most other LLMs)
๐ฌ Technical innovations enable efficient scaling:
โฃ Novel expert parallel and tensor parallel strategies cut communication overhead in half
โฃ Improved linear attention sequence parallelism, multi-level padding and other optimizations achieve 75% GPU utilization (that's really high, generally utilization is around 50%)
๐ฏ Thorough training strategy:
โฃ Careful data curation and quality control by using a smaller preliminary version of their LLM as a judge!
Overall, not only is the model impressive, but the technical paper is also really interesting! ๐
It has lots of insights including a great comparison showing how a 2B MoE (24B total) far outperforms a 7B model for the same amount of FLOPs.
Read it in full here ๐ MiniMax-01: Scaling Foundation Models with Lightning Attention (2501.08313)
Model here, allows commercial use <100M monthly users ๐ MiniMaxAI/MiniMax-Text-01
This work from Chinese startup @MiniMax-AI introduces a novel architecture that achieves state-of-the-art performance while handling context windows up to 4 million tokens - roughly 20x longer than current models. The key was combining lightning attention, mixture of experts (MoE), and a careful hybrid approach.
๐๐ฒ๐ ๐ถ๐ป๐๐ถ๐ด๐ต๐๐:
๐๏ธ MoE with novel hybrid attention:
โฃ Mixture of Experts with 456B total parameters (45.9B activated per token)
โฃ Combines Lightning attention (linear complexity) for most layers and traditional softmax attention every 8 layers
๐ Outperforms leading models across benchmarks while offering vastly longer context:
โฃ Competitive with GPT-4/Claude-3.5-Sonnet on most tasks
โฃ Can efficiently handle 4M token contexts (vs 256K for most other LLMs)
๐ฌ Technical innovations enable efficient scaling:
โฃ Novel expert parallel and tensor parallel strategies cut communication overhead in half
โฃ Improved linear attention sequence parallelism, multi-level padding and other optimizations achieve 75% GPU utilization (that's really high, generally utilization is around 50%)
๐ฏ Thorough training strategy:
โฃ Careful data curation and quality control by using a smaller preliminary version of their LLM as a judge!
Overall, not only is the model impressive, but the technical paper is also really interesting! ๐
It has lots of insights including a great comparison showing how a 2B MoE (24B total) far outperforms a 7B model for the same amount of FLOPs.
Read it in full here ๐ MiniMax-01: Scaling Foundation Models with Lightning Attention (2501.08313)
Model here, allows commercial use <100M monthly users ๐ MiniMaxAI/MiniMax-Text-01