kaizuberbuehler
's Collections
LM Inference
updated
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Paper
•
2402.17764
•
Published
•
608
BitNet: Scaling 1-bit Transformers for Large Language Models
Paper
•
2310.11453
•
Published
•
96
Mixture-of-Depths: Dynamically allocating compute in transformer-based
language models
Paper
•
2404.02258
•
Published
•
104
TransformerFAM: Feedback attention is working memory
Paper
•
2404.09173
•
Published
•
44
Megalodon: Efficient LLM Pretraining and Inference with Unlimited
Context Length
Paper
•
2404.08801
•
Published
•
65
Leave No Context Behind: Efficient Infinite Context Transformers with
Infini-attention
Paper
•
2404.07143
•
Published
•
106
Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence
Paper
•
2404.05892
•
Published
•
33
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video
Understanding
Paper
•
2404.05726
•
Published
•
21
LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens
Paper
•
2402.13753
•
Published
•
115
How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study
Paper
•
2404.14047
•
Published
•
45
SnapKV: LLM Knows What You are Looking for Before Generation
Paper
•
2404.14469
•
Published
•
24
LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding
Paper
•
2404.16710
•
Published
•
77
Octopus v4: Graph of language models
Paper
•
2404.19296
•
Published
•
117
Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting
Paper
•
2404.18911
•
Published
•
30
LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report
Paper
•
2405.00732
•
Published
•
120
Imp: Highly Capable Large Multimodal Models for Mobile Devices
Paper
•
2405.12107
•
Published
•
27
Transformers are SSMs: Generalized Models and Efficient Algorithms
Through Structured State Space Duality
Paper
•
2405.21060
•
Published
•
64
Samba: Simple Hybrid State Space Models for Efficient Unlimited Context
Language Modeling
Paper
•
2406.07522
•
Published
•
38
Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo
Tree Self-refine with LLaMa-3 8B
Paper
•
2406.07394
•
Published
•
27
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
Paper
•
2407.21787
•
Published
•
12
ThinK: Thinner Key Cache by Query-Driven Pruning
Paper
•
2407.21018
•
Published
•
31
A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language
Models: An Experimental Analysis up to 405B
Paper
•
2409.11055
•
Published
•
17
Discovering the Gems in Early Layers: Accelerating Long-Context LLMs
with 1000x Input Token Reduction
Paper
•
2409.17422
•
Published
•
25
Thinking LLMs: General Instruction Following with Thought Generation
Paper
•
2410.10630
•
Published
•
18
VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large
Language Models
Paper
•
2409.17066
•
Published
•
28
Efficiently Serving LLM Reasoning Programs with Certaindex
Paper
•
2412.20993
•
Published
•
35
Token-Budget-Aware LLM Reasoning
Paper
•
2412.18547
•
Published
•
45
Critical Tokens Matter: Token-Level Contrastive Estimation Enhence LLM's
Reasoning Capability
Paper
•
2411.19943
•
Published
•
57
Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for
Quantized LLMs with 100T Training Tokens
Paper
•
2411.17691
•
Published
•
11
Star Attention: Efficient LLM Inference over Long Sequences
Paper
•
2411.17116
•
Published
•
49
SageAttention2 Technical Report: Accurate 4 Bit Attention for
Plug-and-play Inference Acceleration
Paper
•
2411.10958
•
Published
•
52
BitNet a4.8: 4-bit Activations for 1-bit LLMs
Paper
•
2411.04965
•
Published
•
64
1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on
CPUs
Paper
•
2410.16144
•
Published
•
3
FlatQuant: Flatness Matters for LLM Quantization
Paper
•
2410.09426
•
Published
•
13
PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers
in LLMs
Paper
•
2410.05265
•
Published
•
30
Tensor Product Attention Is All You Need
Paper
•
2501.06425
•
Published
•
75