Linear Transformers with Learnable Kernel Functions are Better In-Context Models Paper • 2402.10644 • Published Feb 16 • 79
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints Paper • 2305.13245 • Published May 22, 2023 • 5
ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition Paper • 2402.15220 • Published Feb 23 • 19
Sequence Parallelism: Long Sequence Training from System Perspective Paper • 2105.13120 • Published May 26, 2021 • 5
Ring Attention with Blockwise Transformers for Near-Infinite Context Paper • 2310.01889 • Published Oct 3, 2023 • 10
Striped Attention: Faster Ring Attention for Causal Transformers Paper • 2311.09431 • Published Nov 15, 2023 • 4
DeBERTa: Decoding-enhanced BERT with Disentangled Attention Paper • 2006.03654 • Published Jun 5, 2020 • 3
DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing Paper • 2111.09543 • Published Nov 18, 2021 • 2
Rescoring Sequence-to-Sequence Models for Text Line Recognition with CTC-Prefixes Paper • 2110.05909 • Published Oct 12, 2021 • 2
3D Medical Image Segmentation based on multi-scale MPU-Net Paper • 2307.05799 • Published Jul 11, 2023 • 2
Attention Swin U-Net: Cross-Contextual Attention Mechanism for Skin Lesion Segmentation Paper • 2210.16898 • Published Oct 30, 2022 • 2
CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows Paper • 2107.00652 • Published Jul 1, 2021 • 2
MAFormer: A Transformer Network with Multi-scale Attention Fusion for Visual Recognition Paper • 2209.01620 • Published Aug 31, 2022 • 2
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows Paper • 2103.14030 • Published Mar 25, 2021 • 4
Using Multi-scale SwinTransformer-HTC with Data augmentation in CoNIC Challenge Paper • 2202.13588 • Published Feb 28, 2022 • 2
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Paper • 2211.00593 • Published Nov 1, 2022 • 2
BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences Paper • 2403.09347 • Published Mar 14 • 20
Lightweight Image Inpainting by Stripe Window Transformer with Joint Attention to CNN Paper • 2301.00553 • Published Jan 2, 2023 • 2
Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers Paper • 2311.10642 • Published Nov 17, 2023 • 23
Code Completion using Neural Attention and Byte Pair Encoding Paper • 2004.06343 • Published Apr 14, 2020 • 2
Recurrent Drafter for Fast Speculative Decoding in Large Language Models Paper • 2403.09919 • Published Mar 14 • 20
Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers Paper • 2403.12943 • Published Mar 19 • 14
VSTAR: Generative Temporal Nursing for Longer Dynamic Video Synthesis Paper • 2403.13501 • Published Mar 20 • 9
Efficient Memory Management for Large Language Model Serving with PagedAttention Paper • 2309.06180 • Published Sep 12, 2023 • 25
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention Paper • 2404.07143 • Published Apr 10 • 104
Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length Paper • 2404.08801 • Published Apr 12 • 63
Hydragen: High-Throughput LLM Inference with Shared Prefixes Paper • 2402.05099 • Published Feb 7 • 19
MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs Paper • 2402.15627 • Published Feb 23 • 34
MoA: Mixture-of-Attention for Subject-Context Disentanglement in Personalized Image Generation Paper • 2404.11565 • Published Apr 17 • 14
SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification Paper • 2305.09781 • Published May 16, 2023 • 4
InstantFamily: Masked Attention for Zero-shot Multi-ID Image Generation Paper • 2404.19427 • Published Apr 30 • 71
What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation Paper • 2404.07129 • Published Apr 10 • 3
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality Paper • 2405.21060 • Published May 31 • 63
VideoFACT: Detecting Video Forgeries Using Attention, Scene Context, and Forensic Traces Paper • 2211.15775 • Published Nov 28, 2022 • 1
Beyond Euclid: An Illustrated Guide to Modern Machine Learning with Geometric, Topological, and Algebraic Structures Paper • 2407.09468 • Published Jul 12 • 1
Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters Paper • 2408.04093 • Published Aug 7 • 4
HAT: Hybrid Attention Transformer for Image Restoration Paper • 2309.05239 • Published Sep 11, 2023 • 1
Byte Latent Transformer: Patches Scale Better Than Tokens Paper • 2412.09871 • Published 13 days ago • 75