9 20 15

Huiqiang Jiang PRO

iofu728

https://www.microsoft.com/en-us/research/people/hjiang/

AI & ML interests

None yet

Recent Activity

liked a model 1 day ago

Qwen/Qwen2.5-14B-Instruct-1M

upvoted a paper 4 days ago

Sigma: Differential Rescaling of Query, Key and Value for Efficient Language Models

liked a model 7 days ago

deepseek-ai/DeepSeek-R1-Distill-Qwen-32B

View all activity

Articles

How to Optimize TTFT of 8B LLMs with 1M Tokens to 20s

Jul 21, 2024

• 2

Organizations

Posts 2

Post

1096

Weclome to use MInference, which leverages the dynamic sparse nature of LLMs' attention, which exhibits some static patterns, to speed up the pre-filling for million tokens LLMs. It first determines offline which sparse pattern each head belongs to, then approximates the sparse index online and dynamically computes attention with the optimal custom kernels. This approach achieves up to a 10x speedup for pre-filling on an A100 while maintaining accuracy with 1M tokens.

For more detail please check,
project page: https://aka.ms/MInference
code: https://github.com/microsoft/MInference
paper: MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention (2407.02490)
hf demo: microsoft/MInference

Post

1506

Welcome to LLMLingua-2, a small-size yet powerful prompt compression method trained via data distillation from GPT-4 for token classification with a BERT-level encoder, excels in task-agnostic compression. It surpasses LLMLingua in handling out-of-domain data, offering 3x-6x faster performance. @qianhuiwu

website: https://llmlingua.com/llmlingua2.html
code: https://github.com/microsoft/LLMLingua
demo: microsoft/llmlingua-2

Papers 6

models

None public yet

datasets

None public yet