Huiqiang Jiang PRO

iofu728

AI & ML interests

None yet

Recent Activity

updated a dataset about 18 hours ago
microsoft/SCBench
upvoted a paper 5 days ago
Qwen2.5 Technical Report
View all activity

Articles

Organizations

Microsoft's profile picture MInference's profile picture

Posts 2

view post
Post
1078
Weclome to use MInference, which leverages the dynamic sparse nature of LLMs' attention, which exhibits some static patterns, to speed up the pre-filling for million tokens LLMs. It first determines offline which sparse pattern each head belongs to, then approximates the sparse index online and dynamically computes attention with the optimal custom kernels. This approach achieves up to a 10x speedup for pre-filling on an A100 while maintaining accuracy with 1M tokens.

For more detail please check,
project page: https://aka.ms/MInference
code: https://github.com/microsoft/MInference
paper: MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention (2407.02490)
hf demo: microsoft/MInference

models

None public yet

datasets

None public yet