Lin Tan

lin-tan

https://www.cs.purdue.edu/homes/lintan/

AI & ML interests

AI-Software Synergy. LLM4Code (binary and source code). Mary J. Elmore New Frontiers Professor Purdue University

Recent Activity

reacted to their post with 🔥 1 day ago

Introducing Nova (ICLR’25), foundation models for binary/assembly code. We have also released fine-tuned models for binary code decompilation. Preprint: arxiv.org/pdf/2311.13721 This is our follow-up work on binary analysis after our CCS'24 distinguished paper (https://www.linkedin.com/posts/lintan_resym-harnessing-llms-to-recover-variable-activity-7231749452154159105-sEgj) Highlights: 1. Nova is built with hierarchical attention specially designed for binary and contrastive learning. 2. Nova is pre-trained on 3B binary and source code tokens. 3. Models: https://huggingface.co./lt-asset/nova-6.7b https://huggingface.co./lt-asset/nova-6.7b-bcr 4. Smaller 1.3B models https://huggingface.co./lt-asset/nova-1.3b… https://huggingface.co./lt-asset/nova-1.3b-bcr Binaries are a form of code. Do not forget about binaries when you work on #LLM4Code. Why binaries and binary models? Binary code plays an irreplaceable role in crucial tasks, including vulnerability detection, malware detection, binary recovery, and legacy software maintenance. For example, when performing tasks such as identifying attacks and malware, security analysts often only have access to assembly, i.e., the human-readable representation of binary code, which is extremely difficult to understand. Thus, combined with the increasing sophistication of cybercrime that poses significant threats worldwide (e.g., cybercrime is predicted to cost the world $10.5 trillion annually by 2025 (Sausalito, 2020)), effective binary analysis techniques are in high demand. #LLM4Code #LLM #BinaryAnalysis #Security @jiang719 Chengxiao Wang, Kevin Liu, Xiangzhe Xu, Xiangyu Zhang, @pbabkin

posted an update 1 day ago

liked a dataset 24 days ago

lt-asset/REPOCOD_Lite_Unified

View all activity

Organizations

Posts 2

Post

915

Introducing Nova (ICLR’25), foundation models for binary/assembly code. We have also released fine-tuned models for binary code decompilation. Preprint: arxiv.org/pdf/2311.13721 This is our follow-up work on binary analysis after our CCS'24 distinguished paper (https://www.linkedin.com/posts/lintan_resym-harnessing-llms-to-recover-variable-activity-7231749452154159105-sEgj)

Highlights:
1. Nova is built with hierarchical attention specially designed for binary and contrastive learning.
2. Nova is pre-trained on 3B binary and source code tokens.
3. Models: lt-asset/nova-6.7b lt-asset/nova-6.7b-bcr
4. Smaller 1.3B models lt-asset/nova-1.3b lt-asset/nova-1.3b-bcr

Binaries are a form of code. Do not forget about binaries when you work on #LLM4Code.

Why binaries and binary models? Binary code plays an irreplaceable role in crucial tasks, including vulnerability detection, malware detection, binary recovery, and legacy software maintenance. For example, when performing tasks such as identifying attacks and malware, security analysts often only have access to assembly, i.e., the human-readable representation of binary code, which is extremely difficult to understand. Thus, combined with the increasing sophistication of cybercrime that poses significant threats worldwide (e.g., cybercrime is predicted to cost the world $10.5 trillion annually by 2025 (Sausalito, 2020)), effective binary analysis techniques are in high demand.

#LLM4Code #LLM #BinaryAnalysis #Security

@jiang719 Chengxiao Wang, Kevin Liu, Xiangzhe Xu, Xiangyu Zhang, @pbabkin

Post

1434

Can language models replace developers? #RepoCod says “Not Yet”, because GPT-4o and other LLMs have <30% accuracy/pass@1 on real-world code generation tasks.
- Leaderboard https://lt-asset.github.io/REPOCOD/
- Dataset: lt-asset/REPOCOD
@jiang719 @shanchao @Yiran-Hu1007
Compared to #SWEBench, RepoCod tasks are
- General code generation tasks, while SWE-Bench tasks resolve pull requests from GitHub issues.
- With 2.6X more tests per task (313.5 compared to SWE-Bench’s 120.8).

Compared to #HumanEval, #MBPP, #CoderEval, and #ClassEval, RepoCod has 980 instances from 11 Python projects, with
- Whole function generation
- Repository-level context
- Validation with test cases, and
- Real-world complex tasks: longest average canonical solution length (331.6 tokens) and the highest average cyclomatic complexity (9.00)

Introducing hashtag #RepoCod-Lite 🐟 for faster evaluations: 200 of the toughest tasks from RepoCod with:
- 67 repository-level, 67 file-level, and 66 self-contains tasks
- Detailed problem descriptions (967 tokens) and long canonical solutions (918 tokens)
- GPT-4o and other LLMs have < 10% accuracy/pass@1 on RepoCod-Lite tasks.
- Dataset: lt-asset/REPOCOD_Lite

#LLM4code #LLM #CodeGeneration #Security

Papers 15

models

None public yet

datasets

None public yet