CodeFuse-TestGPT-7B / README.md
Jintao Huang
first commit
b1919aa
metadata
tasks:
  - text-generation
tags:
  - transformer
  - Codefuse
  - CodeLlama
studios:
  - codefuse-ai/TestGPT-7B-demo

模型介绍(Introduction)

TestGPT-7B,是蚂蚁研发的测试域大模型。该模型以CodeLlama-7B为基座,进行了测试领域下游任务的微调,包含多语言测试用例生成、测试用例Assert补全。
TestGPT-7B, developed by Ant Group, is a large-scale model designed for software quality domains. Built upon the foundation of CodeLlama-7B, this model has undergone fine-tuning for downstream tasks, including multi-language test case generation and test case assertion completion.

  • 多语言测试用例生成(Multi-language test case generation)

测试用例生成一直以来都是学术界和工业界非常关注的领域,近年来不断有新产品或工具孵化出来,如EvoSuite、Randoop、SmartUnit等。然而传统的用例生成存在其难以解决的痛点问题,基于大模型的测试用例生成在测试用例可读性、测试场景完整度、多语言支持方面都优于传统用例生成工具。 TestGPT-7B中重点支持了多语言测试用例生成,在我们本次开源的版本中首先包含了Java、Python、Javascript的测试用例生成能力,下一版本中逐步开放Go、C++等语言。
Test case generation has always been a highly regarded field in both academia and industry. In recent years, many products or tools have emerged, such as EvoSuite, Randoop, SmartUnit, etc. However, traditional test case generation tools face challenges that are very difficult to overcome. Test case generation based on large models is superior to traditional test case generation tools in terms of test case readability, test scenario completeness, and multi-language support. In TestGPT-7B, there is a strong focus on supporting test case generation for multiple languages. In this open-source version, we primarily include test case generation capabilities for Java, Python, and Javascript. In the next version, we will introduce support for languages like Go and C++.

  • 测试用例Assert补全(Test case assertion completion)

对当前测试用例现状的分析与探查时,我们发现代码仓库中存在一定比例的存量测试用例中未包含Assert。没有Assert的测试用例虽然能够在回归过程中执行通过,却无法发现问题。因此我们拓展了测试用例Assert自动补全这一场景。通过该模型能力,结合一定的工程化配套,可以实现对全库测试用例的批量自动补全,智能提升项目质量水位。
Based on the analysis of the current state of test cases, we have found that there is a certain proportion of existing test cases that do not include Assert statements. Test cases without Assert statements may pass during regression phase but are unable to detect any bugs. Therefore, we have expanded the scenario of automatic completion of Assert statements in test cases. With the capabilities of this model and some engineering support, we can achieve automatic completion of test cases across the entire code repository, intelligently improving the quality of the project.
后续我们会持续迭代模型能力:1)不断加入更多令人激动的测试域应用场景,如领域知识问答、测试场景分析等;2)以7B为基础,逐步扩展至13B、34B模型。欢迎关注!
In the future, we will continue to iterate on the model capabilities. 1) We will continuously contribute more exciting Software Quality Related applications, such as test scenario analysis. 2) Building on the foundation of 7B, we will gradually expand to 13B and 34B models. Stay tuned for updates!

依赖项(Requirements)

  • python>=3.8
  • pytorch>=2.0.0
  • CUDA 11.4
  • transformers==4.33.2

评测表现(Testcase Evaluation)

  • TestGPT-7B测试用例生成(Multi-language test case generation)

针对模型支持的三种语言:Java、Python、Javascript,Pass@1评测结果如下:
Currently, the model supports test case generation for three languages: Java, Python, and JavaScript. The evaluation results for Pass@1 are as follows:

Model Java pass@1 Java Average number of test scenarios Python pass@1 Python Average number of test scenarios Javascript pass@1 Javascript Average number of test scenarios
TestGPT-7B 48.6% 4.37 35.67% 3.56 36% 2.76
CodeLlama-13B-Instruct 40.54% 1.08 30.57% 1.65 31.7% 3.13
Qwen-14B-Chat 10.81% 2.78 15.9% 1.32 9.15% 4.22
Baichuan2-13B-Chat 13.5% 2.24 12.7% 2.12 6.1% 3.31

注:由于当前已开源的Base模型(如CodeLlama-13B/Qwen-14B/Baichuan2-13B等)不具备测试用例生成能力,因此在评测结果对比时,我们都选择了官方对齐后的chat模型(如CodeLlama-13B-Instruct/Qwen-14B-Chat/Baichuan2-13B-Chat)
Note: Since the currently open-sourced Base models (such as CodeLlama-13B/Qwen-14B/Baichuan2-13B) do not have the capability to generate test cases, we have chosen the official aligned chat models (such as CodeLlama-13B-Instruct/Qwen-14B-Chat/Baichuan2-13B-Chat) for comparison in the evaluation results.

  • TestGPT-7B测试用例Assert补全(Test case assertion completion)

目前模型支持Java用例的Assert补全,Pass@1评测结果如下:
Currently, the model supports assertion completion for Java test cases. The evaluation results for Pass@1 are as follows:

Model pass@1 Percentage of strong validation
TestGPT-7B 71.1% 100%

与此同时,我们也开源了测试用例生成、测试用例Assert补全的评测集,以方便进行模型效果对比与复现。评测集可在eval_data文件夹下找到。
Meanwhile, we have also open-sourced the evaluation dataset for test case generation and test case assertion completion, facilitating model performance comparison and reproduction. The evaluation dataset can be found in the eval_data folder.

快速使用(QuickStart)

下面我们展示使用TestGPT-7B模型,进行测试用例生成、测试用例Assert补全的示例:
Below are examples of test case generation and test case assertion completion using the TestGPT-7B model:

from modelscope import AutoModelForCausalLM, AutoTokenizer, snapshot_download, AutoConfig
import torch

HUMAN_ROLE_START_TAG = "<s>human\n"
BOT_ROLE_START_TAG = "<s>bot\n"

if __name__ == '__main__':
    # 模型地址, 可以替换为本地模型地址
    model_dir = snapshot_download('codefuse-ai/TestGPT-7B', revision = 'v1.0.0')

    # 加载tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True, use_fast=False, legacy=False)

    eos_token = '</s>'
    pad_token = '<unk>'

    try:
        tokenizer.eos_token = eos_token
        tokenizer.eos_token_id = tokenizer.convert_tokens_to_ids(eos_token)
    except:
        print(tokenizer.eos_token, tokenizer.eos_token_id)

    try:
        tokenizer.pad_token = pad_token
        tokenizer.pad_token_id = tokenizer.convert_tokens_to_ids(pad_token)
    except:
        print(tokenizer.pad_token, tokenizer.pad_token_id)

    tokenizer.padding_side = "left"
    print(f"tokenizer's eos_token: {tokenizer.eos_token}, pad_token: {tokenizer.pad_token}")
    print(f"tokenizer's eos_token_id: {tokenizer.eos_token_id}, pad_token_id: {tokenizer.pad_token_id}")

    # 配置
    config, unused_kwargs = AutoConfig.from_pretrained(
        model_dir,
        use_flash_attn=True,
        use_xformers=True,
        trust_remote_code=True,
        return_unused_kwargs=True)

    # 加载模型
    model = AutoModelForCausalLM.from_pretrained(
        model_dir,
        config=config,
        device_map="auto",
        torch_dtype=torch.bfloat16,
        trust_remote_code=True,
        use_safetensors=False,
    ).eval()

    # 推理生成测试用例
    # 被测代码prompt,分为用例生成和assert补全
    # 用例生成格式
    prompt = '为以下Python代码生成单元测试\n' \
             '```Python\ndef add(lst):\n    return sum([lst[i] for i in range(1, len(lst), 2) if lst[i]%2 == 0])\n```\n'

    # assert补全格式,目前仅支持java语言
    # prompt = '下面是被测代码\n' \
    #          '```java\n' \
    #          'public class BooleanUtils {\n    ' \
    #          'public static boolean and(final boolean... array) {\n        ' \
    #          'ObjectUtils.requireNonEmpty(array, "array");\n        ' \
    #          'for (final boolean element : array) {\n            ' \
    #          'if (!element) {\n                return false;\n            }\n        }\n        ' \
    #          'return true;\n    }\n}\n```\n' \
    #          '下面代码是针对上面被测代码生成的用例,请补全用例,生成assert校验\n' \
    #          '```java\n' \
    #          '@Test\npublic void testAnd_withAllTrueInputs() {\n    ' \
    #          'boolean[] input = new boolean[] {true, true, true};\n    ' \
    #          'boolean result = BooleanUtils.and(input);\n}\n\n@Test\npublic void testAnd_withOneFalseInput() {\n    ' \
    #          'boolean[] input = new boolean[] {true, false, true};\n    ' \
    #          'boolean result = BooleanUtils.and(input);\n}\n' \
    #          '```\n'

    # 输入格式化处理
    prompt = f"{HUMAN_ROLE_START_TAG}{prompt}{BOT_ROLE_START_TAG}"
    inputs = tokenizer(prompt, return_tensors='pt', padding=True, add_special_tokens=False).to("cuda")

    # 推理
    outputs = model.generate(
        inputs=inputs["input_ids"],
        max_new_tokens=2048,
        top_p=0.95,
        temperature=0.2,
        do_sample=True,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id,
        num_return_sequences=1,
    )

    # 结果处理
    outputs_len = len(outputs)
    print(f"output len is: {outputs_len}")
    for index in range(0, outputs_len):
        print(f"generate index: {index}")
        gen_text = tokenizer.decode(outputs[index], skip_special_tokens=True)
        print(gen_text)
        print("===================")