Dria Pythonic Agent Benchmark (DPAB)
Introduction
The overwhelming majority, if not all of the (as far as we know) of large language model (LLM) function calling benchmarks work through JSON-based structured output from models, which contain metadata such as function names and argument(s) to be passed to the function [1]. This approach is very straightforward and easy to implement, and make things very deterministic which is great for creating reproducible benchmarks. However, function calling with structured output is not the only (nor the best, according to us) way to do function calling. Earlier this week, we released the first edition of Dria-Agent models, Dria-Agent-α-3B and Dria-Agent-α-7B. These models employ Pythonic Function Calling [2], which prompts the model to output a block of Python code that can be executed to produce the desired output. The motivations for this approach are explained in detail in the Dria-Agent-a blog post.
The DPAB-α Benchmark
As a follow-up to our Dria-Agent-α models, we have created a new benchmark, DPAB-α, which is a collection of 100 problems synthetically generated & validated with a pipeline very similar to the one used to create the training data for the Dria-Agent-α models. Each dataset row contains the following fields:
difficulty
: The difficulty of the problem, which is eithereasy
orhard
.function_schema_python
: The function definitions, with no implementation, in Python.function_schema_json
: The function schemas in JSON format.mock_functions
: The mock functions, implemented with return values, in Python. These are used generate and validate the checklist.user_query
: The user query, which is a natural language question that the model needs to answer/solve.checklist
: The checklist, which is a list of function names and values that need to be in the output of the code execution. An example checklist is shown below:
"checklist": {
"functions": [
"identify_large_files"
],
"values": [
[
"/dev/projects/project_a/large_file_1.zip",
"/dev/projects/project_b/large_dataset.csv"
]
]
}
this checklist enforces that the model must use identify_large_files
function and have the values ["/dev/projects/project_a/large_file_1.zip", "/dev/projects/project_b/large_dataset.csv"]
in the execution output. How do we produce the execution output? We use the execution engine defined in exec-python, a python package that allows us to execute any python code with any amount of predefined functions and return the output. The package was developed hand-in-hand with the DPAB-α benchmark. Now, a question you might have is: how do we generate and validate the checklist? We used the methodology described in the Data Validations section of the Dria-Agent-a blog post, in which we basically used a 3-step pipeline to generate a valid checklist:
Decision
: The validator model decides whether the checklist is valid or not.Justification
: The validator model provides a justification for its decision, given the checklist, mock functions, and user query.Revision
: The validator model revises the checklist if it is not valid, given the justification.
Initial Results
Pythonic function calling performance often outstrips JSON-based function calling in scenarios that require creative or multi-step solutions, reinforcing the premise that Pythonic function calling can be more natural and powerful.
We have run the first edition of the DPAB-α benchmark many open and closed-source models in strict mode, and the results are shown below:
Model Name | Pythonic | JSON |
---|---|---|
Closed Models | ||
Claude 3.5 Sonnet | 87 | 45 |
o1-preview-2024-09-12 | 55 | 39 |
o1-mini-2024-09-12 | 59 | 35 |
gpt-4o-2024-11-20 | 60 | 30 |
Open Models | ||
> 100B Parameters | ||
DeepSeek V3 (685B) | 63 | 33 |
MiniMax-01 | 62 | 40 |
Llama-3.1-405B-Instruct | 60 | 38 |
> 30B Parameters | ||
Qwen-2.5-Coder-32b-Instruct | 68 | 32 |
Qwen-2.5-72b-instruct | 65 | 39 |
Llama-3.3-70b-Instruct | 59 | 40 |
QwQ-32b-Preview | 47 | 21 |
< 20B Parameters | ||
Dria-Agent-a-7B | 70 | 38 |
Qwen2.5-Coder-7B-Instruct | 44 | 39 |
Dria-Agent-a-3B | 72 | 31 |
Qwen2.5-Coder-3B-Instruct | 26 | 37 |
Qwen-2.5-7B-Instruct | 47 | 34 |
Phi-4 (14B) | 55 | 35 |
Clone DBAP repo to run evaluations.
Future Work
Alongside the Dria-Agent series of models, we will also improve upon the first edition of DPAB, and release DPAB-β with a new agentic setup and harder problems.
References
- [1] Yan, Fanjia, et al. Berkeley Function Calling Leaderboard. 2024, https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html.
- [2] andthattoo, ‘Atakan Tekparmak’. Dria-Agent-a. https://huggingface.co./blog/andthattoo/dria-agent-a.