---
library_name: transformers
license: mit
task_categories:
- text-generation
language:
- en
tags:
- agent
- Agentic Learning
- tool use
- BFCL
---
[](https://huggingface.co/collections/prem-research/funcdex) [](https://huggingface.co/datasets/prem-research/Funcdex-MT-Function-Calling) [](https://github.com/prem-research/Funcdex-Synthesizer) [](https://www.premai.io/)
# Funcdex-0.6B-whatsapp_todoist
Funcdex-0.6B is a research preview model by Prem Labs. It has been trained on a mix of [Funcdex-MT-Function-Calling](https://huggingface.co/datasets/prem-research/Funcdex-MT-Function-Calling), Instruct-Following, Single-turn function datasets. It is a LoRA finetune of Qwen3-0.6B (with thinking disabled).
This model excels at Multi-turn Function Calling with tools from `whatsapp` and `todoist`.
The code used to generate the dataset can be found [here](https://github.com/prem-research/Funcdex-Synthesizer).
# Evaluation
## Results
### BFCL v3
- We filtered BFCLv3 examples relevant to the toolkits/bundles and report performance:
- The filtered set is only 83 examples. Further emphasizing the need for workflow/toolkit-specialized workflows.
| LLM |
Acc % |
GPT-5 Mini (medium) |
0.71 |
| Qwen3-1.7B |
0.82 |
| Funcdex-1.7B |
0.86 |
### Funcdex-MT: Overall Performance
| LLM |
Exact Match |
String Ratio |
Total Cost ($) |
GPT-OSS-120B (medium) |
0.35 |
0.51 |
9.32 |
GPT-5 Mini (medium) |
0.35 |
0.58 |
99.71 |
GPT-5 (minimal) |
0.18 |
0.59 |
205.45 |
| Qwen3-0.6B |
0.27 |
0.59 |
2.83 |
| Qwen3-1.7B |
0.27 |
0.69 |
5.73 |
| Funcdex-0.6B |
0.39 |
0.70 |
0.19 |
| Funcdex-1.7B |
0.43 |
0.81 |
5.64 |
### Funcdex-MT: Toolkit-Level Performance
| Toolkit |
GPT-OSS-120B (medium) |
GPT-5 (minimal) |
GPT-5 Mini (medium) |
Qwen3-0.6B |
Funcdex-0.6B |
Qwen3-1.7B |
Funcdex-1.7B |
| EM |
SR |
EM |
SR |
EM |
SR |
EM |
SR |
EM |
SR |
LoRA Checkpoint |
EM |
SR |
EM |
SR |
LoRA Checkpoint |
Asana |
0.38 |
0.47 |
0.12 |
0.68 |
0.49 |
0.71 |
0.33 |
0.63 |
0.46 |
0.69 |
🤗 |
0.30 |
0.79 |
0.52 |
0.82 |
🤗 |
Calendly |
0.47 |
0.56 |
0.41 |
0.63 |
0.41 |
0.56 |
0.44 |
0.66 |
0.54 |
0.78 |
🤗 |
0.47 |
0.74 |
0.54 |
0.86 |
Gmail |
0.48 |
0.70 |
0.24 |
0.69 |
0.50 |
0.73 |
0.27 |
0.61 |
0.47 |
0.72 |
🤗 |
0.31 |
0.73 |
0.53 |
0.83 |
Calendar |
0.27 |
0.52 |
0.20 |
0.50 |
0.21 |
0.51 |
0.21 |
0.53 |
0.39 |
0.74 |
🤗 |
0.23 |
0.64 |
0.47 |
0.83 |
Docs |
0.19 |
0.38 |
0.07 |
0.49 |
0.18 |
0.46 |
0.07 |
0.58 |
0.13 |
0.64 |
🤗 |
0.11 |
0.62 |
0.18 |
0.79 |
Drive |
0.34 |
0.52 |
0.19 |
0.61 |
0.38 |
0.58 |
0.26 |
0.65 |
0.40 |
0.75 |
🤗 |
0.26 |
0.73 |
0.48 |
0.82 |
Jira |
0.47 |
0.53 |
0.17 |
0.65 |
0.47 |
0.66 |
0.51 |
0.69 |
0.58 |
0.76 |
🤗 |
0.47 |
0.76 |
0.59 |
0.83 |
Stripe |
0.15 |
0.37 |
0.10 |
0.46 |
0.12 |
0.39 |
0.08 |
0.50 |
0.17 |
0.71 |
🤗 |
0.09 |
0.56 |
0.16 |
0.80 |
Todoist |
0.65 |
0.74 |
0.19 |
0.72 |
0.64 |
0.79 |
0.57 |
0.87 |
0.65 |
0.88 |
🤗 |
0.55 |
0.91 |
0.72 |
0.94 |
Whatsapp |
0.23 |
0.39 |
0.13 |
0.47 |
0.24 |
0.43 |
0.20 |
0.43 |
0.28 |
0.64 |
🤗 |
0.26 |
0.55 |
0.31 |
0.71 |
- Funcdex-0.6B are specialized models. Reported number is the average performance of each specific model in their respective subset.
### Funcdex-MT: Bundle/Multi-toolkit Performance:
| Bundle |
GPT-OSS-120B (medium) |
GPT-5 (minimal) |
GPT-5 Mini (medium) |
Qwen3-0.6B |
Funcdex-0.6B |
Qwen3-1.7B |
Funcdex-1.7B |
| EM |
SR |
EM |
SR |
EM |
SR |
EM |
SR |
EM |
SR |
LoRA Checkpoint |
EM |
SR |
EM |
SR |
LoRA Checkpoint |
Gmail Calendar |
0.28 |
0.53 |
0.15 |
0.54 |
0.22 |
0.56 |
0.19 |
0.51 |
0.26 |
0.54 |
🤗 |
0.17 |
0.61 |
0.32 |
0.71 |
🤗 |
Drive Calendly Calendar |
0.32 |
0.45 |
0.17 |
0.52 |
0.35 |
0.47 |
0.19 |
0.49 |
0.35 |
0.60 |
🤗 |
0.15 |
0.66 |
0.40 |
0.78 |
Drive Docs |
0.28 |
0.37 |
0.12 |
0.50 |
0.33 |
0.47 |
0.18 |
0.54 |
0.34 |
0.70 |
🤗 |
0.19 |
0.68 |
0.43 |
0.76 |
Jira Gmail |
0.42 |
0.60 |
0.18 |
0.66 |
0.36 |
0.66 |
0.29 |
0.61 |
0.39 |
0.71 |
🤗 |
0.28 |
0.72 |
0.44 |
0.82 |
Whatsapp Todoist |
0.32 |
0.58 |
0.19 |
0.66 |
0.35 |
0.69 |
0.26 |
0.50 |
0.41 |
0.70 |
🤗 |
0.27 |
0.68 |
0.39 |
0.77 |
## Inference
- Given a conversation, we extract all tuples `(context_messages, function_calls)` and use it to generate predictions. We ignore the `content` field and only evaluate `function_calls` generated by an LLM.
- We use vLLM deployment with `tool_choice="auto"`.
## Metrics
Given a list of predicted and reference function calls, we report two metrics:
- **Function Call String Match (SR)**: We perform greedy match and report best-matched string ratio using `difflib.SequenceMatcher.ratio`. The number reported is average string ratio.
- **Exact Match (EM)**: Same as above, but we perform exact string match instead. The number reported is EM F1 Score.
EM is a strict metric, and penalizes string arguments in function calls that may be "okay", e.g. `"email_content": "This is an example."` v/s `"email_content": "This is an Example."`, both only differ by one letter.
# Quickstart
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
import json
# Load model and tokenizer
base_model_name = "ojus1/Qwen3-0.6B-Instruct"
model_name = "prem-research/Funcdex-0.6B-whatsapp_todoist"
tokenizer = AutoTokenizer.from_pretrained(model_name)
base_model = AutoModelForCausalLM.from_pretrained(
base_model_name,
torch_dtype="auto",
device_map="auto"
)
model = PeftModel.from_pretrained(
base_model,
model_name,
torch_dtype="auto",
device_map="auto"
)
# Define tools (WhatsApp + Todoist combined)
tools = [
{
"type": "function",
"function": {
"name": "GET_BACKUPS",
"description": "Get Todoist backups",
"parameters": {
"type": "object",
"properties": {}
}
}
},
{
"type": "function",
"function": {
"name": "GET_SECTION",
"description": "Get a Todoist section",
"parameters": {
"type": "object",
"properties": {
"section_id": {"type": "string", "description": "Section ID"}
},
"required": ["section_id"]
}
}
},
{
"type": "function",
"function": {
"name": "SEND_MESSAGE",
"description": "Send WhatsApp message",
"parameters": {
"type": "object",
"properties": {
"phone_number_id": {"type": "string", "description": "Phone ID"},
"to_number": {"type": "string", "description": "Recipient"},
"text": {"type": "string", "description": "Message text"}
},
"required": ["phone_number_id", "to_number", "text"]
}
}
}
]
# Define conversation
messages = [
{"role": "system", "content": "You are a helpful assistant that can help with tasks by using tools."},
{"role": "user", "content": "Get the Todoist section with ID 'viewing_template'."}
]
# Apply chat template with tools
formatted_input = tokenizer.apply_chat_template(
messages,
tools=tools,
tokenize=False,
add_generation_prompt=True
)
# Tokenize and generate
input_tokens = tokenizer(formatted_input, return_tensors="pt").to(model.device)
output = model.generate(**input_tokens, max_new_tokens=256, do_sample=False)
response = tokenizer.decode(output[0][input_tokens['input_ids'].shape[1]:], skip_special_tokens=True)
print("Response:", response)
```
## Deployment with vLLM
`vllm serve ojus1/Qwen3-0.6B-Instruct --enable-lora --lora-modules prem-research/Funcdex-0.6B=prem-research/Funcdex-0.6B-whatsapp_todoist --enable-auto-tool-choice --tool-call-parser hermes`
For best results, provide detailed system-prompt to steer the tool-use behaviour.
# License
The models, code and the dataset are licensed under MIT License.