--- library_name: transformers license: mit task_categories: - text-generation language: - en tags: - agent - Agentic Learning - tool use - BFCL --- [![Funcdex-Collection](https://img.shields.io/badge/Hugging%20Face-Model-yellow?logo=huggingface)](https://huggingface.co/collections/prem-research/funcdex) [![Dataset](https://img.shields.io/badge/Hugging%20Face-Dataset-yellow?logo=huggingface)](https://huggingface.co/datasets/prem-research/Funcdex-MT-Function-Calling) [![GitHub](https://img.shields.io/badge/GitHub-Code-181717?logo=github)](https://github.com/prem-research/Funcdex-Synthesizer) [![PremAI](https://img.shields.io/badge/Project-PremAI-green)](https://www.premai.io/) # Funcdex-0.6B-whatsapp_todoist
Funcdex Hero
Funcdex-0.6B is a research preview model by Prem Labs. It has been trained on a mix of [Funcdex-MT-Function-Calling](https://huggingface.co/datasets/prem-research/Funcdex-MT-Function-Calling), Instruct-Following, Single-turn function datasets. It is a LoRA finetune of Qwen3-0.6B (with thinking disabled). This model excels at Multi-turn Function Calling with tools from `whatsapp` and `todoist`. The code used to generate the dataset can be found [here](https://github.com/prem-research/Funcdex-Synthesizer). # Evaluation
Line Plot
## Results ### BFCL v3 - We filtered BFCLv3 examples relevant to the toolkits/bundles and report performance: - The filtered set is only 83 examples. Further emphasizing the need for workflow/toolkit-specialized workflows.
LLM Acc %
GPT-5 Mini
(medium)
0.71
Qwen3-1.7B 0.82
Funcdex-1.7B 0.86
### Funcdex-MT: Overall Performance
LLM Exact Match String Ratio Total Cost ($)
GPT-OSS-120B
(medium)
0.35 0.51 9.32
GPT-5 Mini
(medium)
0.35 0.58 99.71
GPT-5
(minimal)
0.18 0.59 205.45
Qwen3-0.6B 0.27 0.59 2.83
Qwen3-1.7B 0.27 0.69 5.73
Funcdex-0.6B 0.39 0.70 0.19
Funcdex-1.7B 0.43 0.81 5.64
### Funcdex-MT: Toolkit-Level Performance
Toolkit GPT-OSS-120B
(medium)
GPT-5
(minimal)
GPT-5 Mini
(medium)
Qwen3-0.6B Funcdex-0.6B Qwen3-1.7B Funcdex-1.7B
EM SR EM SR EM SR EM SR EM SR LoRA Checkpoint EM SR EM SR LoRA Checkpoint
Asana 0.38 0.47 0.12 0.68 0.49 0.71 0.33 0.63 0.46 0.69 🤗 0.30 0.79 0.52 0.82 🤗
Calendly 0.47 0.56 0.41 0.63 0.41 0.56 0.44 0.66 0.54 0.78 🤗 0.47 0.74 0.54 0.86
Gmail 0.48 0.70 0.24 0.69 0.50 0.73 0.27 0.61 0.47 0.72 🤗 0.31 0.73 0.53 0.83
Calendar 0.27 0.52 0.20 0.50 0.21 0.51 0.21 0.53 0.39 0.74 🤗 0.23 0.64 0.47 0.83
Docs 0.19 0.38 0.07 0.49 0.18 0.46 0.07 0.58 0.13 0.64 🤗 0.11 0.62 0.18 0.79
Drive 0.34 0.52 0.19 0.61 0.38 0.58 0.26 0.65 0.40 0.75 🤗 0.26 0.73 0.48 0.82
Jira 0.47 0.53 0.17 0.65 0.47 0.66 0.51 0.69 0.58 0.76 🤗 0.47 0.76 0.59 0.83
Stripe 0.15 0.37 0.10 0.46 0.12 0.39 0.08 0.50 0.17 0.71 🤗 0.09 0.56 0.16 0.80
Todoist 0.65 0.74 0.19 0.72 0.64 0.79 0.57 0.87 0.65 0.88 🤗 0.55 0.91 0.72 0.94
Whatsapp 0.23 0.39 0.13 0.47 0.24 0.43 0.20 0.43 0.28 0.64 🤗 0.26 0.55 0.31 0.71
- Funcdex-0.6B are specialized models. Reported number is the average performance of each specific model in their respective subset. ### Funcdex-MT: Bundle/Multi-toolkit Performance:
Bundle GPT-OSS-120B
(medium)
GPT-5
(minimal)
GPT-5 Mini
(medium)
Qwen3-0.6B Funcdex-0.6B Qwen3-1.7B Funcdex-1.7B
EM SR EM SR EM SR EM SR EM SR LoRA Checkpoint EM SR EM SR LoRA Checkpoint
GmailCalendar 0.28 0.53 0.15 0.54 0.22 0.56 0.19 0.51 0.26 0.54 🤗 0.17 0.61 0.32 0.71 🤗
Drive Calendly Calendar 0.32 0.45 0.17 0.52 0.35 0.47 0.19 0.49 0.35 0.60 🤗 0.15 0.66 0.40 0.78
Drive Docs 0.28 0.37 0.12 0.50 0.33 0.47 0.18 0.54 0.34 0.70 🤗 0.19 0.68 0.43 0.76
Jira Gmail 0.42 0.60 0.18 0.66 0.36 0.66 0.29 0.61 0.39 0.71 🤗 0.28 0.72 0.44 0.82
Whatsapp Todoist 0.32 0.58 0.19 0.66 0.35 0.69 0.26 0.50 0.41 0.70 🤗 0.27 0.68 0.39 0.77
## Inference - Given a conversation, we extract all tuples `(context_messages, function_calls)` and use it to generate predictions. We ignore the `content` field and only evaluate `function_calls` generated by an LLM. - We use vLLM deployment with `tool_choice="auto"`. ## Metrics Given a list of predicted and reference function calls, we report two metrics: - **Function Call String Match (SR)**: We perform greedy match and report best-matched string ratio using `difflib.SequenceMatcher.ratio`. The number reported is average string ratio. - **Exact Match (EM)**: Same as above, but we perform exact string match instead. The number reported is EM F1 Score. EM is a strict metric, and penalizes string arguments in function calls that may be "okay", e.g. `"email_content": "This is an example."` v/s `"email_content": "This is an Example."`, both only differ by one letter. # Quickstart ```python from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel import torch import json # Load model and tokenizer base_model_name = "ojus1/Qwen3-0.6B-Instruct" model_name = "prem-research/Funcdex-0.6B-whatsapp_todoist" tokenizer = AutoTokenizer.from_pretrained(model_name) base_model = AutoModelForCausalLM.from_pretrained( base_model_name, torch_dtype="auto", device_map="auto" ) model = PeftModel.from_pretrained( base_model, model_name, torch_dtype="auto", device_map="auto" ) # Define tools (WhatsApp + Todoist combined) tools = [ { "type": "function", "function": { "name": "GET_BACKUPS", "description": "Get Todoist backups", "parameters": { "type": "object", "properties": {} } } }, { "type": "function", "function": { "name": "GET_SECTION", "description": "Get a Todoist section", "parameters": { "type": "object", "properties": { "section_id": {"type": "string", "description": "Section ID"} }, "required": ["section_id"] } } }, { "type": "function", "function": { "name": "SEND_MESSAGE", "description": "Send WhatsApp message", "parameters": { "type": "object", "properties": { "phone_number_id": {"type": "string", "description": "Phone ID"}, "to_number": {"type": "string", "description": "Recipient"}, "text": {"type": "string", "description": "Message text"} }, "required": ["phone_number_id", "to_number", "text"] } } } ] # Define conversation messages = [ {"role": "system", "content": "You are a helpful assistant that can help with tasks by using tools."}, {"role": "user", "content": "Get the Todoist section with ID 'viewing_template'."} ] # Apply chat template with tools formatted_input = tokenizer.apply_chat_template( messages, tools=tools, tokenize=False, add_generation_prompt=True ) # Tokenize and generate input_tokens = tokenizer(formatted_input, return_tensors="pt").to(model.device) output = model.generate(**input_tokens, max_new_tokens=256, do_sample=False) response = tokenizer.decode(output[0][input_tokens['input_ids'].shape[1]:], skip_special_tokens=True) print("Response:", response) ``` ## Deployment with vLLM `vllm serve ojus1/Qwen3-0.6B-Instruct --enable-lora --lora-modules prem-research/Funcdex-0.6B=prem-research/Funcdex-0.6B-whatsapp_todoist --enable-auto-tool-choice --tool-call-parser hermes` For best results, provide detailed system-prompt to steer the tool-use behaviour. # License The models, code and the dataset are licensed under MIT License.