--- library_name: transformers license: mit task_categories: - text-generation language: - en tags: - agent - Agentic Learning - tool use - BFCL --- [![Funcdex-Collection](https://img.shields.io/badge/Hugging%20Face-Model-yellow?logo=huggingface)](https://huggingface.co/collections/prem-research/funcdex) [![Dataset](https://img.shields.io/badge/Hugging%20Face-Dataset-yellow?logo=huggingface)](https://huggingface.co/datasets/prem-research/Funcdex-MT-Function-Calling) [![GitHub](https://img.shields.io/badge/GitHub-Code-181717?logo=github)](https://github.com/prem-research/Funcdex-Synthesizer) [![PremAI](https://img.shields.io/badge/Project-PremAI-green)](https://www.premai.io/) # Funcdex-0.6B-whatsapp_todoist

Funcdex-0.6B is a research preview model by Prem Labs. It has been trained on a mix of [Funcdex-MT-Function-Calling](https://huggingface.co/datasets/prem-research/Funcdex-MT-Function-Calling), Instruct-Following, Single-turn function datasets. It is a LoRA finetune of Qwen3-0.6B (with thinking disabled). This model excels at Multi-turn Function Calling with tools from `whatsapp` and `todoist`. The code used to generate the dataset can be found [here](https://github.com/prem-research/Funcdex-Synthesizer). # Evaluation

## Results ### BFCL v3 - We filtered BFCLv3 examples relevant to the toolkits/bundles and report performance: - The filtered set is only 83 examples. Further emphasizing the need for workflow/toolkit-specialized workflows.

LLM	Acc %
GPT-5 Mini (medium)	0.71
Qwen3-1.7B	0.82
Funcdex-1.7B	0.86

### Funcdex-MT: Overall Performance

LLM	Exact Match	String Ratio	Total Cost ($)
GPT-OSS-120B (medium)	0.35	0.51	9.32
GPT-5 Mini (medium)	0.35	0.58	99.71
GPT-5 (minimal)	0.18	0.59	205.45
Qwen3-0.6B	0.27	0.59	2.83
Qwen3-1.7B	0.27	0.69	5.73
Funcdex-0.6B	0.39	0.70	0.19
Funcdex-1.7B	0.43	0.81	5.64

### Funcdex-MT: Toolkit-Level Performance

Toolkit	GPT-OSS-120B (medium)		GPT-5 (minimal)		GPT-5 Mini (medium)		Qwen3-0.6B		Funcdex-0.6B			Qwen3-1.7B		Funcdex-1.7B
Toolkit	EM	SR	EM	SR	EM	SR	EM	SR	EM	SR	LoRA Checkpoint	EM	SR	EM	SR	LoRA Checkpoint
Asana	0.38	0.47	0.12	0.68	0.49	0.71	0.33	0.63	0.46	0.69	🤗	0.30	0.79	0.52	0.82	🤗
Calendly	0.47	0.56	0.41	0.63	0.41	0.56	0.44	0.66	0.54	0.78	🤗	0.47	0.74	0.54	0.86
Gmail	0.48	0.70	0.24	0.69	0.50	0.73	0.27	0.61	0.47	0.72	🤗	0.31	0.73	0.53	0.83
Calendar	0.27	0.52	0.20	0.50	0.21	0.51	0.21	0.53	0.39	0.74	🤗	0.23	0.64	0.47	0.83
Docs	0.19	0.38	0.07	0.49	0.18	0.46	0.07	0.58	0.13	0.64	🤗	0.11	0.62	0.18	0.79
Drive	0.34	0.52	0.19	0.61	0.38	0.58	0.26	0.65	0.40	0.75	🤗	0.26	0.73	0.48	0.82
Jira	0.47	0.53	0.17	0.65	0.47	0.66	0.51	0.69	0.58	0.76	🤗	0.47	0.76	0.59	0.83
Stripe	0.15	0.37	0.10	0.46	0.12	0.39	0.08	0.50	0.17	0.71	🤗	0.09	0.56	0.16	0.80
Todoist	0.65	0.74	0.19	0.72	0.64	0.79	0.57	0.87	0.65	0.88	🤗	0.55	0.91	0.72	0.94
Whatsapp	0.23	0.39	0.13	0.47	0.24	0.43	0.20	0.43	0.28	0.64	🤗	0.26	0.55	0.31	0.71

- Funcdex-0.6B are specialized models. Reported number is the average performance of each specific model in their respective subset. ### Funcdex-MT: Bundle/Multi-toolkit Performance:

Bundle	GPT-OSS-120B (medium)		GPT-5 (minimal)		GPT-5 Mini (medium)		Qwen3-0.6B		Funcdex-0.6B			Qwen3-1.7B		Funcdex-1.7B
Bundle	EM	SR	EM	SR	EM	SR	EM	SR	EM	SR	LoRA Checkpoint	EM	SR	EM	SR	LoRA Checkpoint
GmailCalendar	0.28	0.53	0.15	0.54	0.22	0.56	0.19	0.51	0.26	0.54	🤗	0.17	0.61	0.32	0.71	🤗
Drive Calendly Calendar	0.32	0.45	0.17	0.52	0.35	0.47	0.19	0.49	0.35	0.60	🤗	0.15	0.66	0.40	0.78
Drive Docs	0.28	0.37	0.12	0.50	0.33	0.47	0.18	0.54	0.34	0.70	🤗	0.19	0.68	0.43	0.76
Jira Gmail	0.42	0.60	0.18	0.66	0.36	0.66	0.29	0.61	0.39	0.71	🤗	0.28	0.72	0.44	0.82
Whatsapp Todoist	0.32	0.58	0.19	0.66	0.35	0.69	0.26	0.50	0.41	0.70	🤗	0.27	0.68	0.39	0.77

## Inference - Given a conversation, we extract all tuples `(context_messages, function_calls)` and use it to generate predictions. We ignore the `content` field and only evaluate `function_calls` generated by an LLM. - We use vLLM deployment with `tool_choice="auto"`. ## Metrics Given a list of predicted and reference function calls, we report two metrics: - **Function Call String Match (SR)**: We perform greedy match and report best-matched string ratio using `difflib.SequenceMatcher.ratio`. The number reported is average string ratio. - **Exact Match (EM)**: Same as above, but we perform exact string match instead. The number reported is EM F1 Score. EM is a strict metric, and penalizes string arguments in function calls that may be "okay", e.g. `"email_content": "This is an example."` v/s `"email_content": "This is an Example."`, both only differ by one letter. # Quickstart ```python from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel import torch import json # Load model and tokenizer base_model_name = "ojus1/Qwen3-0.6B-Instruct" model_name = "prem-research/Funcdex-0.6B-whatsapp_todoist" tokenizer = AutoTokenizer.from_pretrained(model_name) base_model = AutoModelForCausalLM.from_pretrained( base_model_name, torch_dtype="auto", device_map="auto" ) model = PeftModel.from_pretrained( base_model, model_name, torch_dtype="auto", device_map="auto" ) # Define tools (WhatsApp + Todoist combined) tools = [ { "type": "function", "function": { "name": "GET_BACKUPS", "description": "Get Todoist backups", "parameters": { "type": "object", "properties": {} } } }, { "type": "function", "function": { "name": "GET_SECTION", "description": "Get a Todoist section", "parameters": { "type": "object", "properties": { "section_id": {"type": "string", "description": "Section ID"} }, "required": ["section_id"] } } }, { "type": "function", "function": { "name": "SEND_MESSAGE", "description": "Send WhatsApp message", "parameters": { "type": "object", "properties": { "phone_number_id": {"type": "string", "description": "Phone ID"}, "to_number": {"type": "string", "description": "Recipient"}, "text": {"type": "string", "description": "Message text"} }, "required": ["phone_number_id", "to_number", "text"] } } } ] # Define conversation messages = [ {"role": "system", "content": "You are a helpful assistant that can help with tasks by using tools."}, {"role": "user", "content": "Get the Todoist section with ID 'viewing_template'."} ] # Apply chat template with tools formatted_input = tokenizer.apply_chat_template( messages, tools=tools, tokenize=False, add_generation_prompt=True ) # Tokenize and generate input_tokens = tokenizer(formatted_input, return_tensors="pt").to(model.device) output = model.generate(**input_tokens, max_new_tokens=256, do_sample=False) response = tokenizer.decode(output[0][input_tokens['input_ids'].shape[1]:], skip_special_tokens=True) print("Response:", response) ``` ## Deployment with vLLM `vllm serve ojus1/Qwen3-0.6B-Instruct --enable-lora --lora-modules prem-research/Funcdex-0.6B=prem-research/Funcdex-0.6B-whatsapp_todoist --enable-auto-tool-choice --tool-call-parser hermes` For best results, provide detailed system-prompt to steer the tool-use behaviour. # License The models, code and the dataset are licensed under MIT License.