--- language: - en - de - fr - it - pt - hi - es - th license: llama3.3 pipeline_tag: text-generation tags: - facebook - meta - pytorch - llama - llama-3 - neuralmagic - redhat - speculators - eagle3 --- # Llama-3.3-70B-Instruct-speculator.eagle3 ## Model Overview - **Verifier:** meta-llama/Llama-3.3-70B-Instruct - **Speculative Decoding Algorithm:** EAGLE-3 - **Model Architecture:** Eagle3Speculator - **Release Date:** 09/15/2025 - **Version:** 1.0 - **Model Developers:** RedHat This is a speculator model designed for use with [meta-llama/Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct), based on the [EAGLE-3](https://arxiv.org/abs/2503.01840) speculative decoding algorithm. It was trained using the [speculators](https://github.com/vllm-project/speculators) library on a combination of the [Aeala/ShareGPT_Vicuna_unfiltered](https://huggingface.co/datasets/Aeala/ShareGPT_Vicuna_unfiltered) and the `train_sft` split of [HuggingFaceH4/ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) datasets. This model should be used with the [meta-llama/Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) chat template, specifically through the `/chat/completions` endpoint. ## Use with vLLM ```bash vllm serve meta-llama/Llama-3.3-70B-Instruct \ -tp 4 \ --speculative-config '{ "model": "RedHatAI/Llama-3.3-70B-Instruct-speculator.eagle3", "num_speculative_tokens": 3, "method": "eagle3" }' ``` ## Evaluations

Use cases

Use Case Dataset Number of Samples
Coding HumanEval 168
Math Reasoning gsm8k 80
Text Summarization CNN/Daily Mail 80

Acceptance lengths

Use Case k=1 k=2 k=3 k=4 k=5 k=6 k=7
Coding
Math Reasoning 1.80 2.44 2.89 3.15 3.33 3.44 3.52
Text Summarization 1.72 2.21 2.53 2.74 2.86 2.93 2.98

Performance benchmarking (4xA100)

Coding
Coding
Coding
Details Configuration - temperature: 0 - repetitions: 5 - time per experiment: 4min - hardware: 4xA100 - vLLM version: 0.11.0 - GuideLLM version: 0.3.0 Command ```bash GUIDELLM__PREFERRED_ROUTE="chat_completions" \ guidellm benchmark \ --target "http://localhost:8000/v1" \ --data "RedHatAI/SpeculativeDecoding" \ --rate-type sweep \ --max-seconds 240 \ --output-path "Llama-3.3-70B-Instruct-HumanEval.json" \ --backend-args '{"extra_body": {"chat_completions": {"temperature": 0.0}}}'