File size: 3,972 Bytes

---
language:
- en
- de
- fr
- it
- pt
- hi
- es
- th
license: llama3.3
pipeline_tag: text-generation
tags:
- facebook
- meta
- pytorch
- llama
- llama-3
- neuralmagic
- redhat
- speculators
- eagle3
---

# Llama-3.3-70B-Instruct-speculator.eagle3

## Model Overview
- **Verifier:** meta-llama/Llama-3.3-70B-Instruct
- **Speculative Decoding Algorithm:** EAGLE-3
- **Model Architecture:** Eagle3Speculator
- **Release Date:** 09/15/2025
- **Version:** 1.0
- **Model Developers:** RedHat

This is a speculator model designed for use with [meta-llama/Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct), based on the [EAGLE-3](https://arxiv.org/abs/2503.01840) speculative decoding algorithm.
It was trained using the [speculators](https://github.com/vllm-project/speculators) library on a combination of the [Aeala/ShareGPT_Vicuna_unfiltered](https://huggingface.co/datasets/Aeala/ShareGPT_Vicuna_unfiltered) and the `train_sft` split of [HuggingFaceH4/ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) datasets.
This model should be used with the [meta-llama/Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) chat template, specifically through the `/chat/completions` endpoint.

## Use with vLLM

```bash
vllm serve meta-llama/Llama-3.3-70B-Instruct \
  -tp 4 \
  --speculative-config '{
    "model": "RedHatAI/Llama-3.3-70B-Instruct-speculator.eagle3",
    "num_speculative_tokens": 3,
    "method": "eagle3"
  }'
```

## Evaluations

<h3>Use cases</h3>
<table>
  <thead>
    <tr>
      <th>Use Case</th>
      <th>Dataset</th>
      <th>Number of Samples</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Coding</td>
      <td>HumanEval</td>
      <td>168</td>
    </tr>
    <tr>
      <td>Math Reasoning</td>
      <td>gsm8k</td>
      <td>80</td>
    </tr>
    <tr>
      <td>Text Summarization</td>
      <td>CNN/Daily Mail</td>
      <td>80</td>
    </tr>
  </tbody>
</table>

<h3>Acceptance lengths</h3>
<table>
  <thead>
    <tr>
      <th>Use Case</th>
      <th>k=1</th>
      <th>k=2</th>
      <th>k=3</th>
      <th>k=4</th>
      <th>k=5</th>
      <th>k=6</th>
      <th>k=7</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Coding</td>
      <td>1.84</td>
      <td>2.53</td>
      <td>3.07</td>
      <td>3.42</td>
      <td>3.71</td>
      <td>3.89</td>
      <td>4.00</td>
    </tr>
    <tr>
      <td>Math Reasoning</td>
      <td>1.81</td>
      <td>2.43</td>
      <td>2.88</td>
      <td>3.17</td>
      <td>3.30</td>
      <td>3.42</td>
      <td>3.53</td>
    </tr>
    <tr>
      <td>Text Summarization</td>
      <td>1.71</td>
      <td>2.21</td>
      <td>2.52</td>
      <td>2.74</td>
      <td>2.83</td>
      <td>2.87</td>
      <td>2.89</td>
    </tr>
  </tbody>
</table>

<h3>Performance benchmarking (4xA100)</h3>
<div style="display: flex; justify-content: center; gap: 20px;">

  <figure style="text-align: center;">
    <img src="assets/Llama-3.3-70B-Instruct-HumanEval.png" alt="Coding" width="100%">
  </figure>

  <figure style="text-align: center;">
    <img src="assets/Llama-3.3-70B-Instruct-math_reasoning.png" alt="Coding" width="100%">
  </figure>

  <figure style="text-align: center;">
    <img src="assets/Llama-3.3-70B-Instruct-summarization.png" alt="Coding" width="100%">
  </figure>
</div>

<details> <summary>Details</summary>
<strong>Configuration</strong>

- temperature: 0
- repetitions: 5
- time per experiment: 4min
- hardware: 4xA100
- vLLM version: 0.11.0
- GuideLLM version: 0.3.0

<strong>Command</strong>
```bash
GUIDELLM__PREFERRED_ROUTE="chat_completions" \
guidellm benchmark \
  --target "http://localhost:8000/v1" \
  --data "RedHatAI/speculator_benchmarks" \
  --data-args '{"data_files": "HumanEval.jsonl"}' \
  --rate-type sweep \
  --max-seconds 240 \
  --output-path "Llama-3.3-70B-Instruct-HumanEval.json" \
  --backend-args '{"extra_body": {"chat_completions": {"temperature":0.0}}}'
</details>