File size: 3,972 Bytes
8d36c47 ef3a876 8d36c47 ba1c53f e7dff4c ba1c53f 8d36c47 e7dff4c 8d36c47 e7dff4c be5001b 84867e4 be5001b e7dff4c be5001b e7dff4c be5001b e7dff4c 9f3485b e7dff4c a790d14 e7dff4c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 |
---
language:
- en
- de
- fr
- it
- pt
- hi
- es
- th
license: llama3.3
pipeline_tag: text-generation
tags:
- facebook
- meta
- pytorch
- llama
- llama-3
- neuralmagic
- redhat
- speculators
- eagle3
---
# Llama-3.3-70B-Instruct-speculator.eagle3
## Model Overview
- **Verifier:** meta-llama/Llama-3.3-70B-Instruct
- **Speculative Decoding Algorithm:** EAGLE-3
- **Model Architecture:** Eagle3Speculator
- **Release Date:** 09/15/2025
- **Version:** 1.0
- **Model Developers:** RedHat
This is a speculator model designed for use with [meta-llama/Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct), based on the [EAGLE-3](https://arxiv.org/abs/2503.01840) speculative decoding algorithm.
It was trained using the [speculators](https://github.com/vllm-project/speculators) library on a combination of the [Aeala/ShareGPT_Vicuna_unfiltered](https://huggingface.co/datasets/Aeala/ShareGPT_Vicuna_unfiltered) and the `train_sft` split of [HuggingFaceH4/ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) datasets.
This model should be used with the [meta-llama/Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) chat template, specifically through the `/chat/completions` endpoint.
## Use with vLLM
```bash
vllm serve meta-llama/Llama-3.3-70B-Instruct \
-tp 4 \
--speculative-config '{
"model": "RedHatAI/Llama-3.3-70B-Instruct-speculator.eagle3",
"num_speculative_tokens": 3,
"method": "eagle3"
}'
```
## Evaluations
<h3>Use cases</h3>
<table>
<thead>
<tr>
<th>Use Case</th>
<th>Dataset</th>
<th>Number of Samples</th>
</tr>
</thead>
<tbody>
<tr>
<td>Coding</td>
<td>HumanEval</td>
<td>168</td>
</tr>
<tr>
<td>Math Reasoning</td>
<td>gsm8k</td>
<td>80</td>
</tr>
<tr>
<td>Text Summarization</td>
<td>CNN/Daily Mail</td>
<td>80</td>
</tr>
</tbody>
</table>
<h3>Acceptance lengths</h3>
<table>
<thead>
<tr>
<th>Use Case</th>
<th>k=1</th>
<th>k=2</th>
<th>k=3</th>
<th>k=4</th>
<th>k=5</th>
<th>k=6</th>
<th>k=7</th>
</tr>
</thead>
<tbody>
<tr>
<td>Coding</td>
<td>1.84</td>
<td>2.53</td>
<td>3.07</td>
<td>3.42</td>
<td>3.71</td>
<td>3.89</td>
<td>4.00</td>
</tr>
<tr>
<td>Math Reasoning</td>
<td>1.81</td>
<td>2.43</td>
<td>2.88</td>
<td>3.17</td>
<td>3.30</td>
<td>3.42</td>
<td>3.53</td>
</tr>
<tr>
<td>Text Summarization</td>
<td>1.71</td>
<td>2.21</td>
<td>2.52</td>
<td>2.74</td>
<td>2.83</td>
<td>2.87</td>
<td>2.89</td>
</tr>
</tbody>
</table>
<h3>Performance benchmarking (4xA100)</h3>
<div style="display: flex; justify-content: center; gap: 20px;">
<figure style="text-align: center;">
<img src="assets/Llama-3.3-70B-Instruct-HumanEval.png" alt="Coding" width="100%">
</figure>
<figure style="text-align: center;">
<img src="assets/Llama-3.3-70B-Instruct-math_reasoning.png" alt="Coding" width="100%">
</figure>
<figure style="text-align: center;">
<img src="assets/Llama-3.3-70B-Instruct-summarization.png" alt="Coding" width="100%">
</figure>
</div>
<details> <summary>Details</summary>
<strong>Configuration</strong>
- temperature: 0
- repetitions: 5
- time per experiment: 4min
- hardware: 4xA100
- vLLM version: 0.11.0
- GuideLLM version: 0.3.0
<strong>Command</strong>
```bash
GUIDELLM__PREFERRED_ROUTE="chat_completions" \
guidellm benchmark \
--target "http://localhost:8000/v1" \
--data "RedHatAI/speculator_benchmarks" \
--data-args '{"data_files": "HumanEval.jsonl"}' \
--rate-type sweep \
--max-seconds 240 \
--output-path "Llama-3.3-70B-Instruct-HumanEval.json" \
--backend-args '{"extra_body": {"chat_completions": {"temperature":0.0}}}'
</details>
|