--- library_name: transformers license: apache-2.0 base_model: Qwen/Qwen3-Coder-30B-A3B-Instruct tags: - eagle - eagle3 - speculative-decoding - draft-model - sglang - qwen3 - code language: - en - zh pipeline_tag: text-generation --- # SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct This is an **EAGLE3 draft model** for speculative decoding with [Qwen/Qwen3-Coder-30B-A3B-Instruct](https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct). ## Model Description EAGLE3 (Efficient Auto-regressive Language model Generation with Learned Embeddings) is a speculative decoding technique that uses a lightweight draft model to predict future tokens, which are then verified by the target model in parallel. This can significantly accelerate inference speed (2-3x) without any loss in output quality. ### Key Features - **Target Model**: Qwen3-Coder-30B-A3B-Instruct (30B parameters, 3B active) - **Draft Model Size**: ~350MB (single transformer layer) - **Training Data**: OpenPromptContainer (OPC) regenerated dataset - **Training Steps**: 295,000 (Epoch 1) - **Framework**: Trained with [SpecForge](https://github.com/sgl-project/SpecForge) ### Training Metrics | Metric | Value | |--------|-------| | First Token Accuracy (acc_0) | 88.19% | | Average Accuracy (7 positions) | 85.19% | | Training Epochs | 1+ (295k steps) | ## Usage ### With SGLang ```python import sglang as sgl # Launch with EAGLE3 speculative decoding llm = sgl.Engine( model_path="Qwen/Qwen3-Coder-30B-A3B-Instruct", speculative_algorithm="EAGLE", speculative_draft_model_path="sgl-project/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct", speculative_num_steps=5, speculative_eagle_topk=8, speculative_num_draft_tokens=64, ) # Generate text output = llm.generate("Write a Python function to sort a list:") print(output) ``` ### With SGLang Server ```bash python -m sglang.launch_server \ --model-path Qwen/Qwen3-Coder-30B-A3B-Instruct \ --speculative-algorithm EAGLE \ --speculative-draft-model-path sgl-project/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct \ --speculative-num-steps 5 \ --speculative-eagle-topk 8 \ --speculative-num-draft-tokens 64 \ --tp 8 ``` ## Model Architecture The EAGLE3 draft model is a lightweight transformer that: - Shares embeddings with the target model - Uses a single transformer layer (hidden_size=2048, intermediate_size=12288) - Predicts multiple future tokens autoregressively - Uses the target model's hidden states as input ```json { "architectures": ["LlamaForCausalLMEagle3"], "hidden_size": 2048, "intermediate_size": 12288, "num_attention_heads": 32, "num_key_value_heads": 4, "num_hidden_layers": 1, "vocab_size": 151936 } ``` ## Training Details - **Framework**: SpecForge with SGLang backend - **Hardware**: 4x NVIDIA H200 GPUs (TP=4) - **Batch Size**: 1 per GPU - **Learning Rate**: 1e-4 with cosine annealing - **Max Sequence Length**: 4096 - **Attention Backend**: FlexAttention ## Citation If you use this model, please cite: ```bibtex @article{li2024eagle, title={EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty}, author={Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang}, journal={arXiv preprint arXiv:2401.15077}, year={2024} } @misc{sglang2024, title={SGLang: Efficient Execution of Structured Language Model Programs}, author={Zheng, Lianmin and others}, year={2024}, url={https://github.com/sgl-project/sglang} } ``` ## License This model is released under the Apache 2.0 License, following the base model's license.