File size: 5,430 Bytes
53ac8d2
 
 
 
 
 
 
 
 
 
 
 
 
7f3de60
 
 
 
 
90a3999
53ac8d2
 
 
 
7f3de60
 
53ac8d2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
595d493
53ac8d2
90a3999
 
 
53ac8d2
 
eb4026c
53ac8d2
 
 
 
afb35b6
53ac8d2
 
 
 
 
 
 
595d493
53ac8d2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eb4026c
53ac8d2
 
 
d8b6fa0
53ac8d2
 
d8b6fa0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53ac8d2
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
---
license: apache-2.0
pipeline_tag: text-to-speech
library_name: transformers
---
## Step-Audio-EditX

 ✨ [Demo Page](https://stepaudiollm.github.io/step-audio-editx/) 
| 🌟 [GitHub](https://github.com/stepfun-ai/Step-Audio-EditX) 
| πŸ“‘ [Paper](https://arxiv.org/abs/2511.03601) 

Check our open-source repository https://github.com/stepfun-ai/Step-Audio-EditX for more details!

## πŸ”₯πŸ”₯πŸ”₯ News!!!
* Nov 28, 2025: πŸš€ New Model Release: Now supporting **`Japanese`** and **`Korean`** languages.
* Nov 23, 2025: πŸ“Š [Step-Audio-Edit-Benchmark](https://github.com/stepfun-ai/Step-Audio-Edit-Benchmark) Released!
* Nov 19, 2025: βš™οΈ We release a **new version** of our model, which **supports polyphonic pronunciation control** and improves the performance of emotion, speaking style, and paralinguistic editing.

We are open-sourcing **Step-Audio-EditX**, a powerful **3B parameters** LLM-based audio model specialized in expressive and **iterative audio editing**. 
It excels at **editing emotion**, **speaking style**, and **paralinguistics**, and also features robust **zero-shot text-to-speech (TTS)** capabilities.

## Features
- **Zero-Shot TTS**
  - Excellent zero-shot TTS cloning for `Mandarin`, `English`, `Sichuanese`, `Cantonese`, `Japanese` and `Korean`.
  - To use a dialect, just add a **`[Sichuanese]`**, **`[Cantonese]`** ,**`[Japanese]`**,**`[Korean]`** tag before your text.
 
- **Emotion and Speaking Style Editing**
  - Remarkably effective iterative control over emotions and styles, supporting **dozens** of options for editing.
    - Emotion Editing : [ *Angry*, *Happy*, *Sad*, *Excited*, *Fearful*, *Surprised*, *Disgusted*, etc. ]
    - Speaking Style Editing: [ *Act_coy*, *Older*, *Child*, *Whisper*, *Serious*, *Generous*, *Exaggerated*, etc.]
    - Editing with more emotion and more speaking styles is on the way. **Get Ready!** πŸš€ 
    
- **Paralinguistic Editing**:
  -  Precise control over 10 types of paralinguistic features for more natural, human-like, and expressive synthetic audio.
  - Supporting Tags:
    - [ *Breathing*, *Laughter*, *Suprise-oh*, *Confirmation-en*, *Uhm*, *Suprise-ah*, *Suprise-wa*, *Sigh*, *Question-ei*, *Dissatisfaction-hnn* ]

For more examples, see [demo page](https://stepaudiollm.github.io/step-audio-editx/).

## Model Usage
### πŸ“œ Requirements
The following table shows the requirements for running Step-Audio-EditX model:

|     Model    | Parameters |  Setting<br/>(sample frequency) | GPU Optimal Memory  |
|------------|------------|--------------------------------|----------------|
| Step-Audio-EditX   | 3B|         41.6Hz          |       32 GB        |

* An NVIDIA GPU with CUDA support is required.
  * The model is tested on a single L40S GPU.
* Tested operating system: Linux

### πŸ”§ Dependencies and Installation
- Python >= 3.10.0 (Recommend to use [Anaconda](https://www.anaconda.com/download/#linux) or [Miniconda](https://docs.conda.io/en/latest/miniconda.html))
- [PyTorch >= 2.4.1-cu121](https://pytorch.org/)
- [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads)

```bash
git clone https://github.com/stepfun-ai/Step-Audio-EditX.git
conda create -n stepaudioedit python=3.10
conda activate stepaudioedit

cd Step-Audio-EditX
pip install -r requirements.txt

git lfs install
git clone https://huggingface.co/stepfun-ai/Step-Audio-Tokenizer
git clone https://huggingface.co/stepfun-ai/Step-Audio-EditX

```

After downloading the models, where_you_download_dir should have the following structure:
```
where_you_download_dir
β”œβ”€β”€ Step-Audio-Tokenizer
β”œβ”€β”€ Step-Audio-EditX
```

#### Run with Docker

You can set up the environment required for running Step-Audio using the provided Dockerfile.

```bash
# build docker
docker build . -t step-audio-editx

# run docker
docker run --rm --gpus all \
    -v /your/code/path:/app \
    -v /your/model/path:/model \
    -p 7860:7860 \
    step-audio-editx
```


#### Launch Web Demo
Start a local server for online inference.
Assume you have one GPU with at least 32GB memory available  and have already downloaded all the models.

```bash
# Step-Audio-EditX demo
python app.py --model-path where_you_download_dir --model-source local
```

#### Local Inference Demo
> [!TIP]
> For optimal performance, keep audio under 30 seconds per inference.

```bash
# zero-shot cloning
python3 tts_infer.py \
    --model-path where_you_download_dir \
    --output-dir ./output \
    --prompt-text "your prompt text"\
    --prompt-audio your_prompt_audio_path \
    --generated-text "your target text" \
    --edit-type "clone"

# edit
python3 tts_infer.py \
    --model-path where_you_download_dir \
    --output-dir ./output \
    --prompt-text "your promt text" \
    --prompt-audio your_prompt_audio_path \
    --generated-text "" \ # for para-linguistic editing, you need to specify the generatedd text
    --edit-type "emotion" \
    --edit-info "sad" \
    --n-edit-iter 2
```


## Citation

```
@misc{yan2025stepaudioeditxtechnicalreport,
      title={Step-Audio-EditX Technical Report}, 
      author={Chao Yan and Boyong Wu and Peng Yang and Pengfei Tan and Guoqiang Hu and Yuxin Zhang and Xiangyu and Zhang and Fei Tian and Xuerui Yang and Xiangyu Zhang and Daxin Jiang and Gang Yu},
      year={2025},
      eprint={2511.03601},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2511.03601}, 
}

```