Commit
·
d3a65c3
1
Parent(s):
18acc5e
Update README.md
Browse files
README.md
CHANGED
|
@@ -6,7 +6,7 @@
|
|
| 6 |
|
| 7 |
# AudioLDM
|
| 8 |
|
| 9 |
-
AudioLDM is a latent text-to-audio diffusion model capable of generating realistic audio samples given any text input. It is available in the 🧨 Diffusers library from v0.
|
| 10 |
|
| 11 |
# Model Details
|
| 12 |
|
|
@@ -29,7 +29,7 @@ sound effects, human speech and music.
|
|
| 29 |
First, install the required packages:
|
| 30 |
|
| 31 |
```
|
| 32 |
-
pip install --upgrade
|
| 33 |
```
|
| 34 |
|
| 35 |
## Text-to-Audio
|
|
@@ -46,7 +46,7 @@ pipe = AudioLDMPipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
|
|
| 46 |
pipe = pipe.to("cuda")
|
| 47 |
|
| 48 |
prompt = "Techno music with a strong, upbeat tempo and high melodic riffs"
|
| 49 |
-
audio = pipe(prompt, num_inference_steps=10,
|
| 50 |
```
|
| 51 |
|
| 52 |
The resulting audio output can be saved as a .wav file:
|
|
@@ -65,10 +65,13 @@ Audio(audio, rate=16000)
|
|
| 65 |
|
| 66 |
## Tips
|
| 67 |
|
| 68 |
-
|
|
|
|
| 69 |
* It's best to use general terms like 'cat' or 'dog' instead of specific names or abstract objects that the model may not be familiar with.
|
|
|
|
|
|
|
| 70 |
* The _quality_ of the predicted audio sample can be controlled by the `num_inference_steps` argument: higher steps give higher quality audio at the expense of slower inference.
|
| 71 |
-
* The _length_ of the predicted audio sample can be controlled by varying the `
|
| 72 |
|
| 73 |
# Citation
|
| 74 |
|
|
|
|
| 6 |
|
| 7 |
# AudioLDM
|
| 8 |
|
| 9 |
+
AudioLDM is a latent text-to-audio diffusion model capable of generating realistic audio samples given any text input. It is available in the 🧨 Diffusers library from v0.15.0 onwards.
|
| 10 |
|
| 11 |
# Model Details
|
| 12 |
|
|
|
|
| 29 |
First, install the required packages:
|
| 30 |
|
| 31 |
```
|
| 32 |
+
pip install --upgrade diffusers transformers
|
| 33 |
```
|
| 34 |
|
| 35 |
## Text-to-Audio
|
|
|
|
| 46 |
pipe = pipe.to("cuda")
|
| 47 |
|
| 48 |
prompt = "Techno music with a strong, upbeat tempo and high melodic riffs"
|
| 49 |
+
audio = pipe(prompt, num_inference_steps=10, audio_length_in_s=5.0).audios[0]
|
| 50 |
```
|
| 51 |
|
| 52 |
The resulting audio output can be saved as a .wav file:
|
|
|
|
| 65 |
|
| 66 |
## Tips
|
| 67 |
|
| 68 |
+
Prompts:
|
| 69 |
+
* Descriptive prompt inputs work best: you can use adjectives to describe the sound (e.g. "high quality" or "clear") and make the prompt context specific (e.g., "water stream in a forest" instead of "stream").
|
| 70 |
* It's best to use general terms like 'cat' or 'dog' instead of specific names or abstract objects that the model may not be familiar with.
|
| 71 |
+
|
| 72 |
+
Inference:
|
| 73 |
* The _quality_ of the predicted audio sample can be controlled by the `num_inference_steps` argument: higher steps give higher quality audio at the expense of slower inference.
|
| 74 |
+
* The _length_ of the predicted audio sample can be controlled by varying the `audio_length_in_s` argument.
|
| 75 |
|
| 76 |
# Citation
|
| 77 |
|