Audiocraft 2: Sound Generation

In this tutorial we'll use Audiocraft for sound generation, allowing you to pass a description of a sound you want to hear to the command line, and generating an audio file of the sound from that description.

machine learning, audiocraft,ai,stable audio,generative audio


Step 1: Generate some static samples.

Create a file in the root of the repo called '' and paste in the following:

import torchaudio
from audiocraft.models import AudioGen
from import audio_write
model = AudioGen.get_pretrained('facebook/audiogen-medium')
model.set_generation_params(duration=5)  # generate 5 seconds.
descriptions = ['dog barking', 'sirene of an emergency vehicle', 'footsteps in a corridor']
wav = model.generate(descriptions)  # generates 3 samples.
for idx, one_wav in enumerate(wav):
    # Will save under {idx}.wav, with loudness normalization at -14 db LUFS.
    audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness", loudness_compressor=True)

This imports the audiogen library, brings in the audiogen-medium model (which you can see details of here, at hugging face ), and generates a set of 5 second long clips based on the descriptions in the 'descriptions' array: a dog barking, a siren, and footsteps, saving them to the root of the repo as 0.wav,1.wav,and 2.wav, respectively.

You can run this script with 'python'. Keep in mind, the model, which is almost 4GB, has to load first, so generation will be slow, especially if you don’t have a GPU!

Step 2: Add params

Import the argparse library at the top of the file: 'import argparse'.

Create a named function after you instantate your model:

            model = AudioGen.get_pretrained('facebook/audiogen-medium')
model.set_generation_params(duration=5)  # generate [duration] seconds.
def generate_audio(descriptions):

and move everything from the declaration of your descriptions through the audio_write loop into the new function:

          def generate_audio(descriptions):
  wav = model.generate(descriptions)  # generates samples for all descriptions in array.

  for idx, one_wav in enumerate(wav):
      # Will save under {idx}.wav, with loudness normalization at -14 db LUFS.
      audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness", loudness_compressor=True)
      print(f'Generated {idx}th sample.')

Set up the ability to accept arguments via the command line at the bottom of the script:

          if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Generate audio based on descriptions.")
    parser.add_argument("descriptions", nargs='+', help="List of descriptions for audio generation")
    args = parser.parse_args()

On the last line of the script, call the function:


Run it to generate audio (replace bracketed text with your desired sounds):

          python "[audio you want to generate 1]" "[audio you want to generate 2]"

Final code: (

          import torchaudio
from audiocraft.models import AudioGen
from import audio_write
import argparse
model = AudioGen.get_pretrained('facebook/audiogen-medium')
model.set_generation_params(duration=5)  # generate [duration] seconds.
def generate_audio(descriptions):
  wav = model.generate(descriptions)  # generates samples for all descriptions in array.

  for idx, one_wav in enumerate(wav):
      # Will save under {idx}.wav, with loudness normalization at -14 db LUFS.
      audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness", loudness_compressor=True)
      print(f'Generated {idx}th sample.')
if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Generate audio based on descriptions.")
    parser.add_argument("descriptions", nargs='+', help="List of descriptions for audio generation")
    args = parser.parse_args()
