Audiocraft 2: Sound Generation

In this tutorial we'll use Audiocraft for sound generation, allowing you to pass a description of a sound you want to hear to the command line, and generating an audio file of the sound from that description.

machine learning, audiocraft,ai,stable audio,generative audio

Links

Step 1: Generate some static samples.

Create a file in the root of the repo called 'audiogen-demo.py' and paste in the following:

import torchaudio
from audiocraft.models import AudioGen
from audiocraft.data.audio import audio_write
​
model = AudioGen.get_pretrained('facebook/audiogen-medium')
model.set_generation_params(duration=5)  # generate 5 seconds.
descriptions = ['dog barking', 'sirene of an emergency vehicle', 'footsteps in a corridor']
wav = model.generate(descriptions)  # generates 3 samples.
​
for idx, one_wav in enumerate(wav):
    # Will save under {idx}.wav, with loudness normalization at -14 db LUFS.
    audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness", loudness_compressor=True)

This imports the audiogen library, brings in the audiogen-medium model (which you can see details of here, at hugging face ), and generates a set of 5 second long clips based on the descriptions in the 'descriptions' array: a dog barking, a siren, and footsteps, saving them to the root of the repo as 0.wav,1.wav,and 2.wav, respectively.

You can run this script with 'python audio-gen.py'. Keep in mind, the model, which is almost 4GB, has to load first, so generation will be slow, especially if you don’t have a GPU!

Step 2: Add params

Import the argparse library at the top of the file: 'import argparse'.

Create a named function after you instantate your model:


            model = AudioGen.get_pretrained('facebook/audiogen-medium')
model.set_generation_params(duration=5)  # generate [duration] seconds.
​
def generate_audio(descriptions):
            

and move everything from the declaration of your descriptions through the audio_write loop into the new function:


          def generate_audio(descriptions):
  wav = model.generate(descriptions)  # generates samples for all descriptions in array.

  for idx, one_wav in enumerate(wav):
      # Will save under {idx}.wav, with loudness normalization at -14 db LUFS.
      audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness", loudness_compressor=True)
      print(f'Generated {idx}th sample.')
          

Set up the ability to accept arguments via the command line at the bottom of the script:


          if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Generate audio based on descriptions.")
    parser.add_argument("descriptions", nargs='+', help="List of descriptions for audio generation")
    args = parser.parse_args()
          

On the last line of the script, call the function:


          generate_audio(args.descriptions)
          

Run it to generate audio (replace bracketed text with your desired sounds):


          python audiogen-demo.py "[audio you want to generate 1]" "[audio you want to generate 2]"
          

Final code: (audiogen-demo.py):


          import torchaudio
from audiocraft.models import AudioGen
from audiocraft.data.audio import audio_write
import argparse
​
model = AudioGen.get_pretrained('facebook/audiogen-medium')
model.set_generation_params(duration=5)  # generate [duration] seconds.
​
def generate_audio(descriptions):
  wav = model.generate(descriptions)  # generates samples for all descriptions in array.

  for idx, one_wav in enumerate(wav):
      # Will save under {idx}.wav, with loudness normalization at -14 db LUFS.
      audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness", loudness_compressor=True)
      print(f'Generated {idx}th sample.')
​
if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Generate audio based on descriptions.")
    parser.add_argument("descriptions", nargs='+', help="List of descriptions for audio generation")
    args = parser.parse_args()

    generate_audio(args.descriptions)