Stability AI Introduces Cutting-Edge ‘Stable Audio’ Model for Precise Audio Generation

Home AI Projects Stability AI Introduces Cutting-Edge ‘Stable Audio’ Model for Precise Audio Generation
sounds waves

Stability AI has unveiled its latest innovation, the “Stable Audio” model, designed to revolutionize the world of audio generation. This groundbreaking development represents a significant leap forward in the field of generative AI, offering unparalleled control over audio content and duration, including the ability to create entire songs. 

 

Traditional audio diffusion models have long struggled with generating audio of fixed lengths, often resulting in abrupt and incomplete musical compositions. This limitation stemmed from the models being trained on random audio segments extracted from longer files and then forced into predetermined durations.

 

Stable Audio addresses this historical challenge effectively, allowing for the generation of audio with specified durations, up to the training window size.

 

One standout feature of Stable Audio is its utilization of a highly downsampled latent representation of audio, resulting in significantly faster inference times compared to raw audio. Using advanced diffusion sampling techniques, the flagship Stable Audio model can generate 95 seconds of stereo audio at a 44.1 kHz sample rate in under a second, harnessing the power of an NVIDIA A100 GPU.

 

The core architecture of Stable Audio comprises a variational autoencoder (VAE), a text encoder, and a U-Net-based conditioned diffusion model. The VAE plays a crucial role in compressing stereo audio into a noise-resistant, lossy latent encoding, greatly expediting both the generation and training processes. This approach, based on the Descript Audio Codec encoder and decoder architectures, ensures high-fidelity output while allowing for encoding and decoding of audio of arbitrary lengths.

 

To incorporate the influence of text prompts, Stability AI employs a text encoder derived from a CLAP model specially trained on their dataset. This empowers the model to infuse textual characteristics with insights into the intricate connections between words and their auditory representations. These textual attributes, harvested from the second-to-last layer of the CLAP text encoder, seamlessly meld with the diffusion U-Net framework by means of cross-attention layers.

 

In the training phase, the system acquires proficiency in assimilating two pivotal attributes from audio fragments: the initial time stamp (“start_time”) and the overall length of the source audio file (“audio_duration”). These characteristics undergo a conversion process into discrete, acquired embeddings on a per-second basis. Subsequently, these embeddings are seamlessly integrated with the text prompts, permitting users to precisely define the preferred duration of the generated audio during the inference stage.

 

The diffusion model at the core of Stable Audio is composed of a staggering 907 million parameters and leverages a sophisticated combination of residual layers, self-attention layers, and cross-attention layers to denoise the input while considering text and timing embeddings. To enhance memory efficiency and scalability for longer sequences, the model incorporates memory-efficient implementations of attention mechanisms.

 

To train the flagship Stable Audio model, Stability AI curated a comprehensive dataset comprising over 800,000 audio files, encompassing music, sound effects, and single-instrument stems. This extensive dataset, developed in partnership with AudioSparx, a prominent stock music provider, amounts to a staggering 19,500 hours of audio.

 

Stable Audio represents the forefront of audio generation research, emerging from Stability AI’s generative audio research lab, Harmonai. The team remains committed to advancing model architectures, refining datasets, and improving training procedures. Their ongoing efforts aim to enhance output quality, fine-tune controllability, optimize inference speed, and expand the range of achievable output lengths.

 

Stability AI has also hinted at forthcoming releases from Harmonai, suggesting the possibility of open-source models based on Stable Audio and accessible training code.

allix