Meet the Cool Cousin of Stable Diffusion, ‘Riffusion’

The model can generate infinite variations of a text prompt or an uploaded sound clip which can also be modified by entering further prompts.
Meet the Cool Cousin of Stable Diffusion, ‘Riffusion’
Listen to this story

Since Stable Diffusion got open-sourced, a lot of new innovations are coming to the surface. The newest one is for creating real-time AI generated music—Riffusion. Taking an interesting approach for creating music using images of audio instead of audio, Riffusion is built by fine-tuning Stable Diffusion to create images of spectrograms—essentially, visualisations of audio. 

The model can generate infinite variations of a text prompt or an uploaded sound clip which can also be modified by entering further prompts. 

Click here to try it out.

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

You can check out the code of the model here.

Read: Meet the Hot Cousin of Stable Diffusion, ‘Unstable Diffusion’

Process and Features

Spectrograms are visual representations of audio that display the amplitude of frequencies over time. These generated visuals can then be converted into audio clips. The spectrogram is computed from audio with Short-time Fourier Transform (STFT), approximating audio using a combination of sine waves that have varying amplitudes and phases.

Apart from text-to-audio, Stable Diffusion-based models can also leverage image-to-image ability. This was useful for modifying sounds by making changes to the image while also preserving the original content of the audio using the de-noising strength parameter.

For creating infinite varying AI-generated music, the developers interpolated between prompts and seeds using the latent space present in diffusion models. Latent space consists of objects that are similar to each other, allowing buttery smooth transitions even with disparate prompts.

In September, a similar model built on Stable Diffusion, Dance Diffusion, was released and could generate music clips. It was trained on hundreds of hours of songs and was therefore considered as a borderline ethical choice for Stability AI. 

Mohit Pandey
Mohit dives deep into the AI world to bring out information in simple, explainable, and sometimes funny words. He also holds a keen interest in photography, filmmaking, and the gaming industry.

Download our Mobile App


AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIM Research

Pioneering advanced AI market research

Request Customised Insights & Surveys for the AI Industry

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox