Published on May 30, 2025
In AI News

Resemble AI Open-Sources Its Voice-Cloning Model, Chatterbox

In a recent test, 63.75% of listeners preferred Chatterbox’s audio samples over ElevenLabs.

By Smruthi Nadig

The US-based Resemble AI, a voice cloning platform, has open-sourced Chatterbox—its model that includes both text-to-speech and voice conversion capabilities—the company announced on X.

A recent test conducted through Podonos was designed to assess the use of Resemble AI’s Chatterbox and ElevenLabs in generating natural and high-quality speech. Both systems generate audio samples that range from 7 to 20 seconds in duration using the same text inputs (zero-shot, no prompt engineering, and audio processing).

Participants were made to listen to audio samples from both these models, revealing that 63.75% of listeners preferred Chatterbox over ElevenLabs. The results also supported Chatterbox’s position as a competitive open-source model that offers features like emotion control and rapid voice cloning.

Chatterbox claims to be the first open-source model with emotion exaggeration control. It can adjust intensity from monotone to dramatically expressive with a single parameter.

In February of this year, Resemble AI launched Rapid Voice Clone 2.0, a tool that allows users to create high-quality voice content using just 20 seconds of audio. This powerful tool facilitates seamless voice generation, editing, and localisation. Users can easily make instant modifications, such as swapping words, fine-tuning tone, or adjusting delivery, without re-recording.

Open-source AI voice cloning is a groundbreaking technology that allows users to mimic voices with remarkable precision. A prime example is OpenVoice, developed through collaboration between researchers from MIT, Tsinghua University, and the Canadian startup MyShell, the website states.

Similarly, another AI startup, Zyphra, launched its open-source text-to-speech models in February. These models can clone a voice with only five seconds of sample audio, which generates realistic results with less than 30 seconds of recorded speech.

Reports show that the models, each measuring 1.6 billion parameters, were trained on over 200,000 hours of speech data, which includes both neutral-toned speech, such as audiobook narration, and highly expressive speech.

📣 Want to advertise in AIM? Book here

Smruthi Nadig

Smruthi brings over two years of experience in reporting on the global energy industry. They hold a Master's Degree from the University of Leeds in International Journalism and a Bachelor's Degree from Christ University in Media Studies, Economics and Political Science.