Listen to this story
Big tech Google, which is killing it in the generative AI domain, has introduced AudioPaLM, a new multimodal language model that is built by combining the capabilities of large language model PaLM-2 that Google unveiled in Google I/O 2023 and its generative audio model AudioLM released last year. AudioPaLM establishes an all-encompassing multimodal framework proficient in processing and generating both textual content and spoken language.
Read the full paper here.
The applications of AudioPaLM are diverse, encompassing areas such as speech recognition and speech-to-speech translation. Leveraging the expertise of AudioLM, AudioPaLM inherits the capacity to capture non-verbal cues like speaker identification and intonation, while simultaneously integrating the linguistic knowledge embedded in text-based language models like PaLM-2. Moreover, AudioPaLM showcases distinctive features of audio language models, such as the ability to transfer a voice from one language to another based on a concise spoken prompt.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
AudioPaLM harnesses the power of a large-scale Transformer model as its fundamental framework. It expands upon a pre-existing text-based LLM by augmenting its vocabulary with specialised audio tokens. This, along with a basic task description, enables the training of a single decoder-only model capable of handling a blend of tasks involving both speech and text, in various combinations. These tasks encompass speech recognition, text-to-speech synthesis, and speech-to-speech translation. Through this approach, we consolidate traditionally segregated models into a unified architecture and training process.
AudioPaLM achieves exceptional performance on speech translation benchmarks and delivers competitive outcomes in speech recognition tasks. It also exhibits the ability to convert speech into text for previously unseen language pairs without the need for prior training.
In addition to speech generation, AudioPaLM can also generate transcripts, either in the original language or directly as a translation, or generate speech in the original source. AudioPaLM has achieved top results in speech translation benchmarks and has demonstrated competitive performance in speech recognition tasks.
The model can also preserve paralinguistic information such as speaker identity and intonation, which is often lost in traditional speech-to-text translation systems. The system is expected to outperform existing solutions in terms of speech quality, based on automatic and human evaluation.
“Further research opportunities exist in audio tokenisation, aiming to identify desirable audio token properties, develop measurement techniques, and optimise accordingly. Additionally, there is a need for more established benchmarks and metrics in generative audio tasks to make progress in research, as current benchmarks primarily focus on speech recognition and translation,” read the paper.
Read more: LLMs Are Not As Smart As You Think
Battle of Tech Giants in Music Generation Has Just Begun
However, it is not the first time that Google has launched something in the audio generation space. Back in January, it released MusicLM, a high-fidelity music generative model that creates music from text descriptions, built on AudioLM, as well. It uses a hierarchical sequence-to-sequence approach to generate steady music at 24 kHz. It also introduced MusicCaps, a curated dataset of 5.5k music-text pairs designed for evaluating text-to-music generation.
Google’s rivals are not far behind in this space, either.
Microsoft recently launched Pengi, an audio language model that capitalises on transfer learning to audio tasks as text-generation tasks. By integrating both audio and text inputs, Pengi can generate free-form text output without additional fine-tuning.
Moreover, Meta, spearheaded by Mark Zuckerberg, has introduced MusicGen, which harnesses the power of transformer architecture to create based on textual prompts, aligning the generated music with existing melodies. Similar to language models, MusicGen predicts the next section of a musical piece, resulting in coherent and structured compositions. It efficiently processes tokens in parallel using Meta’s EnCodec audio tokeniser. The model was trained on a dataset of 20,000 hours of licensed music, ensuring access to diverse musical styles and compositions. It also released Voicebox, a multilingual generative AI model that can perform various speech generation tasks through in-context learning, even tasks it was not explicitly trained for.
However, Microsoft-backed OpenAI, which is currently regarded as the leader of the generative AI space seems to be lost in this race of music generation. The ChatGPT creator has made no recent announcements in this space.