Listen to this story
There are several pieces of evidence to prove that AI that will form the backbone of the metaverse. The role of AI in the metaverse involves combining several related techniques like computer vision, natural language processing, blockchain and digital twins.
In February, Meta’s chief Mark Zuckerberg showcased a demo at the company’s first virtual event – Inside The Lab, of what the metaverse would look like. He said that the company was working on a new range of generative AI models that would allow users to generate a virtual reality of their own simply by describing it. Zuckerberg announced a slew of upcoming launches like Project CAIRaoke – “a fully end-to-end neural model for building on-device assistants” which would help users communicate more naturally with voice assistants. Meanwhile, Meta was also working on building a universal speech translator that could offer direct speech-to-speech translation for all languages. A few months later, Meta has made good on their promise. However, Meta isn’t the only tech company with skin in the game; companies like NVIDIA have also released AI models for a richer metaverse experience.
Sign up for your weekly dose of what's up in emerging technology.
Open Pretrained Transformer or OPT-175B
Last week Meta released a research paper along with the codebase for its new 175 billion-parameter large language capable of translating across 200 languages. The model is a definitive step toward building a universal speech translator. Titled ‘No Language Left Behind’, the model includes low-resource languages with less than a million publicly available translated pairs of sentences.
Compared to older models, NLLB-200 is 44 percent better in quality. For African and India-based languages, which aren’t as popular as English or European languages, the model’s translations were accurate by more than 70 percent. Meta said in its blog that the project will help “democratise access to immersive experiences in virtual worlds.”
Developed by NVIDIA’s AI Research, GANverse 3D is a model that uses deep learning to process 2D images into 3D animated versions. Introduced in a research paper published at ICLR and CVPR last year, the tool produces simulations faster at lesser costs. The model used StyleGANs to produce multiple views from a single image automatically. The application can be imported as an extension in the NVIDIA Omniverse to render 3D objects accurately in the virtual world.
The production of 3D models has become essential for the metaverse. Retailers like Nike and Forever21 have built their virtual stores in the metaverse to drive eCommerce sales.
Visual Acoustic Matching Model or AViTAR
Meta’s Reality Labs team collaborated with the University of Texas to build an AI model that improves the sound quality in the metaverse. The model helps match the audio with the video in a scene. It transforms the audio clip to make it sound like it was recorded in a specific environment. The model used self-supervised learning after picking up data from random online videos.
Ideally, the user should be able to watch their favourite memory on their AR glasses and listen to the exact sound that was produced during the actual experience. Meta AI released the open-source for AViTAR along with two other acoustic models, which is a rarity considering the sound is an often-ignored part of the metaverse experience.
Visually-Informed Dereverberation or VIDA
The second acoustic model that Meta AI released was used to remove reverberation from the acoustics. The model was trained on a large-scale dataset that had a wide variety of realistic audio renderings from 3D models of homes. Reverberation doesn’t just reduce the quality of audio and make it hard to understand but also improves the accuracy of automatic speech recognition.
What makes VIDA unique is that it uses visual cues as well as the audio modality to make observations. Improving upon the typical audio-only methods, VIDA can enhance speech and identify the speech and speaker.
The third acoustic model released by Meta AI VisualVoice was used to extract speech from video. Like VIDA, VisualVoice, too, was trained on audio-visual cues from unlabelled videos. The model has automated separating speech. It has important applications like making technology for hearing impaired people, enhancing sound in wearable AR devices and transcribing speech from noisy online videos.
NVIDIA released the open beta version for Omniverse Audio2Face last year to generate AI-driven facial animation to match any voiceover. The tool simplified the long and tedious process of animating for gaming and visual effects. The app also allows users to give instructions in multiple languages.
Early this year, NVIDIA released an update for the tool with added features such as BlendShape Generation, which helps the user create a set of blendshapes from a neutral headmesh. A streaming audio player feature was also added that lets the streaming of audio data using text-to-speech applications.
Audio2Face is set up with a 3D character model that can be animated with the audio track. The audio is then fed into a deep neural network. The user can also edit the character in post-processing to alter the character’s performance.