MITB Banner

Meta AI releases “data2vec”, a self-supervised algorithm that works for speech, vision, and text

They have applied it separately to speech, text and images where it outperformed the previous best single-purpose algorithms for computer vision and speech.

Share

Meta AI has released data2vec, calling it “the first high-performance self-supervised algorithm that works for multiple modalities.” They have applied it separately to speech, text and images, where it outperformed the previous best single-purpose algorithms for computer vision and speech. It came out as competitive on NLP tasks. 

Meta added that data2vec does not rely on contrastive learning or reconstructing the input example. The tech giant has also released open source code and pretrained models.

Method

In the paper titled “data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language”, the research team said that data2vec is trained by predicting the model representations of the full input data given a partial view of the input. 

Image: data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language

The team added in the paper, “We first encode a masked version of the training sample (model in student mode) and then construct training targets by encoding the unmasked version of the input sample with the same model but when parameterized as an exponentially moving average of the model weights (model in teacher mode.)”

The target representations encode all of the information in the training sample. It is for the student to predict these representations given a partial view of the input.

Architecture

As per the paper, the team uses 

  • Standard Transformer architecture along with modality-specific encoding of the input data taken from previous work. 
  • They use the ViT-strategy of encoding an image as a sequence of patches with each spanning 16×16 pixels, input to a linear transformation.
  • Then, the speech data is encoded by using a multi-layer 1-D convolutional neural network. It maps 16 kHz waveform to 50 Hz representations.
  • The text is pre-processed to obtain sub-word units. It is embedded in distributional space through learned embedding vectors.

Masking

The paper adds that after the input sample has been embedded as a sequence of tokens, the team masked a part of these units by replacing them with a learned MASK embedding token and feeding the sequence to the Transformer network.

Computer vision and language

Here, the block-wise masking strategy of Bao (2021) is used. For language, they masked tokens.

The model is trained to predict the model representations of the original unmasked training sample based on an encoding of the masked sample. The model representations only for time-steps that are masked are predicted. These are contextualised representations (it encodes the particular time-step with information from the sample to the use of self-attention in the Transformer network).

Results

Computer vision-outperforms existing popular models

Image: Meta

The method is tested on the ImageNet-1K training set. The resulting model is fine-tuned for image classification using the labelled data of the same benchmark. It outperformed existing methods for popular model sizes.

Speech-Outperforms Meta’s wav2vec 2.0 or HuBERT

Image: Meta

It outperformed wav2vec 2.0 or HuBERT – both came from Meta in the area of self-supervised algorithms for speech.

Text

Image: Meta AI
It was tested on the GLUE benchmark. It performed as well as RoBERTa.

Share
Picture of Sreejani Bhattacharyya

Sreejani Bhattacharyya

I am a technology journalist at AIM. What gets me excited is deep-diving into new-age technologies and analysing how they impact us for the greater good. Reach me at sreejani.bhattacharyya@analyticsindiamag.com
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.