Meta AI releases “data2vec”, a self-supervised algorithm that works for speech, vision, and text

They have applied it separately to speech, text and images where it outperformed the previous best single-purpose algorithms for computer vision and speech.

Meta AI has released data2vec, calling it “the first high-performance self-supervised algorithm that works for multiple modalities.” They have applied it separately to speech, text and images, where it outperformed the previous best single-purpose algorithms for computer vision and speech. It came out as competitive on NLP tasks. 

Meta added that data2vec does not rely on contrastive learning or reconstructing the input example. The tech giant has also released open source code and pretrained models.


In the paper titled “data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language”, the research team said that data2vec is trained by predicting the model representations of the full input data given a partial view of the input. 

Image: data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language

The team added in the paper, “We first encode a masked version of the training sample (model in student mode) and then construct training targets by encoding the unmasked version of the input sample with the same model but when parameterized as an exponentially moving average of the model weights (model in teacher mode.)”

The target representations encode all of the information in the training sample. It is for the student to predict these representations given a partial view of the input.


As per the paper, the team uses 

  • Standard Transformer architecture along with modality-specific encoding of the input data taken from previous work. 
  • They use the ViT-strategy of encoding an image as a sequence of patches with each spanning 16×16 pixels, input to a linear transformation.
  • Then, the speech data is encoded by using a multi-layer 1-D convolutional neural network. It maps 16 kHz waveform to 50 Hz representations.
  • The text is pre-processed to obtain sub-word units. It is embedded in distributional space through learned embedding vectors.


The paper adds that after the input sample has been embedded as a sequence of tokens, the team masked a part of these units by replacing them with a learned MASK embedding token and feeding the sequence to the Transformer network.

Computer vision and language

Here, the block-wise masking strategy of Bao (2021) is used. For language, they masked tokens.

The model is trained to predict the model representations of the original unmasked training sample based on an encoding of the masked sample. The model representations only for time-steps that are masked are predicted. These are contextualised representations (it encodes the particular time-step with information from the sample to the use of self-attention in the Transformer network).


Computer vision-outperforms existing popular models

Image: Meta

The method is tested on the ImageNet-1K training set. The resulting model is fine-tuned for image classification using the labelled data of the same benchmark. It outperformed existing methods for popular model sizes.

Speech-Outperforms Meta’s wav2vec 2.0 or HuBERT

Image: Meta

It outperformed wav2vec 2.0 or HuBERT – both came from Meta in the area of self-supervised algorithms for speech.


Image: Meta AI
It was tested on the GLUE benchmark. It performed as well as RoBERTa.

Download our Mobile App

Sreejani Bhattacharyya
I am a technology journalist at AIM. What gets me excited is deep-diving into new-age technologies and analysing how they impact us for the greater good. Reach me at

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

6 IDEs Built for Rust

Rust IDEs aid efficient code development by offering features like code completion, syntax highlighting, linting, debugging tools, and code refactoring