Meta AI releases “data2vec”, a self-supervised algorithm that works for speech, vision, and text

They have applied it separately to speech, text and images where it outperformed the previous best single-purpose algorithms for computer vision and speech.


Meta AI has released data2vec, calling it “the first high-performance self-supervised algorithm that works for multiple modalities.” They have applied it separately to speech, text and images, where it outperformed the previous best single-purpose algorithms for computer vision and speech. It came out as competitive on NLP tasks. 

Meta added that data2vec does not rely on contrastive learning or reconstructing the input example. The tech giant has also released open source code and pretrained models.


Sign up for your weekly dose of what's up in emerging technology.


In the paper titled “data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language”, the research team said that data2vec is trained by predicting the model representations of the full input data given a partial view of the input. 

Image: data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language

The team added in the paper, “We first encode a masked version of the training sample (model in student mode) and then construct training targets by encoding the unmasked version of the input sample with the same model but when parameterized as an exponentially moving average of the model weights (model in teacher mode.)”

The target representations encode all of the information in the training sample. It is for the student to predict these representations given a partial view of the input.


As per the paper, the team uses 

  • Standard Transformer architecture along with modality-specific encoding of the input data taken from previous work. 
  • They use the ViT-strategy of encoding an image as a sequence of patches with each spanning 16×16 pixels, input to a linear transformation.
  • Then, the speech data is encoded by using a multi-layer 1-D convolutional neural network. It maps 16 kHz waveform to 50 Hz representations.
  • The text is pre-processed to obtain sub-word units. It is embedded in distributional space through learned embedding vectors.


The paper adds that after the input sample has been embedded as a sequence of tokens, the team masked a part of these units by replacing them with a learned MASK embedding token and feeding the sequence to the Transformer network.

Computer vision and language

Here, the block-wise masking strategy of Bao (2021) is used. For language, they masked tokens.

The model is trained to predict the model representations of the original unmasked training sample based on an encoding of the masked sample. The model representations only for time-steps that are masked are predicted. These are contextualised representations (it encodes the particular time-step with information from the sample to the use of self-attention in the Transformer network).


Computer vision-outperforms existing popular models

Image: Meta

The method is tested on the ImageNet-1K training set. The resulting model is fine-tuned for image classification using the labelled data of the same benchmark. It outperformed existing methods for popular model sizes.

Speech-Outperforms Meta’s wav2vec 2.0 or HuBERT

Image: Meta

It outperformed wav2vec 2.0 or HuBERT – both came from Meta in the area of self-supervised algorithms for speech.


Image: Meta AI
It was tested on the GLUE benchmark. It performed as well as RoBERTa.

More Great AIM Stories

Sreejani Bhattacharyya
I am a technology journalist at AIM. What gets me excited is deep-diving into new-age technologies and analysing how they impact us for the greater good. Reach me at

Our Upcoming Events

Conference, in-person (Bangalore)
MachineCon 2022
24th Jun

Conference, Virtual
Deep Learning DevCon 2022
30th Jul

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM
Amit Raja Naik
Oh boy, is JP Morgan wrong?

The global brokerage firm has downgraded Tata Consultancy Services, HCL Technology, Wipro, and L&T Technology to ‘underweight’ from ‘neutral’ and slashed its target price by 15-21 per cent.