Last updated January 25, 2022
In AI Origins & Evolution

Meta AI releases “data2vec”, a self-supervised algorithm that works for speech, vision, and text

They have applied it separately to speech, text and images where it outperformed the previous best single-purpose algorithms for computer vision and speech.

Share

Published on January 25, 2022

by Sreejani Bhattacharyya

Meta AI has released data2vec, calling it “the first high-performance self-supervised algorithm that works for multiple modalities.” They have applied it separately to speech, text and images, where it outperformed the previous best single-purpose algorithms for computer vision and speech. It came out as competitive on NLP tasks.

Meta added that data2vec does not rely on contrastive learning or reconstructing the input example. The tech giant has also released open source code and pretrained models.

We created data2vec, the first general high-performance self-supervised algorithm for speech, vision, and text. When applied to different modalities, it matches or outperforms the best self-supervised algorithms. Read more and get the code:https://t.co/3x8VCwGI2x pic.twitter.com/Q9TNDg1paj
— Meta AI (@MetaAI) January 20, 2022

Method

In the paper titled “data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language”, the research team said that data2vec is trained by predicting the model representations of the full input data given a partial view of the input.

Image: data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language

The team added in the paper, “We first encode a masked version of the training sample (model in student mode) and then construct training targets by encoding the unmasked version of the input sample with the same model but when parameterized as an exponentially moving average of the model weights (model in teacher mode.)”

The target representations encode all of the information in the training sample. It is for the student to predict these representations given a partial view of the input.

Architecture

As per the paper, the team uses

Standard Transformer architecture along with modality-specific encoding of the input data taken from previous work.
They use the ViT-strategy of encoding an image as a sequence of patches with each spanning 16×16 pixels, input to a linear transformation.
Then, the speech data is encoded by using a multi-layer 1-D convolutional neural network. It maps 16 kHz waveform to 50 Hz representations.
The text is pre-processed to obtain sub-word units. It is embedded in distributional space through learned embedding vectors.

Masking

The paper adds that after the input sample has been embedded as a sequence of tokens, the team masked a part of these units by replacing them with a learned MASK embedding token and feeding the sequence to the Transformer network.

Computer vision and language

Here, the block-wise masking strategy of Bao (2021) is used. For language, they masked tokens.

The model is trained to predict the model representations of the original unmasked training sample based on an encoding of the masked sample. The model representations only for time-steps that are masked are predicted. These are contextualised representations (it encodes the particular time-step with information from the sample to the use of self-attention in the Transformer network).

Results

Computer vision-outperforms existing popular models

Image: Meta

The method is tested on the ImageNet-1K training set. The resulting model is fine-tuned for image classification using the labelled data of the same benchmark. It outperformed existing methods for popular model sizes.