Meta AI has released data2vec, calling it “the first high-performance self-supervised algorithm that works for multiple modalities.” They have applied it separately to speech, text and images, where it outperformed the previous best single-purpose algorithms for computer vision and speech. It came out as competitive on NLP tasks.
Sign up for your weekly dose of what's up in emerging technology.
In the paper titled “data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language”, the research team said that data2vec is trained by predicting the model representations of the full input data given a partial view of the input.
The team added in the paper, “We first encode a masked version of the training sample (model in student mode) and then construct training targets by encoding the unmasked version of the input sample with the same model but when parameterized as an exponentially moving average of the model weights (model in teacher mode.)”
The target representations encode all of the information in the training sample. It is for the student to predict these representations given a partial view of the input.
As per the paper, the team uses
- Standard Transformer architecture along with modality-specific encoding of the input data taken from previous work.
- They use the ViT-strategy of encoding an image as a sequence of patches with each spanning 16×16 pixels, input to a linear transformation.
- Then, the speech data is encoded by using a multi-layer 1-D convolutional neural network. It maps 16 kHz waveform to 50 Hz representations.
- The text is pre-processed to obtain sub-word units. It is embedded in distributional space through learned embedding vectors.
The paper adds that after the input sample has been embedded as a sequence of tokens, the team masked a part of these units by replacing them with a learned MASK embedding token and feeding the sequence to the Transformer network.
Computer vision and language
The model is trained to predict the model representations of the original unmasked training sample based on an encoding of the masked sample. The model representations only for time-steps that are masked are predicted. These are contextualised representations (it encodes the particular time-step with information from the sample to the use of self-attention in the Transformer network).
Computer vision-outperforms existing popular models
The method is tested on the ImageNet-1K training set. The resulting model is fine-tuned for image classification using the labelled data of the same benchmark. It outperformed existing methods for popular model sizes.
Speech-Outperforms Meta’s wav2vec 2.0 or HuBERT