Google recently introduced a Natural Language Processing (NLP) benchmark – known as Xtreme – to encourage developments of multilingual AI models. It is a massively multilingual multi-task benchmark for evaluating cross-lingual generalization in NLP models.
Building an NLP system, which not only works in English but in all 6,900+ languages, is more complicated than it sounds. According to the researchers, most of the world’s languages are data-sparse and do not have enough data available to train robust models on their own. Still, many languages do share a considerable amount of underlying structure. For instance, words that stem from the same origin, use of postpositions to mark temporal and spatial relations, etc.
Why Use Xtreme
Several advances have been witnessed in deep learning techniques over the last few years. With an attempt to learn general-purpose multilingual representations, researchers have developed popular models like mBERT, XLM, XLM-R, among others. However, the evaluation of these methods has been mostly focused on a small set of tasks and for linguistically similar languages.
To mitigate such issues and encourage more research on multilingual learning, the researchers from Google introduced this benchmark. Xtreme covers 40 typologically diverse languages and includes nine tasks that collectively require reasoning about different levels of syntax or semantics.
Cross-lingual TRansfer Evaluation of Multilingual Encoders or XTREME is a multi-task benchmark which can be used to evaluate the cross-lingual generalization capabilities of multilingual representations across 40 languages.
This benchmark focuses on the zero-shot cross-lingual transfer scenario, where annotated training data is provided in English. Still, no specific language is provided to which the systems must transfer.
According to the researchers, the languages in this multilingual benchmark are selected for three main tasks. These are – maximizing language diversity, coverage in existing tasks, and availability of training data.
How It Works
The goal of this multilingual benchmark is to provide an accessible benchmark to evaluate the cross-lingual transfer learning on a diverse and representative set of tasks and languages. The benchmark consists of nine tasks that fall into four different categories. These are – classification, structured prediction, question-answering and sentence retrieval.
The nine tasks are XNLI ( Cross-lingual Natural Language Inference), PAWS-X (Cross-lingual Paraphrase Adversaries from Word Scrambling), POS (Part-Of-Speech Tagging), NER (Named Entity Recognition), XQuAD (Cross-lingual Question Answering Dataset), MLQA (Multilingual Question Answering), TyDiQA-GoldP (Typologically Diverse Question Answering – GoldPgold passage version), BUCC (Building and Using Parallel Corpora) and Tatoeba dataset. The researchers searched for the nearest neighbour using cosine similarity and calculated the error rate.
According to them, to evaluate performance using XTREME, models must follow the three-step rule. Firstly, the model must be pre-trained on multilingual text, using objectives that encourage cross-lingual learning. Then they need to be fine-tuned on task-specific English data. Lastly, zero-shot cross-lingual transfer performance on the task is applied in non-English languages.
One of the advantages of this zero-shot learning is its computational efficiency, as a pre-trained model only needs to be fine-tuned on English data for each task and can then be evaluated directly on other languages.
The researchers released the benchmark to encourage research on cross-lingual learning methods that transfer linguistic knowledge across a diverse and representative set of languages and tasks. They said, “We hope that XTREME will catalyze research in multilingual transfer learning, similar to how benchmarks such as GLUE and SuperGLUE have spurred the development of deep monolingual models, including BERT, RoBERTa, XLNet, AlBERT, and others.”
Furthermore, the researchers also introduced pseudo test sets as diagnostics which cover all the 40 languages by automatically translating the English test set of the natural language inference and question-answering dataset to the remaining languages.
Subscribe to our NewsletterGet the latest updates and relevant offers by sharing your email.
A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.