Trillion parameter transformer and language models that highlight the past few years, be it GPT-3, Gopher, Jurassic-1, GLaM or MT-NLG, are mostly based on the modified MoE algorithm technique. Moreover, Microsoft leverages the technique frequently for various applications and recently upgraded its translator API, Z-Code, for large language models. Analytics India Magazine explores why the MoE is so popular and how Microsoft is leveraging it for Z-code in this article.
The Mixture of Experts approach
A mixture of experts is a deep learning model architecture consisting of multiple expert networks dividing a problem into homologous regions. It is an ensemble machine learning technique with better scaling results. MoE’s architecture comprises small clusters of “neurons” activated only under very precise conditions. Lower “layers” of the MoE model extract features, which specialists then evaluate. For instance, MoEs can develop a translation system, with each expert cluster learning to handle a distinct chunk of speech or grammatical norm.
These advantages have led to the technique becoming the only approach demonstrated to scale deep learning models to trillion-plus parameters. It enables the future of AI with capable ML models that can learn heavy information and power computer vision, speech recognition, natural language processing, and machine translation systems.
Project Turing
The Z-code models were developed as part of Microsoft’s AI at Scale and Turing’s initiatives seeking to develop large models pretrained on vast amounts of textual data to understand nuances of language. In 2020, Microsoft detailed the Turing multilingual language model (T-ULRv2) and announced that the AI model had achieved the top rank on the Google XTREME public leaderboard.
Language problems
A recurring machine learning problem is the inability to use NLP technologies in many local languages. As a result, most large language models work perfectly only when the tasks are in English. This challenge is partly due to the scarcity of datasets in regional languages, preventing models from being trained in lesser popular languages. Nonetheless, the recent focus on the importance of having NLP in native languages has sparked efforts in this direction. Microsoft’s Z-code is one of those.
Microsoft’s Z-code
Z-code, Microsoft’s translation API, supports creating AI systems that can speak, see, hear, and understand. It leverages shared linguistic elements in various languages via transfer learning to improve the quality of machine translation and extend the capabilities beyond common languages to underrepresented languages with less available training data.
Z-Code models empower Microsoft Translator, Azure Cognitive Language Services and several products and customers at Microsoft with multilingual capabilities at scale/ Microsoft
Z-code is claimed to offer performance and quality matching other large-scale language models but has more efficiency. “Our goal is to help everyone and every organisation on the planet communicate better,” said Xuedong Huang, Microsoft technical fellow and Azure AI chief technology officer. Two important dimensions to meet this goal are to assure the best possible quality of translations and support multiple languages.
The Z-code model is said to improve common language understanding tasks. These include name entity recognition, text summarisation, custom text classification and keyphrase extraction. This is also the company’s first time publicly demonstrating its use of the Mixture of Experts models to power machine translation products.
How the model works
These models use a sparse MoE approach. This is given its efficiency to engage a portion of the model to complete a task rather than activating an entire AI model to run every request. This architecture in Z-code supports massive scale in its model parameters while keeping the amount of compute constant.
Microsoft collaborated with NVIDIA and deployed the MoE models in production NVIDIA GPUs and through the NVIDIA Triton Inference Server. This allowed the developers to reach a more efficient runtime of 27x speedup over non-optimised GPU runtimes. In addition, it leveraged CUTLASS and FasterTransformer to optimise the new types of models. The Z-code team also worked with Microsoft DeepSpeed researchers to learn effective training for a massive Mixture of Experts model for production.
The model’s performance
Compared to multilingual transfer learning approaches that identify AI quality gains in languages with fewer direct translation examples available for training, the Z-code MoE models have shown consistent gains – even in the largest languages. In a blind test with human evaluators, the model’s translations across languages saw an average improvement by 4%.
The company claims these translation statistics:
* English to French translations by 3.2%
* English to Turkish by 5.8%
* Japanese to English by 7.6%
* English to Arabic by 9.3%
*English to Slovenian by 15%.
The underlying model can further be fine-tuned to perform different language understanding tasks, including translating between languages, summarising, sentence completion or tweet generation, all in one model. “Those are the pieces, the building blocks that we are using to build a truly differentiated intelligence…and to form production systems that are cost-efficient,” Huang said.