Last week, NVIDIA announced the NeMo model for the development of speech and language models and to create a conversational AI. NeMo is an open-source toolkit based on the PyTorch backend. The neural modules form the building blocks of these NeMo models. With NeMo, users can compose and train state-of-the-art neural network architectures.
How Can NeMo Help
NVIDIA NeMo allows to quickly build, train, and fine-tune conversational AI. It consists of NeMo core and NeMo collections. While NeMo core helps in getting the common look and feel for all models, NeMo collections act as groups of domain-specific modules and models.
There are main parts of NeMo: model, neural module, and neural type.
The models contain all necessary information regarding training, fine-tuning, data augmentation, and infrastructure details.
The models of NeMo consists of:
- Neural network implementation where all the neural models are connected for training and evaluation
- All pre- and post-processing activities such as tokenisation and augmentation
- The dataset classes to be used with this model
- The optimisation algorithm and the learning rate schedule
- Other infrastructure details
The neural modules are encoder-decoder architectures consisting of conceptual building blocks responsible for different tasks. At its core, Neural Module is the logical part of the neural network, which takes a set of inputs and computes a set of outputs.
The inputs and outputs have a neural type that comprises the semantics, axis order, and the dimensions of the input and output tensor, which ensures safety semantic check between the modules of NeMo. The inputs and outputs are typed with Neural Types, which are pairs that contain information about the tensor’s axes layout and semantics of its elements. The kind of inputs a Neural Module accepts and what output it returns are described by input_types and output_types properties respectively.
For the sake of better comparison, NeMo can be thought of as an abstraction between a layer and a full neural network, which corresponds to a conceptual piece of the neural network, for example, an encoder, decoder, or a language model.
Conversational AI encompasses three main areas of artificial intelligence research — automatic speech recognition (ASR), natural language processing (NLP), and text-to-speech (TTS or speech synthesis). NeMo helps practitioners to access, re-use, and build on the pre-trained models in this field.
Speaking of the different collections, NeMo comes with an extendable collection of models for ASR, NLP, and TTS.
The NeMo Speech collection(nemo_asr) has models and building blocks for speech and command recognition, speaker identification and verification, and voice activity detection. The NeMo’s NLP collection (nemo_nlp) has models for answering questions, punctuation, name entity recognition, among others. In NeMo’s text-to-speech collection (nemo_tts), there are spectrogram generators and vocoders which generate synthetic speech.
The NeMo models are built on PyTorch and PyTorch Lightning. While PyTorch is most commonly used, PyTorch Lightning and Hydra (from the PyTorch ecosystem) can be used for enhanced effectiveness. Another advantage of integrating with PyTorch Lightning is that it allows for quickly invoking actions with the trainer API. It also has features such as logging, checkpointing, overfit checking, among others. Further, in the case of Hydra, it gives the user the flexibility to and error-checking capabilities.
During the recent NVIDIA GTC 2020 event, NVIDIA announced the release of Jarvis, a GPU-accelerated application framework that uses NeMo. The company claims that it will allow the usage of video and speech data to build state-of-the-art conversational AI services. As per the company release, Jarvis addresses challenges of large data, computational resources for training the models, among others, by offering an end-to-end learning pipeline for conversational AI. Already, several organisations, such as Voca, an AI agent for call centre support that boasts of clientele such as Toshiba and AT&T; and Kensho which is a company which provides automatic speech transcription services for finance and businesses.
In the coming future, it is expected that more companies will adopt NeMo for developing conversational AI.
Subscribe to our NewsletterGet the latest updates and relevant offers by sharing your email.
I am a journalist with a postgraduate degree in computer network engineering. When not reading or writing, one can find me doodling away to my heart’s content.