Natural language processing is a transformative technology and has been generating a lot of buzz in the last few years for its large-scale impact. But most of the research and models built focus on mechanisms that work for the English language. Even if models are built for other languages, they have been mainly around popular languages.
There are around 7000 languages spoken in this world, with the Asian continent having the largest percentage in terms of the number of languages spoken. If we do not cater to the huge spectrum of languages that exist, we are leaving out a huge section of the world from the benefits of the advancement of technology. The need exists for developing speech recognition models for other languages to make technology more inclusive.
Difficult to build
Though researchers and tech companies have realised that the introduction of NLP to other languages will be very useful from a business as well as a societal standpoint, it is quite difficult to build the models in other languages as the availability of the right and sufficient data set is a huge problem. We need a large dataset to train and test the algorithm while building an NLP model. Though large populations may speak a particular language, obtaining such data sets may still be difficult.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
If a small data set is available, we would need to have distinct patterns in it to build the model. The data from the languages needs to be cleaned too. Many languages have symbols and other characters that all types of computer systems may not recognise without proper modification. Making it suitable for such systems may be time-consuming and costly. If a company develops a model for other languages, it must open source it as it is still an emerging area and others can effectively learn from it and build on it.
We have made some progress in the last few years to build models across a diverse spectrum of languages.
- In 2020, Meta introduced the M2M-100, a multilingual machine translation (MMT) model that translates between any pair of 100 languages without relying on English data. It said that M2M-100 is trained on a total of 2,200 language directions. The goal of building such a model is to make the quality of translations worldwide better, especially those who speak low-resource languages, claimed Meta.
- MIT introduced the PARP-Prune, Adjust and Re-Prune for Self-Supervised Speech Recognition model recently. It said that PARP is a new technique that reduces the computational complexity of an advanced machine learning model. This makes it applicable to conduct speech recognition for rare or uncommon languages like Wolof, spoken in West Africa.
- University of Waterloo’s David R. Cheriton School of Computer Science introduced AfriBERTa, which uses deep-learning techniques to achieve state-of-the-art results for low-resource languages. It said that AfriBERTa works specifically with 11 African languages, like Amharic, Hausa, and Swahili, spoken cumulatively by more than 400 million people. The mechanism achieves output comparable to the best existing models despite learning from just one gigabyte of text, as per the university.
- In September, IIT Bombay launched Project Udaan that helps translate textbooks and other study material in engineering and other streams from English to Hindi and other Indian languages. It is a donation-based and artificial intelligence-based translation ecosystem.
How will it help?
Natural language processing finds its usage in a diverse range of areas, such as summarisation, question answering, sentence similarity, translation, token classification and many more. If it can penetrate into less popular languages, it will be immensely beneficial for:
- Understanding and analysing emotions on various social media platforms and e-commerce websites comments where a large section of people express in their mother tongue and not English. This can be very beneficial for businesses for feedback and improvement.
- Better customer service and engagement as customers mostly like to talk to chatbots or virtual assistants in their native language.
- Expansion to diverse categories will improve the outcomes and accuracy of the technology.
- Right content is available to the users in their mother tongue – based on their choices and past patterns.
- Penetration of technology to non-popular languages will benefit society.
We have to make sure that the benefits of technology are accessible to everyone for society to progress. A great start to this will be through the penetration of new-age technologies across boundaries.