Aindriya Barua, a 24-year-old Software Engineer from Tripura, has built a bot that many tech giants couldn’t. She built an Anti-Hate-India Reddit Bot that can detect hate speech from the code-mixed ‘Hinglish’ language.
Fed-up of the hate speech on social media, and after seeing its impact on the mental health of people around her, Barua decided to make a bot herself that can detect hate speech in the popularly used ‘Hinglish’ and solve the issue.
Barua is currently working with Bangalore-based healthcare technology company – Cerner Corporation as a Software Engineer. A B Tech graduate from Amrita Vishwa Vidyapeetham, Barua is also an artist. She expresses her views about the world in the form of paintings, but when things got out of control, she used her software knowledge and built the bot.
AIM Daily XO
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
In an interview with Analytics India Magazine, she shared the reasons and the way she built the Anti-Hate-India Reddit Bot. Excerpts:
AIM: Why did you develop a Reddit bot to fight Hindi-English code-mixed? What was the need for such a platform?
Download our Mobile App
Aindriya Barua: Social media and cheaper internet post 2016-17 have brought the world at our fingertips, quite literally. As it has decentralised the power to express to every person out there, so it has enabled strangers hidden behind fancy usernames to spew hate in the guise of opinions, often forgetting that there are real vulnerable humans on the receiving end.
I was randomly scrolling through my Reddit feed and came across a post. It didn’t take me long to notice how the comments just went from “dank jokes” to offensive to downright hurtful, real quick, often the hate being directed at oppressed communities, caste, sexual, gender, or religious minorities. As an opinionated woman, I often use social media to express my views through art, and hence often end up being at the receiving end of violent trolling and cyberbullying myself. It got me angry and sent me rummaging for a solution using my software and data science knowledge.
To solve this, I started my research, and after a lot of it, I found a paper that had collected social media data from Facebook and marked it as ‘Hindi-English’ code mixed. It was a dataset with over 12,000 ‘Hinglish’ sentences. I decided to try it and processed and normalised the data. I used embedding BERT and binary classification as there are only two classifications required – if it’s hate speech or not. I converted ‘yes’ to 1 and ‘no’ to 0 and trained the model with BERT, which made it the hate-speech detection model.
AIM: What does the Anti-Hate-India Reddit Bot do?
Aindriya Barua: This bot can be used by any admin of a particular subreddit. They can download the code, which is open-source on my GitHub, and follow the tutorial and will be able to detect hate speech on comments and posts in real-time. The bot, once active on a subreddit, constantly monitors the comments being posted; if hate speech is detected from a user, the bot comments with a warning. The bot gives three such warnings; the fourth time the same user uses hate speech, it permanently bans the user ID from posting on the subreddit. The person will also receive a message saying they are banned along with the reason – which is due to the use of hate speech.
AIM: Why did you build the model using BERT?
Aindriya Barua: I used BERT as it is bidirectional and understands the context of the work before embedding it. BERT uses transformer models that are gamechanger in embedding text models. BERT reads a sequence of text from both left and right simultaneously and captures context. For example, “they walked by the river bank,” and “they went to the bank to deposit money.” The same word ‘bank’ means different things. As humans, we can understand that only by reading the words in the left or right of the word. BERT can simulate this understanding.
During my college days, I wrote a paper in which I analysed performances of different word embedding techniques on the Named Entity Recognition task for Indian languages. It was from that analysis that I came to the conclusion that the contextual embedding model is definitely better than non-contextual models like Word2Vec and FastText. I used Distilbert simply because I have a RAM limitation and did not have the resources to run a better performing model like XLM-RoBERTa. I worked on my personal laptop and did not have any additional GPUs to run algorithms that are much resource-expensive.
AIM: Did you face any technical hurdles while developing the same? What are industry insights and analysis for this?
Aindriya Barua: Companies like Twitter or Instagram are trying to do hate-speech recognition that automatically hides the content, but their solutions are not applicable in the Indian scenario. Even after being ‘reported,’ there is a lot of content that is not taken down and is just blatantly going on. I believe that the major reason for this is that Indian languages are resource-poor. English, French or other 5-6 European languages are the only ones that are resource-heavy, which helped companies build ML models and detect hate speech. Like the country, Indian languages are also very diverse and with multiple dialects. The reason for Indian languages being resource-poor is that they never had a digital form. It has been only a few years, with the onset of cheaper internet that made many people come online and speak in their language. There was just not enough data available to train ML models to detect hate speech. And now, the data being produced is immense, and we have no control over it anymore.
Other than being resource-poor, another big problem is that the keyboard that Indians have in computers, laptops, tablets or smartphones is mostly in English. We are basically typing, for example, Punjabi, in English, generating huge amounts of unstructured code-mixed data. One of the biggest hurdles is to find usable datasets for such complex Indian language in code-mixed form. The NLP research scene on Indian languages is not as developed as the aforementioned resource-heavy languages.
AIM: Why develop a bot only for Reddit? In terms of scalability, what is the potential of the bot that you have developed?
Aindriya Barua: Reddit is a forum where the admin has certain rights and acts as a moderator. The moderator can remove content, ban content or even users. But on Instagram or Twitter, there is no moderator to do this job. The bot that I have made can only be used by an admin/moderator of any forum that gives them some rights. The bot can also be used with any such platform like Discord or Telegram, which provide APIs that can be integrated with ML models. For pages with over 1000 followers, it has to be multi-threaded to handle a large influx of comments in real-time.
While this is only a small step, I really believe this kind of work and research can contribute towards building a safer and kinder online space for everyone, irrespective of their gender, sexual orientation, religious beliefs, etc., in the near future.