Behind ChatGPT’s Wisdom: 300 Bn Words, 570 GB Data

As ChatGPT continues to enthral the world, users share their experiences with the human-like chatbot whose responses have taken the internet by storm.
Listen to this story

As ChatGPT continues to enthrall the world, with users sharing their experiences with the human-like chatbot whose responses have taken the internet by storm. This includes a host of tasks, ranging from solving mathematical problems to generating codes and writing essays. The chatbot has also been able to don the cap of a confidante who can even give suggestions for improving relationships, health tips and can even draft jokes for your next stand-up performance. 

Ever wondered how it is able to pull this off so seamlessly? The answer to this lies in its speed and understanding of complex topics.

Recently, OpenAI highlighted how ChatGPT actually works on its website. It said that ChatGPT is a sibling model to InstructGPT, which is trained to follow an instruction in a prompt and provide a detailed response.

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

According to an article published on BBC Science Focus, the model was trained using databases from the internet that included a massive 570 GB of data sourced from books, wikipedia, research articles, webtexts, websites and other forms of content and writing on the net. Approximately 300 billion words were fed into the system.

Being a large language system, the model works on probability as a result of which it is able to predict the next word or prompt in a sentence. This was made possible as the model underwent a supervised testing phase.


Download our Mobile App



The model was fed inputs like “Is tomato a fruit or a vegetable?” and the team feeding the inputs has the correct answer or output, which is also fed into the system. However, this does not guarantee a correct answer as it is based on the prompt or the nature of the query. If the model gets it wrong, the correct answer is fed back into the model thereby training it to the right responses and also helping it build on its knowledge bank.

It then goes through the next stage where it offers diverse responses and a human annotator ranks it from the most appropriate to wrong—training the system to compare.

The model is a step ahead from the other existing models as ChatGPT continues to learn and build on its knowledge, and understanding the nature of prompts and questions and then responding accordingly thereby enabling it to answer all possible questions.

Reinforcement learning to the rescue 

What sets this technology apart is that it continues to learn while guessing what the next word should be, constantly improving its understanding of prompts and questions to become the ultimate know-it-all. 

As it is trained using the reinforcement learning algorithm, the model is constantly learning and updating itself for appropriate responses based on the nature of prompts. ChatGPT can also play the role of say a smarter version of an autocomplete software where when you start typing a sentence—it predicts the next course action

Limitations 

The model, however, still fails on many fronts. The response to the prompt, for example, fails to produce the answer to how it relates to GANs, and needs more layers of verification to source information better. 

In addition, in its effort to be responsible and being aware of the potential of AI being manipulated to produce biased or harmful content, OpenAI has ensured that the Chatbot is trained in such biases and restricts its response to prompts that appear inappropriate.

As for the discussion around whether ChatGPT has the potential to replace developers on a Twitter thread, a Twitter user explains that while the model is capable of producing human-like text, it is still limited in its ability to understand and manipulate complex systems like a human developer. In addition, a language model like ChatGPT is not capable of independent thought or creativity, which are important skills for a developer to have. In short, while large language models like ChatGPT may be able to assist developers in certain tasks, they will not be able to replace them completely.

The trending discussion around ChatGPT has also not escaped even the Crypto community and was among the most trending topics. The hype around the chatbot in turn led to crypto punters buying tokens related to AI that led to token prices surging by up to 77% according to CoinGecko, a digital currency price and information data platform.

Among the tokens that benefited the most were DeepBrain Chain (DBC) that posted the most gains with a 76.7% jump in token price within a week of ChatGPT being launched followed by Numeraire (NMR), the largest AI token by market capitalisation, that witnessed its price increase by 54.5% in the same period, from $11.26 to $17.40. 

More Great AIM Stories

Aparna Iyer
Aparna Iyer has covered various sectors spanning education, wildlife, culture and law for close to a decade. She now writes on technology and is keen to unearth its capability for public good.

AIM Upcoming Events

Early Bird Passes expire on 3rd Feb

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
AIM TOP STORIES

Is AI sexist?

Genderify, launched in 2020, determines the gender of a user by analysing their name, username and email address using AI.