Behind ChatGPT’s Wisdom: 300 Bn Words, 570 GB Data

As ChatGPT continues to enthral the world, users share their experiences with the human-like chatbot whose responses have taken the internet by storm.
Listen to this story

As ChatGPT continues to enthrall the world, with users sharing their experiences with the human-like chatbot whose responses have taken the internet by storm. This includes a host of tasks, ranging from solving mathematical problems to generating codes and writing essays. The chatbot has also been able to don the cap of a confidante who can even give suggestions for improving relationships, health tips and can even draft jokes for your next stand-up performance. 

Ever wondered how it is able to pull this off so seamlessly? The answer to this lies in its speed and understanding of complex topics.

Recently, OpenAI highlighted how ChatGPT actually works on its website. It said that ChatGPT is a sibling model to InstructGPT, which is trained to follow an instruction in a prompt and provide a detailed response.

According to an article published on BBC Science Focus, the model was trained using databases from the internet that included a massive 570 GB of data sourced from books, wikipedia, research articles, webtexts, websites and other forms of content and writing on the net. Approximately 300 billion words were fed into the system.

Being a large language system, the model works on probability as a result of which it is able to predict the next word or prompt in a sentence. This was made possible as the model underwent a supervised testing phase.

The model was fed inputs like “Is tomato a fruit or a vegetable?” and the team feeding the inputs has the correct answer or output, which is also fed into the system. However, this does not guarantee a correct answer as it is based on the prompt or the nature of the query. If the model gets it wrong, the correct answer is fed back into the model thereby training it to the right responses and also helping it build on its knowledge bank.

It then goes through the next stage where it offers diverse responses and a human annotator ranks it from the most appropriate to wrong—training the system to compare.

The model is a step ahead from the other existing models as ChatGPT continues to learn and build on its knowledge, and understanding the nature of prompts and questions and then responding accordingly thereby enabling it to answer all possible questions.

Reinforcement learning to the rescue 

What sets this technology apart is that it continues to learn while guessing what the next word should be, constantly improving its understanding of prompts and questions to become the ultimate know-it-all. 

As it is trained using the reinforcement learning algorithm, the model is constantly learning and updating itself for appropriate responses based on the nature of prompts. ChatGPT can also play the role of say a smarter version of an autocomplete software where when you start typing a sentence—it predicts the next course action


The model, however, still fails on many fronts. The response to the prompt, for example, fails to produce the answer to how it relates to GANs, and needs more layers of verification to source information better. 

In addition, in its effort to be responsible and being aware of the potential of AI being manipulated to produce biased or harmful content, OpenAI has ensured that the Chatbot is trained in such biases and restricts its response to prompts that appear inappropriate.

As for the discussion around whether ChatGPT has the potential to replace developers on a Twitter thread, a Twitter user explains that while the model is capable of producing human-like text, it is still limited in its ability to understand and manipulate complex systems like a human developer. In addition, a language model like ChatGPT is not capable of independent thought or creativity, which are important skills for a developer to have. In short, while large language models like ChatGPT may be able to assist developers in certain tasks, they will not be able to replace them completely.

The trending discussion around ChatGPT has also not escaped even the Crypto community and was among the most trending topics. The hype around the chatbot in turn led to crypto punters buying tokens related to AI that led to token prices surging by up to 77% according to CoinGecko, a digital currency price and information data platform.

Among the tokens that benefited the most were DeepBrain Chain (DBC) that posted the most gains with a 76.7% jump in token price within a week of ChatGPT being launched followed by Numeraire (NMR), the largest AI token by market capitalisation, that witnessed its price increase by 54.5% in the same period, from $11.26 to $17.40. 

Download our Mobile App

Aparna Iyer
Aparna Iyer has covered various sectors spanning education, wildlife, culture and law for close to a decade. She now writes on technology and is keen to unearth its capability for public good.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

6 IDEs Built for Rust

Rust IDEs aid efficient code development by offering features like code completion, syntax highlighting, linting, debugging tools, and code refactoring

Can OpenAI Save SoftBank? 

After a tumultuous investment spree with significant losses, will SoftBank’s plans to invest in OpenAI and other AI companies provide the boost it needs?

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.