Open AI Finally Open Sources Its Controversial Language Model That Imitates Human Writing

Earlier this year, OpenAI came up with a breakthrough language model that could generate natural language texts similar to humans. The model was called GPT-2 and was a successor to its previous version GPT (Generative Pre-Training). The GPT-2 can generate words, build sentences and paragraphs that are indistinguishable from human-generated content.

Source: OpenAI

The model seemed promising, but OpenAI did not open source the fully-trained model due to concerns over misuse of the technology.

Due to our concerns about malicious applications of the technology, we are not releasing the trained model. As an experiment in responsible disclosure, we are instead releasing a much smaller model for researchers to experiment with, as well as a technical paper.” 

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

As stated in the official release blog, OpenAI’s decision to keep the fully trained state-of-the-art model closed was criticised by the AI research community. 

Instead of releasing the fully trained model with 1.5 billion parameters on February, OpenAI decided to go for a staged release by releasing smaller versions of the model in intervals.

Four variants of the model were created based on the number of parameters ranging from Small which consisted of 124 million parameters to Extra Large consisting of ~1.5 billion parameters.

The Small version was released initially in the month of February, followed by Medium consisting of 355 million parameters in May and Large with 774 million parameters in August.

As stated in the final report, the delay of 9 months between the initial release of the small model and the fully trained model allowed time between model releases to conduct risk and benefit analyses as model sizes increased. 

As a planned strategy to supposedly analyse future misuse of technology, OpenAI believed that the staged release could give researchers a chance to mitigate the risk of potential misuse.

Finally, after 9 months of waiting, researchers and AI enthusiasts can now lay their hands on the fully trained GPT-2 model.

“As the final model release of GPT-2’s staged release, we’re releasing the largest version (1.5B parameters) of GPT-2 along with code and model weights to facilitate detection of outputs of GPT-2 models.”

Find the official Github here.

The XL Model

GPT-2, unlike its previous version, was trained on a new dataset with data from over 8 million web pages. The model trained on the complete dataset consists of a gigantic 1.5 billion parameters.

Source: OpenAI

SOTA In Zero-shot

GPT-2 was tested against a variety of language modelling tasks and was found to outperform many of the domain-specific models. Not having trained on any of the data specific to any of the tasks/domains the model was able to achieve State-Of-The-Art(SOTA) results. 

Source: OpenAI

Open AI has been conducting studies about the potential of the new language model. 

An impressive credibility score

A survey conducted by Cornell University surveyed people to assign a credibility score for the texts generated by the model. The XL model received a credibility score of 6.91 out of 10.

Hard to Detect

When language models become efficient, detecting synthetic content becomes challenging. OpenAI also developed a detection model that can detect GPT-2 generated texts with ~95% accuracy for detecting 1.5B GPT-2-generated text.

Perfection for misuse

The very reason that OpenAI decided to go through a long and staged release, is because the model is extremely prone to misuse. Center on Terrorism, Extremism, and Counterterrorism(CTEC) demonstrated that it is possible to create models that can generate synthetic propaganda for ideologies such as white supremacy, Marxism, Jihadist Islamism, and anarchism.

They also show that the Detection methods, despite having low accuracy can mislead by suspecting that an actor is generating synthetic text.

OpenAI states that although there have been discussions around the potential danger and misuse of the model, one thing that motivated them to release the model was the lack of any evidence so far. 

Approach To The Challenge Of Bias

OpenAI has published a model-card alongside the models on GitHub to give people a sense of the issues inherent to language models such as GPT-2. OpenAI also performed a qualitative, in-house evaluation of some of the biases in GPT-2.

In a Nutshell

Despite all the buzz regarding security/misuse concerns, the GPT-2 is a breakthrough in Natural Text Generation which can benefit in many ways. Language itself is a criterion for intelligence and with the uprising of such powerful models, machines will soon be able to converse efficiently. 

Amal Nair
A Computer Science Engineer turned Data Scientist who is passionate about AI and all related technologies. Contact:

Download our Mobile App

MachineHack | AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIMResearch Pioneering advanced AI market research

With a decade of experience under our belt, we are transforming how businesses use AI & data-driven insights to succeed.

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox