Open AI Finally Open Sources Its Controversial Language Model That Imitates Human Writing

Earlier this year, OpenAI came up with a breakthrough language model that could generate natural language texts similar to humans. The model was called GPT-2 and was a successor to its previous version GPT (Generative Pre-Training). The GPT-2 can generate words, build sentences and paragraphs that are indistinguishable from human-generated content.

Source: OpenAI

The model seemed promising, but OpenAI did not open source the fully-trained model due to concerns over misuse of the technology.

Due to our concerns about malicious applications of the technology, we are not releasing the trained model. As an experiment in responsible disclosure, we are instead releasing a much smaller model for researchers to experiment with, as well as a technical paper.” 


Sign up for your weekly dose of what's up in emerging technology.

As stated in the official release blog, OpenAI’s decision to keep the fully trained state-of-the-art model closed was criticised by the AI research community. 

Instead of releasing the fully trained model with 1.5 billion parameters on February, OpenAI decided to go for a staged release by releasing smaller versions of the model in intervals.

Four variants of the model were created based on the number of parameters ranging from Small which consisted of 124 million parameters to Extra Large consisting of ~1.5 billion parameters.

The Small version was released initially in the month of February, followed by Medium consisting of 355 million parameters in May and Large with 774 million parameters in August.

As stated in the final report, the delay of 9 months between the initial release of the small model and the fully trained model allowed time between model releases to conduct risk and benefit analyses as model sizes increased. 

As a planned strategy to supposedly analyse future misuse of technology, OpenAI believed that the staged release could give researchers a chance to mitigate the risk of potential misuse.

Finally, after 9 months of waiting, researchers and AI enthusiasts can now lay their hands on the fully trained GPT-2 model.

“As the final model release of GPT-2’s staged release, we’re releasing the largest version (1.5B parameters) of GPT-2 along with code and model weights to facilitate detection of outputs of GPT-2 models.”

Find the official Github here.

The XL Model

GPT-2, unlike its previous version, was trained on a new dataset with data from over 8 million web pages. The model trained on the complete dataset consists of a gigantic 1.5 billion parameters.

Source: OpenAI

SOTA In Zero-shot

GPT-2 was tested against a variety of language modelling tasks and was found to outperform many of the domain-specific models. Not having trained on any of the data specific to any of the tasks/domains the model was able to achieve State-Of-The-Art(SOTA) results. 

Source: OpenAI

Open AI has been conducting studies about the potential of the new language model. 

An impressive credibility score

A survey conducted by Cornell University surveyed people to assign a credibility score for the texts generated by the model. The XL model received a credibility score of 6.91 out of 10.

Hard to Detect

When language models become efficient, detecting synthetic content becomes challenging. OpenAI also developed a detection model that can detect GPT-2 generated texts with ~95% accuracy for detecting 1.5B GPT-2-generated text.

Perfection for misuse

The very reason that OpenAI decided to go through a long and staged release, is because the model is extremely prone to misuse. Center on Terrorism, Extremism, and Counterterrorism(CTEC) demonstrated that it is possible to create models that can generate synthetic propaganda for ideologies such as white supremacy, Marxism, Jihadist Islamism, and anarchism.

They also show that the Detection methods, despite having low accuracy can mislead by suspecting that an actor is generating synthetic text.

OpenAI states that although there have been discussions around the potential danger and misuse of the model, one thing that motivated them to release the model was the lack of any evidence so far. 

Approach To The Challenge Of Bias

OpenAI has published a model-card alongside the models on GitHub to give people a sense of the issues inherent to language models such as GPT-2. OpenAI also performed a qualitative, in-house evaluation of some of the biases in GPT-2.

In a Nutshell

Despite all the buzz regarding security/misuse concerns, the GPT-2 is a breakthrough in Natural Text Generation which can benefit in many ways. Language itself is a criterion for intelligence and with the uprising of such powerful models, machines will soon be able to converse efficiently. 

More Great AIM Stories

Amal Nair
A Computer Science Engineer turned Data Scientist who is passionate about AI and all related technologies. Contact:

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

Conference, in-person (Bangalore)
Cypher 2023
20-22nd Sep, 2023

3 Ways to Join our Community

Whatsapp group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our newsletter

Get the latest updates from AIM