Last updated December 28, 2020
In AI Origins & Evolution

OpenAI’s Strategy Questioned As Grad Students Recreate The Infamous GPT-2 At A Fraction Of The Cost

Published on September 3, 2019

by Ram Sagar

Earlier this year, OpenAI gained a lot of attention for all the wrong reasons when it produced a language model so good at generating fake news that the organisation decided not to release it. In fact, a study conducted by collaborators at Cornell University found that readers on average believed GPT-2’s outputs to be genuine news articles nearly as often as New York Times ones.

OpenAI’s strategy of delaying the release of the model relies on these models being difficult to replicate and requiring a high degree of specialised domain knowledge.

However, in what may come as surprising news, two grad students used an estimated $50,000 worth of free cloud computing from Google which hands out credits to academic institutions and made a decent attempt at replicating OpenAI’s GPT-2.

Why Was Released It In The First Place

OpenAI developed GPT-2 to investigate new benchmarks for natural language processing tasks. The authors of the original paper reported that GPT-2 zero-shots to state of the art performance on 7 out of 8 tested language modelling datasets.

They claimed that the diversity of tasks the model is able to perform in a zero-shot setting suggests that high-capacity models trained to maximise the likelihood of a sufficiently varied text corpus begin to learn how to perform a surprising amount of tasks without the need for explicit supervision.

OpenAI has insisted that it had not yet found any attempts of malicious use but had seen multiple beneficial applications, including in code autocompletion, grammar help, and developing question-answering systems for medical assistance. As a result, the lab felt that releasing the most recent code was ultimately more beneficial. Other researchers argue that several successful efforts to replicate GPT-2 have been made.

How Two Grad Students Replicated GPT-2

The authors based their implementation off of the Grover model and modified their codebase to match the language modelling training objective of GPT-2.

The authors in their blog gave a detailed view of they got around this model. Briefly, it can be described as follows:

Started with the Pushshift Reddit scrape, a dataset containing a continuously updated collection of Reddit posts, comments, and related metadata.
These links are then filtered to remove direct links to file-types unlikely to contain usable text or HTML.
The text was extracted from HTML pages using the Newspaper Python library and then filtered for only English text using the fastText Python library.
As a cleaning heuristic, documents with fewer than 128 tokens were removed from the dataset.
For encoding the dataset, the Binary Pattern Encoder was used.

Here’s a sample from their experiment:

Prompt: “Recycling is good for the world. NO! YOU COULD NOT BE MORE WRONG!!”

Output: Recycling is good for the world. NO! YOU COULD NOT BE MORE WRONG!! That’s because recycling is all about selling to customers who cannot afford to recycle.” If you don’t have any interest in recycling, in whatever form, then you could well be doing your own global environmental benefit if you just got rid of it. Let me explain. Recycling takes time, money and energy, and it doesn’t really solve the problem…

Aaron Gokaslan, 23, and Vanya Cohen, 24 the duo behind this project insist that their release was intended to show that you don’t have to be an elite lab rich in dollars and PhDs to create this kind of software. And they argue that setting their creation free can help others explore and prepare for future advances — good or bad.

Because the replication efforts are not unique, and large language models are the current most effective means of countering generated text, the students believed that releasing their model is a reasonable first step towards countering the potential future abuse of these kinds of models.

The cost of training the model from scratch is about $50k. It’s important to note this figure is the estimated value of the cloud compute, and does not reflect the much smaller intrinsic costs involved.

“This allows everyone to have an important conversation about security and researchers to help secure against future potential abuses,” says Cohen.

Why Should We Care About Another ML Model?

Machine learning practitioners have stayed divided for a long time over the reliability of AI. This owes in some part to the black-box modelling. The inner workings of a deep learning model is still unclear. Attempts such as activation atlases have been made to investigate how a model learns. However, the evaluation of a model based on just the end results has made people sceptical of AI.

So whenever a new idea like GPT-2 is introduced, its most extreme outcome is often highlighted. In the case of GPT-2, the uncanny way in which a model spun stories out of thin air, made many uncomfortable. People started speculating about dire consequences such as fake news.

For example, any malicious entity can sit in a remote place and can script speeches of presidents and can aggravate things within a nation or across the world. The rise of social engineering has been witnessed by the world during the 2016 presidential elections in the US. So, it is quite understandable why people are paranoid about this new text generation model.

The replication of this model by college students with a few thousand dollars cloud credits, sparks debate about the need for regulating AI. Both the models, the larger 1.5 million parameters or the latest cut down OpenAI version still have a long way to go. However, due to the ominous or sometimes exaggerated claims, the argument is being made against it rather than enhancement of NLP models as a whole.

We can also not blame OpenAI for trying out new things. It is a new company and so is the field of applied AI. This makes even policy making difficult because there has been no precedent and the experts, think tanks have a lot to do in the coming years to ensure a safe human-AI symbiosis.

PS: The story was written using a keyboard.

Access all our open Survey & Awards Nomination forms in one place

Ram Sagar

I have a master's degree in Robotics and I write about machine learning advancements.

Biden’s AI Executive Order Faces Backlash

OpenAI, Google, Microsoft, and Anthropic Launch Frontier Model Forum

White House and OpenAI Make Another Lackluster Commitment to Safety

Meta Needs You in Its Generative AI Gambit

Mapping the Future of Sam’s Investment

Beware, ChatGPT Can Control You Through Your Phone

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Recent Stories

India is Making its Own AI Servers

Pritam Bordoloi

PLI scheme marks the beginning of India ‘s manufacturing venture

GPT-5 Likely to be Released After the US Elections

Donna Eva

Generative AI Jobs in India can Fetch You up to Rs 1 Crore

Siddharth Jindal

Top Editorial Picks

Elon Musk Set to Meet Indian Spacetech Startups During Upcoming Visit

Shyam Nandan Upadhyay

Happiest Minds Technologies Acquires Macmillan Learning India, Expands Edutech Reach

Shritama Saha

Meta Releases Llama 3, Beats Claude 3 Sonnet and Gemini Pro 1.5

Mohit Pandey

Nothing Becomes the First Smartphone Company to Integrate OpenAI’s ChatGPT

Siddharth Jindal

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Featured

Enhancing AI Integration through Optimal Data Management in the Global Convenience Food and Beverage Sector

Through the implementation of advanced data management methodologies, resilient data observability solutions, and cutting-edge AI frameworks, Course5 is spearheading the