MITB Banner

EleutherAI’s GPT-J vs OpenAI’s GPT-3

Share

OpenAI’s not so open GPT-3 has an open-source cousin GPT-J, from the house of EleutherAI. Check out the source code on Colab notebook and a free web demo here

EleutherAI, founded by Connor Leahy, Leo Gao, and Sid Black, is a research group focused on AI alignment, scaling and open-source AI research. In March 2021, the company released two GPT-Neo models with 1.3 billion and 2.7 billion parameters respectively. 

Microsoft has the exclusive access to GPT-3’s source code as part of a larger agreement between the two companies. Microsoft has invested $1 billion. Interestingly, OpenAI’s GPT-1 and GPT-2 are open-source projects.

EleutherAI

EleutherAI project began on July 3, 2020, with the quest to replicate OpenAI GPT-family models. The north star of the research group is to replicate GPT-3 175 billion parameters and ‘break OpenAI-Microsoft monopoly’ on transformer-based language models.

However, to build such powerful models, you need a massive amount of computing power. EleutherAI is currently supported by Google and CoreWeave (cloud computing providers). CoreWeave has offered high-performance GPU compute to develop future models with GPT-NeoX. 

GPT-NeoX is an in-development codebase based upon Megatron-LM and DeepSpeed and is designed for GPUs. Its GPT-Neo, on the other hand, is a codebase built on Mesh Tensorflow, designed for training on TPUs.

Besides this, the research group has built 825 gigabytes (GB) of language modelling dataset called The Pile, curated from a set of datasets including arXiv, GitHub, Wikipedia, StackExchange, HackerNews, etc. 

Now, it has launched GPT-J, one of the largest models that EleutherAI has released till date. GPT-J is a 6 billion parameters model trained on The Pile, comparable in performance to the GPT-3 version of similar size — 6.7 billion parameters. “Because GPT-J was trained on GitHub (7 percent) and StackExchange (5 percent) data, it is better than GPT3 175B at writing code. However, in other tasks, it is significantly worse,” wrote artificial intelligence expert Alberto Romero, in his blog.

GPT-J: JAX-based (Mesh) Transformer LM 

The name GPT-J comes from its use of JAX-based (Mesh) Transformer LM, developed by EleutherAI’s volunteer researchers Ben Wang and Aran Komatsuzaki. JAX is a Python library used extensively in machine learning experiments

GPT-J is the best performing publicly available Transformer LM in terms of zero-shot performance on various down-streaming tasks. 

Komatsuzaki said it allows more flexible and faster inference than TensorFlow and TPU counterparts. More than anything, the project required a substantially smaller amount of time than other large-scale models. JAX, xmap and TPUs are the right set of tools for the quick development of large scale models, he added.

Our model design and hyperparameter choice closely follow those of 6.7B GPT-3 with some differences, including: 

  • The model was trained on 400 billion tokens from The Pile dataset with 800 GB text.
  • Efficient attention (like linear, local or sliding window, etc.) was not used for simplicity, as it would not have significantly improved ‘throughput’ at this scale.
  • The dimension of each ‘attention head’ was set to 256, which is more than that of GPT-3 of comparable size. “This noticeably improved the ‘throughput’ with minimal performance degradation,” said Komatsuzaki. 

The team made two minor architectural improvements for GPT-J–Rotary embedding for slightly better performance, and placed the attention layer and the feedforward layer in parallel for decreased communication. 

Performance

As shown in the below table, the zero-shot performance is on par with GPT-3 of comparable size, and the performance gap from GPT-3 is closer than the GPT-Neo models. 

Performance across GPT-family of models (Source: Aran Komatsuzaki)

Also, the throughput of the 6 billion GPT-for training (151K tokens/s) is faster than the 2.7 billion GPT-Neo (148k tokens/s) on the same hardware (TPU v3-256 pod), showcasing nearly 125 percent improvement in efficiency.

The hardware has a ‘theoretical maximum’ of 13.4 PFLOPs (Peta floating-point operations per second), and GPT-J achieved 5.4 PFLOPs as measured in the GPT-3 paper (excluding attention computation and ignoring compute-memory tradeoffs like gradient checkpointing). “When taking these additional factors into account, approximately 60% of the theoretical maximum is utilised,” mentioned Komatsuzaki, saying GPT-J took roughly five weeks with TPU v3-256 pod.  

GPT-J vs GPT-3 

Max Woolf, a data scientist at BuzzFeed, recently tested GPT-J’s coding abilities. He said he ran GPT-J against the test prompts he had used to test GPT-3 a year ago. “The exception is code generation, where GPT-J performed very well, and GPT-3 had performed very poorly,” he wrote, in his blog post, showcasing multiple examples and use cases. 

Wrapping up 

Romero said the results are impressive. He said it is just another GPT model. But on closer look, clear differences emerge. For instance, GPT-J is 30 times smaller than GPT-3 with 175 billion parameters. “Despite the large difference, ‘GPT-J’ produces better code, just because it was slightly more optimised to do the task,” he added. 

Further, he said further optimisation could give rise to platforms way better than GPT-3 (and not limited to coding). GPT-3 would become a jack of all trades, whereas the specialised systems would be the true masters, added Romero.

Recently, the Chinese government-backed BAAI introduced Wu Dao 2.0, the largest language model to date, with 1.75 trillion parameters. It has surpassed Google’s Switch Transformer and OpenAI’s GPT-3 in size. 

PS: The story was written using a keyboard.
Share
Picture of Amit Raja Naik

Amit Raja Naik

Amit Raja Naik is a seasoned technology journalist who covers everything from data science to machine learning and artificial intelligence for Analytics India Magazine, where he examines the trends, challenges, ideas, and transformations across the industry.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India