EleutherAI’s GPT-J vs OpenAI’s GPT-3

OpenAI’s not so open GPT-3 has an open-source cousin GPT-J, from the house of EleutherAI. Check out the source code on Colab notebook and a free web demo here

EleutherAI, founded by Connor Leahy, Leo Gao, and Sid Black, is a research group focused on AI alignment, scaling and open-source AI research. In March 2021, the company released two GPT-Neo models with 1.3 billion and 2.7 billion parameters respectively. 

Microsoft has the exclusive access to GPT-3’s source code as part of a larger agreement between the two companies. Microsoft has invested $1 billion. Interestingly, OpenAI’s GPT-1 and GPT-2 are open-source projects.


Sign up for your weekly dose of what's up in emerging technology.


EleutherAI project began on July 3, 2020, with the quest to replicate OpenAI GPT-family models. The north star of the research group is to replicate GPT-3 175 billion parameters and ‘break OpenAI-Microsoft monopoly’ on transformer-based language models.

However, to build such powerful models, you need a massive amount of computing power. EleutherAI is currently supported by Google and CoreWeave (cloud computing providers). CoreWeave has offered high-performance GPU compute to develop future models with GPT-NeoX. 

Download our Mobile App

GPT-NeoX is an in-development codebase based upon Megatron-LM and DeepSpeed and is designed for GPUs. Its GPT-Neo, on the other hand, is a codebase built on Mesh Tensorflow, designed for training on TPUs.

Besides this, the research group has built 825 gigabytes (GB) of language modelling dataset called The Pile, curated from a set of datasets including arXiv, GitHub, Wikipedia, StackExchange, HackerNews, etc. 

Now, it has launched GPT-J, one of the largest models that EleutherAI has released till date. GPT-J is a 6 billion parameters model trained on The Pile, comparable in performance to the GPT-3 version of similar size — 6.7 billion parameters. “Because GPT-J was trained on GitHub (7 percent) and StackExchange (5 percent) data, it is better than GPT3 175B at writing code. However, in other tasks, it is significantly worse,” wrote artificial intelligence expert Alberto Romero, in his blog.

GPT-J: JAX-based (Mesh) Transformer LM 

The name GPT-J comes from its use of JAX-based (Mesh) Transformer LM, developed by EleutherAI’s volunteer researchers Ben Wang and Aran Komatsuzaki. JAX is a Python library used extensively in machine learning experiments

GPT-J is the best performing publicly available Transformer LM in terms of zero-shot performance on various down-streaming tasks. 

Komatsuzaki said it allows more flexible and faster inference than TensorFlow and TPU counterparts. More than anything, the project required a substantially smaller amount of time than other large-scale models. JAX, xmap and TPUs are the right set of tools for the quick development of large scale models, he added.

Our model design and hyperparameter choice closely follow those of 6.7B GPT-3 with some differences, including: 

  • The model was trained on 400 billion tokens from The Pile dataset with 800 GB text.
  • Efficient attention (like linear, local or sliding window, etc.) was not used for simplicity, as it would not have significantly improved ‘throughput’ at this scale.
  • The dimension of each ‘attention head’ was set to 256, which is more than that of GPT-3 of comparable size. “This noticeably improved the ‘throughput’ with minimal performance degradation,” said Komatsuzaki. 

The team made two minor architectural improvements for GPT-J–Rotary embedding for slightly better performance, and placed the attention layer and the feedforward layer in parallel for decreased communication. 


As shown in the below table, the zero-shot performance is on par with GPT-3 of comparable size, and the performance gap from GPT-3 is closer than the GPT-Neo models. 

Performance across GPT-family of models (Source: Aran Komatsuzaki)

Also, the throughput of the 6 billion GPT-for training (151K tokens/s) is faster than the 2.7 billion GPT-Neo (148k tokens/s) on the same hardware (TPU v3-256 pod), showcasing nearly 125 percent improvement in efficiency.

The hardware has a ‘theoretical maximum’ of 13.4 PFLOPs (Peta floating-point operations per second), and GPT-J achieved 5.4 PFLOPs as measured in the GPT-3 paper (excluding attention computation and ignoring compute-memory tradeoffs like gradient checkpointing). “When taking these additional factors into account, approximately 60% of the theoretical maximum is utilised,” mentioned Komatsuzaki, saying GPT-J took roughly five weeks with TPU v3-256 pod.  

GPT-J vs GPT-3 

Max Woolf, a data scientist at BuzzFeed, recently tested GPT-J’s coding abilities. He said he ran GPT-J against the test prompts he had used to test GPT-3 a year ago. “The exception is code generation, where GPT-J performed very well, and GPT-3 had performed very poorly,” he wrote, in his blog post, showcasing multiple examples and use cases. 

Wrapping up 

Romero said the results are impressive. He said it is just another GPT model. But on closer look, clear differences emerge. For instance, GPT-J is 30 times smaller than GPT-3 with 175 billion parameters. “Despite the large difference, ‘GPT-J’ produces better code, just because it was slightly more optimised to do the task,” he added. 

Further, he said further optimisation could give rise to platforms way better than GPT-3 (and not limited to coding). GPT-3 would become a jack of all trades, whereas the specialised systems would be the true masters, added Romero.

Recently, the Chinese government-backed BAAI introduced Wu Dao 2.0, the largest language model to date, with 1.75 trillion parameters. It has surpassed Google’s Switch Transformer and OpenAI’s GPT-3 in size. 

More Great AIM Stories

Amit Raja Naik
Amit Raja Naik is a seasoned technology journalist who covers everything from data science to machine learning and artificial intelligence for Analytics India Magazine, where he examines the trends, challenges, ideas, and transformations across the industry.

AIM Upcoming Events

Regular Passes expire on 3rd Mar

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Early Bird Passes expire on 17th Feb

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, Virtual
Deep Learning DevCon 2023
27 May, 2023

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox