EleutherAI, founded by Connor Leahy, Leo Gao, and Sid Black, is a research group focused on AI alignment, scaling and open-source AI research. In March 2021, the company released two GPT-Neo models with 1.3 billion and 2.7 billion parameters respectively.
Microsoft has the exclusive access to GPT-3’s source code as part of a larger agreement between the two companies. Microsoft has invested $1 billion. Interestingly, OpenAI’s GPT-1 and GPT-2 are open-source projects.
EleutherAI project began on July 3, 2020, with the quest to replicate OpenAI GPT-family models. The north star of the research group is to replicate GPT-3 175 billion parameters and ‘break OpenAI-Microsoft monopoly’ on transformer-based language models.
However, to build such powerful models, you need a massive amount of computing power. EleutherAI is currently supported by Google and CoreWeave (cloud computing providers). CoreWeave has offered high-performance GPU compute to develop future models with GPT-NeoX.
GPT-NeoX is an in-development codebase based upon Megatron-LM and DeepSpeed and is designed for GPUs. Its GPT-Neo, on the other hand, is a codebase built on Mesh Tensorflow, designed for training on TPUs.
Besides this, the research group has built 825 gigabytes (GB) of language modelling dataset called The Pile, curated from a set of datasets including arXiv, GitHub, Wikipedia, StackExchange, HackerNews, etc.
Now, it has launched GPT-J, one of the largest models that EleutherAI has released till date. GPT-J is a 6 billion parameters model trained on The Pile, comparable in performance to the GPT-3 version of similar size — 6.7 billion parameters. “Because GPT-J was trained on GitHub (7 percent) and StackExchange (5 percent) data, it is better than GPT3 175B at writing code. However, in other tasks, it is significantly worse,” wrote artificial intelligence expert Alberto Romero, in his blog.
GPT-J: JAX-based (Mesh) Transformer LM
The name GPT-J comes from its use of JAX-based (Mesh) Transformer LM, developed by EleutherAI’s volunteer researchers Ben Wang and Aran Komatsuzaki. JAX is a Python library used extensively in machine learning experiments.
GPT-J is the best performing publicly available Transformer LM in terms of zero-shot performance on various down-streaming tasks.
Komatsuzaki said it allows more flexible and faster inference than TensorFlow and TPU counterparts. More than anything, the project required a substantially smaller amount of time than other large-scale models. JAX, xmap and TPUs are the right set of tools for the quick development of large scale models, he added.
Our model design and hyperparameter choice closely follow those of 6.7B GPT-3 with some differences, including:
- The model was trained on 400 billion tokens from The Pile dataset with 800 GB text.
- Efficient attention (like linear, local or sliding window, etc.) was not used for simplicity, as it would not have significantly improved ‘throughput’ at this scale.
- The dimension of each ‘attention head’ was set to 256, which is more than that of GPT-3 of comparable size. “This noticeably improved the ‘throughput’ with minimal performance degradation,” said Komatsuzaki.
The team made two minor architectural improvements for GPT-J–Rotary embedding for slightly better performance, and placed the attention layer and the feedforward layer in parallel for decreased communication.
As shown in the below table, the zero-shot performance is on par with GPT-3 of comparable size, and the performance gap from GPT-3 is closer than the GPT-Neo models.
Performance across GPT-family of models (Source: Aran Komatsuzaki)
Also, the throughput of the 6 billion GPT-for training (151K tokens/s) is faster than the 2.7 billion GPT-Neo (148k tokens/s) on the same hardware (TPU v3-256 pod), showcasing nearly 125 percent improvement in efficiency.
The hardware has a ‘theoretical maximum’ of 13.4 PFLOPs (Peta floating-point operations per second), and GPT-J achieved 5.4 PFLOPs as measured in the GPT-3 paper (excluding attention computation and ignoring compute-memory tradeoffs like gradient checkpointing). “When taking these additional factors into account, approximately 60% of the theoretical maximum is utilised,” mentioned Komatsuzaki, saying GPT-J took roughly five weeks with TPU v3-256 pod.
GPT-J vs GPT-3
Max Woolf, a data scientist at BuzzFeed, recently tested GPT-J’s coding abilities. He said he ran GPT-J against the test prompts he had used to test GPT-3 a year ago. “The exception is code generation, where GPT-J performed very well, and GPT-3 had performed very poorly,” he wrote, in his blog post, showcasing multiple examples and use cases.
Romero said the results are impressive. He said it is just another GPT model. But on closer look, clear differences emerge. For instance, GPT-J is 30 times smaller than GPT-3 with 175 billion parameters. “Despite the large difference, ‘GPT-J’ produces better code, just because it was slightly more optimised to do the task,” he added.
Further, he said further optimisation could give rise to platforms way better than GPT-3 (and not limited to coding). GPT-3 would become a jack of all trades, whereas the specialised systems would be the true masters, added Romero.
Recently, the Chinese government-backed BAAI introduced Wu Dao 2.0, the largest language model to date, with 1.75 trillion parameters. It has surpassed Google’s Switch Transformer and OpenAI’s GPT-3 in size.