Listen to this story
Former Tesla AI head Andrej Karpathy recently released an updated version of minGPT, NanoGPT, a new fast repository for training and fine-tuning medium-sized GPTs. Prior to this, in 2020, he unveiled the minGPT library for the GPT language model to address the existing implementations of GPT on PyTorch.
Check out the GitHub repository here.
Currently working on reproducing GPT-2 on the OpenWebText dataset, the NanoGPT strives to be plain and readable:
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
- train.py – is a ~300-line boilerplate training loop
- model.py is a ~300-line GPT model definition, which can optionally load the GPT-2 weights from OpenAI
- pip install datasets for huggingface datasets
- pip install tiktoken for OpenAI’s fast bpe code
- pip install wandb for optional logging
- pip install tqdm
Some documents are tokenised into one long 1D array of indices to render a dataset. Eg, for OpenWebText see:
$ cd data/openwebtext
$ python prepare.py
This will generate two files, train.bin and val.bin, each containing a sequence of raw uint16 bytes representing a GPT2 BPE token id. The training script currently attempts to replicate the smallest GPT-2 version made available by OpenAI, the 124M version. We have to run the script with torchrun to train using PyTorch Distributed Data-Parallel (DDP).
Go to data/shakespeare and prepare.py to get the shakespeare dataset and render it into a train.bin and val.bin to finetune a GPT on new text. It will take less time to run than OpenWebText. It only takes a few minutes to finetune a single GPT. Run the following example finetuning:
$ python train.py finetune_shakespeare
This will load the config parameter overrides in config/finetune_shakespeare.py
OpenAI GPT-2 checkpoints enable us to get some baselines for openwebtext. We can get the numbers as follows:
$ python train.py eval_gpt2
$ python train.py eval_gpt2_medium
$ python train.py eval_gpt2_large
$ python train.py eval_gpt2_xl
and observe the following losses on train and val:
For model benchmarking, bench.py might be useful. It’s identical to what happens in the training loop of train.py but removes many of the other complexities.
The code, by default, now uses PyTorch 2.0, which makes torch.compile() available in the release. The improvement from the one line of code is noticeable.
NanoGPT is set to be updated with new developments that include the following:
- Additional optimizations to the running time
- Report and track other metrics like PPL
- Eval zero-shot perplexities on PTB, WikiText, and other related benchmarks
- Add some finetuning datasets and guide on some datasets for demonstration.
- Reproduce GPT-2 results. It was estimated ~3 years ago that the training cost of the 1.5B model was ~$50K
Two files make up the “library” of minGPT:
- mingpt/model.py contains the actual Transformer model definition and
- mingpt/trainer.py is the (GPT-independent) PyTorch boilerplate that trains the model.
A Jupyter notebook is also attached to this repo, showing how the “library” can train sequence models.