With machine learning models getting bigger in size, training them is becoming quite a challenging task. Especially the task of tuning, when the parameters run into the order of billions (even trillion now), makes it a highly cumbersome and resource-intensive process. Recently, researchers – Edward Hu, PhD Student Greg Yang, Senior Researcher Jianfeng Gao, distinguished scientist and vice president from Microsoft, introduced µ-Parametrization, which offers maximal feature learning even in infinite-width limit. The researchers further collaborated with OpenAI to demonstrate its practical advantages, which was recorded in this paper.
We caught up with Edward Hu and Greg Yang to learn more about their research.
Edited excerpts:
AIM: How did you recognise this as a problem area? Could you walk us through the development cycle?
Authors:
The Origin of Tensor Programs: The First Epiphany
Around 2017 and 2018, I (Greg Yang) was working on initialisation schemes for deep neural networks. At the time, the default ways of initialisation, such as Glorot & Bengio and He et al., were derived from rough heuristics to ensure the activation scale of each layer was consistent with width and depth. These heuristics were only derived for simple neural networks, and even there, they were never made rigorous, so I felt a bit uneasy applying them to general neural networks used in practice. The Tensor Programs framework was born as a rigorous system for understanding neural network behaviour at initialisation, i.e. when the weights of the network are random.
As I tried to unify existing initialisation heuristics regarding different neural architectures such as ResNet and Transformers, I had the first epiphany: there is a low level “programming language” composed of just matrix multiplication and coordinatewise nonlinearities such that, if the neural network function is re-expressed in that language, then there’s an automatic and rigorous way of doing the initialisation analysis. The programs in this language are called Tensor Programs (TP).
So, to summarise, this programmatic way of thinking about neural networks theoretically was the first epiphany.
Going Beyond Initialisation: The Second Epiphany
The second epiphany came very quickly after TP was invented: this low level “programming language” was motivated by the way to express any neural architectures, but the language is, in fact, already powerful enough to express the entire neural network training algorithm as well! This epiphany is akin to trying to create a format, like ONNX, for describing neural architectures only to discover you have just created PyTorch that can be used not just to describe networks but also to do training and inference with them.
With this epiphany, I was able to analyse how wide neural networks behave not only at initialisation (as was my original motivation) but also after training, a really hard thing to do at the time.
Via the second epiphany (that I can analyse wide neural networks after training), I successfully obtained a classification of all possible infinite-width neural network limits (these are the “phases”). In particular, this classification singles out Maximal Update Parametrization, or μP, as inducing the unique infinite-width limit that maximises feature learning. This is how μP was found.
Now, intuitively, maximising feature learning surely sounds like a good thing. But at this point, when I first derived μP in 2019, the “neural networks are just kernels” narrative was very strong due to an explosion of works showing how, theoretically, wide neural networks behave like kernel machines in the default PyTorch/Tensorflow style of parametrisation. This explosion was primarily because, in this kernel perspective, neural network optimisation, a very nonconvex problem, reduces to a convex problem when the width is large, thus bypassing thorny theoretical questions of proving training convergence. Nevertheless, this kernel perspective would preclude any sort of feature learning, which any practitioner would likely say is quite important for neural networks, especially with the rise of large pretrained models and finetuning. Returning to the analogy to phases of water, such kernel limits of a wide neural network are like ice—very rigid, with a regular structure that one can theoretically understand easily, corresponding to a fixed, not learned, set of features—while the feature learning limit of μP is like liquid—very adaptive to its environment (i.e., data).
Manifestation of Correctness: Final Epiphany
So, suffice to say, I was very dissatisfied with this deep disconnect between the theoretical and empirical communities. Therefore, after deriving μP, I asked myself, “OK, so, why would anyone care that μP is optimal in this theoretical sense of maximising feature learning? If it’s really the correct parametrisation, how would it manifest this correctness in practice?”
Credit: Microsoft
This is when I arrived at my final epiphany leading to TP5: the correct parametrisation should be the unique one that preserves the optimal learning rate, initialisation, and other training hyperparameters when model size varies, whereas incorrect parametrisations would see them potentially diverging to infinity or decreasing to zero in larger and larger models.
1/ You can't train GPT-3 on a single GPU, much less tune its hyperparameters (HPs).
— Greg Yang (@TheGregYang) March 8, 2022
But what if I tell you…
…you *can* tune its HPs on a single GPU thanks to new theoretical advances?
paper https://t.co/urEUY7O3yQ
code https://t.co/5S0YAghCYx
blog https://t.co/QPqinMwOXj pic.twitter.com/ZiOQfvN2OF
AIM: You speak about the challenge with training large neural networks more than once. Please give more details about it.
Authors: It’s been shown that the performance of neural networks is critically determined by their sizes. As a result, an important research direction is on training neural networks that are as large as possible given available hardware. Therefore, by definition, we will be training models that are so large that we can only afford to train them from start to finish once because otherwise, we would train an even larger model.
As such, because a team would more or less “put all their eggs” in this one basket of a single large model, anything that can derail its pretraining will incur high costs in time and money wasted. This can be simple oversights like in Section 2.2 of OpenAI’s GPT-3 paper, where they wrote, “Unfortunately, a bug in the filtering caused us to ignore some overlaps, and due to the cost of training, it was not feasible to retrain the model.” Or it could be wrong hyperparameter choices like the learning rate that causes the model to diverge after one month of training. Prior to our work, there was quite a lot of uncertainty in how to choose the right learning rate and other hyperparameters on models, so large nobody has tried before. But our work gives a nice simple prescription to scaling such hyperparameters, dramatically reducing this uncertainty.
Credit: GPT-3
AIM: What are the shortcomings of your research? How do you plan to resolve them?
Authors: We showed that μTransfer solves the hyperparameter transfer problem across model width both theoretically and empirically, but the same problem across other dimensions of scale, like depth, is still open.
Additionally, Fig. 4 in our paper shows that the optimal HP still shifts slightly for smaller models. Perhaps by considering finite-width corrections to µP, one can fix this shift. Finally, it will be interesting to study if there’s a way to transfer regularisation HPs as a function of both the model size and data size, especially in the context of finetuning of pretrained models.
Credit: Microsoft
AIM: Why is scaling up neural networks such a challenge?
Authors: There are many aspects to answering this question. For example, when training large neural networks on a cluster of many GPUs, a lot of work needs to be done on the systems side to ensure efficient communication between GPUs as well as the robustness of the training process to GPU nodes sporadically going offline that can happen when training can go on for months.
The most relevant aspect of our work is the challenge of setting the right hyperparameters for large neural networks. When training small neural networks, researchers can freely try many hyperparameters at once and pick the best one. But this is far from feasible for large networks like GPT-3 that can only be trained once. So, before our work, researchers tend to make educated guesses on the right hyperparameters. However, these methods are at most “hacks”, and in practice, there is a lot of uncertainty as to whether such guesses will actually work well.
With our work, we, for the first time, understand very precisely how large neural networks behave during training and how to vary the hyperparameters with width, like the learning rate for every layer.
As an additional note, our technique can also be used in reverse, transferring a divergence issue from a large model to a small model to help debug the issue more quickly (see Appendix I: Reverse-µTransfer for Diagnosing Training Instability in Large Models in our paper).
AIM: How has µP advanced the research community’s understanding of large models?
Authors: To practitioners, our work, in particular Tensor Programs, really pulled back the curtains on the mystery of large neural networks. We now have an accurate, universal way of predicting the behaviour of wide models by doing certain mathematical calculations. As a specific corollary, we have μTransfer technology, the surprising way of tuning a large model by tuning a small one. This allows practitioners to train very large models without resorting to guessing for hyperparameters. We also think μTransfer will change how the community thinks about hyperparameters in general. But we should acknowledge that different scenarios likely will ask for different hyperparameters, and no fixed hyperparameters are optimal for every setting.
To theoretical researchers, we hope our work settles the debate on the right way of thinking about wide neural networks. μP is the unique parametrisation allowing hyperparameter transfer, so we believe its limit, the feature learning limit, is the “correct” limit, whereas the kernel limits are “incorrect,” in the same way that one would say quantum mechanics is “correct” where classical mechanics is not.
In the bigger picture, we believe that the study of large neural networks is a rare area where theoretical insights have outsized leverage on the cutting edge of AI. This is because the most advanced models in the near future will always be among the largest ones, and empirical insights are expensive to obtain in such large models. We hope our work will inspire more fundamental theoretical works on the limits of large neural networks that can change the way humanity’s best AI systems are trained.