“Training BERT costs about $7,000, and for the largest models like GPT-3, this number can be as high as $12 million.”
Typically, training a deep learning model starts with a forward pass where loss functions are evaluated followed by a backward pass where the loss-compensating gradients are generated, which are then pushed to servers and updated. These servers aggregate the updates from all the users and make changes to the global machine learning model. Now, this procedure repeats itself multiple times until it hits a certain accuracy. State of the art models are large and involve heavy compute. As models become bigger, the training process continues to remain an expensive affair. Distributed training was introduced to avoid restricting research to just well funded labs. Volunteer computing (VC) is popular with other domains such as bioinformatics and physics where people donate the idle time of their desktops, smartphones, and other personal devices to solve a computationally hard problem. Imagine lending your friend’s PC to train your deep learning model remotely while they are away.
Landscape of collaborative computation
[email protected] or FAH is a distributed computing project for simulating protein dynamics, including protein folding and the movements of proteins apropos a variety of diseases. FAH brings together volunteers (citizen scientists) to run simulations of protein dynamics on their personal computers. Insights from these data help scientists to better understand biology and provide new opportunities for developing therapeutics. For example, in [email protected], over 700,000 volunteers have collectively contributed 2.43 exaFLOPs of compute to COVID-19 research in April of 2020
The Berkeley Open Infrastructure for Network Computing or BOINC app, allows downloading of scientific computing jobs on a user’s personal computer and runs the workload in the background. For instance, the [email protected] is a distributed computing project for protein structure prediction on the BOINC platform. Rosetta can tap into the computational power of idle computers to help with projects related to designing new proteins and to predict their 3-dimensional shapes.
[email protected] is a distributed computing project dedicated to understanding and interpreting complex machine learning models, with an emphasis on neural networks. It uses the BOINC distributed computing platform. [email protected] is another project on BOINC that provides an open, collaborative platform for ML researchers. It allows them to train thousands of networks in parallel, with tightly controlled inputs, hyperparameters, and network structures. However, distributed training still has few problems:
- Distributed training of a single model requires significantly more communication and does not allow a natural way to “restart” failed jobs.
- Distributed training of neural networks are bounded by the throughput of parameter servers and the memory available on the weakest GPU.
“Is there really no alternative to using pre-trained models for the broader ML community?”
According to Hugging Face (HF)–whose NLP libraries are used by companies such as Apple– data transfer in distributed deep learning is still a bottleneck. This can arise due to the need to aggregate the gradients from multiple workers and as most participants don’t have high speed connections, they run the risk of getting dropped from the network. “So how on Earth can you train anything with a household data plan?” asks the team at HF.
Now, a team of researchers from Yandex, HF and others have come up with a new method that lets machine learning models train over the internet in a better way. The new training algorithm is called Distributed Deep Learning in Open Collaborations (or DeDLOC)
Data parallelism in GPUs is a popular technique. DeDLOC tries to employ best of all parallelism attributes while tweaking the popular distributed training techniques. DeDLOC incorporates synchronous data-parallel training with fixed hyperparameters regardless of the number of volunteers. Training is done with extremely large batches to compensate for slow communication. According to the researchers, each device accumulates gradients at its own pace until the collaboration reaches the target batch size. Once ready, the collaborators exchange their gradients and perform one optimiser step.
DeDLOC operates similarly to BitTorrent and I2P where individual peers coordinate by forming a Distributed Hash Table. To test DeDLOC’s performance, the researchers picked the sahajBERT language mode. The experiment had 40 volunteers, 30 of whom were Bengali-speaking. Volunteers were asked to open the provided notebook (Colab/Kaggle) locally and run one code cell and watch the training loss decrease on the shared dashboards. The cumulative runtime for the experiment was 234 days.
At the end of training, sahajBERT was compared with three other pretrained language models: XLM-R Large, IndicBert, and bnRoBERTa. The results showed that DeDLOC, when applied on pretraining sahajBERT achieves nearly state-of-the-art quality with results comparable to much larger models that used hundreds of high-tier accelerators. This is the first distributed deep learning training at scale and the results are encouraging for individual researchers looking to take up expensive ML training tasks. “The community for any language can train their own models without the need for significant computational resources concentrated in one place,” wrote the HuggingFace team.