What is the solution for faster training of deep neural networks? Building fast processors? That is true, but now we have GPUs and TPUs. What if the speed is not enough? Should we develop processors that are even faster?
No, says a team of researchers from Google AI.
According to the researchers, the accelerators are idle for most of the time while waiting for inputs. So, instead of tweaking the hardware endlessly, they introduced a simpler concept at the algorithmic level, which they call data echoing. They have also published a paper titled ‘Faster Neural Network Training with Data Echoing’.
Data echoing, as the name suggests, is a technique that reuses the output data of previous steps instead of keeping the processors waiting for fresh data.
How Does Data Echoing Work
Not all operations in the training pipeline run on accelerators, so one cannot simply rely on faster accelerators to continue driving training speedups. Earlier stages in the training pipeline like disk I/O and data preprocessing involve operations that do not benefit from GPUs and TPUs.
As accelerator improvements outpace the developments in CPUs and disks, these earlier stages will increasingly become a bottleneck, wasting accelerator capacity and limiting training speed.
The technique the Google researchers proposed involves duplicating data into a shuffle buffer somewhere in the training pipeline.
They then implement data echoing by inserting a stage in the training pipeline that repeats (echoes) the outputs of the previous stage. Using TensorFlow, an echoing stage is as simple as,
dataset.flat_map( lambda t: tf.data.Dataset.from_tensors(t).repeat(e))
where ‘e’ is the data echoing factor, the number of times each data item is repeated
After the first optimisation step on the preprocessed batch, the researchers reused the batch and performed a second step before the next batch was ready. In the best-case scenario, where repeated data is as useful as fresh data, the authors claimed a two-fold speedup in training.
In reality, admitted the authors, data echoing provides a slightly smaller speedup because repeated data is not as useful as fresh data – but it can still provide a significant speedup compared to leaving the accelerator idle.
The amount of idle downstream time that data echoing can exploit is greatest given that every operation in the pipeline takes some time to execute.
For the experiments, the researchers tried data echoing on five neural network training pipelines spanning 3 different tasks – image classification, language modelling, and object detection – and measured the number of fresh examples needed to reach a particular performance target.
They found that data echoing can help reach the target performance with fewer fresh examples, demonstrating that reusing data is useful for reducing disk I/O across a variety of tasks. In some cases, repeated data is nearly as useful as fresh data as echoing before augmentation reduces the number of fresh examples required almost by the repetition factor ‘e’.
As improvements in GPUs and TPUs continue to outpace general-purpose computation, the authors expect data echoing to become an important part of the neural network training toolkit.
Key Takeaways
Data echoing as a concept sounds very promising, the number of speedups that it can offer in reality might not be dramatically high, but it is a simpler way of improving speed instead of redesigning a processor. This work can be summarised as follows:
- Data echoing is an effective alternative solution to optimising the training pipeline
- Echoing after augmentation is effective for image datasets that employ expensive data augmentation that runs on CPUs
- Data echoing does not degrade solution quality
Know more about this work here.