With the third round of MLPerf benchmarking results coming out, graphic giant, NVIDIA announced breaking AI performance records becoming the fastest products available commercially for AI training. However, on the other hand, Google has also proclaimed acing the MLPerf tests with the world’s fastest training supercomputer.
Although both companies have showcased significant achievements in creating faster training processes of ML models, which indeed would be critical for research breakthroughs, there is a hidden element that needs to be paid attention to.
Similar to System Performance Evaluation Consortium benchmark and Transaction Processing Council benchmark, MLPerf is an industry-standard benchmark that has been designed to measure the time to train ML models on a few specific tasks. Comprising some 80 companies and universities from all over the world, some other prominent vendors competing in the MLPerf benchmark tests were Intel’s Xeon processors and Huawei.
Is Google Lagging Behind NVIDIA’s GPU?
The current result announced by MLPerf was similar to its last two results where NVIDIA stood firm in the top position for all 16 benchmarks with its commercially available hardware and software products for a variety of ML tasks. However, on the other hand, Google’s Tensor Processing Unit has surpassed NVIDIA’s results in most tasks for the category of research projects.
In a recent tweet, Google AI lead, Jeff Dean shared his excitement on setting records in six out of eight benchmarks in MLPerf tests. According to the results Google topped in training DLRM, Transformer, BERT, SSD, ResNet-50 and Mask R-CNN leveraging its new machine learning supercomputer and TPU chip. In fact, it is believed that Google’s supercomputer is 4x larger than Google’s Cloud TPU v3 Pod, which made records in the previous results, becoming the first cloud provider to outperform on-premise systems.
While Google has powered up its Cloud TPUs providing faster training time for ML models, NVIDIA’s advanced GPU — A100 — also proved to execute five petaflops of performance excelling on all eight benchmarks of MLPerf. A100 has been the first processor designed on NVIDIA’s Ampere architecture, which allows the GPU to address different sized acceleration needs from small to big multi-node workload.
Strangely enough, realising the advanced capabilities of A100 that could easily dominate the field, the only companies to put their submission for commercially available servers were Google and Huawei, that also for only for two categories — image classification and natural language processing. Thereby, NVIDIA’s results remained high in system design and training models beating Huawei, Google’s TPU as well as Intel, which has recently switched to Habana’s AI chips.
According to the results, it took about 49 seconds for NVIDIA’s DGX A100 to train BERT, which is better than Google with 57 minutes. Understanding its position, Google, therefore submitted its V4 chip in the research category to build its position for the same, with an astounding performance for many training tasks. With that being said, the company hasn’t yet released it on the Google Cloud, which again can put it way behind in the race against A100, which is commercially in production.
These pointers above highlight how NVIDIA single-handedly dominates the commercial category with other vendor companies like Dell EMC, Alibaba, and even Google using A100 for submitting their performance results. However, even though Google’s TPU will still not be available in the market for some time, it indeed showcased intensive performance for many tasks in MLPerf. Interestingly, Intel has also joined the force with its soon-to-be-released CPU, however too early to predict its success in applications.
On another note, MLPerf added a new test to reflect upon the growing application of machine learning in production settings — Deep Learning Recommendation Model. Although NVIDIA has performed brilliantly on the newly added benchmarking test, it might still lag in building recommendation engines due to the requirement of a massive amount of supercomputing memory. Google, on the other hand, with its supercomputing abilities, has recently launched a beta version of Recommendation AI, making building recommendation engines super easy for developers.
Having said that, even though MLPerf assesses almost all aspects of AI performance, two vital parameters left out in this benchmarking test are the comparison of the price of chips and computers, and the consumption of energy, which are critical for today’s era. It is believed that machines with better accuracy and more chips are going to consume more energy and are going to be heavier on the pockets, which might put Google on the lead once again.
With all that information in hand, it can be established that a usual trend of bigger hardware is indeed gaining traction, which will drastically reduce the amount of training time of machine learning models. However, NVIDIA cannot be the only one dominating the market. Google’s supercomputer also showed tremendous results in training ML models to perform tasks.
While NVIDIA and Google are still going to continue the race of achieving the highest AI performance record, NVIDIA stated in their company blog, “the real winners are the customers” who will now be able to leverage this advancement to transform their businesses faster with AI.
Check full results here.