“Using Dask, RAPIDS, BlazingSQL, and NVIDIA GPUs, researchers are leveraging Summit supercomputers from their laptops.”
Working on data-intensive projects like protein folding research, drug discovery, or deep space leads to several TBs of data. And, using queries on CPUs to sort information can take days. Time is a key constraint while fighting global pandemics. Research labs and governments around the world have accommodated money and manpower to speed up drug discovery. But this isn’t sufficient. There is a need for a smart, diligent solution that combines the existing technologies without trying to reinvent the wheel.
At Oak Ridge National Laboratory, which has been at the forefront of the fight against COVID-19, the researchers have been leveraging the powerful SUMMIT supercomputer to skim large datasets in search of solutions. SUMMIT, the world’s second-fastest supercomputer is powered by NVIDIA’s Tesla V100s and the team at OLCF (Oakridge Leadership Computing Facility) has been looking for solutions that would fit well into their technology stack.
The team at OLCF reviewed commercial applications such as finance and marketing approach the analysis of large structured datasets. They identified that integrating BlazingSQL into the RAPIDS/Dask ecosystem would provide GPU-accelerated open-source platform to process extremely fast and scalable SQL queries, along with other data analytics.
This flexible toolset enabled engineers to get this custom workflow up and running in less than two weeks.
The Secret Sauce Behind Summit
Back in 2014, the United States awarded a $325 million contract to IBM, NVIDIA and Mellanox to construct two supercomputers — Summit and Sierra. Summit ended up being the world’s fastest supercomputer until it was eclipsed by Japan’s Fugaku a couple of months ago. Summit is also the first supercomputer to reach exaflop (a quintillion operations per second) speed, achieving 1.88 exaflops during a genomic analysis.
Today, using Dask, RAPIDS, BlazingSQL, and NVIDIA GPUs, researchers can leverage the power of Summit supercomputers from their laptops. This year in June, NVIDIA announced that using the RAPIDS suite of open-source data science software libraries powered by 16 DGX A100 systems, it ran the benchmark in just 14.5 minutes compared to the previous high of 4.7 hours on a CPU system. The DGX A100 systems had a total of 128 NVIDIA A100 GPUs and used NVIDIA Mellanox networking.
The RAPIDS data science framework is a collection of libraries that are used for executing end-to-end data science pipelines completely in the GPU. RAPIDS uses optimised NVIDIA CUDA primitives and high-bandwidth GPU memory to accelerate data preparation and machine learning. The goal of RAPIDS is not only to accelerate the individual parts of the typical data science workflow but to accelerate the complete end-to-end workflow.
Whereas, BlazingSQL is a fully open-source, free to use standard SQL engine built entirely on top of RAPIDS.ai. BlazingSQL lets users ETL raw data directly into GPU memory as a GPU DataFrame (GDF). While BlazingSQL and RAPIDS have multiplied NVIDIA’s role in the advancement of AI-based research, there is another key component that helped it to meet the demands–DASK.
When developed back in 1989, Python was not intended to handle the TB-scale production workloads. But, the way it bridges high-performance languages and APIs like Fortran and CUDA to lightweight, user-friendly APIs. It would not be an exaggeration if one were to say that there are no data scientists who haven’t heard of numpy, scikit-learn and pandas. However, these successful libraries didn’t offer solutions to parallelism problems. Enter Dask.
Dask supports native code, which makes it easy to work with for Python users and C/C++/CUDA developers. Dask is also a critical component of the RAPIDS ecosystem, making it even easier to take advantage of accelerated computing through a comfortable Python-based user experience.
Today, many scientific research centres, including Oak Ridge, are adopting both Dask and RAPIDS to scale some of their most important operations. Some of NVIDIA’s biggest partners, leaders in their industries, are using Dask and RAPIDS to power their data analytics.
There is a rise in demand for highly usable distributed computing, need for more computational power and open-source software. At the intersection of these trends is NVIDIA offering integrated solutions through RAPIDS, BlazingAQL and Dask.
Know more about NVIDIA’s Supercomputing efforts here.