Training bigger neural networks can be challenging when faced with accelerator memory limits. The size of the datasets being used by machine learning models is very large nowadays. For example, a standard image classification datasets like hashtagged Instagram contains millions of images. With the increasing quality of the images, the memory required will also increase. Today, the memory available on NVIDIA GPUs is only 32 GB.
Therefore, there needs to be a tradeoff between memory allocated for the features in a model and how the network gets activated. It is only understandable why the accelerator memory limit needs to be breached.
A deep neural network benefits from larger datasets as it alleviates the problem of overfitting. And, to run these ever growing networks, we need deep learning supercomputers such as Google TPU or NVIDIA’s DGX which enable parallelism by providing faster interconnections between the accelerators.
Today, an average ImageNet resolution is 469 x 387 and it has been proven that by increasing the size of an input image, the final accuracy score of a classifier increases. To fit the current accelerator memory limits, most models are made to process images of sizes 299 x 299 or 331 x 331.
The computing powers of the machines have seen a sporadic growth over the past few decades. High-performance computing can mean either improving application performance or harnessing thousands of cores to get many orders of magnitude speedups.
Parallelization The Intel Way
Reducing Workload With Xeon
For instance, graph algorithms consists of a lot of parallelism. A graph can be a structured representation of a dataset containing relationship between various elements where each is element is a vertex and relationships are the edges between two vertices.
With increasing availability of larger data sets , graph analytics are found to be quite significant in data center applications.
Since the topology of the graph is irregular, fetching attributes is challenging. And, how quickly they are fetched depends on the architecture of the processors. Traditional methods lead to underutilisation of computer and memory resources. The memory intensiveness and irregularity make graph based applications challenging for current processors.
This is where the parallelism of the graph algorithms can be exploited for faster fetching of attributes. This can be achieved with processors which enable more accessibility for the sparse memories while making good use of cache.
Intel’s Xeon processors achieved speeds up to 5x for vector generations in graph based algorithms.
An example of this is, instead of looking for neighbors of the current front and checking whether they’ve already been visited (forward algorithm), all non-visited vertices are considered, and it is checked whether they are a neighbor of a vertex. Since there are more non-visited vertices(attributes) than vertices in the front, there’s more parallelism to make use of.
Parallelization With Numba
While continuously tweaking their hardware, Intel have had also released Python based frameworks and libraries to accelerate the deployment of deep neural nets.
Parallelism in Python is difficult and Intel plans to achieve this with Numba.
The Numba framework used just-in-time and low-level virtual machine compilation engines to create native-speed code.
The first requirement for using Numba is that your target code for JIT or LLVM compilation optimization must be enclosed inside a function. After the initial pass of the Python interpreter, which converts to bytecode, Numba will look for the decorator that targets a function for a Numba interpreter pass. Next, it will run the Numba interpreter to generate an intermediate representation (IR). Afterwards, it will generate a context for the target hardware, and then proceed to JIT or LLVM compilation.
Numba deals with both NumPy and SciPy and can also target the transcendental ufunc with its short vector math library(SVML)
import array
import random
from numba import jit
Check the rest of the code here
Next generation of AI technologies should be able to comprehend commands by working on the huge background of information in a fast paced environment. To make these machines smart, there is a great need for innovation on the hardware side while making them more energy efficient.