Abstraction is a common trait amongst the now widely used machine learning libraries or frameworks. Dusting off the nitty-gritty details under the rug and concentrating on implementing algorithms with more ease is what any data scientist would like to get their hands on.
TensorFlow rose into prominence for the very same reason — abstraction.
Now with its latest library TensorFlow Graphics, it aims to address key computer vision challenges by incorporating the knowledge from graphics in the images, which in turn result in robust neural network architectures.
TensorFlow is a widely-used Machine Learning framework in the deep learning arena, demanding efficient utilization of computational resources.
While efforts are being made to make Deep Learning more accessible through platforms like TensorFlow, companies like Intel® are tweaking TensorFlow to extract high performance.
The TensorFlow framework has been optimized1 using Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) primitives, a popular performance library for deep learning applications.
TensorFlow follows a data flow paradigm for computations and it is a good model for doing parallelism.
TensorFlow is designed to be flexible, scalable and deployable. For example, new developers can get a quick start as this framework hides all complex distributed training and developers do not need to understand low level APIs.
How Key Is Hardware For Deep Learning Inference
Deep Learning inference can be done with two different strategies, each with different performance measurements and recommendations. The first is Max Throughput (MxT) and aims to process as many images per second, passing in batches of size > 1. For Max Throughput, best performance is achieved by exercising all the physical cores on a socket.
Real-time Inference (RTI) is an altogether different regime where we typically want to process a single image as fast as possible. Here the aim is to avoid penalties from excessive thread launching and orchestration between concurrent processes. The strategy is to confine and execute quickly. The following best known methods (BKMs) differ where noted with MxT RTI.
How Intel® Affects Performance
Most of the success of modern AI, especially deep learning algorithms, is due to its impressive results in image classification where near human-level has been observed.
To explain the advantages that Intel® brings to the table, let’s take the example2 of document classification.
This capability can be used for document authentication which is a common task when opening a banking account, performing checks-in at the airport or showing a driver’s license to a police officer. Today most document authentication tasks are done by humans, but AI is showing to be effective and is being increasingly employed for this activity.
Any typical document classification task would contain the following steps:
- Binary Classifier: Label a given image as a Document or Not Document
- Multiclass Classifier: Label an image classified as a Document into either Front, Back, or Unfolded
- OCR: This module receives an image and turn it into text
- Image Authentication: This module looks for a match between the picture available in the document with the real person picture available at a database
- Text Authentication: This module looks for a match between the text available in the document with the real person data available at a database
The Binary and Multiclass Classifier used in the experiments of this paper were implemented using Keras* high-level API available on TensorFlow.
So, on the CPU, when Intel® Distribution for Python* along with Intel® Optimization for TensorFlow was used, around 70% to 80% improvement was observed only by installing Intel® Optimization for TensorFlow.
This is done by setting Number of Threads to Execute in Parallel for Inter and Intra Operations in TensorFlow and Keras.
To do this, set intra_op_parallelism_threads and OMP_NUM_THREADS equal to number of physical cores; and Set inter_op_parallelism_threads equal to number of sockets whereas, KMP_BLOCKTIME to zero;
Intel® Xeon® Platinum CPU 8153 has 32 physical cores and 2 sockets, therefore intra_op_parallelism_threads is set to 32 and inter_op_parallelism_threads to 2 as shown in the code snippet below:
import tensorflow as tf
from tensorflow.keras import backend as K
Runtime options heavily effect TensorFlow performance. Understanding them will help get the best performance out of the Intel® Optimization of TensorFlow.
- Data layout
Recommended settings (RTI)→ intra_op_parallelism = #physical cores
Recommended settings → inter_op_parallelism = 2
tf_cnn_benchmarks usage (shell)
python tf_cnn_benchmarks.py --num_intra_threads=cores --num_inter_threads=2
intra_op_parallelism_threads and inter_op_parallelism_threads are runtime variables defined in TensorFlow. ConfigProto. The ConfigProto is used for configuration when creating a session.
These two variables control number of cores to use.
This runtime setting controls parallelism inside an operation. For instance, if matrix multiplication or reduction is intended to be executed in several threads, this variable should be set. TensorFlow will schedule tasks in a thread pool which contains intra_op_parallelism_threads threads.
These optimizations can result in orders of magnitude higher performance. For example, measurements are showing up to 70x higher performance for training and up to 85x higher performance for inference on Intel® Xeon Phi™ processor 7250.
The comparison was taken using a default environment with libraries from official pip channel (baseline) and an Intel® optimized environment where Intel® Distribution for Python* and Intel® Optimization for TensorFlow* were installed.
The results show that there has been a 3.1X speedup when training a binary image classifier and 3.6x speedup when training a multiclass image classifier.
For even better performance, batch size was increased in the optimized environment. Increasing batch size delivered a boosted performance but led to an accuracy drop on both classifiers.
Validation accuracy drop on binary classifier went from 98% to 85% and on the multiclass classifier from 95% to 44%.
We can also take advantage of large memory size available on Intel® Xeon® Scalable processors and increase the batch size to process more images at the same time while computing the gradients of a Neural Network. Increasing the batch size can reduce the execution time for training on CPUs, but it may also have an impact on testing accuracy, therefore this step should be taken carefully to decide if the gain in execution time is worth the loss in accuracy.
TensorFlow’s machine learning platform has a comprehensive, flexible ecosystem of tools, libraries and community resources that lets researchers push the state-of-the-art in ML and developers easily build and deploy ML-powered applications.
Now, with the release of TensorFlow 2.0, more focus is aimed at developer productivity, simplicity, and ease of use. There are multiple changes in TensorFlow 2.0 to make TensorFlow users more productive. TensorFlow 2.0 removes redundant APIs, makes APIs more consistent (Unified RNNs, Unified Optimizers), and improved integration with the Python runtime with Eager execution.
Optimizing TensorFlow means deep learning applications built using this widely available and widely applied framework can now run much faster on Intel® processors to increase flexibility, accessibility, and scale.
The Intel® Xeon Phi processor, for example, is designed to scale out in a near-linear fashion across cores and nodes to dramatically reduce the time to train machine learning models.
The collaboration between Intel® and Google to optimize TensorFlow is part of ongoing efforts to make AI more accessible to developers and data scientists, and to enable AI applications to run wherever they’re needed on any kind of device—from the edge to the cloud. Intel® believes this is the key to creating the next-generation of AI algorithms and models to solve the most pressing problems in business, science, engineering, medicine, and society.