When the size of data becomes very large, the performance of machine learning models becomes a concern. In such situations, the machine learning models need to be scaled which not only helps in saving time and memory but also helps in improving the performance of the model. TensorFlow 2.x provides features to scale machine learning models easily and effectively. In this article, we will discuss scalable machine learning models that can be achieved with TensorFlow 2.x. First, we will try to understand the shared memory models and distributed memory models and then finally we will see how TensorFlow 2.x facilitates these features. The major points that we will cover in this article are listed in the below table of contents.
Table of Contents
- What is Scalable Machine Learning?
- The Shared Memory Model (SMM)
- The Distributed Memory Model (DMM)
- The transition from SMM to DMM
- The GPU and TPU Accelerators
- Scalability in TensorFlow 2.x
Let us proceed with understanding scalable machine learning.
Sign up for your weekly dose of what's up in emerging technology.
What is Scalable Machine Learning?
Talking about the real world where the amount of data is so large, the scalability of machine learning models becomes a primary concern for us. Most of the learners start learning data science with a very small amount of data which is small enough to fit the process and model on a single machine. But in the real world, many applications require scaling ML models to multiple machines. There can be various situations where models need to deal with large data sets like dealing with stock market data where models need to adopt new data and produce predictions quickly. And in a millisecond the predictions from the model become useless. In such a situation Scalable Machine Learning comes in the application which aims to combine statistics, systems, machine learning, and data mining in such a procedure that is flexible to any environment.
More formally we can say the word scalable means here to make a model which can deal with any amount of data without increasing the consumption of resources like memory, time, etc. When it comes to making the computation of machine learning models across the big data faster some problems can accrue due to lack of scalability like we can find the problem in the fitting of any model on large size of data and if we have a model one thing which can slow us the computation speed of the model. This can lead to consequences where we are required to sacrifice accuracy because optimization of models for large-scale data becomes infeasible and time-consuming.
In scalable machine learning, we try to build a system where the components of the system have their own work or task which helps the whole system to lead towards the solution of the problem rapidly, without wasting so much memory and increasing the performance as well. This makes us introduce shared memory models and Distributed Memory Model which are basically two types of hardware structural designs for modelling purposes.
The Shared Memory Model
When we are developing a model in a single machine we can easily pass the information across the whole machine so that all the cores on the machine can have access to the same memory and the threads of the models can easily access the information. This type of model can be called the shared memory model. Because they are sharing the variables, memory to the multiple processes in a system. This process of sharing allows different processes to communicate with each other.
The Distributed Memory Model
Where we are having problems that cannot be dealt with by a single machine, we require a network of multiple machines to work on the problem so that they can communicate with each other to make the problem-solving procedure complete. This means the network allows machines to pass messages to each other according to the requirement.
The above image is an illustration of a distributed memory system with three machines.
The Transition from SMM to DMM
Scalability can also be considered as the transition of processes from a shared memory model to a distributed memory model. Some basic concepts of any distributed memory model are:
- The machines are fully connected by the means of nodes of the network.
- The links of the network between machines are bidirectional.
According to those basic concepts we need to be aware of the following points:
- Which computing thread of the model is in which machine.
- How to move the data in the network to complete the procedure without taking so much time and facing issues.
Developing models on a shared memory model is easier but it has its own drawback like you have a limited number of cores, computing threads, and memory. In terms of big data, it often happens that the models become larger than the RAM memory of a single machine where the distributed system allows us to increase the memory level by combining more than two systems. Also, it enhances the speed of the model by providing different spaces for different computations.
The GPU and TPU Accelerators
Although we can use GPU (Graphical Processing Unit) for increasing the memory and power of computation to a limit in the shared memory model and in any distributed memory system we can use the TPU (Tensor Processing Unit). Using the NVIDIA GPU is a better option for utilizing the benefits of the shared memory model but it is also limited to an extent. What if we require more memory to share or more cores for calculation, in such scenarios we require to go with the distributed memory model where Google TPU is a better option to use as an accelerator of the procedure. Google TPU is specially designed for the distributed memory model which provides separate IP addresses and communication protocols. And also using it saves our time from dealing with the communication layer which is taken care of by the provided accelerated Linear Algebra Layer(XLA) which can be exposed using TensorFlow API.
Scalability in TensorFlow 2.x
The typical TPU architecture works on the basis of serialization of the data in the batches in TensorFlow with a module named TFrecord which makes the data serialization easier. In a distributed memory model it is very important and one of the toughest tasks to do. TFrecord data serialization gives ease on it and provides help in large-scale deployment. The below image represents the architect of the shared and distributed memory model when using GPU and TPU accelerators.
The TensorFlow API has made many of the in-between processes easy to perform when training a model. The TensorFlow 2.x has a custom training loop feature using which we can train any number of models synchronously. Where in the previous version we didn’t have this option which means the API was allowed to train only one model.
In TensorFlow 2.x, eager execution is turned on by default, which is a programming environment that helps in the evaluation of operations in less time. The evaluated operations give rigid values and avoid the procedure for making a computational graph to run later. This helps in debugging the models. It is a flexible machine learning platform for experimentation. Using this we can call an operation to test running models. And allows the use of python control flow instead of graph control flow.
The TensorFlow 2.x allows us to use the compiled functions by the tf.function module using which we can make graphs of our programs. The introduction of custom functions has enhanced the flexibility of TPUs as the GPUs are having. There are several other new features like API tf.Distributed. The strategy allows distributed training of the models across multiple machines where the major goal of the API is to provide an easy-to-use interface with good performance and easy switching between strategies. In Tensorflow 2.x there are also extensions to the tf.data object that makes it easy to specify how data is distributed among the several cores. Which makes it easy to understand the distribution of the data across the cores.
By summing up all these features, we can say the eager mode is something that we need in the shared memory model. The existence of custom functions makes a model execution slow. The reason for slow execution is that the custom function should be applied in the remote execution of the model. Debugging the delayed execution model is harder because of the dependency on the tracing. In such a situation the strategy function helps because of its strategy of making code transparent for any interface.
Again the common code path makes it possible for developers to code and understand the methods if they are switching from a shared memory model to distributed memory model. The codes of the models can be run first as a shared memory model and in the requirement, they can switch into another interface of the distributed memory model. Also allowing the debugging in the GPU mode makes the process faster.
As we have seen in the article, developing the models of the shared memory models category is easier than developing the distributed memory model. Since performance enhancement is a major goal of any developer we are required to make the DMM category models. TensorFlow is such a good library which not only allows us to develop the DMM easily but also gives features that can help us in debugging and a similar code path is making the models flexible for transition between SMM and DMM.