Deep Learning is a subset of Machine learning. It was developed to have an architecture and functionality similar to that of a human brain. The human brain is composed of neural networks that connect billions of neurons. Similarly, a deep learning architecture comprises artificial neural networks that connect a number of mathematical units called neurons.
Deep Learning is capable of modeling complex problems that, in some cases, exceed human performance! With tremendous victories in the deep learning domain, a few great frameworks have emerged, intending to generalize deep learning processes from data pre-processing to model deployment. Though there are countable deep learning frameworks in practice, TensorFlow and PyTorch remain the preferred ones among practitioners and researchers over the years.
This article discusses the fundamentals of deep learning along with hands-on implementation using TensorFlow Keras. Keras is a high-level API adopted into TensorFlow, meant exclusively for deep learning tasks. Let’s dive deeper.
A Neuron or Unit
A neuron is the fundamental building block of deep learning architecture. It is a simple mathematical operator that performs a weighted summation of its inputs. The inputs to a neuron can be either features of an input data point or the outputs of neurons of its previous neural layer. A neuron is also called a unit.
Inputs (X1, X2, X3) are multiplied by corresponding weights (w1, w2, w3) and added together to form the output (y). A bias (b) is necessarily included to represent the complement data that the input data fails to provide to demonstrate the actual pattern. Hence, the output is a linear function of the inputs.
Role of An Activation Function
The output of a neuron is a linear function of its inputs. But the true pattern of data can not be explored just with a linear function. Hence, in most neurons, the linear output obtained by a neuron is transformed with an activation function to obtain a non-linear output.
There are many non-linear activation functions in practical use. However, ReLU, tanh, sigmoid and softmax are the widely used functions. For instance, ReLU, the acronym of Rectified Linear Unit, sets negative values to zero leaving positive values as such.
Therefore, a neuron can be viewed as an integration of a linear activation (weighted summation) function and a non-linear activation function if a non-linear activation function is employed.
Bias and weights are known as the parameters of that neuron. There will be one bias for each neuron and a number of weights equal to the number of inputs to that neuron. Bias and weights are randomly assigned with some initial values during model building. During training, these values are updated suitably in iterative steps (termed technically as epochs). Finding the most suitable weights is called learning. At the end of training, the neural architecture and these weights are together called the trained Deep Learning Model.
A Neural Layer
A collection of neurons that receive inputs from the same source is called a neural layer. Though each neuron in a neural layer receives the same inputs, they differ by weight. Thus each neuron attempts to explore different patterns hidden in the data.
Let’s try to understand the implementation of a neural layer with a code example—import necessary libraries and modules.
import tensorflow as tf from tensorflow import keras import numpy as np import pandas as pd import matplotlib.pyplot as plt
We can proceed our discussion with a regression problem having structured data. This example is loaded from Google Colab’s in-built datasets. Readers may opt for their own data.
Explore the in-built datasets in Google Colab using the following command.
california_housing_train.csv for training and
california_housing_test.csv for validation. Load the data using the following commands.
train = pd.read_csv('sample_data/california_housing_train.csv') test = pd.read_csv('sample_data/california_housing_test.csv') train.head()
How many examples are there in the train and validation sets?
There are 17000 training examples and 3000 validation examples. There are nine columns in total, including the target column. Let’s check the usability of the raw train data.
Each feature is of
float64 data type and there are no missing values in the train data. The data is clean and can be used as such.
Similarly, validation data has no missing values. Let’s split the features and target.
# train features and target X_train = train.copy() y_train = X_train.pop('median_house_value') # test features and target X_test = test.copy() y_test = X_test.pop('median_house_value')
Let’s define a Keras dense layer with 3 units (or neurons) and a
relu activation. Since there are 8 features in the train data,
input_shape is .
# dense layer 3 units; relu; 8 input features layer_1 = keras.layers.dense(3, activation='relu', input_shape=)
This layer can be applied to data without training. However, the weights will be randomly initialized during its call.
# prepare a single row of data example = np.array(X_train.iloc[:1,:]) example.shape
Example data is ready and has the correct input shape as the layer expects.
Output has three entries each corresponding to the three neuron units. The first value is a positive number, whereas the next two values are zeros. This may be because of the ReLU activation that forces negative numbers to zero.
Let’s load the neural layer’s weights.
The top array refers to weights and the bottom array refers to bias. 8 inputs each to 3 units cause 8×3 weights and three units cause 3 biases. These values are purely random and will be updated during training.
A neural layer without ReLU activation may have negative outputs. Let’s define another layer without any activation function.
# dense layer 3 units; no activation; 8 input features layer_2 = keras.layers.Dense(3, input_shape=) print(layer_2(example))
As expected the output array contains one positive and two negative values. (This may vary, as the weights are purely random)
It is clear that the presence of an activation function has no role in determining the number of parameters.
A Neural Network
A neural network is a stack of more than one neural layer. By tradition, a neural network is termed a deep neural network if it is composed of more than or equal to 3 neural layers. The first layer that takes input from data is called the input layer, and the last layer that gives the required output is called the output layer. The remaining layers are generally called the hidden layers. They are called so because the outputs of hidden layers are not explicit.
Let’s build a neural network with 4 layers, each with 256 units and ReLU activation functions. The final layer gives a single continuous output (regression problem). Hence, it must have a single unit without any activation.
regressor = keras.Sequential([ # input layer keras.layers.Dense(512, activation='relu', input_shape=), # 3 hidden layers keras.layers.Dense(512, activation='relu'), keras.layers.Dense(512, activation='relu'), keras.layers.Dense(512, activation='relu'), # output layer keras.layers.Dense(1) ])
The number of parameters in each neural layer can be calculated using the code,
Number of parameters in each layer can be manually verified by considering the number of units in each layer and the number of inputs they receive.
The neural network is built and is ready for training. It is custom to apply data to an untrained model to check for any shape compatibility issues.
# test the untrained model regressor.predict(example)
The neural network produces a numerical output value as expected.
Training A Neural Network
Training is the process of updating weights iteratively so that the model can predict the output with minimal error. This needs two important functions: a loss function and an optimizer. A loss function determines the deviation of output from the ground truth value. An optimizer determines how to update the weights to reduce the loss in the next iteration (or for the next batch). The famous loss functions used in regression problems are mean absolute error (MAE) and mean squared error (MSE). The famous optimizers in deep learning are stochastic gradient descent (SGD) and its variants such as Adam and RMSProp.
Let’s start training by implementing an MAE loss function and an Adam optimizer.
It is a good habit to normalize or scale the input data before feeding it to a neural network in order to have a unified scale.
from sklearn.preprocessing import StandardScaler scaler = StandardScaler().fit(X_train) X_train = scaler.transform(X_train) X_test = scaler.transform(X_test)
The neural network model is to be trained with batches of data (as SGD expects it). Updating weights is performed after each batch of data. The batch size should be high enough to represent the whole dataset. Here, we use a batch size of 256 and train the model for 100 epochs.
history = regressor.fit(X_train, y_train, validation_data=(X_test, y_test), batch_size=256, epochs=100)
There are 67 batches of training data in the above training. ‘loss’ refers to the training loss, and ‘val_loss’ refers to the validation loss. Each epoch has taken 2 seconds approximately. The training performance can be clearly understood with the help of training history.
hist = pd.DataFrame(history.history) hist.plot() plt.ylabel('Loss') plt.xlabel('Epochs') plt.show()
For better visual clarity, plot the loss values after the fifth epoch.
hist.iloc[5:].plot() plt.ylabel('Loss') plt.xlabel('Epochs') plt.show()
Visualizations are always great in communication. Here, the loss keeps on going down even till the 100th epoch. It suggests that the neural network model should be trained for more epochs as long as there is some fall in loss. Further, the validation loss is remarkably higher than the training loss for every epoch after the 10th epoch. It suggests that the model may overfit the training data, and approaches such as dropout and batch normalization have to be attempted. It should be noted that if batch normalization is performed inside the model, any scaling or normalizing attempts on raw data must be given up.
In the case of classification problems, the output layer will have a suitable activation function. Popular activation functions are sigmoid for binary classification and softmax for multi-class classification. Furthermore, the loss functions should be properly chosen. Popular loss functions are binary cross-entropy for binary classification and sparse categorical cross-entropy for multi-class classification. Except for the above changes, the implementation of a classification problem in TensorFlow Keras is similar to the above-discussed regression problem implementation.
In this article, we have discussed the fundamentals of a deep learning neural network. Further, we have addressed the Python implementation of a deep neural network for a regression task using TensorFlow Keras. We have discussed some key approaches that can be incorporated with the built model to improve its performance.
Find here the Colab Notebook with the above code implementation.