In a traditional paradigm of machine learning, we often work in the offline learning fashion where we start with data preprocessing and end with data modelling with an algorithm to satisfy the requirements. This becomes a storage-dependent and time-consuming process. To overcome this, we can use streaming data for predictive analysis or any other modelling process. We don’t need to store the data before modelling it. This can be accomplished by stream learning and online learning. In this article, we will understand how we can make streaming data useful in machine learning. We will also learn how to implement online machine learning to learn from streaming data. The major points that we will cover in this are listed below.
Table of Contents
Sign up for your weekly dose of what's up in emerging technology.
- What is Streaming Data?
- Online Machine Learning
- About Creme
- Implementing Online Learning Using Creme
What is Streaming Data?
The data which is generated continuously in an incremental manner from different sources can be considered as the streaming data. Basically, this type of data is generated frequently and flowing across different websites and platforms which are designed to provide information as a function of time. Usually, streaming data is used in the context of big data where it is generated using different sources at stable or unstable high speed. In the context of technology, it is a system that is used for providing information to devices over the internet by allowing users to access the content immediately instead of using the content offline or after downloading it.
As we know, big data mainly focuses on the storage of the data which causes a huge amount of cost for storage and the storage consists of the tendency to make the data unstructured. This is not fruitful for the machine learning algorithms. Datastream is a sequence of digitally encoded coherent signals used to transfer the data that is in the data transfer process. Datastream can be considered as set to information that is extracted from the receiver and provided by the data provider. Using various methods we can make streaming data useful for us like making reports, providing on-demand data-driven decisions, and machine learning algorithms.
We can extract useful information from a data stream or many data streams for modelling purposes. Since the offline machine learning models work on a trained or offline data, in the case of modelling with streaming data online machine learning comes into the picture.
Online Machine Learning
Online machine learning is a method that combines the machine learning models or predictors with the data which comes in the sequential order so that the best predictor can be used to provide updates according to the future data at every step. In a traditional machine learning program, we use stable data to predict the values for which the model is designed where on the other hand online machine learning programs are designed to adopt the new patterns in the data or adopt the data when it is generated as a function of time. Online machine learning models have their own benefits and uses like the changes and updates on the model can be done automatically in case of changes in the data and they are highly preferable models in the fields like finance market, economics where the new data is emerging on a frequent basis.
The above image represents the flow chart of any online machine learning system where streaming data can be used to train an online machine learning algorithm while the precision label of the model is not enough we can analyze the problem to make the algorithm perform better with the new data.
There are various frameworks available to perform online machine learning. A few of them are:
This article is mainly focused on the creme library which provides the functionality for performing online machine learning on streaming data. So let us understand creme in detail with its implementation.
The creme is a python library for online machine learning which provides most of the packages designed especially for online machine learning. Using the library we can learn the stream of data using different approaches. This library allows models to learn one data point at a time so that updates can be done if required. This approach helps to learn from big data that isn’t stored in the main memory. The library gives integration options to online machine learning where the new data stream is constantly arriving.
The following benefits can be achieved using creme:
- Real-time updated models – it is the basic feature of the models given under the cremes package that they are incremental according to time.
- Models under the creme can adapt to the concept drift problem, which means they try to predict changes in target variables over time.
- During model development, the creme can represent the production scenario even when working with data streams.
- Models under the creme are designed to require little computation and they don’t need to be retrained.
- Models are designed for learning each observation at a time.
The following are the features of the creme library:
- Under the roof of the creme, we have various Nearest Neighbors, Decision Trees, and Naive Bayes models.
- Library provides k-fold and cross-validation, progressive model validation methods.
- We have models for recommendation systems, time series forecasting, and also linear models with a variety of optimizers.
- In the case of unsupervised learning, the library provides the feature for clustering.
- They have inbuilt class imbalanced learning features in the package.
- With all these facilities we have the option for feature extraction and selection in the packages.
- Various built-in datasets like airline passengers, chick-weights, fraud-detection, and many others so that we can learn the usage of the library easily.
- We have an anomaly detection feature in the library.
Implementing Online Learning Using Creme
Now, let us implement a linear model that is a logistic regression for binary classification using the creme library.
The basic requirement to work with the creme is python 3.6 or above. We can install the library using pip and the following command:
!pip install creme
Now let’s start building a logistic regression model for classifying the website phishing dataset. Before going for modelling let’s check for the entities dataset consists of:
Importing dataset from the creme library
from creme import datasets A_b = datasets.Phishing() print(A_b)
The above output represents some basic details of the phishing dataset. Let’s divide the dataset into dependent and independent variables.
from pprint import pprint for A, b in A_b: pprint(A) print("independent variable =", b) break
Now we can train a logistic regression model on the data in a streaming fashion.
Building the Model
Importing the packages for data preprocessing, models, and accuracy metrics.
from creme import compose, linear_model, metrics, preprocessing
Defining a pipeline for a model instance and scaling the data.
lm = compose.Pipeline( preprocessing.StandardScaler(), linear_model.LogisticRegression() ) pprint(model)
Making an accuracy metrics instance:
metric = metrics.Accuracy()
Making a prediction using a model instance and sequentially updating the model we can also update an accuracy metric using the metric.update module.
for A, b in A_b: pred = lm.predict_one(A) metric = metric.update(b, pred) model = lm.fit_one(A, b)
After running the above-given code we can check for the accuracy of the model which we have updated in the loop.
Here we have created and trained a linear regression model by interleaving predictions and model updates. The model has performed well and we can see the accuracy of the model is around 89% which is pretty good.
There are various tasks we can perform using the creme library as we have explained in the features.
In this article, we had an overview of streaming data and online machine learning. There are various application domains of online machine learning such as time series forecasting, spam filtering, recommender systems, CTR prediction, and IoT applications. The creme library provides most of the features of stream learning that are required in online learning. We can use them according to the requirements.