What if you want to do machine learning with the data that is in motion? What if you wish to train machine learning models on real-time data? Online Machine Learning, also known as Incremental machine learning, is a method of executing machine learning with data that is in motion. In contrast to real-time online machine learning, in this post, we will look at how streaming data can be leveraged to do state-of-the-art machine learning tasks. We will also discuss a python library, River, with its implementation on online machine learning. The following outlined points will be discussed.
Table of Contents
- Batch Learning Vs Online Learning
- Python Libraries for Online Machine Learning
- Stream Learning with River
Let’s start the discussion by knowing the difference between batch learning and online learning.
Batch Learning Vs Online Learning
Batch learning approaches are incapable of gradual learning. They usually build models from the entire training set, which are then put into production. We must build a new model from scratch on the complete training set and the new data if we want batch learning algorithms to learn from new data as it arrives. Offline learning is another term for this. If the amount of data is large, training on the entire dataset could be expensive in terms of computer resources (CPU, RAM, storage, disk I/O, and so on).
Sign up for your weekly dose of what's up in emerging technology.
If our system does not need to respond to quickly changing data, the batch learning method may suffice. If we don’t need to update our model too frequently, we can make use of the batch learning strategy. In other words, the entire process of training, evaluation, and testing are quite basic and uncomplicated, and it frequently yields better results than online techniques. For knowledge graph embedding projects, I’ve created batch learning algorithms. In the future, I hope to create an online approach for these projects to adapt to constantly changing knowledge graphs.
In online learning, the training is done in small groups or in an incremental fashion by continuously feeding data as it arrives. Each learning phase is quick and inexpensive, allowing the system to learn about new data as it comes in.
Machine learning systems that receive data in a continuous stream (e.g., stock prices) and must adapt to change quickly or autonomously benefit from online learning. It’s also a smart alternative if you only have a limited number of computational resources: after an online learning system has learnt about new data instances, it no longer needs them, so you may trash them (unless you wish to be able to “replay” the data). This can help you save a lot of space. Online learning is depicted in the figure above.
Online learning methods can also be used to train systems on massive datasets that are too large to fit in a single machine’s main memory (this is also called out-of-core learning). The algorithm loads a portion of the data, performs a training step on that data, and then continues the process until it has run on all of the data.
The learning rate is an important feature of online learning. The learning rate is the rate at which you want your machine learning to adapt to new data sets. A system with a rapid learning rate will swiftly forget what it has learned. A system with a low learning rate is more akin to batch learning.
One significant downside of an online learning system is that if it is fed incorrect data, the system will perform poorly, and the user will see the impact immediately. As a result, it is critical to implement proper filters to ensure that the data fed is of good quality. Furthermore, it is critical to closely monitor the functioning of the machine learning system.
Python Libraries for Online Machine Learning
To execute online machine learning, many frameworks are available. Several of them are,
Scikit-multi-flow (also known as skmultiflow) is a Python-based machine learning library that supports multi-output/multi-label and stream data. Scikit-multiflow makes it simple to create and perform experiments, as well as to enhance stream learning algorithms. It has a number of methods for classification, regression, concept drift detection, and anomaly detection. A suite of data stream generators and evaluators is also included. scikit-multiflow is compatible with Jupyter Notebooks and is meant to work with Python’s numerical and scientific libraries NumPy and SciPy.
Nippon Telegraph & Telephone and Preferred Infrastructure created Jubatus, an open-source online machine learning, and distributed computing system. It has classification, recommendation, regression, anomaly detection, and graph mining capabilities. Many client languages are supported, including C++, Java, Ruby, and Python. Iterative Parameter Mixture is used for distributed machine learning.
With Creme, we may use a different approach, which is to learn a stream of data continually. As a result, the model only processes one observation at a time and may be modified on the fly. This enables learning from large datasets that are too large to fit in main memory. In circumstances when new data is continually arriving, online machine learning works well. It excels in a variety of applications, including time series forecasting, spam filtering, recommender systems, CTR prediction, and IoT.
This article focuses on the River library, which combines the scikit-multiflow and creme libraries to provide functionality for executing online machine learning on streaming data. So, let’s take a closer look at River and its implementation.
The river is a machine learning library for continuous learning and dynamic data streams. For various stream learning challenges, it includes many state-of-the-art learning methods, data generators/transformers, performance indicators, and evaluators. It’s the outcome of combining two of Python’s most popular stream learning packages: Creme and scikit-multiflow.
In river, machine learning models are extended classes of specialized mixins that vary based on the learning job, such as classification, regression, clustering, and so on. This maintains library compatibility and makes it easier to extend/modify current models as well as create new models that are compatible with the river.
Learn and predict are the two main functions of all predictive models. The learn one method is used for learning (updates the internal state of the model). The
predict one (classification, regression, and clustering),
predict proba one (classification), and
score one (anomaly detection) algorithms provide predictions depending on the learning goal. It’s worth noting that the river includes transformers, which are stateful objects that use the transform one method to convert an input.
Lets’ implement the river.
The river offers a Scikit-learn-like API and is also known as Scikit-learn for streaming or online machine learning. It is designed for streaming data and supports practically all ML estimators and transformers.
Let’s look at how to use the river to create a simple text classifier model that can categorize the sentiment of text as Positive (1) or Negative(-1). The dataset used in this post is taken from this Kaggle repository which contains strings of texts associated with respective sentiments. To convert our text into features, we’ll use
BagOfWords() as our transformer or vectorizer, and Naive Bayes
MultinomialNB as our Machine Learning Estimator.
While installing the river, make sure you are using the latest version of the NumPy package.
pip install -U numpy pip install river import river from river.naive_bayes import MultinomialNB from river.feature_extraction import BagOfWords,TFIDF from river.compose import Pipeline import pandas as pd
In terms of data, we’ll just utilize a list of tuples containing the text and the label in our scenario (Positive or Negative). However, data from a streaming engine or a CSV file can be ingested as well. If you’re dealing with the famed Pandas package, you’ll have to convert a CSV file to a dictionary or list of tuples.
df = pd.read_csv("/content/stock_data.csv") # Convert to Format df.to_dict() # Convert to Tuple data = df.to_records(index=False) data
Next, we’ll construct a pipeline that includes two stages: a transformer/vectorizer for converting text to features and an estimator.
# Build Pipeline pipe_nb = Pipeline(('vectorizer',BagOfWords(lowercase=True)),('nb',MultinomialNB())) # Specify Steps pipe_nb.steps
Because the data is coming one at a time, we’ll have to fit our model to it one at a time during training using our pipeline’s.learn one(x,y) method. We can emulate this by using a for loop to iterate through our data. [Note that t is.fit one(x,y) in crème]
# Train for text,label in data: pipe_nb = pipe_nb.learn_one(text,label)
Now let’s check the prediction using predict_one and the probability of two classes.
# Make a Prediction test = 'Mr AAP is going to have to stop hanging out by the pool if he is to make 435 by close. All she needs is one fat finger buyer' pred = pipe_nb.predict_one(test) # Prediction Prob proba_ = pipe_nb.predict_proba_one(test) print('Predicted test sentence as',pred,' and probability of classes for test as',proba_)
We can employ functions from the river to determine the reliability and performance of our model. Accuracy metrics and classification reports can be used from sub-module metrics.
# Update the Model on the test data & Check Accuracy metric = river.metrics.Accuracy() for text,label in data: y_pred_before = pipe_nb.predict_one(text) metric = metric.update(label,y_pred_before) # Has already learnt the pattern pipe_nb = pipe_nb.learn_one(text,label) print(metric)
Through this article, we have seen the fundamental difference between batch learning and stream learning (online learning) and also what is the limitation of both the techniques and how one takes over the other. Later we saw some popular frameworks and libraries that were used mostly to handle and model the streaming data. And lastly, we have seen the hands-on implementation of the most popular library in the field, named River. I encourage you to visit the official document to get an idea of how and what can be done when you are working with continuous or streaming data.