Now Reading
A Guide to River – A Python Tool For Online Learning

A Guide to River – A Python Tool For Online Learning

What if you want to do machine learning with the data that is in motion? What if you wish to train machine learning models on real-time data? Online Machine Learning, also known as Incremental machine learning, is a method of executing machine learning with data that is in motion. In contrast to real-time online machine learning, in this post, we will look at how streaming data can be leveraged to do state-of-the-art machine learning tasks. We will also discuss a python library, River, with its implementation on online machine learning.  The following outlined points will be discussed. 

Table of Contents

  1. Batch Learning Vs Online Learning
  2. Python Libraries for Online Machine Learning
  3. Stream Learning with River

Let’s start the discussion by knowing the difference between batch learning and online learning.  

Register for FREE Workshop on Data Engineering>>

Batch Learning Vs Online Learning 

Batch learning approaches are incapable of gradual learning. They usually build models from the entire training set, which are then put into production. We must build a new model from scratch on the complete training set and the new data if we want batch learning algorithms to learn from new data as it arrives. Offline learning is another term for this. If the amount of data is large, training on the entire dataset could be expensive in terms of computer resources (CPU, RAM, storage, disk I/O, and so on).

If our system does not need to respond to quickly changing data, the batch learning method may suffice. If we don’t need to update our model too frequently, we can make use of the batch learning strategy. In other words, the entire process of training, evaluation, and testing are quite basic and uncomplicated, and it frequently yields better results than online techniques. For knowledge graph embedding projects, I’ve created batch learning algorithms. In the future, I hope to create an online approach for these projects to adapt to constantly changing knowledge graphs.

In online learning, the training is done in small groups or in an incremental fashion by continuously feeding data as it arrives. Each learning phase is quick and inexpensive, allowing the system to learn about new data as it comes in.

Machine learning systems that receive data in a continuous stream (e.g., stock prices) and must adapt to change quickly or autonomously benefit from online learning. It’s also a smart alternative if you only have a limited number of computational resources: after an online learning system has learnt about new data instances, it no longer needs them, so you may trash them (unless you wish to be able to “replay” the data). This can help you save a lot of space. Online learning is depicted in the figure above.

Online learning methods can also be used to train systems on massive datasets that are too large to fit in a single machine’s main memory (this is also called out-of-core learning). The algorithm loads a portion of the data, performs a training step on that data, and then continues the process until it has run on all of the data.

The learning rate is an important feature of online learning. The learning rate is the rate at which you want your machine learning to adapt to new data sets. A system with a rapid learning rate will swiftly forget what it has learned. A system with a low learning rate is more akin to batch learning.

One significant downside of an online learning system is that if it is fed incorrect data, the system will perform poorly, and the user will see the impact immediately. As a result, it is critical to implement proper filters to ensure that the data fed is of good quality. Furthermore, it is critical to closely monitor the functioning of the machine learning system.

Python Libraries for Online Machine Learning

To execute online machine learning, many frameworks are available. Several of them are,

Scikit-Multiflow

Scikit-multi-flow (also known as skmultiflow) is a Python-based machine learning library that supports multi-output/multi-label and stream data. Scikit-multiflow makes it simple to create and perform experiments, as well as to enhance stream learning algorithms. It has a number of methods for classification, regression, concept drift detection, and anomaly detection. A suite of data stream generators and evaluators is also included. scikit-multiflow is compatible with Jupyter Notebooks and is meant to work with Python’s numerical and scientific libraries NumPy and SciPy.

Jubatus

Nippon Telegraph & Telephone and Preferred Infrastructure created Jubatus, an open-source online machine learning, and distributed computing system. It has classification, recommendation, regression, anomaly detection, and graph mining capabilities. Many client languages are supported, including C++, Java, Ruby, and Python. Iterative Parameter Mixture is used for distributed machine learning.

Creme Framework

With Creme, we may use a different approach, which is to learn a stream of data continually. As a result, the model only processes one observation at a time and may be modified on the fly. This enables learning from large datasets that are too large to fit in main memory. In circumstances when new data is continually arriving, online machine learning works well. It excels in a variety of applications, including time series forecasting, spam filtering, recommender systems, CTR prediction, and IoT.

This article focuses on the River library, which combines the scikit-multiflow and creme libraries to provide functionality for executing online machine learning on streaming data. So, let’s take a closer look at River and its implementation.

The River

The river is a machine learning library for continuous learning and dynamic data streams. For various stream learning challenges, it includes many state-of-the-art learning methods, data generators/transformers, performance indicators, and evaluators. It’s the outcome of combining two of Python’s most popular stream learning packages: Creme and scikit-multiflow. 

In river, machine learning models are extended classes of specialized mixins that vary based on the learning job, such as classification, regression, clustering, and so on. This maintains library compatibility and makes it easier to extend/modify current models as well as create new models that are compatible with the river.

Learn and predict are the two main functions of all predictive models. The learn one method is used for learning (updates the internal state of the model). The predict one (classification, regression, and clustering), predict proba one (classification), and score one (anomaly detection) algorithms provide predictions depending on the learning goal. It’s worth noting that the river includes transformers, which are stateful objects that use the transform one method to convert an input.

Lets’ implement the river. 

The river offers a Scikit-learn-like API and is also known as Scikit-learn for streaming or online machine learning. It is designed for streaming data and supports practically all ML estimators and transformers.

Let’s look at how to use the river to create a simple text classifier model that can categorize the sentiment of text as Positive (1) or Negative(-1). The dataset used in this post is taken from this Kaggle repository which contains strings of texts associated with respective sentiments. To convert our text into features, we’ll use BagOfWords() as our transformer or vectorizer, and Naive Bayes MultinomialNB as our Machine Learning Estimator.

See Also

While installing the river, make sure you are using the latest version of the NumPy package.  

pip install -U numpy 
pip install river
 
import river
from river.naive_bayes import MultinomialNB
from river.feature_extraction import BagOfWords,TFIDF
from river.compose import Pipeline 
 
import pandas as pd

In terms of data, we’ll just utilize a list of tuples containing the text and the label in our scenario (Positive or Negative). However, data from a streaming engine or a CSV file can be ingested as well. If you’re dealing with the famed Pandas package, you’ll have to convert a CSV file to a dictionary or list of tuples.

df = pd.read_csv("/content/stock_data.csv")
# Convert to Format
df.to_dict()
 
# Convert to Tuple
data = df.to_records(index=False)
data

Next, we’ll construct a pipeline that includes two stages: a transformer/vectorizer for converting text to features and an estimator.

# Build Pipeline
pipe_nb = Pipeline(('vectorizer',BagOfWords(lowercase=True)),('nb',MultinomialNB()))
 
# Specify Steps
pipe_nb.steps

Because the data is coming one at a time, we’ll have to fit our model to it one at a time during training using our pipeline’s.learn one(x,y) method. We can emulate this by using a for loop to iterate through our data. [Note that t is.fit one(x,y) in crème]

# Train
for text,label in data:
    pipe_nb = pipe_nb.learn_one(text,label)

Now let’s check the prediction using predict_one and the probability of two classes. 

# Make a Prediction
test = 'Mr AAP is going to have to stop hanging out by the pool if he is to make 435 by close. All she needs is one fat finger buyer'
pred = pipe_nb.predict_one(test)
# Prediction Prob
proba_ = pipe_nb.predict_proba_one(test)
print('Predicted test sentence as',pred,' and probability of classes for test as',proba_)

Output:

We can employ functions from the river to determine the reliability and performance of our model. Accuracy metrics and classification reports can be used from sub-module metrics.

# Update the Model on the test data & Check Accuracy
metric = river.metrics.Accuracy()
for text,label in data:
    y_pred_before = pipe_nb.predict_one(text)
    metric = metric.update(label,y_pred_before)
    # Has already learnt the pattern
    pipe_nb = pipe_nb.learn_one(text,label)
print(metric)

Output:

Accuracy: 90.07%

Final Words 

Through this article, we have seen the fundamental difference between batch learning and stream learning (online learning) and also what is the limitation of both the techniques and how one takes over the other. Later we saw some popular frameworks and libraries that were used mostly to handle and model the streaming data. And lastly, we have seen the hands-on implementation of the most popular library in the field, named River. I encourage you to visit the official document to get an idea of how and what can be done when you are working with continuous or streaming data.  

References 

Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.
Join our Telegram Group. Be part of an engaging community

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top