Advertisement

Guide To Prediction Engineering With Compose

Compose is a Python module for automating prediction engineering. It enables developers to generate labels by writing one just labelling function.
Prediction Engineering with Compose

This article gives an overview of Compose, the prediction engineering tool from Alteryx’s open-source suite of machine learning solutions. We have previously covered two of their tools for automating and standardizing feature engineering and machine learning model search. You can learn more about them in the articles below:

Compose and  the rest of Altreyx’s machine learning ecosystem
Source: Compose Docs

One aspect of the machine learning pipeline that new data scientists often overlook is prediction engineering. The concept of creating our own labels seems foreign because most of us started with Kaggle or UCI datasets where the answers are already given. However, when working with real-world data, we often need to define the problem before building machine learning models, which means we have to create our own labels. Generally, this is done using historical examples of what we want to predict.  

Prediction Engineering With Compose
Source: Compose Docs

Prediction engineering is not a new concept; however, it doesn’t have a standardized process and is done by data scientist in an ad hoc manner. This leads to a lot of unnecessary coding as a new script is created for each new problem, even with the same dataset. A better approach would be to write reusable functions that can adapt to changes in business problem.

What is Compose?

Compose is a Python library that can be used to automate prediction engineering. It provides a standardized way for structuring prediction problems; the end-user defines the outcome of interest by creating a labelling function. Compose then runs a search and automatically extracts the relevant training examples from historical data. 

The output is a LabelTimes table, a set of labels with negative and positive examples made from historical data with associated cutoff times indicating the time at which predictions are made. The cutoff time depends on the task at hand. For our example, we will consider ‘2014-01-01’ to be the cutoff time, so all of the training examples will be made with data from before ‘2014-01-01’.  Cutoff times are an important aspect to consider when doing feature engineering for time-series problems to prevent data leakage.

Prediction Engineering with Compose

  1. Install Compose from PyPI:
pip install composeml
  1. Import necessary libraries and load the data.
 import matplotlib.pyplot as plt
 import composeml as cp
 df = cp.demos.load_transactions()
 df.head() 
  1. Write the labelling function. 

The labelling function will calculate the total of a customer’s transactions over a span of an hour. It will be passed groups of data points corresponding to different windows of one hour; all it needs to do is add them up.

def amount_spent(data):
     total = data['amount'].sum()
     return total 
  1. Create a LabelMaker object for the prediction problem. We intend to calculate the hourly transactions of each customer, so we set the target_entity to the customer ID, the window_size to one hour and pass our labelling function.
label_maker = cp.LabelMaker(
     target_entity="customer_id",
     time_index="transaction_time",
     labeling_function=amount_spent,
     window_size="1h",
 ) 
  1. Use the search() method on the LabelMaker object to automatically search for and extract labels.
 labels = label_maker.search(
     df.sort_values('transaction_time'),
     num_examples_per_instance=-1,
     gap=1,
     verbose=True,
 )
 labels.head() 
labels.plot.dist()
  1. Various transformations can be applied to the LabelTimes table to modify the label as per the problem. 

Let’s say you want to create binary labels for the threshold of transaction amounts greater than $200. This can be done using the threshold() method:

 binary_labels = labels.threshold(200)
 binary_labels.head() 
binary_labels.plot.distribution()

Or maybe you want to shift the label times by one hour for predicting in advance. This can be achieved using the apply_lead() method: 

 shifted_labels = labels.apply_lead('1h')
 shifted_labels.head() 

You can learn more about the available methods here.

  1. Once you’re satisfied with the labels, you can use the describe() method to print out the distribution of the labels and the settings and transformations that were used to create them.

binary_labels.describe()

Last Epoch

This article discussed Compose, a Python module for automating prediction engineering. It enables developers to structure prediction problems and generate labels by writing one labelling function. Combined with EvalML and FeatueTools, Compose makes up Altreyx’s machine learning ecosystem to standardize different parts of the machine learning pipeline. Taking such an approach to machine learning enables data scientists to experiment, debug and adapt to changes in business problem more rapidly. To learn more about prediction engineering with Compose, refer to the following resources: 

Download our Mobile App

Aditya Singh
A machine learning enthusiast with a knack for finding patterns. In my free time, I like to delve into the world of non-fiction books and video essays.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Upcoming Events

Career Building in ML & AI

31st May | Online

Rakuten Product Conference 2023

31st May - 1st Jun '23 | Online

MachineCon 2023 India

Jun 23, 2023 | Bangalore

MachineCon 2023 USA

Jul 21, 2023 | New York

Cypher 2023

Oct 11-13, 2023 | Bangalore

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
MOST POPULAR