MITB Banner

Guide To Prediction Engineering With Compose

Compose is a Python module for automating prediction engineering. It enables developers to generate labels by writing one just labelling function.

Share

Prediction Engineering with Compose

This article gives an overview of Compose, the prediction engineering tool from Alteryx’s open-source suite of machine learning solutions. We have previously covered two of their tools for automating and standardizing feature engineering and machine learning model search. You can learn more about them in the articles below:

Compose and  the rest of Altreyx’s machine learning ecosystem
Source: Compose Docs

One aspect of the machine learning pipeline that new data scientists often overlook is prediction engineering. The concept of creating our own labels seems foreign because most of us started with Kaggle or UCI datasets where the answers are already given. However, when working with real-world data, we often need to define the problem before building machine learning models, which means we have to create our own labels. Generally, this is done using historical examples of what we want to predict.  

Prediction Engineering With Compose
Source: Compose Docs

Prediction engineering is not a new concept; however, it doesn’t have a standardized process and is done by data scientist in an ad hoc manner. This leads to a lot of unnecessary coding as a new script is created for each new problem, even with the same dataset. A better approach would be to write reusable functions that can adapt to changes in business problem.

What is Compose?

Compose is a Python library that can be used to automate prediction engineering. It provides a standardized way for structuring prediction problems; the end-user defines the outcome of interest by creating a labelling function. Compose then runs a search and automatically extracts the relevant training examples from historical data. 

The output is a LabelTimes table, a set of labels with negative and positive examples made from historical data with associated cutoff times indicating the time at which predictions are made. The cutoff time depends on the task at hand. For our example, we will consider ‘2014-01-01’ to be the cutoff time, so all of the training examples will be made with data from before ‘2014-01-01’.  Cutoff times are an important aspect to consider when doing feature engineering for time-series problems to prevent data leakage.

Prediction Engineering with Compose

  1. Install Compose from PyPI:
pip install composeml
  1. Import necessary libraries and load the data.
 import matplotlib.pyplot as plt
 import composeml as cp
 df = cp.demos.load_transactions()
 df.head() 
  1. Write the labelling function. 

The labelling function will calculate the total of a customer’s transactions over a span of an hour. It will be passed groups of data points corresponding to different windows of one hour; all it needs to do is add them up.

def amount_spent(data):
     total = data['amount'].sum()
     return total 
  1. Create a LabelMaker object for the prediction problem. We intend to calculate the hourly transactions of each customer, so we set the target_entity to the customer ID, the window_size to one hour and pass our labelling function.
label_maker = cp.LabelMaker(
     target_entity="customer_id",
     time_index="transaction_time",
     labeling_function=amount_spent,
     window_size="1h",
 ) 
  1. Use the search() method on the LabelMaker object to automatically search for and extract labels.
 labels = label_maker.search(
     df.sort_values('transaction_time'),
     num_examples_per_instance=-1,
     gap=1,
     verbose=True,
 )
 labels.head() 
labels.plot.dist()
  1. Various transformations can be applied to the LabelTimes table to modify the label as per the problem. 

Let’s say you want to create binary labels for the threshold of transaction amounts greater than $200. This can be done using the threshold() method:

 binary_labels = labels.threshold(200)
 binary_labels.head() 
binary_labels.plot.distribution()

Or maybe you want to shift the label times by one hour for predicting in advance. This can be achieved using the apply_lead() method: 

 shifted_labels = labels.apply_lead('1h')
 shifted_labels.head() 

You can learn more about the available methods here.

  1. Once you’re satisfied with the labels, you can use the describe() method to print out the distribution of the labels and the settings and transformations that were used to create them.

binary_labels.describe()

Last Epoch

This article discussed Compose, a Python module for automating prediction engineering. It enables developers to structure prediction problems and generate labels by writing one labelling function. Combined with EvalML and FeatueTools, Compose makes up Altreyx’s machine learning ecosystem to standardize different parts of the machine learning pipeline. Taking such an approach to machine learning enables data scientists to experiment, debug and adapt to changes in business problem more rapidly. To learn more about prediction engineering with Compose, refer to the following resources: 

Share
Picture of Aditya Singh

Aditya Singh

A machine learning enthusiast with a knack for finding patterns. In my free time, I like to delve into the world of non-fiction books and video essays.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.