Guide To Prediction Engineering With Compose

Compose is a Python module for automating prediction engineering. It enables developers to generate labels by writing one just labelling function.

Share

Published on May 5, 2021

by Aditya Singh

This article gives an overview of Compose, the prediction engineering tool from Alteryx’s open-source suite of machine learning solutions. We have previously covered two of their tools for automating and standardizing feature engineering and machine learning model search. You can learn more about them in the articles below:

Compose and the rest of Altreyx’s machine learning ecosystem — Source: Compose Docs

One aspect of the machine learning pipeline that new data scientists often overlook is prediction engineering. The concept of creating our own labels seems foreign because most of us started with Kaggle or UCI datasets where the answers are already given. However, when working with real-world data, we often need to define the problem before building machine learning models, which means we have to create our own labels. Generally, this is done using historical examples of what we want to predict.

Prediction Engineering With Compose — Source: Compose Docs

Prediction engineering is not a new concept; however, it doesn’t have a standardized process and is done by data scientist in an ad hoc manner. This leads to a lot of unnecessary coding as a new script is created for each new problem, even with the same dataset. A better approach would be to write reusable functions that can adapt to changes in business problem.

What is Compose?

Compose is a Python library that can be used to automate prediction engineering. It provides a standardized way for structuring prediction problems; the end-user defines the outcome of interest by creating a labelling function. Compose then runs a search and automatically extracts the relevant training examples from historical data.

The output is a LabelTimes table, a set of labels with negative and positive examples made from historical data with associated cutoff times indicating the time at which predictions are made. The cutoff time depends on the task at hand. For our example, we will consider ‘2014-01-01’ to be the cutoff time, so all of the training examples will be made with data from before ‘2014-01-01’. Cutoff times are an important aspect to consider when doing feature engineering for time-series problems to prevent data leakage.

Prediction Engineering with Compose

Install Compose from PyPI:

pip install composeml

Import necessary libraries and load the data.

 import matplotlib.pyplot as plt
 import composeml as cp
 df = cp.demos.load_transactions()
 df.head()

Write the labelling function.

The labelling function will calculate the total of a customer’s transactions over a span of an hour. It will be passed groups of data points corresponding to different windows of one hour; all it needs to do is add them up.

def amount_spent(data):
     total = data['amount'].sum()
     return total

Create a LabelMaker object for the prediction problem. We intend to calculate the hourly transactions of each customer, so we set the target_entity to the customer ID, the window_size to one hour and pass our labelling function.

label_maker = cp.LabelMaker(
     target_entity="customer_id",
     time_index="transaction_time",
     labeling_function=amount_spent,
     window_size="1h",
 )

Use the search() method on the LabelMaker object to automatically search for and extract labels.

 labels = label_maker.search(
     df.sort_values('transaction_time'),
     num_examples_per_instance=-1,
     gap=1,
     verbose=True,
 )
 labels.head()

labels.plot.dist()

Various transformations can be applied to the LabelTimes table to modify the label as per the problem.

Let’s say you want to create binary labels for the threshold of transaction amounts greater than $200. This can be done using the threshold() method:

 binary_labels = labels.threshold(200)
 binary_labels.head()

binary_labels.plot.distribution()

Or maybe you want to shift the label times by one hour for predicting in advance. This can be achieved using the apply_lead() method:

 shifted_labels = labels.apply_lead('1h')
 shifted_labels.head()

You can learn more about the available methods here.

Once you’re satisfied with the labels, you can use the describe() method to print out the distribution of the labels and the settings and transformations that were used to create them.

binary_labels.describe()

Last Epoch

This article discussed Compose, a Python module for automating prediction engineering. It enables developers to structure prediction problems and generate labels by writing one labelling function. Combined with EvalML and FeatueTools, Compose makes up Altreyx’s machine learning ecosystem to standardize different parts of the machine learning pipeline. Taking such an approach to machine learning enables data scientists to experiment, debug and adapt to changes in business problem more rapidly. To learn more about prediction engineering with Compose, refer to the following resources:

Access all our open Survey & Awards Nomination forms in one place