This article gives an overview of Compose, the prediction engineering tool from Alteryx’s open-source suite of machine learning solutions. We have previously covered two of their tools for automating and standardizing feature engineering and machine learning model search. You can learn more about them in the articles below:
One aspect of the machine learning pipeline that new data scientists often overlook is prediction engineering. The concept of creating our own labels seems foreign because most of us started with Kaggle or UCI datasets where the answers are already given. However, when working with real-world data, we often need to define the problem before building machine learning models, which means we have to create our own labels. Generally, this is done using historical examples of what we want to predict.
Prediction engineering is not a new concept; however, it doesn’t have a standardized process and is done by data scientist in an ad hoc manner. This leads to a lot of unnecessary coding as a new script is created for each new problem, even with the same dataset. A better approach would be to write reusable functions that can adapt to changes in business problem.
What is Compose?
Compose is a Python library that can be used to automate prediction engineering. It provides a standardized way for structuring prediction problems; the end-user defines the outcome of interest by creating a labelling function. Compose then runs a search and automatically extracts the relevant training examples from historical data.
The output is a LabelTimes
table, a set of labels with negative and positive examples made from historical data with associated cutoff times indicating the time at which predictions are made. The cutoff time depends on the task at hand. For our example, we will consider ‘2014-01-01’ to be the cutoff time, so all of the training examples will be made with data from before ‘2014-01-01’. Cutoff times are an important aspect to consider when doing feature engineering for time-series problems to prevent data leakage.
Prediction Engineering with Compose
- Install Compose from PyPI:
pip install composeml
- Import necessary libraries and load the data.
import matplotlib.pyplot as plt import composeml as cp df = cp.demos.load_transactions() df.head()
- Write the labelling function.
The labelling function will calculate the total of a customer’s transactions over a span of an hour. It will be passed groups of data points corresponding to different windows of one hour; all it needs to do is add them up.
def amount_spent(data): total = data['amount'].sum() return total
- Create a
LabelMaker
object for the prediction problem. We intend to calculate the hourly transactions of each customer, so we set thetarget_entity
to the customer ID, thewindow_size
to one hour and pass our labelling function.
label_maker = cp.LabelMaker( target_entity="customer_id", time_index="transaction_time", labeling_function=amount_spent, window_size="1h", )
- Use the
search()
method on theLabelMaker
object to automatically search for and extract labels.
labels = label_maker.search( df.sort_values('transaction_time'), num_examples_per_instance=-1, gap=1, verbose=True, ) labels.head()
labels.plot.dist()
- Various transformations can be applied to the
LabelTimes
table to modify the label as per the problem.
Let’s say you want to create binary labels for the threshold of transaction amounts greater than $200. This can be done using the threshold()
method:
binary_labels = labels.threshold(200) binary_labels.head()
binary_labels.plot.distribution()
Or maybe you want to shift the label times by one hour for predicting in advance. This can be achieved using the apply_lead()
method:
shifted_labels = labels.apply_lead('1h') shifted_labels.head()
You can learn more about the available methods here.
- Once you’re satisfied with the labels, you can use the
describe()
method to print out the distribution of the labels and the settings and transformations that were used to create them.
binary_labels.describe()
Last Epoch
This article discussed Compose, a Python module for automating prediction engineering. It enables developers to structure prediction problems and generate labels by writing one labelling function. Combined with EvalML and FeatueTools, Compose makes up Altreyx’s machine learning ecosystem to standardize different parts of the machine learning pipeline. Taking such an approach to machine learning enables data scientists to experiment, debug and adapt to changes in business problem more rapidly. To learn more about prediction engineering with Compose, refer to the following resources: