A/B testing is a strategy that determines how a change in one variable impacts the audience or user engagement. It’s a commonly used strategy for improving campaigns and target conversion rates in marketing, web design, product development, and user experience design. Split testing, or A/B testing, is widely used to improve specific variables or elements by assessing user or audience involvement. A/B testing involves changing a single variable, such as a headline, image, or element layout. In a 50/50 split, a sample of the audience is shown the control and adjusted versions. The previous version will engage with half of the traffic, while the newer version will interact with the other half. The statistic is compared between the versions after a specific amount of engagement time or on the completion of a defined goal.
A/B testing ML Models
Data scientists are now A/B testing on ML models. Machine learning models are developed by businesses to improve their business results, and KPIs are used to track progress toward certain business objectives. When data scientists and machine learning engineers develop models on their machines, they do not use these KPIs to track their progress. Instead, they compare model performance to historical datasets. However, just because a model performs well in offline testing and metrics does not mean it will generate the KPIs that matter to the company. The issue arises because causality cannot be determined through offline tests.
Machine learning models are often constructed in an offline environment before they are deployed to live, dynamic data, which is a significant shift in methodology. A/B testing, on the other hand, is carried out on real-time or online data. Most machine learning models are typically trained on training data in an offline or local environment. As a result, models will frequently experience idea drift or covariate drift, both examples of machine learning drift. Because the data in the dynamic, real environment has altered or evolved away from the initial training data, the model may become less accurate or efficient over time. When machine learning drift is found, models can be retrained or refitted regularly to maintain them. Hence, data scientists have suggested A/B testing as an optimisation strategy to improve the creation and deployment of machine learning models.
A/B testing for ML deployment
Machine learning models may be tested and improved using the A/B testing technique. The method can be used to determine whether a new model is superior to the existing one. For this purpose, the organisation should select a metric to compare the control and the new model. This statistic is used to determine the success of the deployment and to distinguish between the two. Both models need to be used on a sample of data simultaneously for a set amount of time. Half of the users should use the control model, while the other half should use the new model.
In a dynamic setting, A/B testing for machine learning models is a helpful experiment to evaluate user preference. However, the strategy has several drawbacks to consider. With half of the users exposed to the control version and the other half to the newer version, half of the users are presented with a less-than-optimal alternative during the trial. The audience’s general preference can be rather similar. Although the majority of the audience prefers choice B, 40% of the audience may prefer option A.
Use Cases: A/B Testing with Amazon SageMaker
Data scientists and engineers in functional ML operations routinely aim to enhance their models in a variety of ways:
- Hyperparameter tuning
- Training on new or more recent data
- Better feature selection.
A/B testing on the new and old models can be an effective final step in validating a new model. It tests several models’ variants and assesses how they perform against one another. Users can replace the older model if the new version performs better than the older one.
Using production variations, Amazon SageMaker allows users to test various models or model versions behind the same endpoint. Each production variant identifies an ML model and the resources used to host it. Users can distribute endpoint invocation requests over different production variants by providing traffic distribution for each version or calling a variant directly for each request.