A/B testing on ML models: A primer

With half of the users exposed to the control version and the other half to the newer version, half of the users are presented with a less-than-optimal alternative during the trial.

A/B testing is a strategy that determines how a change in one variable impacts the audience or user engagement. It’s a commonly used strategy for improving campaigns and target conversion rates in marketing, web design, product development, and user experience design. Split testing, or A/B testing, is widely used to improve specific variables or elements by assessing user or audience involvement. A/B testing involves changing a single variable, such as a headline, image, or element layout. In a 50/50 split, a sample of the audience is shown the control and adjusted versions. The previous version will engage with half of the traffic, while the newer version will interact with the other half. The statistic is compared between the versions after a specific amount of engagement time or on the completion of a defined goal.

A/B testing ML Models

Data scientists are now A/B testing on ML models. Machine learning models are developed by businesses to improve their business results, and KPIs are used to track progress toward certain business objectives. When data scientists and machine learning engineers develop models on their machines, they do not use these KPIs to track their progress. Instead, they compare model performance to historical datasets. However, just because a model performs well in offline testing and metrics does not mean it will generate the KPIs that matter to the company. The issue arises because causality cannot be determined through offline tests.

Machine learning models are often constructed in an offline environment before they are deployed to live, dynamic data, which is a significant shift in methodology. A/B testing, on the other hand, is carried out on real-time or online data. Most machine learning models are typically trained on training data in an offline or local environment. As a result, models will frequently experience idea drift or covariate drift, both examples of machine learning drift. Because the data in the dynamic, real environment has altered or evolved away from the initial training data, the model may become less accurate or efficient over time. When machine learning drift is found, models can be retrained or refitted regularly to maintain them. Hence, data scientists have suggested A/B testing as an optimisation strategy to improve the creation and deployment of machine learning models. 

Read More: Hands-on Guide to Cockpit: A Debugging Tool for Deep Learning Models

A/B testing for ML deployment 

Machine learning models may be tested and improved using the A/B testing technique. The method can be used to determine whether a new model is superior to the existing one. For this purpose, the organisation should select a metric to compare the control and the new model. This statistic is used to determine the success of the deployment and to distinguish between the two. Both models need to be used on a sample of data simultaneously for a set amount of time. Half of the users should use the control model, while the other half should use the new model.

Drawback

In a dynamic setting, A/B testing for machine learning models is a helpful experiment to evaluate user preference. However, the strategy has several drawbacks to consider. With half of the users exposed to the control version and the other half to the newer version, half of the users are presented with a less-than-optimal alternative during the trial. The audience’s general preference can be rather similar. Although the majority of the audience prefers choice B, 40% of the audience may prefer option A.

Read More: Converting PyTorch & TensorFlow Models Into Apple Core ML Using CoreMLTools

Use Cases: A/B Testing with Amazon SageMaker 

Data scientists and engineers in functional ML operations routinely aim to enhance their models in a variety of ways:

  • Hyperparameter tuning
  • Training on new or more recent data
  • Better feature selection.

A/B testing on the new and old models can be an effective final step in validating a new model. It tests several models’ variants and assesses how they perform against one another. Users can replace the older model if the new version performs better than the older one. 

Using production variations, Amazon SageMaker allows users to test various models or model versions behind the same endpoint. Each production variant identifies an ML model and the resources used to host it. Users can distribute endpoint invocation requests over different production variants by providing traffic distribution for each version or calling a variant directly for each request.

Read More: Google Datalab Vs Amazon SageMaker: Which Cloud Platform Is Best For Your ML Project

More Great AIM Stories

Abhishree Choudhary
Abhishree is a budding tech journalist with a UGD in Political Science. In her free time, Abhishree can be found watching French new wave classic films and playing with dogs.

More Stories

OUR UPCOMING EVENTS

8th April | In-person Conference | Hotel Radisson Blue, Bangalore

Organized by Analytics India Magazine

View Event >>

30th Apr | Virtual conference

Organized by Analytics India Magazine

View Event >>

MORE FROM AIM
Victor Dey
AWS Releases A No Code Machine Learning Tool

SageMaker Canvas leverages the same technology as previous Amazon SageMaker to automatically clean and combine data, create hundreds of models under the hood, select the one performing best, and generate new individual or batch predictions.

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM