A/B testing on ML models: A primer

A/B testing is a strategy that determines how a change in one variable impacts the audience or user engagement. It’s a commonly used strategy for improving campaigns and target conversion rates in marketing, web design, product development, and user experience design. Split testing, or A/B testing, is widely used to improve specific variables or elements by assessing user or audience involvement. A/B testing involves changing a single variable, such as a headline, image, or element layout. In a 50/50 split, a sample of the audience is shown the control and adjusted versions. The previous version will engage with half of the traffic, while the newer version will interact with the other half. The statistic is compared between the versions after a specific amount of engagement time or on the completion of a defined goal.

A/B testing ML Models

Data scientists are now A/B testing on ML models. Machine learning models are developed by businesses to improve their business results, and KPIs are used to track progress toward certain business objectives. When data scientists and machine learning engineers develop models on their machines, they do not use these KPIs to track their progress. Instead, they compare model performance to historical datasets. However, just because a model performs well in offline testing and metrics does not mean it will generate the KPIs that matter to the company. The issue arises because causality cannot be determined through offline tests.

Machine learning models are often constructed in an offline environment before they are deployed to live, dynamic data, which is a significant shift in methodology. A/B testing, on the other hand, is carried out on real-time or online data. Most machine learning models are typically trained on training data in an offline or local environment. As a result, models will frequently experience idea drift or covariate drift, both examples of machine learning drift. Because the data in the dynamic, real environment has altered or evolved away from the initial training data, the model may become less accurate or efficient over time. When machine learning drift is found, models can be retrained or refitted regularly to maintain them. Hence, data scientists have suggested A/B testing as an optimisation strategy to improve the creation and deployment of machine learning models. 


Sign up for your weekly dose of what's up in emerging technology.

Read More: Hands-on Guide to Cockpit: A Debugging Tool for Deep Learning Models

A/B testing for ML deployment 

Machine learning models may be tested and improved using the A/B testing technique. The method can be used to determine whether a new model is superior to the existing one. For this purpose, the organisation should select a metric to compare the control and the new model. This statistic is used to determine the success of the deployment and to distinguish between the two. Both models need to be used on a sample of data simultaneously for a set amount of time. Half of the users should use the control model, while the other half should use the new model.

Download our Mobile App


In a dynamic setting, A/B testing for machine learning models is a helpful experiment to evaluate user preference. However, the strategy has several drawbacks to consider. With half of the users exposed to the control version and the other half to the newer version, half of the users are presented with a less-than-optimal alternative during the trial. The audience’s general preference can be rather similar. Although the majority of the audience prefers choice B, 40% of the audience may prefer option A.

Read More: Converting PyTorch & TensorFlow Models Into Apple Core ML Using CoreMLTools

Use Cases: A/B Testing with Amazon SageMaker 

Data scientists and engineers in functional ML operations routinely aim to enhance their models in a variety of ways:

  • Hyperparameter tuning
  • Training on new or more recent data
  • Better feature selection.

A/B testing on the new and old models can be an effective final step in validating a new model. It tests several models’ variants and assesses how they perform against one another. Users can replace the older model if the new version performs better than the older one. 

Using production variations, Amazon SageMaker allows users to test various models or model versions behind the same endpoint. Each production variant identifies an ML model and the resources used to host it. Users can distribute endpoint invocation requests over different production variants by providing traffic distribution for each version or calling a variant directly for each request.

Read More: Google Datalab Vs Amazon SageMaker: Which Cloud Platform Is Best For Your ML Project

Support independent technology journalism

Get exclusive, premium content, ads-free experience & more

Rs. 299/month

Subscribe now for a 7-day free trial

More Great AIM Stories

Abhishree Choudhary
Abhishree is a budding tech journalist with a UGD in Political Science. In her free time, Abhishree can be found watching French new wave classic films and playing with dogs.

AIM Upcoming Events

Early Bird Passes expire on 3rd Feb

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

All you need to know about Graph Embeddings

Embeddings can be the subgroups of a group, similarly, in graph theory embedding of a graph can be considered as a representation of a graph on a surface, where points of that surface are made up of vertices and arcs are made up of edges