Why Should A Robotics Researcher Care About A/B Testing?

In machine learning research, it is a standard practice to figure out a baseline for a certain model through experiments and then improve variables to breach that baseline over a period of time. New mathematical expressions might be added, more trial runs could be performed and so on. But according to the researchers at Google, there is an oft-overlooked, yet a fundamental aspect, in the whole experiment routine: An experiment, for example in the field of robotics, can be inconsistent with results even if it is carried out in a controlled environment. Inconsistent performance, when compared to baseline, keeps the researchers in the dark as to what exactly is influencing the results. To address this fundamental challenge, the researchers at Google proposed a randomised yet fundamental approach–A/B testing.

A/B testing is a popular technique, a statistical endeavour, to find out the next best product or the most profitable website design and many more. 

So, what’s A/B testing got to do with robotics? Classical research methods such as A/B testing are not a default option in robotics research. But, researchers said the methods are critical to producing meaningful and measurable scientific results for robotics in real-world scenarios. 


Sign up for your weekly dose of what's up in emerging technology.

Significance of A/B testing for Robotics

Source: Google AI

A perfectly reproducible robotics experiment is a myth. Even in a controlled environment, robotic arms will be subjected to wear and tear, vulnerable to blind spots in perception as the lighting changes or can even be influenced by the low power battery that runs the motor. These factors magnify when the trial runs increase. To verify this, the Google researchers ran 33,000 task trials on two robots over a period of five months using the same software and machine learning model. These robots were tasked with moving identical cubes made of foam from one basket to the other, as shown above . The performance of these robots were measured against a baseline–the overall success rate of the last two weeks of the experiment.

As shown below, the Y-axis represents the 95% confidence interval of % change in success rate relative to baseline. In the plot, ‘0’ is considered as the baseline. Any confidence interval that contains zero indicates that the success rate is statistically similar to the success rate of baseline.  The researchers computed the confidence intervals using Jackknife, with Cochran-Mantel-Haenszel (CMH) correction to remove operator bias. Jackknife is a resampling technique used commonly for variance and bias estimation. Whereas, CMH is often used in observational studies where random assignment of subjects to different treatments cannot be controlled, but confounding covariates can be measured.

Download our Mobile App

Comparison of sequential vs randomised assignment (Source: Google AI)

The researchers were in for a surprise when they plotted the results.  In the sequential experiment, the robots performed better towards week 12 and 13 and except for week 19, none of the results touched the baseline.  This was a controlled experiment and the results would have discouraged many researchers from continuing such a project. To verify such glaring inconsistencies, the Google researchers, as a follow up to the sequential experiments, grouped the data into sub-samples randomly and plotted against confidence intervals. The randomised assignments, however, generated more optimistic results. All the sub-samples were within the premises of the baseline. When the same experiments were repeated on other robotic tasks, the results were similar. 

The whole experiment was a simulation of A/B testing on a larger scale and with randomness in the mix. It helped researchers to conclude that most of their experiments lack statistically significant differences. A/B testing with random assignment can serve as a great tool to control the unexplainable variance of the real world in robotics. Since the significance of A/B testing in a robotics set up has been established, here are some key takeaways:

  • Baselines must be run in parallel with the experimental conditions for ease of comparison. It also avoids a stale baseline.
  • The absolute performance metric (in baseline or experiment) depends to an unknown degree on the state of the world so it is usually not informative.
  • Data efficiency improves with scale.
  • Environmental biases cancel out if there is no significant impact. Experimental resets can be postponed.

So far, the hypothesis tests were mostly restricted to ad campaigns and public administration surveys. They did show up in a few experiments where an algorithmic learner is tasked with picking the right choice using similar methodologies. To find a classical statistical instrument like A/B testing in an environment which runs on cutting edge deep reinforcement learning algorithms is a first. It offers a fresh perspective of running these exhaustive  experiments in a sophisticated environment that can run for months. The aforementioned confidence intervals in the plots can act as a barometer for confidence of the researcher to continue these expensive experiments.

More Great AIM Stories

Ram Sagar
I have a master's degree in Robotics and I write about machine learning advancements.

AIM Upcoming Events

Regular Passes expire on 3rd Mar

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Early Bird Passes expire on 17th Feb

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, Virtual
Deep Learning DevCon 2023
27 May, 2023

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

A beginner’s guide to image processing using NumPy

Since images can also be considered as made up of arrays, we can use NumPy for performing different image processing tasks as well from scratch. In this article, we will learn about the image processing tasks that can be performed only using NumPy.

RIP Google Stadia: What went wrong?

Google has “deprioritised” the Stadia game streaming platform and wants to offer its Stadia technology to select partners in a new service called “Google Stream”.