In machine learning research, it is a standard practice to figure out a baseline for a certain model through experiments and then improve variables to breach that baseline over a period of time. New mathematical expressions might be added, more trial runs could be performed and so on. But according to the researchers at Google, there is an oft-overlooked, yet a fundamental aspect, in the whole experiment routine: An experiment, for example in the field of robotics, can be inconsistent with results even if it is carried out in a controlled environment. Inconsistent performance, when compared to baseline, keeps the researchers in the dark as to what exactly is influencing the results. To address this fundamental challenge, the researchers at Google proposed a randomised yet fundamental approach–A/B testing.
A/B testing is a popular technique, a statistical endeavour, to find out the next best product or the most profitable website design and many more.
So, what’s A/B testing got to do with robotics? Classical research methods such as A/B testing are not a default option in robotics research. But, researchers said the methods are critical to producing meaningful and measurable scientific results for robotics in real-world scenarios.
Significance of A/B testing for Robotics
A perfectly reproducible robotics experiment is a myth. Even in a controlled environment, robotic arms will be subjected to wear and tear, vulnerable to blind spots in perception as the lighting changes or can even be influenced by the low power battery that runs the motor. These factors magnify when the trial runs increase. To verify this, the Google researchers ran 33,000 task trials on two robots over a period of five months using the same software and machine learning model. These robots were tasked with moving identical cubes made of foam from one basket to the other, as shown above . The performance of these robots were measured against a baseline–the overall success rate of the last two weeks of the experiment.
As shown below, the Y-axis represents the 95% confidence interval of % change in success rate relative to baseline. In the plot, ‘0’ is considered as the baseline. Any confidence interval that contains zero indicates that the success rate is statistically similar to the success rate of baseline. The researchers computed the confidence intervals using Jackknife, with Cochran-Mantel-Haenszel (CMH) correction to remove operator bias. Jackknife is a resampling technique used commonly for variance and bias estimation. Whereas, CMH is often used in observational studies where random assignment of subjects to different treatments cannot be controlled, but confounding covariates can be measured.
The researchers were in for a surprise when they plotted the results. In the sequential experiment, the robots performed better towards week 12 and 13 and except for week 19, none of the results touched the baseline. This was a controlled experiment and the results would have discouraged many researchers from continuing such a project. To verify such glaring inconsistencies, the Google researchers, as a follow up to the sequential experiments, grouped the data into sub-samples randomly and plotted against confidence intervals. The randomised assignments, however, generated more optimistic results. All the sub-samples were within the premises of the baseline. When the same experiments were repeated on other robotic tasks, the results were similar.
The whole experiment was a simulation of A/B testing on a larger scale and with randomness in the mix. It helped researchers to conclude that most of their experiments lack statistically significant differences. A/B testing with random assignment can serve as a great tool to control the unexplainable variance of the real world in robotics. Since the significance of A/B testing in a robotics set up has been established, here are some key takeaways:
- Baselines must be run in parallel with the experimental conditions for ease of comparison. It also avoids a stale baseline.
- The absolute performance metric (in baseline or experiment) depends to an unknown degree on the state of the world so it is usually not informative.
- Data efficiency improves with scale.
- Environmental biases cancel out if there is no significant impact. Experimental resets can be postponed.
So far, the hypothesis tests were mostly restricted to ad campaigns and public administration surveys. They did show up in a few experiments where an algorithmic learner is tasked with picking the right choice using similar methodologies. To find a classical statistical instrument like A/B testing in an environment which runs on cutting edge deep reinforcement learning algorithms is a first. It offers a fresh perspective of running these exhaustive experiments in a sophisticated environment that can run for months. The aforementioned confidence intervals in the plots can act as a barometer for confidence of the researcher to continue these expensive experiments.