It is humbling to think of the number of tools, languages, techniques and applications a machine learning ecosystem has nurtured. Choosing the best fit out of these hundreds of options and then bringing them together to work seamlessly is a data scientist’s nightmare. The hidden technical debts in a machine learning (ML) pipeline can incur massive maintenance costs.
According to a report presented by the researchers at Google, there are several ML-specific risk factors to account for in system design:
- Boundary erosion
- Hidden feedback loops
- Undeclared consumers
- Data dependencies
- Configuration issues
Technical debt, popularised by Ward Cunningham in 1992 with a metaphor, represents the long term costs incurred by moving quickly in software engineering. Hidden debt is dangerous and can deliver a fatal blow to the system.
In their review, the authors draw parallels between software engineering principles and those of machine learning to investigate the recurring issues. Here are a few key takeaways:
Ripple Effect: Everything Is Connected
Changing Anything Changes Everything or the CACE principle, as the researchers would like to call it, refers to the dependency of every change we make in a pipeline. This principle also extends to hyper-parameters, learning settings, sampling methods, convergence thresholds, data selection, and essentially every other possible tweak.
So, isolating models and serving ensembles is recommended. This approach comes in handy where sub-problems decompose naturally such as in disjoint multi-class settings.
A current solution to keep track of changes is to have a high-dimensional visualisation tool that was used to allow researchers to quickly see effects across many dimensions and slicings. There have been quite a few tools developed for this, such as Google’s TensorBoard and Facebook’s HiPlot.
Weeding Out Stale Code And Pipeline Jungles
Any ML pipeline begins with data collection and preparation. During the course of this preparation, operations like scraping, sampling, joining and plenty of other approaches usually accumulate in a haphazard way resembling a jungle; a pipeline jungle. Things get even worse in the presence of experimental code that has been forgotten in the code archives. The presence of such stale code can malfunction. A malfunctioning algorithm can crash stock markets and self-driving cars. The risk in the ML context is just too high.
Feature flags help in keeping track of the code development so that any dead code can be avoided regularly. Since the whole process is tedious, Uber, a multinational ride-hailing company, has come with an automated tool called Piranha. This tool helps the developer bring down the hammer on obsolete code.
Customisation Is Not Always Cool
With a number of languages and tools, one can pick their method of choice and stitch together various languages in their pipeline. This causes problems during testing and would be difficult to share the model across the organisation. Another ignored aspect, warn the authors, is of excessive prototyping. New ideas are usually implemented with prototype models. However, having too many small scale prototypes can be costly and can even blind one from the pitfalls at large scale deployment.
Configuration Can Be Costly
As the systems mature, they usually end up with a wide range of configurable options such as features used, how data is selected, algorithm-specific learning settings, verification methods, etc.
The number of lines of configuration can far exceed the number of lines of the traditional code
A sample configuration may include things like where a certain feature was logged from and whether it is correctly logged. And, if a certain feature is available for production or if some training jobs should be allocated extra memory and many other million things.
These tiny details in a messed up scenario make configuration almost impossible to be dealt with. The authors suggest that configurations should be reviewed well and stored in a repository. A good configuration system should be easy to visualise, verified and should be devoid of any oversight.
Knowing Where To Look At
Any model in production needs to be continuously monitored. There is always a lot of talk about cross-validation of systems. But are there any simple diagnostics, a good starting point that gives a fair, if not comprehensive, idea of what is going on?
A useful heuristic, as the experts recommend, is to check if the distribution of predicted labels is equal to the distribution of observed labels. This thumb rule can help detect black swan scenarios where the world data no longer resembles historical data on which the model has been trained on. This can be leveraged into designing an automatic alert system.
The above-mentioned scenarios are one of the many technical debts that might get induced into an ML system. Configuration debt, data dependency debt, monitoring, management debt and many more. The collection of these debts become more sophisticated as ecosystems support multiple models together. So, it is advisable to be aware of all possible vulnerabilities and then keep checking them regularly.
For starters, the researchers list down the following questions, answering which might help you build robust ML systems:
- How easily can an entirely new algorithmic approach be tested at full scale?
- What is the transitive closure of all data dependencies?
- How precisely can the impact of a new change to the system be measured?
- Does improving one model or signal degrade others?
- How quickly can new members of the team be brought up to speed?
Read the original NeurIPS paper here.