As AI products and services are increasingly being deployed into the real world, ML Data OPs have had to rapidly iterate to meet the challenges of handling data for model training and continuous testing. Sudeep George, VP of Engineering at iMerit, in his talk, ‘Training data for effective AI deployment’ at the Data Engineering Summit 2022, has shared practical insights around handling training data for complex AI systems operating at scale.
Sudeep defined data annotation as “a process of structuring and classifying data so that an ML model can distinguish between the human and the background, the people and vehicles, and understand the actions and intents by analysing its cause.” The process is important because it provides highly accurate ground truth data and helps machines to parse and understand the text.
The annotation journey
Nearly 60 to 70% of the model development time is spent on data-related processes like gathering, cleaning and structuring. The data should be maintained at a very high quality to optimise the model’s performance. To build training data, companies go through three types of vendors. Initially, companies opt for tool providers who can help them annotate complex data objects. Later, the company reaches a point where tons of data needs to be classified, and they bring in workforce providers who offer trained annotators. Soon, the companies reach another inflection point and start working with solution providers.
Model deployment has several stages. First, the model is evaluated using the test dataset. Later, it is tested against fresh production data and, post-deployment, continuously monitored to ensure performance. The model has to be retrained when the performance dips.
The traditional approach to building AI systems has been to train a model to reach a baseline performance and tweak and tune it to improve performance. Unfortunately, these models usually fail after deployment. The data-centric approach brings the model to baseline performance and trains it against datasets that represent the environment within which the model has to operate.
For models, it is important to have
- Data at scale
- Trustworthy data
Augmentation is a popular way to scale data. Here, two key approaches are used: deep learning and data domain manipulation. The manipulation techniques are specific to the data modality. For instance, for computer vision, image filters or transformation tools (such as image geometry transformation or image colour space transformation) are used to transform the image for training. On the other hand, deep learning techniques like GANs generate synthetic images based on original data. This ensures the dataset is evenly distributed among all classes and minimises the bias.
Dataset management is important to ensure the collected data is used across various models in a manner that the data is reusable. It also helps understand data distribution and databases.
Edge cases affect the performance of the model. Some unique classes only have limited data available, and when the model encounters these in real-life scenarios, it leads to errors and failures. Depending on the ML life cycle part, an edge case can have different implications. In the collection phase, it is about insufficient data. In the labelling phase, it is about incomplete attributes being classified. In training, it is about incomplete testing against scenarios. Lastly, in the deployment phase, it is about the actual performance of the ML model when working on real-world data. Sudeep discussed some of the common challenges companies face, including guidelines, traceable feedback loop, scenario replication, and end-end system validation.
Lastly, he discussed an example of a customer who wanted to train a model to distinguish between a cloudy and sunny outside environment. The iMerit team used the shadows cast by cards and other objects to act as a feature point for the ML training process, and the model’s performance improved drastically.
REGISTER HERE TO ACCESS THE CONTENT