MITB Banner

Training data for effective AI deployment

Nearly 60 to 70% of the model development time is spent on data-related processes.

Share

As AI products and services are increasingly being deployed into the real world, ML Data OPs have had to rapidly iterate to meet the challenges of handling data for model training and continuous testing. Sudeep George, VP of Engineering at iMerit, in his talk, ‘Training data for effective AI deployment’ at the Data Engineering Summit 2022, has shared practical insights around handling training data for complex AI systems operating at scale.

Sudeep defined data annotation as “a process of structuring and classifying data so that an ML model can distinguish between the human and the background, the people and vehicles, and understand the actions and intents by analysing its cause.” The process is important because it provides highly accurate ground truth data and helps machines to parse and understand the text.

The annotation journey

Nearly 60 to 70% of the model development time is spent on data-related processes like gathering, cleaning and structuring. The data should be maintained at a very high quality to optimise the model’s performance. To build training data, companies go through three types of vendors. Initially, companies opt for tool providers who can help them annotate complex data objects. Later, the company reaches a point where tons of data needs to be classified, and they bring in workforce providers who offer trained annotators. Soon, the companies reach another inflection point and start working with solution providers.

Model deployment has several stages. First, the model is evaluated using the test dataset. Later, it is tested against fresh production data and, post-deployment, continuously monitored to ensure performance. The model has to be retrained when the performance dips.

Data-centric AI

The traditional approach to building AI systems has been to train a model to reach a baseline performance and tweak and tune it to improve performance. Unfortunately, these models usually fail after deployment. The data-centric approach brings the model to baseline performance and trains it against datasets that represent the environment within which the model has to operate.

For models, it is important to have

  • Data at scale
  • Trustworthy data

Augmentation is a popular way to scale data. Here, two key approaches are used: deep learning and data domain manipulation. The manipulation techniques are specific to the data modality. For instance, for computer vision, image filters or transformation tools (such as image geometry transformation or image colour space transformation) are used to transform the image for training. On the other hand, deep learning techniques like GANs generate synthetic images based on original data. This ensures the dataset is evenly distributed among all classes and minimises the bias. 

Dataset management is important to ensure the collected data is used across various models in a manner that the data is reusable. It also helps understand data distribution and databases.

Edge cases affect the performance of the model. Some unique classes only have limited data available, and when the model encounters these in real-life scenarios, it leads to errors and failures. Depending on the ML life cycle part, an edge case can have different implications. In the collection phase, it is about insufficient data. In the labelling phase, it is about incomplete attributes being classified. In training, it is about incomplete testing against scenarios. Lastly, in the deployment phase, it is about the actual performance of the ML model when working on real-world data. Sudeep discussed some of the common challenges companies face, including guidelines, traceable feedback loop, scenario replication, and end-end system validation. 

Lastly, he discussed an example of a customer who wanted to train a model to distinguish between a cloudy and sunny outside environment. The iMerit team used the shadows cast by cards and other objects to act as a feature point for the ML training process, and the model’s performance improved drastically.

REGISTER HERE TO ACCESS THE CONTENT

Share
Picture of Avi Gopani

Avi Gopani

Avi Gopani is a technology journalist that seeks to analyse industry trends and developments from an interdisciplinary perspective at Analytics India Magazine. Her articles chronicle cultural, political and social stories that are curated with a focus on the evolving technologies of artificial intelligence and data analytics.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.