Listen to this story
An algorithm (code) that learns common features from large data clouds is the core component of an artificial intelligence (AI) system. From a development perspective, for many years AI/ML has been in the model-centric world. However, model-centric AI has recently been criticised for being restricted to corporations and industries despite its dominance over the past three decades while consumer platforms with hundreds of millions of users can easily rely on generalist solutions.
Data-centric AI is an emerging approach that places data at the forefront of AI development. Rather than focusing only on developing sophisticated models or endlessly refining such models, data-centric AI emphasises the importance of quality data in training the models effectively. This approach can be particularly useful in the field of unstructured document processing, where ensuring the quality of training data could be a huge challenge. Data-centric AI can significantly boost model performance in the document processing space without having to rely solely on complex algorithms.
In practice, this means following several essential steps when adapting a data-centric approach during document processing:
Ensure training data is representative
It’s interesting to note that significant advancements in AI development aren’t driven by better algorithms, feature engineering, or model architecture but rather by the calibre of the training data that AI models can use to iterate quickly and transparently. It’s critical to ensure that the dataset used for training is representative of real-world data. For example, textual and tabular data can be challenging to interpret, so it’s important to pay close attention to the complete and representative data and the quality of training samples.
Involve domain expertise
Creating datasets with domain knowledge is crucial for a data-centric approach to AI development. Domain experts can provide the ground truth for the specific business use case and determine whether the dataset accurately represents the problem at hand. The analytical products are unlikely to have much, if any, value if it is unclear where the information comes from and how it will be used. In brief, domain expertise is utilised to assess the inputs, direct the procedure, and assess the final products in the context of worth and validity.
Acknowledge bias in the training data
It’s essential to recognise that the datasets created by humans often contain bias because they reflect variegated human interests and beliefs. To address this issue, programmatic labelling can be used to automate the labelling process to make it more efficient. Programmatic labelling can enhance the accuracy and fairness of AI systems by effectively reducing bias in the training data.
Manage with available data
In industries that are less technology-driven, availability of training data is a considerable limitation. Data-centric AI solutions designed to smoothen such data quality issues—pre-processing and post processing—are essential. With such a technology, the document processing solution can intelligently process documents with low sample size and data quality issues. Absence of data centric AI solution leads to lower success rate (accuracy) and therefore raises questions on the business case itself. Adopting a data-centric approach throughout can offer significant benefits to these industries, such as improving operations, reducing costs, and enabling businesses to operate without extensive technological expertise.
Improve the speed of deployment
As data-centric solutions need comparatively lesser quantities of training datasets with focus on good data and also require lesser time to train, the deployment of new solutions can be done at least 2x to 4x faster. This is a considerable advantage because then the customers do not have to spend on the resources and can also start realising the benefits faster.
Overall, data-centric AI can provide a more efficient and effective approach to AI development, particularly in core industries. By focusing on quality data and adopting a data-centric approach, businesses can improve their operations and reduce costs by deploying solutions faster with lesser dependency on the need for a large volume of training datasets.
To summarise, data-centric AI can help document processing solutions leapfrog to the next frontier by:
- Improving their accuracy above 80% even with one tenth of the training datasets, with reduced cycle time (2x to 4x faster) against traditional solutions.
- Enhancing their ability to deal with template-free input documents.
- Intelligently dealing with semi- and unstructured text, including images.
- Making them aware of the context and composite processing capabilities.
“Data-centric AI is the discipline of systematically engineering the data used to build an AI system”. –Andrew NG, leading AI expert.
This article is written by a member of the AIM Leaders Council. AIM Leaders Council is an invitation-only forum of senior executives in the Data Science and Analytics industry. To check if you are eligible for a membership, please fill out the form here.