Good training data is a prerequisite for better performance of machine learning models, and it further calls for a systematic analysis of data quality before building an AI model. Analytics India Magazine got in touch with Sameep Mehta – IBM Distinguished Engineer and Lead – Data and AI Platforms, IBM Research India. Sameep has completed his masters and PhD in Computer Science from The Ohio State University and has been working at IBM for more than 15 years.
“AI for enterprises is about enabling organisations to modernise at ease, better predict outcomes, automate at scale and secure their organisations. Organisations are looking to infuse AI in the form of NLP, automation with trust and security as a key foundation across various processes and business functions. We are witnessing the demand for AI across the board, from SMBs to large enterprises and across all sectors,” said Sameep.
AIM: Can you talk about how IBM Research India is devising innovative approaches to ensure quality data for having ready-to-deploy ML models?
Sameep: As we know, the AI team spends the majority of its time in data collection, data cleaning, and data preparation. There are many tools for data preparation for Business Intelligence (BI) and Management of Information System (MIS) that provide traditional data cleaning methods like missing value imputations, data normalisation, etc. Even though these metrics are helpful, they do not meet the requirements of AI teams. Moreover, data quality for AI is fundamentally different from the one required for BI, e.g., how to detect and correct wrong labels in the data is a pressing problem.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Wrong labels will result in poor-quality models. At IBM Research, we are building novel algorithms and toolkits for data assessment and remediation so that the downstream AI model can be more accurate, fair, and robust. While the data quality toolkit is available as a commercial offering, we have recently released a part of the toolkit as APIs on the IBM Developer Hub. Since most data assessment algorithms are complex and require non-trivial efforts to build, these APIs will enable developers to experiment with the data quality metrics and to include the metrics in their AI pipeline without spending too much development effort. Apart from algorithmic innovation, we also advocate the importance of data management issues in the AI lifecycle. Last year, we also developed a course on Data Lifecycle Management for Computer Science engineering students of Indraprastha Institute of Technology, Delhi (IIIT-Delhi) to cover topics such as handling data to build better ML pipelines.
AIM: What kind of initiatives have IBM Research India taken to infuse trust, transparency, and fairness in AI platforms and algorithms?
Sameep: Trust is essential to AI adoption. It allows organisations to understand and explain recommendations and outcomes and manage AI-led decisions in their business while maintaining the full ownership and protection of data and insights. A recent Morning Consult study conducted for IBM on AI adoption revealed that nearly 95% of IT professionals in India believe that it is critical or very important to their business to trust the AI’s output is fair, safe and reliable.
IBM Research India is actively working to build toolkits that infuse trust in AI algorithms. We contributed to the development of the open-source toolkit AIFairness 360 that allows developers to detect and de-bias the AI models. AI explainability 360 provides technology that enables developers and other personas like Risk Officers & end customers to open up the AI model and understand the reasoning behind the decision. The open-source toolkits help our developer community to build fair and explainable AI models. At the same time, the team works very closely with IBM Products and Services to include these capabilities in the IBM portfolio for our enterprise customers. We have also introduced AI Factsheets – which provides information about a model’s important characteristics, just like nutrition labels for foods. We are also part of various external think-tanks and working groups to shape this critical topic at a broader level.
AIM: What challenges do you generally face in the AI testing cycle? How do you overcome those challenges?
Sameep: Typically, a good AI model is evaluated by metrics like accuracy and scale. However, just testing the model performance on these metrics may not be enough. Before deploying AI models in production, they must be thoroughly tested for other metrics like fairness, robustness, generalizability, privacy, etc., to identify any shortcomings. While the importance of testing is well accepted and practised for traditional software, it is still in a very nascent stage with respect to AI model testing.
One of the core problems in testing is the non-availability of test cases. For example, the standard hold-out set may not be rich enough to test for the desired metrics. We build algorithms that generate millions of realistic and metric-driven test cases to validate the model across a wide range of properties to tackle this challenge. These test cases are then further prioritised and executed to find failure points. We are also investing heavily in techniques to help developers improve the model by providing focused recommendations by deeper analysing the failed test cases.
AIM: With more than 15 years of experience, where do you think most data companies falter?
Sameep: I wouldn’t say they are faltering, but data companies today are looking at newer innovations with a renewed focus, newer ways to increase investments and associated returns. There are two dimensions to this:
- First, the data companies invest in more state-of-the-art tools to collect, curate, and manage the data. The regulatory framework around data privacy and security is constantly changing, and the data companies are always looking to ensure they keep up with the changing demands. Else, data breaches and security leaks will be one big dampener for these companies.
- The second part is around value unlocking and monetisation of the data. Organisations are looking to be more flexible and open (modulo privacy) with the data. They are looking at ways to work closely with their AI team, academia, developer community, etc., to discover potential uses of enterprise data and then demonstrate how this data can solve pressing business or societal problems.
So, in essence, the need for more investment in tools/platforms to prepare high-quality data and collaboration with partners who can use the data securely are the two biggest reasons that will help data companies to succeed.
AIM: How do you see/observe the landscape of Enterprise AI evolving in India?
Sameep: AI has moved from consideration to wider mainstream adoption due to the pandemic. A recent Morning Consult study conducted for IBM on AI adoption revealed that 53% of Indian IT professionals stated that their company had accelerated its rollout of AI due to the COVID-19 pandemic. We are witnessing the demand for AI across the board, from SMBs to large enterprises and across all sectors. Different patterns are emerging.
- The first or the most common is to improve the existing process by infusing AI. For instance, automating a manual process, generating sales leads by connecting diverse data sets, developing chat assistants and voice assistants for customer engagement, etc.
- The second trend, though in relatively small pockets, more impactful is to use AI to unlock the value of the existing enterprise data and establish new lines of business.
- The third trend is customers wanting to modernise their traditional data and applications – using AI. They are using AI tools to discover & analyse the traditional data systems and recommend modernised configuration.
With so much excitement in the AI ecosystem from students, academia, developers, industry & government, these are exciting times for AI in India.