MITB Banner

The ‘Unsolved’ Problems in Machine Learning

Uncertainty, probability, infinite-datasets, lack of causality are only few of the several challenges in machine learning.
Share
Listen to this story

While artificial intelligence and machine learning are solving a lot of real world problems, a complete comprehension of a lot of the “unsolved” problems in these fields is hindered due to fundamental limitations that are yet to be resolved with finality. There are various domains in the field of machine learning that developers dive deep into and come up with small incremental improvements. However, challenges to further advancement in these fields persist. 

A recent discussion on Reddit brought in several developers of the AI/ML landscape to talk about some of these “important” and “unsolved” problems which, when solved, are likely to pave the way for significant improvements in these fields.

Uncertainty prediction 

Arguably, the most important aspect of creating a machine learning model is gathering information from reliable and abundant sources. Beginners in the field of machine learning, who formerly worked as computer scientists, face the difficulty of working with imperfect or incomplete information—which is inevitable in the field. 

“Given that many computer scientists and software engineers work in a relatively clean and certain environment, it can be surprising that machine learning makes heavy use of probability theory,” said Andyk Maulana in his book series—‘Adaptive Computation and Machine learning’.

Three major sources of uncertainty in machine learning are:

  • Presence of noise in data: Observations in machine learning are referred to as “sample” or “instance” that often consist of variability and randomness which ultimately impact the output.
  • Incomplete coverage of the domain: Models trained on observations that are by default incomplete as they only consist of a “sample” of the larger unattainable dataset.
  • Imperfect models: “All models are wrong but some are useful,” said George Box. There is always some error in every model.

Check out a research paper by Francesca Tavazza on uncertainty prediction for machine learning models here.

Convergence time and low-resource learning systems

Optimising the process of training and then inferring data requires a large amount of resources. The problems of reducing the convergence time of neural networks and requiring low-resource systems are countering each other. Developers might be able to build tech that is groundbreaking in applications but requires huge amounts of resources like hardware, power, storage, and electricity. 

For example, language models require vast amounts of data. The ultimate goal of reaching human-level interaction in the models requires training on a massive scale. This means a longer convergence time and requirement of higher resources for training. 

A key factor in the development of machine learning algorithms is scaling the amount of input data that, arguably, increases the accuracy of a model. But in order to achieve this, the recent success of deep learning models shows the importance of stronger processors and resources, thus resulting in continuous juggling of the two problems.

Click here to learn how to converge neural networks faster.

Overfitting

Recent text-to-image generators like DALL-E or Midjourney showcase possibilities of what overfitting of input and training data can look like.

https://twitter.com/hausman_k/status/1511732395011194885?lang=en

Overfitting, also a result of noise in data, is when a learning model picks up random fluctuations in the training data and treats them like concepts of the model resulting in errors and impacting the model’s ability to generalise.

To counter this problem, most non-parametric and non-linear models include techniques and input guiding parameters to limit the reach of learning of the model. Even then, in practice, fitting a perfect dataset into a model is a difficult task. Two suggested techniques to limit overfitting data are:

  • Using resampling techniques to gauge model accuracy: ‘K-fold cross validation’ is the most popular sampling technique that allows developers to train and test models several times with different subsets of training data.
  • Holding back validation dataset: After tuning the machine learning algorithm on the initial dataset, developers input a validation dataset to achieve the final objective of the model and check how the model would perform on previously unseen data.

Estimating causality instead of correlations

Causal inferences come to humans naturally. Machine learning algorithms like deep neural networks are great for analysing patterns in huge datasets but struggle to make causal inferences. This occurs in fields like computer vision, robotics, and self-driving cars where models—though capable of recognising patterns—do not comprehend physical environmental properties of objects, resulting in making predictions about the situations and not actively dealing with novel situations.

Researchers from Max Planck Institute for Intelligent Systems along with Google Research published a paper—Towards Causal Representation Learning, which talks about the challenges in machine learning algorithms due to the lack of causal representation. According to the researchers, to counter the absence of causality in machine learning models, developers try to increase the amount of datasets on which the models are trained, but fail to understand that this eventually leads to models recognising patterns and not independently “thinking”.

The introduction of “inductive bias” into models is believed to be a step towards building causality into machines. But that, arguably, can be counter productive in building AI that is free of bias.

Reproducibility

AI/ML being the most promising tool in almost all fields has resulted in many newcomers diving straight into it without fully grasping the intricacies of the subject. While reproducibility or replication is a combined outcome of the above mentioned problems, it still poses great challenges for newly developing models.

Due to lack of resources and reluctance to conduct extensive trials, many of the algorithms fail when tested and implemented by other expert researchers. Big companies offering hi-tech solutions do not always publicly release their codes, making new researchers experiment on their own and propose solutions for large problems without rigorous testing, thus lacking reliability.

Click here to find out about how lack of reproducibility in machine learning models is making the healthcare industry risky.

PS: The story was written using a keyboard.
Picture of Mohit Pandey

Mohit Pandey

Mohit dives deep into the AI world to bring out information in simple, explainable, and sometimes funny words. He also holds a keen interest in photography, filmmaking, and the gaming industry.
Related Posts

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
Recent Stories

Featured

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

AIM Conference Calendar

Immerse yourself in AI and business conferences tailored to your role, designed to elevate your performance and empower you to accomplish your organization’s vital objectives. Revel in intimate events that encapsulate the heart and soul of the AI Industry.

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed