Here’s What You Need To Know About Confidence Intervals

Testing a proposal or the workability of a design using hypothesis testing is standard practice in the corporate world. Be it changing the user interface on a mobile application or checking a model which is used to diagnose a patient for psychotherapy, the most inexpensive accessible non-human decision maker is the flipping of a coin. With all the confounding variables associated with real-world problems, flipping a coin can make things go for a toss — literally.

Clients and stakeholders may or may not understand the intricacies involved in the model. They don’t care about the type of activation function used or the optimisation technique followed. It always comes down to one question: How does the model work in the worst case scenario? This is where the Confidence Interval (CI) estimate comes into the picture.

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

CI is generated on range and probability. Range, which is the lower and upper limit on the skill that can be expected on the model. Probability talks about whether the model belongs to the range or not.

Source: Construction Of Confidence Interval

The CI is often referred to as the margin of error and may be used to graphically depict the uncertainty of an estimate on graphs through the use of error bars.

Download our Mobile App

For Classification Accuracy In Machine Learning

A machine learning algorithm is well understood by the data scientists and the engineers who develop them but when the product needs to be pitched, the only parameter that counts is its performance. So, a metric to gauge the performance of a model is necessary.

Classification accuracy is used to assess the efficacy of a classification algorithm. To report the classification accuracy of the model alone is not best of practices.

Classification Accuracy = correct predictions/ total predictions

It is common to use classification accuracy or classification error (the inverse of accuracy) to describe the skill of a classification predictive model. For example, a model that makes correct predictions of the class outcome variable 75% of the time has a classification accuracy of 75%, calculated as:

accuracy = total correct predictions / total predictions made * 100

Classification accuracy or classification error is a proportion or a ratio. It describes the proportion of correct or incorrect predictions made by the model. Each prediction is a binary decision that could be correct or incorrect. Technically, this is called a Bernoulli trial, named for Jacob Bernoulli. The proportions in a Bernoulli trial have a specific distribution called a binomial distribution.

Source: The Psychologist

We can use the assumption of a Gaussian distribution of the proportion (i.e. the classification accuracy or error) to easily calculate the confidence interval.

In the case of classification error, the radius of the interval can be calculated as:

interval = z * sqrt( (error * (1 - error)) / n)

In the case of classification accuracy, the radius of the interval can be calculated as:

interval = z * sqrt( (accuracy * (1 - accuracy)) / n)

Where interval is the radius of the confidence interval, error and accuracy are classification error and classification accuracy respectively, n is the size of the sample, sqrt is the square root function, and z is a critical value from the Gaussian distribution. Technically, this is called the Binomial proportion confidence interval.

A code snippet to calculate the accuracy scores:

# split the data into a train and validation sets
X1, X2, y1, y2 = train_test_split(X_train, y_train, test_size=0.5)
base_prediction = base_model.predict(X2)
error = mean_squared_error(base_prediction, y2) ** 0.5
mean = base_model.predict(X_test)
st_dev = error
X1, X2, y1, y2 = train_test_split(X, y, test_size=0.5), y1)
base_prediction = base_model.predict(X2)
validation_error = (base_prediction - y2) ** 2, validation_error)
mean = base_model.predict(X_test)
st_dev = error_model.predict(X_test)

Check the idea behind this method here

Common Misconceptions About Confidence Intervals

A 95% confidence interval does not mean that for a given realised interval there is a 95% probability that the population parameter lies within the interval. The 95% probability relates to the reliability of the estimation procedure, not to a specific calculated interval.

A confidence interval is not a definitive range of plausible values for the sample parameter, though it may be understood as an estimate of plausible values for the population parameter.

A particular confidence interval of 95% calculated from an experiment does not mean that there is a 95% probability of a sample parameter from a repeat of the experiment falling within this interval. So, it is essential to remember that:

  • 95% confidence is confidence that in the long-run 95% of the CIs will include the population mean. It is a confidence in the algorithm and not a statement about a single CI.
  • In frequentist terms, the CI either contains the population mean or it does not.
  • There is no relationship between a sample’s variance and it’s mean. Therefore we cannot infer that a single narrow CI is more accurate. In this context “accuracy” refers to the long run coverage of the population mean. Look at the visualisation above and note how much the widths of the CIs vary. They can still be narrow but far away from the true mean.


A confidence interval is different from a tolerance interval that describes the bounds of data sampled from the distribution. CI provides bounds on a population parameter, such as a mean, standard deviation, or similar and, to deal with the uncertainty inherent in results derived from data that are themselves only a randomly selected subset of a population.

It is said that preferring hypothesis testing to confidence intervals and estimation will lead to fewer statistical misinterpretations. Confidence intervals can be unintuitive and sometimes are as misunderstood as p-values and null hypothesis significance testing. Moreover, CIs are often used to perform hypothesis tests and are therefore prone to the same misuses as p-values.

Real world data is filled with noise, is inconsistent, non-linear. So, a single “significant” CI can be mighty useful to draw conclusions which otherwise would be cumbersome.


Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Ram Sagar
I have a master's degree in Robotics and I write about machine learning advancements.

Our Upcoming Events

24th Mar, 2023 | Webinar
Women-in-Tech: Are you ready for the Techade

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023 [AI100 Awards]

21 Jul, 2023 | New York
MachineCon USA 2023 [AI100 Awards]

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Council Post: The Rise of Generative AI and Living Content

In this era of content, the use of technology, such as AI and data analytics, is becoming increasingly important as it can help content creators personalise their content, improve its quality, and reach their target audience with greater efficacy. AI writing has arrived and is here to stay. Once we overcome the initial need to cling to our conventional methods, we can begin to be more receptive to the tremendous opportunities that these technologies present.