# A Comprehensive Guide to Maximum Likelihood Estimation and Bayesian Estimation

An l based on data that has random values. The estimation is a process of extracting parameters from the observation that are randomly distributed. An estimation function is a function that helps in estimating the parameters of any statistical model based on data that has random values. The estimation is a process of extracting parameters from the observation that are randomly distributed. In this article, we are going to have an overview of the two estimation functions – Maximum Likelihood Estimation and Bayesian Estimation. Before having an understanding of these two, we will try to understand the probability distribution on which both of these estimation functions are dependent. The major points to be discussed in this article are listed below.

1. Probability Distribution
2. Maximum Likelihood Estimation (MLE)
3. Bayesian Estimation
4. Maximum Likelihood Estimation (MLE) vs Bayesian Estimation

Probability Distribution

In statistics, the probability distribution is a function that represents the probability of occurrence of different outcomes from any process or experiment. More on this we can say it is a numerical description of the probability of an event. As an example, we can take the sample space of any coin flip where there can be two values only head or tails.

More formally we can divide the probability distribution into two forms:-

• Discrete Probability Distribution – in the case of discrete random variables we can define it by specifying the probability mass function p which can be used for assigning a probability to each possible outcome. For example, when throwing a fair die we have 6 possible outcomes and every outcome has the probability of ⅙.  So the probability of any event is defined by the sum of all the probabilities of the outcomes that satisfy the event. So in the case of our example, the probability of the event “ the die rolls ann odd value” is

p(1) + p(3) + p(6) = ⅙ + ⅙ + ⅙ = ½

In the below image the probability mass function p(s) specifies the probability distribution for the sum S of the number accrued from two dice. For example, the figure shows that p(12) = 1/36. The probability mass function allows the calculation of probabilities of events such as P(S > 10) =  1/18 + 1/36 = 1/12.

Image source

• Continuous Probability Distribution – in the case of continuous random variables any individual outcome has probability zero and if we are talking about the events that include infinitely many outcomes such as range then the probability of events can have a positive value. For example, measuring the weight of any human can consist of many values. So if we take any single value and measure the probability then the probability will be zero and if we are talking about the weights between 65 kg to 70 kg with 98% probability. So the probability distribution in such cases can be calculated by integrating the probability density function over the interval or By the means of the cumulative distribution function.

The below image represents the probability density function for continuous random variables in the left and the right picture represents the cumulative distribution function.

Image source

Maximum Likelihood Estimation

As the name suggests in statistics it is a method for estimating the parameters of an assumed probability distribution.  Where the likelihood function measures the goodness of fit of a statistical model on data for given values of parameters. The estimation of parameters is done by maximizing the likelihood function so that the data we are using under the model can be more probable for the model. The likelihood function for discrete random variables can be given by

Where x is the outcome of X random variables and likelihood is the function of θ. By the above function, we can say the likelihood is equal to the probability of occurrence of outcome x is observed when the parameter of the model is θ.

The likelihood function for continuous random variables can be given by

Here the likelihood function can be put into hypothesis testing for finding the probability of various outcomes using the set of parameters defined in the null hypothesis.

The main goal of the maximum likelihood estimation is to make inferences about the data population which will take part in the generation of the sample and evaluating the joint density at the observed data set. As we have seen in the likelihood function above it can be maximized by

Here the motive of the estimation is to select the best fit parameter for the model to make the data most probable. The specific value that maximizes the likelihood function Ln is called the maximum likelihood estimation.

Bayesian Estimation

In statistics, the Bayesian estimation is a method of estimating the parameters by minimizing the posterior expected loss function where the posterior expected value is a conditional probability that is assigned after the relevant evidence is taken into account. Posterior stands for the relevant evidence that is taken into account for the case which is being examined. It also maximizes the posterior expectation of a utility function

Estimation of the parameter can be done by Bayes rule as follows

Where D represents the dataset and θ represents the set of parameters. To explain these terms in the above image I am using an example where we have two events A = “I woke up earlier today” and event B = “I am feeling sleepy” in this scenario.

Likelihood: The conditional probability p(B/A) represents the probability of ‘i am feeling sleepy” when “I woke up earlier today” is given. So the likelihood is that I will feel sleepy, given that I woke up earlier.

Prior: This is the probability of event A regardless of probability B. In our case, it will be” I woke up earlier” whether I am feeling sleepy or not(prior to(before) knowing the state of the feeling). It is a kind of weight we have given to the likelihood. In our case, if “I am feeling sleepy” is not because “I woke up earlier today” will give a lower value to the probability of “I woke up earlier” which will cause a lower value of the probability of “I woke up earlier today” given that “I am feeling sleepy.”

Evidence: It is the probability of event B which is in our case “I am feeling sleepy”. We can say this event is working as evidence for the fact that I woke up earlier today.

We can see here that the Bayesian estimates include the likelihood function also and fully calculate the posterior distribution. Bayesian inference treats the parameters as the variable. Basically here in Bayesian estimation, we put the probability density function in the estimator and get a probability density function again wherein in the MLE we get a single point.

All the parameters (θ) values from the posterior probability can be compared and chosen from it is our job to do so. For example, we may choose the expected value of θ assuming its variance is small enough. The posterior distribution that we can calculate for the parameter θ, makes us confident about using the parameter as an estimate. If the variance is too large, we can consider that we don’t have a good estimate for θ.

Here in the Bayes rule we for estimation we need to deal with the denominator that is the probability of evidence which can be represented by

If we are required to express our prior beliefs, we use that form of evidence to calculate the integration shown above.

Maximum Likelihood Estimation (MLE) vs Bayesian Estimation.

As we have seen in the intuition behind the process which these two estimation functions follow next in the article we will see some basic differences between them.

Final words

Here in this article, we have seen why we need an estimation function and how it can be used. We had a general overview of the Maximum Likelihood Estimation and Bayesian Estimation and understood some basic differences between them. By their meanings and formulation also, we can easily understand the difference between them. We saw that in MLE, the focus was just on the likelihood function. Whereas in Bayesian estimation, there is Likelihood, Prior, and Evidence used for calculation along with getting a healthy posterior value. Yugesh is a graduate in automobile engineering and worked as a data analyst intern. He completed several Data Science projects. He has a strong interest in Deep Learning and writing blogs on data science and machine learning.