Bayesian hierarchical model for regression with uncertainty modelling

Probabilistic ML offers a strong set of techniques for modelling uncertainty
Listen to this story

The world is an unpredictable environment, and intelligent systems must derive conclusions from noisy or confusing input. Probability theory (particularly Bayes’ theorem) provides a systematic framework for integrating previous knowledge with actual information. Probabilistic machine learning offers a strong set of techniques for modelling uncertainty, executing probabilistic inference, and generating predictions or judgments. This article focuses on building a Bayesian hierarchical model for a regression problem with PyMC3. Following are the topics to be covered.

Table of contents

  1. About Bayesian Inference
  2. Benefits and Drawbacks of probabilistic programming
  3. A bayesian hierarchical model for regression using PyMC

Bayesian inference summarises evidence about the likelihood of a prediction using Bayesian probability. Let’s have a look at a high-level understanding of Bayesian inference.


Sign up for your weekly dose of what's up in emerging technology.

About Bayesian Inference

Bayes’ theorem serves as the foundation for Bayesian inference. Events are replaced by observations and parameter sets, and probabilities with densities in Bayesian inference. The joint posterior distribution of a parameter set for observation is now defined as the product of the parameter set’s prior distributions and the probability of the observation divided by marginal likelihood.

The “marginal likelihood” of observation, also known as the “prior predictive distribution,” can be set to an unknown constant c. The prior predictive distribution describes what an observation should look like before it is observed, given the model. For the marginal likelihood of y, just the collection of prior probabilities and the model’s likelihood function are employed. The marginal likelihood of observation normalises the joint posterior distribution, guaranteeing that it is a suitable distribution that integrates into one. 

Bayes’ theorem is used in model-based Bayesian inference to estimate the unnormalized joint posterior distribution in order to assess and infer from marginal posterior distributions.

Components of Bayesian inference

  • Prior distributions for the parameter set use probability to quantify uncertainty about parameters prior to taking data into account.
  • In a comprehensive probability model, the likelihood or likelihood function describes how all variables are connected.
  • After considering both the prior and the data, the joint posterior distribution indicates uncertainty about the parameter selection. If the parameter set is partitioned into a single parameter of interest and the remaining parameters are considered nuisance parameters, then the density of that single parameter of interest is the marginal posterior distribution.

Are you looking for a complete repository of Python libraries used in data science, check out here.

Benefits and Drawbacks of probabilistic programming

There are three main benefits:

Not dependent on the amount of data

In probabilistic programming, an individual may include domain knowledge into the model and then let the model learn as it goes from the data. That is something that a deep neural network cannot achieve. This means that an individual may begin with far less data than would be required in typical machine learning.

Uncertainty changes to certainty

Another significant advantage is that probabilistic models typically include uncertainty distributions. So, instead of getting a chance of anything, you will receive a probability distribution. 

Let’s understand this with an example. Assume we are developing a self-driving automobile. Our AI has a confidence score of 0.99 for the signal to be a green light ahead. How certain are we that the 99% figure is correct? Normally, we just don’t know. We get a distribution from probabilistic programming. That is, you know, how certain you are, that the 99% likelihood is right.

High interpretability of algorithm

Explainability is required in machine learning development, although it is frequently a rare resource. Many machine learning models are black-boxed from start to finish, and you will not know why the model reached a certain conclusion. That is an issue in many circumstances. For example, Consider you have a financial institution that provides loans. If a loan application is rejected by your machine learner then it would be really easy to explain the reason behind the rejection. Probabilistic programming is considerably easier to explain.

There are some downsides to probabilistic programming:

Technological challenges

Because this is a new sector, there are still certain technical issues that have yet to be resolved. For example, the inference is typically quite sluggish, causing model training to be incredibly slow. The industry is rapidly evolving, and new strategies are being developed on a daily basis, but you will still have to jump through some hoops for the time being.

This is not a typical programming

As a result, this paradigm includes programming, machine learning, and statistical techniques. A tremendously helpful combo, but also one that is taxing on the creators. Most developers were not educated in statistics and would need to gain new skills in this area. When looking for developers, this makes it a little more difficult to discover the proper folks.

A bayesian hierarchical model for regression using PyMC

PyMC is a Python library for Bayesian statistical modelling that focuses on sophisticated Monte Carlo Markov Chain and Variational inference methods. Its adaptability and extensibility make it suitable for a wide range of challenges.

In this article, we will be building a hierarchical bayesian model for a regression problem with two MC simulation chains.

The data used in this article is related to a baseball team club. This data has 23 features including the target variable which is “Salary”. The problem is a regression and needs to predict the salary of the players based on their historical data. 

Let’s check whether the PyMC is present 

! pip show pymc3
Analytics India Magazine

If not, it could be installed by the following code.

! pip install pymc3

Importing libraries 

import pymc3 as pm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import arviz as az

Reading and preprocessing the data

Analytics India Magazine

When the data is loaded, there are a few missing values needed to be treated and encoding categorical columns.

from sklearn.preprocessing import LabelEncoder

Since there are 23 features including the target variable, let’s check if we could do some feature selection or feature elimination.

import seaborn as sns
Analytics India Magazine

With the help of a heatmap, it could be observed that there are some features which are highly correlated with each other. So, performing feature elimination and eliminating column correlation greater than 0.8.

df=data_utils.drop(['AtBat', 'Hits','Runs', 'RBI','Years', 'CAtBat',
                    'CHits','CRuns', 'CRBI', 'CWalks'],axis=1)

Building the Hierarchical Bayesian model

with pm.Model() as hierarchical_model:
    mu_a = pm.Normal('mu_alpha', mu=0., sigma=1)
    sigma_a = pm.HalfCauchy('sigma_alpha', beta=1)
    mu_b = pm.Normal('mu_beta', mu=0., sigma=1)
    sigma_b = pm.HalfCauchy('sigma_beta', beta=1)
    a = pm.Normal('alpha', mu=mu_a, sigma=sigma_a, shape=len(X))
    b = pm.Normal('beta', mu=mu_b, sigma=sigma_b, shape=len(X))
    eps = pm.HalfCauchy('eps', beta=1)
    est = a[X['Division_enc']] + b[X['Division_enc']] * X['Errors'].values
    y_like = pm.Normal('y', mu=est, sigma=eps, observed=y)

For using the model of a PyMC it has to be defined as an “with” statement and can be given an alias using “as”. 

In the “est” variable the formula for linear regression is defined by using the independent variable as “Errors” and for slope and constant the “Division_enc” is being utilized.

Since the problem is multivariate linear regression, that is the reason for using hierarchical modelling.

with hierarchical_model:
    step = pm.NUTS()
    hierarchical_trace = pm.sample(2000, tune=1000, init=None, step=step, cores=2, return_inferencedata=True)

In this, we are sampling to have an inference of the statistics of the model. In the above creation 2 sample chains with 1000 tunes and using parallel processing to reduce the time.

Analytics India Magazine

Let’s deep dive into the inference obtained from the sample function. Since the result is stored in a variable it could easily be accessed, don’t forget to save.

Analytics India Magazine

If clicking on the dropdowns (the little arrowhead) it would expand and we could understand more about the inference. Perhaps there is another way to access the inference, the data is in the form of a data frame so posterior,log_likelihood,sample_stats, and observed_data are different columns. These could be treated as like pandas dataframe.

Analytics India Magazine

Let’s check the tree depth of the chain formed.

hierarchical_trace.sample_stats["tree_depth"].plot(col="chain", ls="none", marker="o", alpha=0.3)
Analytics India Magazine

The more trees there are, the prediction here is that the first chain has dense binary trees but the second chain doesn’t, which means the performance could be improved. But more trees means more time too.

Let’s have a look at the posterior distribution.

    hierarchical_trace, group="sample_stats", var_names="acceptance_rate", hdi_prob="hide", kind="hist"
Analytics India Magazine

This distortion in the distribution means there are outliers in the observation data. It could also be concluded that the chain formed isn’t stable. 

We can also plot the alpha and beta of the multivariate linear regression algorithm.

with hierarchical_model:
    az.plot_trace(hierarchical_trace,var_names=['alpha','beta','mu_alpha','mu_beta'],figsize=(15, 20)) 
Analytics India Magazine


The Bayesian model is built on our knowledge of past events. The prior represents your understanding of the parameters prior to receiving data. The likelihood is the probability of the data given the parameter values. Given the data, the posterior is the probability of the parameters. The prior, likelihood and posterior distributions are linked by Bayes’ theorem. With this article, we have understood Bayesian inference and its implementation to solve a regression problem with PyMC. 


More Great AIM Stories

Sourabh Mehta
Sourabh has worked as a full-time data scientist for an ISP organisation, experienced in analysing patterns and their implementation in product development. He has a keen interest in developing solutions for real-time problems with the help of data both in this universe and metaverse.

Our Upcoming Events

Masterclass, Virtual
How to achieve real-time AI inference on your CPU
7th Jul

Masterclass, Virtual
How to power applications for the data-driven economy
20th Jul

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, Virtual
Deep Learning DevCon 2022
29th Oct

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM

What can SEBI learn from casinos?

It is said that casino AI technology comes with superior risk management systems compared to traditional data analytics that regulators are currently using.

Will Tesla Make (it) in India?

Tesla has struggled with optimising their production because Musk has been intent on manufacturing all the car’s parts independent of other suppliers since 2017.

Now Reliance wants to conquer the AI space

Many believe that Reliance is aggressively scouting for AI and NLP companies in the digital space in a bid to create an Indian equivalent of FAANG – Facebook, Apple, Amazon, Netflix, and Google.