Listen to this story
|
The world is an unpredictable environment, and intelligent systems must derive conclusions from noisy or confusing input. Probability theory (particularly Bayes’ theorem) provides a systematic framework for integrating previous knowledge with actual information. Probabilistic machine learning offers a strong set of techniques for modelling uncertainty, executing probabilistic inference, and generating predictions or judgments. This article focuses on building a Bayesian hierarchical model for a regression problem with PyMC3. Following are the topics to be covered.
Table of contents
- About Bayesian Inference
- Benefits and Drawbacks of probabilistic programming
- A bayesian hierarchical model for regression using PyMC
Bayesian inference summarises evidence about the likelihood of a prediction using Bayesian probability. Let’s have a look at a high-level understanding of Bayesian inference.
THE BELAMY
Sign up for your weekly dose of what's up in emerging technology.
About Bayesian Inference
Bayes’ theorem serves as the foundation for Bayesian inference. Events are replaced by observations and parameter sets, and probabilities with densities in Bayesian inference. The joint posterior distribution of a parameter set for observation is now defined as the product of the parameter set’s prior distributions and the probability of the observation divided by marginal likelihood.
The “marginal likelihood” of observation, also known as the “prior predictive distribution,” can be set to an unknown constant c. The prior predictive distribution describes what an observation should look like before it is observed, given the model. For the marginal likelihood of y, just the collection of prior probabilities and the model’s likelihood function are employed. The marginal likelihood of observation normalises the joint posterior distribution, guaranteeing that it is a suitable distribution that integrates into one.
Bayes’ theorem is used in model-based Bayesian inference to estimate the unnormalized joint posterior distribution in order to assess and infer from marginal posterior distributions.
Components of Bayesian inference
- Prior distributions for the parameter set use probability to quantify uncertainty about parameters prior to taking data into account.
- In a comprehensive probability model, the likelihood or likelihood function describes how all variables are connected.
- After considering both the prior and the data, the joint posterior distribution indicates uncertainty about the parameter selection. If the parameter set is partitioned into a single parameter of interest and the remaining parameters are considered nuisance parameters, then the density of that single parameter of interest is the marginal posterior distribution.
Are you looking for a complete repository of Python libraries used in data science, check out here.
Benefits and Drawbacks of probabilistic programming
There are three main benefits:
Not dependent on the amount of data
In probabilistic programming, an individual may include domain knowledge into the model and then let the model learn as it goes from the data. That is something that a deep neural network cannot achieve. This means that an individual may begin with far less data than would be required in typical machine learning.
Uncertainty changes to certainty
Another significant advantage is that probabilistic models typically include uncertainty distributions. So, instead of getting a chance of anything, you will receive a probability distribution.
Let’s understand this with an example. Assume we are developing a self-driving automobile. Our AI has a confidence score of 0.99 for the signal to be a green light ahead. How certain are we that the 99% figure is correct? Normally, we just don’t know. We get a distribution from probabilistic programming. That is, you know, how certain you are, that the 99% likelihood is right.
High interpretability of algorithm
Explainability is required in machine learning development, although it is frequently a rare resource. Many machine learning models are black-boxed from start to finish, and you will not know why the model reached a certain conclusion. That is an issue in many circumstances. For example, Consider you have a financial institution that provides loans. If a loan application is rejected by your machine learner then it would be really easy to explain the reason behind the rejection. Probabilistic programming is considerably easier to explain.
There are some downsides to probabilistic programming:
Technological challenges
Because this is a new sector, there are still certain technical issues that have yet to be resolved. For example, the inference is typically quite sluggish, causing model training to be incredibly slow. The industry is rapidly evolving, and new strategies are being developed on a daily basis, but you will still have to jump through some hoops for the time being.
This is not a typical programming
As a result, this paradigm includes programming, machine learning, and statistical techniques. A tremendously helpful combo, but also one that is taxing on the creators. Most developers were not educated in statistics and would need to gain new skills in this area. When looking for developers, this makes it a little more difficult to discover the proper folks.
A bayesian hierarchical model for regression using PyMC
PyMC is a Python library for Bayesian statistical modelling that focuses on sophisticated Monte Carlo Markov Chain and Variational inference methods. Its adaptability and extensibility make it suitable for a wide range of challenges.
In this article, we will be building a hierarchical bayesian model for a regression problem with two MC simulation chains.
The data used in this article is related to a baseball team club. This data has 23 features including the target variable which is “Salary”. The problem is a regression and needs to predict the salary of the players based on their historical data.
Let’s check whether the PyMC is present
! pip show pymc3

If not, it could be installed by the following code.
! pip install pymc3
Importing libraries
import pymc3 as pm import numpy as np import pandas as pd import matplotlib.pyplot as plt import arviz as az
Reading and preprocessing the data
data=pd.read_csv("Hitters.csv") data[:5]

When the data is loaded, there are a few missing values needed to be treated and encoding categorical columns.
data_utils=data.dropna(axis=0) from sklearn.preprocessing import LabelEncoder encoder=LabelEncoder() data_utils['League_enc']=encoder.fit_transform(data_utils['League']) data_utils['Division_enc']=encoder.fit_transform(data_utils['Division']) data_utils['NewLeague_enc']=encoder.fit_transform(data_utils['NewLeague'])
Since there are 23 features including the target variable, let’s check if we could do some feature selection or feature elimination.
import seaborn as sns plt.figure(figsize=(15,8)) sns.heatmap(data_utils.corr(),annot=True) plt.show()

With the help of a heatmap, it could be observed that there are some features which are highly correlated with each other. So, performing feature elimination and eliminating column correlation greater than 0.8.
df=data_utils.drop(['AtBat', 'Hits','Runs', 'RBI','Years', 'CAtBat', 'CHits','CRuns', 'CRBI', 'CWalks'],axis=1) X=df.drop(['League','Division','NewLeague','Salary'],axis=1) y=df['Salary']
Building the Hierarchical Bayesian model
with pm.Model() as hierarchical_model: mu_a = pm.Normal('mu_alpha', mu=0., sigma=1) sigma_a = pm.HalfCauchy('sigma_alpha', beta=1) mu_b = pm.Normal('mu_beta', mu=0., sigma=1) sigma_b = pm.HalfCauchy('sigma_beta', beta=1) a = pm.Normal('alpha', mu=mu_a, sigma=sigma_a, shape=len(X)) b = pm.Normal('beta', mu=mu_b, sigma=sigma_b, shape=len(X)) eps = pm.HalfCauchy('eps', beta=1) est = a[X['Division_enc']] + b[X['Division_enc']] * X['Errors'].values y_like = pm.Normal('y', mu=est, sigma=eps, observed=y)
For using the model of a PyMC it has to be defined as an “with” statement and can be given an alias using “as”.
In the “est” variable the formula for linear regression is defined by using the independent variable as “Errors” and for slope and constant the “Division_enc” is being utilized.
Since the problem is multivariate linear regression, that is the reason for using hierarchical modelling.
with hierarchical_model: step = pm.NUTS() hierarchical_trace = pm.sample(2000, tune=1000, init=None, step=step, cores=2, return_inferencedata=True)
In this, we are sampling to have an inference of the statistics of the model. In the above creation 2 sample chains with 1000 tunes and using parallel processing to reduce the time.

Let’s deep dive into the inference obtained from the sample function. Since the result is stored in a variable it could easily be accessed, don’t forget to save.
hierarchical_trace

If clicking on the dropdowns (the little arrowhead) it would expand and we could understand more about the inference. Perhaps there is another way to access the inference, the data is in the form of a data frame so posterior,log_likelihood,sample_stats, and observed_data are different columns. These could be treated as like pandas dataframe.
hierarchical_trace['sample_stats']

Let’s check the tree depth of the chain formed.
hierarchical_trace.sample_stats["tree_depth"].plot(col="chain", ls="none", marker="o", alpha=0.3)

The more trees there are, the prediction here is that the first chain has dense binary trees but the second chain doesn’t, which means the performance could be improved. But more trees means more time too.
Let’s have a look at the posterior distribution.
az.plot_posterior( hierarchical_trace, group="sample_stats", var_names="acceptance_rate", hdi_prob="hide", kind="hist" )

This distortion in the distribution means there are outliers in the observation data. It could also be concluded that the chain formed isn’t stable.
We can also plot the alpha and beta of the multivariate linear regression algorithm.
with hierarchical_model: az.plot_trace(hierarchical_trace,var_names=['alpha','beta','mu_alpha','mu_beta'],figsize=(15, 20))

Conclusion
The Bayesian model is built on our knowledge of past events. The prior represents your understanding of the parameters prior to receiving data. The likelihood is the probability of the data given the parameter values. Given the data, the posterior is the probability of the parameters. The prior, likelihood and posterior distributions are linked by Bayes’ theorem. With this article, we have understood Bayesian inference and its implementation to solve a regression problem with PyMC.