Probabilistic Graphical Models(PGM) are a very solid way of representing joint probability distributions on a set of random variables. It allows users to do inferences in a computationally efficient way. PGM makes use of independent conditions between the random variables to create a graph structure representing the relationships between different random variables. Further, we can calculate the joint probability distribution of these variables by combining various parameters taken from the graph.
Mainly, there are two types of Graph models:
Bayesian Graph Models: These models consist of Directed-Cyclic Graph(DAG) and there is always a conditional probability associated with the random variables. These types of models represent causation between the random variables.
Markov Graph Models: These models are undirected graphs and represent non-causal relationships between the random variables.
pgmpy is a python framework to work with these types of graph models. Several graph models and inference algorithms are implemented in pgmpy. Pgmpy also allows users to create their own inference algorithm without getting into the details of the source code of it. Let’s get started with the implementation part.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
Requirements
Installation
Install pgmpy via pyPI
!pip install pgmpy
pgmpy Demo – Create Bayesian Network
In this demo, we are going to create a Bayesian Network. Bayesian networks use conditional probability to represent each node and are parameterized by it. For example : for each node is represented as P(node| Pa(node)) where Pa(node) is the parent node in the network.
An example of a student-model is shown below, we are going to implement it using pgmpy python library.
- Import the required methods from pgmpy.
from pgmpy.models import BayesianModel from pgmpy.factors.discrete import TabularCPD
- Initialize the model by passing the edge list as shown below.
# Defining the model structure. We can define the network by just passing a list of edges. model = BayesianModel([('D', 'G'), ('I', 'G'), ('G', 'L'), ('I', 'S')])
Define all the conditional probabilities tables as shown in the diagram above. These CPD’s are formed by a method in pgmpy called TabularCPD.
# Defining individual CPDs. cpd_d = TabularCPD(variable='D', variable_card=2, values=[[0.6], [0.4]]) cpd_i = TabularCPD(variable='I', variable_card=2, values=[[0.7], [0.3]]) # The representation of CPD in pgmpy is a bit different than the CPD shown in the above picture. In pgmpy the columns # are the evidence and rows are the states of the variable. ##represents P(grade|diff, intel) cpd_g = TabularCPD(variable='G', variable_card=3, values=[[0.3, 0.05, 0.9, 0.5], [0.4, 0.25, 0.08, 0.3], [0.3, 0.7, 0.02, 0.2]], evidence=['I', 'D'], evidence_card=[2, 2]) cpd_l = TabularCPD(variable='L', variable_card=2, values=[[0.1, 0.4, 0.99], [0.9, 0.6, 0.01]], evidence=['G'], evidence_card=[3]) cpd_s = TabularCPD(variable='S', variable_card=2, values=[[0.95, 0.2], [0.05, 0.8]], evidence=['I'], evidence_card=[2]) Add CPD’s(defined above) to the initialized model. # Associating the CPDs with the network model.add_cpds(cpd_d, cpd_i, cpd_g, cpd_l, cpd_s) Verify the above network by using a check_model() method. If it sum up to 1, means the CPD’s are defined correctly. # check_model checks for the network structure and CPDs and verifies that the CPDs are correctly # defined and sum to 1. model.check_model()
- In the above step, we haven’t provided the state name so pgmpy automatically initialized all the states as 0,1,2,…., so on but it also provides a method of exclusively setting the states. An example of this is shown below. And the whole code snippet is available here.
cpd_g_sn = TabularCPD(variable='G', variable_card=3, values=[[0.3, 0.05, 0.9, 0.5], [0.4, 0.25, 0.08, 0.3], [0.3, 0.7, 0.02, 0.2]], evidence=['I', 'D'], evidence_card=[2, 2], state_names={'G': ['A', 'B', 'C'], 'I': ['Dumb', 'Intelligent'], 'D': ['Easy', 'Hard']})
- Print the CPD’s for no-states defined by simply using the print command and exclusively defined states by using the get_cpds method. The code is available here. The output is shown below.
- Next is to find independencies in the given bayesian network. There are types of independencies defined by the Bayesian Network.
Local Independencies : A variable which is independent of its non-descendents given its parents. It can be defined as P( X ⊥ NonDesc(X) | Pa(X)), where NonDesc(X) is the set of variables which are not descendents of X and Pa(X) is the set of variables which are parents of X.
# Getting the local independencies of a variable. model.local_independencies('G')
Or,
# Getting all the local independencies in the network. model.local_independencies(['D', 'I', 'S', 'G', 'L'])
Global Independencies : There are many structures possible for global independencies. For two nodes, there are two ways it can be connected.
In the above two cases it is obvious that change in any of the nodes will affect the other. Similar cases can be shown for three nodes.
- Inference from bayesian models. In this step, we will predict values from the Bayesian Model discussed above. We are going to use Variable Elimination, a very basic method for inference. For example, we will compute the probability of G by marginalizing over all the other variables. The python code for this is given below.
from pgmpy.inference import VariableElimination infer = VariableElimination(model) g_dist = infer.query(['G']) print(g_dist)
For computing the conditional distribution such as P(G | D=0, I=1), we need to pass an extra argument.
print(infer.query(['G'], evidence={'D': 'Easy', 'I': 'Intelligent'}))
- In this step, we will predict the values for new data points . The difference between step 6 and this step is, we are now interested in getting the most probable state of the variable instead of calculating probability distribution. In pgmpy this is known as MAP query. Here’s an example:
infer.map_query(['G'])
Or,
infer.map_query(['G'], evidence={'D': 'Easy', 'I': 'Intelligent'})
You can check the full demo here.
pgmpy Demo – Extensibility
As discussed above, pgmpy provides a method to create your own inference algorithm. In this demo, we are going to discuss the same. pgmpy contains methods like :
- BaseInference for inference
- BaseFactor for model parameters
- BaseEstimators for parameter and model learning
- For adding new features, create a new class which inherits a base class and the we can just simply use other functionality of pgmpy with this new class.
Following are the steps:
- Import all the required methods and packages.
# A simple Exact inference algorithm import itertools from pgmpy.inference.base import Inference from pgmpy.factors import factor_product
- Define your own inference class, by passing the base class from pgmpy. For this particular algorithm, we will multiply all the factors/CPD of the network and marginalize over variables to get the desired query.
class SimpleInference(Inference): # By inheriting Inference we can use self.model, self.factors and self.cardinality in our class def query(self, var, evidence): # self.factors is a dict of the form of {node: [factors_involving_node]} factors_list = set(itertools.chain(*self.factors.values())) product = factor_product(*factors_list) reduced_prod = product.reduce(evidence, inplace=False) reduced_prod.normalize() var_to_marg = set(self.model.nodes()) - set(var) - set([state[0] for state in evidence]) marg_prod = reduced_prod.marginalize(var_to_marg, inplace=False) return marg_prod
- Now, like discussed in the above model, we will initialize the bayesian model, prepare all the conditional probability (for all variables) and then add it to the initialized model.
# Defining a model from pgmpy.models import BayesianModel from pgmpy.factors.discrete import TabularCPD model = BayesianModel([('A', 'J'), ('R', 'J'), ('J', 'Q'), ('J', 'L'), ('G', 'L')]) cpd_a = TabularCPD('A', 2, values=[[0.2], [0.8]]) cpd_r = TabularCPD('R', 2, values=[[0.4], [0.6]]) cpd_j = TabularCPD('J', 2, values=[[0.9, 0.6, 0.7, 0.1], [0.1, 0.4, 0.3, 0.9]], evidence=['A', 'R'], evidence_card=[2, 2]) cpd_q = TabularCPD('Q', 2, values=[[0.9, 0.2], [0.1, 0.8]], evidence=['J'], evidence_card=[2]) cpd_l = TabularCPD('L', 2, values=[[0.9, 0.45, 0.8, 0.1], [0.1, 0.55, 0.2, 0.9]], evidence=['J', 'G'], evidence_card=[2, 2]) cpd_g = TabularCPD('G', 2, values=[[0.6], [0.4]]) model.add_cpds(cpd_a, cpd_r, cpd_j, cpd_q, cpd_l, cpd_g)
- Now, calculate the inference from your customized inference algorithm and compare it with the VariableElimination method.
# Doing inference with our SimpleInference infer = SimpleInference(model) a = infer.query(var=['A'], evidence=[('J', 0), ('R', 1)])
You can check the full demo here.
Conclusion
In this article, we have discussed the pgmpy python library which provides a simple API for working with Graphical models(bayesian model, markov model,etc. It is highly modular and quite extensible.
Official codes, Docs & Tutorials are available at: