Pykeen is a python package that generates knowledge graph embeddings while abstracting away the training loop and evaluation. The knowledge graph embeddings obtained using pykeen are reproducible, and they convey precise semantics in the knowledge graph.
The knowledge graph is a graph data structure that captures multimodal and multilateral information in terms of relationships between concepts. The concepts are represented as entities, and relationships between concepts are represented as edges of the knowledge graph.
This graph can be used for various tasks like search and retrieval of information.We can also predict new relations between two concepts making knowledge graphs an excellent choice for augmenting sparse data for ML and DL algorithms.
Knowledge Graph Embeddings
KGEs are vector space representations of entities and relationships in a knowledge graph.
These embeddings are obtained from a model called KGE model. These models essentially try to preserve the pairwise distance between entities, commensurate with their relation. Following is a list of such models available as a part of the Pykeen Package.
Let’s look closely at TransE, one of the basic and famous models for generating KGEs.
Translation Based Embedding
This model generates vectors for relations and entities in the same vector space. Following is the pseudocode for the algorithm behind this model.
The distance mentioned in the algorithm is the Frobenius norm between the arguments.
- Here h is the head or source entity of a relationship in the knowledge graph.
- Here l is the link or relation between entities of a relationship in the knowledge graph.
- Here t is the tail or destination entity of a relationship in the knowledge graph.
This plot shows the structure of embeddings obtained using the TransE model.
Now let’s see how to use pykeen to extract these embeddings.
Installation of pykeen is quite simple. You can just do a pip install.
! pip install pykeen==1.0.5
This package runs on top of PyTorch, so install PyTorch as well.
Pykeen provides lots of Open Source datasets as classes for seamless integration with the rest of the module.Let’s check out the OpenBioLink Knowledge graph in this article.
from pykeen.datasets import OpenBioLink dataset = OpenBioLink() dataset.training.triples
Each triple contains (head, link, tail) in the same order. The first tuple in the image is a gene-phenotype relation. A phenotype is an observable trait of humans like colour of the eyes, hair, skin etc. The first tuple is the gene NCBIGENE:11200 which is responsible for the phenotype HP:0009919(retina tumor). Use this link to check out the genes’ meanings, phenotypes, and anatomies of these tuples’ identifiers.
Model, Optimizer and Training Approach
Next, we need to pick an embedding model to extract embeddings from the OpenBioLink Knowledge graph. Following is the code to load TransE model in pykeen:
# Pick a model from pykeen.models import TransE model = TransE(triples_factory=training_triples_factory)
We can choose optimizers from torch to train the model.
# Pick an optimizer from Torch from torch.optim import Adam optimizer = Adam(params=model.get_grad_params())
We need to select a training approach to use to train the model
# Pick a training approach (sLCWA or LCWA) from pykeen.training import SLCWATrainingLoop training_loop = SLCWATrainingLoop(model=model, optimizer=optimizer)
Whenever we have a knowledge graph we need to make certain assumptions to draw inferences from it. Closed World Assumption is one such assumption. It assumes that if a link is not present between two entities, then that link is false or the probability of a relationship between these entities is always zero. We can immediately see problems with this assumption. Once we assume this, we can’t predict any new links in the knowledge graph. Collecting Real-world Data is a challenging task and lots of relationships are not captured in the knowledge graph. This assumption turns all the missing data into false values.
Local Closed World Assumption(LCWA) solves this problem by specifying a predicate over areas that says whether the area of the knowledge graph is complete or not. Stochastic Local Closed World Assumption(sLCWA) is a stochastic version of the LCWA.
Training and Evaluation
We are all set to train the model now. Following command trains the model.
Following is the code to evaluate the trained model using a test set.
# Pick an evaluator from pykeen.evaluation import RankBasedEvaluator evaluator = RankBasedEvaluator() # Get triples to test mapped_triples = dataset.testing.mapped_triples # Evaluate results = evaluator.evaluate(model, mapped_triples, batch_size=1024) print(results)
To Evaluate the embeddings model we use a test set of triples.
Each of these triples is assigned a score by the model based on the plausibility of the triple.
We want the test set’s triples to be highly plausible so we expect higher score for these triples.But the score given by the model is not bounded and we cannot decide what a good value of this score is.We need to rank all the triples according to this score’s descending order and use the ranks to evaluate the model.
PyKeen provides a high-level entry point to access the models. It is called a pipeline. We should provide all the information about the model to the pipeline, and the pipeline takes care of everything required for training.
from pykeen.pipeline import pipeline pipeline_result = pipeline( dataset='Nations', model='TransE', evaluator='RankBasedEvaluator', training_loop='sLCWA', negative_sampler='basic', model_kwargs=dict( scoring_fct_norm=2, ), ) pipeline_result.save_to_directory('nations_transe')
Hyper Parameter Optimization
PyKeen provides a hyper parameter optimization pipeline function pykeen.hpo.hpo_pipeline().It uses optuna in the backend and does optimization.Following is a code snippet that shows how to optimize the hyperparameters.
from pykeen.hpo import hpo_pipeline hpo_pipeline_result = hpo_pipeline( n_trials=30, dataset='Nations', model='TransE', loss='MarginRankingLoss', model_kwargs_ranges=dict( embedding_dim=dict(type=int, low=100, high=500, q=100), ), loss_kwargs_ranges=dict( margin=dict(type=float, low=1.0, high=2.0), ), )
Hpo_pipeline works more or less like the simple pipeline; it does a grid search or random search or a similar search over the specified model parameter range and returns a hpo_pipeline_result.
Ranges for hyperparameters can be provided using model_kwars_ranges argument.
There are many other types of hyperparameters offered by hpo_pipeline, which can be used to optimize the model fit.
Saving and Restoring Model
PyKeen Models are torch models with utility functions on the top. We can use the torch’s functionality to save and reload a model.
import torch torch.save(model,'trained_model.pkl') my_pykeen_model = torch.load('trained_model.pkl')
We can also save the model checkpoints during training to restore the training process if training fails due to a crash.This functionality can be added using the training_kwargs argument
training_kwargs=dict( num_epochs=2000, checkpoint_name='my_checkpoint.pt', checkpoint_directory='doctests/checkpoint_dir', checkpoint_frequency=5, )
To start the training process from a checkpoint, we simply need to use the same checkpoint name in the code.
We have taken a knowledge graph and converted all the entities and relations into embeddings. Let’s see some of the interesting information we can extract from these embeddings.
What are the possible phenotypes observed due to the presence of the gene NCBIGENE:534?
predicted_tails_df = model.predict_tails('NCBIGENE:534', 'GENE_PHENOTYPE') predicted_tails_df
The top prediction of phenotype for the gene is HP:0001337 which corresponds to Tremors. This information was not present in the original knowledge graph. We inferred it from the phenotypes of all closely related(only information present in KG is used to decide close genes) genes.
We can even ask other questions like What is the head given relation and tail ? or What are some most plausible triples?