AmpliGraph is a TensorFlow-based open-source library developed by Accenture Labs for predicting links between concepts in knowledge graphs. It is a collection of neural ML models for statistical relational learning (SRL) (also called Relational Machine Learning) – a subdiscipline of AI/ML which deals with supervised learning on knowledge graphs.
Before going into the details of AmpliGraph, let us have a quick look at what a knowledge graph means.
What is a knowledge graph?
The knowledge graph is a diagrammatic representation showing how various entities of a system (e.g. objects, individuals, abstract concepts, events) are interlinked. There is no precise definition of a knowledge graph. In simple terms, it is a graph representing distinct entities and the relationships among them, according to a GitHub repository. It enables data integration and analysis by providing context to a system’s associated data. Visit this page to understand about knowledge graphs in detail. Following is an example of a knowledge graph:
Image source: GitHub
A graph is represented by a set of nodes representing entities and connecting edges showing relationships among them. It can be homogenous (e.g. a social network having people and their connections – all entities of a common type) or heterogeneous (e.g. graph of a university having different types of entities like students, professors, department etc. and relations like ‘studies-at’, ‘teaches-at’ and so on). Besides, a graph can be a ‘multigraph’ in which we can have multiple directed edges between one or more pairs of nodes, some of which can even form loops.
A university graph, mentioned as a heterogeneous graph in the example above, conveys meaningful information (known as ‘semantics’) its entities and associated relations.
Now that we know terminologies like ‘heterogeneous graph’, ‘multigraph’ and ‘semantics’, we can define a knowledge graph as “a heterogenous multigraph in which entities and relations have semantics specific to a particular domain”.
Overview of AmpliGraph
AmpliGraph library provides ML models that can create knowledge graph embeddings (KGEs), which are nothing but low-level vector representations of the entities and relations mong them constituting a knowledge graph.
Consider the following knowledge graph and its corresponding KGE to understand what AmpliGraph does:
Image source: GitHub
There is no direct link between certain entities through some relations in the above knowledge graph, e.g. there is no information shown for how ‘Acme Inc’ and ‘Liverpool’ can be connected through ‘basedIn’ relation. AmpliGraph combines the above KGE with some scoring function and makes predictions about new links.
E.g. It predicts that there is an 85% probability of Acme Inc being based in Liverpool, which can be represented as:
Image source: GitHub
Modules of AmpliGraph
Highlighting features of AmpliGraph
- It can run on CPUs as well as GPUs to speed-up the training process
- Its APIs reduce the amount of code required for code predictions in knowledge graphs
- AmpliGraph base estimators are extensible
Practical implementation
Here’s a demonstration of using AmpliGraph for discovering novel relations in a GoT knowledge graph, the database for which can be downloaded from here and the graph is available at GitHub.
The condensed dataset looks something like this:
While the graph appears as follows:
Image source: GitHub
The code here has been implemented using Google colab with Python 3.7.10 and AmpliGraph 1.3.2 versions. We have used ComplEx (Complex Embeddings) model for KGE. Step-wise explanation of the code is as follows:
- Install Ampligraph library
!pip install ampligraph
- Import required libraries
import ampligraph import numpy as np import pandas as pd import requests #module for making HTTP requests from ampligraph.datasets import load_from_csv from ampligraph.evaluation import train_test_split_no_unseen from ampligraph.latent_features import ComplEx from ampligraph.evaluation import evaluate_performance from ampligraph.utils import create_tensorboard_visualizations
- Download the dataset
#Define the URL from which to download the data data_url = 'https://ampligraph.s3-eu-west-1.amazonaws.com/datasets/GoT.csv' #Open a file called ‘GoT.csv’in binary write mode and write the contents of #downloaded dataset into it open('GoT.csv', 'wb').write(requests.get(url).content) #Load knowledge graph from the GoT.csv file using load_from_csv() data = load_from_csv('.', 'GoT.csv', sep=',')
- Get the unique entities present in the dataset
ent = np.unique(np.concatenate([data[:, 0], data[:, 2]])) ent #display those entities
Output:
array(['Abelar Hightower', 'Acorn Hall', 'Addam Frey', ..., 'the Antlers','the Paps', 'unnamed tower'], dtype=object)
Similarly, get the names of unique relations among the entities
rel = np.unique(X[:, 1]) rel #display names of those relations
Output:
array(['ALLIED_WITH', 'BRANCH_OF', 'FOUNDED_BY', 'HEIR_TO', 'IN_REGION', 'LED_BY', 'PARENT_OF', 'SEAT_OF', 'SPOUSE', 'SWORN_TO'], dtype=object)
- Perform train-test split to create training and test sets from the dataset
#We split the data into 70-30 train-test ratio.Compute number of test samples accordingly num_test_samples = int(len(data) * (30 / 100)) #Split the data into train and test set from ‘data’ such that test set has #number of samples equal to ‘num_test_samples’ and there are no duplicate entries X = {} X['train'], X['test'] = train_test_split_no_unseen(data, test_size=num_test_samples, seed=0, allow_duplication=False)
train_test_split_no_unseen() creates a test set such that test samples are not unseen ones i.e. it involves only those entities and relations which are also parts of the training set.
#Check sizes of training and test sets print('Train set size: ', X['train'].shape) print('Test set size: ', X['test'].shape)
Output:
Train set size: (2223, 3) Test set size: (952, 3)
- Instantiate the ComplEx model
ce_model = ComplEx(batches_count=100, seed=0, epochs=200, k=150, #dimensionality of embedding space #number of negative triples which must be generated for each positive triple while training eta=5, optimizer='adam', #Adam optimizer optimizer_params={'lr':1e-3}, #learning rate loss='multiclass_nll', #loss function #Lpregularization technique; here we specify p=2 for L2regularization regularizer='LP', regularizer_params={'p':2, 'lambda':1e-5}, verbose=True)
- Fit the model to training data
ce_model.fit(X['train'], early_stopping = False)
- Evaluate the embedding model on test data
test_rank = evaluate_performance(X['test'], model=ce_model, # corrupt subject and object separately while evaluatin use_default_protocol=True, verbose=True)
evaluate_performance() method computes rank at which each test set triple was found when the model performed link prediction.
- Create some unseen statements for new links prediction
unseen_links = np.array([ ['Jorah Mormont', 'SPOUSE', 'Daenerys Targaryen'], ["King's Landing", 'SEAT_OF', 'House Lannister of Casterly Rock'], ['Daenerys Targaryen', 'SPOUSE', 'Jon Snow'], ['House Stark of Winterfell', 'IN_REGION', 'The North'], ['House Tyrell of Highgarden', 'IN_REGION', 'Beyond the Wall'], ['Brandon Stark', 'ALLIED_WITH', 'House Lannister of Casterly Rock'], ['House Hutcheson', 'SWORN_TO', 'House Tyrell of Highgarden'], ['Daenerys Targaryen', 'ALLIED_WITH', 'House Lannister of Casterly Rock'], ['Robert I Baratheon', 'PARENT_OF', 'Myrcella Baratheon'], ['Cersei Lannister', 'PARENT_OF', 'Brandon Stark'], ["Missandei", 'SPOUSE', 'Grey Worm'], ])
- Rank the unseen triples by applying the embedding model
ranks_unseen = evaluate_performance( unseen_links, model=ce_model, corrupt_side = 's+o', # corrupt subjest and object separately while evaluating use_default_protocol=False, verbose=True )
- Make predictions for the unseen links
sc = ce_model.predict(unseen_links)
- Convert the predicted scores for unseen statements into probabilities in the range 0-1
probability = expit(sc)
- Display predicted score and probability of each of the unseen links.
pd.DataFrame(list(zip([' '.join(i) for i in unseen_links], ranks_unseen, np.squeeze(sc), np.squeeze(probs))), columns=['new link', 'rank', 'score', 'probability']).sort_values("sc")
Output:
- Visualize the knowledge graph embedding using Tensorboard
create_tensorboard_visualizations(model, 'Knowledge_Graph_Embeddings')
The ‘Knowledge_Graph_Embeddings’ directory should now have several files as follows:
Embeddings Visualization Output:
- Code source: Official tutorial
- Google colab notebook of the above implementation
References
To dive deeper into the AmpliGraph library, refer to the following web links: