Guide To AmpliGraph: A Machine Learning Library For Knowledge Graphs

AmpliGraph

AmpliGraph is a TensorFlow-based open-source library developed by Accenture Labs for predicting links between concepts in knowledge graphs. It is a collection of neural ML models for statistical relational learning (SRL) (also called Relational Machine Learning) – a subdiscipline of AI/ML which deals with supervised learning on knowledge graphs.

Before going into the details of AmpliGraph, let us have a quick look at what a knowledge graph means.

What is a knowledge graph?

The knowledge graph is a diagrammatic representation showing how various entities of a system (e.g. objects, individuals, abstract concepts, events) are interlinked. There is no precise definition of a knowledge graph. In simple terms, it is a graph representing distinct entities and the relationships among them, according to a GitHub repository. It enables data integration and analysis by providing context to a system’s associated data. Visit this page to understand about knowledge graphs in detail. Following is an example of a knowledge graph:

knowledge graph

Image source: GitHub

A graph is represented by a set of nodes representing entities and connecting edges showing relationships among them. It can be homogenous (e.g. a social network having people and their connections – all entities of a common type) or heterogeneous (e.g. graph of a university having different types of entities like students, professors, department etc. and relations like ‘studies-at’, ‘teaches-at’ and so on). Besides, a graph can be a ‘multigraph’ in which we can have multiple directed edges between one or more pairs of nodes, some of which can even form loops. 

A university graph, mentioned as a heterogeneous graph in the example above, conveys meaningful information (known as ‘semantics’) its entities and associated relations.

Now that we know terminologies like ‘heterogeneous graph’, ‘multigraph’ and ‘semantics’, we can define a knowledge graph as “a heterogenous multigraph in which entities and relations have semantics specific to a particular domain”.

Overview of AmpliGraph

AmpliGraph library provides ML models that can create knowledge graph embeddings (KGEs), which are nothing but low-level vector representations of the entities and relations mong them constituting a knowledge graph. 

Consider the following knowledge graph and its corresponding KGE to understand what AmpliGraph does:

AmpliGraph img1

Image source: GitHub

There is no direct link between certain entities through some relations in the above knowledge graph, e.g. there is no information shown for how ‘Acme Inc’ and ‘Liverpool’ can be connected through ‘basedIn’ relation. AmpliGraph combines the above KGE with some scoring function and makes predictions about new links.

E.g. It predicts that there is an 85% probability of Acme Inc being based in Liverpool, which can be represented as:

AmpliGraph img2

Image source: GitHub

Modules of AmpliGraph

Highlighting features of AmpliGraph

  • It can run on CPUs as well as GPUs to speed-up the training process
  • Its APIs reduce the amount of code required for code predictions in knowledge graphs
  • AmpliGraph base estimators are extensible

Practical implementation

Here’s a demonstration of using AmpliGraph for discovering novel relations in a GoT knowledge graph, the database for which can be downloaded from here and the graph is available at GitHub.

The condensed dataset looks something like this:

AmpliGraph dataset

While the graph appears as follows:

AmpliGraph KG

Image source: GitHub

The code here has been implemented using Google colab with Python 3.7.10 and AmpliGraph 1.3.2 versions. We have used ComplEx (Complex Embeddings) model for KGE. Step-wise explanation of the code is as follows:

  1. Install Ampligraph library

!pip install ampligraph

  1. Import required libraries
 import ampligraph
 import numpy as np
 import pandas as pd
 import requests  #module for making HTTP requests
 from ampligraph.datasets import load_from_csv
 from ampligraph.evaluation import train_test_split_no_unseen 
 from ampligraph.latent_features import ComplEx
 from ampligraph.evaluation import evaluate_performance
 from ampligraph.utils import create_tensorboard_visualizations 
  1. Download the dataset 
 #Define the URL from which to download the data
 data_url = 'https://ampligraph.s3-eu-west-1.amazonaws.com/datasets/GoT.csv'
#Open a file called ‘GoT.csv’in binary write mode and write the contents of #downloaded dataset into it
 open('GoT.csv', 'wb').write(requests.get(url).content)
 #Load knowledge graph from the GoT.csv file using load_from_csv()
 data = load_from_csv('.', 'GoT.csv', sep=',') 
  1. Get the unique entities present in the dataset
 ent = np.unique(np.concatenate([data[:, 0], data[:, 2]]))
 ent      #display those entities 

Output:

array(['Abelar Hightower', 'Acorn Hall', 'Addam Frey', ..., 'the Antlers','the Paps', 'unnamed tower'], dtype=object)

Similarly, get the names of unique relations among the entities

 rel = np.unique(X[:, 1])
 rel      #display names of those relations 

Output:

 array(['ALLIED_WITH', 'BRANCH_OF', 'FOUNDED_BY', 'HEIR_TO', 'IN_REGION',
        'LED_BY', 'PARENT_OF', 'SEAT_OF', 'SPOUSE', 'SWORN_TO'],
        dtype=object) 
  1. Perform train-test split to create training and test sets from the dataset
 #We split the data into 70-30 train-test ratio.Compute number of test samples accordingly
 num_test_samples = int(len(data) * (30 / 100))
#Split the data into train and test set from ‘data’ such that test set has #number of samples equal to ‘num_test_samples’ and there are no duplicate entries
 X = {}
 X['train'], X['test'] = train_test_split_no_unseen(data,  
 test_size=num_test_samples, seed=0, allow_duplication=False)  

train_test_split_no_unseen() creates a test set such that test samples are not unseen ones i.e. it involves only those entities and relations which are also parts of the training set.

 #Check sizes of training and test sets
 print('Train set size: ', X['train'].shape)
 print('Test set size: ', X['test'].shape) 

Output:

 Train set size:  (2223, 3)
 Test set size:  (952, 3) 
  1. Instantiate the ComplEx model
 ce_model = ComplEx(batches_count=100, 
                 seed=0, 
                 epochs=200, 
                 k=150,   #dimensionality of embedding space
 #number of negative triples which must be generated for each positive triple while training
                 eta=5,
                 optimizer='adam',  #Adam optimizer
                 optimizer_params={'lr':1e-3},  #learning rate
                 loss='multiclass_nll',   #loss function
 #Lpregularization technique; here we specify p=2 for L2regularization
                 regularizer='LP',  
                 regularizer_params={'p':2, 'lambda':1e-5},  
                 verbose=True) 
  1. Fit the model to training data

ce_model.fit(X['train'], early_stopping = False)

  1. Evaluate the embedding model on test data
 test_rank = evaluate_performance(X['test'], model=ce_model,
# corrupt subject and object separately while evaluatin   
             use_default_protocol=True, 
             verbose=True) 

evaluate_performance() method computes rank at which each test set triple was found when the model performed link prediction.

  1. Create some unseen statements for new links prediction
 unseen_links = np.array([
     ['Jorah Mormont', 'SPOUSE', 'Daenerys Targaryen'],
     ["King's Landing", 'SEAT_OF', 'House Lannister of Casterly Rock'],
     ['Daenerys Targaryen', 'SPOUSE', 'Jon Snow'],
     ['House Stark of Winterfell', 'IN_REGION', 'The North'],
     ['House Tyrell of Highgarden', 'IN_REGION', 'Beyond the Wall'],
     ['Brandon Stark', 'ALLIED_WITH', 'House Lannister of Casterly    
     Rock'],    
     ['House Hutcheson', 'SWORN_TO', 'House Tyrell of Highgarden'],
     ['Daenerys Targaryen', 'ALLIED_WITH', 'House Lannister of Casterly  
     Rock'],
     ['Robert I Baratheon', 'PARENT_OF', 'Myrcella Baratheon'],
     ['Cersei Lannister', 'PARENT_OF', 'Brandon Stark'],
     ["Missandei", 'SPOUSE', 'Grey Worm'],
 ]) 
  1. Rank the unseen triples by applying the embedding model
 ranks_unseen = evaluate_performance(
     unseen_links, 
     model=ce_model, 
      corrupt_side = 's+o',
# corrupt subjest and object separately while evaluating
     use_default_protocol=False, 
     verbose=True
 ) 
  1. Make predictions for the unseen links

sc = ce_model.predict(unseen_links)

  1. Convert the predicted scores for unseen statements into probabilities in the range 0-1

probability = expit(sc)

  1. Display predicted score and probability of each of the unseen links.
 pd.DataFrame(list(zip([' '.join(i) for i in unseen_links], 
                       ranks_unseen, 
                       np.squeeze(sc),
                       np.squeeze(probs))), 
              columns=['new link', 'rank', 'score',  
              'probability']).sort_values("sc") 

Output:

AmpliGraph output
  1. Visualize the knowledge graph embedding using Tensorboard

create_tensorboard_visualizations(model, 'Knowledge_Graph_Embeddings')

The ‘Knowledge_Graph_Embeddings’ directory should now have several files as follows:

AmpliGraph output2

Embeddings Visualization Output:

AmpliGraph output3

References

To dive deeper into the AmpliGraph library, refer to the following web links:

Download our Mobile App

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
MOST POPULAR