MITB Banner

PyMongo: A Python Wrapper For MongoDB

A comprehensive guide to working with MongoDB databases using Python and Pymongo.

Share

pymongo-mongodb

What is MongoDB and PyMongo?

MongoDB is a NoSQL database that stores data in JSON-like documents with flexible schemas instead of the traditional table-based database structure. The document model maps to objects in application code which makes it easier to work with. It has a rich-query language that supports dynamic queries on documents. MongoDB also has its own aggregation pipeline and support for map-reduce, eliminating the need for complex data pipelines. PyMongo is a Python library that contains tools for interacting with MongoDB databases. 

To install PyMongo from PyPI:

python -m pip install pymongo

For this article, we will be working with a local MongoDB instance. Instructions for downloading and installing can be found here. Additionally, I recommend installing MongoDB Compass to have a GUI to explore the data and see the changes made by the code.

Making a Connection with the MongoDB instance

When working with MongoDB databases, or any database for that matter, the first thing we need to do is to make a connection. You can do so using the MongoClient() method:

 import pymongo
 client = pymongo.MongoClient() 

This establishes a connection to the default host and port, we can also specify the host and port:

client =  pymongo.MongoClient('localhost', 27017)

Or use the MongoDB URI format:

 DEFAULT_CONNECTION_URL = "mongodb://localhost:27017/"
 client = pymongo.MongoClient(DEFAULT_CONNECTION_URL) 

You can use this default URI to connect to the local instance in Compass.

Databases

Now that we have established a connection to the local instance, let’s list the existing databases using the list_database_names() method:

client.list_database_names()
-----------------------------Output-----------------------------
 ['Population', 'admin', 'config', 'dblp', 'local'] 

A single MongoDB instance can have multiple databases, as can be seen above. To access one of these databases, you can either use attribute style access on the client object:

local_db = client.local

Or the dictionary-style access:

local_db = client['local']

Collections

Collections are exactly that, collections of documents stored in MongoDB. They can be thought of as the MongoDB equivalent of tables. You can list the collections of a database using list_collection_names().

 local_db.list_collection_names()
-----------------------------Output-----------------------------
['startup_log']

Like the databases, you can use either the attribute style access or the dictionary-style access to get a collection:

 collection = local_db.startup_log 
 # or
 # collection = local_db['startup_log'] 
-----------------------------Output-----------------------------
Collection(Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), 'local'), 'startup_log')

One thing to note about the databases and collections in MongoDB is that they are created lazily, i.e., they are created when the first document is inserted into them. Before we illustrate this, let’s briefly go through what a document is in MongoDB.

Documents

MongoDB stores data using JSON-style documents which are key-value pairs. Documents are the MongoDB equivalent of rows, just more flexible. Pymongo uses dictionaries to represent documents. As an example, the following dictionary may be used to store a LinkedIn connection:

 document = {
 "firstname": "Di",
 "lastname": "Croix",
 "city": "Copenhagen",
 "country": "Luxembourg",
 "countryCode": "WS",
 "companyName": "XYZ",
 "email": "Di.Croix@gmail.com",
 "connections": ["Gabi","Yolane","Molli","Elmira","Yvonne"]
 } 

MongoDB CRUD Operations

Create Operations

Insert operations add document(s) to a collection, and if the collection does not exist the insert operation will create it. You can also create a collection by explicitly using the create_collection() method.

Let’s go back to the lazy creation of collections and databases. 

Firstly we’ll need to create functions for verifying the existence of databases and collections.

 def check_existence_DB(db_name, client):
     """It verifies the existence of DB"""
     list_of_dbs = client.list_database_names()
     if db_name in list_of_dbs:
         print(f"'{db_name}' exists")
     print(f" '{db_name}' does not exist")

 def check_existence_collection(collection_name, db_name, client):
     """It verifies the existence of collection name in a database"""
     db = client[db_name]
     collection_list = db.list_collection_names()
     if collection_name in collection_list:
         print(f"Collection: '{collection_name}' in Database: '{db_name}' exists")
     print(f"Collection: '{collection_name}' in Database: '{db_name}' does not exists") 

Now let’s try to create a database and a collection.

 dataBase = client["exampleDB"]
 check_existence_DB("exampleDB", client)
 collection = dataBase["ExampleCollection"]
 check_existence_collection("ExampleCollection", "exampleDB" , client) 
-----------------------------Output-----------------------------
 'exampleDB' does not exist
 Collection: 'ExampleCollection' in Database: 'exampleDB' does not exists 

None of the above commands has actually performed any operations on the local MongoDB instance. Inserting a document using insert_one() will create both the database and collection.

 collection.insert_one(document) 
 check_existence_DB("exampleDB", client)
 check_existence_collection("ExampleCollection", "exampleDB" , client) 
-----------------------------Output-----------------------------
 'exampleDB' exists
 Collection: 'ExampleCollection' in Database: 'exampleDB' exists 

In addition to inserting a single document, you can also perform bulk creation operations using insert_many():

 documents = [{
 "firstname": "Tani",
 "lastname": "Waite",
 "city": "Rio de Janeiro",
 "country": "Barbados",
 "companyName": "XYZ",
 "email": "Tani.Waite@gmail.com",
 "connections": ["Dagmar", "Deane","Esmeralda","Bertine","Flo"]
 },{
 "firstname": "Trudie",
 "lastname": "Kermit",
 "city": "Road Town",
 "country": "Bahrain",
 "companyName": "XYZ",
 "email": "Trudie.Kermit@gmail.com",
 "connections": ["Fidelia","Letizia","Winifred","Odessa","Talya"]
 },{
 "firstname": "Calla",
 "lastname": "Junie",
 "city": "Santo Domingo",
 "country": "Ghana",
 "companyName": "XYZ",
 "email": "Calla.Junie@gmail.com",
 "connections": ["Selma","Marita","Lauryn","Max","Dorene"]
 }]
 res = collection.insert_many(documents) 

Each of the documents is assigned a unique ObjectID that act like primary keys. You can access the _id of the inserted documents using the inserted_ids attribute.

 inserted_IDs = res.inserted_ids
 for unique_id in inserted_IDs:
     print(unique_id) 
-----------------------------Output-----------------------------
 6086312ffb38100e97ee5a07
 6086312ffb38100e97ee5a08
 6086312ffb38100e97ee5a09 

You can override the default unique Id by defining your own _id:

 team = dataBase["Team"]
 list_of_records_with_id = [
     {
 "_id": 511000000000,
 "firstname": "Alejandra",
 "lastname": "Loeb",
 "city": "Horta (Azores)",
 "age": 20,
 "country": "Cameroon"
 },{
 "_id": 831023809584,
 "firstname": "Blinni",
 "lastname": "Jacqui",
 "city": "Cardiff",
 "age": 42,
 "country": "Uganda"
 }, {
 "_id": 741224814953,
 "firstname": "Kimmy",
 "middlename" : "Walter",
 "lastname": "Florina",
 "city": "New Orleans",
 "age": 46,
 "country": "Hungary"
 },]
 team_records = team.insert_many(list_of_records_with_id) 

Ensure the _id of the records you insert is unique; otherwise, you’ll encounter a BulkWriteError.

Read Operations

The most basic read operation in MongoDB is the find_one() method; it returns a single document matching a query. Let’s use this to print the first document from the team collection:

print(team.find_one())
-----------------------------Output-----------------------------
 {'_id': 511000000000, 'firstname': 'Alejandra', 'lastname': 'Loeb', 'city': 'Horta (Azores)', 'age': 20, 'country': 'Cameroon'} 

To query more than one document, you can use the find() method. The find() method returns an iterable cursor object with all documents matching the query. Let’s say you are interested in documents from the collection team with an age greater than 25.

 query = {"age": {"$gt": 25}}
 results = team.find(query)
 for data in results:
     print(data) 
-----------------------------Output-----------------------------
 {'_id': 831023809584, 'firstname': 'Blinni', 'lastname': 'Jacqui', 'city': 'Cardiff', 'age': 42, 'country': 'Uganda'}
 {'_id': 741224814953, 'firstname': 'Kimmy', 'middlename': 'Walter', 'lastname': 'Florina', 'city': 'New Orleans', 'age': 46, 'country': 'Hungary'} 

You can use the limit() method to specify the number of documents you want to display.

 for record in team.find().limit(3):
     pprint.pprint(record) 
-----------------------------Output-----------------------------
 {'_id': 511000000000,
  'age': 20,
  'city': 'Horta (Azores)',
  'country': 'Cameroon',
  'firstname': 'Alejandra',
  'lastname': 'Loeb'}
 {'_id': 831023809584,
  'age': 42,
  'city': 'Cardiff',
  'country': 'Uganda',
  'firstname': 'Blinni',
  'lastname': 'Jacqui'}
 {'_id': 741224814953,
  'age': 46,
  'city': 'New Orleans',
  'country': 'Hungary',
  'firstname': 'Kimmy',
  'lastname': 'Florina',
  'middlename': 'Walter'} 

Update Operations

Like the create and read operations, MongoDB provides two methods for update operations: update_one() for updating a single document and update_many() for updating all documents matching the criteria.

 import pprint
 present_data = {'firstname': 'Di'}
 new_data = {"$set":{'firstname': 'Diana'}}
 dataBase["ExampleCollection"].update_one(present_data, new_data)
 all_record = collection.find_one()
 pprint.pprint(all_record) 
-----------------------------Output-----------------------------
 {'_id': ObjectId('60863129fb38100e97ee5a06'),
  'city': 'Copenhagen',
  'companyName': 'XYZ',
  'connections': ['Gabi', 'Yolane', 'Molli', 'Elmira', 'Yvonne'],
  'country': 'Luxembourg',
  'countryCode': 'WS',
  'email': 'Di.Croix@gmail.com',
  'firstname': 'Diana',
  'lastname': 'Croix'} 
 collection = dataBase["ExampleCollection"]
 present_data = {'companyName': 'XYZ'}
 new_data = {"$set": {'companyName': 'XYZ.ai'}}
 collection.update_many(present_data, new_data)
 all_record = collection.find().limit(2)
 for  record in all_record:
     pprint.pprint(record) 
-----------------------------Output-----------------------------
 {'_id': ObjectId('60863129fb38100e97ee5a06'),
  'city': 'Copenhagen',
  'companyName': 'XYZ.ai',
  'connections': ['Gabi', 'Yolane', 'Molli', 'Elmira', 'Yvonne'],
  'country': 'Luxembourg',
  'countryCode': 'WS',
  'email': 'Di.Croix@gmail.com',
  'firstname': 'Diana',
  'lastname': 'Croix'}
 {'_id': ObjectId('6086312ffb38100e97ee5a07'),
  'city': 'Rio de Janeiro',
  'companyName': 'XYZ.ai',
  'connections': ['Dagmar', 'Deane', 'Esmeralda', 'Bertine', 'Flo'],
  'country': 'Barbados',
  'email': 'Tani.Waite@gmail.com',
  'firstname': 'Tani',
  'lastname': 'Waite'} 

Delete Operations

To delete documents from a collection you can either use delete_one() or delete_many(). Let’s say you want to delete Kimmy’s data from the collection:

 query_to_delete = {"firstname": "Kimmy"}
 team.delete_one(query_to_delete)
 for record in team.find():
     print(record) 
-----------------------------Output-----------------------------
 {'_id': 511000000000, 'firstname': 'Alejandra', 'lastname': 'Loeb', 'city': 'Horta (Azores)', 'age': 20, 'country': 'Cameroon'}
 {'_id': 831023809584, 'firstname': 'Blinni', 'lastname': 'Jacqui', 'city': 'Cardiff', 'age': 42, 'country': 'Uganda'}
 {'_id': 741224814953, 'firstname': 'Kimmy', 'middlename': 'Walter', 'lastname': 'Florina', 'city': 'New Orleans', 'age': 46, 'country': 'Hungary'} 

To delete the data for people over the age of 40: 

 query_to_delete_multiple = {"age": {"$gte": 40}}
 team.delete_many(query_to_delete_multiple)
 for record in team.find():
     print(record) 
-----------------------------Output-----------------------------
 {'_id': 511000000000, 'firstname': 'Alejandra', 'lastname': 'Loeb', 'city': 'Horta (Azores)', 'age': 20, 'country': 'Cameroon'} 

To delete all the documents present in the collection you can just pass an empty dictionary:

team.delete_many({})

To drop an entire collection, you can use the drop() method:

team.drop()

Aggregation

MongoDB supports three ways for performing aggregation: single-purpose aggregation methods, aggregation pipeline, and map-reduce function

The following example illustrates how to create an aggregation pipeline using the aggregation() method. We will calculate the total occurrence of each pet in the pets array across all documents in the Students collection and sort them by count. You can’t directly perform operations on arrays so you’ll need to unwind them using the $unwind stage. You can find a list of all aggregation stages here. After unwinding the array, the documents are grouped by the pets, summed up, and finally sorted by count.

 students = dataBase.create_collection("Students")
 result = students.insert_many([
     {"Roll no": 1, "pets": ["hamster", "dog"]},
     {"Roll no": 2, "pets": ["dog"]},
     {"Roll no": 3, "pets": ["cat"]},
     {"Roll no": 4, "pets": ["cat", "lizard"]}
                                ])
 from bson.son import SON
 pipeline = [{"$unwind": "$pets"},
             {"$group": {"_id": "$pets", "count": {"$sum": 1}}},
             {"$sort": SON([("count", -1), ("_id", -1)])}]
 res = list(students.aggregate(pipeline))
 print(res) 
-----------------------------Output-----------------------------
 [{'_id': 'dog', 'count': 2}, {'_id': 'cat', 'count': 2}, {'_id': 'lizard', 'count': 1}, {'_id': 'hamster', 'count': 1}] 

Another way of doing this aggregation is by using the map_reduce() function. You’ll have to write a map and reduce functions to count the occurrences of each pet across the collection.

The map function emits a (pet, 1) pair for each pet; the reduce function adds all the emitted value for a particular pet. 

 from bson.code import Code
 map = Code("""
                function () {
                  this.pets.forEach(function(z) {
                    emit(z, 1);
                  });
                }
                """)
 reduce = Code("""
                 function (key, values) {
                   var total = 0;
                   for (var i = 0; i < values.length; i++) {
                     total += values[i];
                   }
                   return total;
                 }
                 """)
 result = students.map_reduce(map, reduce, "myresults")
 for doc in result.find().sort("_id"):
     print(doc) 
-----------------------------Output-----------------------------
 {'_id': 'cat', 'value': 2.0}
 {'_id': 'dog', 'value': 2.0}
 {'_id': 'hamster', 'value': 1.0}
 {'_id': 'lizard', 'value': 1.0} 

Aggregation pipelines provide better performance than map-reduce operation, all map-reduce methods can be rewritten using aggregation operators. MongoDB provides the $accumulator and $function operators to define map-reduce operations that require custom functionality. 

References

To learn more about MongoDB and PyMongo, please refer to the following resources:

Share
Picture of Aditya Singh

Aditya Singh

A machine learning enthusiast with a knack for finding patterns. In my free time, I like to delve into the world of non-fiction books and video essays.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.