3.2. Building a MongoDB Database

This tutorial will guide you through the process of taking an HDF5 file and generating the MongoDB database for that file.

3.2.1. Learning Objectives

Learn how to generate a MongoDB database from an HDF5 file.
Understand the format of the mongodb entries
Understand how to add information about the molecules to the database.

3.2.2. Required Files

The files for this tutorial are located in examples/DatabaseGeneration and examples/DatabaseGeneration/inputs

t8_with_smiles_suggested_charge.hdf5 : This is the HDF5 file that you will be using to generate the MongoDB database. It contains molecular data, including SMILES strings and other relevant information.

3.2.3. Tutorial

To build the database, it works much like in the previous example Adding SMILES codes to HDF5 Files.

First, we will import the library and start the MONGODB client.

from pymongo import MongoClient
from pprint import pprint

from pharmaforge.database import DataBase
from pharmaforge.recipes.GeneralDatabase import GeneralRecipe
from pharmaforge.dbutils.create_mongodb_collections import process_hdf5_folder 
from pharmaforge.dbutils.mongo_utils import add_field_to_all_documents


client = MongoClient('mongodb://localhost:27017/')

Next, we need to ensure that the database is empty. This is done by dropping the database if it exists.

db = client["test_db"]
db.drop_collection("t8_qdpi")  # Drop the collection if it exists

In this example, we are showing how to build the data base from a single HDF5 file; however, the tools that we are demonstrating are applicable to generating a database from a collection of HDF5 files.

from pharmaforge.dbutils.create_mongodb_collections import process_hdf5_folder 
from pharmaforge.dbutils.mongo_utils import add_field_to_all_documents


client = MongoClient('mongodb://localhost:27017/')

db = client["test_db"]
db.drop_collection("t8_qdpi")  # Drop the collection if it exists


process_hdf5_folder(
    folder_path="./inputs",
    database_name="test_db",
    level_of_theory="wB97M-D3(BJ)/def2-TZVPPD",
    data_source="tautobase"
)

This command does a few things, the first thing it does is looks at the folder_path specified (in this case “./inputs”). It then processes each of the hdf5 files pulling the documents from them. It then assigns these a database name (test_db) and a level_of_theory, which is the QM method used to produce the data. Lastly, the data_source key tells you where the data came from, in this case the tautobase.

Now, we will access the collections and count the number of documents. To do this, we first load the database from the locally hosted client (which we just added in the previous step!) and then access the collection.

# Access the client
db = client["test_db"]


print("Now we build a collection, and add the data to it.")
print("Note - right now there is one database, but you COULD have multiple.")
collections = {"t8_qdpi": db["t8_qdpi"],}




print("Now we can check the number of documents in the collection.")
for collection_name, collection in collections.items():
    count = collection.count_documents({})
    print(f"Collection '{collection_name}' contains {count} documents.")

Now, we want to look at an entry. We can do this by accessign the collection and pulling the first entry.

# Accessing Entries in the Collection
print("Now lets check out a collection.")
collection = collections["t8_qdpi"]
print("First entry in the collection.")
entry = collection.find_one()
print(entry)
print("Note that this was a really messy output.")
print("pprint can help with that.")
pprint(entry)

Now - the database starts with relatively limited information (which you can see when you pull up an entry.)

Instead, we want to add some additional information to the database. This is done by adding a few fields to the entry.

#Adding Data Fields
print(db.list_collection_names())
print("Now lets modify a field, by setting the spin to 0.")
result = add_field_to_all_documents("test_db", "t8_qdpi", "spin", 0)
print(result)
entry = collection.find_one()
print("Note that the spin is now set to 0.")
pprint(entry)
print(f"See, the spin is {entry['spin']} ")

And there you go! You’ve taken your first steps towards using the database software.

You can do all of these steps quickly by using a recipe and kwargs.

For instance,

print("Alternatively, we can do this automatically with a recipe.")

NewRecipe = GeneralRecipe("test_db", "mongodb://localhost:27017/", 
                          low_level=None,
                          high_level="wB97M-D3(BJ)/def2-TZVPPD",
                          base_model="DFTB3",
                          data_source="tautobase",
                          note="This was built from a recipe.")
NewRecipe.pprint_one_entry()

Note - the outputs for this example are all printed to the terminal.

3.2.4. Full Code

from pymongo import MongoClient
from pprint import pprint

from pharmaforge.database import DataBase
from pharmaforge.recipes.GeneralDatabase import GeneralRecipe
from pharmaforge.dbutils.create_mongodb_collections import process_hdf5_folder 
from pharmaforge.dbutils.mongo_utils import add_field_to_all_documents


client = MongoClient('mongodb://localhost:27017/')

db = client["test_db"]
db.drop_collection("t8_qdpi")  # Drop the collection if it exists


process_hdf5_folder(
    folder_path="./inputs",
    database_name="test_db",
    level_of_theory="wB97M-D3(BJ)/def2-TZVPPD",
    data_source="tautobase"
)

# This script processes HDF5 files in the specified folder and creates MongoDB collections.
# It makes it accessible on the locally hosted database on port 27017.


# Access the client
db = client["test_db"]


print("Now we build a collection, and add the data to it.")
print("Note - right now there is one database, but you COULD have multiple.")
collections = {"t8_qdpi": db["t8_qdpi"],}




print("Now we can check the number of documents in the collection.")
for collection_name, collection in collections.items():
    count = collection.count_documents({})
    print(f"Collection '{collection_name}' contains {count} documents.")

# Accessing Entries in the Collection
print("Now lets check out a collection.")
collection = collections["t8_qdpi"]
print("First entry in the collection.")
entry = collection.find_one()
print(entry)
print("Note that this was a really messy output.")
print("pprint can help with that.")
pprint(entry)

#Adding Data Fields
print(db.list_collection_names())
print("Now lets modify a field, by setting the spin to 0.")
result = add_field_to_all_documents("test_db", "t8_qdpi", "spin", 0)
print(result)
entry = collection.find_one()
print("Note that the spin is now set to 0.")
pprint(entry)
print(f"See, the spin is {entry['spin']} ")


print("Alternatively, we can do this automatically with a recipe.")

NewRecipe = GeneralRecipe("test_db", "mongodb://localhost:27017/", 
                          low_level=None,
                          high_level="wB97M-D3(BJ)/def2-TZVPPD",
                          base_model="DFTB3",
                          data_source="tautobase",
                          note="This was built from a recipe.")
NewRecipe.pprint_one_entry()