3.5. Querying the MongoDB Database

In this tutorial, you will learn about develop basic queries for elements from the MongoDB database. This tutorial will also cover how to use a query to obtain a new hdf5 file which can then (later) be used for DeepMDKit training.

3.5.1. Learning Objectives

  • How to query the MongoDB database

  • How to construct compound queries

  • How to use a query to obtain a new hdf5 file

3.5.2. Required Files

  • MongoDB database initialized through previous tutorials (specifically, Generating QDPi2 MongoDB Database)

  • The files for this tutorial are in examples/DatabaseQuerying`

3.5.3. Tutorial

In this tutorial, we will learn how to query the MongoDB database. To start, we need to import the necessary libraries and ensure that the MongoDB database is running.

from pymongo import MongoClient
import re
from pprint import pprint
from typing import Dict, Any, List

from pharmaforge.database import DataBase
from pharmaforge.dbutils.create_mongodb_collections import process_hdf5_folder
from pharmaforge.dbutils.mongo_utils import add_field_to_all_documents
from pharmaforge.queries.query import Query

print("First, we load the database and create the collections.")
print("Note - if you haven't run the DataBaseGeneration script, this will fail.")
try:
    client = MongoClient('mongodb://localhost:27017/')
    db = client["QDPi2_Database"]
except:
    print("MongoDB is not running, or you haven't created the database in the previous example. Please start MongoDB and try again.")
    exit()

Now that the database is running (if this failed, make sure that you started the mongodb server prior to this tutorial, and that you did the previous tutorial), we can start querying the database. The next step is to define a set of example queries. The key thing with queries is that they all follow a general format of <field> <operator> <value>. The operators that you can use are eq, neq, gt, gte, lt, lte, and any. Additionally, you can use logical operators such as and, or, and not to make compound queries. Some examples are defined below.

# Start Examples
example_queries=[
    "nmols eq 5",
    "not nmols eq 5",
    "nmols eq 1",
    "contains_elements any [H,N,O,C]",
    "contains_elements any [H,O] and contains_elements any [C]",
    "contains_elements any [H,O] and not contains_elements any [C]",
    "contains_elements any [H,N,O,C] or nmols gt 1",
    "molecular_charge eq -1",
    "not molecular_charge eq 0",
]
# End Examples

Before we run this big list of queries, it should be pointed out that what is happening behind the scenes is that the pharmaforge package is converting the strings above into a dictionary that looks like a json format which is what pymongo actually interprets.

# End Examples

for collection in db.list_collection_names():
    print(f"Collection: {collection}")
    count = db[collection].count_documents({})
    print(f"Collection '{collection}' contains {count} documents.")

test_query = {
    'nmols':  5
    }
results = db['ani_qdpi'].find(test_query)

print("Found", len(list(results)))
test2_query = {
    'nmols': {'$eq': 5}
}

results = db['ani_qdpi'].find(test2_query)
print("Found", len(list(results)))

Now lets try to run the queries. We do this in a for loop and you will see the output of a smiles code matching each of the queries that you try below, and how many documents (aka how many molecules) match the query.

for query in example_queries:
    q = Query(query)
    print("*****************************************")
    print(f"Query: {q.querystring}")
    print(f"Parsed query: {q.parsed_query}")
    #parsed_query = q.query(q.querystring)
    #q.display_query(parsed_query)
    #results= list(db["ani_qdpi"].find(parsed_query))
    results = q.apply(db, "ani_qdpi")
    #results = db["ani_qdpi"].find(parsed_query)
    try:
        print(results[0]['smiles'])
    except:
        print("No results found.")
    # Print the results
    #print(f"Found {len(list(results))} documents matching the query '{query}':")


Now being able to query things is great, but it would be better to be able to actually use that query FOR something. In the code block below, we will use a query to generate a new hdf5 file. This will be used in the next tutorial, where we will then relabel the data to get different labels at a different level of theory.

# Now lets try to grab a new HDF5 file for training based on a match against the query.
print("Lets grab a new HDF5 file for training based on a match against the query, specifically, lets train on more than 4 molecules and less than 7.")

query = "nmols gt 4 and nmols lt 7"
q = Query(query)
print("*****************************************")
print(f"Query: {q.querystring}")
print(f"Parsed query: {q.parsed_query}")
q.display_query()
results = q.apply(db, "ani_qdpi")

q.results_to_deepmdkit("saved_model.hdf5", level_of_theory="wB97XM-D3(BJ)/def2-TZVPPD")


print("Lets also setup a small example for QDPi1, which will be used in CalculateMSE")

query = "contains_elements any [H,N,O,C] and not contains_elements any [F,Li,Na,P,S,Cl,K,Br,I]"
q = Query(query)
print("*****************************************")
print(f"Query: {q.querystring}")
print(f"Parsed query: {q.parsed_query}")
q.display_query()
results = q.apply(db, "ani_qdpi")
q.results_to_deepmdkit("saved_model_qdpi1.hdf5", level_of_theory="wB97XM-D3(BJ)/def2-TZVPPD")

# This is the same query as above, but using the contains_elements only function, which is much simpler.
query = "contains_elements only [H,N,O,C] "
q = Query(query)
print("*****************************************")
print(f"Query: {q.querystring}")
print(f"Parsed query: {q.parsed_query}")
q.display_query()
results = q.apply(db, "ani_qdpi")
q.results_to_deepmdkit("saved_model_qdpi1.hdf5", level_of_theory="wB97XM-D3(BJ)/def2-TZVPPD")

3.5.4. Full Code

from pymongo import MongoClient
import re
from pprint import pprint
from typing import Dict, Any, List

from pharmaforge.database import DataBase
from pharmaforge.dbutils.create_mongodb_collections import process_hdf5_folder
from pharmaforge.dbutils.mongo_utils import add_field_to_all_documents
from pharmaforge.queries.query import Query

print("First, we load the database and create the collections.")
print("Note - if you haven't run the DataBaseGeneration script, this will fail.")
try:
    client = MongoClient('mongodb://localhost:27017/')
    db = client["QDPi2_Database"]
except:
    print("MongoDB is not running, or you haven't created the database in the previous example. Please start MongoDB and try again.")
    exit()

# Start Examples
example_queries=[
    "nmols eq 5",
    "not nmols eq 5",
    "nmols eq 1",
    "contains_elements any [H,N,O,C]",
    "contains_elements any [H,O] and contains_elements any [C]",
    "contains_elements any [H,O] and not contains_elements any [C]",
    "contains_elements any [H,N,O,C] or nmols gt 1",
    "molecular_charge eq -1",
    "not molecular_charge eq 0",
]
# End Examples

for collection in db.list_collection_names():
    print(f"Collection: {collection}")
    count = db[collection].count_documents({})
    print(f"Collection '{collection}' contains {count} documents.")

test_query = {
    'nmols':  5
    }
results = db['ani_qdpi'].find(test_query)

print("Found", len(list(results)))
test2_query = {
    'nmols': {'$eq': 5}
}

results = db['ani_qdpi'].find(test2_query)
print("Found", len(list(results)))

for query in example_queries:
    q = Query(query)
    print("*****************************************")
    print(f"Query: {q.querystring}")
    print(f"Parsed query: {q.parsed_query}")
    #parsed_query = q.query(q.querystring)
    #q.display_query(parsed_query)
    #results= list(db["ani_qdpi"].find(parsed_query))
    results = q.apply(db, "ani_qdpi")
    #results = db["ani_qdpi"].find(parsed_query)
    try:
        print(results[0]['smiles'])
    except:
        print("No results found.")
    # Print the results
    #print(f"Found {len(list(results))} documents matching the query '{query}':")


# Now lets try to grab a new HDF5 file for training based on a match against the query.
print("Lets grab a new HDF5 file for training based on a match against the query, specifically, lets train on more than 4 molecules and less than 7.")

query = "nmols gt 4 and nmols lt 7"
q = Query(query)
print("*****************************************")
print(f"Query: {q.querystring}")
print(f"Parsed query: {q.parsed_query}")
q.display_query()
results = q.apply(db, "ani_qdpi")

q.results_to_deepmdkit("saved_model.hdf5", level_of_theory="wB97XM-D3(BJ)/def2-TZVPPD")


print("Lets also setup a small example for QDPi1, which will be used in CalculateMSE")

query = "contains_elements any [H,N,O,C] and not contains_elements any [F,Li,Na,P,S,Cl,K,Br,I]"
q = Query(query)
print("*****************************************")
print(f"Query: {q.querystring}")
print(f"Parsed query: {q.parsed_query}")
q.display_query()
results = q.apply(db, "ani_qdpi")
q.results_to_deepmdkit("saved_model_qdpi1.hdf5", level_of_theory="wB97XM-D3(BJ)/def2-TZVPPD")

# This is the same query as above, but using the contains_elements only function, which is much simpler.
query = "contains_elements only [H,N,O,C] "
q = Query(query)
print("*****************************************")
print(f"Query: {q.querystring}")
print(f"Parsed query: {q.parsed_query}")
q.display_query()
results = q.apply(db, "ani_qdpi")
q.results_to_deepmdkit("saved_model_qdpi1.hdf5", level_of_theory="wB97XM-D3(BJ)/def2-TZVPPD")