3.1. Adding SMILES codes to HDF5 Files

In this tutorial, you will learn how to take an existing file in the DEEPMD-Kit HDF5 format, and add a SMILES string to each molecule in the file.

3.1.1. Learning Objectives

Understand how to read an HDF5 file in the DEEPMD-Kit format.
Learn how to add SMILES strings to the HDF5 file.
Understand how to save the modified HDF5 file.

3.1.2. Required Files

The files for this tutorial are located in examples/AddSmiles and examples/AddSmiles/inputs

t8.hdf5 : This is the HDF5 file that you will be modifying. It comes from the tautobase, and is relatively small (which makes it easier to work with).
setup_smiles.py : This is the script that will be used to add the SMILES strings to the HDF5 file.

3.1.3. Tutorial

In this example, we demonstrate how to add SMILES strings to the HDF5 file. The SMILES strings are generated using the RDKit library, which is a popular library for cheminformatics.

SMILES strings are a way to represent chemical structures in a text format. They are widely used in cheminformatics and molecular modeling, and can be easily converted to other formats (e.g., InChI, SDF, etc.). Sometimes, the SMILES strings are not immediately available and you may need to generate them from the molecular structure when first building the database. This is what we will do in this example.

To get started, we will import the library and load the HDF5 file to show its current contents.

from pharmaforge.database import DataBase
from pharmaforge.io import new_hdf5_file_with_smiles

filename = "./inputs/t8.hdf5"


# Create a database object and add the data to it.
print("Creating database...")
db = DataBase()
db.add_data(filename)

# Print the keys from the database.
print("Keys in the database, listed as chemical formulas.")
print(db.data["t8"].keys())

# Take the first key and print the entry.
print("First entry in the hdf5 file.")
print(db.data["t8"][list(db.data["t8"].keys())[0]].keys())
print("Note that there are four keys in the entry.")

Now we will add smiles strings in two ways. The first is to just add them, assuming the charge for every structure is zero.

#new_hdf5_file_with_smiles('inputs/t8.hdf5', 'outputs/t8_with_smiles.hdf5', exist_ok=True)

print("Now - note that two molecules failed to add a smiles, this is because they have a charge that is non-zero.")
print("This can be fixed by using the use_suggested_charge option.")

The second is to add them allowing rdkit to predict the charge. This is done by adding the optional flag use_suggested_charge=True.

new_hdf5_file_with_smiles("inputs/t8.hdf5", "outputs/t8_with_smiles_suggested_charge.hdf5", exist_ok=True, use_suggested_charge=True)
print("Note that you should CHECK the charge, as it is not guaranteed to be correct.")

print("Now there is a new hdf5 file t8_with_smiles_suggested_charge.hdf5, that has the smiles added to it.")


print("Now, we will add this to a database collection so that we can query it.")
print("For now, we will just do this without deeper explanation.")
print("The later examples will show how each of the following steps works.")
from pharmaforge.recipes import GeneralRecipe
# Now we can read the new file and create a collection. 
db = GeneralRecipe("t8_with_smiles",
                    "mongodb://localhost:27017/",
                    input_dir="outputs/",
                    spin=0,
                    level_of_theory= "wB97XM-D3(BJ)/def2-TZVPPD",
                    basis_set= "def2-TZVPPD",
                    functional= "wB97XM-D3(BJ)",
                    data_source= "t8_with_smiles")

# Print the keys from the database.
print("Now you can see the form of a single entry in the datbaase, reformatted to be more readable. ")
db.pprint_one_entry()

Now there are smiles strings added to the HDF5 file!

3.1.4. Full Code

from pharmaforge.database import DataBase
from pharmaforge.io import new_hdf5_file_with_smiles

filename = "./inputs/t8.hdf5"


# Create a database object and add the data to it.
print("Creating database...")
db = DataBase()
db.add_data(filename)

# Print the keys from the database.
print("Keys in the database, listed as chemical formulas.")
print(db.data["t8"].keys())

# Take the first key and print the entry.
print("First entry in the hdf5 file.")
print(db.data["t8"][list(db.data["t8"].keys())[0]].keys())
print("Note that there are four keys in the entry.")



# Now add the smiles to the file. Note - you don't technically have to do the above step, but its useful for comparing the differences between the data entries.
# This will take a few minutes.
print("Adding smiles...")
#new_hdf5_file_with_smiles('inputs/t8.hdf5', 'outputs/t8_with_smiles.hdf5', exist_ok=True)

print("Now - note that two molecules failed to add a smiles, this is because they have a charge that is non-zero.")
print("This can be fixed by using the use_suggested_charge option.")

new_hdf5_file_with_smiles("inputs/t8.hdf5", "outputs/t8_with_smiles_suggested_charge.hdf5", exist_ok=True, use_suggested_charge=True)
print("Note that you should CHECK the charge, as it is not guaranteed to be correct.")

print("Now there is a new hdf5 file t8_with_smiles_suggested_charge.hdf5, that has the smiles added to it.")


print("Now, we will add this to a database collection so that we can query it.")
print("For now, we will just do this without deeper explanation.")
print("The later examples will show how each of the following steps works.")
from pharmaforge.recipes import GeneralRecipe
# Now we can read the new file and create a collection. 
db = GeneralRecipe("t8_with_smiles",
                    "mongodb://localhost:27017/",
                    input_dir="outputs/",
                    spin=0,
                    level_of_theory= "wB97XM-D3(BJ)/def2-TZVPPD",
                    basis_set= "def2-TZVPPD",
                    functional= "wB97XM-D3(BJ)",
                    data_source= "t8_with_smiles")

# Print the keys from the database.
print("Now you can see the form of a single entry in the datbaase, reformatted to be more readable. ")
db.pprint_one_entry()