importnumpyasnpimportjsonfromrdkitimportChemfromrdkit.ChemimportDrawfromrdkit.Chem.DrawimportIPythonConsolefrommodelimportMoleculeVAEfromutilsimportencode_smiles,decode_latent_molecule,interpolate,get_unique_mols# number of dimensions to represent the molecules# as the model was trained with this number, any operation made with the model must share the dimensions.latent_dim=292# trained_model 0.99 validation accuracy# trained with 80% of ALL chembl molecules, validated on the other 20.trained_model='chembl_23_model.h5'charset_file='charset.json'aspirin_smiles='CC(=O)Oc1ccccc1C(=O)O'

Most of the 1k latent representations won't end in a valid molecule, this is completelly normal due to the complexity of the chemical space. Also notice that this is NOT a perfect autoencoder, it's trained with a validation accuracy of 0.99, so some molecules won't be correctly decoded after the encoding phase.

In [7]:

fromrdkitimportChemfromrdkitimportRDLogger# remove warnings and errors from notebook (lots of them due non valid molecule generation)lg=RDLogger.logger()lg.setLevel(RDLogger.CRITICAL)working_mols=[]forsmilesindecoded_molecules:try:mol=Chem.MolFromSmiles(smiles)ifmol:working_mols.append(mol)except:continue