The aim of the paper is to present the results obtained by utilization of an original approach called Molecular Descriptors Family on Structure-Property (MDF-SPR) and Structure-Activity Relationships (MDF-SAR) applied on classes of chemical compounds and its usefulness as precursors of models elaboration of new compounds with better properties and/or activities and low production costs. The MDF-SPR/MDF-SAR methodology integrates the complex information obtained from compound’s structure in unitary efficient models in order to explain properties/activities. The methodology has been applied on a number of thirty sets of chemical compounds. The best subsets of molecular descriptors family members able to estimate and predict property/activity of interest were identified and were statistically and visually analyzed. The MDF-SPR/MDF-SAR models were validated through internal and/or external validation methods. The estimation and prediction abilities of the MDF-SPR/MDF-SAR models were compared with previous reported models by applying of correlated correlation analysis, which revealed that the MDF-SPR/MDF-SAR methodology is reliable. The MDF-SPR/MDF-SAR methodology opens a new pathway in understanding the relationships between compound’s structure and property/activity, in property/activity prediction, and in discovery, investigation and characterization of new chemical compounds, more competitive as costs and property/activity, being a method less expensive comparative with experimental methods.

Observations related to the relationships between activity/property and compounds structure has been actually reported before the apparition of the QSAR/QSPR concepts. In 1868, Crum-Brown and Fraser stipulate the idea that the activity of a compound is a function of its chemical composition and structure [2]. In 1893, Richet and Seancs shown for a set of organic molecules that the cytotoxicity was inverse related with water solubility [3]. Mayer suggests in 1899 that the narcotic action of a group of organic compounds is related with solubility into olive oil [4]. Ferguson introduced in 1939 a thermodynamic generalization to the correlation of depressant action with the relative saturation of volatile compounds in the vehicle in which they were administered [5]. Hammett [6] and Taft [7] put together the mechanistic basis of QSAR/QSPR development.

Ten years after defining of the QSAR/QSPR methods, these paradigms found their applicability in practice of agro-chemistry, pharmaceutical chemistry, toxicology and other chemistry related fields [8].

In QSAR/QSPR analysis, the electronic [9,10], hydrophobic [11,12], steric [13,14] and topologic [15,16] descriptors are most frequently used. Pure topological indices used in QSAR/QSPR analysis are: Wienner index [17], Szeged index [18], and Cluj index [19,20]. More, the QSAR is used nowadays in drug investigations being seen as a useful tool in design of new compounds [21,22], in characterization of activity by the use of gene expression programming [23], and in analysis of the relationships between compounds structure and associated biological activity [24,25].

An original approach called Molecular Descriptors Family on Structure-Property/Activity Relationships (MDF-SPR/MDF-SPR) has been developed [26]. The MDF-SAR/SPR methodology, a unitary approach based on minimal complex knowledge obtained from the compound’s structure, was applied on different classes of compounds. Obtaining models proved to have estimation and prediction abilities and are presented here. Starting with the MDF-SAR/MDF-SPR models, an opens system has been developed in order to provide a virtual experimental environment with applicability in analysis and characterization of properties/activities of chemical compounds.

2. Materials and Methods

A number of thirty sets of chemically compounds were investigated with MDF-SPR/MDF-SAR methodology. Twenty out of thirty sets (66.66%, 95%CI [46.78 – 83.22]) has been sets with an activity of interest while the others (33.33%, 95%CI [16.78 – 53.22]) had a property of interest. The abbreviation of the set, the type of the observed or measured property/activity, and the class of the compounds are presented in Table 1.

The property or activity of each sample of compound was modeled by the use of the MDF methodology [26]. The steps followed in the modeling process, described in details in [26], were:

Compounds preparation: the three-dimensional representation of each compound was built up by using HyperChem software [27]. The property or activity of each compound for the sample of interest was store in *.txt file.

Molecular descriptor family generation and computing: all compounds belonging to the sample of interest were used in the construction and generation of the molecular descriptors family. The algorithm used is strictly based on the complex information obtained from the compounds structure. Each calculated descriptor has an individual name of seven letters, which express the modality of construction:

○

Compound characteristic relative to its geometry (g) or topology (t) - the 7thletter;

Search and identification of MDF-SPR/MDF-SAR models: the criteria imposed into searching and identification of the model were: the model significance, the values for the correlation and squared correlation coefficients, the standard error, and the significances of the coefficients associated to the molecular descriptors.

MDF-SPR/MDF-SAR models validation: two methods were applied for internal validation of the obtained models. The methods were: the leave-one-out procedure (the property or activity of each compound was predicted by the regression equation calculated based on all the other compounds), and the leave-many-out procedure (the property or activity of a number of compounds discard from the sample were predicted by the regression equation calculated based on all the other compounds).

MDF-SPR/MDF-SAR analysis and comparison with previous reported models (where is applicable): the chosen MDF-SPR/MDF-SAR models were analyzed through computing and interpreting of a number of seven statistical parameters and visually by model plotting.

Starting from the previous experience in development of online systems [30,31], PHP (Hypertext Preprocessor) and MySQL (My Structure Query Language) has been used in development of the open system.

The characteristics of the previous reported models and of the MDF-SPR/MDF-SAR models were summarized by using Statistica software. The correlation coefficients obtained by the previous reported models were compared with the correlation coefficients obtained by MDF-SPR/MDF-SAR models (the Fisher’s Z test at a significance level of 5% was applied [32]).

3. Results

The summaries of characteristics of the MDF-SPR/MDF-SAR and previous reported results models are presented in Table 2.

The characteristics of the previous reported models (where were available), of the best performing MDF-SPR and MDF-SAR models are presented in Table 3. There was also included in Table 3 the Z parameter resulted from comparison between the correlation coefficient of previous model and of best performing MDF-SPR, or respectively MDF-SAR model.

As was specified in material and method section, the MDF-SPR and MDF-SAR results were integrated into an open system. The open system incorporates distinct programs useful in analysis and characterization of compounds properties/activities. The system is hosted on AcademicDirect domain and it is available at the following URL: http://l.academicdirect.org/Chemistry/SARs/MDF_SARs/

The named and the functions of the programs are presented in Table 4.

4. Discussion

The paper presented the estimation and prediction abilities of the Molecular Descriptors Family in characterization of property/activity of chemical compounds. Based on the obtained results a virtual environmental library has been created.

Four observations can be notice looking at the entire ensemble of chemical compounds sets. First, the squared correlation coefficients and associated correlation coefficients, measures of statistical fit of the regression, had always values greater in MDF-SPR/MDF-SAR models comparing with previous reported models. With one exception (for the relative response factor of polychlorinated biphenyls PCB_rrf set) the MDF-SPR/MDF-SAR models obtained squared correlation coefficients greater than 0.9 (see Table 3). The results of the squared correlation coefficient and of the leave-one-out score being greater than 0.8 in the majority of the cases sustain in accordance with the Basak at all criteria [73] the predictive abilities of the models. With one exception (for alkanes set with boiling point as property, the 33504 set), the squared correlation coefficients obtained by best performing MDF-SPR model was greater than the squared correlation coefficient previous reported (see Table 3). Note that, the lowest performance was obtained for the relative response factor of polychlorinated biphenyls (PCB_rrf set, see Table 3). The relative response factor is a complex property that depends by many factors not just by the compound’s structure (all external factors of gas chromatography and/or HRGC/ECD methods). In seventy percent of cases the squared correlation coefficient was statistical significant greater comparing with previous reported models (see Table 3).

Second observation refers the number of descriptors used in the MDF-SPR/MDF-SAR models. It is well known that the fitting power of the model become greater by increasing the number of descriptors, being generally accepted that a regression model with v descriptors for a sample size of n could be acceptable only if the following criterion is satisfied: n ≥ 4–5·v [74]. As it can be observed, with three exceptions (RRC_lbr, Dipep, and 19654 sets), the number of descriptors used by MDF-SPR/MDF-SAR models was less than the number of variables used in previous reported models. More, with two exceptions (41521, 52344 sets, both of them with 8 compounds), the condition imposed by Hawkins [74] was respected by the MDF-SPR/MDF-SAR models. In five cases, the previous reported models did not respect the above describe condition (compound sets: 36638 - 5·v condition, 41521 - both conditions, 52344 - both conditions, 26449 - both conditions, and 23151 - both conditions).

The third observation that can be noted regards the sample sizes used by the previous reported models and by the MDF-SPR/MDF-SAR models. There were seven sets in which the previous reported model was obtained after exclusion of one (Tox395 set), two (Ta395 set), three (23151 set), four (40846_4 and 23167 sets), twenty (22583 set), or forty-four (23110) compounds, while the MDFSPR/ MDF-SAR models were using in all cases the whole sample size (see Table 3).

The last but not the least observation regards the stability of the MDF-SPR/MDF-SAR models (defines as the differences between squared correlation coefficient and the cross-validation leave-oneout score) which sustained the prediction abilities (see Table 3).

The performances of the MDF-SPR/MDF-SAR methodology were good in estimation and prediction of properties of different chemical classes. For example, retention chromatography index (IChr set) and molar refraction (MR10 set), are two properties which can be estimated and predicted with high accuracy (squared correlation coefficients and leave-one-out cross-validations scores greater than 0.99, see Table 3).

These results suggest that the physicochemical properties of compounds are in relationships with compounds structure and the information obtained from their structure can be useful in property characterization.

Regarding the toxicity of chemical compounds it can be say that the MDF-SAR models estimate and predict almost perfect the toxicity of alkyl metal compounds (52730 set, r2 = 0.9998, r2cv-loo = 0.9993, see Table 3) and obtains high performances in estimation of cytotoxycity of studied quinolines (Ta395 set, Table 3). Good performances (Tox395 set, see Table 3) and significantly greater comparing with previous reported model (Tox395 set, Z = 1.893, Table 3) are also obtained in estimation and prediction of mutagenicity of studied quinolines.

Looking at the performances of the MDF-SAR models obtained on insecticidal (41521 set) and herbicidal (Triaz set) activities it can be observed that, even if the previous reported models had squared correlation coefficients close to one (see Table 3), the MDF-SAR models obtained statistical significant greater correlation coefficients (Z = 2.144 for 41521 set, and Z = 1.766 for Triazines set, Table 3).

High performances are obtained by MDF-SAR models in estimation and prediction of antioxidant efficacy of studied 3-indolyl derivates (52344 set), where the obtained values for squared correlation coefficient and cross-validation leave-one-out score are greater than 0.999 (see Table 3).

The abilities of MDF-SAR methodology in investigations of drugs were revealing in the study of antituberculotic activity of some polyhydroxyxanthones (26449 set), antimalarial activity of some 2,4-diamino-6-quinazoline sulfonamide derivates (23151 set) and carbonic anhydrase inhibitors (40846_1, 40846_2, and 40846_4 sets). The squared correlation coefficients were greater than 0.9 and the cross-validation leave-one-out scores were greater than 0.88 (see Table 3).

Even if the MDF-SAR model obtained greater squared correlation coefficient in investigation of anti-HIV-1 potencies of HEPTA and TIBO derivatives (22583 set), there was not identify statistical significance between MDF-SAR correlation coefficient and correlation coefficients obtained by previous reported models (p ≥ 0.05, see Table 3).

The MDF-SPR/MDF-SAR proved to offer reliable and valid models in characterization of property/activity of chemical compounds. The results indicate that important information regarding compounds property/activity can be obtained by analyzing the compounds structure.

Base on the above results, the developed open system provides an environment of modeling the property/activity of chemical compounds assisted by a computer, offering to the researchers the alternative of free risks experiments. The analysis of the system can be done through its advantages and disadvantages.

The advantages offered by the system can be summarizing as follows:

÷

Provides the values of molecular descriptors family calculated based on information obtained strictly from the compounds structure for studied classes of compounds;

÷

Identify the best performing models based on generated molecular descriptors family;

÷

Display a summary report of statistical characteristics of the best performing models;

÷

Provides parameters of measures the goodness-of-fit, the robustness and the predictivity of the obtained models;

÷

Allows to the user to visualize a demo of how the program calculate a molecular descriptor;

÷

Predict the property/activity of new compounds from a class previously studied based on the best performing MDF-SPR/MDS-SAR model.

Note that the costs of virtual experiments are less comparing with real experiments. In addition, the experiment risks are withdrawn. Comparing with the experimental approach, the proposed online system provides a stable and valid alternative in studying of relationships between compounds structure and associated activity/property.

In order to use the system facilities, the user must to have a computer connected to the Internet and browses skills. This can be considering at least for the researchers from developing countries a disadvantage of the system.

The open system provide effective models which can be used in studying the property/activity of new compounds in real time, without any experiments, and with low costs, being necessary just building up as *.hin files the three dimensional structure of the new compound and a previous study on the same class of compound. The future development of the system will allow the access to as many sets of compounds as possible, opening a new pathway in study of relationship between property/activity and structure of chemical compounds.

The MDF-SPR/MDF-SAR methodology opens a new pathway in understanding the relationships between compounds structure and property/activity, in characterization, investigation and development of new compounds, more competitive as production costs and property/activity.

Display for a set of data the MDF-SPR/MDF-SAR equations, with some statistical parameters (the squared correlation coefficient, the number of descriptors, and the sample size).

Query

Display the following characteristics for the MDF-SPR/MDF-SAR investigation: the size of the molecular descriptor family, the MDF-SPR/MDF-SAR equations, the number of descriptors used by the models, the sample size for each model, the values of each descriptor, the squared correlation coefficient, the leave-one-out score, the squared correlation coefficient between each descriptor and measured/observed property/activity.

DC Predictor (DC: demo calculator)

Provide a demo calculation of the Molecular Descriptors Family for a specified compound (a *.hin file) based on characteristics choused by the user.

SARs (SAR: Structure-Activity Relationship)

A previous obtained MDF-SPR/MDF-SAR model(s) is used (learning set). The property/activity of a new compound from the same class as learning compounds can be predicted based on its structure. A *.hin file of the compound of interest is necessary.

LOO Analysis (LOO: leave one out)

Based on the data resulted form MDF-SPR/MDF-SAR investigation the program is able to compute the leave-one-out score and to display statistical characteristics of the estimated and predicted property/activity of interest (number of descriptors used by the model, degree of freedom, standard error, standard deviation, squared correlation coefficient, Fisher parameter and associated significance). The program is able to work just with tabulated data (with labels on columns and rows). The columns must be organized as followed: independent variables (first sets of columns), estimated dependent variable, measured/observed dependent variable, and predicted variable.

Investigator

Display the characteristics of the sets of molecules which are in analysis. The administrator of the system is able to delete the MDF-SPR/MDF-SAR models which are considered not being at the level of imposed conditions and desires.

TvT Experiment (TvT: Training vs. Test)

Based on previous analyzed set of compounds, the program randomly split the compounds into training and test sets (the user can impose the number of compounds in training set). The MDF-SPR/MDF-SAR equation is calculated on the training set and applied on the test set. The program display the molecular descriptors and associated values for compounds in training and test sets, the MDF-SPR/MDF-SAR equations, the squared correlation coefficient, the Fisher parameter and associated significance for training and test sets.

Acknowledgements

We would like to thank Prof. Dr. I. Černušák and Prof. Dr. J. Noga from Comenius University, Bratislava, Slovak Republic, for the discussions and suggestions on the topic presented in this article. The research was partly supported by UEFISCSU Romania through project ET/36/2005.