Chemometrics and Intelligent Laboratory Systems 109 (2011) 146–161
Contents lists available at SciVerse ScienceDirect
Chemometrics and Intelligent Laboratory Systems journal homepage: www.elsevier.com/locate/chemolab
Comparative QSARs for antimalarial endochins: Importance of descriptor-thinning and noise reduction prior to feature selection Probir Kumar Ojha, Kunal Roy ⁎ Drug Theoretics and Cheminformatics Laboratory, Division of Medicinal and Pharmaceutical Chemistry, Department of Pharmaceutical Technology, Jadavpur University, Kolkata 700 032, India
a r t i c l e
i n f o
Article history: Received 25 May 2011 Received in revised form 27 July 2011 Accepted 22 August 2011 Available online 1 September 2011 Keywords: QSAR Antimalarial OECD Validation
a b s t r a c t The emergence of multidrug resistance of the currently available antimalarial drugs has led to the need of the discovery and development of new antimalarial compounds. In the present study, we have selected a series of 53 endochin analogs with antimalarial activity against the clinically relevant multidrug resistant malarial strain TM-90-C2B to develop robust QSAR models using different chemometric tools such as stepwise regression, factor analysis followed by multiple linear regression (FA-MLR), factor analysis followed by partial least squares (FA-PLS) analysis, genetic function approximation (GFA) and genetic partial least squares (G/PLS) techniques. We have tried to emphasize on importance of descriptor-thinning prior to feature selection step. For this purpose, we have partitioned the total pool of descriptors into three categories followed by running stepwise regression separately for all three categories of descriptors for selection of variables. Then we have applied genetic methods of model development on the selected pool of descriptors. We have validated the models using different validation parameters to check the statistical quality and reliability of the models. It has been found that models developed from variable selection by stepwise regression followed by GFA and G/PLS are the best two models. This study reflects the importance of descriptor-thinning and noise reduction prior to feature selection for the development of QSAR models. According to the best two models, we interpreted that substitutions with rotatable bond containing groups (e.g., long alkyl chains), limited volume of C—R group, presence of electronegative (like —Br) fragments, absence of cycles the molecules, absence of or rings in the molecules are important for the antimalarial activity. The developed models are believed to comply with the OECD guidelines for QSAR models. The results also confirmed that the QSAR models developed from the reduced pool of descriptors were better than those obtained from the entire pool of descriptors. © 2011 Elsevier B.V. All rights reserved.
1. Introduction Since long, malaria has become an important topic of discussions in the global public health forum [1]. Reports revealed that malaria has been infecting an estimated 247 million people annually among 3.3 billion causing one million deaths, most of whom are children [2]. Malaria still remains one of the major health problems in the world especially in sub-Saharan Africa [3]. The disease is transmitted by female mosquitoes of the anopheles genus belonging to the parasites of Plasmodium species. There are four species of Plasmodium parasites (Plasmodium falciparum, Plasmodium vivax, Plasmodium malariae, and Plasmodium ovale) among which P. falciparum is responsible for the majority of fatalities (95%) [4].The massive spread of resistant parasites hindered the successful treatment of the
⁎ Corresponding author. Tel.: + 91 98315 94140; fax: +91 33 2837 1078. E-mail address:
[email protected] (K. Roy). URL: http://sites.google.com/site/kunalroyindia/ (K. Roy). 0169-7439/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.chemolab.2011.08.007
patients to the points that some drugs have been rendered virtually useless in many parts of the world. It has been shown that once the resistance of a new drug emerges, it rapidly spreads and causes the cross-resistance to the whole class. Therefore, the identification and development of novel chemotypes are extremely important. The World Health Organization (WHO) has promoted the widespread use of artemisinin combination therapy (ACT) in malaria endemic countries [5]. But the emergence of ACT resistant malaria in Asian countries indicates the importance of the development of new antimalarial compounds [6,7]. The emergence of multidrug resistance of the conventional antimalarial compounds has led to the need of the development of new antimalarial drugs. Quantitative structure-activity relationship (QSAR) methods are used extensively in the design and development of new antimalarial compounds [8–10]. QSARs are among the important applications of chemometric tools with an objective of development of predictive models which can be used in different areas of chemistry including medicinal, agricultural, environmental, materials, etc. [11–13]. QSAR attempts to correlate the structural/ molecular properties in the form of descriptors (e.g. topological,
P.K. Ojha, K. Roy / Chemometrics and Intelligent Laboratory Systems 109 (2011) 146–161
physicochemical, thermodynamic, electronic, structural etc.) with biological activities (e.g. EC50, IC50, ED50, Ki etc.) or toxicities for a set of similar compounds by using different chemometric tools. Thus, QSAR models can be used to predict biological activity of new compound and to design novel potential molecules. Many studies have indicated that descriptor-based QSAR models along with the information derived from the structural bioinformatics tools can timely provide very useful insights for both basic research and drug development [14–16]. Recently, Roy and Ojha [17] have reviewed QSAR models of different classes of antimalarials based on both chemical classes and specific targets. Gupta et al. [18] analyzed two series of closely related antimalarial agents, 7-chloro-4-(3′,5′-disubstituted anilino)quinolines, using Combinatorial Protocol in Multiple Linear Regression (CP-MLR) for the structure–activity relations with more than 450 topological descriptors for each set. Agrawal et al. [19] reported a QSAR model using topological descriptors (Wiener (W) and Szeged (Sz) indices) for a series of antimalarial sulfonamide derivatives (2,4-diamino-6-quinqzoline sulfonamides). Ojha and Roy reported QSAR studies of antimalarial compounds acting on different targets (dihydroorotate dehydrogenase (DHOD) [20], plasmepsin-II [21]) and belonging to different chemical classes (aryltriazolylhydroxamates [22] and cycloguanil [23] derivatives) using physicochemical, electronic, topological, structural and thermodynamic descriptors. Apart from emergence of drug resistance, relative reluctance of the pharmaceutical industry to invest massively in the development of drugs that would offer only limited marketing prospects are the major constraints in antimalarial drug discovery. Endochin (a 4 (1H)-quinolone derivative) is a previously discovered lead (identified in 1940's) which was used as a causal prophylactic and potent erythrocytic stage agent in avian malaria models [24]. The mechanism of action of endochin analogs is not clear. However, the preliminary mechanism of action based on oxygen consumption by P. yoelii infected erythrocytes indicates that the 4(1H)-quinolones target the cytochrome bc1 complex, which is responsible for the parasite's respiration [25]. In this study, we have developed QSAR models for antimalarial endochin analogs using five types of chemometric tools such as stepwise regression, factor analysis followed by multiple linear regression (FA-MLR), factor analysis followed by partial least squares (FAPLS), genetic function approximation (GFA) and genetic partial least squares (G/PLS). We have also developed QSAR models using another approach, where we have selected the variables by partitioning the total pool of descriptors into three categories [(i) topological, ii) structural and thermodynamic descriptors and finally iii) atom type descriptors] followed by stepwise regression for each category. Final models were developed from the selected descriptors (obtained from stepwise regression) using GFA and G/PLS techniques. We have done so in order to select important descriptors from among the initial pool of descriptors for “descriptor-thinning” thus reducing noise in the input. Selection of descriptors is a very important step in QSAR analysis. In this paper, we have tried to emphasize on importance of descriptor-thinning prior to the feature selection step. Finally, we have compared the statistical quality of the models using different chemometric tools. Though many researchers reported various variable selection strategies like genetic algorithm, k-nearest neighborhood based method, recursive partitioning, etc. in the literature [26–30], in the present work we have tried to select optimal subset of variables using a novel strategy of partitioning descriptors based on their original classes. According to a recent comprehensive review [31], to establish a really useful statistical predictor model for a biomedical system, we need to consider the various aspects like selection of a benchmark data set, development of an unambiguous algorithm, extensive validation and mechanistic interpretation of the developed models etc., all of which have been properly attended to in the present report.
147
2. Material and methods 2.1. The data set For the development of QSAR models in this study, we have selected a series of 53 endochin analogs (4(1-H)-quinolone derivatives) with antimalarial activity against the clinically relevant multidrug resistant malarial strain TM-90-C2B (chloroquine, mefloquine, pyrimethamine and atovaquone resistant) (Table 1) [32]. Concentration-response data were analyzed by a non-linear regression logistic dose-response model and the EC50 value (the concentration of a drug that gives half-maximal response) for each compound was determined [33]. The effective concentration [EC50 (nM)] values were converted to the logarithmic scale pEC50 (M) [pEC50 = log (10^9/EC50 (nM)] for the present QSAR work. The logarithmic conversion of the biological activity data (EC50, ED50, etc.) is very usual (and also a requirement) in QSAR analysis as the log-dose response curve is sigmoidal one while the dose-response curve is parabolic thus making a significant portion of the former linear which is easier to model [34,35]. Also the logarithmic scale is preferred as the scale of numbers covers several factors of ten and so the use of logarithms allows the use of more manageable numbers. 2.2. QSAR 2.2.1. Descriptors Descriptors are numeric quantities containing chemical information about molecular structures. Depending on the algorithm of calculations or method of experimental determination or concept of the origin, descriptors may be classified into several classes. In the present work, we have developed QSAR models using a set of wellknown topological descriptors (E-state index, kappa shape index, molecular connectivity index, subgraph count) along with thermodynamic (ALogP, ALogP98, MolRef, MR, LogP, atom type logP contribution) and structural (H-bond donor, H-bond acceptor, Rotlbonds) descriptors. All the descriptors were calculated using Descriptor + module of the Cerius2 version 4.10 software [36]. The categorical list of the descriptors used in the development of QSAR models was reported in Table 2. Definitions of some selected descriptors are shown in Supplementary materials section. The variables with zero variance or near zero variance were omitted from the initial pool of descriptors for obvious reasons. The number of descriptors considered in the pool of descriptors on which variable selection strategies were applied was 79. The excel sheets of the whole pool of descriptors as well as the reduced pool of descriptors have been uploaded in the Supplementary materials section. 2.2.2. Selection of training and test sets In this study, the whole data set (n = 53) was divided into an internal evaluation (training) set (n = 39) and an external evaluation (test) set (n = 14) (75% and 25% respectively of the total number of compounds) based on a factor analysis of the descriptor matrix leading to PCA score plot (Fig. 1) obtained using SPSS software [37]. The test set compounds were selected manually from the PCA score plot so that each test set compound remains near at least one of the training set compounds to obey the similarity principle of QSAR [38]. For regression based QSARs, normal distribution of the response variable of the data set is a requirement [34,35]. The normality distribution of the response value of the training set compounds was checked using the Shapiro-Wilk test [39] and the Kolmororov-Smirnov test [40]. These tests are performed to check whether the population of data is normally distributed or not. The normality distribution of the dataset is checked by p-value obtained from each test. If the p-value is greater than 0.05, then we can be confident that the response values of the training set compounds are normally distributed. The obtained results are summarized in Table 3.
148
P.K. Ojha, K. Roy / Chemometrics and Intelligent Laboratory Systems 109 (2011) 146–161
2.2.3. Development of different QSAR models In this study, we have developed QSAR models using five different chemometric tools such as stepwise regression, factor analysis followed by MLR (FA-MLR), factor analysis followed by PLS (FA-PLS), GFA and G/PLS techniques. We have also developed QSAR models using another approach where descriptor-thinning has been performed prior to feature selection. In this case, firstly, we have partitioned the whole set of descriptors into three categories such as i) topological descriptors ii) structural and thermodynamic descriptors and finally iii) atom type descriptors. After that, we have performed stepwise regression separately for all three categories of descriptors for selection of variables. The set of variables thus selected contains representations from all three types of descriptors, i.e., i) topological, ii) structural and thermodynamic descriptors and iii) atom type descriptors. The final models were developed from the pool of selected descriptors (after “descriptor-thinning”) by using GFA and G/PLS techniques. The number of descriptors considered in the (reduced) pool of descriptors on which GFA or G/PLS was applied was 10 (note that the number of descriptors in the initial pool was 79). Finally, we compared the statistical quality of different models using different chemometric tools. The flow chart of the model development from variable selection by stepwise regression followed by GFA and G/PLS is given in Fig. 2. 2.2.3.1. Stepwise regression. In the stepwise regression method [41], a multiple-term linear equation is built step by step where an initial model is recognized and then it is repeatedly altered by adding or removing a predictor variable according to the “stepping criteria”. The main approaches of stepwise regression are forward selection and backward elimination. Forward selection involves starting with no variables in the model, trying out the ‘statistically significant’ variables one by one and including them in the model. Backward elimination involves starting with all the candidate variables and testing and deleting them one by one which are not statistically significant. So, the stepwise regression method is a combination of the above (forward selection and backward elimination), testing at each stage for variables to be included or excluded. In this study, we have used the “stepping criteria” F = 4 to enter and F = 3.9 to remove. The criteria “F to Enter” and “F to Remove” determine how significant or insignificant the contribution of a variable is in the regression equation, respectively for adding the variable to the equation and removing the variable from the equation. The F value mentioned in this test is the square of the t value of the incoming variable thus indicating significance of the corresponding regression coefficient. 2.2.3.2. Partial least squares (PLS). PLS is a generalization of regression which is particularly suited when the matrix of predictors has more variables than observations, and when there is multicollinearity among X values. PLS is used to find the fundamental relations between X and Y matrices, i.e. a latent variable approach to modeling the covariance structures in these two spaces. Application of PLS allows the construction of larger QSAR equations while still avoiding overfitting and eliminating most variables. PLS is statistically more robust than MLR because standard regression will fail in such cases [42]. However, MLR is more straightforward in calculation than PLS as the former does not require calculation of any latent variables and optimization of the number of components. PLS is normally used in combination with cross-validation to obtain the optimum number of components. This ensures that the QSAR equations are selected based on their ability to predict the data rather than to fit the data [43]. In this study, leave-one out method was used to select optimum number of components [44]. Based on the standardized regression coefficients, the variables with smaller coefficients were removed from the PLS regression, until there was no further improvement in Q 2 value, irrespective of the components.
2.2.3.3. Factor analysis (FA). Factor analysis [45,46] is a data reduction technique. The principal objectives of factor analysis are to display multidimensional data in a space of lower dimensionality with minimum loss of information and to extract the basic features behind the data with ultimate goal of interpretation and/or prediction. So, the main applications of factor analysis are to reduce the number of variables and to detect the structure in the relationships between variables. Factor analysis can be used as a data-preprocessing step to identify the important predictor variables contributing to the response variable and to avoid collinearities among them. The information obtained about the collinearities among the predictor variables can be used to reduce number of variables in the descriptor pool. In a typical factor analysis procedure, the data matrix is first standardized, and correlation matrix and subsequently reduced correlation matrix are constructed. An eigenvalue problem is then solved and the factor pattern can be obtained from the corresponding eigenvectors. The principal objectives of factor analysis are to display multidimensional data in a space of lower dimensionality with minimum loss of information (explaining N95% of the variance of the data matrix) and to extract the basic features behind the data with ultimate goal of interpretation and/or prediction. Factor analysis was performed on the dataset(s) containing biological activity and all descriptor variables, which were to be considered. Factors explaining N5% of the variance of the data matrix were only considered. The factors were extracted by principal component method and then rotated by VARIMAX rotation (a kind of rotation which is used in principal component analysis so that the axes are rotated to a position in which the sum of the variances of the loadings is the maximum possible) to obtain Thurston's simple structure. The simple structure is characterized by the property that as many variables as possible fall on the coordinate axes when presented in common factor space, so that largest possible number of factor loadings becomes zero. This is done to obtain a numerically comprehensive picture of the relatedness of the variables. Only variables with non-zero loadings in such factors where biological activity also has non-zero loading were considered important in explaining variance of the activity. Further, variables with non-zero loadings in different factors were combined in a multivariate equation. 2.2.3.4. FA-MLR. Factor analysis has been performed here as a preprocessing step to identify important variables for MLR. It reduces the large numbers of variables to few factors from which important variables can be identified for multiple linear regression analysis. In this study, the principal component method was used to extract the factors. The factors were rotated by VARIMAX rotation to obtain Thurston's simple structure. The simple structure is characterized by the property that as many variables as possible fall on the coordinate axes when presented in common factor space, so that largest possible number of factor loadings becomes zero. Multiple linear regression was performed using the effective variables which were selected from rotated component matrix. 2.2.3.5. FA-PLS. FA-PLS is the combination of factor analysis (FA) and partial least squares (PLS), where FA is used for initial selection of descriptors followed by PLS for model development. FA reduces variables into few latent factors from which important variables are selected for PLS regression. In this study, leave-one-out (LOO) method was used for the selection of optimum number of components. 2.2.3.6. Genetic function approximation (GFA). The GFA algorithm (Rogers and Hopfinger) [47] is a statistical tool which offers a relatively new approach to the problem of building QSAR and quantitative structure-property relationship (QSPR) models. The GFA algorithm allows the construction of multiple models instead of a single model as is obtained from a stepwise regression run and it is possible to select the best model from among the comparable models based on statistical
P.K. Ojha, K. Roy / Chemometrics and Intelligent Laboratory Systems 109 (2011) 146–161
149
Table 1 Structural features, observed [32] and calculated antimalarial activity values of endochin analogs.
R1
O
R2
R5
R3
N R4
CH3
R6
Sl No.
R1
R2
R3
R4
R5
R6
Observed activity (pEC50) (M) [32]
Calculated activity (pEC50) (M) M1
M2
M3
M4
M5
M6
1
−H
−H
OCH3
−H
−C7H15
−H
7.332
7.548
7.658
7.641
7.088
7.015
7.267
7.121
2
−H
−H
OCH3
−H
−H
−H
4.903
6.049
5.870
6.085
5.265
5.269
5.540
5.701
M7
3*
−H
−H
−H
−H
−C7H15
−H
6.979
6.832
6.836
6.818
6.615
6.553
7.074
7.009
4
−H
−H
−H
−H
C9H19
−H
7.400
7.327
7.457
7.350
7.457
7.385
7.694
7.539
5
−H
−H
−H
−H
−H
5.652
5.383
5.952
6.092
5.279
5.424
5.097
5.125
6
−H
−H
−H
−H
−H
6.270
5.524
5.323
5.247
5.432
5.614
5.881
6.055
7*
−H
−H
−H
−H
−H
5.207
5.348
5.530
5.426
5.136
5.073
5.352
5.318
8
−H
−H
−H
−H
−H
7.110
6.244
6.168
6.196
6.536
6.642
6.926
6.813
9
−H
−H
−H
−H
−H
7.824
7.429
7.323
7.248
7.589
7.540
7.744
7.742
10*
−H
−H
−H
−H
−H
7.472
7.376
7.123
7.067
7.251
7.201
7.744
7.742
11
−H
−H
−H
−H
−H
6.728
7.400
7.328
7.253
7.240
7.165
7.220
7.061
12
−H
−H
−H
−H
−H
5.827
5.855
6.059
6.031
6.348
6.324
6.143
6.348
13
−H
−OCH3
−H
−H
−Br
−H
5.934
6.549
6.918
6.964
5.983
6.728
5.314
6.493
14
−H
−CH3
−H
−H
−Br
−H
6.583
6.455
6.355
6.409
6.779
6.574
7.202
6.620
15
−H
−Br
−H
−H
−Br
−H
6.470
6.644
7.281
7.192
6.814
6.785
6.678
6.563
16
−H
−Cl
−H
−H
−Br
−H
6.900
6.534
6.443
6.491
6.661
6.713
6.498
6.597
17
−H
−F
−H
−H
−Br
−H
6.903
6.360
6.336
6.436
6.538
6.628
6.541
6.738
18
−H
−Cl
−H
−H
−H
4.366
4.537
4.471
4.567
4.324
4.351
4.297
4.184
19*
−OCH3
−H
−H
−H
−Bn
−H
5.675
5.930
6.299
6.218
5.510
5.436
5.544
5.430
20*
−H
−OCH3
−H
−H
−Bn
−H
4.624
5.813
6.293
6.213
5.510
5.436
5.544
5.430
21*
−H
−H
−OCH3
−H
−Bn
−H
5.859
5.771
6.293
6.214
5.510
5.436
5.544
5.430
22
−OCH3
−H
−H
−H
−Ph
−H
5.295
6.021
6.077
6.029
5.783
5.758
6.073
6.256
23
−H
−H
−OCH3
−H
−Ph
−H
6.570
5.827
6.072
6.025
5.783
5.755
6.073
6.256
-Br
24
−H
−H
H
−OCH3
−Ph
−H
5.779
5.965
6.078
6.031
5.783
5.758
6.073
6.256
25
−Cl
−H
H
−H
−Bn
−H
5.279
5.737
5.801
5.720
5.359
5.289
5.191
5.108
26
−H
−Cl
H
−H
−Bn
−H
5.485
5.669
5.810
5.731
5.359
5.289
5.191
5.108
27
−H
−H
−Cl
−H
−Bn
−H
4.805
5.645
5.806
5.727
5.359
5.289
5.191
5.108
28
−Cl
−H
−H
−H
−Ph
−H
5.324
5.880
5.587
5.538
5.642
5.763
5.719
5.934
29*
−H
−Cl
−H
−H
−Ph
−H
6.110
5.794
5.595
5.548
5.642
5.684
5.719
5.934
30
−H
−H
−Cl
−H
−Ph
−H
5.731
5.765
5.591
5.545
5.642
5.684
5.719
5.934
31
−Cl
−Cl
−H
−H
−Ph
−H
5.493
6.199
5.854
5.822
6.504
6.598
6.398
6.055
32
−H
−Cl
−Cl
−H
−Ph
−H
6.456
6.095
5.858
5.829
6.504
6.522
6.398
6.055
33
−Cl
−H
−Cl
−H
−Ph
−H
5.464
6.168
5.854
5.826
6.504
6.522
6.398
6.055
34*
−H
−OCH3
−H
−H
−Bn
−CH3
5.094
6.171
5.118
4.944
5.660
5.584
5.487
5.356
35
−H
−H
−OCH3
−H
−Bn
−CH3
5.106
6.139
5.120
4.948
5.660
5.584
5.487
5.356
36*
−H
−F
−OCH3
−H
−H
7.081
6.030
6.212
6.230
6.556
6.573
6.837
6.489
37
−H
−Cl
−OCH3
−H
−H
7.815
6.204
6.344
6.316
6.650
6.660
6.837
6.489
(continued on next page)
150
P.K. Ojha, K. Roy / Chemometrics and Intelligent Laboratory Systems 109 (2011) 146–161
Table 1 (continued)
Sl No.
38
R1
−H
R2
−Br
R3
R4
−OCH3
−H
Observed activity (pEC50) (M) [32]
M1
M2
M3
M4
M5
M6
M7
−H
7.602
6.306
7.190
7.026
6.723
7.877
6.837
7.582
R6
R5
Calculated activity (pEC50) (M)
39*
−H
−F
−OCH3
−H
-C2H5
−H
6.793
6.881
6.385
6.607
7.153
6.308
6.729
6.489
40*
−H
−Cl
−OCH3
−H
-C2H5
−H
7.559
7.072
6.536
6.714
7.291
6.411
6.679
6.489
41*
−H
−Br
−OCH3
−H
-C2H5
−H
7.440
7.188
7.391
7.434
7.448
7.637
6.843
7.582
42
−H
−Cl
−OCH3
−H
-C7H15
−H
8.530
8.062
7.957
7.937
7.998
7.964
8.060
8.154
43
−H
−Cl
−OCH3
−H
-C9H19
−H
8.799
8.599
8.591
8.472
8.858
8.814
8.680
8.685
44
−H
−Cl
−OCH3
−H
−H
6.401
6.224
6.999
7.173
6.537
6.510
6.083
6.270
45
−H
−Cl
−OCH3
−H
−H
7.164
6.172
6.571
6.507
6.382
6.347
6.338
6.464
46
−H
−Cl
−OCH3
−H
−CH3
−H
7.564
6.781
6.279
6.487
7.339
6.245
7.320
6.055
47*
−H
−Cl
−OCH3
−H
−iPr
−H
6.712
7.270
6.665
6.821
7.475
6.363
6.762
6.489
48
−H
−Cl
−OCH3
−H
−iBu
−H
7.199
7.394
6.925
7.046
6.586
6.568
6.690
6.922
49
−H
−Cl
−OCH3
−H
−H
−H
5.731
6.512
6.137
6.366
6.824
6.127
6.334
6.055
50
−H
−Cl
−OCH3
−H
−H
5.376
5.209
5.271
5.365
5.418
5.480
5.445
5.539
51*
−H
−Cl
−OCH3
−H
−H
7.830
8.127
7.646
7.675
7.840
7.829
8.118
8.227
52
−H
−Cl
−OCH3
−H
−H
7.975
8.165
7.825
7.835
8.144
8.133
8.118
8.227
−H
7.271
7.897
7.873
7.914
7.746
7.736
7.724
7.793
O 53
−H
−Cl
−OCH3
−H
O
* denotes test set compounds.
quality, predictive ability and mechanistic interpretability. Genetic algorithms are derived from an analogy with the evolution of DNA [47]. The GFA algorithm was developed from the concept of Holland's genetic algorithm (1975) [48] and Friedman's (1990) multivariate adaptive regression splines (MARS) algorithm [49]. In GFA algorithm, an individual or model is represented as one-dimensional string of bits. In this technique, an initial population of equations is built by random selection of descriptors followed by cross over between pairs of those equations. A fitness function or “Lack of Fit (LOF)” is used to estimate the quality of an individual or model. The model quality improves as the value of LOF diminishes. The error measurement term LOF is determined by the following equation: LSE LOF ¼ cþdp 2 1− M
ð1Þ
Table 2 Categorical list of the descriptors used in the development of QSAR model. Category of descriptors
Name of the descriptors
Topological
Jx, 1κ, 2κ, 3κ, 1κam, 2κam, 3κam, φ, SC-0, SC-1, SC-2, SC-3_P, SC-3_C, 0χ, 1χ, 2χ, 3χp, 3χc, 0χv, 1χv, 2χv, 3χvp, 3χvc, Wiener, Zagreb, S_sCH3, S_dCH2, S_ssCH2, S_dsCH, S_aaCH, S_sssCH, S_tsC, S_dssC, S_aasC, S_ssssC, S_ssNH, S_ddsN, S_sssN, S_dO, S_ssO, S_sF, S_sCl, S_sBr MW, Rotlbonds, Hbond acceptor, Hbond donor AlogP, AlogP98, MR, MolRef, LogP Atype_C_1, Atype_C_2, Atype_C_3, Atype_C_5, Atype_C_11, Atype_C_15, Atype_C_16, Atype_C_17, Atype_C_19, Atype_C_22, Atype_C_24, Atype_C_25, Atype_C_26, Atype_C_40, Atype_H_46, Atype_H_47, Atype_H_50, Atype_H_52, Atype_O_58, Atype_O_60, Atype_O_61, Atype_N_70, Atype_N_71, Atype_N_76, Atype_F_84, Atype_Cl_89, Atype_Br_94.
Structural Thermodynamic AlogP_atom type
In Eq. (1), c is the number of basis functions (other than constant term); d is smoothing parameter (adjustable by the user); M is number of samples in the training set; LSE is the least squares error and p is total numbers of features contained in all basis functions. Once models in the population have been rated using the LOF score, the genetic cross-over operation is repeatedly performed. Initially two good models are probabilistically selected as parents and each parent is randomly cut into two pieces and a new model (child) is generated using a piece from each parent. After many mating steps, i.e., genetic crossover type operation, average fitness of individuals (models) in the population increases as good combination of genes are discovered and spread through the population. The models developed by GFA can use linear polynomials, higher order polynomials, splines, and Gaussians. In this study, we have used both linear and spline terms for model development. For the development of GFA models, Cerius2 4.10 version [36] has been used. The mutation probabilities were kept at 50% with 5000 iterations (from our previous experiences using several data sets, we have seen that a GFA run with more than 5000 iterations lead to models with either poor predictive ability as evidenced from lower Q2 or no further improvement of predictive ability than the ones obtained at 5000 iterations). The smoothness (d) was kept at 1.00. The smoothness parameter adjusts the penalty reflected in the score of equations due to the number of descriptors: an equation obtained from a higher value of the smoothness parameter will have less number of descriptors than that obtained from a lower smoothness value. Initial equation length value (number of descriptors) was set at four and no fixed length of the final equations was set. 2.2.3.7. G/PLS. G/PLS [42,47] is a statistical tool that combines the best features of GFA and partial least squares (PLS). Both of these chemometric tools are especially useful when the number of descriptors (independent variables) is comparable to or greater than the number of
P.K. Ojha, K. Roy / Chemometrics and Intelligent Laboratory Systems 109 (2011) 146–161
151
procedure. The initial equation length was set at 4 while the mutation probability was kept at 50% with 1000 iterations (from our previous experiences using several data sets, we have seen that a G/PLS run with more than 1000 iterations lead to models with either poor predictive ability as evidenced from lower Q 2 or no further improvement of predictive ability than the ones obtained at 1000 iterations) and the option of no fixed length of the final equations was selected.
Fig. 1. PCA Score plot of training set and test set compounds.
Table 3 Statistical results of normality distribution of training set compounds. Response variable
Statistical test
Obtained value
pEC50
Shapiro-Wilk Kolmororov-Smirnov
W = 0.97447, p = 0.50864 d = 0.11791, p N 0.20
compounds (data points). The PLS method is also used to develop stable, correct and highly predictive models even for correlated descriptors [50–52]. GFA is used to select the appropriate variables to be used in the development of a model followed by PLS regression used as a fitting technique to weigh the relative contributions of the selected variables in the final model. G/PLS retains the ease of interpretation of GFA by back transforming the PLS components to the original variables. So, application of G/PLS allows the construction of larger QSAR equations while avoiding overfitting and eliminating most variables. The cross-validation was performed by leave-one-out (LOO)
2.2.4. Model quality The statistical parameters used to check the quality of models were correlation coefficient (R), variance ratio (F), standard error of estimate (s) and adjusted R 2(Ra2). But, these statistical parameters are not sufficient to check the predictive ability of the models. Thus, to further check the predictive ability of the models, we have used some other statistical parameters such as classical ones like Q 2 and R 2pred (for internal validation and external validation respectively), 2 2 2 rm and also some novel metrics like rm (LOO), ðLOOÞ and Δrm(LOO) 2 2 2 for internal validation, rm(test), rmðtest Þ and Δrm(test) for external valida2 2 c 2 2 tion, rm (overall), rm ðoverallÞ and Δrm(overall) for overall validation and Rp for model randomization [23,53] (see below for details). We have also checked the models quality according to the sum of ranking differences validation criteria [54,55]. We have also calculated the variance inflation factor (VIF) [56] for MLR models to check presence of multicollinearity. VIF quantifies the severity of multicollinearity in an ordinary least squares regression analysis. It is a measure of how much the variance of an estimated regression coefficient is increased because of collinearity. VIF can be calculated by following equation:
VIF ¼
1 1−R2i
ð2Þ
where, Ri2 the unadjusted R 2 when one regresses Xi against all the other explanatory variables in the model, that is, against a constant, X2, X3, …., Xi-1, Xi + 1, …., Xk. If the VIF value is greater than 5, multicollinearity is very high. 2.2.5. Validation methods The robustness of the models should be verified by using different validation criteria. The developed models were validated using internal validation, external validation, overall validation and randomization techniques. In case of internal validation techniques, the predictive ability of models is checked based on the training set compounds only whereas the external validation techniques deal with the new set (test set) of compounds which are not used for the model development process. Both training and test sets are used for the overall validation of the models.
Fig. 2. Schematic diagram of model development with application of descriptorthinning prior to feature selection.
2.2.5.1. Internal validation. Cross-validated squared correlation coefficient (Q 2) and predicted residual sum of squares (PRESS) [57] based on the observed and predicted activity data of the training set molecules can be used as the internal validation metrics. In this present study, we have performed the leave-one-out (LOO) cross-validation technique (jackknife test) for internal validation where each of the compounds of the training set is deleted once and a model is built with the remaining compounds. The activity of the deleted compound is predicted by using the developed model. The cycle is repeated until all the compounds are deleted once. We have also performed leavemany-out (LMO) (25% out) cross-validation where 25% of the compounds are eliminated in each cycle. For each cycle, the model is built based on the remaining molecules of the dataset and similar to the LOO technique, the activity of the deleted compounds is predicted using the developed model. The predicted activity data obtained for all the training set compounds are used for the calculation of above
152
P.K. Ojha, K. Roy / Chemometrics and Intelligent Laboratory Systems 109 (2011) 146–161
mentioned internal validation parameters. Cross-validated squared correlation coefficient R 2 (LOO-Q 2) is calculated according to the following equation. 2
Q ¼ 1−
PRESS 2 ∑ YobsðtrainingÞ −Y training
ð3Þ
In the above equation, Y training represents average activity value of the training set while Yobs(training) and Ypred(training) represent observed and LOO-predicted activity values respectively of the training set compounds. Often, a high Q 2 value (Q 2 N 0.5) is considered as a proof of high predictive ability of the model [58]. Though internal validation tools like the jackknife test have been criticized by Golbraikh and Tropsha [59], these have been increasingly and widely used by other investigators to examine the quality of various predictive models [60,61]. 2.2.5.2. External validation. To predict the activity/property/toxicity for a new set of compounds is one of the important objectives of QSAR modeling. The value of Q2 signifies the ability of the model to predict the activity of molecules which are very much alike the training set ones. But to determine the predictive potential of the QSAR model for a new set (test set or external set) molecules differing in some aspects from the training set ones, external validation is needed to be performed. The predictive capacity of the model is judged by its application for prediction of activity values of the test set compounds and R2pred is calculated according to the following equation (Eq. (4)) [62]: 2 Rpred
2 ∑ Yobs ðtestÞ −YpredðtestÞ ¼ 1− 2 ∑ YobsðtestÞ −Y training
ð4Þ
In the above equation, Ypred(test) and Yobs(test) indicate predicted and observed activity values respectively of the test set compounds and Y training indicate mean activity value of the training set com2 for an acceptable model should be more pounds. The value of Rpred 2 (Q 2ext) have also been defined in the than 0.5. Other variants of Rpred literature [63]. 2 rm
2
ðtest Þ
qffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 ¼ r 1− r2 −r02
ð5Þ
2 The value of rm should be greater than 0.5 for an acceptable model. In the above equation, r 2 and r02 are determination coefficients of linear relations between the observed and predicted values of the test set compounds with (r 2) and without (r02) intercept respectively. The determination coefficient r02 can be calculated by following equation:
2 r0
∑ Yobs Ypred k¼ 2 ∑ Ypred
2 ∑ Yobs −k Ypred ¼ 1− 2 ∑ Yobs −Yobs
ð6Þ
In the above equation, Yobs is the mean observed value of the compounds while k (the slope of the best fit line obtained from
ð7Þ
2 We have shown that rm (test) appears to be a stricter metric for ex2 2 [65]. Initially, the concept rm was applied ternal validation than Rpred only to the test set prediction [64,65], but it can also be applied for training set [rm2 (LOO)] using LOO-predicted values of the training set 2 compounds. It has already been shown that rm (LOO) is a stricter test for internal validation than Q 2[66]. This metric can be applied for 2 whole data set for overall validation [rm (overall)] which is based on LOO predicted values of the training set compounds and predicted 2 values of the test set compounds [66]. The advantages of the rm (overall) 2 metric are: (1) unlike external validation parameters (R pred etc.), the 2 rm (overall) statistic is not based only on limited number of test set compounds. It includes prediction for both test set and training set (using LOO predictions) compounds. Thus, this statistic is based on prediction of comparably large number of compounds. In many cases, test set size is considerably small and regression based external validation parameter may be less reliable and highly dependent on individual 2 test set observations. In such cases, the rm (overall) statistic may be advantageous. (2) In many cases, comparable models are obtained where some models show comparatively better values of internal validation parameters and some other models show relatively superior values of external validation parameters. This may create problem in 2 selecting the final model. The rm (overall) statistic may be used for selection of the best predictive models from among comparable models. 2 value by arbitrarily using the obPreviously, we calculated the rm served response values in y-axis and predicted values in the x-axis. However, we have recently discussed [67] that this can also be calcu2 ) by interchanging the axes, i.e., observed response values in lated (r /m x-axis and predicted response values in y-axis. The determination coefficient r /02 can be calculated by following equation:
′2 r0
R 2pred
metrics. The parameter has been extensively used 2.2.5.3. The for judging external predictivity of the model. But, it has been shown recently [64,65] that R 2pred may not be sufficient to indicate external predictivity of the model especially in case of a test set with wide activity range of R 2pred is mainly controlled by the because the value 2 term ∑ YobsðtestÞ −Y training , i.e., the sum of squared differences between observed values of test set compounds and mean observed value of the training set compounds. So, to better reflect external predictive potential of a model, recently our group has introduced a 2 2 [64] for external validation [rm novel validation metric rm (test)], which can be calculated by following equation: rm
correlating the observed and predicted values setting intercept to zero) can be calculated from following equation:
2 0 ∑ Ypred −k Yobs ¼ 1− 2 ∑ Ypred −Ypred
ð8Þ
In the above equation, Ypred is the mean predicted value of the compounds while k /(the slope of the best fit line obtained from correlating the predicted and observed values setting intercept to zero) can be calculated from following equation: ′
k ¼
∑ Yobs Ypred ∑ðYobs Þ2
ð9Þ
2 and Our group [67] has proposed additional validation metrics rm 2 2 2 , based on average and difference respectively of rm and r /m values Δrm 2 2 2 2 [r m ðtest Þ and Δrm (test) for external validation, r m ðLOOÞ and Δrm (LOO) for 2 2 internal validation, r m ðoverallÞ and Δrm(overall) for overall validation]. 2 2 may be different from that of rm and the difference The value of r /m 2 ) may also be used as a measure of between these two metrics (Δrm 2 2 and r /m should be the goodness of predictions. The values of rm close (ideally, equal). However, the difference in values of these two 2 and Δr 2 metrics is due to presence of errors in predictions. Thus, r m m metrics may validate the models more precisely than the traditional 2 2 2 , r /m ,r m metrics. In this present QSAR study, we have determined rm 2 and Δrm values of the training set (based on LOO predicted values), the test set and whole dataset for the reported models.
2.2.5.4. Randomization tests. Another statistical tool used for validation of a model is randomization test (Y-randomization). There are two
P.K. Ojha, K. Roy / Chemometrics and Intelligent Laboratory Systems 109 (2011) 146–161
153
Table 4 Calculation of SRD values. Chemical compounds
Rankings for original values (pEC50)
Rankings for model M1
Ranking difference for model M1
Rankings for model M2
Ranking difference for model M2
–
Rankings for model Mn
Ranking difference for model Mn
1 2 .. .. N
X1 X2 … … Xn
Y1 Y2 … … Yn
ABS(X1-Y1) ABS(X2-Y2) … … ABS(Xn-Yn) SRD for model M1
P1 P2 … … Pn
ABS(X1-P1) ABS(X2-P2) … … ABS(Xn-Pn) SRD for Model M2
– – – – – –
T1 T2 … … Tn
ABS(X1-T1) ABS(X2-T2) … … ABS(Xn-Tn) SRD for model Mn
types of randomization tests such as process randomization and model randomization. In case of process randomization, the values of the dependent variable are randomly scrambled and variable selection is done freshly from the whole descriptor matrix. In case of model randomization, the Y column entries are scrambled and new QSAR models are developed using same set of variables as present in the nonrandomized model. If the score of the non-random QSAR model is significantly better than that of the random models then that model should be considered as a statistically robust model [68,69]. We have done process randomization for the genetic models while model randomization results have been reported for all models. The process randomization and model randomization tests have been performed at 90% and 99% confidence levels respectively. We have used a parameter cRp2 [53], which penalizes the model R2 for small differences between squared mean correlation coefficient (Rr2) of randomized models and square of correlation coefficient (R2) of the non-randomized model. The above-mentioned novel parameter can be calculated by the following equation:
3. Results and discussion QSAR models were developed using various chemometric tools (FA-MLR, FA-PLS, stepwise regression, GFA and G/PLS) applied on the total pool of descriptors. An attempt was also made to develop models with variable selection using stepwise regression followed by GFA and G/PLS. The developed models were validated using different validation metrics as discussed in Materials and Methods section. Among all the models, the models developed by stepwise regression followed by GFA and G/PLS techniques were statistically more acceptable than the other developed models. All the statistically significant models were summarized in Tables 5 and 6. Here we discuss in details only the best two models. Though both linear and spline options were used to develop the genetic models, the models developed with spline option showed statistically more significant result than those developed using only linear option. 3.1. Variable selection using stepwise regression followed by GFA
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi r ffi c 2 2 2 Rp ¼ R R −Rr
ð10Þ pEC50 ¼ 5:9580ð0:1719Þ þ 0:43965ð0:04201Þ Rotlbond
This novel parameter cRp2 ensures that the models developed are not obtained by chance. For an acceptable QSAR model, the value of c 2 Rp should be greater than 0.5.
þ34:405ð5:300Þ bJX 2:4723N0:028784ð0:004878Þ 〈MR 24:7006〉1:0631ð0:1907Þ Atype C 22 ntraining ¼ 39; R2 ¼ 0:797; R2a ¼ 0:773; F ¼ 33:41ðdf 4; 34Þ; s ¼ 0:517; PRESS ¼ 11:260;
2.2.5.5. Sum of ranking differences (SRD). This non-parametric method has recently been introduced by Karoly Heberger in the field of chemistry [54,55]. To carry out the ranking of models, the data should be arranged in a matrix form. The chemical compounds are arranged in the rows, whereas the models to be compared are arranged in the column. Each column contains the calculated values according to the respective models. A column of reference values is included separately. In the present study, the original measured data [experimental value pEC50 (M)] were included as the reference in a separate column, because the models are developed based on original measured data. An average of values calculated from all models can also be considered as the reference, but for this, ideally, a large number of models need to be generated. All the models and original measured data were ranked with increasing order of activity. The rankings of all the models were compared with the reference rankings. Finally, the absolute values of the differences between the reference and individual rankings are calculated and summed. In this way, sum of ranking differences was calculated for each model. The SRD values should be closer to zero for a good model i.e., the SRD value should be minimum (i.e., the rankings should be closer to the reference rankings). In this way, we can select the best model using this SRD method. The raw SRD values for training and test sets cannot be compared directly because the number of compounds is different. Therefore, we calculated the scaled SRD values between 0 and 100 using the software named CRRN_DNA. (downloaded from http://knight.kit.bme.hu/CRRN). Table 4 shows the calculation of the SRD values.
2
2
2
2 Q ¼ 0:749; rmðLOOÞ ¼ 0:568; rm ðLOOÞ ¼ 0:651; ΔrmðLOOÞ ¼ 0:165;
ð1Þ
2 2 ntest ¼ 14; R2pred ¼ 0:808; rm ðtest Þ ¼ 0:791; rmðtest Þ ¼ 0:704; 2 2 Δrm ðtest Þ ¼ 0:174; rmðoverallÞ ¼ 0:603 2 2 rm ðoverallÞ ¼ 0:685; ΔrmðoverallÞ ¼ 0:163:
In the above equation, ntraining is the number of compounds used to develop the model and ntest is the number of compounds used for external prediction. This model could explain 77.3% of the variance (adjusted coefficient of variation). The leave-one-out (LOO) crossvalidated correlation coefficient (Q 2 = 0.749) and leave-many-out 2 = 0.561) (LMO) (25%) cross-validation correlation coefficient (Q25% above the critical value of greater than 0.5 signify the statistical reliability of the model. The predictivity of the model was judged 2 2 ) (Rpred = 0.808) which shows good by means of predictive R 2 (Rpred 2 2 metrics (rm2 (test), r m predictive ability of the model. Values of rm ðtest Þ , 2 2 2 , r and r ) more than 0.5 and values rm2 (LOO), r m m ðoverallÞ m (overall) ðLOOÞ 2 2 and r /m metrics (Δrm2 (test), Δrm2 (LOO) and of differences between rm 2 Δrm(overall)) less than 0.2 imply the goodness of predictions of this model. Using the standardized variable matrix for regression, the significance level of the descriptors was found to be in the following order: Rotlbond, JX, MR and Atype_C_22.
154
P.K. Ojha, K. Roy / Chemometrics and Intelligent Laboratory Systems 109 (2011) 146–161
Table 5 Comparison of statistical quality and validation parameters of different models. Model no.
Type of model
Descriptors
M1 M2 M3 M4 M5 M6 M7
Stepwise FA-MLR FA-PLS GFA G/PLS Stepwise followed by GFA Stepwise followed by G/PLS
Equation quality
Jx, φ, S_tsC φ, S_aaCH, S_ssO, S_tsC, S_ssNH, Atype_Br_94 φ, S_aaCH, S_ssO, S_tsC, S_ssNH, Atype_Br_94, 2κam b JX-2.3879N, S_tsC, MR, φ b 2.8504-3κN, MR, b 1-Atype_Br_94N, φ, b6.05139-S_tsCN Rotlbond, bJx-2.4723N, b MR-24.7006N, Atype_C_22 b Rotlbond-1N, b1-Atype_Br_94N, b MR-45.0402N, Atype_C_22
Rotlbond is an important parameter which indicates the flexibility of the molecules. It counts the number of rotatable bonds in a molecule that are considered to be meaningful for molecular mechanics. The positive regression coefficient of the term Rotlbond indicates that the molecules having higher number of rotatable bonds should be highly active. It has been observed that a long alkyl chain substitution at R5 position (compound nos. 42 and 43) leads to better antimalarial activity due to the higher number of rotatable bonds (seven and nine respectively) than compounds without a substitution at R5 position (e.g., compound nos. 2 and 49). It has also been observed that the number of rotatable bonds at R5 position of compound nos. 43 and 4 are same but compound 43 contains an additional rotatable bond at R3 position (−OCH3) and this explains better antimalarial activity of compound 43 as compared to compound no. 4 (unsubstituted at R3 position). From this observation, it can be inferred that R5 and R3 positions should be substituted with rotatable bond containing groups. The parameter Jx, Balaban Index, is a highly discriminating descriptor. The value of this descriptor does not significantly increase with molecular size and number of rings present in the molecule. Balaban J index which characterizes the shape of a molecule based on covalent radii is defined as follows:
J¼
q ∑V Dj V Di μ þ1
In the above equation, i and j are the adjacent vertices, q is the number of edges, μ represents the number of cycles (i.e., μ =0 for linear graphs), V Di and V Dj are the average distance sum of the vertices i and j respectively. The spline term b Jx-2.4723N has a positive contribution toward the antimalarial activity. Thus, the numerical value of Jx should be more than 2.4723 for better antimalarial activity (e.g., 8 and 46). This
LVs
s
R2
Ra2
F
Q2
2 3 3
0.678 0.675 0.603 0.567 0.565 0.517 0.535
0.642 0.675 0.664 0.756 0.751 0.797 0.776
0.611 0.614 0.645 0.727 0.730 0.773 0.757
20.90 11.09 35.51 26.35 35.15 33.41 40.47
R2pred
LOO
LMO (25%)
0.588 0.500 0.546 0.699 0.687 0.749 0.735
0.561 0.762
0.667 0.575 0.636 0.803 0.742 0.808 0.791
indicates that the presence of cycles or rings in a molecule or minimum number of edges disfavor the antimalarial activity (e.g., compound nos. 5, 26 and 27). From this it can be inferred that presence of cycles or rings in the molecule disfavors the antimalarial activity. MR refers to the molar refractivity of a molecule which is a measure of total polarizability. Polarizabilities determine the dynamical response of a bound system to external fields, and provide insight into the internal structure. The polarizability of an atom is also proportional to its volume. So, MR is also a measure of the volume occupied by an atom or group. The negative regression coefficient of the spline term b MR-24.7006 N indicates that the numerical value of MR should be less than 24.7006 for better antimalarial activity. For example, the numerical values of MR are less than 24.7006 in case of compound nos. 37 and 38 and this explains their better antimalarial activity. The numerical values of MR in case of compound nos. 25, 18 and 35 are more than 24.7006 (81.91, 69.62 and 86.88 respectively) and this explains poor antimalarial activity though these compounds contain two, four and three rotatable bonds respectively. From this observation, it can be inferred that the volume of the molecules should not be too high. It has been shown previously that atomic contribution values for physicochemical properties are an important guide for correlating the observed biological activity of the ligands to their chemical structure [70]. The atom type descriptor, Atype_C_22 is calculated based on fragment based approach for log P prediction. Atype_C_22 refers to C—R and R = C = R fragments, the hydrophobicity measure of C in where R represents any group linked through carbon; = represents a represents a triple bond. The negative regression double bond; coefficient of Atype_C_22 indicates that it has a detrimental effect against antimalarial activity. Compounds like 18 and 50 showed C—R group. From lower range of activity due to the presence of C—R group in this series this it can be inferred that presence of of compounds may disfavor the antimalarial activity.
Table 6 2 metrics. Comparison of statistical quality of different models according to rm Model nos.
Type of model
rm2(LOO)
2 r /m (LOO)
2 rm ðLOOÞ
2 Δrm (LOO)
PRESS (LOO)
2 rm (test)
2 r /m (test)
2 rm ðtest Þ
2 Δrm (test)
PRESS (Test)
2 rm (overall)
2 r /m (overall)
2 rm ðoverallÞ
2 Δrm (overall)
PRESS (Overall)
M1 M2 M3 M4 M5 M6
Stepwise FA-MLR FA-PLS GFA G/PLS Stepwise followed by GFA Stepwise followed by G/PLS
0.419 0.353 0.515 0.516 0.666 0.568
0.653 0.625 0.281 0.704 0.464 0.733
0.536 0.489 0.398 0.610 0.565 0.651
0.234 0.272 0.234 0.188 0.202 0.165
18.468 22.400 20.360 13.495 14.029 11.260
0.644 0.538 0.604 0.792 0.771 0.791
0.363 0.184 0.296 0.644 0.58 0.617
0.504 0.361 0.450 0.718 0.676 0.704
0.281 0.354 0.308 0.148 0.191 0.174
4.606 5.88 5.03 2.717 3.56 2.652
0.458 0.388 0.547 0.557 0.683 0.603
0.685 0.651 0.284 0.737 0.484 0.766
0.572 0.520 0.416 0.647 0.584 0.685
0.227 0.263 0.263 0.180 0.199 0.163
23.074 28.280 25.390 16.212 17.589 13.912
0.716
0.538
0.627
0.178
11.860
0.752
0.674
0.713
0.078
2.892
0.723
0.569
0.646
0.154
14.753
M7
P.K. Ojha, K. Roy / Chemometrics and Intelligent Laboratory Systems 109 (2011) 146–161
155
Fig. 3. Graphical comparison of statistical quality of different models according to internal validation parameters.
3.2. Variable selection using stepwise regression followed by G/PLS pEC50 ¼7:13877þ0:43466bRotlbond‐1N0:03748bMR‐45:0402N 1:12684 Atype C 22 1:08438 b1AtypeBr 94N
ntraining ¼ 39; R2 ¼ 0:776; R2a ¼ 0:757; F ¼ 40:47ðdf 3; 35Þ; s ¼ 0:535; PRESS ¼ 11:860;
2 2 2 Q 2 ¼ 0:735; rm ðLOOÞ ¼ 0:716; rmðLOOÞ ¼ 0:627; ΔrmðLOOÞ ¼ 0:178 2 2 ntest ¼ 14; R2pred ¼ 0:791; rm ðtest Þ ¼ 0:752; rmðtest Þ ¼ 0:713; 2
2
ΔrmðtestÞ ¼ 0:078; rmðoverallÞ ¼ 0:723 2 2 rm ðoverallÞ ¼ 0:646; ΔrmðoverallÞ ¼ 0:154:
ð2Þ In the above equation, ntraining is the number of compounds used to develop the model and ntest is the number of compounds used for external prediction. This model could explain 75.7% of the variance (adjusted coefficient of variation). The leave-one-out (LOO) cross-validated correlation coefficient (Q 2 = 0.735) and leave-many-out (LMO) (25%) cross2 = 0.762) above the critical value validation correlation coefficient (Q 25% of greater than 0.5 signify the statistical reliability of the model. The predictivity of the model was judged by means of predictive R2 (R2pred = 0.791) which shows good predictive ability of the model. It is very interesting to note that the values of R2, Q2 and R2pred in this
2 2 2 metrics (rm2 (test), rm equation are very close. Values of rm ðtestÞ , rm (LOO), 2 2 2 r m ðLOOÞ , r mðoverallÞ and rm (overall)) more than 0.5 and values of differences 2 2 and r /m metrics (Δrm2 (test), Δrm2 (LOO) and Δrm2 (overall)) less between rm than 0.2 imply the goodness of predictions of this model. Using the standardized variable matrix for regression, the significance level of the descriptors was found to be in the following order: Rotlbond, MR, Atype_C_22 and AtypeBr_94. The parameter Rotlbond (number of rotatable bonds indicating flexibility of molecules) has a favorable contribution toward the antimalarial activity as evidenced by the positive regression coefficient. The spline term indicates that the number of Rotlbond more than one may enhance the antimalarial activity as discussed earlier in case of Eq. (1). The negative regression coefficient of the spline term b MR45.0402 N indicates that the numerical values of MR should be less than 45.0402 for better antimalarial activity as discussed earlier in Eq. (1). The negative regression coefficient of the atom type descriptor, Atype_C_22, indicates that it has a detrimental effect against antimalarial activity discussed earlier in Eq. (1). The atom type descriptor, AtypeBr_94 is calculated based on fragment based approach for logP prediction. It refers to the hydrophobicity measure of Br atom attached to a sp2 hybridized carbon atom. The negative regression coefficient of the spline term b1- AtypeBr_94 N indicates that the numerical value of AtypeBr_94 should be more
Fig. 4. Graphical comparison of statistical quality of different models according to external validation parameters.
156
P.K. Ojha, K. Roy / Chemometrics and Intelligent Laboratory Systems 109 (2011) 146–161
Fig. 5. Graphical comparison of statistical quality of different models according to overall set validation parameters.
than 1 for better antimalarial activity. This indicates that presence of Br atom in the molecule attached to a sp2 hybridized carbon atom may enhance the antimalarial activity. For example, compound no. 15 shows moderate antimalarial activity as the numerical value of AtypeBr_94 is more than 1. From this it can be inferred that presence of –Br fragment in the compounds may favor the antimalarial activity. 4. Comparison of statistical quality of different models In this study, we have developed QSAR models using five types of chemometric tools such as stepwise regression, factor analysis followed by MLR (FA-MLR), factor analysis followed by PLS (FA-PLS), GFA and G/PLS techniques. We have also developed QSAR models by variable selection using stepwise regression followed by GFA and G/PLS techniques. Among these models, those developed using variable selection by stepwise regression followed by GFA and G/PLS techniques were found to be statistically more significant than other models. According to different validation criteria, it can be clearly seen that model numbers M1-M3 are poorer than model numbers M42 M7 (Tables 5 and 6). This can be explained by internal (Q 2, rm (LOO), 2 2 2 2 2 rm ðLOOÞ and Δrm (LOO)) (Fig. 3), external (R pred, rm (test), rmðtest Þ and 2 2 2 Δrm2 (test)) (Fig. 4) and overall (rm (overall),rm ðoverallÞ and Δrm (overall)) validation parameters (Fig. 5). It has been found that external prediction ability of model numbers M1-M3 are comparable with other models but the internal prediction and overall prediction qualities are poorer 2 metrics than other models. It has already been shown [66] that rm are more stringent validation parameter than classical metrics. Re2 2 metrics, average rm cently, we have found [67] that among the rm 2 2 2 metrics ( rm) and difference of rm (Δrm) metrics are more reliable 2 2 2 metrics. Ideally, rm and r /m should be equal, i.e., than original rm
2 2 the difference between rm and r /m should be minimum (b0.2). A 2 indicates accuracy of the predicted values of comlower value of Δrm pounds according to the developed models. In case of model numbers M6 and M7 with variable selection by stepwise regression 2 2 followed by GFA and G/PLS techniques, the values of rm ðLOOÞ , rm ðtest Þ 2 and rmðoverallÞ are more than or comparable to the other correspond2 ing MLR or PLS models respectively and the difference between rm 2 2 /2 and r m in case of internal set (Δrm (LOO)), external set (Δrm (test)) 2 and overall set (Δrm (overall)) are lower than or comparable to the corresponding other MLR or PLS models (Table 6). It has also been found that the PRESS values of (both training and overall sets) model nos. M6 and M7 are lower than all other models according to internal and overall sets. As shown by Table 7, all the models bear cRp2 values greater than the threshold value of 0.5 indicating that all the models developed are robust enough and not the outcome of mere chance. After comparison of all the statistical quality of the models developed from initial pool of descriptors and those developed using reduced pool of descriptors, it has been confirmed that the QSAR models developed from the reduced pool of descriptors were better than those obtained from the entire pool of descriptors. The qualities of the models have also been checked by sum of ranking differences validation method. According to SRD values of the training set (Table 8) and test set (Table 9), it is clearly seen that model M7 and M6 i.e., models developed by variable selection by stepwise regression followed by G/PLS and GFA techniques are the best two models. In case of the training set, the order of the models according to the goodness of prediction are M7N M6 N M4 N M5N M2 N M3 N M1. In case of the test set, the order of the models according to the priorities are M7 N M6 N M5N M3 N M4 N M2N M1. The sign “N” means that the previous model is better than the later one. This implies that model numbers M7 and M6 (the SRD values are lowest amongst all the models) are the best two models according to both training and test sets. Figs. 6 and 7
Table 7 Results of Randomization tests for QSAR Models. Model no.
Model type
M1 M2 M3 M4 M5 M6 M7
Stepwise FA-MLR FA-PLS GFA G/PLS Stepwise followed by GFA Stepwise followed by G/PLS
R2
0.642 0.675 0.664 0.756 0.751 0.797 0.776
Process randomization
Model randomization
Rr2
c 2 Rp
Rr2
c 2 Rp
0.203 0.392 0.171 0.333
0.646 0.519 0.706 0.586
0.064 0.146 0.004 0.095 0.004 0.088 0.003
0.609 0.598 0.662 0.707 0.749 0.752 0.775
P.K. Ojha, K. Roy / Chemometrics and Intelligent Laboratory Systems 109 (2011) 146–161
157
5. Compliance of the developed models with OECD guidelines
Table 8 Sum of ranking differences of training set compounds. Model
M7
M6
M4
M5
M2
M3
M1
SRDtrain
274
288
322
338
364
370
414
Table 9 Sum of Ranking Differences (SRD) of test set compounds. Model
M7
M6
M5
M3
M4
M2
M1
SRDtest
12
16
18
24
26
30
32
show the ranking of models of training and test sets respectively by SRD values scaled between 0 and 100. Based on comparison of all the validation parameters, it can be suggested that model numbers M7 (model developed by variable selection by stepwise regression followed by G/PLS) and M6 (model developed by variable selection by stepwise regression followed by GFA) are the best two models.
In this study, we have tried to develop the models according to the organization for economic co-operation and development (OECD) principles for developing QSAR models [71,72]. We have followed all five types of OECD principles such as: i) a defined endpoint; ii) an unambiguous algorithm; iii) a defined domain of applicability; iv) appropriate measures of goodness-of fit, robustness and predictivity; v) a mechanistic interpretation, if possible. These principles are important to find consistent, reliable and reproducible QSAR models. i. Defined end point: We have selected a series of 53 endochin analogs (4(1-H)-quinolone derivatives) with antimalarial activity against the clinically relevant multidrug resistant malarial strain TM-90-C2B. The human malaria parasite P. falciparum was grown in vitro in dilute human erythrocytes in RPMI 1640 media containing 10% heat inactivated plasma. Concentrationresponse data were analyzed by a non-linear regression logistic dose-response model and the EC50 value (the concentration of a drug that gives half-maximal response) for each compound was determined [32].
Fig. 6. Scaled SRD values for predictive models for the training set compounds. The mean and standard deviation of the distribution are given in the header.
Fig. 7. Scaled SRD values for predictive models for the test set compounds. The mean and standard deviation of the distribution are given in the header.
158
P.K. Ojha, K. Roy / Chemometrics and Intelligent Laboratory Systems 109 (2011) 146–161
Fig. 8. Williams plot of model M6.
ii. Unambiguous algorithm: The structures were drawn in Cerius2 version 4.10 software [36]. All the descriptors (topological, structural, thermodynamics and atom type) were calculated using Descriptor + module of the Cerius2 version 4.10 software [36]. The whole data set was divided into training and test sets based on factor analysis leading to a PCA score plot obtained using SPSS software [37]. The normality distribution of the response value of the training set compounds were checked using the Shapiro-Wilk test and the Kolmororov-Smirnov test using STATISTICA software [73]. Initially the models were developed by using stepwise regression, factor analysis followed by MLR (FA-MLR), factor analysis followed by PLS (FA-PLS), GFA and G/PLS techniques. We have also developed QSAR models using another approach where descriptor-thinning has been performed prior to feature selection. In this case, the total pool of descriptors was divided into three categories such as i) topological descriptor ii) structural and thermodynamic descriptors; and iii) atom type descriptor. Then stepwise regression was performed separately for variable selection. Finally model was developed from the pool of selected descriptors using GFA and G/PLS techniques. The final models contain well defined descriptors and involve widely used MLR or PLS technique.
iii. Applicability Domain (AD): The AD is a theoretical region in chemical space which can be defined by the model descriptors and modeled response. In this study, we have performed both leverage approach (model M6) [74,75] using STATISTICA software [73] and DModX (distance to model) approach [42] at 90% confidence level using SIMCA-P software [76] (model M7) to check the applicability domain of these two best models (M6 and M7). It is possible to verify whether the compounds lie within the region of chemical space or outside the region of chemical space. If the leverage value (h) of a compound is higher than the critical value (h*) i.e., h N h* (where, h* = 3p′/n, where p′ is the number of model variables plus one, and n is the number of the compounds used to calculate the model), the prediction of the compound can be considered as unreliable and vice versa. We can easily detect the response outliers by plotting standardized cross-validated residuals (R) versus leverage (Hat diagonal) values (h) (the Williams plot) of the compounds. Compounds with cross-validated standardized residuals greater than three standard deviation units, N2.5σ may be considered as outliers. Fig. 8 shows the Williams plot of model M6, suggesting that all the test set compounds are within the applicability domain of the developed model.
Fig. 9. DModX values of the training set compounds at 90% confidence level of the developed model M7. The thick horizontal line signifies the critical DModX value (1.959) at the 90% confidence level.
P.K. Ojha, K. Roy / Chemometrics and Intelligent Laboratory Systems 109 (2011) 146–161
159
Fig. 10. DModX values of the test set compounds at 90% confidence level of the developed model M7. The thick horizontal line signifies the critical DModX value (1.959) at the 90% confidence level.
The leverage values (h) of these two training set compounds 18 and 50 are greater than the critical hat value (h*) suggesting these two compounds are influential observations, but their cross-validated standardized residuals are lower than 2.5 standard deviation units. Fig. 9 shows that all the training set compounds are within the critical DModX value (model M7) at 90% confidence level (1.959). So, there is no outlier in the training set compounds. Fig. 10 shows that all the test set compounds are within the applicability domain of model M7. iv. Validation parameters: We have checked the statistical quality of the models using different classical validation parameters like Q2 and R2pred (for internal validation and external validation 2 respectively), and also some novel metrics like rm2 (LOO), rm ðLOOÞ 2 2 and Δrm2(LOO) for internal validation, rm2 (test), rm ðtestÞ and Δrm (test) 2 2 for external validation, rm2 (overall), rm ðoverallÞ and Δrm (overall) for c 2 overall validation and Rp for model randomization [53,67]. We have also checked the models quality according to the nonparametric validation criteria, sum of ranking differences (SRD) [54,55]. The cross-validated correlation coefficient (Q2) above the critical value of 0.5 signifies the statistical reliability of the models. The predictivity of the models was judged by means of predictive R2 (R2pred) which shows good predictive ability of 2 2 metrics (rm2 (test), r m the models (R2pred N 0.5). Values of rm ðtestÞ , 2 2 2 2 rm (LOO), r mðLOOÞ , r mðoverallÞ and rm (overall)) more than 0.5 and values 2 2 and r/m metrics (Δrm2 (test), Δrm2 (LOO) and of differences between rm Δrm2 (overall)) less than 0.2 imply the goodness of predictions of these models. All the models bear cRp2 values greater than the threshold value of 0.5 indicating that all the models developed are robust enough and not obtained by any chance. We have also checked the multicollinearity of MLR models (M1, M2, M4 and M6) using variance inflation factor, which shows that there is no intercorrealtion among the descriptors in the MLR models (Table 10). After comparison of different statistical qualities of
Table 10 Different developed models with variance inflation factors. Model M1
Model M2
Model M4
Model M6
Variables
VIF
Variables
VIF
Variables
VIF
Variables
VIF
JX
1.631
PHI
1.524
PHI
1.577
1.252
PHI S_tsC
1.656 1.111
S_ssO S_tsC S_ssNH AtypeBr94 S_aaCH
1.323 1.114 1.093 1.592 1.653
S_tsC MR bJx-2.3879N
1.030 1.115 1.422
bMR24.7006N Rotlbond bJx-2.4723N Atype_C_22
1.572 1.289 1.032
all the developed models, it has also been concluded that reduction of noise in the input descriptors and improvement of model quality can be done by “descriptor-thinning” procedure prior to feature selection as evidenced by the good statistical quality of M6 and M7 models compared to other models. v. Mechanistic interpretation: From the different QSAR models, it has been concluded (Fig. 11) that i) R5 and R3 positions should be substituted with the rotatable bond containing groups; ii) the volume of the molecules should not be too group (at R5 position) in the comhigh; iii) presence of pounds may disfavor the antimalarial activity; iv) presence of –Br fragment in the compounds may favor the antimalarial activity; v) presence of cycles or rings in the molecules disfavors the antimalarial activity; vi) substitutions containing long chain may favor the antimalarial activity. From the above discussion, it is confirmed that the developed models are in compliance with the OECD guidelines. 6. Overview and conclusions In this paper, we have explored QSAR studies of 53 endochin analogs (4(1-H)-quinolone derivatives) with well defined antimalarial activity against TM-90-C2B strain. QSAR studies have been performed using topological (E-state index, kappa shape index, molecular connectivity index, subgraph count) along with thermodynamic (ALogP, ALogP98, MolRef, MR, LogP), structural (Hbond donor, Hbond acceptor, Rotlbonds) and atom type descriptors. The whole dataset was divided into a training set and a test set based on factor analysis of the descriptor matrix leading to PCA score plot. We have developed QSAR models using different chemometric tools such as stepwise regression, factor analysis followed by MLR (FA-MLR), factor analysis followed by PLS (FA-PLS), GFA and G/PLS techniques. We have also developed QSAR models after selecting variables by stepwise regression followed by GFA and G/PLS techniques to justify the importance of descriptor-thinning and noise reduction prior to feature selection. In this case, the number of initial pool of descriptors was reduced from 79 to 10. Among these models only best two models (variable selection using stepwise regression followed by GFA and G/PLS techniques) have been detailed here. Acceptable values of the models in terms of different validation parameters like 2 metrics, randomization internal prediction, external prediction, rm test and sum of ranking differences confirm reliability of the models. We have also validated the models using newly introduced validation 2 2 ) and difference of r 2 metrics (Δr 2 ) metric (r m parameters average rm m m as discussed earlier in Material and Methods section. The results
160
P.K. Ojha, K. Roy / Chemometrics and Intelligent Laboratory Systems 109 (2011) 146–161
Fig. 11. Schematic diagram of mechanistic interpretation of structure-activity relationship of antimalarial endochins.
clearly indicate that the QSAR models obtained from the reduced pool of descriptors were better in predictive quality than those obtained from the entire pool of descriptors. From the different QSAR models, it has been concluded that (Fig. 11) i) R5 and R3 positions should be substituted with the rotatable bond containing groups; ii) the volume group (at of the molecules should not be too high; iii) presence of R5 position) in the compounds may disfavor the antimalarial activity; iv) presence of electronegative (like –Br) fragments in the compounds may favor the antimalarial activity; v) presence of cycles or rings in the molecules disfavors the antimalarial activity; vi) substitution containing long chain may favor the antimalarial activity. From the developed QSAR models using different chemometric tools, it can be inferred that variable selection using stepwise regression followed by GFA and G/PLS provided statistically more acceptable results. After comparison of statistical quality of all the developed models, it has also been concluded that noise reduction (by reducing initial pool of descriptors from 79 to 10) and improvement of model quality may be done by “descriptor-thinning” procedure prior to feature selection as evidenced by the improved statistical quality of the genetic models after descriptor-thinning by the stepwise regression process. Similar method of descriptor-thinning may be in general applied in QSAR research for other endpoints of interest using similar (as used in the present study) or other kind of descriptors. Acknowledgement Financial assistance from the UGC (New Delhi) in form of a fellowship to PKO and a major research project to KR is thankfully acknowledged. Appendix A. Supplementary material Supplementary material to this article can be found online at doi:10.1016/j.chemolab.2011.08.007. References [1] World Malaria Report, WHO, Geneva, Switzerland, 2008 www.who.int/entity/ malaria/publications/atoz/…/en/index.html. [2] Global burden of disease: 2004 update, World Health Organization, Geneva, 2008. [3] I. Bathurst, C. Hentschel, Medicines for malaria venture: Sustaining antimalarial drug development, Trends in Parasitology 22 (2006) 301–307. [4] M.C. Murray, M.E. Perkins, Chemotherapy of malaria, Annual Reports in Medicinal Chemistry 31 (1996) 141–150. [5] N.J. White, P. Olliaro, Artemisinin and derivatives in the treatment of uncomplicated malaria, La Medicina Tropical 58 (1998) 54–56. [6] H. Noedl, D. Socheat, W. Satimai, Artemisinin-resistant malaria in Asia, The New England Journal of Medicine 361 (2009) 540–541.
[7] W.O. Rogers, R. Sem, D. Tero, P. Chim, P. Lim, S. Muth, D. Socheat, F. Ariey, C. Wongsrichanalai, Failure of artesunate mefloquine combination therapy for uncomplicated Plasmodium falciparum malaria in Southern Cambodia, Malaria Journal 8 (2009) 10. [8] H. Gonzalez-Diaz, M. Cruz-Monteagudo, D. Vina, L. Santana, E. Uriarte, E. De Clercq, QSAR for anti-RNA-virus activity, synthesis, and assay of anti-RSV carbonucleosides given a unified representation of spectral moments, quadratic, and topologic indices, Bioorganic & Medicinal Chemistry Letters 15 (2005) 1651–1657. [9] H. Gonzales-Diaz, O. Gia, E. Uriarte, I. Hernadez, R. Ramos, M. Chaviano, S. Seijo, J.A. Castillo, L. Morales, L. Santana, D. Akpaloo, E. Molina, M. Cruz, L.A. Torres, M.A. Cabrera, Markovian chemicals “in silico” design (MARCH-INSIDE), a promising approach for computer-aided molecular design I: discovery of anticancer compounds, Journal of Molecular Modeling 9 (2003) 395–407. [10] L. Santana, E. Uriarte, H. Gonzalez-Diaz, G. Zagotto, R. Soto-Otero, E. Mendez-Alvarez, A QSAR model for in silico screening of MAO-A inhibitors. Prediction, synthesis, and biological assay of novel coumarins, Journal of Medicinal Chemistry 49 (2006) 1149–1156. [11] Q.S. Du, P.G. Mezey, K.C. Chou, Heuristic molecular lipophilicity potential (HMLP): A 2D-QSAR study to LADH of molecular family pyrazole and derivatives, Journal of Computational Chemistry 26 (2005) 461–470. [12] Q.S. Du, R.B. Huang, Y.T. Wei, L.Q. Du, K.C. Chou, Multiple field three dimensional quantitative structure-activity relationship (MF-3D-QSAR), Journal of Computational Chemistry 29 (2008) 211–219. [13] Q.S. Du, R.B. Huang, K.C. Chou, Recent advances in QSAR and their applications in predicting the activities of chemical molecules, peptides and proteins for drug design, Current Protein & Peptide Science 9 (2008) 248–259. [14] Z. He, J. Zhang, X.H. Shi, L.L. Hu, X. Kong, Y.D. Cai, K.C. Chou, Predicting drug-target interaction networks based on functional groups and biological features, PloS One 5 (2010) e9603. [15] K.C. Chou, A vectorized sequence-coupling model for predicting HIV protease cleavage sites in proteins, The Journal of Biological Chemistry 268 (1993) 16938–16948. [16] P. Wang, L. Hu, G. Liu, N. Jiang, X. Chen, J. Xu, W. Zheng, L. Li, M. Tan, Z. Chen, H. Song, Y.D. Cai, K.C. Chou, Prediction of antimicrobial peptides based on sequence alignment and feature selection methods, PloS One 6 (2011) e18476. [17] K. Roy, P.K. Ojha, Advances in quantitative structure–activity relationship models of antimalarials, Expert Opinion on Drug Discovery 5 (2010) 751–778. [18] M.K. Gupta, Y.S. Prabhakar, Topological descriptors in modeling the antimalarial activity of 4-(3′,5′-disubstituted anilino)quinolines, Journal of Chemical Information and Modeling 46 (2006) 93–102. [19] V.K. Agrawal, R. Srivastava, P.V. Khadikar, QSAR studies on some antimalarial sulfonamides, Bioorganic & Medicinal Chemistry 93 (2001) 287–3293. [20] P.K. Ojha, K. Roy, Chemometric modeling, docking and in silico design of triazolopyrimidine-based dihydroorotate dehydrogenase inhibitors as antimalarials, European Journal of Medicinal Chemistry 45 (2010) 4645–4656. [21] P.K. Ojha, K. Roy, Exploring molecular docking and QSAR studies of plasmepsin-II inhibitor di-tertiary amines as potential antimalarial compounds, Molecular Simulation 37 (2011) 779–803. [22] P.K. Ojha, K. Roy, Chemometric modelling of antimalarial activity of aryltriazolylhydroxamates, Molecular Simulation 36 (2010) 939–952. [23] P.K. Ojha, K. Roy, Exploring QSAR, pharmacophore mapping and docking studies and virtual library generation for cycloguanil derivatives as PfDHFR-TS inhibitors, Medicinal Chemistry 7 (2011) 173–199. [24] W. Salzer, H. Timmler, H. Andersag, A new type of compounds active against avian malaria, Chemische Berichte 81 (1948) 12–19. [25] R.W. Winter, J.X. Kelly, M.J. Smilkstein, R. Dodean, D. Hinrichs, M.K. Riscoe, Antimalarial quinolones: synthesis, potency, and mechanistic studies, Experimental Parasitology 118 (2008) 487–497. [26] Q. Shen, J.H. Jiang, G. Shen, R.Q. Yu, Variable selection by an evolution algorithm using modified Cp based on MLR and PLS modeling: QSAR studies of carcinogenicity of aromatic amines, Analytical and Bioanalytical Chemistry 375 (2003) 248–254. [27] E. Bayram, P. Santago 2nd, R. Harris, Y.D. Xiao, A.J. Clauset, J.D. Schmitt, Genetic algorithms and self-organizing maps: a powerful combination for modeling complex QSAR and QSPR problems, Journal of Computer-Aided Molecular Design 18 (2004) 483–493. [28] S. Schefzick, M. Bradley, Comparison of commercially available genetic algorithms: GAs as variable selection tool, Journal of Computer-Aided Molecular Design 18 (2004) 511–521. [29] W. Zheng, A. Tropsha, Novel variable selection quantitative structure – property relationship approach based on the k-nearest-neighbor principle, Journal of Chemical Information and Computer Sciences 40 (2000) 185–194. [30] T. Ghafourian, M. Cronin, The effect of variable selection on the non-linear modeling of oestrogen receptor binding, QSAR and Combinatorial Science 25 (2006) 824–835. [31] K.C. Chou, Some remarks on protein attribute prediction and pseudo amino acid composition, Journal of Theoretical Biology 273 (2011) 236–247. [32] R.M. Cross, A. Monastyrskyi, T.S. Mutka, J.N. Burrows, D.E. Kyle, R. Manetsch, Endochin optimization: Structure-activity and structure-property relationship studies of 3-substituted 2-Methyl-4(1H)-quinolones with antimalarial activity, Journal of Medicinal Chemistry 53 (2010) 7076–7094. [33] R.E. Desjardins, C.J. Canfield, J.D. Haynes, J.D. Chulay, Quantitative assessment of antimalarial activity in vitro by a semiautomated microdilution technique, Antimicrobial Agents and Chemotherapy 16 (1979) 710–718. [34] H. Kubinyi, The quantitative analysis of structure-activity relationships, in: M.E. Wolff (Ed.), Burger's Medicinal Chemistry and Drug Discovery, John Wiley, 1995, pp. 494–571.
P.K. Ojha, K. Roy / Chemometrics and Intelligent Laboratory Systems 109 (2011) 146–161 [35] C.D. Selassie, History of Quantitative Structure-Activity Relationships, in: J. Abraham (Ed.), Burger's Medicinal Chemistry and Drug Discovery, Wiley, 2003, pp. 1–48. [36] Cerius2 Version 4.10 is a product of Accelrys Inc., San Diego, CA, 2005. [37] SPSS is statistical software of SPSS Inc., USA, 1999. [38] J.T. Leonard, K. Roy, On selection of training and test sets for the development of predictive QSAR models, QSAR and Combinatorial Science 25 (2006) 235–251. [39] M.A. Stephens, Asymptotic Results for Goodness of Fit Statistics with Unknown Parameters, The Annals of Statistics 4 (1976) 357–369. [40] F.J. Massey Jr., The Kolmogorov-Smirnov test for goodness of fit, Journal of the American Statistical Association 46 (1951) 68–78. [41] R.B. Darlington, In Regression and linear models, New York, McGraw- Hill, 1990. [42] S. Wold, M. Sjostrom, L. Eriksson, PLS-regression: a basic tool of chemometrics, Chemometrics and Intelligent Laboratory Systems 58 (2001) 109–130. [43] Y. Fan, L.M. Shi, K.W. Kohn, Y. Pommier, J.N. Weinstein, Quantitative structureantitumor activity relationships of camptothecin analogues: cluster analysis and genetic algorithm-based studies, Journal of Medicinal Chemistry 44 (2001) 3254–3263. [44] S. Wold, in: H. van de Waterbeemd (Ed.), Chemometric Methods in Molecular Design, VCH, Weinheim, 1995, p. 195. [45] R. Franke, Theoretical Drug Design Methods, Elsevier, Amsterdam, 1984, p. 184. [46] R. Franke, A. Gruska, in: H. van de Waterbeemd (Ed.), Chemometric Methods in Molecular Design, VCH, Weinheim, 1995, pp. 113–163. [47] D. Rogers, A.J. Hopfinger, Application of genetic function approximation to quantitative structure-activity relationships and quantitative structure-property relationships, Journal of Chemical Information and Computer Sciences 34 (1994) 854–866. [48] J. Holland, Adaptation in Artificial and Natural Systems, University of Michigan Press, Ann Arbor, MI, 1975. [49] J. Friedman, Multivariate Adaptive Regression Splines, Technical Report No. 102, Laboratory for Computational Statistics, Department of Statistics, Stanford University, Stanford, CA, Nov 1988, (revised Aug 1990). [50] H. Martens, T. Naes, Multivariate Calibration, Wiley, Chichester, 1989 etc. [51] A. Höskuldsson, PLS regression methods, Journal of Chemometrics 2 (1988) 211–228. [52] L. Eriksson, E. Johansson, N. Kettaneh-Wold, S. Wold, Multi- and megavariate data analysis: Principles and applications, Umetrics, Umeå, 2001. [53] I. Mitra, A. Saha, K. Roy, Exploring quantitative structure–activity relationship studies of antioxidant phenolic compounds obtained from traditional Chinese medicinal plants, Molecular Simulation 36 (2010) 1067–1079. [54] K. Héberger, Sum of ranking differences compares methods or models fairly, Trends in Analytical Chemistry 29 (2010) 101–109. [55] K. Héberger, K. Kollár-Hunek, Sum of ranking differences for method discrimination and its validation: comparison of ranks with random numbers, Journal of Chemometrics 25 (2010) 151–158. [56] M. Robert O'Brien, A caution regarding rules of thumb for variance inflation factors, Quality and Quantity 41 (2007) 673–690. [57] G.W. Snedecor, W.G. Cochran, Statistical Methods, Oxford & IBH, New Delhi, 1967.
161
[58] H. Kubinyi, F.A. Hamprecht, T. Mietzner, Three-dimensional quantitative structure-activity relationships (3D QSiAR) from SEAL similarity matrices, Journal of Medicinal Chemistry 41 (1998) 2553–2564. [59] A. Golbraikh, A. Tropsha, Beware of q2! Journal of Molecular Graphics & Modelling 20 (2002) 269–276. [60] K.C. Chou, H.B. Shen, Cell-PLoc: A package of Web servers for predicting subcellular localization of proteins in various organisms (updated version: Cell-PLoc 2.0: An improved package of web-servers for predicting subcellular localization of proteins in various organisms, Natural Science 2 (2010) 1090–1103), Nature Protocols 3 (2008) 153–162. [61] K.C. Chou, H.B. Shen, Review: Recent progresses in protein subcellular location prediction, Analytical Biochemistry 370 (2007) 1–16. [62] G.R. Marshall, Binding-site modeling of unknown receptors, in: H. Kubinyi (Ed.), 3D QSAR in Drug Design – Theory, Methods and Applications, Leiden, ESCOM, 1994, pp. 80–116. [63] V. Consonni, D. Ballabio, R. Todeschini, Comments on the definition of the Q2 parameter for QSAR validation, Journal of Chemical Information and Modeling 49 (2009) 1669–1678. [64] P.P. Roy, K. Roy, On some aspects of variable selection for partial least squares regression models, QSAR and Combinatorial Science 27 (2008) 302–313. [65] P.P. Roy, S. Paul, I. Mitra, K. Roy, On two novel parameters for validation of predictive QSAR models, Molecules 14 (2009) 1660–1701. 2 as a metric [66] I. Mitra, P.P. Roy, S. Kar, P.K. Ojha, K. Roy, On further application of rm for validation of QSAR models, Journal of Chemometrics 24 (2010) 22–33. 2 [67] P.K. Ojha, I. Mitra, R.N. Das, K. Roy, Further exploring rm metrics for validation of QSPR models, Chemometrics and Intelligent Laboratory Systems 107 (2011) 194–205. [68] K. Roy, S. Paul, Exploring 2D and 3D QSARs of 2,4-diphenyl-1,3-oxazolines for ovicidal activity against tetranychus urticae, QSAR and Combinatorial Science 28 (2009) 406–425. [69] K. Roy, S. Paul, Docking and 3D-QSAR studies of acetohydroxy acid synthase inhibitor sulfonylurea derivatives, Journal of Molecular Modeling 16 (2010) 951–964. [70] A.K. Ghose, A. Pritchett, G.M. Crippen, Atomic Physicochemical Parameters for Three-Dimensional Structure-Directed Quantitative Structure-Activity Relationships 3, Journal of Computational Chemistry 9 (1988) 80–90. [71] http://www.oecd.org/document/23/0,3746,en_2649_34379_33957015_1_1_1_ 1,00.html. [72] http://www.oecd.org/dataoecd/33/37/37849783.pdf. [73] STATISTICA is a statistical software of STATSOFT Inc., USA, http://www.statsoft.com/. [74] P. Gramatica, Principles of QSAR models validation: internal and external, QSAR and Combinatorial Science 26 (2007) 694–701. [75] L. Eriksson, J. Jaworska, A.P. Worth, M.T. Cronin, R.M. McDowell, P. Gramatica, Methods for reliability and uncertainty assessment and for applicability evaluations of classification- and regression-based QSARs, Environmental Health Perspectives 111 (2003) 1361–1375. [76] UMETRICS SIMCA-P 10.0,
[email protected]:www.umetrics.com 2002 Umea, Sweden.