Qsar Itq21 Param

  • June 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Qsar Itq21 Param as PDF for free.

More details

  • Words: 11,171
  • Pages: 18
Full Papers

Prediction of ITQ-21 Zeolite Phase Crystallinity: Parametric Versus Non-parametric Strategies Laurent A. Baumes, Manuel Moliner, Avelino Corma* Instituto de Tecnologa Qumica (UPV-CSIC), av. de los Naranjos, E-46022, Valencia, Spain, E-mail: [email protected] Keywords: Data mining, High-throughput, ITQ-21 zeolite, Parametric, Regression, Statistics Received: May 8, 2006; Accepted: July 13, 2006 DOI: 10.1002/qsar.200620064

Abstract This work deals with data analysis techniques and high-throughput tools for synthesis and characterization of solid materials. In previous studies, it was found that the final properties of materials could be successfully modeled using learning systems. Machine learning algorithms such as neural networks, support vector machines, and regression trees are non-parametric strategies. They are compared to traditional parametric statistical approaches. We review a wide range of statistical methodologies, and all the methods are evaluated using experimental data derived from an exploration-optimization of the material ITQ-21. The results are judged on the numerical prediction of phase>s crystallinity. We discuss the theoretical aspects of such statistical techniques, which make them an attractive method when compared to other learning strategies for modeling the properties of the solids. Advantages and drawbacks are highlighted. We show that such approaches, by offering broad solutions, can reach high-level performances while offering ease of use, comprehensibility, and control. Finally, we shed light on both the interpretation and stability of results, which remain the main drawbacks of the majority of machine learning methodologies when trying to retrieve knowledge from the data treatment.

1 Introduction Molecular sieve and more specifically zeolites are materials of considerable interest in gas adsorption and separation, catalysis, and for electronics and medical uses [1, 2]. Recent research work from different groups has contributed to the understanding of the synthesis mechanism, as well as to the discovery of new zeolitic structure [3 – 11]. The discovery of new structures or enlarging the synthesis space, and optimization of existing ones require a considerable experimental effort. This can be reduced by using High-Throughput (HT) synthesis and characterization techniques [12 – 14] since the amount of samples to be processed is tremendously increased, and consequently the number of parameters to be simultaneously explored. Thus, the possibility of discovering new materials or better covering a phase diagram may be strongly accelerated. The need for advanced strategies that aim at optimizing the retrieve of knowledge from experiments while maintaining their number at a reasonable level is a critical part of the discovery and optimization processes. Numerous different Machine Learning (ML) techniques have been successfully applied for modeling experimental data obtained during the exploration of multi-component materiQSAR Comb. Sci. 00, 0000, No. &, 1 – 18

als. However, the synthesis of zeolitic materials through HT experimentation has received a weaker impulse and fewer studies have been reported. The models allow to predict the properties of unsynthesized materials (alsocalled “virtual” solids) taking into account their expected compositions or preparation conditions as input variables. Among the different ML techniques, Neural Networks (NNs) often yielded the best modeling results. They have been applied for modeling and prediction of the catalytic performance of libraries for a variety of reactions, and some selected examples are water gas shift reaction [15], oxidative dehydrogenation ethane [16], oxidative dehydrogenation of propane to propene [17], and propene oxidation to propene-oxide [18]. However, NNs may suffer from overfitting the data, reproducibility problems and, therefore, there is still the need to use or even develop other techniques. In this sense Support Vector Machines (SVM) can be a suitable method for overcoming the pitfalls of NNs when they may occur, and a first comparison has been recently done for the design of catalysts and materials [19, 20]. In this case, even if overfitting is rather discarded, the interpretation of results still remains difficult when using complex kernel functions.

I 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

&1&

These are not the final page numbers! ÞÞ

Full Papers ML methods do not assume any parametric form of the appropriate model to use; they are classified in the set of distribution-free methods. Instead of starting with assumptions on a particular problem, ML uses a toolbox approach in order to identify the correct model structure directly from the available data. One of the main consequences is that the methods typically require larger datasets than parametric statistics. In materials science domain, even when using HT techniques, the number of examples remains low (i.e., less than 150). This represents a great problem for non-parametric procedures for preventing overfitting. Since the 1990s, a large amount of publications have appeared using only such ML methods, while traditional parametric statistics remains relatively neglected as it has been emphasized in [21]. In [22], the authors make use of traditional statistical analysis while examining split-plot design, and very recently a new hybrid statistical methodology has been proposed [23], which combines evolutionary algorithm operators with a statistical criterion for optimizing the structure characterization of a given search space taking into account an a priori limited amount of experiments to be conducted. This work, which deals with data analysis techniques and HT tools for synthesis and characterization of solid materials, aims at showing that statistics can enable a better interpretation of results while showing similar quality of performances and discarding ML pitfalls. We review a wide range of statistical methodologies and discuss the theoretical aspects of such techniques, which make them an attractive method for modeling the properties of solids when compared to the other learning strategies. Advantages and drawbacks are highlighted. We show that such statistical approaches, by offering broad solutions, allow to reach high-level performances while offering ease of use, comprehensibility, and control. Finally, we shed light on both the interpretation and stability of results which remain the major drawbacks of the black-box learning methodologies. All the methods are evaluated here using experimental data derived from exploration/optimization of the synthesis of a zeolitic material (ITQ-21) [24]. ITQ-21 is a zeolite with a three-dimensional pore network containing 1.18nm-wide cavities, each of which is accessible through six circular and 0.74-nm-wide windows. The structure is shown in Figure 1a. We have chosen this system because the structure as ITQ-21 is one of the most interesting large pore zeolites that combines the catalytic properties of USY zeolites with a higher diffusitivity and a lower rate of catalyst deactivation. Then there is incentive for better understanding and optimizing the synthesis and chemical composition of this material. The results are judged on the numerical prediction of phase crystallinity.

&2&

I 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

ÝÝ These are not the final page numbers!

Laurent A. Baumes et al.

Figure 1. a) Structure of the ITQ-21 zeolite. b) Standard diffractogram of the ITQ-21 zeolite.

2 Experimental Section 2.1 Synthesis Experimental Data A large amount of parameters govern the hydrothermal crystallization processes of microporous materials, determining which phases are formed and their crystallization kinetics. In this study, a detailed exploration of the hydrothermal synthesis in the system SiO2/GeO2/Al2O3/F  /H2O/ N(16) Methylsparteinium (MSPT) has been performed, in order to understand the effect of these factors over the growth of ITQ-21. The synthesis variables have been selected in order to cover the broadest range of the most promising parameter space based on previous experience, while keeping the total amount of experiments within a feasible and reasonable range. Five synthesis variables and their respective-expected values are: Si/Ge ¼ {15, 20, 25, 50}, Al/(Si þ Ge) ¼ {0.02, 0.04, 0.06}, MSPT/(Si þ Ge) ¼ {0.25, 0.5}, H2O/(Si þ Ge) ¼ {2, 5, 10}, and time (day) ¼ {1, 5}. A sixth variable, F/(Si þ Ge), is always maintained www.qcs.wiley-vch.de

QSAR Comb. Sci. 00, 0000, No. &, 1 – 18

Prediction of ITQ-21 Zeolite Phase Crystallinity: Parametric Versus Non-parametric Strategies

equal to the MSPT amount to get neutral pH. The distribution of experiments comes from a full factorial design with Si/Ge, H2O, (MSPT & F  )/(Si þ Ge), Al/(Si þ Ge), and the synthesis duration noted as t, respectively, at 4, 3, 2, 3, and 2 levels. Therefore each experiment is one combination among the 144 possible. All the experiments were carried out in a random order. The reagents employed for gel syntheses were ammonium fluoride (98%, Aldrich), germanium oxide (99.998% Aldrich), aluminum isopropoxide (98%, Aldrich), methylsparteine, LUDOX (AS 40 wt% Aldrich), MilliQ water (Millipore) and N(16)-methyl-sparteinium hydroxide. Automated gel synthesis was done inside Teflon vials (3 mL), which were finally inserted in a multi-autoclave of 15 positions and sealed with a Teflon-lined stainless-steel tip, and subsequently allowed to crystallize at 175 8C. The samples were then washed and filtered in parallel and then dried at 100 8C overnight. Finally, the samples were weighted and characterized by XRD using a multi-sample Philips X>Pert diffractometer employing Cu Ka radiation. The standard X-ray diffractogram for ITQ-21 is shown in Figure 1b. Calculation of the occurrence and crystalinity was done integrating the area of the characteristic peaks. For ITQ-21 the range for the angle 2q is comprised between 25.78 and 26.58. 2.2 Computational Methods In regression problems, the objective is to estimate the value of a continuous output variable that in our case is a given crystalline phase from input variables such as the synthesis parameters. All the different techniques used in this study are quickly detailed except NNs which already have received considerable attention, see [15 – 20] for recent applications in material science, and [25, 26] for more technical explanations. In order to provide a fair comparison between the different techniques investigated, 28% of the data chosen randomly among the whole available dataset composed of 144 distinct experiments is kept unused for model generalization evaluation. 2.2.1 Multiple Linear Regression (MLR) An MLR model specifies the relationship between one dependent variable Pi¼ky, and a set of predictor variables X, so that y ¼ b0 þ i¼1 bi xi in where bi are the regression coefficients. 2.2.2 Generalized Linear Model (GLZ) GLZ can be used to predict responses for both dependent variables with discrete distributions and for dependent variables which are non-linearly related to the predictors. GLZ differs from the linear model mainly in the following major aspects. (i) The distribution of the dependent variable can be explicitly non-normal, (ii) the dependent variaQSAR Comb. Sci. 00, 0000, No. &, 1 – 18

www.qcs.wiley-vch.de

ble values are predicted from a linear combination of predictor variables, which are connected to the dependent variable via a function called link function. The relationship in GLZ is assumed to be y ¼ g(b0 þ b1x1 þ ... þ bkxk) þ e, where e stands for the error variability. The inverse funcf is the link function; so that tion g1 ¼P i¼k f ð~yÞ ¼ b0 þ i¼1 bi xi , where ~y stands for the expected value of y. For additional information about GLZ, see [27, 28]. 2.2.3 Piecewise Linear Regression (PLR) This model specifies a common intercept b0, and a slope that is either equal to b1 if y  100, or b2 taking into account a problem with only two variables, and the following model: y ¼ b0 þ b1x(y  100) þ b2x(y > 100). Stepwise model-building techniques for regression designs with a single dependent variable are described in numerous sources [29, 30]. 2.2.4 SVMs as Regression Tool A general introduction of SVMs was already presented in [20]. With e-SV regression [31], the goal is to find a function f(x) that has at most e deviation from the target yi for all the training data, and at the same time, as flat as possible. Formally, the problem is written as a convex optimization problem. 2.2.5 Regression Trees (RTs) Regression trees may be considered as a variant of decision trees, designed to approximate real-valued functions instead of being used for classification tasks. RT is built through a process known as binary recursive partitioning. This is an iterative process of splitting the data into partitions, and then splitting it up further on each of the branches. In our experiments the classical C&RT [32] tree is used.

3 Results of Parametric Statistics and Prediction of ITQ-21 Phase Crystallinity 3.1. Experimental Results In Figure 2 is represented the effect of each synthesis variable on ITQ-21 crystallinity. It is shown that ITQ-21 is favored by some combination of synthesis variables. The highest values of crystallinity appear in concentrated gels [H2O/(Si þ Ge) < 5] with low ratios of Si/Ge. The presence of the zeolite can be affected by the content of aluminum, in such a way that the more aluminum, the less crystallinity. Furthermore, high MSPT/(Si þ Ge) and F/(Si þ Ge) ratios play positive roles in the formation of ITQ-21.

I 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

&3&

These are not the final page numbers! ÞÞ

Full Papers

Laurent A. Baumes et al.

Figure 2. Variation of ITQ-21 phase crystallinity with the studied variables.

&4&

I 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

ÝÝ These are not the final page numbers!

www.qcs.wiley-vch.de

QSAR Comb. Sci. 00, 0000, No. &, 1 – 18

Prediction of ITQ-21 Zeolite Phase Crystallinity: Parametric Versus Non-parametric Strategies

Table 1. Table shows the standardized regression coefficients (b) and the raw regression coefficients (B) for the MLR.

Intercept T Si/Ge Al/c MSPT/c H2O/c

b

B

t(138)

p-Level

Partial

Semi-partial

Tolerance

R2

– 0.094280  0.403721  0.231247 0.111600  0.611522

65.152 1.205  0.741  317.365 22.344  4.725

9.7772 1.7771  7.6099  4.3587 2.1036  11.5273

0.000000 0.077757 0.000000 0.000025 0.037230 0.000000

– 0.149574  0.543687  0.347867 0.176265  0.700392

– 0.094273  0.403695  0.231227 0.111593  0.611513

– 0.999844 0.999875 0.999829 0.999870 0.999971

– 0.000156 0.000125 0.000171 0.000130 0.000029

The magnitude of b allows to compare the relative contribution of each independent variable in the prediction of the dependent variable. The squared semi-partial correlation is an indicator of the percent of total variance uniquely accounted for by the respective independent variable, while the squared partial correlation is an indicator of the percent of residual variance accounted after adjusting the dependent variable for all other independent variables. Grey cells indicate that the estimates are significant (a ¼ 5%). Gray.

Figure 3. Prediction of ITQ-21 phase crystallinity with synthesis variables as input, for GLZs using Gamma distribution with log link function, normal distribution with identity link function, and normal distribution with log link function as modeling approaches.

3.2. MLR and First Inspection of the Dataset

are statistically significant (p < 5%) except for the synthesis duration (variable 2). If the risk a is increased to 10%, the variable 2 becomes significant. A non-significant p-value does not mean that the null hypothesis*is true. It simply means that this dataset is not strong enough to convince that the null hypothesis is not true. To conclude that a val-

In Figure 3, the MLR is calculated with the synthesis variables as input. Real ITQ-21 phase crystallinity is indicated on the y-axis while the expected one is represented on the x-axis. The adjustment was R2 ¼ 0.61164 [F(5.138) ¼ 43.46; p ¼ 0.00000]. According to this method, 61.16% of the original variability has been explained, and (1  R2) is the residual variability. Regression coefficients are given in Table 1, where highlighted values (gray background color) are significant. As indicated by b values, Si/Ge and H2O/ (Si þ Ge) (respectively, variables 3 and 6) are the most important predictors of ITQ-21 phase crystallinity, and all

* A significance test is performed to determine if an observed value of a statistic differs enough from a hypothesized value of a parameter to draw the inference that the hypothesized value of the parameter is not the true value. The hypothesized value of the parameter is called the “null hypothesis”.

QSAR Comb. Sci. 00, 0000, No. &, 1 – 18

I 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.qcs.wiley-vch.de

&5&

These are not the final page numbers! ÞÞ

Full Papers

Laurent A. Baumes et al.

ue is not statistically significant when the null hypothesis is false is called Type II error. For more details of this aspect see [23]. Another way of looking at the unique contributions of each independent variable is to compute the partial and semi-partial correlations. In Table 1, partial correlations are the correlations between the respective independent variables adjusted by all other variables, and the dependent variable adjusted by all other variables. The semi-partial correlation is the correlation of the respective independent variable adjusted by all other variables, with the raw dependent variable. Values in Table 1 for such partial and semi-partial correlations appear relatively similar and confirm the trends observed with b values. In Table 2, the partial correlation sizes the correlation between two variables that remains after partialling out one other variables (indicated with “  ”), while the correlation coefficient does not take into account such control. It can be observed that the correlations, and partial correlations, between each variable and ITQ-21 crystallinity, are quite similar. However, one can note that without considering the effect of H2O (i.e., fifth column of partial correlations in bold) the correlation between Si/Ge and ITQ-21 crystallinity decreases by ten points. Actually, a similar jump is examined for all the correlations when H2O is partialled out; in the case of positive response (MSTP or F) such effects are increased, while negative partial correlations are decreased. Surprisingly, it seems that H2O increase, which has a global negative effect on ITQ-21 crystallinity, when combined with other variables has a good effect on negative feature and a bad one for the unique positive relation. Moreover, it is shown that the three variables that have the greatest influences on the formation of ITQ-21 are H2 O, Si/Ge, and Al content. For the levels chosen in the present work, the water content is the variable that has the largest influence on ITQ-21 crystallinity. This phase prefers concentrated gels that present relations of H2O/(Si þ Ge) with values less than 5. This can also indicate that high concentration of F  has a positive effect on crystallization. The content of Ge in the framework of the ITQ-21 is a critical factor. When the content of Ge decreases in the starting gel, the rate of crystallization of ITQ-21 decreases, and for high values of Si/Ge (> 30), small amounts of ITQ-21 (low crystallinity) is achieved with the set of

times and temperature reported here. Finally, the other factor that is statistically interesting is the Al content. The highest values of crystallinity have been obtained at low levels of Al. The reason being that the number of framework negative charges introduced by Al and which have to be neutralized by the Organic Structure Directing Agent (OSDA) are limited, due to the fact that OSDA has also to compensate the F  located within the double four member rings [33], and the void volume of the ITQ-21 structure can fit a limited number of MSPT cations. It has to be noted that H2O has the largest effect on crystallization only considering the chosen ranges of variation. The use of parametric procedures allows taking advantages of the whole theory behind the model. However, assumptions should always be first verified, otherwise the conclusion may not be accurate. For example, in MLR, it is assumed that the residuals are distributed normally. Many tests are robust with regard to violations of this assumption. The normal probability plot of residuals gives a quick indication of whether or not violations have occurred. If the observed residuals (plotted on the x-axis of Figure 4) are normally distributed, then all values should fall onto a straight line. If the residuals are not normally distributed, then they will deviate from the line. Figure 4 shows a particular lack of fit: the data seems to form an Sshape around the line. This pattern is characteristic when the dependent variable may have to be transformed through a log-transformation to pull the tails of the distribution. Another important step when building models is the detection of outliers. If one experiment is clearly an outlier, then there is a tendency for the regression line to be pulled by this outlier. As mentioned before, one can say that such a deviation would be rather low compared to the consequences (overfitting) which might be observed using ML models. As a result, if the respective cases were excluded, different B coefficients would be found. Figure 5 shows the “deleted residual” statistic which is the standardized residual for the respective case that one would obtain if the case was excluded from the analysis. Therefore, if the deleted residual is different from the standardized residual the regression analysis may be biased by the given case. However, such a case does not belong to our experimental dataset and therefore the entire set is kept. Another inter-

Table 2. Partial correlations and correlation coefficients between all variables involved in the synthesis study. Variables

ITQ21 – partial correlation

Time Si/Ge Al MSTP or F H2O

–  0.40  0.23 0.11  0.62

0.10 –  0.26 0.12  0.67

ITQ21 – correlation 0.09  0.41 – 0.11  0.63

0.09  0.41  0.23 –  0.62

0.12  0.51  0.29 0.13 –

0.09  0.40  0.23 0.11  0.61

The partial correlation sizes a correlation between two variables that remains after controlling for (e.g., partialling out) one or more other variables. Gray cells contain significant values at p < 0.05.

&6&

I 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

ÝÝ These are not the final page numbers!

www.qcs.wiley-vch.de

QSAR Comb. Sci. 00, 0000, No. &, 1 – 18

Prediction of ITQ-21 Zeolite Phase Crystallinity: Parametric Versus Non-parametric Strategies

Figure 4. Normal probability plot of residuals for ITQ-21 phase crystallinity linear model. This visualization procedure permits to quickly examine if the normal distribution of residual assumption is respected or not. The tails show an S-shape pattern.

esting test such as heteroscedasticity may be investigated. Homoscedasticity is the assumption that the variability in scores for one variable is roughly the same at all values of the other variable, which is related to normality, as when normality is not met, variables are not homoscedastic, but, they are heteroscedastic. For example, the Goldfeld – Quandt test is applicable if you think heteroscedasticity is related to only one of the x variables. This test is of great interest, for example if the heating system of a multi-channel reactor becomes hazardous on increasing the temperature, generating additional noise. To summarize, the MLR fits moderately (~ 61%) the objective variable, and fails to preserve the fitted ITQ-21 phase crystallinity from negative values. Moreover, the amount of false positive is very high (i.e., the gray squares on the x-axis in Figure 3), and for the other experiments, the phase crystallinity is greatly underestimated. However, such a preliminary methodology has allowed us to obtain a first idea about the dependent variable modeling and its correlations with synthesis variables through estimates. However, it has been shown that this technique allows to make test of assumptions that are usually too often accepted without being tested. Assumptions about the normality of residuals, the detection of outliers, the significance of variables, and others are of great help to the user in determining the first steps of how works the underlying mechanism. The examination of the normality assumption will require further more complex methodologies, allowing to transform the dependent variable in order to respect the hypothesis of normal distribution while preventing predictions from negative values. The GLZ, as an extension of the MLR, is investigated below.

QSAR Comb. Sci. 00, 0000, No. &, 1 – 18

www.qcs.wiley-vch.de

Figure 5. Residuals vs. deleted residuals plot. This technique allows to separate outliers from the dataset when the latter are relatively far from the line.

3.3. Generalized Linear Model The construction of a GLZ starts by selecting an appropriate link function and response probability distribution. Two alternatives are investigated: a distribution fitting and the choice of the corresponding link function, or only the transformation of the dependent variable through a chosen link function. There is many potential distributions (normal, Exponential, Weibull, log-normal, Gamma, etc.) that could be used as a distributional model for the data. Therefore two basic questions are addressed: (i) Does a given distributional model provide an adequate fit to the data? (ii) Does one distribution fit the data better than another distribution? The use of Goodness-of-Fit (GoF) tests provide a method to answer these two questions. The Kolmogorov – Smirnov (KS) test is chosen because of the following reasons: unlike the parametric t-test for independent samples, which tests differences in means in the location of two samples, the KS test is also sensitive to differences in the general shapes of the distributions in the two samples (i.e., differences in dispersion, skewness, etc.). The GoF tests confirm that either the Gamma or the log-normal distribution would provide a good model for this data. Finally, the best fitting is the Gamma distribution which is defined as f(x) ¼ (x/b)1e(  x/b)[bG(c)]1 with b > 0 the scale parameter, c > 0 the so-called shape parameter and G is the gamma function with the following formula: R1 G ðaÞ ¼ 0 ta1 et dt. Here b ¼ 39.3, c ¼ 0.456, see Figure 6. The corresponding link function for such distribution is the log function. Considering the second option proposed earlier, the normal probability plot of residuals has given an indication of the non-normal distribution of observed residuals. Since the data follow an S-shape pattern around the line, we have supposed that the dependent variable should be transformed into a new one such as g(y) ¼ ln(y) I 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

&7&

These are not the final page numbers! ÞÞ

Full Papers

Figure 6. Distribution fitting of ITQ-21 phase crystallinity with the Gamma function.

in order to pull in the tail of the distribution. Therefore, this modification is handled through the log link function which will force to maintain the values within a positive range while the distribution of the dependent variable is still supposed to be normal. Figure 3 shows the predictions of ITQ-21 crystallinity for GLZ model using Gamma distribution with log link function, normal distribution assumption and log link function, and MLR (i.e., normal distribution assumption and identity link function). A better fitting of GLZ over MLR can be observed. However, GLZ using the normal distribution and log link function remains the best. In this situation, compared to Gamma assumption, more weight is indirectly given to non-null crystallinity values of ITQ-21 phases, and therefore the variability of response for high crystallinity values is narrower. The previous GLZ were defined with only first-order effects, i.e., bixi. However, in GLZ, more advanced configurations, such as factorial, fractional, polynomial, quadratic models, or even some special user effects, can be defined. In Figure 7, all the models are estimated and their respective predicted values of ITQ-21 crystallinity are plotted with corresponding effect estimates given in Table 3. The values of the parameters (bi and the scale parameter) in the GLZ are obtained by maximum likelihood estimation. Note that highlighted values correspond to statistically significant estimates for a ¼ 0.5. On the basis of estimate values and their significances when considering different forms of models, we can say that the MLR does not contain enough features for capturing the underlying information and consequently all the input variables are significant. On the contrary, the full factorial design takes into account too many variables; thus, the information is spread and smoothed into the numerous estimates. Finally, the models retained are the quadratic one for its overall performance, the fractional factorial to degree 2 for its sim&8&

I 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

ÝÝ These are not the final page numbers!

Laurent A. Baumes et al.

plicity, and the fractional factorial design to degree 3 since it represents an intermediary solution. The relationship between predictors and their interactive effects (e.g., two predictors masking the effects of a third) are much more complex. However, it can be observed that the conclusion drawn previously about the effect of Si/Ge, and Al contents when considering or not the effect of the water is confirmed here through the inspection of the significance of 2-way interaction effects. One can also make statistical inference about the parameters using confidence intervals and hypothesis tests. The confidence intervals for specific statistics give a range of values around the statistic where the “true” statistic can be expected to be located with a given level of certainty (here the level is set 90%). Therefore it is possible to provide confidence intervals for predicted values. An example is given for the best model found earlier, i.e., quadratic response surface regression model, in Figure 8. As a decreases the interval will be narrower. Here are examples of the numerous advantages allowed using such a parametric modeling. 3.4. Piecewise Linear Regression The slope of a function at a particular point can be computed as the first-order derivative of the function at that point. The “slope of the slope” is the second-order derivative, which tells us how fast the slope is changing at the respective point, and in which direction. The quasi-Newton method, at each step, evaluates the function at different points in order to estimate the first order derivatives and second order derivatives. It uses this information to follow a path toward the minimum of the loss function. We have chosen the quasi-Newton method since, for most applications, it yields good performances. Other procedures that use various geometrical approaches to function minimization, may be more “robust,” that is, they are less likely to converge on a local minima, and are less sensitive to “bad” starting values. However, special attention has been given to such parameters and care about the reproducibility of the results was taken. The loss function is a least square as in many other cases. In Figure 9 the predicted values of ITQ-21 crystallinity are plotted against the observed values. It is surprising to see that this very simple method, compared to all other approaches, allows us to obtain a quasi-perfect fitting of very low or even null crystallinity values as shown in Figure 9. The equation of the PLR model with a breakpoint at 17.4582 is 5.19880.1013t0.0244Si/Ge24.8966Al(Si þ Ge) þ 4.0457MSTP/(Si þ Ge)0.5039H2O/(Si þ Ge) and 113.28 þ 3.2668t2.1384SiGe557.013Al/Si þ Ge) þ 26.9783MSTP/(Si þ Ge)8.2507H2O/(Si þ Ge).

www.qcs.wiley-vch.de

QSAR Comb. Sci. 00, 0000, No. &, 1 – 18

Prediction of ITQ-21 Zeolite Phase Crystallinity: Parametric Versus Non-parametric Strategies

Table 3. Estimates of GLZ models using different configurations of effects. Gray cells contain significant values at p < 0.05. Effects

Multiple reg. Full-factorial Polynomial- Quadratic Fractional Fractional degree 2 response factorial factorial surface to degree 3 to degree 2 regression

Intercept B0 First order (main effects) t Si/Ge Al/c MSPT/c H2O/c Two-way interaction t. ( Si/Ge) t. Al/c t. MSPT/c t. H2O/c ( Si/Ge).Al/c ( Si/Ge).MSPT/c ( Si/Ge).H2O/c ( Al/c).MSPT/c ( Al/c).H2O/c ( H2O/c).MSPT/c Three-way interaction t.( Si/Ge).Al/c t.( Si/Ge). MSPT/c t.( Si/Ge). H2O/c t.( Al/c). MSPT/c t.( Al/c). H2O/c t.( MSPT/c). H2O/c ( Si/Ge).Al/c.( MSPT/c) ( Si/Ge).Al/c.( H2O/c) ( Si/Ge).MSPT/c.( H2O/c) ( Al/c). MSPT/c.( H2O/c) {4...5}-way ... Second order T2 ( Si/Ge)2 ( Al/c) 2 ( MSPT/c) 2 ( H2O/c) 2 Scale

5.8659 0.0552  0.0672  13.7012 0.7943  0.3214

10.2652

One has to note that the breakpoint is defined on the dependent variable and therefore, in order to assign a value to a new experiment it should be first evaluated on which side on the breakpoint the dependent variable will be. However, a previous model can be used or a classification algorithm with a two-class system defined by the threshold (i.e., the breakpoint). Therefore the final PLR efficiency depends on such a previous estimation. A quick classification using the quadratic model only misclassified six experiments.

5.446  0.324  0.032 61.500 0.482  0.247  0.006  6.832 0.655 0.076  2.568  0.103 0.003  162.920  27.895  0.583 1.203 0.028 0.009 18.467 5.623 0.145 9.108 0.486 0.064 100.063 ...

7.724

6.9819 0.0485 0.0255  11.3677  11.1976 0.3193

3.4273  0.0346 0.0191  1.3192 3.2865 0.7122 0.0016 0.6027  0.0004 0.0190  0.1918  0.0327 33.8077  0.0270  9.0908 0.4554

0.0000  0.0002  33.0816 15.6960  0.0935 9.8454

0.0000  0.0003  65.1404  5.3025  0.0751 7.5255

2.535  0.004 0.051 143.024 4.124 0.665  0.002  8.681 0.501 0.060  3.816  0.129  0.007  225.243  49.030  1.037 0.379  0.011  0.002  0.266 1.135  0.126 5.223  0.095  0.004 80.883

5.6146  0.0240  0.0256  10.6196  0.6261  0.0037 0.0023 0.2978 0.0192 0.0085  0.1389  0.0242  0.0171 29.0218  6.5146 0.3316

7.812

8.7050

namely “training,” “selection,” and “test,” respectively, with 64, 40, and 40 individuals in each set in order to avoid overfitting. Thus, the test set represents 28% of the entire dataset as mentioned before. 4.1 Comparison and Performance Assessments

Having previously estimated the distribution of the collected data from ITQ21 analysis study, the predictions of previous parametric statistics are compared with NN, SVMs, and RTs. For each ML approach, the whole dataset which contains 144 data is divided into three different sets,

As in the case of traditional MLR models, fitted GLZ can be summarized through statistics such as parameter estimates, their standard errors, and GoF statistics. Here different statistics such as the correlation coefficient (i.e., the correlation coefficient between the predicted and observed output values), the coefficient of determination (R2, Eq. 3), R2 adjusted (R2adj , Eq. 4), the standard deviation (Eq. 1) of the target output variable (sy), and the standard deviation of errors for the output variable (se) have been calculated. r (Eq. 2) represents the linear relationship between two variables. A perfect prediction will have a correlation coefficient of 1. A correlation of 1 does not necessarily indicate a perfect prediction (only a prediction

QSAR Comb. Sci. 00, 0000, No. &, 1 – 18

I 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

4 Results of Non-parametric Approaches and Prediction of ITQ-21 Phase Crystallinity

www.qcs.wiley-vch.de

&9&

These are not the final page numbers! ÞÞ

Full Papers

Laurent A. Baumes et al.

Figure 7. ITQ-21 phase crystallinity fitting using GLZs and MLR is given as a reference.

which is perfectly linearly correlated with the actual outputs), although in practice the correlation coefficient is a good indicator of performance. It also provides a simple and familiar way to compare the performance of statistical and ML methods. In Eqs. 1 – 4, formulas are given for each statistics, with n the amount of data, and p the number of predictors. Adding more independent variables to a model can only increase the R2. Since the number of variables selected by the NN is different from the one used in the other approaches, R2adj has also been used. rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 X 2 ðx  xÞ s¼ i N h X X X i x y  x y r¼ n i i i i i i i

ð1Þ

X X 2 1=2 X X 2 1=2 2 2  n x  x  n y  y i i i i i i i i ð2Þ P

2

ðy  ~yÞ Þ2 i ðy  y

2

R ¼ 1  Pi R

2 adj



¼ 1  1  R2 

&10&

ð3Þ 

 n1 np1

ð4Þ

I 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

ÝÝ These are not the final page numbers!

4.2 Performances of Neural Networks, Regression Trees, and SVMs The most common NN architectures have outputs in a limited range (e.g., 0 – 1 for the logistic activation function we use). When the desired output is in such a range, it presents an interest for classification problems as has been investigated [15]. However, for regression problems there is clearly an issue to be resolved, and some of the consequences are quite subtle. A scaling algorithm can be applied to ensure that the network>s output will be in a sensible range. The simplest scaling function finds the minimum and maximum values of a variable in the training data, and performs a linear transformation to convert the values into the target range. Therefore the network>s output will be constrained to lie within this range. However, this brings to the problem of extrapolation of new materials out of the range defined by the training case. For a fair simulation of the prediction of new materials> crystallinity, one has to consider that the expected values can reach levels lower than the actual worst experiment or upper the best case previously seen in the current dataset. Thus, we have chosen to always rescale the training data within the range [0 – 0.9] due to the fact that a crystallinity lower than zero (i.e., amorphous material) cannot be attained. However, it may be possible to obtain a more crystalline ITQ-21 samwww.qcs.wiley-vch.de

QSAR Comb. Sci. 00, 0000, No. &, 1 – 18

Prediction of ITQ-21 Zeolite Phase Crystallinity: Parametric Versus Non-parametric Strategies

Figure 8. Confidence intervals of predicted values for the best GLZ models. Three different a values are considered (i.e., 10, 5, and 1%)

ple than the one obtained up to now. The 100% crystallized material may not belong to the training set (e.g., random selection of training set), and the 100% crystallinity has been arbitrarily defined by the best zeolite found in our experimentation. Nevertheless new synthesis could achieve an even better crystallized sample. In all the cases, NNs as Multi-Layer Perceptron (MLP) and SVMs using RBF kernel form have reached the best performances. In Table 4, the best NN model for the prediction of ITQ21 crystallinity is shown. Two points have to be underlined considering the performance assessment of QSAR Comb. Sci. 00, 0000, No. &, 1 – 18

www.qcs.wiley-vch.de

NNs. (i) The work required to obtain and select the best NN is by far more time-consuming than the other nonparametric approaches. Considerable attention has been given to NN due to the high variability of results we have obtained. Numerous architectures, activation functions, and other parameters have been tested. Several NN models have been discarded due to the great difference of performance between the training/selection and the test, indicating a clear overfitting phenomenon. (ii) Having combined a feature selection algorithm to the NN, among the first selected “good” networks, some of them are comI 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

&11&

These are not the final page numbers! ÞÞ

Full Papers

Laurent A. Baumes et al.

Figure 9. ITQ-21 phase crystallinity modeling with best GLZ, PLR, SVM, NN, and RT.

posed of very few input variables. Considering the synthesis of zeolites, it can be shown that any of the variables we have used is without effect and can be eliminated from the synthesis steps. However, the selection of input variables permits to eliminate variables from which the network did not find the right way to utilize the information brought after the exploitation of the others. Moreover, reducing the pool of variables input minimizes inherently the potentiality of extrapolation when using a broader range for synthesis variables, since the role of the discarded variables

could emerge. Both R2 and corrected R2 have been given, while the use of the latter can be questioned because of the above reasons. It has to be noted that such a feature selection mechanism could have been used for SVM or regression trees. Conversely, the stability of these methodologies are usually better, partially due to the very little number of parameters compared to the numerous ones simply contained into the NN architecture as will be shown later.

Table 4. Description of all the selected models for the prediction of ITQ21 phase>s crystallinity. Statistics Models

MLR

GLZ Full Polynomial Quadratic Fractional Fractional Piecewise Neural SVM Regres(normal fac- of degree 2 response factorial factorial linear network radial sion distribu- torial surface to degree 3 to degree 2 regression MLP 4 : basis tree tion, regression 4 – 5-1 : 1 function log link function)

Correlation 0.782 0.919 coefficient (r) 0.611 0.844 R2 0.597 0.839 R2 adjusted Standard deviation 15.931 10.151 of errors

0.953 0.923

0.955

0.952

0.941

0.962

0.918

0.921

0.916

0.909 0.853 0.906 0.847 7.695 9.813

0.913 0.910 7.514

0.907 0.904 7.782

0.885 0.881 8.664

0.925 0.923 6.978

0.843 0.838 10.139

0.849 0.844 9.956

0.840 0.835 10.216

Black cells are used for non-parametric approaches and gray ones are the selected models. Mean of the whole dataset: 17.458&Pls check change&. Standard deviation of the whole dataset: 25.565&Pls check change&.

&12&

I 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

ÝÝ These are not the final page numbers!

www.qcs.wiley-vch.de

QSAR Comb. Sci. 00, 0000, No. &, 1 – 18

Prediction of ITQ-21 Zeolite Phase Crystallinity: Parametric Versus Non-parametric Strategies

Figure 10. Final Regression tree for prediction of ITQ-21 phase crystallinity after additional user pruning.

Figure 9 shows the predictions of ITQ21 crystallinity for the given NN. The effect of the synthesis variable named “Time” being rather low (as indicated earlier), NN has removed it from the input parameter. The number of false positives is much more important for NN-MLP and SVMRBF (radial basis function) compared to all other techniques. The SVM-RBF is the best among non-parametric approaches considering the overall criteria given in Table 4, but on the other hand, numerous negative crystallinity values can be observed. A k-fold (k ¼ 10) Cross-Validation (CV) has been utilized for the optimizing capacity (C) and epsilon (e) at the same time. For C ¼ 10, gamma (g) has been set at 0.2, and e ?¼? 0.1. Regression Tree (RT) produces accurate predictions based on few logical if – then conditions. A ten-fold CV is used for pruning. The original version of the RT was composed of 13 non-terminal nodes and 14 (terminal) leaves. In Figure 10, some terminals have been pruned again (the leaves containing less than 20 individuals are removed) making the reading easier. It can be observed through the gray scale rectangles that the RT QSAR Comb. Sci. 00, 0000, No. &, 1 – 18

www.qcs.wiley-vch.de

succeeds in isolating the different levels of ITQ-21 crystallinity. It is interesting to observe that the position of the rectangles gives an intuitive classification of the samples studied, allowing an easy visualization of the crystallinity and the synthesis factors. The leaves in the left branches present an increasing crystallization (dark rectangles) going down the splits, while in the right branches the crystallinity is descending (bright rectangles). For each leaf, the mean (m, i.e., mu) of the samples is indicated. Figure 10 shows that the highest crystalline samples are obtained for concentrate gels (H2O < 3.5) and Si/Ge < 36. These results are in agreement with conclusions obtained in Section 2, and in the MLR (Section 3.2). Moreover, previous work [33] suggests that ITQ-21 could be obtained for a Si/Ge ratio of 25, but not for 50. SVM-RBF obtains the best results without requiring heavy pre- and post-treatments. However, RT may be preferred because the RBF kernel makes the model interpretation more difficult than other easier ones. Such a kernel has been chosen to give SVM the same chance facing NN. I 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

&13&

These are not the final page numbers! ÞÞ

Full Papers However, simpler kernels such as polynomials of degrees 2 and 3 have also been tested. Results for a 30% test set and 10-CV are the following: {Degree, C, e, g, coeff.} ¼ {3, 10, 0.1, 0.3} gives r ¼ 0.915 (training), r ¼ 0.85 (test) {Degree, C, e, g, coeff.} ¼ {2, 10, 0.1, 0.3} gives r ¼ 0.920 (training), r ¼ 0.87 (test) These results confirm what was concluded through MLR and GLZ examination, i.e., the use of second degree effects is useful while the integration of higher effects is not. The difference in performance between RBF and such a simple kernel is very low and once again it discards all forms of more complex models in this study. Finally, both RT and SVM with polynomial kernel of degree 2 are selected. RT should be used for a quick overview of the system while SVM could allow to draw precisely a contour plot.

5 Advanced Analysis of Methodologies and Interpretation Not only to show the difficulties encountered using NN but also for better arguing the selection of SVM methodology over NN in this study, both techniques are compared based on the stability/variability of their performances depending on the amount of data available for the training step. We have chosen to assess performance generalization for only these two approaches since SVM has been qualified as a more stable technique compared to NN, and all other used techniques are far less likely to overfit the data or a fast post-processing treatment can be easily combined such as for RT. Through this analysis it is also investigated if the decrease of the size of the test set for allocating more “resources” to the training part makes the variability of performance higher and thus the risk of false acceptation of the model becomes larger. The dataset is divided into two parts: training (Tr) and test (Te). Their respective size varies and the fitting capacity is assessed. The relative amount of data in the test subset is set to either 70 or 30% of the whole available dataset. Five different samplings for each distribution into training and test are presented for both NNs and SVM. The frequencies of responses have been checked in order to have a minimum number of each type of experiments into both training and test sets, i.e., low and high ITQ-21 crystallinity values. This will permit to assess the performance of the modeling on three different ranges of crystallinity: {0, ]0...50], ]50...100]}. Table 5 gives the mean and standard deviation of each sample taking into account the ranges, while Tables 6 and 7 indicate the statistics for the predicted values. The best solution using RBF and MLP (three or four layers) is conserved for NN while SVM makes use of only RBF model form. Considering NNs, the best network found is kept for each sampling after elimination of the networks that show a clear overfitting &14&

I 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

ÝÝ These are not the final page numbers!

Laurent A. Baumes et al.

through the difference between training and test performances as done before. However, the set #2 in Table 6 shows an obvious estimation failure. Only one input has been kept; consequently, the range of maximum interest (i.e., “> 50”) is greatly under-evaluated while amorphous materials are overestimated. Obviously, such an NN has been trapped into a local optimum. In Tables 6 – 9, the gray cells indicate where a given failure has been encountered, while the black cells indicate the selected models. Different kinds of disappointment are underlined below. It has to be pointed out that the following criteria are not independent, and therefore only the most significant criteria of the failure are shown in gray. On the basis of traditional statistics listed in Tables 8 and 9. (1) High performance drop from calculation on training to test sets such as the NN using the MLP technique and tested with set #1 (11.4% ¼ 98.186.7, Table 8) which represents the greatest fall, but also set #5 for NN-MLP in Table 8. (2) Relatively low performance compared to all other models of the same type. Therefore, set #4 for NN-RBF in Table 9 is discarded. (3) Relatively high error standard deviation. One has to note that even if a prediction error mean extremely close to zero is expected, it is possible to get a zero prediction error mean simply by estimating the averaged training data value, without any recourse to the input variables or any advanced methodologies at all. Thus, the standard deviation error is of great interest in order not to use false good models as NN-RBF tested on set #4 in Tables 8 and 9. NN-RBF with set #2 in Table 8 could have been discarded directly with the error mean. Note that if the standard deviation error is no better than the training data standard deviation, then the technique has performed no better than a simple mean estimator. (4) A weak (i.e., non-robust) architecture. Not only the NN-MLP tested on set #2 in Table 8, but also NN-MLP and NN-RBF tested on set #5 in Table 9 possess a very low number of input data indicating that the networks did not manage to use the information brought by all variables. In Tables 6 and 7, predictions are followed on separated ranges of crystallinity. (5) Difference between observed and predicted mean of ITQ-21 crystallinity. This is generally observed for high values of crystallinity (sets #2, #3, #5 for NN-MLP, and sets #4 and #5 for NN-RBF in Table 6, and sets #1, #3, #4 with NN-RBF in Table 7). This is due to the relatively low amount of experiments belonging to the range “> 50.” On the other hand, in set #2 for both NN-RBF and NN-MLP in Table 6, a very bad recognition of amorphous materials is detected as well for set #1 for NN-MLP. The prediction for medium crystallized materials is overestimated in set #1 for NN-MLP, making the margin between the medium and highly crystallized zeolites very narrow. (6) Overfitting phenomenon is also detected through the high standard deviation of the predicted ITQ-21 crystalwww.qcs.wiley-vch.de

QSAR Comb. Sci. 00, 0000, No. &, 1 – 18

Prediction of ITQ-21 Zeolite Phase Crystallinity: Parametric Versus Non-parametric Strategies

Table 5. Two different partitions of the whole available experimental set of data are described. Test sets

Real ITQ-21 crystallinity mean ( SD )

Percentage (%) (mean nominal value)

Set #1

Set #2

Set #3

Set #4

Set #5

Ranges

30 (44.6)

0 (0) 30.8362 (15.6366) 68.1282 (13.1038) 21.4306 (28.9117) 0 (0) 25.692 (13.3596) 65.853 (10.683) 17.0139 (24.3044)

0 (0) 17.0754 (13.0485) 69.9754 (12.9964) 18.9280 (28.2568) 0 (0) 23.674 (14.5605) 65.077 (10.527) 17.5168 (25.4006)

0 (0) 28.2170 (17.2106) 72.1727 (9.4963) 15.2557 (25.4654) 0 (0) 25.772 (14.4003) 66.109 (8.068) 16.7966 (24.8081)

0 (0) 28.4415 (15.2259) 63.9546 (3.7527) 18.7385 (25.7839) 0 (0) 22.415 (15.2069) 66.822 (11.615) 17.3316 (26.3934)

0 (0) 25.4530 (17.5253) 71.8512 (15.1351) 22.7955 (29.8045) 0 (0) 24.740 (15.085) 69.327 (12.645) 17.4238 (26.5437)

0 > 50 < 50 Total 0 > 50 < 50 Total

70 (100.8)

The first column gives the percentage and corresponding nominal mean of the test set. Then, considering a given partition, the real mean and standard deviation of each test set are given taking into account different ranges of ITQ-21 phase crystallinity (entire set, null, ]0...50] and ]50...100]) noted in the last column.

Table 6. Mean and standard deviation of the predicted ITQ-21 crystallinity for a test set of 70% of the entire dataset. Neural networks Test sets only ITQ-21 crytallitnity

Mean

SD

Test sets only ITQ-21 crytallitnity

Mean

SD

Support vector machines Test sets only ITQ-21 crytallitnity Mean

SD

Ranges

MLP 4 : 4 – 2-1 : 1

MLP 1 : 1 – 1-1 – 1 : 1

MLP 3 : 3 – 1-2 – 1 : 1

MLP 3 : 3 – 2-1 : 1

MLP 3 : 3 – 3-1 : 1

0 > 50 < 50 0 > 50 < 50 Ranges 0 > 50 < 50 0 > 50 < 50

2.9502 22.0063 62.5251 7.9992 13.5509 31.1547 RBF 3 : 3 – 9-1 : 1 0.5330 30.0371 58.6176 7.6737 17.5384 24.2643

5.1625 21.1096 27.1190 6.3111 10.1026 6.7165 RBF 3 : 3 – 10 – 1 : 1 8.8558 33.4153 69.5698 13.1794 19.2353 19.7210

0.3841 24.6572 47.3451 5.1928 16.3604 14.6670 RBF 3 : 3 – 9-1 : 1 0.8193 33.7114 60.7412 7.3416 22.6005 23.1682

2.4025 26.2958 60.8654 8.8584 21.5982 17.3099 RBF 4 : 4 – 9-1 : 1 1.0616 27.1984 44.4586 14.0595 12.2213 8.7685

0.3139 24.6209 45.7861 5.6070 17.2128 12.2742 RBF 4 : 4 – 10 – 1 : 1 1.7603 24.8732 40.9229 11.2243 12.0779 11.6271

Ranges 0 > 50 < 50 0 > 50 < 50

RBF 1 1.3014 29.4619 54.8544 11.3509 15.1640 20.7396

RBF 2 1.87664 30.8891 54.5931 14.4216 14.9366 12.4630

RBF 3  1.3818 28.3283 60.4716 8.1337 15.0950 14.3953

RBF 4 0.7985 29.4217 60.6477 10.7433 12.1353 13.7363

RBF 5 2.2881 31.6706 51.7803 12.1001 15.8295 16.8359

The two statistics are given depending on both the methodologies employed and ranges of the real ITQ-21 crystallinity.

Table 7. Mean and standard deviation of the predicted ITQ-21 crystallinity depending on both the methodologies employed and ranges of the real ITQ-21 crystallinity. Neural networks Test sets only ITQ-21 crytallitnity

Mean

SD

Ranges

MLP 4 : 4 – 8-1 : 1

MLP 4 : 4 – 3-1 : 1

MLP 4 : 4 – 10 – 8-1 : 1

MLP 5 : 5 – 3-1 : 1

MLP 3 : 3 – 1-3 – 1 : 1

0 > 50 < 50 0 > 50 < 50

1.6462 39.6653 51.9028 2.8551 21.1517 20.4111

0.4754 20.6251 60.1008 4.4862 15.4822 13.2987

5.6347 24.4674 61.6987 10.0612 17.3254 26.5842

0.3186 30.0032 56.7048 5.3826 18.4007 23.2101

 0.3159 31.9088 61.3224 4.0088 24.8200 9.8563

QSAR Comb. Sci. 00, 0000, No. &, 1 – 18

www.qcs.wiley-vch.de

I 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

&15&

These are not the final page numbers! ÞÞ

Full Papers

Laurent A. Baumes et al.

Table 7. (cont.) Neural networks Test sets only

Ranges

MLP 4 : 4 – 8-1 : 1

MLP 4 : 4 – 3-1 : 1

MLP 4 : 4 – 10 – 8-1 : 1

MLP 5 : 5 – 3-1 : 1

MLP 3 : 3 – 1-3 – 1 : 1

Test sets only

Ranges 0 > 50 < 50 0 > 50 < 50

RBF 5 : 5 – 20 – 1 : 1 1.3549 33.6393 43.0530 7.5802 14.3402 14.2721

RBF 4 : 4 – 15 – 1 : 1 0.1903 23.2889 56.7472 7.4352 14.2591 12.0603

RBF 4 : 4 – 19 – 1 : 1 3.5478 31.4614 50.5591 9.1270 20.1331 14.6490

RBF 5 : 5 – 6-1 : 1 3.1014 29.4760 41.0717 14.8800 8.8716 9.5832

RBF 2 : 2 – 9-1 : 1 1.5366 27.0582 53.1617 5.3202 20.6843 13.1921

Ranges 0 > 50 < 50 0 > 50 < 50

RBF 1  0.3699 33.1845 55.7318 6.2682 17.0246 23.4673

RBF 2 1.5211 24.0824 59.0732 8.2075 14.4054 13.2500

RBF 3 2.0352 26.9391 55.0579 8.8295 11.9040 15.2324

RBF 4 0.0455 29.1968 62.3767 7.2102 12.9160 13.7425

RBF 5  0.1315 29.1049 58.3044 6.0857 16.2556 15.9692

ITQ-21 crytallitnity

Mean

SD

Support vector machines Test sets only ITQ-21 crytallitnity Mean

SD

Tables 6 and 7 differ from the percent of test set which is used. Here the test set represents 30% of the whole available dataset.

Table 8. Different statistics are given for each type of model, parameters, test set such as the mean error, the error standard deviation, the ratio of the prediction error standard deviation to the original output data standard deviation noted “SD ratio,” as well as the Pearson correlation r for both training and test sets. Note that a lower SD ratio indicates a better prediction, and this is equivalent to 1 minus the explained variance of the model. The percentage of test set used is 70 as indicated in the first column. 70% of Test set Methodology

Test sets Statistics on test set

Models

Error

Neural networks

Set Set Set Set Set Set Set Set Set Set

#1 #2 #3 #4 #5 #1 #2 #3 #4 #5

Pearson correlation (i.e., r)

Mean

SD ratio Training SD ratio ( þ selection)

Test

 0.0430 4.2375 2.8432  1.3291 3.5427  0.6261  8.3668  1.8639 2.0565 3.4021

8.3626 12.8548 6.5612 8.3130 7.8134 8.9759 12.8457 9.4848 13.1512 11.8073

0.86779 0.70001 0.91527 0.88991 0.89572 0.87771 0.86598 0.87900 0.79190 0.80956

0.5254 0.7489 0.4112 0.4696 0.4612 0.5109 0.5488 0.5318 0.6105 0.5934

0.98191 0.55621 0.91686 0.93298 0.96023 0.92194 0.90154 0.92553 0.82674 0.81881

Form Parameters

MLP

RBF

4 : 4 – 2-1 : 1 1 : 1 – 1-1 – 1 : 1 3 : 3 – 1-2 – 1 : 1 3 : 3 – 2-1 : 1 3 : 3 – 3-1 : 1 3 : 3 – 9-1 : 1 3 : 3 – 10 – 1 : 1 3 : 3 – 9-1 : 1 4 : 4 – 9-1 : 1 4 : 4 – 10 – 1 : 1

70% of Test set Methodology

Test sets Statistics on test set

Models

Error on Test set Mean Support vector machines Set Set Set Set Set

&16&

#1 #2 #3 #4 #5

Pearson correlation (i.e., r)

SD ratio Training SD ratio ( þ selection)

 0.3540 9.5929  1.2760 0.30301 2.3972 8.7152 0.5403 10.4749  0.4055 10.2129

0.5268 0.4636 0.4634 0.4992 0.5233

I 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

ÝÝ These are not the final page numbers!

0.93276 0.91980 0.93419 0.91399 0.95273

Form Parameters

Test 0.8517 0.8597 0.8781 0.8581 0.8670

www.qcs.wiley-vch.de

RBF

{C, e, g} ¼ {10, 0.1, 0.3}

QSAR Comb. Sci. 00, 0000, No. &, 1 – 18

Prediction of ITQ-21 Zeolite Phase Crystallinity: Parametric Versus Non-parametric Strategies

Table 9. Different statistics are given for each type of model, parameters, test set such as the mean error, the error standard deviation, the ratio of the prediction error standard deviation to the original output data standard deviation noted “SD ratio,” as well as the Pearson correlation r for both training and test sets. Note that a lower SD ratio indicates a better prediction, and this is equivalent to 1 minus the explained variance of the model. The percentage of test set used is 30 as it is indicated in the first column. 30% of Test set Methodology

Test sets

Statistics on test set

Models

Error

Neural networks

Set Set Set Set Set Set Set Set Set Set

#1 #2 #3 #4 #5 #1 #2 #3 #4 #5

Pearson correlation (i.e., r)

Mean

SD

SD ratio

Training (þ selection)

Test

1.7774 0.7065  1.4581 0.7467 0.2503 4.3222 0.7534  0.6127 2.1398 2.5788

6.7509 6.6734 7.7908 7.2215 8.2122 6.9701 4.0173 8.7909 11.4482 6.9778

0.3846 0.3610 0.5077 0.4313 0.3942 0.4833 0.4108 0.5222 0.6267 0.4557

0.9309 0.9316 0.9281 0.9520 0.9086 0.8688 0.9018 0.8497 0.7989 0.8225

0.9246 0.9336 0.8625 0.9081 0.9197 0.8891 0.9133 0.8528 0.7793 0.8938

Form

Parameters

MLP

4 : 4 – 8-1 : 1 4 : 4 – 3-1 : 1 4 : 4 – 10 – 8-1 : 1 5 : 5 – 3-1 : 1 3 : 3 – 1-3 – 1 : 1 5 : 5 – 20 – 1 : 1 4 : 4 – 15 – 1 : 1 4 : 4 – 19 – 1 : 1 5 : 5 – 6-1 : 1 2 : 2 – 9-1 : 1

RBF

30% of Test set Methodology

Test sets

Statistics on test set

Models

Error on test set

Support vector machines

Set Set Set Set Set

#1 #2 #3 #4 #5

Pearson correlation (i.e., r)

Mean

SD

SD ratio

Training (þ selection)

Test

2.2549  0.6212 0.3328 1.8904 1.6718

7.5327 8.3591 8.2413 6.8110 8.8380

0.3688 0.3921 0.4362 0.3529 0.3939

0.9323 0.9258 0.9420 0.9430 0.9207

0.9297 0.9208 0.9055 0.9452 0.9204

linity for medium and highly crystallized materials. This is observed for the majority of NN models: set #1 for NNMLP, sets #4 and #5 for NN-RBF in Table 6, and all NNMLP except the one tested on set #2, and sets #3 – 5 for NN-RBF in Table 7. This statistic shows that NNs often fail to find a good and stable model over the whole range of ITQ-21 crystallinity. Considering all these criteria, one can observe that NNs are much more affected than SVMs for both sizes of training sets. The number of detected failures increases as the amount of training data decreases as it was expected. Relatively small test sets increase the risk of false selection of model as seen through the higher variability of criteria. Considering the case with 70% of test set, it can be checked that the number of selected inputs for NNs is lower than for the other case. The relative lack of experiments does not permit to take advantage of the whole set of features, the variability of responses being quickly associated to few variables, the others are considered so as to bring redundant or noisy information.

QSAR Comb. Sci. 00, 0000, No. &, 1 – 18

www.qcs.wiley-vch.de

Form

Parameters

RBF

{C, e, g} ¼ {10, 0.1, 0.3}

6 Conclusions This work shows a broad investigation of different modeling techniques for the prediction of performances in material science. Two types of methodologies are examined: on the one hand the parametric strategies, and on the other hand, the non-parametric techniques. The non-parametric methods employed here are all ML algorithms namely NNs, SVMs and regression trees. They reach a reasonable fitting accuracy. However, considering ML techniques, the recurrent problem of overfitting had to be considered and investigated. The parametric methods are less subjected to this problem of great importance. The difference is due to the fact that the statistical models are inherently restricted in their model forms, while learning methods, and particularly NNs, possess a high flexibility and numerous setting parameters. The advanced performance assessment of NNs and SVM has allowed to verify such an assumption. As a general advice, the parametric approach should always employed as a reference for further work. Both approaches are compatible and the selection of a unique model is not compulsory. In contrast, we advocate the use I 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

&17&

These are not the final page numbers! ÞÞ

Full Papers of the simplest methodologies at the beginning, while more complicated but less informative techniques are kept when complex underlying systems are detected. The combination of multiple approaches is of great appeal and should allow to reach higher and more stable performances, taking advantage of each method contribution, while expected drawbacks might be eliminated or corrected through the complementarities of each technique>s strength. Such a study has pointed the difficulty of selecting models when any a priori or preference is given to a given modeling technique. Of course, such a work is problem-dependent as the number of input variables, available amount of data, and complexity of the system investigated makes a given approach more or less adapted. Thus, such preliminary inspection of techniques appears to be mandatory decreasing the risk of deceptive results. The data mining technology is more and more applied in the production mode, which usually requires automatic analysis of data and related results in order to proceed to conclusions. But we have shown here that the selection of a given model remains a difficult task making the automation of the whole combinatorial loop a problem which is too often under-estimated. Unlike traditional data mining contexts which deal with voluminous amounts of data, materials science is actually characterized by a scarcity of data, owing to the cost and time involved in conducting simulations or setting up experimental apparatus for data collection. In such domains, it is prudent to balance speed through automation and the utility of data. For these reasons, the human interaction, verification and guidance may lead to better quality output.

Acknowledgements EU Commission (TOPCOMBI Project) is gratefully acknowledged.

References [1] H. Lee, S. I. Zones, M. E. Davis, Nature 2003, 425, 385 – 387. [2] A. Corma, J. Catal. 2003, 216(1 – 2), 298 – 312. [3] C. S. Cundy, P. A. Cox, Chem. Rev. 2003, 103, 663 – 702. [4] S. I. Zones, S. J. Hwang, S. Elomari, I. Ogino, M. E. Davis, A. W. Burton, C. R. Chim. 2005, 8, 267 – 282. [5] J. L. Paillaud, B. Harbuzaru, J. Patarin, N. Bats, Science 2004, 304(5673), 990 – 992. [6] K. G. Strohmaier, D. E. Vaughan, J. Am. Chem. Soc. 2003, 125(51), 16035 – 16039. [7] R. Millini, C. Perego, L. Carluccio, G. Bellussi, D. E. Cox, B. J. Campbell, A. K. Cheetham, Proceedings of the Interna-

&18&

I 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

ÝÝ These are not the final page numbers!

Laurent A. Baumes et al.

tional Zeolite Conference 12th, Baltimore, July 5 – 10, 1998, Meeting Date 1998, 1999, 541 – 549. [8] A. Corma, F. Rey, J. Rius, M. J. Sabatier, S. Valencia, Nature 2004, 431, 287 – 290. [9] A. Corma, M. J. Daz-Caban˜as, F. Rey, S. Nicolopoulus, K. Boulahya, Chem. Commun. 2004, 12, 1356 – 1357. [10] C. S. Cundy, P. A. Cox, Micropor. Mesopor. Mat. 2005, 82, 1 – 78. [11] A. Corma, V. Fornes, U. Diaz, Chem. Commun. 2001, 24, 2642 – 2643. [12] M. Moliner, J. M. Serra, A. Corma, E. Argente, S. Valero, V. Botti, Micropor. Mesopor. Mat. 2005, 78, 73 – 81. [13] O. B. Vistad, D. E. Akporiaye, K. Mejland, R. Wendelbo, A. Karlsson, M. Plassen, K. P. Lillerud, Stud. Surf. Sci. Catal. 2004, 154, 731 – 738. [14] A. Cantn, A. Corma, M. J. Diaz-Cabanas, J. L. Jorda´, M. Moliner, J. Am. Chem. Soc. 2006, 128, 4216 – 4217. [15] L. A. Baumes, D. Farruseng, M. Lengliz, C. Mirodatos, QSAR Comb. Sci. 2004, 29, 767 – 778. [16] A. Corma, J. M. Serra, E. Argente, S. Valero, V. Botti, Chem. Phys. Chem. 2002, 3, 939 – 945. [17] M. Holena, M. Baerns, Catal. Today 2003, 81, 485 – 494. [18] C. Klanner, D. Farrusseng, L. A. Baumes, C. Mirodatos, F. Schuth, Angew. Chem. Int. Ed. 2004, 43, 5347 – 5349. [19] J. M. Serra, L. A. Baumes, M. Moliner, P. Serna, A. Corma, Comb. Chem. High Throughput Screen. 2006 (Submitted). [20] L. A. Baumes, J. M. Serra, P. Serna, A. Corma. J. Comb. Chem. 2006, 8, 583 – 596. [21] D. Nicolaides, QSAR Comb. Sci. 2005, 24, 15 – 21. [22] M. M. Gardner, J. N. Cawse, in: J. M. Cawse (Ed.), Experimental Design for Combinatorial and High Throughput Materials Development, John Wiley & Sons, Hoboken, New Jersey, 2003, pp. 129 – 145. [23] L. A. Baumes, J. Comb. Chem. 2006, 8, 304 – 313. [24] A. Corma, M. J. Daz-Caban˜as, J. Martnez-Triguero, F. Rey, J. Rius, Nature 2002, 418, 514 – 517. [25] C. Bishop, Neural Networks for Pattern Recognition, Oxford University Press, Oxford, 1995. [26] S. Haykin, Neural Networks: A Comprehensive Foundation, Macmillan Publishing, New York, 1994. [27] P. J. Green, B. W. Silverman, Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach, Chapman & Hall, New York, 1994. [28] A. J. Dobson, An Introduction to Generalized Linear Models, Chapman & Hall, New York, 1990. [29] J. Stevens, Applied Multivariate Statistics for the Social Sciences, Erlbaum, Hillsdale, NJ, 1986. [30] M. S. Younger, A First Course in Linear Regression, 2nd ed, Duxbury Press, Boston, 1985. [31] V. Vapnik, The Nature of Statistical Learning Theory, Springer, Berlin, Germany, 1995. [32] L. Breiman, J. H. Friedman, R. A. Olshen, C. J. Stone, Classification and Regression Trees, Wadsworth & Brooks/Cole Advanced Books & Software, Monterey, CA, 1984. [33] T. Blasco, A. Corma, M. J. Diaz-Cabanas, F. Rey, J. Rius, G. Sastre, J. A. Vidal-Moya, J. Am. Chem. Soc. 2004, 126, 13414 – 13423.

www.qcs.wiley-vch.de

QSAR Comb. Sci. 00, 0000, No. &, 1 – 18

Related Documents

Qsar Itq21 Param
June 2020 1
Param
December 2019 16
Qsar
May 2020 4
4 Param Imp Mod
April 2020 6
Mf-param-oper2
June 2020 1