FULL PAPER DOI: 10.1002/chem.2008(……))
Design of a Full-Profile Matching Solution for High-Throughput Analysis of MultiPhases Samples Through Powder X-Ray Diffraction Laurent A. Baumes*, Manuel Moliner and Avelino Corma*[a] Abstract: Few solutions that aim at identifying crystalline materials from the analysis of powder X-ray diffraction (XRD) data have been reported so far. A careful inspection of the latter and the corresponding highlight of specific failures when used for the discrimination of crystallographic phases of zeolites among mixtures, has allowed the creation of the recently proposed strategy called Adaptable Time Warping (ATW). Here, the design process is thoroughly detailed in a step by step manner allowing a deep understanding of the motivations,
improvements, and the resulting remarkable properties of our methodology. As the use of highthroughput techniques for the discovery or the synthesis space enlargement of new microporous crystalline structures makes the trustworthiness of searchmatch methods a critical point to be assessed, a meticulous evaluation of the reliability and the robustness is provided and supported by both empirical comparisons and mathematical proofs. The results offered by our methodology, as it clearly outperforms the wellestablished solutions, open the way
Introduction The knowledge of the structure at atomistic/molecular level of a material is required for any advanced understanding of its performance. In order to have access to the material micro-structure, numerous generic characterization tools exist, each of them providing very specific information either on the structure or the chemical properties, but they are time-consuming. While most of the methods are performed ex-situ, in-situ characterization techniques are developing rapidly, with a major effort for parallelization and high-throughput[1] (HT). Nevertheless HT characterization allows saving “machine” time, the corresponding high volume of data generated requires an automatic data analysis system to avoid slowing down the whole process. With respect to the targeted information, most of the responses take the form of series where the measurements follow an ordered variable; for example, temperature (thermogravimetry analysis e.g. TGA, temperature programmed
[a]
Dr. L. A. Baumes, Dr. M. Moliner, Dr. Prof. A. Corma Instituto de Tecnologia Quimica CSIC-Universidad Politecnica de Valencia UPV, Av. de los Naranjos, s/n E-46022 Spain Fax: (+34) 963879444 E-mail: {baumesl; mmoliner; acorma}@itq.upv.es Supporting information for this article is available on the WWW under http://www.chemeurj.org/
toward a total automation of such a routine procedure, eliminating laborious and time-consuming controls, preliminary treatments and settings. Consequently, the proposed solution is of great interest and appears to be very promising considering the numerous potential applications of X-ray diffraction in material science, but also the possible extent to several other characterization techniques. Keywords: x-ray diffraction · full profile · warping distance · mixture · high-throughput
desorption-reduction-oxidation, e.g. TPD, TPR, TPO), the time (gas chromatography, e.g. GC), the wave length (photoluminescence, infra-red, e.g. IR, ultraviolet-visible adsorption, e.g. UV-vis), the voltage (voltammetry), or the angles (x-ray diffraction, e.g. XRD), etc. Here, we focus on powder diffraction data to be automatically labeled by their corresponding phases (single or mixture). Moreover, such analysis is intended not only to identify crystalline materials by comparing diffraction data against a database, but also to detect the formation of an unknown phase. Few solutions[2,3,4] that aim at extracting phases’ ID directly from powder X-ray diffraction data have been reported so far. Despite the excellent results[5,6] reported for these techniques, their application on mixtures of zeolite crystallographic phases gave us motivations to further investigate their reliability and robustness. A careful inspection of the latter and the corresponding highlight of specific/representative failures brought us to create a new technique called Adaptable Time Warping (ATW).[7] Here, the design process of the recently proposed ATW strategy is thoroughly detailed in a step by step manner. This allows a deep understanding of the motivations, improvements, and the resulting remarkable properties of the methodology. Three different studies and the corresponding results using the commercial software PolySnap2©[3,4] from Bruker-AXS are presented in the first part of thus report. Then, a close examination of the responses will allow to point and understand the weaknesses of such an approach for the type of problems under study. This will be illustrated by the selection of few representative mistakes for
1
which some corrective adaptations of the methodology are described in the third section. It is shown how such modifications can be merged together making the so-called Adaptable Time Warping a fundamentally different, powerful and innovative approach. Its evaluation on numerous benchmarks allows supporting empirically the mathematical proofs of ATW robustness and reliability property. Finally, results using ATW for the same cases will show the clear superiority of this methodology with respect to those previously reported.
Figure 1. The automatic analysis of the 192 X-Ray diffraction data of the discovery program of the zeolite ITQ-33 is expected to identify the different crystallographic phases present in each sample, allowing the creation of the corresponding phase diagram (8 different phases, numerous mixtures, and among them the presence of a new material has to be established).
Identification of Zeolite Crystallographic Phases Zeolites are crystalline alumino-silicates, whose structures are 3D networks composed of channels and cavities of molecular dimensions. Their microporous framework structures, the wide range of chemical composition and surface acidity, and the possibility of tuning their electric fields are key factors that render zeolites versatile materials for an increasing number of applications.[8]
Among them, is the use of zeolites as catalysts (for the petrochemical industry, for pollution control and for the synthesis of special chemicals), for gas adsorption and separation, electronics and medical uses.[9] The discovery of new structures or enlarging the synthesis space, and optimization of existing ones require a considerable experimental effort. This can be diminished by using high throughput (HT) technology (usually parallelization and miniaturization)[10] as also done for heterogeneous catalysts.[11] As a recent example, this technique has allowed the synthesis of a very unique zeolite structure that includes extra-large 18MR connected with medium 10MR pores (ITQ-33).[12] The unusual conditions
2
“ITQ-21/30” - The gel composition is explored by varying the following molar ratios: Al/(Si+Ge), MSPT/(Si+Ge), F-/(Si+Ge) and Si/Ge. A factorial experimental design (4 x 32 x 22 =144) is selected for studying simultaneously the crystallization time and the composition of the gel varying the molar ratios of the components. Two zeolites are found in the area of the phase diagram studied: ITQ-21 and ITQ-30. “ITQ-33” - Hexamethonium is used as structure directing agent (SDA). An initial experimental factorial design (3×43) is selected. Si/Ge, TIII/(Si+Ge), OH-/(Si+Ge), and H2O/(Si+Ge) are the synthesis variables. This experimental design considers the following four molar ratios (level): Si/Ge (4) ranging from 2 to 30; B/(Si+Ge) (4) from 0 to 0.05; OH-/(Si+Ge) (3) from 0.1 to 0.5; and H2O/(Si+Ge) (4) from 5 to 30. The number of synthesized samples is 192. Hexamethonium is a promising structure directing agent (SDA), since this molecule has a high number of degrees of freedom. Such flexibility allows different conformations that stabilize several competing structures, such as EU-1, ITQ-17, ITQ-22, ITQ-24, SSZ31, a lamellar phase, and the new structure, ITQ-33.
Table 1. Confusion matrix with the real phases in the experimental design “Beta study” versus predicted classes obtained with PolySnap2 indicated at the top of the cell, and versus ATW indicated at the bottom of the cell (in italic and between brackets).
Am.
UDM
9
2
Growing Mixture
Mixture
Growing Beta
1
Beta
75 (76)
Growing UDM
UDM
Real phases
Amorphous
“Zeolite β” - In order to optimise the preparation conditions for the synthesis of beta zeolite, a factorial design (32×42) is employed for exploring different molar gel composition. The molar gel composition is explored by varying the following molar ratios: Na/(Si+Al), TEA/(Si+Al), OH/(Si+Al) and H2O/(Si+Al), while Si/Al ratio is fixed. The experimental design considers the following four molar ratios (level): Na/(Si+Al) (3) ranging from 0 to 0.5; TEA/(Si+Al) (4) from 0.1 to 0.6; OH-/(Si+Al) (4) from 0.15 to 0.52; and H2O/(Si+Al) (3) from 5 to 15. The total number of samples synthesised considering this factorial design is 144. The materials obtained in the studied area are amorphous materials, zeolite beta, and another dense material (named as UDM).
When PolySnap2© is employed for the studies “ITQ-33”, “Zeolite β”, and “ITQ-21/30” the overall recognition rates are 68.2%, 84.3%, 86.11% and respectively. Despite the relative easiness of β study and ITQ-21/30, i.e. only three different phases (including amorphous) are present with their corresponding mixtures, 15.7% and 13.89% of errors are respectively obtained. Among them, the majority corresponds to the fact that partially crystalline materials are not correctly identified. The same observation can be done for “ITQ-33” study. This makes the number of false negative, i.e. the presence of a given crystallographic phase which is not detected, relatively high, see the first line of the Tables 1 and 2. Confusion matrix for “ITQ-21/30” study is given in Supplementary information as Table S2. Among them, growing phases represent the principal source of misidentifications, and it can be observed that mixtures of pure phases (without the presence of amorphous) are also poorly recognized.
Predicted phases
needed to synthesize this new material were found by using HT techniques to cover a wide range of possible synthesis conditions. That study required the generation of 192 diffractograms to follow the formation of the different crystallographic phases and mixtures of phases depending on the synthesis conditions, and from those mixtures of phases the presence of a new zeolite structure had to be established, see Figure 1. To achieve this is not obvious when complex mixtures of several crystalline phases exist in the same samples and consequently, the identification of the phases is time consuming. At the same time when increasing the amount of experiments to cover a broader synthesis range, it becomes mandatory to develop an automatic data treatment not to slow down the whole process.[13] Therefore an important step in this direction is the automatic classification of crystalline structures through XRD data, as it has been previously pointed by Takeushi et al.[6] For doing this PolySnap2©[3] has been selected as a reference because the software is representative and highly relevant considering the current state of the art in HT identification of phases through full profile examination. Detailed explanations are available, while merging various techniques: principle component analysis (PCA), multi-dimensional scaling (MDS), cluster analysis,[14] and statistical tests. As PolySnap2© offers an automatic analysis where user interface is minimized to few options (“allows x-shift” and “check for amorphous” have been selected), this mode is chosen for comparison. Three datasets of experiments have been selected: “Zeolite β”,[10b] “ITQ-21/30”,[10a,10d] and “ITQ-33”[12].
2
(1)
89 (77)
11
2
13
(12)
(1)
(13)
Grow
0
0
UDM
(9)
(9) 27
Beta
2
29
(28)
(28)
Growing
0
0
0
Beta
(1)
(2)
(3)
Mixture
1
2
0
3
(1)
(1)
Growing
0
0
0
Mixture
(1)
(2)
(3)
2
2
84.3%
76
12
11
29
2
(97%)
Considering the selected applications, an adapted methodology is still lacking taking into account that one specific structure can present large differences in peak intensity and diffraction angle within the XRD diffractogram, depending on the level of crystallinity, crystallite size, and chemical composition. Subsequently, the synthesized powder which presents a mixture of phases and impurities makes the automatic determination of crystalline phases through X-ray diffraction data a real challenge. A closer look on mistakes should allow understanding why the methodology is failing. Therefore, some representative failures are extracted from the above treatments, see Figures 1 to 3. PolySnap2© operations[4] can be summarized as follows: i) each input sample is checked against a database of known phases, ii) the criterion based on the Pearson and Spearman correlation
3
coefficients is calculated taking into account all measured data points (full profile), iii) if the pattern correlations do not indicate a good match, then the percentage composition of the sample is determined, i.e. quantitative analysis, using the singular value decomposition (SVD) for matrix inversion, and all the known phases as potential elements of the composition.
Predicted phases Amorphous
ITQ-22
ITQ-24
EU-1
SSZ-31
Lamellar
ITQ-17
Lamellar + 24
24 + 33
17 + 24
22 + 24
22 + 24 + 33
Other mixtures
55 (55)
Am.
36 (36)
2
ITQ-22
2
3
ITQ-24
2
EU-1
3
SSZ-31
28
Lam.
55 (2)
38
8 (17)
6 (2)[a]
19
1 (3)
3 13 (15)
(1)[b]
16
18 (46)
46
Lam. + 24
2
2
1
24 + 33 17 + 24
1
22 + 24
1
1
0 (8)
1
2
0 (2)
2
2
0 (2)
3
2
0 (2)
8
Real phases
ITQ-17
0 (0)
0
22 + 24 + 33
1
0 (1)
1
95 (54)
40 (36)
10 (17)
1 (2)
13 (14)
19 (46)
1 (0)
0 (8)
0 (2)
0 (2)
0 (4)
0 (1)
13 (3)
131/192=68.2% (187/192=97.4%)
Table 2. Confusion matrix with the real phases in the experimental design “ITQ-33” versus predicted classes obtained with PolySnap2 indicated on the left hand-side of the cell, and versus ATW indicated on the right hand-side of the cell (in italic and between brackets). [a]
Figures 2-a ,3-b, and 2-c highlight a first problem inherent to methodologies for which a statistical criterion based on a correlation coefficient is employed to measure the strength (and direction) of the relationship between diffractograms. As highest peaks get the greatest influence on the variation of the criteria, the local overlapping of diffractograms considering the reflections with high intensities, such as for ITQ-22 and ITQ-24 at 11° and between 20° and 23°, see Figure 2-a, can mislead the methodology when used in the case of ITQ-33. The same confusion appears for the sample containing both zeolite Beta and UDM in Figure 3-b. In this case, smaller peaks belonging to UDM are clearly visible, but their influence remains relatively too low compared to the main phase. This problem is tackled in PolySnap2© as it integrates various techniques and criteria in order to better handle such kind of complications. Gilmore et al. define a general criterion as the linear combination (half-half for the automatic mode) of Pearson and Spearman coefficients.[3] Spearman’s coefficient is a non-parametric measure of correlation and a special case of Pearson productmoment as the data is first converted to ranking before calculating the coefficient. This has the advantage to decrease the influence of the nominal peaks height. However, the criterion is not strong enough by itself. When combined with Pearson coefficient, it simply behaves as an indicator for further user attention.
Figure 2. Predicted errors by using PolySnap2. a) The real phase is ITQ-22 (#19), being the software prediction mixture ITQ-22 with ITQ-24, b) it is growing the Lamellar phase, being the software prediction “amorphous”, and c) it predicts pure ITQ17 (#217), being a mixture between ITQ-17 and ITQ-24.
ITQ-24 + amorphous. [b] SSZ-31 + amorphous.
4
Similarly, Figures 2-b and 3-a point the problem of detecting a growing phase in presence of amorphous. As explained in Ref[3], if the ratio of non-background to background intensity falls below a default limit of 3% and if, additionally, there are less than a default value of 3 peaks, the sample is flagged as non-crystalline. Here, the conception of the methodology consisting in an a priori separation of the different situations, i.e. one pure phase-mixture of phasesamorphous sample, through thresholds explains such errors. Another intricacy encountered during the analysis of “ITQ-33” data is represented in Figure 4. It can be observed how the increasing integration of Germanium modifies the dimension of the unit cell due to [Ge-O] longer bond length than [Si-O], and consequently shifts the diffractograms along x-axis. The difficulty of such situation arises from the fact that the shifting is not constant along the entire angle range as emphasized in Figure 4-b. This has a direct influence on the criteria and thus on the identification as most techniques use the Euclidian distance at the basis of their calculation, i.e. the differences of intensity are calculated at the same angle positions. The resulting is that most of these diffractograms are interpreted as mixtures of ITQ-24 and another phase, i.e. false positive.
Figure 3. Predicted errors by using PolySnap2. a) Growing phases, UDM (#21) and Beta (#45), being the software prediction “amorphous”. b) Mixtures between Beta and UDM, being the software prediction “Beta” phase.
Figure 4. XRD patterns for ITQ-24 zeolites obtained in the “ITQ-33 study” with different Si/Ge ratio a) full pattern, and b) small angle range (19-24º).
To summarize, when facing difficult problems where samples present large differences in the intensity of peaks and diffraction angles, the traditional methodologies obviously fail. As basic statistical correlation coefficients allow the identification of pure phases in most of the cases, except for the latter, the real strength and need of new strategies is expected where the current methodologies does not succeed or simply warns for further user attention. Such tricky situations can be divided into 5 groups: i) the treatment of the highest peaks which either mislead the crystallographic phase assignment due to local overlapping, Figure 2-a, or hide the formation of competitive phases due to their relatively big weight into the criterion, Figure 2-c; ii) the simultaneous occurrence of small peaks representative of a given growing phase does not take enough importance among the evident presence of another phase, Figure 3-b; iii) the treatment of partially but predominantly amorphous samples, Figures 2-b and 3-a, is not accurate enough leading to numerous false negatives; iv) similarly, the shifting of diffractograms along angle axis, Figure 4, undoubtedly appears as one of the limitation of Euclidian-based criteria leading to uncertainty; iv) finally the effective detection of the presence of a new phase, or zeolite structure, which remains one of the most required feature of the strategy still needs improvements. The three samples containing the growing ITQ-33 zeolite have been classified as mixtures of other pure phases, Table 2. Despite the fact that other full-profile techniques are available either in PolySnap2© or from other sources, their overall conception remains very similar and always refers to statistical criteria which determine the similarity between the diffractograms. Such similarity is used either for a direct labelling of the samples as shown before, or for the visual representation of numerous XRD in a way to provide a quick idea of how diffractograms may be grouped,[2,14] such as principle component analysis (PCA), multi-dimensional scaling (MDS), cluster analysis. However, these traditional algorithms initially created for multi-dimensional data are not adequate when series are considered. The reason for this is mainly due to the lack of knowledge on the intrinsic ordering data particularity. Most of them are similarity/dissimilarity-based approaches, where the similarity between instances is usually explicitly expressed by the classical p-norm distance, e.g. Euclidian. Numerous algorithms are available, but only few distances have been proposed for the special treatment of series and, among them, the Dynamic Time Warping[15] (DTW) appears as the most famous one. As warping distances have never been integrated into searchmatch approaches, we propose this first modification to tackle the problem related to shifting. On the other hand, even though all points are considered in full-profile search-match methodologies, the whole available information is not fully used. Before analyzing unknown samples, the selection of potential pure phases can be a priori analyzed, the information being contained in these patterns should give precious indications on how the different phases can be identified taking into account the presence of other ones. Such kind of analysis can be handled by machine learning (ML) techniques. The aim is to localize strategic traits/features of each individual phase which makes them unique taking into account the competitive phases. Moreover, additional information can be inserted before launching the automatic phase recognition, such as previous experiments, i.e. the use of internal databases, o theoretical diffraction data simulating the formation of unsynthesized mixtures or the integration of new elements into the crystal network, such as Germanium considering ITQ-33 dataset.
5
In the following section, details about these modifications are given and we show how both have been combined showing our methodology its unique strength and reliability/robustness characteristics.
Corrective actions The treatment of naturally ordered (i.e. time, temperature, etc…) sequences represents an actual challenge in machine learning (ML) and data mining research due to the specific structure of the data. Most classical algorithms initially created for multi-dimensional data are not efficient when series are considered. The reason why their performances suffers is mainly due to the lack of knowledge of the intrinsic ordering data particularity. The lack of adapted methodologies has induced a new field of investigation called temporal data mining[16] which includes association rules,[17] indexing for large databases,[18] feature mining,[19] the discovery of recurrent or surprising motifs,[20] classification[21] and clustering.[22] All the works named above use a distance measure between instances but the effort is not focused on the elaboration of a suitable criterion considering the problem to be handled but rather on the algorithms making use of the distances. Due to the relative importance/impact of the similarity expression, even very sophisticated approaches may failed, and the lack of stability when facing various problems remains their principal drawback. Among the whole set of distances applied for the treatment of series, the Dynamic Time Warping[15] (DTW) appears as the most famous one. The operation of non-warping distances over a pair of series consists in matching each point from one time series with the point from the second one that occurs at the same time. The easiest nonwarping distance is the 1-norm distance 1- norm ( A, B ) = ak − bk with A = {a1 ,⋯ , at } and B = {b1 ,⋯ , bt } . The main drawback of non-warping distances is the impossibility to manage time axis gaps. For example, two time series with the same shapes that do not occur concurrently on the time axis obtain a relatively high 1-norm distance. Depending on the problem in hand, i.e. the characteristics that group instances, such similarity measure may be unintuitive and the obtained classification may result significantly disturbed,[23] as demonstrated earlier. On the other hand, warping distances aim at managing temporal shift; they allow matching two time series by computing distance between points that do not occur at the same moment without rearranging the sequence of the elements, see Figure 5. The final sequence of pairs of points that are matched together is called the warping path W ( A, B ) = w1 ,⋯ , wk ,⋯ , wK with wk = (ai , b j ) k and i, j ∈ [1..t ] . wk is the kth pair of the warping path. wk is composed of the ith point of the series A and the jth point of B. K, i.e. the length of the warping path, can be superior or equal to the length t of the series involved. The warping path W(A, B) = (a1, b1), …, (ai, bi),…, (at, bt), i.e. ∀ {k , i, j} , k = i = j corresponds to the 1-norm value. It may be pointed that one point can be totally ignored during the distance computation.[24] Due to combinatorial considerations but also due to the importance of the meaning of the matching, a warping window noted δ is employed. A point that occurs at instant t can be matched with points from the other series that occur in ( i − δ ) .. ( i + δ ) . δ is generally set as constant value. The Dynamic Time Warping[15,25] (DTW) compares two time series together by allowing a given point to be matched with one or several points, see Figure 5. Let’s defined A'i = a 1 ,… , ai and B 'i = b1 ,… , b j with 1 ≤ i, j ≤ t , A'i and B ' j being sub-sequences of A and B starting at 1 and finishing respectively at i and j. We note γD(A’i, B’j) the recursive function
that defines the distance between the two sub-sequences. DTW is defined following by the recursive Equation 1. The computation of DTW needs a dynamic programming approach consisting in creating a t×t matrix, see Figure 6. Inside the cell (i, j) of the matrix the value |ai – bj| is stored. After the cells are filled, the search for the best warping path begins. It is the path beginning in position (1,1) and finishing in (t, t) that minimizes the sum of the cells it goes through, see Figure 6. Table 3. Example of two series A and B
t
1
2
3
4
5
6
7
8
A
1
2
4
3
1
1
2
1
B
2
1
1
2
4
3
1
1
∑
(
)
(
)
Figure 5. The two series A (square) and B (triangle) with the matching using DTW and δ=±2. (A3,B5) are matched together, i.e. they belong to the warping path, as indicated by the grey line between the points. The distance which is retained corresponds to the black bold vertical line.
Figure 6. The matrix on the right hand side is filled with the values corresponding to the DTW recursive function (Eq.1). For i=3 and j=3, A3-B3=4-1=3, and the minimum of the cells {(Ai-1,Bj);(Ai,Bj-1); (Ai-1,Bj-1)}=2. The best warping path which minimizes the sum of the cell it goes through is displayed in the matrix on the left with dark cells. It can be noticed that there is not always a unique best warping path, (A6,B7) can be substituted by (A6, B8).
DTW ( A, B) = γ D ( Am , Bn ) with
(Equation 1)
γ D ( Ai , B j ) = Ai − B j + min {γ ( Ai −1 , B j ), γ ( Ai , B j −1 ), γ ( Ai −1 , B j −1 )} As mentioned earlier, we expect this first modification to efficiently handle the problem related to shifting of diffractograms. However, it must be merged with the integration of additional knowledge such as preliminary inspection of potential
6
crystallographic phases, previous experiments, and theoretical calculations. The aim of analyzing additional data is to modify the similarity between diffractograms in order to improve recognition of mixtures, with or without the presence of amorphous. A way to achieve this is to modify the influence of the intensities for each angle position. Since the matrix which is employed for DTW calculation is a squared matrix of dimension t, i.e. the total number of angles steps, it can be used to integrate weights. A possible way to correctly assign the values of each weight is to “learn” how intensities must be interpreted or modified in order to get a correct answer from the labelling procedure. Due to the large number of angle positions, a machine learning (ML) approach must be employed. By presenting various cases (better if complicated) either from past experiments or theoretical calculations, we expect the setting of weights to improve the classification results.
classification function which relates a new case to the learned cases. IBL approaches are chosen since the similarity function that is employed plays a central role as stressed by the definition given earlier. The following are the most common IBL methods: k-nearest neighbors (k-nn), locally weighted regression, and radial basis function (RBF). k-nn algorithm is selected for simplicity since it is the most basic of all IBL methods, but also because the aim is to stress on the proposed similarity measure and not on the algorithm which uses it. Moreover, superior performances may be easily obtained as k-nn algorithm can be refined by weighting the contribution of each of the k neighbors according to their distance to the query point giving greater weight to closer neighbors such as for RBF. One very important trait of the methodology is its independence from the classification algorithm, i.e. others could be employed as well.
Our methodology Adaptable Time Warping (ATW) is designed, as all warping (including the famous DTW), and p-norm distances are particular cases of ATW. Let Π a t×t matrix, be the set of + parameters required to compute ATW, Π = π i j ∈ ℝ ∀i, j ∈ [1, t ] . If πij=∞, ai is not allowed to be matched with bj. γA(A’i, B’j, πij) is the recursive function that computes the distance between A’i and B’j. ATW is defined by Equation 2.
At this level of the design, it remains defining the procedure for weight assignment. Since the aim is to improve labelling results, the best matrix Π is defined as the matrix that minimizes the classification error rate with ATW and the chosen learning strategy, see Figure 7. The “brute force” solution would perform the classification with all the possible values of Π, and finally keeps the one that corresponds to the best error rate. Each possible matrix Π must be tested for obtaining the error rate. However, the resulting complexity in time of the exhaustive search is intractable. This implies the use of a heuristic which aims at quickly finding a solution, i.e. obtaining a locally optimal matrix Π. Genetic algorithms (GAs) are selected because there is a body of theory providing theoretical evidence to complement the empirical proof of their robustness. Such techniques are now consistently used. The reader is referred to Ref[26] for an introduction and advanced discussions.
(
ATW ( A, B, Π ) = γ A A't , Bt' , π t t
(
with γ A A'i , B 'j , π i j
)
(Equation 2)
)
∞ if π i j = ∞ 0, if i = j = 1 ' ' γ A Ai , B j −1 , π i ( j −1) , if i = 1 and j > 1 ' ' γ A Ai −1 , B j , π (i −1) j , if i > 1 and j = 1 = ai − b j × π i j + γ A A'i −1 , B 'j , π (i −1) j ' ' , else min γ A Ai , B j −1 , π i ( j −1) ' ' γ A Ai −1 , B j −1 , π (i −1) ( j −1) π 11 … π 1t with Π = ⋮ ⋱ ⋮ π i j ∈ ℝ + , ∀i, j ∈ [1, t ] π ⋯ π tt t1
( (
) )
( ( (
) )
)
Considering a set of n time series and a given classification algorithm, we expect optimal settings for ATW distance considering the given problem and related data. Two different algorithms are required: a classification technique, and another one for the automatic weight setting. As an example and in order not to make the methodology too complex, the instance-based learning (IBL) method called k-nearest-neighbors (k-nn) is chosen as classification strategy. An IBL approach also called memory-based or case-based learning is defined as the generalizing of a new instance to be classified from the stored training examples. Training examples are processed when a new instance arrives. Each time a new query instance is encountered, its relationship to the previously stored examples is examined to assign a target function value for the new instance. Three characteristics may define IBL algorithms: i) A similarity function which tells the algorithm how close together two instances are. ii) A “typical instance” selection function: this tells the algorithm which of the instances to keep as examples. iii) A
Figure 7. ATW conception. Based on classification errors of training samples, the ∏ matrix is modified until reaching a given termination criterion, i.e. the convergence to a minimum. Finally the best matrix is used for unseen test samples.
Let X be a set of n time series, X = {x1,…,xn), and Π a population of p individuals {Π1,…, Πp). Each individual is a candidate matrix Πi that owns t² variables. The terms “individuals”, “population”, etc… are used following the GA terminology. Each cell of the matrix is a variable that the GA optimizes. The GA (i.e. all individuals Πi) is initialized by assigning random values to every cell πij. The evaluation, then the selection, and finally the mutation/recombination are applied to each generation. The details about the use of genetic algorithm as optimization tool are explained in supplementary information. Our approach is a learning process which uses a heuristic, and allows reaching near-optimal solutions. According to a given classification problem, ATW approach adapts itself for better capturing data specificity. ATW can be associated to both warping and non-warping distances as demonstrated by the proof given in supplementary material. Therefore, whatever the temporal classification problem, optimal parameters Π imply ATW
7
to give results at least equal or superior to other distances, and consequently ATW is always at least equivalent or superior to all other distance use. In the following section, results for 20 benchmarks[27] and the three real datasets of zeolite investigations are given, confirming empirically the mathematical proof.
same phase. Therefore, one advantage of ATW is to consider peak position shifts, which are common for zeolites with different Si/Ge ratio. For this reason, the dataset “ITQ-33” is an excellent proof to check the ability of the ATW to classify such materials. There are eight different materials (amorphous, ITQ-17, ITQ-22, ITQ-24, EU1, Lamellar phase, SSZ-31, and ITQ-33) with numerous mixtures between them. The ratio Si/Ge is also varied in the study getting variations in the peaks positions. The last challenge that shows the experimental design is the discovering of a new crystalline phase, ITQ-33, in a very narrow condition range. In Figure 1 it can be observed the occurrence of each competing phase as a function of the composition of the starting gel. Despite of the non-linearity of the system, it is clearly defined the range of composition in which a specific phase is formed. However, there are some narrow ranges of composition in which two or more phases are competing. Zeolite ITQ-33 was growing with ITQ-24 in an specific area of the phase diagram. The conditions for ITQ-33 synthesis are extremely unusual, due to the low OH-/(Si+Ge) ratio in a concentrated gel H2O/(Si+Ge) = 5, in the presence of a trivalent element ((Si+Ge)/B = 20 or 50) and germanium (Si/Ge = 2). When some variable was changed, the ITQ-33 phase disappears and ITQ-22 or ITQ-24 are formed. When ATW is applied, the classification error is below 4% (see Table 2), considering all the mixtures obtained in the experimental design.
Results and Discussion The k-nn algorithm is employed with k=3. The calculation of the error rate on the learning dataset is done using k’-CV with k’=5. The lowest error rate on the validation set (i.e. the winner matrix) is kept for comparison with other techniques which have been evaluated with the same validation set. The number of iteration (r) for ATW matrices optimization is set to 50, and the population size (p) is set to 4. pmut and q are respectively set to {0.1; 5}. Figure 8 and Table S1 in supplementary material shows that ATW always constitutes the lower bound of error rates when is compared to both DTW and Euclidian distance use on the 20 different benchmarks.[27] ATW is superior to DTW (either with the best delta value or without warping window) and the Euclidian distance for 7 benchmarks (see bold values). 0.6
Euclide DTW (no warping window)
0.5
DTWδ(best d value) ATW
0.4 0.3 0.2 0.1
Figure 8. Evaluation of ATW of 20 different benchmarks.[27] For all the cases, ATW reaches the lowest classification error indicated on the y-axis.
Considering “zeolite β” dataset, DTW performs nearly two times better than the Euclidian distance but here again ATW allows to reach a better performance. For “ITQ-21/30”, DTW and Euclidian distances perform equally while ATW decreases the error rate by half. In those datasets there are only few different classes with low number of mixtures between them. However, in the zeolite synthesis, it is normal to crystallise a high quantity of different materials, with mixtures of two or more different zeolites. The ATW methodology can not only predict the pure phase but also the mixture ordered by decreasing degree of crystallization of each one. For doing that, the algorithm compares to next neighbours getting the distance to them, and the output phase is ordered depending on that distance. Moreover, depending on the crystal composition, it is possible to obtain some shifts in the peak positions, although it is the
OliveOil
Coffee
Beef
Fish
Yoga
Adiac
ECG
Lightning-7
Lightning-2
Face (four)
Wafer
Two Patterns
Trace
50Words
Swedish Leaf
OSU Leaf
Face (all)
CBF
Gun-Point
Synthetic Control
0
Moreover, if the confusion matrix is studied in detail, one can see that 2 of the 5 errors are due to the fact that the algorithm predicts also the amorphous materials that are contained in these samples, being more accurate than our manual classification. Amorphous material is considered as a regular phase, even when the amorphous content is “impurity”. Two other errors come from two predicted mixtures ITQ-22/ITQ-24 while the real phases are pure ITQ-22. In fact, ITQ-24 was in the limit of significance, being a most minority phase, and the SVD procedure assigns less than 3%.
Conclusion ATW is a new time series distance that can be adapted according to the classification issue, using a genetic algorithm. GA components (chromosomes and operators) have been adapted for the task in-hand. Such hybridization is shown of great interest for the classification goal while the adaptation of the GA to the problem
8
form allows applying successfully the new strategy to complex datasets. Besides its obvious identification superiority, a large number of advantages are inherent to ATW such as the following: there is no background removal, no peak extraction, no smoothing, no interpolation, no region exclusion from the user, no pre-defined shifts, no false negative, the best δ is automatically found, it handles properly amorphous samples, captures phases characteristics while reducing the user interaction to the minimum, and finally encounters new crystallographic phases.
[5]
R. Storey, R. Docherty, P. D. Higginson, C. Dallman, C. J. Gilmore, G. Barr, W. Dong, Cryst. Rev. 2004, 10, 45-56.
[6]
C. J. Long, J. Hattrick-Simpers, M. Murakami, R. C. Srivastava, I. Takeuchi, V. L. Karen, X. Li. Rev. Sci. Instrum. 2007, 78, 072217.
[7]
L. A. Baumes, M. Moliner, A. Corma. ChemEngComm. 2008, 10, 1321-1324
[8]
A. Corma, J. Catal. 2003, 216, 298-312.
[9]
H. Lee, S. I. Zones, M. E. Davis, Nature. 2003, 425, 385-387.
[10]
a) A. Corma, M. J. Díaz-Cabañas, J. Martínez-Triguero, F. Rey, J. Rius, Nature. 2002 418, 514-517. b) O. B. Vistad, D. E. Akporiaye, K. Mejland, R. Wendelbo, A. Karlsson, M. Plassen, K. P. Lillerud, Stud. Surf. Sci. Catal. 2004, 154, 731738. c) A. Cantin, A. Corma, M. J. Diaz-Cabanas, J. L. Jordá, M. Moliner, J. Am. Chem. Soc. 2006, 128, 4216-4217. d) A. Corma, M.J. Diaz- Cabañas, M. Moliner, C. Martinez. J. Catal. 241, 2, 312-318, 2006. e) A.Corma, M. Moliner, J. M. Serra, P. Serna, M. J. Díaz-Cabañas, L. A. Baumes, Chem. Mater. 2006, 18, 3287-3296.
[11]
a) J. M. Serra, A. Corma, D. Farrusseng, L. A. Baumes, C. Mirodatos, C. Flego, C. Perego, Catal. Today, 2003, 81, 425-436; b) C. Klanner, D. Farrusseng, L. A. Baumes, C. Mirodatos, F. Schüth, QSAR & Comb. Sci., 2003, 22, 729-736; c) D. Farrusseng, C. Klanner, L. A. Baumes, M. Lengliz, C. Mirodatos, F. Schüth, QSAR & Comb. Sci. 2005, 24, 78-93; d) F. Schüth, L. A. Baumes, F. Clerc, D. Demuth, D. Farrusseng, J. Llamas-Galilea, C. Klanner, J. Klein, A. MartinezJoaristi, J. Procelewska, M. Saupe, S. Schunk, M. Schwickardi, W. Strehlau, T. Zech, Catal. Today. 2006, 117, 284-290.
[12]
a) A. Corma, M. J. Díaz-Cabañas, J. L. Jordá, C. Martínez, M. Moliner, Nature. 2006, 443, 842–845; b) M. Moliner, M. J. Díaz-Cabañas, V. Fornés, C. Martínez, A. Corma, J. Catal. 2008, 254, 101-109.
Figure 9. Coupling of ATW inside our platform hITeQ with ITQ existing HT tools
ATW clearly appears as a new method for automatically deciphering X-ray diffraction patterns, allowing crystallographic phases to be identified quickly and reliably. As XRD is used for a wide range of purposes, from routine characterization in industrial procedure control through in depth research investigations of the most complex high technology materials, such kind of methodology which can be easily coupled with existing high-throughput synthesis apparatus, see Figure 9, is a real breakthrough. It has been shown how the phase diagram corresponding to a broad research space can be quickly revealed making clearer the way towards new structures.
[13] a) L. A. Baumes, D. Farruseng, M. Lengliz, C. Mirodatos, QSAR & Comb. Sci. 2004, 29, 767-778; b) C. Klanner, D. Farrusseng, L. A. Baumes, M. Lengliz, C. Mirodatos, F. Schüth, Angew. Chem. Int. Ed. 2004, 43, 5347-5349; c) L. A. Baumes, J. M. Serra, P. Serna, A. Corma, J. Comb. Chem. 2006, 8, 583-596; d) L. A. Baumes, J. Comb. Chem. 2006, 8, 304-314; e) J. M. Serra, L. A. Baumes, M. Moliner, P. Serna, A. Corma, Comb. Chem. High Throughput Screen. 2007, 10, 13-24; f) L. A. Baumes, M. Moliner, A. Corma, QSAR & Comb. Sci 2007, 26, 255-272; g) P. Serna, L. A. Baumes, M. Moliner, A. Corma. J. Catal. 2008, 258, 25-34; h) L. A. Baumes, P. Collet, Com. Mat. Sci. 2008, doi:10.1016/j.commatsci.2008.03.051. [14]
G. Barr, W. Dong, C. J. Gilmore, J. Appl. Cryst. 2004, 37, 874-882.
[15]
a) D. J. Berndt, J. Clifford, in Advances in Knowledge Discovery and Data Mining (Ed.: U. M. Fayyad, et al.) , AAAI Press/MIT Press, Menlo Park, CA 1996, pp. 229-248; b) E. Keogh, Data Mining and Machine Learning in Time Series Databases, at the 18th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD’04). Seattle, WA, USA, 2004.
Acknowledgements
[16]
EU Commission FP6 (TOPCOMBI Project) is gratefully acknowledged. We also thank Santiago Jimenez for his important collaboration to set the hITeQ platform which integrates all programming codes of the presented methodology. Laurent A. Baumes thanks Stephan Schunk from hte Company for his comments on the technique.
a) W. Lin, M. A. Orgun, G. J. Williams, Australasian Data Mining Workshop, Macquarie Univ. and CSIRO Data Mining, 2002; b) C. M. Antunes, A. L. Oliveira, Workshop on Temporal Data Mining, at the 7th Int. Conf. on Knowledge Discovery and Data Mining, San Francisco, CA, 2001.
[17]
G. Das, K. Lin, H. Mannila, G. Renganathan, P. Smyth, Global partial orders from sequential data, at the 4th Int. Conf. on Knowledge Discovery and Data Mining, New York, NY, 1998.
[18]
a) E. Keogh, C. A. Ratanamahatana, J. Knowl. Inf. Syst., 2005, 7, 358–386; b) B. K. Yi, H. V. Jagadish, C. Faloutsos, Proceedings 14th IEEE Int. Conf. on Data Engineering. 1998, pp: 201 – 208.
[19]
N. Lesh, M. J. Zaki, M. Ogihara, IEEE Intell. Syst. App. 2000, 15(2), 48-56.
[20]
a) J. Buhler, M. Tompa, J. Comput. Biol. 2002, 9, 225-242; b) B. Chiu, E. Keogh, S. Lonardi, Probabilistic Discovery of Time Series Motifs, at the 9th Int. Conf. on Knowledge Discovery and Data Mining, Washington, DC, 2003.
[21]
a) J. C. Chappelier, M. Gori, A. Grumbach, in Sequence Learning: Paradigms, Algorithms and Applications (Ed.: R.Sun, G. L. Giles), Springer-Verlag, London, 2001, pp. 105-134; b) D. L. James, R. Miikkulainen, Advances in Neural Processing Systems. 1995, 7, 577-584; c) P. Somervuo, T. Kohonen, Neural Process. Lett. 1999, 10, 151-159.
[22]
a) F. R. Lin, L. S. Hsieh, S. M. Pan, Learning clinical pathway patterns by hidden markov model, at the 38th Annual Hawaii Int. Conf. on System Sciences (HICSS'05), Big Island, Hawaii, USA, 2005. b) J. Lin, M. Vlachos, E. Keogh, D. Gunopulos, Iterative incremental clustering of time series, at the IX Conf. on Extending Database Technology (EDBT 2004), Crete, Greece, 2004.
[1]
[2]
a) M. Fransen, Am. Lab. 2002, 34, 42-49; b) R. Meier, M. Dirken, Int. Cement & Lime Journal. 2002, 1, 44-47. a) I. Takeuchi, C. J. Long, O. O. Famodu, M. Murakami, J. Hattrick-Simpers, G. W. Rubloff, M. Stukowski, K. Rajan. Rev. Sci. Instrum. 2005, 76, 062223; b) X'Pert HighScore Plus software from PANalytical, see website visited Nov. 20th 2008, http://www.panalytical.com/index.cfm?pid=547; c) D. L. Bish, S. A. Howard, J Appl Crystallogr. 1988, 21, 86-91; d) D. L. Bish, S. J. Chipera, in Advances in X-ray Analysis (Ed.: C. Barrett, et al.), Plenum Press, New York 1988, 31, pp. 295-308; e) D. L. Bish, S. J. Chipera, in Advances in X-ray Analysis (Ed.: P. Predecki, et al.), Plenum Press, New York 1995, 38, pp. 4757; f) S. J. Chipera, D. L. Bish, Powder Diffr. 1995, 10, 47-55; g) F. H. Chung, J. Appl. Crystallogr. 1974, 7, 519-525; h) RockJock – Uses Microsoft Excel Macros and the Solver function to perform a whole-pattern modified Rietveldtype refinement to perform quantitative analysis. Written by Dennis D. Eberl, the software was published in 2003 as U.S.G.S. Open-File Report 03-78, “Determining Quantitative Mineralogy from Powder X-ray Diffration Data”. Available via FTP from ftp://brrcrftp.cr.usgs.gov/pub/ddeberl/RockJock. i) See JADE from http://www.rigaku.com/index_world.html.
[3]
a) C. J. Gilmore, G. Barr, J. Paisley, J. Appl. Crystallogr. 2004, 37, 231–242; b) G. Barr, W. Dong, C. J. Gilmore, J. Appl. Crystallogr. 2004, 37, 243–252.
[4]
G. Barr, W. Dong, C. J. Gilmore, J. Appl. Crystallogr. 2004, 37, 658–664.
9
[23]
E. Keogh, M. Pazzani, Scaling up Dynamic Time Warping for Datamining Applications, at the 6th Int. Conf. on Knowledge Discovery and Data Mining, Boston, MA, 2000.
[24]
E. Keogh, M. Pazzani, Derivative Dynamic Time Warping, at the 1st SIAM International Conference on Data Mining, Chicago, USA, 2001.
[25]
H. Sakoe, S. Chiba, IEEE T. Acoust. Speech. 1978, 26, 43-49.
[26]
a) D. E. Goldberg, The Design of Innovation: Lessons from and for Competent Genetic Algorithms, Kluwer Academic Publishers, Norwell, MA. 2002; b) L. M. Schmitt, Theory of Genetic Algorithms, Theoretical Computer Science, 2001, 259, 1-61; c) M. D. Vose, The Simple Genetic Algorithm: Foundations and
Theory, MIT Press, Cambridge, MA. 1999. d) D. Whitley. A genetic algorithm tutorial. Statistics and Computing. 1994, 4, 65–85. [27]
E. Keogh, X. Xi, L. Wei, C.A. Ratanamahatana. 2006, The UCR Time Series Benchmark: www.cs.ucr.edu/~eamonn/time_series_data/
Received: ((will be filled in by the editorial staff)) Revised: ((will be filled in by the editorial staff)) Published online: ((will be filled in by the editorial staff))
10
Making clearer the way towards new structures Laurent A. Baumes*, Manuel Moliner and Avelino Corma*
Design of ATW methodology for automatically, quickly and reliably deciphering X-ray diffraction patterns
Design of a Full-Profile Matching Solution for HighThroughput Analysis of MultiPhases Samples Through Powder X-Ray Diffraction
11