COMMUNICATION
www.rsc.org/crystengcomm | CrystEngComm
A reliable methodology for high throughput identification of a mixture of crystallographic phases from powder X-ray diffraction data† Laurent Allan Baumes,*a Manuel Moliner,a Nicolas Nicoloyannisb and Avelino Corma*a Received 18th July 2008, Accepted 23rd July 2008 First published as an Advance Article on the web 11th August 2008 DOI: 10.1039/b812395k
Because the inherent complexity of zeolites together with the use of high throughput technology make the performance of existing solutions for the automatic identification of mixture of crystallographic phases questionable, an adequate full profile search-match approach is presented, and its reliability is clearly demonstrated and illustrated on the very complicated case of the zeolite ITQ-33. The discovery of new microporous crystalline structures involves a considerable experimental effort, which can be diminished by using high throughput (HT) techniques. Despite the reduction of experimental time and cost per solid, HT technology has added substantial difficulties to the analysis of the powder X-ray diffraction (XRD) of the resultant products. The associated data quality loss and volume increase make the routine procedures, both manual and softwareassisted, inadequate. Time constraints and related complexity highlight the necessity of reliable and robust systems able to identify each component of mixtures of crystallographic phases in a fully automated way. When dealing with inherently complex materials like zeolites, the capability of existing solutions becomes questionable. Widely used to characterize crystallographic structures, crystallite size (grain size), preferred orientation in powdered samples, powder diffraction is intended not only to identify crystalline materials by comparing diffraction data against a database but also to detect the formation of an unknown phase even over broad ranges of synthesis conditions and variables. Because user involvement is expected to be minimized at least in the first steps of screening in order not to slow down the whole HT process, the reliability and robustness of the methodology become of outstanding significance. To improve principally implies to decrease the number of identification errors, while a relatively greater weight should be assigned to the mismatch of the phases presenting higher levels of crystallinity, and to false negative considering the detection of unknown phases. Closely related, we refer to robustness as the capability of a technique to successfully perform over a variety of problems, i.e. should not only be restricted to datasets with special characteristics. Thus, we will describe first the weaknesses of existing methodologies, and then an improved approach is presented. Its originality and power are verified
a Instituto de Tecnologia Quimica, UPV-CSIC, Universidad Politecnica de Valencia, Avda de los Naranjos s/n, 46022 Valencia, Spain. E-mail:
[email protected];
[email protected]; Fax: +34 9638 7789; Tel: +34 9638 77800 b Laboratoire ERIC, 5 Av. Pierre Mendes France, Universite Lumiere Lyon 2, 69676 Bron, France † Electronic supplementary information (ESI) available: Proof of ATW reliability; warping path and ATW formula; optimization with a genetic algorithm; synthesis of the ‘‘hexamethonium study’’. See DOI: 10.1039/b812395k
This journal is ª The Royal Society of Chemistry 2008
and illustrated with the very complex case of the zeolite ITQ-33.1 Both the mathematical proof of the robustness and the impressive results obtained on the real studies, make such an approach a very promising and reliable methodology. Search-match approaches for the recognition of crystallographic phases from XRD can be divided into (a) peak search and indexing, and (b) full profile solutions. Based on the examination of the respective advantages and drawbacks of the two kinds of approaches summarized in Table 1, the former is discarded, and we have focused on true full profile systems. The principal difficulty encountered when dealing with full patterns is the adequate conception of the criterion used for the matching, e.g. comparison, taking into account that one specific structure can present large differences in the intensity of peaks and XRD angles, depending on its level of crystallinity, crystallite size, and chemical composition. Subsequently, synthesized powders presenting mixtures of phases and impurities make this even more complicated. Similarity measures are the key component that allows sizing how similar/dissimilar the samples are. However, a suitable criterion should (i) accurately manage the influence of the highest peaks on the overall measure, (ii) detect the growing phases despite their weak peaks intensities, (iii) handle x-shift between samples but also when the shifting is not constant along 2q range inside a given pattern (see Fig. S8-b in the ESI†), (iv) limit the number of user interactions for settings and pre-treatments decisions and, (v) indifferently treat the amorphous phase while keeping consistent. The proposed methodology called adaptable time warping2 (ATW) is a two-step approach that could be dedicated to, for instance, intrinsically ordered data such as X-ray diffractograms. The central criterion is first optimized in order to tackle all previously mentioned points, taking into account the knowledge about the problem, i.e. selection of possible phases. Then it is applied on incoming full profiles for identification of the phases including amorphous and unknown. The preliminary combination of the following two components gives to the method all its strength: a very flexible distance based on dynamic time warping (DTW) model, and a learning system that aims at automatically tuning distance parameters according to the specificities of selected phases. DTW3 is a variety of time series alignment algorithm developed originally for speech recognition in the 1970s.4 Rather than comparing the value of the input pattern at time t to a selected reference pattern at time t, an algorithm is used that searches the space of mappings from the time sequence of the input to that of the reference, so that the total distance is minimised. This is not always a linear mapping (see Fig. 1(right)); for example, we may find that time t1 in the input corresponds to time t1 + 5 in the reference, whereas t2 in the input stream corresponds to t2 3 in the reference. The computation of a warping distance requires a warping path which defines the sequence of a pairs of points that are matched together, see ‘‘Warping path and ATW formula’’ in the CrystEngComm, 2008, 10, 1321–1324 | 1321
Table 1 Comparison of search-match approaches Approaches
Pattern
Methods
Associated techniques
Advantages
Drawbacks
Peak search and indexing
Reduced (Stick)
d-spacing and Intensity
Full profile
Full
Similarity-based on Euclidian distance
Hanawalta Fink Diffract ATb Statistics PCA, MMDS Clustering
Low storage requirement Speed of search Ease of database building Full use of information
Peak determinationc Number of peaks to consider Weak peaks are discarded No commercial database Local exptl patterns collected Decision for pre-processing
a
8 strongest peaks. b Intermediate approach. c Overlap, shoulders.
Fig. 1 (Left) Phase diagram of the entire research space. (Right) Adapting the distance calculation for the series.
ESI.† In ATW, this special feature allows both to optimally manage ordering a variable shift by computing the distance between the points that do not occur at the same moment, and to highlight important traits of each reference pattern, i.e. taking simultaneously into account the entire selection of possible phases, to find which are the 2q angles that make a given phase particular/distinguishable by assigning weights to each pair in the warping sequence. Note that one point can be either totally ignored during the distance computation or matched with one or several points. To do this, ATW uses P, a t t matrix, as the set of parameters (i.e. weights) required to compute the distance, with P ¼ [Pi,j] ˛ R+, c i,j ˛ [1,t] and t the number of intensities. P is optimized by a learning system that examines the available patterns, and is modified by a genetic algorithm5 (GA) in order to maximize the recognition rate of the phases. After the method has been correctly trained, i.e. phases specificities are captured, the unseen samples are analyzed. The algorithm labels each sample with all the phases present. In order to identify the different crystallographic phases contained in the experimental samples, the algorithm only uses the diffraction data from available zeolites (with the laboratory internal database and theoretical patterns). Each time a new sample is characterized, its relationship to all previously stored materials is examined through distances in order to assign its crystallographic phases. Such an approach which follows the instancebased learning (IBL) protocol can not only predict the pure or majority phase but also the mixtures of phases ordered by a decreasing order of crystallinity. For doing that, the algorithm computes the distances to the neighbours, and the output phases are ordered depending on these distances. The conception of ATW 1322 | CrystEngComm, 2008, 10, 1321–1324
makes any warping and non-warping distances a particular case of ATW, for example the Euclidian distance is defined by P as the unit matrix, see ‘‘proof of ATW reliability’’ in the ESI.† According to a given classification problem (i.e. dataset), the optimal parameters P imply that ATW gives results at least equal or superior to all other distances used. HT technology in combination with chemical knowledge and data analysis6,7 has allowed the synthesis of a very unique zeolite structure that includes extra-large 18MR connected with medium 10MR pores (ITQ-33). This study has required the generation of 192 diffractograms with a parallelized XRD to follow the formation of 8 different crystallographic phases, and numerous mixtures of phases depending on the synthesis conditions. In a very narrow range of conditions among the whole experimental space, the new crystalline phase had to be discovered. The ratio Si/Ge was broadly varied getting variations in the peaks positions (see Fig. S8 in the ESI).† As among other characteristics, our methodology is expected to handle the shifts in the peaks position, the one that is common when dealing with samples having different Si/Ge, Si/Al, and Si/B ratios. The results obtained during the synthesis of zeolites using hexamethonium as an organic structure directing agent (OSDA) is a perfect example of testing the ability of the ATW to classify such materials. In Fig. 1(left) which represents the complete experimental phase diagram of the investigation of the ITQ-33, the occurrence of each competing phase as a function of the composition of the starting gel can be observed. Despite the non-linearity of the system, the range of composition in which the different phases are formed is clearly defined. The evaluation of the proposed strategy aims at recovering This journal is ª The Royal Society of Chemistry 2008
This journal is ª The Royal Society of Chemistry 2008
55 36 17 3 15 46 8 2 2 4 1 2 1 0 0
SSZ-31 + Am.
19 38 55
2
Predicted
Amorphous (Am.) ITQ-22 ITQ-24 EU-1 SSZ-31 Lamellar Lamellar/ITQ-24 ITQ-24/ITQ-33 ITQ17/ITQ-24 ITQ-22/ITQ-24 ITQ-22/ITQ-24/ITQ-33 ITQ-24/Amorphous SSZ-31/Amorphous
55
36
2
17
3
3
1 16
15
46
46
8
8
2
2
2
2
2
2
1
1
0
0
24 + Am. 22 + 24 + 33 22 + 24 17 + 24 24 + 33 Lamellar + 24 Lamellar SSZ-31 EU-1 ITQ-24 ITQ-22 Am.
Real
Table 2 Confusion matrix with the real phases in the experimental design versus predicted classes obtained with ATW
the complete phase diagram from HT powder diffraction in a fully automated manner. The commercial software PolySnap2ª from Bruker-AXS written by Gilmore et al. is selected as the reference for comparison due to the following reasons: is representative and highly relevant considering the current state of the art in HT identification of phases through full profile examination; thoroughly detailed explanations are available;8 study cases have reported excellent results; a free time-limited demo version is accessible; and it merges a broad kind of techniques, among them: (i) principle component analysis (PCA), and multi-dimensional scaling (MDS), principally used as a data visualization tools for exploring the similarities or dissimilarities between patterns; (ii) cluster analysis such as hierarchical or fuzzy clustering aiming at creating subsets of data so that the patterns in each group ideally share some common trait; and (iii) parametric and non-parametric statistical tests such as Pearson, Spearman, or Kolmogorov–Smirnov (KS).8 It can be noticed that most of these techniques or criteria use the Euclidian distance at the basis of their calculation. PolySnap2ª analyses the data, automatically sorts the full patterns into clusters, and identifies unusual samples which may be unknown structures. As PolySnap2ª offers an automatic analysis where user interface is minimized to a few options (‘‘allow an x-shift’’ and ‘‘check for amorphous’’ have been selected), this mode is chosen for comparison. We will illustrate the methodology for a mixture of crystallographic phases occurring during the synthesis of a zeolite. It can also be applied to the synthesis of MOFs or any other type of crystalline materials. The results obtained with the methodology presented here will also be compared with those from, probably, the best method reported so far in the literature.8 ATW is applied to the dataset ‘‘hexamethonium’’, and the resulting classification error is below 3% (see Table 2). After verification, we could observe that the algorithm is more accurate than our ‘‘manual’’ classification; the algorithm considers the amorphous material as another class, even when the amorphous content is the ‘‘impurity’’. The other two errors came from two mixtures ITQ-22/ITQ-24 that were predicted by the algorithm, while the real phase only contained ITQ-22. The reason for this is because with these two zeolites, the distance to the ITQ-24 diffractogram is in the limit of significance. The comparison of ATW and PolySnap2ª results considering the ‘‘hexamethonium study’’ (see Table S7 in the ESI)† shows a clear improvement in the error rate when ATW is employed (the ATW error rate is 92% lower than that of PolySnap2ª), principally when mixtures of phases and not complete crystalline phases are present (see Fig. S9 in the ESI).† This has also been verified with another case of lower complexity (‘‘Beta study’’) with only two crystalline phases, showing similar difficulties when using the PolySnap2ª method (see Table S4 and Fig. S5 in the ESI).† Moreover, we have empirically verified with 20 benchmarks and two other real datasets of zeolite investigations in which ATW effectiveness is always at least equivalent to all other distances used, including the famous DTW (see Fig. S1 and Table S2 in the ESI).† In conclusion, we have shown that the ATW methodology can not only be successfully used to automatically identify mixtures of crystallographic phases but it is also able to extract/detect unknown structures. This makes ATW an innovative and robust approach. The lack of adapted methodologies for series has induced a new field of investigation called temporal data mining.9 ATW robustness places the methodology as a leading strategy in this domain. Considering the numerous applications in chemistry and especially when using
CrystEngComm, 2008, 10, 1321–1324 | 1323
characterization data (i.e. series), ATW appears as a very promising methodology that can help those working in the synthesis of novel crystalline materials.
Experimental Hexamethonium is used as a structure directing agent (SDA). An initial experimental factorial design (3 43) is selected.1 Si/Ge, TIII/(Si + Ge), OH/(Si + Ge), and H2O/(Si + Ge) are the synthesis variables. This design considers the following four molar ratios (level): Si/Ge (4) ranging from 2 to 30; B/(Si + Ge) (4) from 0 to 0.05; OH/(Si + Ge) (3) from 0.1 to 0.5; and H2O/(Si + Ge) (4) from 5 to 30. The total number of samples synthesized and characterized is 192. The flexibility of the hexamethonium allows different conformations that stabilize diverse competing structures, like EU-1, ITQ-17, ITQ-22, ITQ-24, SSZ-31, a lamellar phase, and the new structure ITQ-33.
Acknowledgements Laurent A. Baumes especially thanks Nicolas Nicoloyannis who was one of the directors of his PhD thesis and a friend. EU Commission FP6 (TOPCOMBI Project) is gratefully acknowledged. We also thank Santiago Jimenez for his scientific collaboration on the hITeQ platform.
Notes and references 1 A. Corma, M. J. Dı´az-Caban˜as, J. L. Jorda´, C. Martı´nez and M. Moliner, Nature, 2006, 443, 842–845. 2 R. Gaudin, N. Nicoloyannis. in 5th Int. Conf. Machine Learning and Applications (ICMLA’06), ICMLA, 2006, 213–218, ISBN 0-76952735-3. IEEE Computer Society. Los Alamitos, CA, USA.
1324 | CrystEngComm, 2008, 10, 1321–1324
3 (a) D. J. Berndt and J. Clifford, KDD Workshop, 1994; (b) (b)E. Keogh, in Tutorial in 18th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD’04). Seattle, WA, USA, 2004. 4 (a) V. M. Velichko and N. G. Zagoruyko, Int. J. Man-Machine Studies, 1970, 2, 223; (b) H. Sakoe and S. Chiba, IEEE Trans. Acoustics Speech Signal Process., 1978, 43–49; (c) C. S. Myers and L. R. , Rabiner, Bell Syst. Tech. J., Sept. 1981, 60(7), 1389–1409. 5 (a) D. E. Goldberg, The Design of Innovation: Lessons from and for Competent Genetic Algorithms, Addison-Wesley, Reading, MA. 2002; (b) L. M. Schmitt, Theor. Comput. Sci, 2001, 259, 1–61; (c) (c)M. D. Vose, The Simple Genetic Algorithm: Foundations and Theory, MIT Press, Cambridge, MA. 1999; (d) D. Whitley, Stat. Comput., 1994, 2, 65–85. 6 (a) M. Moliner, J. M. Serra, A. Corma, E. Argente, S. Valero and V. Botti, Microporous Mesoporous Mater., 2005, 78, 73–81; (b) O. B. Vistad, D. E. Akporiaye, K. Mejland, R. Wendelbo, A. Karlsson, M. Plassen and K. P. Lillerud, Stud. Surf. Sci. Catal., 2004, 154A, 731–738; (c) A. Cantin, A. Corma, M. J. Diaz-Cabanas, J. L. Jorda´ and M. Moliner, J. Am. Chem. Soc., 2006, 128, 4216– 4217; (d) A. Corma, M. Moliner, J. M. Serra, P. Serna, M. J. Dı´azCaban˜as and L. A. Baumes, Chem. Mater., 2006, 18, 3287–3296. 7 (a) C. Klanner, D. Farrusseng, L. A. Baumes, M. Lengliz, C. Mirodatos and F. Schu¨th, Angew. Chem., Int. Ed., 2004, 43, 5347–5349; (b) F. Schu¨th, L. A. Baumes, F Clerc, D. Demuth, D. Farrusseng, J. Llamas-Galilea, C. Klanner, J. Klein, A. MartinezJoaristi, J. Procelewska, M. Saupe, S. Schunk, M. Schwickardi, W. Strehlau and T. Zech, Catal. Today, 2006, 117, 284–290; (c) L. A. Baumes, J. Comb. Chem., 2006, 8, 304–314. 8 (a) C. J. Gilmore, G. Barr and J. Paisley, J. Appl. Crystallogr., 2004, 37, 231–242; (b) G. Barr, W. Dong and C. J. Gilmore, J. Appl. Crystallogr., 2004, 37, 243–252; (c) G. Barr, W. Dong and C. J. Gilmore, J. Appl. Crystallogr., 2004, 37, 658–664. 9 (a) W. Lin, M. A. Orgun and G. J. Williams, Australasian Data Mining Workshop, Macquarie Univ. and CSIRO Data Mining, 2002; (b) C. M. Antunes, A. L. Oliveira, Workshop on Temporal Data Mining, at the 7th Int. Conf. on Knowledge Discovery and Data Mining (KDD’01), San Francisco, CA, 2001.
This journal is ª The Royal Society of Chemistry 2008