Introduction The emergence of computer-enabling technologies in the late 20th century has transformed the way science is practiced in the U.S and around the world. Science researchers and educators now take advantage of electronic literature databases and virtual classrooms, and have benefited from a sustained growth in all areas of low- mid-, and high-performance computing hardware. So profound has been the change that the use of information, computational and communication technologies is now an accepted part of the research infrastructure we expect to maintain and grow in the 21st century. Cyberinfrastructure needs in each area of science will be different depending on the discipline. The chemistry community encompasses researchers in a broad range of subdisciplines including the core chemical sciences, interdisciplinary activities at the interfaces of biology, geosciences, materials, physics and engineering, and through the interplay of computational modeling and prediction with experimental chemistry. It is probably not surprising that, given the unifying chemistry theme, the subdisciplines share common hardware needs, experience similar algorithmic bottlenecks, and face similar information-technology issues. At the same time, each subdiscipline of chemistry has its own unique cyberinfrastructure problems that are inhibiting scientific advances or education and training in that area. Finally, the next generation of grand-challenge chemistries could help to define and anticipate cyberinfrastructure bottlenecks in advancing those emerging areas in the chemical sciences. The NSF-sponsored workshop was organized around chemical sciences drivers, with a primary purpose to determine how cyberinfrastructure solutions can enable chemical science research and education, and how best to educate and train our future workforce to use and benefit from cyberinfrastructure advances. Having both identified common cyberinfrastructure solutions for the chemistry community as a whole, and discerned distinct needs of the chemical sciences drivers, we offer the following recommendations on cyberinfrastructure solutions identified and discussed during the two days of the workshop.
Workshop organization A scientifically diverse set of approximately 40 participants from academia, government laboratories, and the private sector, along with observers from various agencies as well as international representatives, were invited to attend the 2-day workshop. The workshop participants gathered on the first night with a reception and opening remarks from the co-organizers and representatives from the National Science Foundation that defined the charge to workshop participants. The remainder of the evening was devoted to talks by four plenary speakers who presented a broad vision regarding the issues of cyberinfrastructure and its impact on the chemistry community. The plenary speakers addressed the various levels of complexity, detail, and control desired by different groups within the computational chemistry community, which ranges from algorithm developers to point-and-click users. Their goals centered on a set of highly modular and extensible elementary and composite workflow pieces that, by individual combination, allow researchers to explore new uses of the codes. In practice this meant an unprecedented level of integration of a variety of computational codes and tools including
1
computational-chemistry codes; preparation, analysis, and visualization software; and database systems. Examples of development of such an infrastructure include the “Resurgence” project presented by Kim Baldridge, and Thanh Truong’s ITR project, “Computational Science and Engineering Online.” The following day was devoted to breakout sessions organized around four chemical sciences drivers: Core Computational Chemistries, Computational Chemistry at the Interface, Computational and Experimental Chemistry Interaction, and Grand Challenge Chemistries. These discussions were meant to first identify outstanding scientific problems in those areas, and then focus on what cyberinfrastructure solutions could advance those areas. Participants in the chemical sciences driver breakout sessions were reshuffled so that all of the first session’s identified needs would be represented in each of a second set of breakout sessions held later in the day addressing core infrastructure solutions: Software and Algorithms, Hardware Infrastructure, Databases and ChemInformatics, Education and Training, and Remote Chemistries. On the final day, the facilitators for each session provided brief oral reports, which they submitted later as written reports summarizing those discussions. These reports have been assembled by the co-organizers to draw out the main findings and recommendations that follow. We also gratefully acknowledge several previous workshops reports [1, 2], and blue-ribbon panel report [3], which have identified several important areas of cyberinfrastructure development for the chemical sciences and engineering and which served as background for the participants of the Cyber-enabled Chemistry workshop.
Chemical Sciences Drivers 1. Core Chemistry Any endeavor called “cyber-enabled chemistry” should share at least two prominent characteristics. First, cyber-enabled chemistry (CEC) relies on the presence of ubiquitous and substantial network, information, and computing resources. Second, CEC is problem-centered, and is directed in particular toward problems so complex as to resist solution by any single method or approach. Giving some examples of problems in chemistry that stand to benefit from better cyberinfrastructure may help bring the term “cyber-enabled chemistry” into tighter focus. A key feature of the following illustrative problems is that none of them can be solved using solely methods from just one of those subdisciplines, and that a complete solution must combine different areas of experiment, theory, and simulation. • Molecular design of new drugs, molecular materials, and catalysts • Electrochemistry problems such as molecular structure of double layers, and active chemical species at an electrode surface • Influence of complex environments on chemical reactions, understanding and predicting environmental contaminant flows • Chemistry of electronically excited states and harnessing light energy • Tribology, for example molecular origins of macroscopic behavior such as friction and lubrication
2
• Rules for self-assembly at the nanoscale, with emphasis on non-covalent interactions • Combustion, which involves integrating electronic-structure, experiment, and reaction-kinetics models Cyber-enabled Challenges. Tools for CEC should allow researchers to focus on the problems themselves by freeing them from an enforced focus on simulation details. While these details should be open to examination on demand, their unobtrusiveness would allow researchers to focus on the chemical problem being simulated. Cyberinfrastructure could promote progress in the core chemistries by improving software interoperability, prototyping modeling to a prescribed accuracy with well-defined benchmark experiments and simulation data, archiving and data warehousing of experimental and simulation benchmarks, providing education in computational science and model chemistries, and enabling collaborations between experimentalists and theorists. Software Interoperability. The subdisciplines in theoretical chemistry generally have their own associated public and private software; in many cases, no single code suffices even to solve parts of the problem residing entirely within a single subdiscipline. Therefore, interoperability between codes both within and across these subdisciplines is required. There is a clear opportunity here for the NSF to play a strong role in encouraging the establishment of standards such as file formats to enable interoperability. However, such standards should be imposed only where the problems involved are sufficiently mature, and these standards should be extensible. A problem-oriented simulation language – a crucial missing ingredient in achieving interoperability across subdisciplines – could be realized through the definition of largescale tasks with standardized input and output that can be threaded together in a way that avoids detailed specification of what happens inside each task. Along with extensibility, such a language needs to provide rudimentary type-checking to ensure that all the information needed for any task will be available from prior tasks. Benchmarks for Model Accuracy. Simulation has become increasingly accepted as an essential tool in chemistry: for instance, many experimentalists use quantum-chemistry packages to help interpret results, and often with little or no outside consultation with theoretical and/or computational chemists. One advance for furthering this acceptance was the introduction of “model chemistries”: levels of theory that were empirically determined to provide a certain level of accuracy for a restricted set of questions. This “prototyping with prescribed accuracy” (PWPA) can be identified as a goal that needs to be achieved in the broader context of chemical simulation. Ideally, each such PWPA scheme would be a hierarchical approach with guaranteed accuracy (given sufficient computational resources) based on standardized protocols that the community can benchmark empirically in order to determine the level of accuracy expected. Databases and Data Warehousing. Cyberinfrastructure solutions that can enable PWPA include an extensive set of databases containing readily accessible experimental and theoretical results. Such data would allow testing of a specific proposed modular subsystem protocol (say, for use with a new database or new hardware) without carrying out all of the component simulations comprising the protocol. Furthermore, theoretical and
3
experimental results must be stored with an associated “pedigree,” allowing database users to assess the data’s reliability with estimated error bars and sufficient information to allow researchers to revise these estimates if necessary. The NIST WebBook database of gasphase bond energies is a notable example of a current effort directed along these lines. Ideally, automated implementation of these protocols should actually proceed from specification not of the protocols themselves but rather of a desired accuracy. The simulation language would allow for automatic tests of the sensitivity of specified final results (for example, the space-time profile of the concentration of some species in a flame) to accuracy in component sub-problems (for example, computed reaction energies, diffusion constants, and fluid properties, to name but a few). There are simulation cases where a compelling reason exists to archive relatively vast amounts of simulation data. Such “community simulations” have value outside their original stated intent because, for example, they provide initial conditions that can be leveraged to go beyond the original questions asked, because they can be used in benchmarking and creating standardized protocols needed for PWPA, or because they provide useful model-consistent boundary conditions or averages for multiscale methods. In the first category are simulations of activated events such as the millisecond folding simulation of the Kollman group or the simulation of water freezing from the Ohmine group. These simulations were so challenging because of the long time scales involved, which is in turn directly related to the presence of free-energy barriers. Most methods developed to model rare events without brute-force simulation techniques rely on the availability of some representative samples where the event occurs. These “heroic” simulations can be put to use in a more statistically significant context. Simulations of water at different pressures and temperatures would fall in the second category – providing benchmarks for empirical water-potential models and aiding development of standardized protocols for simulations involving water as a solvent. In the last category are simulations of biological membranes, which could be used as boundary conditions for embedding membrane-bound proteins in studies of active site chemistry. Education in Computational Science and Engineering. The development of an improved cyberinfrastructure could go a long way towards achieving an equal partnership between simulation and experiment in solving chemical problems. But such an equal partnership implies a drastic change in the culture of chemistry. Thus, it is particularly important that such a change be cultivated at the level of the undergraduate curriculum. In some disciplines, such as physics, it is natural for students to turn to modeling as an aid in understanding the ramifications of a complex problem, but this is rarely the case in chemistry. Improvements in cyberinfrastructure will enable earlier and more aggressive introduction of simulation techniques into the classroom. However, cultivation of a “model it first” attitude among undergraduates and/or chemists as a whole is more useful if the simulation data come with a trust factor, i.e., error bars. Development of PWPA techniques is therefore critical. Moreover, the concept of error in simulation needs to be emphasized in order that simulation tools are not misused. Collaborations between Experiment and Theory. What would be the practical outcome of “simulation and experiment as equal partners”? Computing and modeling at the lab bench would become routine, both suggesting new experiments and, just as
4
valuably, helping avoid experiments with little or no hope of success. Numerous cyberinfrastructure tools are explicitly designed to enhance or facilitate collaborations. Thus, there is a double effect whereby cyberinfrastructure promotes collaborations, and these collaborative efforts in turn increase the demand for improved cyberinfrastructure. This is welcome, since increased collaboration between experiment and theory is a must for progress on complex chemical problems and also for the validation of protocols needed for PWPA.
2. Chemistry at the Interface Many of the important potential advances for chemistry in the 21st century involve crossing an interface of one type or another. Significant intersections of chemistry and other disciplines (in parentheses) include: • Understanding the chemistry of living systems in detail, including the development of medicines and therapies (biology, biochemistry, mathematical biology, bioinformatics, medicinal and pharmaceutical chemistry) • Understanding the complex chemistry of the earth (geology, environmental science, atmospheric science) • Designing and producing new materials that can be predicted, tailored, and tuned before production, including investigating self-assembly as a useful approach to synthesis and manufacturing (physics, electrical engineering, materials science, biotechnology) Developing cyberinfrastructure to make these interfaces as seamless as possible will help address the challenges that arise. It is important to acknowledge multiple types of interfaces in the specification of needed infrastructure. One scientific theme underlying many of the areas described above is the requirement to cross multiple time and length scales. Examples range from representations or models for the breaking of bonds (quantum chemistry) to descriptions of molecular ensembles (force fields, molecular dynamics, Monte Carlo) to modeling of chemistry of complex environments (e.g., stochastic methods) to entire systems. Today, computational scientists are generally trained in depth in one sub-area, but are not expert in the models used for other time and length scales. Herein lies a challenge, since frequently data from a shorter time/length scale is used as input for the next model. Developing the interfaces between theory, computation, and experiment are also required to understand a new area of science. But again, because of specialization, no seamless interface exists between theorists, computational scientists, and experimentalists. Other interfaces deserve consideration, as well: Interfacing across institutions – academic, industrial labs, government labs, funding agencies – is needed to disseminate advances among the different institutions conducting research in the U.S., as well as across geographical locations, to take advantage of research already done around the globe. Better coordination between research and education is required to introduce new research topics into the undergraduate and K-12 curriculum, as well as for explaining significant new chemistry solutions that impact public policy such as stem-cell research or genetically modified foods.
5
Cyber-enabled Challenges. Tools for cyber-enabled chemistry should allow for clear communication across the interfaces, broadly defined. The science interfaces listed above, for example, require scientific research involving complexity of representation, of analysis, of models, and of experimental design and measurement, even within individual sub-areas. How can all of the relevant information from one sub-area be conveyed (with advice about its reliability) to scientists in other sub-areas who use different terminology? How do we educate students at scientific boundaries, and also promote collaboration across disciplines? Chemistry Research and Education Communication. How do we present problems to the broader research and education community in the most engaging manner? Different disciplines may use different terminology and concepts to describe similar chemistry. A science search-engine – such as the recently introduced Google Scholar – would be highly desirable. Research, development, and use of ontologies, thesauri, knowledge representations, and standards for data representation are all ways to tackle these issues, and are computer science research efforts in their own right. In some cases, a problem is best expressed at a higher level of abstraction, e.g., in a way that a mathematical scientist might understand and use generic techniques to solve. Alternatively, a problem may need to be expressed in the language of a different sub-area in order to encourage experts in that area to generate data needed as input to a model at a different scale. It is therefore important to present theories and algorithms in a context that can be understood by those in related fields. Along another dimension, free access to the scientific literature carries particular importance for projects that span interfaces, because much of the literature required for these projects is not in core chemistry areas, but rather in a multitude of other disciplines. Today, in these cases, cost is a significant inhibitor to learning from the literature. As a result, advancement of the science is slower than it could be. Members of a newly formed multidisciplinary research team must initially learn about one another’s disciplines, and the results of their research must later be conveyed to the wider academic community. Web pages, links to related literature, and online courses may all be useful. In addition, the academic curriculum should be updated to teach the basic concepts of multiple traditional disciplines (rather than just chemistry) to the next generation of students so that they may more easily understand and contribute to new areas. Interfacing Data and Software across Disciplines. Potential inhibitors to clear communication across interfaces and to the success of multidisciplinary projects include, first, some sociological issues around data sharing. For example, traditional scientists, who have been encouraged to become deep experts in a particular field, may not feel the need, or have the time, to make their data available to others. To combat this and to encourage widespread deposition of protein structural data, the Protein Data Bank (PDB) has successfully allied itself with influential journals that frown on submission of articles without deposition of data in the databank. However, it is not clear that this model can be extended to all areas of chemistry. This raises another issue, namely support for centralized data sources that can be centrally curated, as is the PDB, versus distributed autonomous sources where the maintenance and support is managed more cost-effectively by multiple
6
groups, but where the data models and curation protocols are likely to differ, thus hampering integration of the data. Finding and understanding relevant data requires reliability and accuracy. Users of data need to understand its accuracy and the assumptions used in its derivation in order to use it wisely. Algorithm developers need to know what degree of accuracy is required (i.e., when their algorithm is good enough) for different uses of the data. Cyberinfrastructure “validation services” could provide information on what to compare, how to compare, and what protocols to follow in the comparison. Curation is essential for ensuring improved reliability, although, in general, expert curation (unlike automated curation) is not scalable. Data provenance and annotation are also important. Standards are essential for interoperability among applied programming interfaces (APIs) as well as among data models, although the difficulties of standards adoption should not be underestimated. Adoption of standards partially addresses this concern, but there is also a significant need for robust software-engineering practices, so that software that is developed for one subdiscipline can be easily transferred to codes for other subdisciplines. This facilitates building on proven technology where appropriate, and for funding of software maintenance and support. Development of Collaboration Tools. Cyberinfrastructure can help existing, geographically dispersed teams communicate more effectively. Examples of useful collaboration tools are those that would improve point-to-point communication with usable remote-whiteboard technology, or would better enable viable international videoconferencing, such as VRVS (Virtual Rooms Videoconferencing System). In order to make many multidisciplinary projects successful, budgets have to cover technical people much depth as an expert in any one area. New information technology with sophisticated knowledge representation may someday fill this gap, but in the foreseeable future such people will be vital to the success of a multidisciplinary project. Large projects also need expert project managers to facilitate collaborations and supervise design, development, testing, and deployment of robust software. Funding should be available for multidisciplinary scientific projects (as long as such funding does not negatively impinge upon individual PI funding) that focus on novel science and novel computer science as well as for projects that focus on novel science enabled by design and deployment of infrastructure based on current technology. (In fact, any one project may involve both of these aspects).
3. Computational and Experimental Chemistry Interactions Increasingly, experimentalists and computational chemists are teaming up to tackle the challenging problems presented by complex chemical and biological systems. However, many of the most challenging and critical problems are pushing the limits of the capabilities of current simulation methods. A key aspect of the computational/experimental interface is validation. This is a two-way process: High-quality experimental data are necessary for validating computational models, and results from highly accurate computational methods can frequently play an important role in validating experimental data, or provide qualitative insight that permits the development of new experimental directions.
7
Improved Computational Models that Connect with Experiment. A defining area of intersection between computation and experiment – the use of efficient computational models to drive experiments (for example, to predict optimal experimental conditions for deriving the highest-quality experimental data, or to design "smart experiments") – could lower cost of discovery and process design. Several key prospective infrastructural advances will help researchers bridge the computational/experimental interface. Highbandwidth networks will allow large amounts of data to rapidly move among researchers. Robust visualization and analysis tools will give researchers better chemical insight from data exchanges. Better data access and database querying tools are also needed. Continued development of new and improved methods for modeling systems with non-bonded interactions, hard materials (e.g., ceramics), and interfacial processes should be given high priority. For many of these problems, there is a need to develop computational methods that truly bridge different time and length scales. To validate such approaches, it is important to generate and maintain databases with data from experiments and simulations, which in turn means developing mechanisms for certifying data, establishing standardized formats for data from different sources, and developing new tools (expert systems and visualization software) for querying a data base and analyzing data. Promoting Experimental/Theoretical Collaboration. Real-time interactions between experiments and simulations are needed in order to maximize the benefit for groups to interact effectively. Although groups are beginning to explore opportunities to enhance these collaborations, real-time interactions at present limit their full exploitation. One problem is the time required to carry out experiments or simulations. Faster algorithms and peak-performance models/methods should help facilitate crosscomputational/experimental interactions. Better software and analyses of experimental data/databases, perhaps using expert systems, should help computational modelers access experimental data/results. The computational/experimental interface also has an educational dimension. First, the Internet era has made it easier than ever for experimental and theoretical groups in different locations to interact. Modern cybertechnology also makes it possible for these interactions to involve students and not just faculty members.
4. Grand Challenges in Chemistry Some very broadly defined areas of chemistry may yield only to next-generation technologies and innovations, which will in turn rely heavily on the development and application of novel cyberinfrastructures enhancing both computational power and collaborative efforts. In particular, three key grand challenges were discussed: • Development of modeling protocols that can represent chemical systems at levels of detail spanning several orders of magnitude in length scale, time scale, or both. For example, modeling the materials properties of nanostructured composites might involve detail from the molecular level up to the mesoscopic scale. The bulk modulus of a solid, for example, is a property associated with the large scale. But developing parameters to describe that scale in a tractable coarse-grained fashion requires understanding the details of unit cell structures at the molecular scale. • Development of modeling protocols that can represent very large sections of potential-energy surfaces of very high dimensionality to chemical accuracy, typically
8
defined as within 1 kcal mol–1 of experiment. This level of accuracy will be critical to the successful modeling of such multiscale problems as protein folding, aggregation, self-assembly, phase separation, and phase changes such as those involved in conversion between crystal polymorphs. With respect to the latter problem, simply predicting the most stable crystal structure for an arbitrary molecule remains an outstanding grand challenge. • Development of algorithms and data-handling protocols capable of providing realtime feedback to control a reacting system actively monitored by sensor technology (for example, controlling combustion of a reactive gas in a flow chamber). Solving such multiscale problems requires transferring data among adjacent scales, so that smaller-scale results can be the foundation for larger-scale model parameters and, at the same time, the larger-scale results can feed back to the smaller-scale model for refinement (e.g., improved accuracy). Addressing this grand challenge means developing algorithms for propagating deterministic or probabilistic system evolution of fine and coarse scales. In addition, most systems of interest are expected to be multiphase in nature, e.g., solids in contact with gases, or high polymers in solution, or a substance that is poorly characterized with respect to phase, as is a glass. Characterization of any of these systems will require considering significant ranges of system variables such as temperature and pressure. An added level of difficulty may arise when the system is not limited to its ground electronic state. Another key point is that it is not really the potential energy surface, but the free energy surface, that needs to be modeled accurately. This requires an accurate modeling of entropy. It is unlikely that ideal-gas molecular partition functions will be sufficiently robust for this task. Improved algorithms for estimating entropy and other thermodynamic parameters will be critical to better modeling in this area. For problem areas such as combustion and sensor control, attaining the speeds needed for controlling combustion of a reactive gas in a flow chamber may require development of specialized hardware optimized to the algorithms involved. In addition, methods for handling very large data flows arriving from the sensors (and possibly being passed to control mechanisms) will need to be developed. All three grand challenges share several common features that will place an onus on cyberinfrastructure development. First, model quality cannot be evaluated in the absence of experimental data against which to conduct validation studies. Useful data are not always available, and support for further measurement should not be ignored. Centralization of validation data into convenient databases – ideally with quality review of individual entries and standardization of formats – would contribute to more-efficient development efforts. Second, model/algorithm development at all but the very smallest scale inevitably involves some parameterization. Support for cyberinfrastructure tools that might speed parameter optimization (e.g., via grid computing across multiple sites) and simplify analysis of parameter sensitivity should also be a priority. Third, approaches to grand-challenge problems will benefit from improvements in processor speeds, memory usage, parallel-algorithm development, and grid-management technology. Finally, to ensure the maximum utility of tools developed as cyberinfrastructure, developers need to
9
be multidisciplinary, either individually or as teams, so that the tools themselves will be characterized both by good chemistry and physics and by good software engineering.
Cyberinfrastructure Solution Areas 1. Models, Algorithms & Software Several models, algorithms, and software developments are needed to carry out computational and theoretical chemistry that will enhance or be enhanced by developments in cyberinfrastructure. Nearly all problems at the forefront of the chemical sciences require bridging across multiple length and time scales for their solution. Techniques are needed to reversibly map quantum-mechanical scales to atomic scales, atomic scales to mesoscales, mesoscales to macroscales. These mapping techniques may not generalize across all chemical systems, materials, and processes. Consequently, development of coarse-grained models and methods for bridging models across length and time scales is a high priority. Models and Algorithms. At the level of quantum-mechanical (QM) methods, accurate order-N methods are needed to enable the determination of ground and excited states of systems containing on the order of several thousand atoms, and to calculate system dynamics, kinetics, and transition states. Another approach worth pursuing is the application of ensemble and sampling methods to QM problems with greater sophistication and quality than is now possible. In the electronic-structure calculation of condensed phases, there is a need to include relativistic effects for heavy metals and to develop methods beyond density functional theory (DFT). It is not clear at this point how to systematically improve DFT and related methods. At the classical, atomistic level, more-sophisticated force fields, including reactive force fields and empirical potentials that can realistically model heterogeneous, disparate materials and complex (e.g., intermolecular) systems, will be beneficial. Development of these approaches would benefit from access to systematic databases of QM and experimental data and from new tools to automate force-field optimization. Databases should be provided in standard data formats with tags, and it should be easy to use tools to access data for validation and parameterization of models. Similarly, improved mesoscale models and methods for the consideration of heterogeneous, multiphase, multi-component, and multi-material systems are needed. Time-scale meshing is an issue, for example, in combustion and environmental flow (e.g., through soil) problems. Additionally, methods are needed to model both chemical and physical processes at the cellular level and such basic processes as solvation. Better algorithms – for instance, multiscale integrators, coarse-sampling methods, and ergodic sampling methods, such as Wang-Landau and hyperparallel tempering – are needed for global optimization of structures and, e.g., wave functions, as well as for the efficient exploration of time or efficient generation of equilibrium or metastable states. This would include better rare-event and transition-state methods, and, importantly, robust methods for validating these methods. Additionally, “bridging methods” that hand off parts of a simulation to different models and methods adaptively would permit seamless bridging across scales without the need to fully integrate these methods. To accomplish
10
these goals, community-endorsed standard data formats and I/O protocols are needed, as are modular programming environments. Software Development Tools. Software tools enabling massively parallel computation, such as scalable algorithms, improved networking and communication protocols, and the ability to adapt to different architectures are needed. Improved performance tools, such as profilers and debuggers, load-balancing, checkpointing, and rapid approaches to assess fault tolerance comprise the necessary support for the parallelization of good, older serial codes, and to facilitate the generation of large new community codes. Tools to benchmark software accuracy and speed in a standard way would be helpful, as would metrics for quality assurance and comparison. One way to achieve this goal would be with community benchmarking challenges, similar to those for predicting protein structures or the NIST fluids challenge. Inclusion of benchmarking as a standard feature of larger projects should be encouraged. While visualization at the electronic-structure and atomic levels is generally adequate, there is a need for easy-to-use, affordable software to visualize unique shapes (e.g., nanoparticles, colloidal particles) and macromolecular objects as well as to visualize very large data sets or multiple levels and dimensions. Interactive capabilities of data streams would allow for computational steering of simulations. Finally, models, algorithms and software must be carefully integrated. Common component architectures, common interfaces, and inter-language “translators,” for example, will facilitate software integration and interoperability; while better methods of integrating I/O, standardized formats for input (such as babel [and CML), and error/accuracy assessments within a framework will facilitate integration. We encourage the support, development, and integration of public-domain community codes and, in particular, the solving of associated problems such as long-term maintenance, intellectual property, and tutorial development (with recognition of the varying expertise of likely users). Grid computing should be supported for model parameterization and to enable the coupling of databases to computations, access pre-computed and pre-measured values, and avoid duplication. Collaborations between chemical scientists and computer scientists for continued development of tools, languages, and middleware to facilitate grid computing should be encouraged. As grid computing increases, so will the need for reliable datamining tools. Data repositories – which should include negative computational results where applicable – would also facilitate benchmarking.
2. Hardware Cyberinfrastructure Hardware resources that can be brought to bear on a particular scientific challenge are not hard to identify. Yet these resources are the underpinnings of all the layers of the hierarchy of cyber-enabled chemistry (CEC). The areas discussed in this report – data and storage, networking, and computing resources – do not cover the full gamut of what constitutes the entire base infrastructure, but it is in these three areas that the need for sustained funding and opportunistic possibilities is most acute. Data and Storage. A low-latency, high-bandwidth, transparent, hierarchical storage system (or systems) will be needed for facilitating federated-database access and archival
11
storage. The system interfaces must account for the mapping of data, information, and knowledge at all levels. Significant infrastructure research will be required to federate and use the multitude of databases that are and will be both available and required for CEC. The fidelity and pedigree or provenance of that data must be maintained. With the nearexponential growth of data now occurring, the situation will only get worse. The hardware should be tuned to allow for ubiquitous access to the literature, and to existing chemistryspecific databases such as the PDB, thermochemistry, NIST, and other chemical-sciences and engineering databases. The hardware system functions should include the collection of user-based and automatically generated metadata both from site-specific systems – e.g., the NSF-funded supercomputer centers – and from applications run at those centers. Site specifics from experimental facilities and their associated instruments need to be included, as well. Finally, access to data must be carefully controlled at the user, group, facility, and public levels. Users must be able to determine what data gets pushed out to the various levels within the user community and how. Networking. The network components of cyberinfrastructure are essential to the success of any cyber-enabled science. Without a high-performance, seamless, robust network, none of the components of cyber-enabled science will work. The network has to identify and mitigate failures and performance issues in a productive way. Timely integration of new technology into the network infrastructure is critical, and evaluation of next- and future-generation innovations must be carried out on a continuing basis and in cooperation with others evaluating the same or similar technology. The network infrastructure’s critical nature, and the expected demands placed on it by CEC, demand that the current network backbone be immediately upgraded to the latest productionquality technology and that regional and network-wide links be upgraded in a timely fashion. Furthermore, appropriate testbeds for evaluating new technology must be put into place or strengthened where needed. Network research testbeds are available within other federal agencies (e.g., DOE Office of Science) and should be leveraged where appropriate. Computing Resources. The desktop, or principal-investigator, level of resources at the medium-to-high end of the current market – used for individual work and for accessing other cyber-enabled facilities on the network – is frequently adequate for users within the CEC community. However, desktop resources must keep pace with changes in cyberinfrastructure as base computing capability evolves. Periodic technological refreshment of these resources must be supported on a continuing basis. The next levels in the hierarchy are the department and larger regional capacity centers. These are not high-end computing (HEC) resources with unique capabilities but, rather, facilities with the capacity to stage jobs to the high-end resources and serve the computational needs of scientists that do not require high-end resources. Currently these are underrepresented resources in the NSF landscape. Many people use high-end resources because those resources have significant capacity in addition to their capability. This obviously results in reduced access to the capabilities of the high-end resources for the capability user, while the capacity user often has throughput issues on his/her work. NSF’s Mathematical and Physical Sciences Directorate, which includes the Chemistry Division, and CISE should cooperate in developing new regional capacity centers, possibly in partnership with existing ones, in addition to augmenting the resources at the HEC
12
facilities and targeting applications to appropriate resources. Funds for the staffing necessary for maintaining, appropriating, and operating these expanded resources must be provided. The third level of the hierarchy is high-end computing (HEC) resources. NSF national facilities such as NCSA, SDSC, and PSC, provided to meet the programmatic expectations of the various NSF divisions, are critical for expediting grand-challenge science, and they are the computational workhorses of cyber-enabled systems. The need for technology refresh is extremely important for the HEC centers. Mechanisms to identify experimental and future technologies, to take advantage of high-risk opportunistic technologies, and to determine these technologies’ suitability for enhancing productionquality capabilities for the computational-science community and cyber-enabled science – always in a cost-effective manner, with engagement from the NSF computing community and possibly in cooperation with other agencies – should be put in place. At the highest level of the current hierarchy there is the TeraGrid, a large, comprehensive, distributed infrastructure for open scientific research. CEC will make use of the TeraGrid at some level. The NSF Chemistry Division must understand the objectives and goals of the TeraGrid program and determine a path forward to exploit these resources and any new cyber-enabled resources. It is likely that the TeraGrid will be one component of cyber-enabled science. The micro-architecture of all these resources must feature balanced characteristics among the many subsystems available (e.g., memory bandwidth and latency must match the computational horsepower, and cache efficiency is critical to achieve high performance). Parallel-subsystem characteristics (e.g., inter-node communication systems and secondary storage) must also be balanced. Deployed systems’ characteristics must be appropriate for chemistry and chemical-engineering applications. In addition, programming models and compiler technology must be advanced to increase programmer efficiency. This is an obvious area for cross-collaborative development with other NSF directorates and divisions. Visualization systems will become even more critical to the insights and engagement of experts and non-experts alike. It is easy to visualize a handful of atoms, but simulations of protein systems with chemically reactive sites, reactive chemical flows, microscopic systems, etc., will demand improved visualization capabilities, which the chemistry community, in turn, must learn how to develop and use. Visualization tools such as immersive caves, power walls, high-end flat-panel displays, and, most importantly, appropriate software tools will lead the practicing cyber-chemist to new chemical discoveries. Remote visualization of data via distributed data sources is a difficult challenge, but one that must be met.
3. Databases and ChemInformatics Shared cyber infrastructure and data resources will be needed to solve grandchallenge issues identified in recent NAS reports [1, 2], as well as improve baseline productivity and enable day-to-day progress in scientific advancement. Current databases are not the universal answer. For example, the current protein databases are very useful,
13
but we can’t mine them to learn about protein-protein interactions. However, several concrete examples of forward thinking on database organization and querying exist: • Protein Data Bank (PDB): The first chaos of data access was exhilarating, and many discoveries were made by virtue of having data in the same place. The hierarchy and rules developed later. • Cambridge DB is successful because they created a community of users with a common need on a focused problem. • Thermodynamics Resource Center at NIST is developing a program in dynamic data evaluation whereby literature data is searched and a crude evaluation of uncertainty is performed by an expert system. • JPL/NASA Data Atmospheric Chemistry and Kinetics panel is an example where standards have been agreed upon to evaluate data, but the rest of the community does not embrace these standards. However, there still remains a number of standard problems that continue to be wholly or partly unsolved in regards to databases, their management, and their use in the chemical sciences context. Data can exist in database form or in the more amorphous literature or on the Web. Cyberinfrastructure will be needed to provide tools to access data, organize it, and convert it into a form that enables chemical insight and better technical decision making, as well as to facilitate communication to and from non-experts to bridge the gap between scientists and public perceptions. On one level, data can be defined as a disorganized collection of facts. Information can be defined as organized data that enables interpretation and insight. Knowledge is understanding what happens. Wisdom is the ability to reliably predict what will happen. Cyberinfrastructure is essential to move between these levels. The activity of validation and consistency is extremely labor intensive, but essential. Tools should be developed that can cross-check data as much as possible. Experimental and computational results can be used for mutual screening. One example is the automatic evaluation of consistency for thermodynamic-property data performed by the NIST Thermodynamics Resource Center. Stored data and information should reference details of how the data was acquired (the “metadata” or “pedigree”) to enable experts to evaluate the data. It would be beneficial to have authors assign uncertainty to published data, or to have sufficient information available for an expert or expert system to quantify uncertainty in a measurement or predictive model. If data is very crude with a high uncertainty (e.g., more than an order of magnitude), this is important for a non-expert to know, as the person may have to engineer a system with greater allowances. Converting non-evaluated data in paper legacy systems into a validated, refereed database is an extremely time-consuming activity that is currently done by experts only in their spare time. Is this a valuable use of an expert’s time? How else can we do it? Are there some aspects of evaluation that could be performed by an expert system? How do we ensure that data published from now on can be readily evaluated and captured? New approaches to these problems is an activity that should be supported. Standards will be very important for interoperability and communication, but how to get people to adopt and use these standards? Should journals require them? The demand
14
on experts who evaluate data after it has been published could be relieved by having journals require the entry of raw as well as evaluated data, uncertainty estimates, and similar metadata. Standardization can enable automated data capture. Standardization should be tiered, perhaps consistent with the maturity of the data or medium being captured. Standardization would be very effective in capturing current literature, for example. Raw data has the longest-lasting value and should be archived. Interpretations may change over time as science and understanding progress. Long-term archiving or legacy databases or other collections of knowledge, especially non-electronic ones, require experts to extract information. How will we need to access information in the future? Visualization is critical. Data needs to be visualized in a manner consistent with how a person thinks. This may vary to some degree depending upon the field of expertise: i.e., visualization of the same process for a chemist may differ from what is most effective for a physicist, materials scientist, biologist, or lay person. Creative visualization at a fundamental level may be an effective way to bridge vocabulary and conceptual paradigm differences across disciplines. Other issues include new database paradigms for collection, archiving, data-mining tools, validation, and retrieval needed to facilitate interdisciplinary collaborations Interoperability between databases at different levels as well as user interfaces will facilitate data mining. Automated dictionaries and translators that will greatly facilitate communication between scientific disciplines and between scientists and the lay public. This will require new software tools. There is no substitute for critical thinking and the human element. Hence for discovery, the emphasis in the near term should be on tool development to archive, gather, extract, and present data and information in a manner that will enable creative thinking and insight. The development of artificial intelligence to draw conclusions from data and information at this stage may be best suited for collection and evaluation of objective, quantitative information such as property data. This will certainly evolve in the future.
4. Education Infrastructure Educational activities represent an important component of virtually all topics associated with any emerging cyber-enabled chemistry (CEC) project or initiative. Because of CEC’s multidisciplinary focus and the nature of new science potentially facilitated by it, even individual investigators in single, well-defined subdisciplines of computational chemistry are likely to benefit from educational components that might be developed. Defining the possible scope of these educational endeavors, as broad as they might be, is a helpful first step in this vision-generating process. The prototype audiences for educational efforts can be divided into four groups, though they undoubtedly share some information needs. The first group is composed of research scientists, both within chemistry and from allied fields. While disparities certainly exist, by and large the individuals in this group can be described as problem-solving experts with strong motivation and capabilities for learning. Thus, non-chemists may need to become more proficient in molecular sciences, while chemists might require education in computational methodologies and limitations –
15
but all members of this category are probably capable of self-directed instruction given appropriate materials. A second category of professionals, who will require education, are established and emerging educators themselves. If students are to be reached, particularly at early points in their studies, those who teach them will need both greater depth and greater breadth of information about the nature of CEC. High-school teachers in particular may face barriers to learning that are associated with missing background (in physical chemistry or mathematics) or concomitant fear of materials that appear to require extensive background in the areas to be understood. Educators, either at the high-school or introductory college levels of the curriculum, serve as the conduit for the next vital group of people, students (in the traditionally defined sense). Computationally-related science in general, and CEC in particular, should be infused into the undergraduate curriculum. Inclusion of computational philosophy in the undergraduate chemistry curriculum is important both from an educational perspective, and from a pragmatic one: Future scientists who populate the world of CEC will need longer and more-complete training in scientific computing and modeling methodologies. Computational approaches must be introduced carefully into classroom and laboratories, so as to avoid fueling student misconceptions about chemistry – a possibility probably necessitating continuing educational research as new materials, methodologies, and curricula are developed. The fourth identifiable group consists of the general public and legislative leadership within political bodies, particularly at the federal level. CEC is likely to provide the ability to construct complex models whose accomplishments and limitations must be communicated in an intelligible way to the general public. Because there is some tendency of non-scientists to view science as a body of factual information, it will be necessary to address and, if possible, forestall confusion associated with findings from CEC-related models that, while important in advancing understanding, do not represent the “final” step in the treatment of a particular topic. The intricacies of complex modeling may not need to be expounded upon in this forum, but the process of model building, with an eye to both its probabilities for success and its potential limitations, should be presented. Cyberinfrastructure Challenges. The educational demands of training for a multidisciplinary CEC environment pose specific challenges and opportunities. It is possible to specify certain educational activities likely to support large-scale development and deployment of CEC in the future, such as instructional materials and software and middleware. Instructional materials. Self-training or tutorial materials are an important component of the educational strategies for several of the four groups delineated above. These materials should be developed with several factors in mind. First, materials for learning computational chemistry, visualization tools, and reaction animations already exist; we need to benefit fully from lessons gained in the creation of those materials. Second, in the same way that workflow models for the CEC enterprise require humanfactors research, educational materials must be tailored to the cognitive demands made on the target audiences, the different cognitive capabilities of different audiences, and the fundamental constraints associated with any asynchronous learning environment. Because,
16
for example, online educational materials tend to stress certain innate cognitive skills in the learner, educational research regarding differences in learning styles is likely to be valuable as part of the development process. Ultimately, materials development in CECoriented education may best proceed as multiple small projects, with users competitively “rating” the materials offered. A self-assessment component to calibrate users’ learning gains might also provide important systematic data for educational assessment and overall project evaluation. Software and middleware. Educational software and middleware constitute another important area for future development. It will be important to develop interdisciplinary educational materials and programs that include both computational chemistry and morefundamental computational sciences. As CEC works to help practitioners maximize the efficiency of their modeling efforts, the same lessons learned in improving research productivity will probably also enhance new learning materials’ effectiveness. Interfaces to state-of-the-art computational resources for novices should be both intelligent and multilevel. As novice users gain proficiency in specific modeling technologies, the educational interfaces they are using should automatically allow more flexibility for those users and reveal new, more complex modes of the CEC environment. Some students in this category will be expert learners already. The CEC modeling environment’s power as well as its limitations should be emphasized in ways that scientists who are not computer modelers will find useful. Specific issues associated with cyberinfrastructure are common to educational efforts as well. Cybersecurity is important across the entire spectrum of CEC deployment, including educational developments. Networking-infrastructure disparities (including those that exist on local scales within most or all educational institutions) may play a bigger role in the educational components of CEC than in its research components, where infrastructure is more nearly uniform. Finally, the changing of hardware availability in the educational environment may render such issues as interoperability of educational materials even more paramount within education than within CEC as a whole.
5. Remote Chemistry The emerging national cyberinfrastructure is already enabling new scientific activities through, for example, the use of remote computers, the development and use of community databases, virtual laboratories, electronic support for geographically dispersed collaborations, and numerous capabilities hosted as web-accessible services. New virtual organizations are being established (such as virtual centers of excellence) that assemble distributed expertise and resources to target research and educational grand challenges. Cyberinfrastructure research currently being driven by other scientific domains – in areas such as scientific portals; workflow management; computational modeling; and data analysis, visualization, and management – is clearly relevant to chemistry as well. However, certain characteristics of the chemistry community – specifically, the broad range of computational techniques and data types in use and the large number of independent data producers – pose unique challenges for remote chemistry. Distributeddatabase federation, sample and data provenance tracking (e.g., as in laboratory information management systems, or LIMS), and mechanisms to support data fusion and
17
community curation of data are particularly relevant to chemistry and thus are areas where this community may drive cyberinfrastructure requirements. In addition, environmental chemistry – which may soon involve experiments drawing data from thousands to millions of sensors – and high-throughput chemistry will be leading-edge drivers for new cyberinfrastructure capabilities. While the term “remote chemistry” suggests an emphasis on bridging physical distances, much more challenging gulfs to bridge than distance are, in fact, differences in distributed collaborations’ and organizations’ cultures, levels of expertise, organizational practice and policies, and scientific vocabularies. Close interaction between practicing chemists and information technology developers, iterative approaches to development and deployment, and mechanisms to share best practices will all be critical in developing new remote-chemistry capabilities to meet the needs of a diverse chemistry community. Remote communities and practitioners may also be confronted with currently poorly understood social/cultural constraints. For example, members of certain constituencies (e.g., based on ethnicity, race, culture, nationality, age, and/or gender) may adapt to the remote-community concept far more readily after initial strong personal or even face-to-face contact with the other members of the community. Expecting remote communities to develop spontaneously and rapidly in a manner that reflects the current population of interest groups may or may not be realistic. Social research may be necessary to understand how to expand remote communities to accurately reflect national and international demographics. Access to and Use of Remote Instruments. Advances in information technologies have made it possible to access and control scientific instruments in real-time from computers anywhere on the Internet. Technologies such as Web-controlled laboratory cameras, electronic notebooks, and videoconferencing provide a sense of virtual presence in a laboratory that partially duplicates the experience of being there. More than a decade of R&D and technological evolution has greatly reduced the time and effort required to offer secure remote-instrument access and proved the viability of remote-instrument services. Instrumentation such as Pacific Northwest National Laboratory’s Virtual NMR Facility has migrated from being research projects to ongoing operations, and setting up new instruments for remote operation can now be as simple as running screen-sharing software or enabling remote options in control software (e.g., in National Instruments’ LabView). The numerous benefits provided by access to remote instruments include sharing the acquisition, maintenance, and operating costs of expensive, cutting-edge instruments; broadening the range of capabilities available to local researchers and students; moreeffective utilization of instruments; and easing the adoption of new techniques in research projects and student laboratory courses. While there can be drawbacks to remote facilities – for example, conflicts between the service and research missions of a facility, loss of “bragging rights” and control of instruments, and loss of contact with colleagues at an instrument site – the potential benefits far outweigh the drawbacks. Enhanced access to remote instruments would benefit the chemistry community. Remote access to expensive, high-end, state-of-the-art instruments will maximize their scientific impact, serve broader audiences, and allow more widespread use of current-
18
generation technologies in both research and education. Technical support for planning and operating facilities will be a key enabler. On the other hand, problems limiting adoption of this new research and education mode are potential users’ and facility operators’ unfamiliarity with state-of-the-art networking and distributed-computing technologies, with best practices developed by current remote facilities, and with the learning curve associated with the use of the software tools themselves. Cyberinfrastructure research in support of remote facilities will be needed in several areas, including the continuing improvements in ease of use and support for multiple levels of instrument access (e.g., simplified interfaces for novice users or the ability to allow data collection while prohibiting instrument recalibration), mechanisms for coordinating across experiments (e.g., experiments guided by simulation results or by other experiments, or creating large-scale shared community data resources that aggregate individual remote experiments), and managing distributed facilities (e.g., with instruments and experts in various techniques at multiple facilities). Access to and Use of Advanced Computational Modeling Capabilities. Computational chemistry, in all of its forms, has made enormous advances. It is now possible to predict the properties of small molecules to an accuracy comparable to that of all but the most sophisticated experiments. Computational studies of complex molecules (e.g., proteins) have provided insights into their behavior that cannot be obtained from experiment alone. Investments are still required to continue to advance the core areas of computational chemistry. But high-bandwidth networking, remote computing, and distributed data and information storage, along with resource discovery and wide-area scheduling, promise to spark the development of new computational studies and approaches, providing opportunities to solve large, complex research problems and open new scientific horizons. Of particular interest here are portals, workflow management, and distributed computing and data storage, especially as envisioned in the notion of the “grid,” whose goal is to couple geographically distributed clusters, supercomputers, workstations, and data stores into a seamless set of services and resources. Grids have the potential to increase not only the efficiency with which computational studies may be performed but also the broader community’s access to computational approaches. In this regard, an important target for the chemistry community will be to develop tools that allow the scientist to couple computational codes together to build complex, flexible, and reusable workflows for innovative studies of molecular behavior. Collaboratories. Collaboratories enable researchers and educators to work across geographic and organizational as well as disciplinary boundaries to solve complex scientific and engineering problems. They enable researchers and educators to share computing and data resources as well as ideas and concepts, to collaboratively execute and analyze computations, to compare the resulting output with experimental results, and to collectively document their work. While early collaboratories tended to focus on rich interactions in small groups or lightweight coordination within a community, nextgeneration collaboratories will be able to operate far more effectively, allowing large groups to organize to tackle grand-challenge problems, form subgroups as needed to accomplish tasks, and publish results that are then made available to the larger community.
19
Examples of such activities that are relevant to the chemistry community include the Collaboratory for Multiscale Chemical Science (http://cmcs.org/), which is being used by groups of quantum chemists, thermodynamicists, kineticists, reaction model developers, and reactive-flow modelers to coordinate combustion research, as well as the National Biomedical Informatics Research Network (http://www.nbirn.net), which supports researchers studying neurobiology across a wide range of length and time scales. These projects and other emerging frameworks allow scientists and engineers to access securely distributed data and computational services, share their work in small groups or across the community, and collaborate directly via conferencing, desktop sharing, etc. Over the next few years, collaboratories will provide increasingly powerful capabilities for community data curation (tracking data provenance across disciplines, assessing data quality, annotating shared information), automating analysis workflows, and translating across formats and models used in different subdomains. Collaboratories have great potential in chemical research and education, particularly in bringing together researchers from multiple subdisciplines and multiple cultures. In particular, the solutions of many grand-challenge problems in chemistry, e.g., the design of new catalysts and more-efficient photoconversion systems or the integration of computation into the chemistry curriculum, would benefit greatly from the services provided by collaboratories.
Workshop Summary Crossing Bridges: Toward an Unbounded Chemical Sciences Landscape. As illustrated by recurring themes in the breakout-session reports, workshop participants shared a consensus vision of cyber-enabled chemistry: that of an unbounded chemical sciences landscape in which different disciplines, computational and experimental methods, institutions, geographical areas, levels of user sophistication and specialization, and subdisciplines within chemistry itself are bridged by seamless, high-bandwidth telecommunications networks, computing resources, and disparate databases of curated data of known pedigree that can be conveniently accessed and retrieved and processed by modular algorithms with compatible codes. Realizing this vision will require significant enhancements and expansions of the existing cyberinfrastructure, as well as the development and deployment of innovative models and technologies. With regard to accelerating progress in this direction, the consistency with which certain themes were voiced by workshop participants suggests a consensus across the chemical sciences community concerning recommended courses of action. In particular, breakout group participants: • noted the community’s increasing acceptance of and reliance upon simulation as a tool in chemistry. Participants concurred that advances in the field will be achieved through an “equal partnership” between simulation and experiment, whereby simulations first corroborate and interpret existing experiments and, subsequently, suggest new experiments. • observed that the chemical sciences are characterized by the use of a broad range of computational techniques and data types and by a large number of independent data
20
producers. Certain areas of the field, such as environmental and high-throughput chemistry, are hugely data-intensive, and accumulated data must be available to be shared and processed in a distributed fashion by collaboratories. • agreed on the importance both of using cyberinfrastructure to educate audiences at all levels from K-12 through college-level to the broad public sector about science topics (e.g., genetically engineered crops, stem cells), and of introducing cyberscience techniques to undergraduate chemistry curricula. • noted that major advances in networking and distributed-computing technologies can make possible new modes of activity for chemistry researchers and educators by allowing them to perform their work without regard to geographical location – interacting with colleagues and accessing instruments and sensors and computational and data resources scattered all over the country (indeed, the world). These new modes have great potential to enlarge the community of scientists engaged in advancing the frontiers of chemical knowledge and in developing new approaches to chemical education. However, this potential will not be realized unless chemical researchers and educators actively participate in guiding cyberinfrastructure research and development, obtain the needed assistance and support as these new technologies are deployed, and take advantage of the new forms of organization that these technologies make possible.
Recommendations (1) It is suggested that NSF Chemistry research funding criteria for cyberinfrastructure focus on the chemical science drivers themselves, encouraging investigators to define the amount and type of bridging activities and mechanisms that would best enable the focus on their next-generation grand-challenge problems. Development of cyberinfrastructure that maximizes the impact of new chemistry in the grand-challenge arena is strongly recommended. A focus area of Multiscale Modeling in the Chemical Sciences would provide an example of an immediate effort that will drive both the underlying chemistry problems of bridging length and time scales with the cyberinfrastructure investments described below. As long as individual PI funding is not negatively impacted, the NSF should sponsor multidisciplinary, collaborative projects such as Multiscale Modeling, and ensure funding is specifically allocated for development of cyberinfrastructure as a partner to scientific research. The early-year projects would allow the cyberinfrastructure to be guided (and critiqued) as it is developed, and give the software developers both real data and scientists with whom to work. Consideration should be given to the funding of joint academic/industry projects wherein both existing and new information technologies can be exploited. A collaborative environment should be established bridging the cyberinfrastructure portions of all of the projects, with a funded technical person managing synergies across the cyberinfrastructure backbone, developing standard data models and API representation, acquiring knowledge – and, potentially, software – that could be passed on to future projects, and gathering requirements and use cases for future rounds of
21
cyberinfrastructure for chemistry. The chemical research community’s needs, going forward, include: • Modeling protocols that bridge different subdisciplines and that can collectively represent chemical systems over many orders of magnitude of time and space scales, with output from one computational stage (along with accompanying information about precision) becoming input at the next stage • Modular algorithms that can be used in diverse combinations in varying applications and/or automated, freeing researchers to concentrate on chemistry issues rather than computational “busywork” • Improved tools for visualizing complex macromolecules and their interactions as well as large data sets • Access to shared, heterogeneous community databases with accompanying information about data provenance and with standardized formats that allow crosstalk among, and data fusion from, different database types across interfaces of different disciplines • Person-to-person communication technologies (e.g., screen sharing with audio/video) to render collaborations more effective • Grid-enabling technologies – speedy, interoperable hardware and software and remote-instrumentation capabilities to enable both more-efficient collaboration in computational studies and more access in general (2) It is suggested that NSF Chemistry research funding criteria for cyber infrastructure support for the educational community promote interdisciplinary educational-materials centers to accelerate development and deployment of useful educational materials from ideas generated by individuals from any of the chemistry subdisciplines. This is because it is unlikely that single individuals will be in a position to develop an entire array of educational materials. It may, therefore, be prudent to build centers (perhaps cyber-centers) conducive to collaborations for developing new educational materials. A gathering of experts – in scientific-computing content, software engineering, education and education research, and human factors – might provide an environment that could speed the development and deployment of useful educational materials from ideas generated by individuals from any number of the subdisciplines of the fundamentally multidisciplinary CEC community. Thus, a computational chemist who has an idea about how to present a topic (such as, say, sampling in statistical-mechanical models) could bring that idea to an established center and receive assistance in developing and deploying the envisioned teaching module. This interdisciplinary educational-materials center is only one type of structure that could be devised. For example, the ability of local campuses to build specifically interdisciplinary programs – minors in scientific computing, perhaps, or interdisciplinary components in established curricula – would help sustain CEC, once it is established, by preparing students who could readily function within this environment. The development of an environment using computational resources to solve research problems opens immediate avenues for enhanced research in chemistry education and, more generally, science education. The capture of data about usage of both
22
educational materials and research problem-solving workspace might yield insight into the cognitive factors at play in the development of a problem-solving facility. In this manner, the development and deployment of CEC and its accompanying educational resources can help to build our understanding of the learning process in a more general sense, providing an important spin-off benefit. Specific recommendations include: • Development of “tunable” algorithms that can be configured for several different levels of user proficiency • Development of improved visualization tools to educate the broad public and novice chemists • Promotion of the blending of experiment and simulation in the education system • Funding of cognitive research on learning-style differences in order to tailor and optimize remote or asynchronous learning approaches (3) It is suggested that NSF Chemistry seek partnerships with appropriate divisions within NSF, such as CISE, or explore other partnering mechanisms to help fund and support the development and deployment of hardware, databases, and remote collaborations technologies. Hardware Infrastructure. The chemical sciences will require computer performance well beyond what is currently available, but computational speed alone will not ensure that the computer is useful for any specific application area. Several other factors are critical, including the size of primary system memory (RAM), the size and speed of secondary storage, and the overall architecture of the computer. Grid computing, visualization, and remote instrumentation are areas that continue to evolve toward becoming standard cyberinfrastructure needs. Hardware design, affordability, and procurement of desktop, midrange, and high-performance computing solutions are needed. Access to disk farms, midrange computing facilities, and sustained support for high-performance computing solutions that are available for “grand challenge” problems are important long-term investments. Specific recommendations include: • To realize cyber-enabled chemistry or science in general, the NSF and, in particular, the Chemistry Division must track and cooperate with other agency programs such as the DARPA high-productivity computing program, the DOE Leadership Class System Development program, the INCITE program @ DOE’s NERSC facility, NASA’s Leadership Class systems, the NIH NCBC program, and the DOD computational infrastructure program. The last of these has significant experience in midrange systems such as the regional capacity systems outlined above. NSF should explore lessons learned from these facilities. • The NSF and the Chemistry Division should also explore stronger industry partnerships to develop hardware and software, where appropriate. Databases and Cheminformatics. Chemistry is becoming an information science, especially in industry, and the field is poised to put into practice new information science and data management technologies directly. New techniques and solutions from computer
23
science will focus on developing new databases, integration and interoperability of databases, distributed-networking data exchange, and the use of ultra-high-speed networks to interconnect students, experimentalists, and computational chemists with publicly funded data repositories. Data warehousing and data management will be critical issues. This encourages database federation by establishing community standards for data curation and cybersecurity, and establishing standards to encourage interoperability of algorithms and hardware. It is important to create incentives (and, of equal importance, overcome disincentives) for the sharing of individual researchers’ data, and to encourage free access to cross-disciplinary literature. Specific recommendations include: • Bridging model and experiment for mutual validation and quality screening should be a top priority. Predictive models and simulations are among the most effective and sustainable tools to capture and disseminate knowledge, particularly for use by nonexperts. They can also be used to perform data quality checks for experimental results. The models, however, must be validated. This requires investment in both experimental data and standard reference simulations and model development (validated models with quantified uncertainty). This is not glamorous, but it is an essential foundation. Models based on first principles are evergreen and provide insight into mechanisms. • Flexible cyberinfrastructure is needed to answer questions of the future. We will be asking different questions tomorrow from the ones we are asking today. It is not possible to anticipate every potential query or application need and develop an enduring solution. There is an underlying foundation of factual data that we know we need: i.e., standardized ways to enter, archive, and extract published chemical and physical experimental data from legacy systems and to actively capture and disseminate new data as it is published. We need to go beyond this foundation and develop methods whereby the cyberinfrastructure learns and grows as needs change. Remote Instrumentation: Access to remote instruments will become increasingly important as these instruments become increasingly sophisticated and expensive. It is important to examine. what categories of instrumentation are most important for chemists to gain remote access, and what are the most pressing issues in developing cyberinfrastructure (computing, data, networking) for virtual laboratories (remote control of instrumentation)
References [1] Information and Communications: Challenges for the Chemical Sciences in the 21st Century. Organizing Committee for the Workshop on Information and Communications, Committee on Challenges for the Chemical Sciences in the 21st Century, National Research Council, 2003. [2] Beyond the Molecular Frontier: Challenges for Chemistry and Chemical Engineering. Committee on Challenges for the Chemical Sciences in the 21st Century, National Research Council, 2003
24
[3] Revolutionizing science and engineering through cyberinfrastructure: Report of the National Science Foundation Blue-Ribbon Advisory Panel on Cyberinfrastructure. http://www.cise.nsf.gov/sci/reports/toc.cfm
25