Nsf 2003 Ci In The Biological Sciences

  • Uploaded by: RohnWood
  • 0
  • 0
  • December 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Nsf 2003 Ci In The Biological Sciences as PDF for free.

More details

  • Words: 22,263
  • Pages: 53
Any opinions, findings, conclusions, or recommendations expressed in this report are those of the workshop participants, and do not necessarily represent the official views, opinions, or policy of the National Science Foundation.

Building a Cyberinfrastructure for the Biological Sciences (CIBIO)

2005 and Beyond: A Roadmap for Consolidation and Exponentiation

A Workshop Report July 14-15, 2003 Chair, John C. Wooley Subcommittee on 21st Century Biology NSF Directorate for Biological Sciences Advisory Committee (BIOAC)

Table of Contents Abstract

3

Executive Summary Introduction: The Context for Deliberation The Unique Case for Including Biology in CI Development Intrinsic Aspects of the Biological Sciences Multiplying Exponentials through an Extensive Partnership The Essence of the Objectives for NSF BIO Resource Requirements and Initial Stages of Implementation Education and Training Coordination and Collaborations

4 4 5 6 6 7 8 9 10

Prelude (Commentary and Prescience in the Biological Sciences)

11

The Workshop Report Specifying and Building a CI to meet the Requirements of Biology

13 13

The Opportunity Created by Building a BIO-Specific CI Invest in People Ensure Science Pull, Technology Push Stay the Course Prepare for the Data Deluge Enable Science Targets of Opportunity Select and Direct Technology Contributions Establish National and International Partnerships

14 15 17 17 18 20 23 24

Implementation Strategies Funding and Management Mechanisms Outreach and Partnerships

26 27 30

Immediate Steps for BIO

31

Appendices I. II. III. IV. V. VI. VII.

Central Questions for CIBIO Workshop Schedule and Assignments Workshop Participants Overview of Blue Ribbon Panel/Atkins Report on CI Models for a Comprehensive CIBIO based on Extant Test Beds Potential Prototypes for BIO Implementation References for CIBIO Report

-2-

33 34 36 38 40 43 47

A Cyberinfrastructure View Envisioning and Empowering Successes for 21st Century Biological Sciences •

Creating and sustaining a comprehensive cyberinfrastructure (CI; the pervasive applications of all domains of scientific computing and information technology) is as relevant and as required for biology as for any science or intellectual endeavor; in the advances that led to today’s opportunity, the BIO Directorate made numerous, ad hoc contributions, and now can integrate its efforts to build the complete platforms needed for 21st Century Biology. Doing so will accelerate progress in extraordinary ways.



The time for creating a CI for all of the sciences, for research and education, has arrived and NSF will lead the way. BIO must co-define the extent and fine details of the NSF structure for CI, which will involve major internal NSF partnerships and external partnerships with other agencies, and will be fully international in scope.



Only the biological sciences have seen as remarkable, sustained revolutionary advances as the computer and information sciences. Just in the past few years has the world of computing and information technology reached the level of being fully applicable to the wide range of cutting edge themes characteristic of biological research. Multiplying the exponentials (of continuing advances in computing and bioscience) through deep partnerships will inevitably be exciting beyond any anticipation.



The stretch goals for the biological sciences community include both a need for community-level involvement and for the complete spectrum of CI; namely: People and Training; Instrumentation; Collaborations; Advanced Computing and Networking; Databases and Knowledge Management; Analytical Methods (Modeling and Simulation).



NSF BIO must: o o o o o o o

Invest in People Ensure Science Pull, Technology Push Stay the Course Prepare for the Data Deluge Enable Science Targets of Opportunity Select and Direct the Technology Contributions Establish National and International Partnerships.



The biology community must decide how best it can interact with the quantitative science community, where and when to intersect with computational sciences and technologies, how to cooperate on and contribute to infrastructure projects, and how NSF BIO should partner administratively. An implementation meeting, as well as briefings to the community through professional societies, will be essential.



For NSF BIO to underestimate the importance for biology, or fail to provide fuel over the entire journey, would severely retard progress and be very damaging for the entire national and international biological sciences community.

-3-

Executive Summary Introduction: The Context for Deliberation The biological sciences are at a critical junction in their history, having absorbed over several decades the tremendous successes of “reductionist” experimentation, that is, of carefully focused investigations on simpler systems, model organisms and biological abstractions/models. Today, as the direct consequence of such extraordinary and even unanticipated successes, a new era of synthesis pervades thinking about the future of biological research, from macromolecules to ecosystems. To inform the process and deliver this synthesis, biological scientists must collect, organize, analyze and comprehend unprecedented volumes of highly heterogeneous, hierarchical information obtained by different means or modalities, with different standards, widely varying kinds (types) of data, over vast scales of time, space and organizational complexity. NSF recently introduced the term “cyberinfrastructure” (CI) to describe the integrated, ubiquitous, increasingly pervasive application of scientific computing (SC) and information technology (IT) approaches, which are already changing both science and society. For example, a pervasive infrastructure arising from scientific computing and information technology will provide the circumstances and platforms to enable robust, widely distributed research teams or collaboratories, user-friendly interfaces to fully integrated information from multicomponent systems, and also the software and hardware for advanced simulation and modeling projects that are directly and tightly coupled with experimental studies and provide interactive, iterative capacities to refine our knowledge. Such approaches are already essential requirements for many features of contemporary scientific research. A CI will do many things, but among the most important are to provide a means to establish: (1) the tools for capturing, storing, and managing data; (2) the tools for organizing, finding and analyzing the data to obtain information; (3) the connection of experimental and theoretical analyses and their interplay with simulations and complex models based on that information; and (4) the integration of disparate aspects of that information to provide a synthesis, a knowledge repository for further considerations. The reception of the concept of CI as a maturing, philosophical and practical perspective – that is, on the profound revolution provided through today’s integration of continuing advances in SC and IT - has been truly remarkable, with the entire worldwide community of scientists joining the dialogue. People, and their ideas and tools, are at the heart of CI. Building a fully effective cyberinfrastructure for science and society will require educating a new generation, although the technologies and the effort itself will generate new training environments and open novel options for enriching understanding of science for both technical features and for community relationships. After full implementation including the training of a new cadre of scientists, a comprehensive CI (for any community) will address (1) the provision of routine, remote access to instrumentation, computing cycles, data storage an other specialized resources; (2) the facilitation of novel collaborations that allow new, robust, interdisciplinary and multidisciplinary research team efforts among the most appropriate individuals at widely separated institutions to mature; (3) the powerful and ready access to major digital knowledge resources and other libraries of distilled information including the scientific literature; (4) platforms or vehicles for the integration of information from multiple, highly diverse and distributed sources; (5) new training environments; and (6) other features essential to contemporary research.

-4-

The Unique Case for Including Biology The history of the basic science community studying biology and of their federal sponsors, NSF BIO, is especially rich in prescience and sustained commitment. BIO invested early, and in numerous ways in the advancing IT world that is now leading to a comprehensive cyberinfrastructure. The original investments, for example, ranged from ecology and the LTERs to structural biology and the PDB. In partnership with CISE, BIO also invested in a very wide range of high performance scientific computing opportunities, such as biophysical and neuroscience modeling, telescience or remote access to specialized instrumentation, and the requisite visualization, networking and database tools. Today, there is an extraordinary opportunity to consolidate those activities and thereby to build a compelling, integrated program that only could arise from BIO in its home at NSF. Specifically, building a cyberinfrastructure for the biological sciences requires an interface to all of the quantitative sciences as well as to computer science and engineering, and this can only happen at NSF. Already, we have seen examples, such as NCRR/NIH’s cyberinfrastructure prototype, the Biomedical Information Research Network (BIRN), where mission agencies have recognized NSF’s contribution and begun to establish CI activities to meet their needs. Numerous other examples will follow to meet the goals of those missions. Indeed, BIO’s previous and continuing investments will catalyze revolutionary change, not merely incremental improvements, around the world. CI is ideally suited for the cottage industry that is biology, due to the revolution in grid services, data integration, and modern information technology. This revolution can now be coupled with the advent of a biological research approach, focused at a systems level, that is integrative, synthetic and predictive, or what NSF calls: 21st Century Biology. The vantage point gained by looking at research issues in biology from a synthetic point of view, including the characterization of interacting processes, and the integration of informatics, simulation and experimental analysis, represents the central engine powering the entire discipline.

Cyberinfrastructure Enabled BioScience Research ƒ ƒ ƒ ƒ ƒ

Multidisciplinary Multidimensional Information-driven Education-oriented Internationally engaged

Not only does 21st Century Biology absolutely require a strong cyberinfrastructure, but also, more than any other scientific domain, biology, due to its inherent complexity and the core requirement for advanced IT, will drive the future cyberinfrastructure for all science. NSF BIO must engage with CISE fully in setting the course, in establishing an architectural plan describing the specific needs of the biosciences, in assembling the parts, and building a full blown, highly empowering cyberinfrastructure for the entire biological sciences community. Complementing the compelling scientific case for building the cyberinfrastructure required by the biological sciences, there are very favorable administrative considerations in the context of NSF. Notably, the implementation, in incremental fashion and tailored to each discipline’s needs, of CI by NSF, offers an especial opportunity - a perfect fit - for the biological sciences. In terms of management, access and

-5-

resources, NSF should assign the utmost priority to BIO, the only organization positioned to lead the response of the biological sciences community.

Intrinsic Aspects of the Biological Sciences As our understanding of living systems increases, the inherent complexity of biology has become very obvious, so apparent as to approach a daunting challenge. Indeed, biology encompasses more than twenty orders of time, more than ten orders of space, and a hierarchical organizational space of enormous variety. As calculus has been the language of the physical sciences, information technology (informatics) is becoming the language of the biological sciences. Although biological scientists have already typically managed data sets up to the limit set by each generation’s computing parameters (cycles, storage, bandwidth), the singular nature of observations, the individuality of organisms, the typical lack of simplifying symmetries and the lack of redundancy in time and space, the depth of detail and of intrinsic features distinguish biological data, rather than their sheer volume. The biological sciences, in settings around the world, will remain dominated by widely distributed, individual or small team research efforts, rather than moving to a particular focus on centralized, community facilities, as has happened for some sciences; the consequences of reaching out to the broadest range of the best performers wherever they are is, consequently, particularly important. As telecommunication networks advance, biologists around the entire world will be able to explore and contribute to 21st Century Biology.

PDB & Genome enabled Biology – Using Structure to Understand Function

Accelerated Drug Development Individualized Medicine Productive, Healthy Citizens

Environmental Remediation Biofuels, Biocatalysts Improved Agriculture

DNA Sequence Implies Structure Implies Function DNA Sequence Provides Protein Sequence

Synchrotron Facilities Provide 3-D Protein Structure

Basis for 21st Century Medicine, Sustainable Development: Enhanced U.S. Competitiveness, Environmental Quality

A Cyberinfrastructure for BIO is Needed to Extract Implicit Genome Information

At the molecular level, for example, a cyberinfrastructure for biology, using tools developed to extract implicit genome information, will allow biologists to understand how genes are regulated; how DNA sequences dictate protein structure and function; how genetic networks function in cellular development, differentiation, health and disease. A CI for BIO must integrate the expertise and approaches of scientific computing and information technology with experimental studies at all levels; for example, on molecular machines, gene regulatory circuits, metabolic pathways, signaling networks, microbial ecology and microbial cell projects, population biology, phylogenies, and ecosystems.

Multiplying Exponentials through an Extensive Partnership As the consequence of the parallel, fully comparable revolutions in biological research, and in computer and information science and engineering, an extraordinary frontier is emerging at the interface between the fields. Both communities, and their Federal counterparts, BIO and CISE, can facilitate the research agenda of the other. 21st Century Biology absolutely requires, through the domain of scientific computing (SC), all of the insight, expertise, methodology and technology of advanced information technology (IT), arising from the output of computer science and engineering (CSE) and its vigorous interconnection with

-6-

experimental research. Indeed, only the biological sciences, over the past several decades, have seen as remarkable, sustained revolutionary increases in knowledge, understanding, applicability, as the computer and information sciences.

The Essence of the Objectives for NSF BIO Today, the exponential increases in these two domains make them ideal partners and the dynamics of the twin revolutions underpin the potential unprecedented impact of building a cyberinfrastructure for the entire biological sciences. Building on these successes, the essence, for BIO utilizing CI in empowering 21st Century Biology, is “Keep Your Eye on the Prize” -• • • • • • •

Invest in People Ensure Science Pull, Technology Push Stay the Course Prepare for the Data Deluge Enable Science Targets of Opportunity Select and Direct Technology Contributions Establish National and International Partnerships

The most obvious feature of 21st Century Biology is the increasing rate of data flow and simultaneously, the highly complex nature of the data, whether obtained through conventional or automated means. Responding to the enormous challenge requires that biological scientists be able to organize that data into information, analyze the information to create insight and knowledge, and synthesize disparate elements of our knowledge to create a deep understanding or wisdom. A passive role will not suffice when the vitality of the entire biology enterprise is involved. In other words, BIO must provide the vision for the CIBIO, not rely on technology drivers and circumstantial access. Education and the investment in people will, of necessity, include retraining, lowering the barrier for entry by senior faculty and by those from other disciplines, programs at all academic levels, and the training and stable career paths for future principle investigators and for academic professionals who will provide the energy for the scientific journey. Once involved, BIO will have made a major commitment to the community and must have an effective long-range plan to sustain the efforts. The changing relationship of CISE to its high performance computing centers and the introduction of a CI process across NSF places a significant obligation on the BIO Directorate to structure and maintain the role of the biological science community in the development and utilization of scientific computing and information technology applied to biology. Not all subdisciplines can be simultaneously provided with a CI by BIO, so selected pilot projects and areas of high NSF BIO impact should be the first focal points of effort. Strategic partnerships, discussed below, may well be needed to facilitate and accelerate implementation. Nothing succeeds like success, and the complete implementation of a CI for the biological science will depend on the initial choices paying off in easily demonstrated ways. Thus, the early pilots should be selected not just for their longer term scientific contribution but also for their ability to contribute significantly in the near term, even though many aspects of a comprehensive CI for the biological sciences will take years to develop fully and the impact will continue to accelerate the science for the foreseeable future. All research communities should interoperate, work through and with CISE and NSF as a whole, to seek to absorb as many as possible of the computational contributions from other fields, rather than encouraging reinvention. Nonetheless, BIO must also choose its own technology course, not passively accept whatever (hardware, software, middleware) is delivered for the needs of other science domains. The entire community will engage, even those with fewer resources and alternatives (than those available within biomolecular computing community). Scientists can now facilitate each others progress in

-7-

extraordinary ways, and to optimize introduction of 21st Century biology, the biosciences need to be interconnected to the other NSF domains. For NSF BIO to underestimate the importance for the biological sciences or to fail to provide fuel over the journey would be very damaging, perhaps catastrophic, for the community. Cyberinfrastructure promises to be as pervasive and central an influence as any societal revolution ever. Given the breadth and the long-term impact, several considerations are very important. First, working with partnerships and working in a global context is obvious and imperative on a scientific basis. Second, these interconnections are equally obvious and imperative on a practical and administrative basis. The cost of full implementation, of a comprehensive cyberinfrastructure in which the biological sciences supported by NSF benefit from cyber-rich environments, such as those piloted by NEES and BIRN, will be large as would be expected for a program of such incredible significance and applicability. The administrative scale at which NSF and NSF BIO prepare and sustain this process will have to be well beyond any previous efforts, beyond even the STCs or extant MRE programs.

Resource Requirements and Initial Stages of Implementation Doubling the BIO budget would be justified simply to underwrite key steps in a comprehensive CI for the biological sciences. This will be a decades-long effort, although each journey begins with a single step. NSF BIO should take that step, and further initial steps, as soon as possible. Funding increases will be needed as well for the core experimental programs and their projects to permit them to exploit fully the growing cyberinfrastructure and to build the requisite collaborations for a synthetic understanding of biology, which requires computational expertise and the deep involvement of information technology. In the biological sciences, database activities, modeling/simulations, and theory must always be connected to experimental efforts; a balanced expansion of the portfolio will be important. Beyond this base internal to BIO, major partnerships with computer and information science (CISE) and with the other sciences will be required. The impact should not be underestimated, but neither should the requirement for greatly enhanced, stable funding. BIO is already engaged upon a series of extraordinary opportunities, in creating a larger scale for shared, collaborative research efforts, through activities like FIBR and (the just initiated) NEON, while sustaining microbial projects and LTER. These larger scale projects particularly require a cyberinfrastructure, with costs of comparable magnitude to the projections for the experimental research component. BIO will have to (1) build up its own core activities at this interface (e.g., the funding for bioinformatics, biological knowledge resources, computational biology tools and collaborations on simulation/modeling) that allow it to partner with other parts of NSF, (2) choose test beds for full implementation of CI, establish paths toward deep integration of CI into all of its communities and for all of its performers, and (3) set a leadership role for other agencies around the entire world, including notably the USA mission agencies, to follow. Of course, only through a decades long commitment and through flexible, agile, engaged, proactive interactions with the entire community and with the other stakeholders – i.e., with other sources of funding for infrastructure and research newly enabled by CI – will the effort be a complete success. Several types of early actions are needed. Implementing these important requirements will be the responsibility of BIO and must be in place for effective collaborations on research frontiers with the other domains (Directorates) of NSF. The first implementation steps should be to expand the extant database activities and computational modeling/simulation studies, which need central attention. Many challenges remain as research problems as well as particularly difficult implementation problems for databases in the life sciences. Simulation studies could contribute considerably more across all of the biological sciences. Accelerating

-8-

the introduction and expansion of tools and of the conceptual approaches provided through testing models, a prominent feature of research in the physical sciences, will require continued programmatic emphasis and commitment. Many biologists trained in more traditional ways are just starting to recognize the opportunities, making a renewed and invigorated focus on training in the quantitative sciences essential for 21st Century Biology. Encouragement of more collaborations between/among experimentalists and computational scientists is essential, but the full implementation of the opportunities will require the training of a new generation of translators, of “fearless” biologists able to understand and speak the language of the quantitative scientists well enough to choose the best collaborators and to build bridges to more traditionally trained experimentalists. Many basic requirements involve academic professionals and the use of welldocumented approaches within computer and information science. Interdisciplinary training should be restored as an separate, defined program within BIO. As noted above, the enabling and transformational impact of CI justifies, and for full implementation would require, a doubling of the NSF BIO budget, but it will also require that BIO lead a much larger effort, marshalling resources from other National agencies and around the world, to provide adequate funding to ensure full participation by the international life sciences community. Consequently, other key, early actions are to establish a long range plan for sustained funding and to engage the community in a dialogue to ascertain implementation priorities as well as to prepare the biological scientists from around the Nation to participate fully. The dialogue should begin as an open meeting that is highly interactive and inclusive in all ways; a major venue will be needed to explore all options, dig deeply into implementation features for subdisciplines and into national and international partnerships, and provide for the archival of discussions and recommendations. Important administrative features include the review and funding of infrastructure and establishing (over time) a balance across the subdisciplines. Infrastructure is different from individual research and needs separate processes for their consideration, which are described below. Central coordination, needed for effective selection of pilots and coordinated efforts, will ensure balance and accelerate penetration of the benefits of modern IT to every BIO disciplines. All categories of infrastructure are increasingly important for scientific research, but cyberinfrastructure will be particularly valuable for the biological sciences. What will be critical is to recognize that infrastructure cannot be treated the same as individual research proposals. One cannot review infrastructure requests and plans against individual research proposals, and separate, centralized review and oversight will be needed. This situation arises since infrastructure benefits all, but has a different time frame, different budgets, different staffing (more academic professionals). To make informed, equitable and effective judgments on behalf of the community, CIBIO simply can not be simultaneously considered with individual projects. At the same time, robust, rigorous peer review is essential to establish the best opportunities. Competition is also important; overlapping efforts will need to be initiated in many cases and then the best project will ultimately become clearly identified.

Education and Training The educational challenges are themselves vast, and will require an expansion of existing programs and possibly the creation of new ones. CI will be dramatically alter how education is conducted - the means for training and transferring knowledge – and its full implementation and utilization will require a new cadre of scientists adroit at the frontier between computing and biology, able to recognize important biological problems, understand what computational tools are required, and capable of being a translator or communicator between more traditionally-trained biologists and their collaborators, computational scientists who will be just as traditionally-trained. These requirements are universal; that is, NSF BIO should work with NSF INT and with international agencies to encourage

-9-

innovation and sustain the excitement beyond national boundaries. The technology itself will change all levels of education and BIO should coordinate with EHR and the other research Directorates. A simple example, beyond the graphical nature of knowledge representation and interactive media as a teaching vehicle, is remote learning. Democracy is at work on the web, in Nobelists answering the queries of students and in the ready access with routine tools to the world’s information and knowledge store.

Coordination and Collaborations NSF in numerous cases will be able to nucleate an activity, but not have to plan for long term, expanding support, since another agency will ultimately adopt or extend some aspects to meet its own needs; the other agency may even sustain some or much of the original core. But some research problems, such as ecology, plant science, phylogeny and the tree of life, the evolution of multicellularity, and of developmental processes, among others, are research domains or areas that NSF will always own. Besides applying CI to these kinds of basic questions within the biological sciences, the overall catalysis of biology by CI will remain an NSF BIO role for the foreseeable future. NSF must ensure that once the cyberinfrastructure for biology is put in place, the funding plan and priority level for resource allocation must be in place to sustain the efforts and in particular, intellectual roadmaps linked to budgetary requirements must be developed in order to ensure that first the pilot efforts and then full implementations (each after peer review and selection of the best activities) can be funded and maintained stably in order to deliver on the promise to the community.

- 10 -

A Prelude on Entering Biology’s Century Commentary and Prescience in the Biological Sciences Toward a Paradigm Shift in Biology (Nature, Vol. 349, p.99, 10 Jan 1991) The questions of science always lie in what is not yet known. Although our techniques determine what questions we can study, they are not themselves the goal. The march of sciences devises ever newer and more powerful techniques. The new paradigm, now emerging, is . . . that the starting point of a biological investigation will be theoretical. An individual scientist will begin with a theoretical conjecture, only then turning to experiment to follow or test that hypothesis. The human genome project will continue and accelerate this rate of increase [in DNA sequences known and in internet accessible databases, notably Genbank]. Thus, I expect that...7the total knowledge of the human organism will be available...by the end of the decade. To use this flood of knowledge, which will pour across the computer networks of the world, biologists not only must become computerliterate, but also change their approach to the problem of understanding life. We must hook our individual computers in the worldwide network that gives us access to daily changes in the database and also makes immediate our communication with each other. The programs that display and analyze the material for us must be improved – and we must learn to use them more effectively. Walter Gilbert

Computing in Molecular Biology: Mapping and Interpreting Biological Information (Computer, November, p. 6-13, 1991; and Communications of the ACM, IEEE, 1991.) Biology is in the middle of a major paradigm shift – driven by computing. Although it is already an information science in many respects, the field is rapidly becoming much more computational and analytical. Computerized databases of genetic information, for example, let researchers quickly determine the significance of research findings. Eric S. Lander, Robert Langridge, Damian M. Saccocio

“Computing has changed biology forever; most biologists just don’t know it yet.” Michael Levitt, 1999

“Computational Biology will be as essential for the next quarter century of biology as molecular biology was for the past quarter century.” William McGinnis, 1999

- 11 -

More than the earlier implementations of research computing and far more than the contributions from any other scientific infrastructure, the integration and acceleration of scientific computing and advanced information technology (SC and AIT), driven by NSF, will enable applications for the entire biological sciences community beyond any expectation we could articulate today. By building a cyberinfrastructure for the biological sciences (CIBIO), the Biological Sciences Directorate will provide an enduring, extraordinary foundation for research. To encapsulate the value of this enterprise by a metaphor, consider the assertion by hockey great Wayne Gretsky about his ability to score goals. In sum: CIBIO should allow biologists to “scait” to where the puck will be. The BIOAC Working Group of 21st Century Biology Subcommittee, 2003

- 12 -

The Workshop Report Specifying and Building a Cyberinfrastructure To Meet the Requirements of Biology The NSF has identified the pervasive, ubiquitous contribution of a set of broad enabling approaches as cyberinfrastructure (CI), which builds upon a long history of exceptional advances in basic computer science and engineering, information technology, computer hardware and software, networking, and wired and wireless communication technologies. The Directorate for the Biological Sciences (BIO) will establish its own plans for CI in the context of this broader effort. That broader effort began through the study of a “Blue Ribbon Panel,” chaired by Prof. Dan Atkins, for the Computer and Information Science and Engineering (CISE) Directorate of the NSF. A summary of their report is in Appendix IV. CI is perceived by the NSF as contributing not only to basic/technical knowledge and deep understanding (and even to wisdom), but also to a knowledge-based democracy for science and society - beginning with widespread access and full participation, and to an accelerating translation from fundamental research to societal benefits. The aim of CI, as an integrative platform and enabling tool, is to empower all of the sciences to work on a systems level, while fully encompassing the requirements for ultra-small, focused studies to ultra-large analyses. Including databases and knowledge resources, grid architectures and services, software engineering, telescience (or remote access to specialized methods and tools), collaborations around defined science projects and around implementing needed technologies, and so on, the implementation of CI has to be international in scope. In fact, already there are international projects, and the international character and expectations of CI will increase rapidly.

Integrated Cyberinfrastructure meeting the needs of a community of communities Education and Training

Discovery & Innovation

Applications • Environmental Science • High Energy Physics • Proteomics/Genomics •… Development Tools & Libraries Grid Services & Middleware

DomainDomainspecific specific Cybertools Cybertools (software) (software)

Shared Shared Cybertools Cybertools (software) (software)

Distributed Distributed Resources Resources (computation, (computation, communication communication , ,storage, storage,etc.) etc.)

Cyberinfrastructure is not fully defined at this point. Each community is to establish its own requirements at the frontier with computing. Thus, CI is a developing, underdetermined set of expectations, which will grow both as technology advances and as the needs of the sciences are established. As the entire NSF sustains an active dialogue, CI will be increasingly defined, and will certainly turn out to be defined in different ways for different communities.

CIBIO will thus have to be part of a “family,” will have to interconnect, Venn Diagram style, with the central or parent CSE (or CISE) CI and with all of its siblings. There are huge areas of large overlap, through common or shared research programs and challenges in understanding the environment, with the GEO community. This is notably the case for activities such as GEON, Geoinformatics, climate change, and the marine sciences. However, CIBIO will interconnect with the research and infrastructure activities in all of the NSF Directorates, including EHR. Hardware

- 13 -

Many novel partnerships, involving new collaborations across long maintained disciplinary and subdisciplinary boundaries and connected anywhere that makes sense intellectually, will arise due to the enabling infrastructure of modern infrastructure. This infrastructure is absolutely central to 21st Century Science. While only one aspect of that infrastructure, CI will have an especially productive impact and will require equally special attention by the funding agencies, by private Foundations, by performing Institutions and by the entire community. All economic, geographic and technical sectors of the Nation and indeed the entire world will participate; all communities and all Agencies, wherever they are located in the world, will have to define their relationship to CI and respond in an exceptionally proactive fashion. We need the performers of biological research around the world to be fully engaged in contributing to the data rich, knowledge empowered world of 21st Century Biology. Such complicated dynamics present an important opportunity for leadership by NSF BIO, while creating substantive challenges as well. The dynamic infrastructure being driven by the explosive intellectual and pragmatic/ technical growth within CSE offers both great promise and potential pitfalls for BIO. Both the community and the NSF will have to be aware of the path taken by CI as a whole, as well as making sure that we BIO help define that path. This point must be taken very seriously: BIO must coordinate and “interoperate” with CISE and secondarily with all the Directorates in defining the extent and fine details of the NSF structure for CI. The biological sciences can drive their involvement to meet their needs, as this report recommends. Or, the members of the community can proceed as before, and then they will simply be dragged along, which will be far from satisfactory. BIO must recognize that CI is here, is now, has even been growing (up) for years, but it certainly will be “more here tomorrow.”

The Opportunity Created by Building a BIO-Specific CI Only the biological sciences, over the past several decades, have seen as remarkable, sustained revolutionary increases in knowledge, understanding, applicability, as the computer and information sciences. The exponential increases in computer processor speed (density of the chips), storage capacity and networking bandwidth have been ongoing, while the biological sciences have seen both early, extraordinary and unanticipated advances with genome science and exponentials comparable to those in CSE for sequence and structure data, and soon for many types of even more complicated data (such as that arising from mass spectrometry and environmental sensors). Any metric of output of biological knowledge output would indicate an even faster rate of growth than even the CSE applications (chip density, cycles, storage, communication bandwidth) to date, which may serve to inspire more rapid progress on the technology side. At the same time, real world features are essential for modeling and simulation in any science. Just in the past few years has the world of scientific computing and advanced information technology reached the level of being fully applicable to a wide range of deep biological research themes. In particular, computing has advanced from

Molecules Synchrotrons

Macromolecular Complexes, Organelles, Cells Microscopes

Magnetic Resonance Imagers

- 14 -

Organs, Organ Systems, Organisms

the being able to model tens of thousands, a hundred thousand, and then a million atoms, to tens of millions of atoms, bringing the macromolecular machines of living systems within reach for the first time. From the complementary approaches to complexity to the shared history of incredible, sustained advances, the two fields seem ideal for each other: multiplying the exponentials through deep partnerships will inevitably be exciting beyond any anticipation. However, many issues need to be addressed, from technical to cultural, from education to process, for the two domains to benefit from each other. In addition, the context for IT and CI involves all of the sciences, and this interconnectivity must also be considered. This report focuses on CSE considerations from a biology perspective. At the frontiers with computer scientists and biologists are quantitative scientists of all backgrounds, engineers, mathematicians, statisticians, chemists, physicists, and others – who will be referred to collectively as quantitative sciences. While BIO should especially focus on a rich partnership with CISE, it is this broader mix of scientists who will jointly write the story of 21st Century Biology, and BIO should look for opportunities for partnerships around the entire NSF. As happened already for the past decades during which molecular biology was established first as a discipline and then as a set of tools for all biologists, an intellectual thrust or push arising from computing and information technology over the next decade will “meld” with the biological sciences to form a tool kit for the next generation of advances, expected to be even more profound than the revolution wrought by molecular biology. NSF has a critical responsibility to outline a long range vision to accelerate the process of the meld, beginning with CIBIO. As all domains of NSF science, engineering and education participate in the CI revolution, how can the biological sciences maximize their benefits and minimize the inevitable, intertwined perturbations, the growing pains? How can CI drive fundamental biological discovery and empower more complete participation for all stakeholders? What are the immediate or near term steps to take, and what management plans (notably, funding and implementation mechanisms) will be needed? These are among the central questions addressed in our overview. Some representative objectives and nuggets for the science are included.

The essence, for BIO utilizing CI in empowering 21st Century Biology, is “Keep Your Eye on the Prize” -• • • • • • •

Invest in People Ensure Science Pull, Technology Push Stay the Course Prepare for the Data Deluge Enable Science Targets of Opportunity Direct the Technology Contributions Establish National and International Partnerships

Invest in People CI as an empowering, pervasive technology will change how training is done and will bring that change to all levels from K-12 to retraining of senior faculty. In doing so, CI will also eliminate geographic barriers from the educational experience, connecting instructors and students throughout the world. These aspects will be considered by other arms of NSF. To exploit CI fully will require that computational awareness be among the goals of those building training programs. As undergraduate curricula around the country change to reflect the convergence of approaches to address 21st Century Biology and to

- 15 -

reflect the need to establish a strong quantitative foundation for future biologists, the nature of what will be required at the graduate level and beyond will itself change. Early on, more short courses, of the types associated with MBL (the Marine Biological Laboratory at Woods Hole, MA) and CSHL (the Cold Spring Harbor Laboratory, Long Island, NY), among other Institutions, will be especially required. In addition, there will need to be extensive development of upper level undergraduate courses, with the expectation that initially, they will be heavily populated by graduate students filling in gaps in their earlier education. Novel training programs of this nature are being established already for genome (molecular) bioinformatics and will become increasingly common for all bioinformatics graduate training. The early, and especially successful, BIO efforts at interdisciplinary training (the RTGs: Research Training Groups) that helped lead to the NSF IGERT (Interdisciplinary Graduate Education and Research Training) program, should be inspected again, and the overall context, particularly in terms of the need for novel training at all of the institutions supported by NSF, considered. BIO should evaluate mechanisms either for ensuring that more of the IGERT activities fit into the CIBIO framework or for re-establishing some specialized training activities to accelerate the rate of adoption of scientific computing and advanced information technology into the biological science community. As important, significant, and valuable as the NSF IGERT program has been, the challenges of CI not only require the continued support for such interdisciplinary training, but also require novel training approaches, particularly for more senior scientists who need retraining, as well as the establishment of new undergraduate curricula. Training should be built into every major infrastructure award at this frontier, and not just within centralized programs. The case above, that NSF BIO should reexamine the need for RTGs, is made from two perspectives. First, the IGERTs are spread around the entire Foundation and must inevitably reflect this range. Second, numerous reports and articles have pointed to the need for far more training in the quantitative sciences in the biological sciences and other major changes in curriculum, which are likely to require some experimentation to implement adequately. Even though increased training in quantitative methods will provide a framework, the concomitant effort to introduce more knowledge about scientific computing and information technology provides its own challenge. To meet these specialized expectations and challenges, a revised RTG program, drawing on the original successes, could include focused problem solving training environments (programs), and explore novel (and potentially beyond the more established framework expected for IGERTs) crosstraining, providing an empowerment deep enough to fuel 21st Century Biology. A different form of “higher education” will be essential immediately: the training of mid-career biologists and even more senior graduate students on how to use CI and what it can do for their research. One option that should be considered is a center where biologists can learn enough to specify their needs accurately, with whomever they need to work; such training needs to start by helping the practicing scientist identify if they need a CSE investigator, a CSE researcher/academic professional, or simply to hire a programmer, or instead if there is even an existing tool that will accomplish the required task. (Admitting to this need shows the range of “introduction to” or experience with computing technology among biological scientists – some use thousands of hours of cycles per year at high performance computing centers, and others are just recognizing, particularly from the side of information technology, that computing will be important for their research.) This approach could be similar to the mouse genetics courses, the various MBL and CSH courses, except perhaps of shorter duration but greater frequency during the year, and only for a few years of the transition to a functioning CIBIO. Another option would be a Center to which an investigator would provide funds to purchase a service or obtain unique information. This would be a Center acting as a central repository of knowledge of biologically-relevant computational tools, and would do software tool development when the needed algorithm or package did not yet exist. Such a Center must involve training/education among its roles, so that its implicit knowledge would be disseminated.

- 16 -

In all areas, the opportunities exists to employ new collaborative tools to broadened traditional education experience, allowing individual students, or all students, to be spatially removed from instructors, and to interact routinely and easily with each other and with their instructor. In addition, the new CI will facilitate the development of new approaches in an international effort, ranging from amore regular dialog across national boundaries, to the sharing of educational experiences, and to joint projects on an international scale. This will help prepare students to work in a global economy, and encourage rapid progress in the biological sciences.

National (Biological, Ecological, Biodiversity) Data Centers

Ensure Science Pull, Technology Push The biology community must decide how best it can interact with the quantitative science community, where and when to intersect with computational sciences and technologies, how to partner in science projects, and how the NSF should partner administratively. We all know better than to permit a “build it and they will come” mentality to dominate the discussion and set the agenda. However, there will be many buildings created for other communities; some of those we will want to live in and for others we will learn from the design but will reconstruct for our own needs. For economies of scale, we can’t afford to manage all of the architect, builder and materials costs ourselves, and need sophisticated partnerships built on the normal metrics for academic teams with CSE/CISE endeavors and also especially with GEO, that is, in marine biology and the environment science, as well as the research and education overlaps with MPS, ENG, SBE and EHR. Indeed, the domain of overlap for the environmental sciences, between BIO and GEO, is already constructing its own cyberinfrastructure. Climate, biodiversity and other environmental research communities will have strong, multifaceted partnerships among NGOs, private research entities, and government. One aspect of ensuring that the scientific requirements of the biological sciences determine what aspects of cyberinfrastructure are employed first and most extensively within the BIO domain will be to continue to build the bioinformatics and computational biology programs for BIO. Only very recently has the information technology revolution reached numerous biological sciences sub-disciplines. The communities arising from research in ecology, cell biology, plant science and most other BIO domains have unmet needs at fundamental levels of computational expertise, and their requirements for software engineering, for example, have to be met to allow active participation across the biological sciences. What could be called a basal infrastructure has to be in place for the biological sciences before an elaborate, highly activated edifice can be erected to drive 21st Century Biology. Stay the Course In 1984, leading computational biologists, experimental biologists, and computer scientists assembled at Arlie House to give NSF BIO advice on how to respond to the first step toward a CI, the high performance

- 17 -

computing initiative and what led to funding the NSF National Supercomputer Centers. The discussions considered the potential for computing across all research areas supported by NSF BIO as well as biomedical research areas of interest to NIH. The workshop participants concluded (1) biology has a role in high performance computing; (2) some of the compelling research problems in biology are already compute bound; that is, they require more advanced computing; (3) that there would be an ever increasing set of such problems; and (4) that more and more of the biological community would indeed be able to make effective use of such high performance computing resources. {Hilllman et al, 1984} The scientists at that meeting in 1984 raised the concern, that if they put in the effort to port their code and so on, they didn’t want to have the resources withdrawn, as had happened recently, at that time, for an NSF disciplinary resource many of them had used. NSF BIO promised it would stay the course, and, subject to annual efforts to sustain the budget, indeed, it has. Modeling of biomolecular systems, and biology in general, represented less than a fraction of a percent usage in 1984. As one consequence of staying the course and working closely with CISE, biomolecular computing had become the plurality user at 28% of the time by FY 1998 at the NSF Centers. The importance then (in 1984), however, pales in comparison to the importance now for NSF to stay the course. The entire biological science community must and will engage, even those with fewer resources than those already involved biomolecular computing. Scientists can now facilitate each others’ progress in extraordinary ways, and to optimize introduction of 21st Century biology, the biosciences need to be interconnected to the other NSF domains. For NSF BIO to underestimate the importance for biology or fail to provide fuel over the journey would be very damaging, perhaps catastrophic, for the community. Prepare for the Data Deluge The biological sciences today are swimming in a swift current of data, characterized by an exploding rate of data production and an exploding ability to capture and manipulate data, which arises across vast scales and national boundaries, as well as from numerous disciplines. From acquisition, refinement, reduction and deposition, the current of data, and the rapids along the course, point to a compelling requirement for all of the disciplines for tools for analyses and for provision of links across these interfaces. All modern biology, from low throughput, spatial reconstruction studies on cells to ecosystems, to the automated methods of microarrays and (further) sequencing and structure determination research, will produce a high volume of complex, heterogeneous data, with demands on standards, archiving, mining, federation or integration, and other contemporary issues in information technology. What distinguishes bioscience research is not the net volume, though the spatial studies described above will have terabytes to petabytes of data as will any of the high throughput methods (especially, mass spectrometry and proteomics), but rather the inherent complexity of the data, which will be very heterogeneous in data type, in modality of acquisition, and in its ties to biological phenomena, arising from the hierarchical nature of living systems. Calculus, in managing the infinitely small but large scale of events filled with redundancy, has been the language of the physical sciences. The processes of biology, the activities of living organisms, involve the usage, maintenance, dissemination, transformation or transduction, replication and transmittal across generations of information. Biology has high information content, along with individuality, historicity and contingency. Indeed, given the diverse data types and other features of heterogeneity and the hierarchical nature of biology, the biological sciences as a research discipline are said to be an information science. As such, information technology is the language of the life sciences, managing the discrete, non-symmetric, largely non- reducible, unique nature of biological systems and observations.

- 18 -

Bioscience has already borrowed from IT in moving first 10 Discrete Automata The Complexity of to relational databases, object models 10 Biosystems Finite element databases, and to various Evolutionary models Processes Ecosystems standards like XML, with bio10 and Epidemiology centric implementations (such 10 Organ function as CML, SBML, EBL). The Electrostatic 10 continuum models Cell signalling wide variety of data types alone challenges IT. The cottage 10 industry nature of biological DNA 10 replication data collection requires Regions where Enzyme 10 Mechanisms Ab initio distributed data archiving and Computational Quantum Chemistry Protein Modeling can be multiple data resources, run by 10 Folding Employed Today vs those with deep knowledge, but Goals for Coverage with standards for interaction, 10 Empirical force field Molecular Dynamics integration, federation, for 10 Homology-based First Principles specific queries to merge Protein modeling Molecular Dynamics 10 relevant data to provide biological discovery and insight. 3 9 Geologic & -15 -12 -9 -6 -3 0 6 10 10 Evolutionary No clear solution has emerged 10 10 10 10 10 10 10 Timescales Time Scale (seconds) for building and sustaining biological databases but despite the challenges, hundreds of public databases exist, some of which represent major community resources, such as the PDB. Organisms

3

0

Cells

6

3

Biopolymers

0

6

3

0

6

Atoms

Number Scale (over size scale from Angstroms to Km)

6

3

0

Vigorous research programs, and scientific and administrative processes, are needed to ensure the continued excellence of the extant public databases, in the face of the data deluge. There has not been enough consideration of stability and continuity. The professional societies need to be more firmly engaged. Considerable infrastructure is required just in constructing and maintaining a single, focused community database. As we require the community to submit more of its observations and to submit that data more rapidly, as well as the community acquires the tools to obtain far more data in much less time, the challenges on the data resources will grow. (Let alone the challenges on the data providers, who must also have the software tools to analyze their own data and extract useful parameters from community databases to drive their experiments.) At the same time, the downstream requirements, data mining and other tools for biological discovery, need enhanced support as part of the transition to the research environment and expectations of this century. As the data, information and knowledge base grows (exponentially) for the biological sciences, to sustain the computational analysis chain, one inevitably with deep human intervention, that runs from data to information to knowledge, and with more sophisticated contemplation, to wisdom, will require considerably more attention. To bring modern IT, from data standards and federation tools like mediation and wrappers, will require substantial collaborative efforts by the community, suitable attention by the agencies not just to funding but to workshops on standards and the use of grant mechanisms to ensure collaborations across disciplines, and the full involvement of professional societies and leading technical journals. BIO staff will need to take lead roles in engaging the breadth of the participants needed. Rather than being condemned to repeat the past, bioscientists must be able to query the world’s literature each morning, as envisioned by W. Gilbert in 1991, to set their experimental path for the day’s research. (The vision section, termed A Prelude, reviews this prescient statement, a classic white paper laying out the future for introducing computational and informational technologies to modern biology, done even before the web provided the mechanism for the transformation.) This singular feature both provides the opportunity for a revolution beyond anyone’s imagination and the potential for major

- 19 -

limitations if access is not universal or if the free flow of public information is limited by technology, imagination, resources, policy or other artifacts. Enable Science Targets of Opportunity The highest pinnacles for research projects for the biological sciences without exception involve those with a considerable informatics component, along with any requisite technology and experimental advances that are also essential. More generally, the stretch goals projects for the biological sciences community all include both a need for community-level involvement and the complete spectrum of CI; namely: •

People and Training



Instrumentation



Collaborations



Advanced Computing and Networking



Databases and Knowledge Management



Analytical Methods (Modeling and Simulation)

21st Century BIO-Cyberinfrastructure

Changing How Science is Done Providing the Tools to Swim in the Rapid Current of Data

A list that included the major set for all fields in the biological sciences supported by NSF would be excessively long and inevitably incomplete – every month’s discoveries, sometimes ever day’s discoveries, open new horizons for the biological sciences. Knowledge intensive computing – organized

- 20 -

and understood data – is the theme of computational biology; the tools and approaches of computational biology are embedded within all of the key science opportunities. A selected set, aimed at being representative, follows. 1. Establish an integrated understanding of individual biological systems, managing the inherently different organizational features from molecules to cells and organisms to populations, and build a dynamic form and function into that understanding at all scales. Simpler systems would include peroxisomes, endoplasmic reticulum, proteosomes, mitochondria, choloroplasts, minimal genome microbe, to free living microbe and protists. 2. Enable well-tuned, detailed ecological forecasting, such as for progression of invasive species, the involvement of biology in climate change and climate dynamics, and the spread of new or newly mutated pathological organisms. 3. Describe the Tree of Life; that is, facilitate the community’s effort to reconstruct the history of life on earth. 4. Predict cellular function on multiple scales; for example, predict the function of the protein coding by an arbitrary, uncharacterized DNA sequence (found only as an open reading frame, ORF) and predict the chemistry upstream from that genetic sequence. (Modeling cellular behavior would be complementary to the biocomplexity program at NIGMS, NIH and the genomes to life program of BER, DOE, and this complementarity as a whole would have a large impact on progress in the biosciences.) 5. Understand how to simulate biology in precise enough detail to provide external deliverables to other disciplines and industrial processes; that is, establish a designer biology, through biomimetrics, and processes inspired by well characterized biological principals. Develop a comprehensive synthesis of microbial ecology, including how ecological processes function in the natural environment, potential methods for bioremediation. (A complete characterization of microbial ecology, exploiting microbial diversity and function, requires experimental efforts, informatics, and computing approaches. Strong teams should be established. The teams should attempt to provide an understanding deep enough that investigators can actually manipulate the system in predictable ways.) 6. Understand specialization and communication in communities and in multi-cellular organisms. The focus could be on any level, ranging from free-floating unicellular organisms to metazoans or their component systems such as the immune or nervous system. 7. Characterize information integration by selected biological systems, ranging from organisms to ecosystems. For example, determine how organisms make use of their environmental information, how the nervous system integrates the electrical and chemical information, how a unicell or protozoan integrates information about its environment, and how a metazoan integrated information that guides development. 8. Understand a given system in enough detail to build models that simulate complex biological systems, and allow reengineering or manipulation of the system. Representative Examples The various genome projects have produced, and will continue to produce, sequence datasets that can then be used for phylogeny reconstruction; however, the best reconstruction methods are heuristics for hard optimization problems. Solving these problems for datasets of more than 100 sequences, to an acceptable level of accuracy, is beyond the current capabilities of existing software. In order for systematic biologists to be able to obtain good estimates of evolutionary trees, new methods (algorithms and software) need to be developed, and cycles on a high performance computing platform needs to be made available. Novel database technology also needs to be developed, to tie the inferred phylogenies to sequence datasets. Large-scale simulations also need to be done to assess the performance of both new and existing software. All such efforts require collaborations between computer scientists (specializing in databases and algorithms) and systematic biologists. In particular, such collaborative efforts would enable the entire systematic biology community to contribute to the final product. The database itself could provide an opportunity for

- 21 -

other scientists (for example, geologists, who study the interactions between the earth sciences and biological evolution) to benefit. CI for BIO will enable new opportunities for the well-established, two decades old collaboratory around long term ecological research (LTER), for research within the ecology/ecosystems community as a whole, and the new comprehensive, enabling collaboratory called NEON. In particular, ever more powerful, regularly updated to be fully state-of-the-art) sensors, measuring physical, chemical, biological, meteorological, spatial and other relevant parameters, and associated computational, informatics and communication tools, will enable ecologists to obtain an in depth understanding of population processes and the interplay of all the living and physical players in the ecosystem. Sensors will provide a revolutionary advance in tracking threatened species and otherwise following and characterizing the behavior of individuals in populations under study in the natural world. Any individual creature can have its own IP address, an extraordinary expansion, for example, on traditional tracking of individual marine species. The transformation of data, then information, and ultimately knowledge, empowered by the new sensor technology, in population studies cannot be overstated. The “24 by 7” monitoring of environmental and population events, through analysis of individuals, will require smart software filtering of the data, but will enable the establishment of robust models. All sponsors and all performers in the environmental sciences will incorporate these kinds of technologies, methodologies and infrastructure into their research. As this major, new, innovative, exciting and truly extraordinary platform for scientific discovery and societal relevance is established, NEON, which can only be conceived and implemented in a comprehensive cyberinfrastructure environment, will provide both robust approaches to reductionist analysis of specific organisms and phenomena, and the ability to construct novel syntheses toward a fully comprehensive understanding of larger scale ecosystem behavior.

Characteristics of Ecological Data High Data Volume (per dataset)

Low

Satellite Images Weather Stations Business Data Most Software Gene Sequences

GIS Most Ecological Data Primary Productivity

Biodiversity Surveys Population Data

Soil Cores

Complexity/Metadata Requirements

High

The revolution wrought by NSF in plant genomics from the model to crop plants has changed the face of the biological sciences forever. Through enabling a new generation of plant science research, the construction of a comprehensive CI for BIO is a key next step in this revolution. Cutting edge stretch goals, which will require collaborations among experimentalists and computational biologist and strengthened bioinformatics support, are: 1. Identify, characterize, and understand the genes responsibility for domestication of crops; 2. Characterize and understand the mechanism of polyploidization and genome reduction; 3. Uncover and characterize the molecular mechanism of symbiosis.

- 22 -

Test beds for Establishing Comprehensive, Collaborative CIBIO Teams (CITs) BIO should select a few programs, termed CITs for convenience herein, to pilot the full introduction of the advantages of cyberinfrastructure, from multiscale, federated information environments to virtual centers or collaboratories, from shared expensive instrumentation to novel knowledge repositories, and certainly, with rich educational environments. Given the priorities for NSF and the state of the science and the community’s involvement at the interface with information technology, among the choices would be CITs for microbes (genomics and communities), ecology and ecosystems and plant genomics. Ultimately, as NEON comes to be, NEON itself (for example, with the information requirements and implications from its embedded sensor nets, populations wirelessly tracked) will intrinsically be a CIT, but the richness of the infrastructure could be enhanced and the research discoveries enhanced through a concentrated, focused effort at strengthening the corresponding CI. At this moment, BIRN represents the most fully implemented example, a functioning, expanding CIT, within the biological sciences, although BIO can also learn from NEES, GryPhN and the other early CITs outside biology. A representative, obvious and important example of an extension to specific BIO domains would be an ecological CIT, a pilot with planned, regular expansion as a CI for ecosystems science, ECI or an EARN, ecological analysis research network, with a coordinating center and one to two dozen research partners. A detailed description of such a model for the ecological sciences, with the aim of achieving a grand synthesis for biocomplexity and other aspects of the exceptional components and dynamics of ecosystems, is in an appendix. Such a Center would be developed in consultation with the cognizant professional societies and a specialized BIO workshop. Select and Direct Technology Contributions Building a cyberinfrastructure for the biological sciences, selecting the right targets and milestones, will require careful attention to the data intensive nature of all of modern biology. Beyond data manipulation, the technology must enable an easily accessible knowledge management framework. Biologists are already collaborating with computer scientists to create pilots and establish the process by which the incoherent output of large numbers of investigators can be channeled so that all may benefit. Early steps for CIBIO are to expand upon BIO’s database and information management activities, using the successes in structural biology and ecology, and to establish a broad cyberinfrastructure umbrella for computational biology and bioinformatics throughout the BIO research arenas. Keeping these principles in mind, we provide a set of priorities on the technology side of CIBIO. 1. Establish how to create database federations: linking disparate databases. Support research for biosciences into: a. Federation and integration heterogeneous databases; ontology development; pedigree; data validation; provenance chains; the processing pipeline, and other bioinformatics tools. b. Data mining, data exploration capabilities, affordable tools. c. Sustainable, stable knowledge resources. d. How to optimize analysis algorithms for interactions with databases TOL, computing with inputs that change. 2. Large scale data analysis that can not be accomplished on a desktop. 3. Supporting advanced simulations, multi-scale, standardized methodologies, linking, chaining uncertainty. 4. Network middleware for domain services, throughput will change the way we work with data, grab whole databases or application servers. 5. Sensor development systems. 6. Cyberinfrastructure resources: professional services, systems support, helpdesk support, resource centers and tools for distributed infrastructure development. (NSF PACI centers do this

- 23 -

7. 8. 9. 10. 11.

12.

13.

to some extent now, with some specialized software support, but BIO should work with the community to identify key needs and set pathways for future implementations.) Hardware applications raise numerous issues; for example, commodity processors might not be useful for all BIO problems, and the overarching principle is “better” (routine, reliable, easier and more useful) access to computational resources. Major need for next generation biological science: the means for interacting with sensors, including protocol stacks. International collaboration on development, long-term, intergovernmental collaboration, codevelopment with sister programs. Mechanisms should be in place to hardening of community application code, following extensive debate and evaluation through professional societies and peer review. Develop collaborative projects to address integrated environmental challenges – research goals that lie beyond the expertise and resources available to one community and require broad participation – and require a very rich information infrastructure, as well as build out of LTER and NEON. Support an interface of experimental biology, bioinformatics, and software engineering through interdisciplinary teams who establish modeling frameworks that integrate biological systems with physical and engineering systems for insight into how living processes and organisms function as a systems level. Combining spatial images, observations on species, variety of physical/digital data; this should include biological information stored in multidisciplinary, knowledge resources that include geospatial, demographic, economic and other data for land use and disaster management.

Establish National and International Partnerships In an era of data-intensive biology, reliable, routine and robust access to an international level of information infrastructure (to the organized products from observations, modeling and interpretations generated the world over) will be critical in order to sustain progress, for scientists to remain competitive, and to exploit the potential insight to be derived from comprehensive knowledge resources. Storage grids and compute grids will frequently not be local and sometimes not even regional, and data grids will certainly be widely distributed, coupling input from very remote sites. Research contributions, perhaps even more notably in biology than in other sciences, will arise from the entire world. Tomorrow’s CI will bring together remote resources (expertise, instruments, data resource, computer platforms) and provide access from the end user’s desktop. This aspect will have no disciplinary, national, political or cultural boundaries, reflects the growing awareness that science itself is a global enterprise, for example, in terms of new discoveries and insight to come from numerous research settings around the world, the existence of international databases, the explosion of electronic literature, and the extraordinary depth of information available on the world wide web. In addition, the growing international flavor, beyond the leading economies of the world, can be seen in the rapid spread of ideas in electronic as well as journal hard copy form, the increased facilitation of the exchange of data with colleagues around the world, the strengthening of the role and recognition of international scientific organizations (or societies). The most obvious transformation is the development of an international e-science community that debates and exudes excitement 24 hours/day, 365 days/year. Furthermore, an international flavor will facilitate and will often be needed to address many scientific questions (often with social impact) that are by their nature global in scope, such as understanding basic ecological and ecosystems processes, especially in the context of global ocean / atmospheric circulation, where observations, expertise, and resources are needed from across the world. After all, political, social, cultural and economic boundaries are human inventions, and the physical world, while it may have shaped some of those boundaries, follows its own “path.” While local, regional and National solutions to immediate requirements have, of course, to be established as quickly as possible, biologists need to envision a global grid, and to think and act locally, regionally,

- 24 -

and globally. The challenges, not so much in thinking but very much in implementing, of international contexts for anything, especially infrastructure, are quite large. As a consequence, from the beginning, the design and implement of a National CI should be considered in the international context. Given the requirement for access to the world’s literature on a routine basis, the expertise and the resources to build CI will not arise from only a single country or region of the world. With the biological sciences, numerous examples already exist where international interactions that rely upon the current, albeit partial, implementation of CI. For example, the Protein Data Bank (PDB) is the international repository of 3D macromolecular structure data, and is now evolving into an internationally managed activity. Nearly from its beginning, Genbank has been an international collaboration in storing and managing DNA sequence data. The International Long Term Ecological Research (LTER) activity is a conceptual network of researchers, using research resources sited around the world for exploring regional and global questions in ecology. PRAGMA is an organization that has biological applications as a key set of drivers to build collaborations among research efforts located around the rim of the Pacific Ocean. The National Center for Microscopy and Imaging Resource (NCMIR) provides international, high bandwidth, remote access, via dedicated Internet connections, that allows investigators to obtain three dimensional information on Model-based Integration of Multi-resolution Data: biological objects. Among Development of a Cell Centered Database other instruments including Parallel computing NCMIR houses an resources for tomography intermediate voltage Spatial database of rat brain anatomy microscope and provides Models software tools for computed axial tomographic Neuronal models reconstruction of objects Database federation within thick specimens using Imaging databases information from rotated Large scale 3D EM reconstructions images. Investigators also Cells and tissues use similar methods Modeling cellular developed by NCMIR to Cellular processes microdomains access a high voltage electron microscope in Osaka Cellular microdomains Japan. Data from such Macromolecular distributions studies is incorporated into a Correlated LM multiscale database termed and EM Hi-throughput tomography the Cell Centered Database (CCDB). These activities, which are only selected examples among numerous biological science projects with international components and involvement, illustrate the value of working across national boundaries and the extra complexity (language, culture, policy) and time investments needed to be successful. Some specific additional challenges for international collaboration and the associated approaches for NSF include: • Funding of joint international projects (funding both sides of the collaboration) o Work with funding agencies in other countries. • Accessing data for comparison runs into potential barriers of different laws in different countries. o Work with government agencies to ensure the basic principle that “open access” to publicly funded data is guaranteed for scientific and educational endeavors. o Work with International Agencies to accelerate development of CI in developing countries and to facilitate their access to resources and their ability to develop needed expertise. • Shared resources will be developed and deployed under local (at the National level) funding, but will become part of the global CI.

- 25 -

Work with Funding Agencies from cognizant Nations to reach simplified principles for sharing the resources. Exchange of researchers and students among Nations will be important in ensuring the most productive international CI environment. o Develop incentives to encourage undergraduate, graduate and postdoctoral students to spend time outside their own culture conducting research. o



To fully establish a CIBIO, the investments in people will need to keep in mind the international implications. Researchers will need to have an opportunity to be exposed to the global consequences and environment, and senior scientists must be empowered to prepare future research generations to work productively and succeed in the global context that is already beginning to come into being. This will entail incentives to change the overwhelming imbalance in the exchange of students, especially in the sciences. But this effort is essential to overcome the too prevalent but naïve belief that the only good or reliable science and technology research and development is conducted in the United States or perhaps the US and Europe. Two corollary responsibilities arise from the simultaneous importance of databases for 21st Century Biology and the related consequences of the growing international nature of information resources. Considerable community involvement and discussion is needed to ascertain the right directions, and the community must be encouraged to see the big picture, to understand the longer term implications. The very word “community” must also mean all users of the databases, not just a narrow view held by a few major data providers, or a view exclusive to the keepers, managers, or archivists of the database. On the other side of the coin, the agencies in the USA, and particular NSF with its unique expertise, point of view and credibility, must also assert a leadership role from within the agency, not just in terms of dollars provided to the community. The managers of a database cannot speak for the Nation or for the community in the same way, that is, with the same authority, credibility and impact, as can officials from the agencies. Thus, to ensure the requirements of the community are met, NSF must take a proactive role in international settings. Of course, NSF must provide continuity and reliable, sustained funding for major information resources in its domain, but it must also participate in international standards setting; the credibility of NSF is an essential vehicle to ensure the international effort is on track, that the US effort is focused in the right directions and remains state of the art, and that access to important databases for the biological sciences are not comprised due to competing standards established in other international settings or to commercial interests.

Implementation Strategies NSF faces many hurdles in attempting to build a cyberinfrastructure for all of the sciences, and BIO certainly will have its own share of issues to address. The scale of the costs associated with CI means that BIO will have to choose directions carefully, build pilots and then expand them through robust mechanisms, and find suitable partners whenever possible. Even once adequate levels of funding are in hand to support initial community activities, BIO will have to balance a portfolio of investments. This is considered below under “Funding and Management Mechanisms.” NSF does not expect that its budget will grow fast enough to build CI by itself. Even with extraordinary budget enhancements, since the requirements for CI are fully intertwined with all research activities, the interdisciplinarity and scope is such that NSF’s programs must work together and that NSF working with other agencies and countries is essential now and will always be essential. The local, regional and international impacts and responsibilities for BIO are considered below under Outreach and Partnerships. Examining a variety of mechanisms in place already, both within NSF as well as in other agencies, will facilitate BIO’s efforts to ascertain the what and how for their initial investments and actions. Among the responsibilities is to balance the options and prototypes, to be careful not to prune early growth, and to confront head on the major challenge of sustained investments in all parts of CI for biology, ranging from the adequate support of data resources from acquisition to management and development of query tools, the

- 26 -

maintenance and on-going development of community software for modeling and analysis, the formation of innovative collaborative efforts driving discovery along the frontier, and the development of new as well as expanded vehicles for education and training. Funding and Management Mechanisms Details about internal management will certainly have to be worked out by NSF BIO, but considerations from the perspective of researchers are provided. These are in the form of recommendations as to actions by BIO. • •



• •



A suite of channels, alternative vehicles and pathways for investigators to obtain funding, are needed to ensure creative ideas can flourish for infrastructure; a variety of means for funding CI should be created, both within existing and in new programs that focus on CI BIO. One avenue for maintaining a suite of channels was the alternatives between regular investigator initiated proposals/awards and the CISE program PACI; while weaknesses have been noted for such overarching programs, PACI provided a flexibility for starting new directions and for taking risks. As PACI comes to a close, the overall pipeline and its component parts (science challenges, software requirements, connection of science pull to technology push, and so forth) should be considered by the research Directorates and notably BIO. ITRs will not fill the void being left by the PACI program. ITRs are long term, basic research, but there is little expectation that these would be for infrastructure development and deployment. Other mechanisms for development and maintenance will be required; to participate fully in the CI of the Foundation as a whole, all Directorates will have to ensure a core effort for their own disciplines. Simple infrastructure needs for research directorates will not be met by CISE, and BIO needs to engage in a dialogue with CISE over expectations and how to sustain the right focus for biology in the new context of CI. The biological science community varies widely in its degree of adoption of scientific computing. For some communities, a service center, specified for the unique situation of the cognizant subdiscipline and community,, where a given set of BIO PIs can turn to find the CS partnerships they need. There could be a stand along Center, with some similarities to the LTER Coordinating Center. There needs to be a critical mass of people focused on bio cyber infrastructure. This could focus activities in groups of investigators, for the development and maintenance of CI as well as for training the next generation of scientists CI BIO will range from leading-edge-to-routine (“every day”) facilities and tools. BIO in setting up funding mechanisms should recognize that there are different kinds of infrastructure, a range of human experts and similarly, a range of community software code hardening and support. These activities could fall under the same coordination or research support ‘center’ as above. A centralized resource or center could also serve to ensure the community can adopt the latest, best hardware and software tools, can enable more ready access to new technology, and facilitate standards that allow each member of a distributed team to accelerate their efforts by connections to the information obtained by others. NIH has funded a project, the Biomedical Informatics Research Network (BIRN) and described below, that federates data from a community of researchers, as well as building other collaboratory aspects for 21st Century Biology research. This project may be a model for various CIBIO aspects. As currently constructed, BIRN has a coordinating center to put out all of the hardware, software, pipes, and to get the distributed PIs to move their workflows into that framework. Currently, the Coordinating Center is funded at $2.5M/year for CI people to support the 200 investigators.

- 27 -

Model Exists: Architecture To Support a Biological Informatics Research Network BIRN - Phase I - 20012001-2002 UCSD

Form a National Scale Data Grid and Federate Multi-scale NeuroImaging Data from Centers with High Field MRI and Advanced 3D Microscopes

NIH Centers for Bio Imaging and Computational Biology & NCRR Research Ctrs.

Cal Tech

UCLA

NSF NPACI W/SDSC

Harvard

Cal-(IT)2

“Deep Web”

Duke

Integrating Cyber Infrastructure to Link: •Advanced Imaging Instruments •Data Intensive Computing •Multi-Scale Brain Databases

Test Beds for an NSF BIO Cyberinfrastructure •

• •



The NIH National Center for Research Resources also supports several computing resources. The Research Resources provides funding at the level of $1M per year for the development and maintenance of software, with components of service to the community, training on the tools of the resource, and clear dissemination of those tools. Part of the review of the Resource is in terms of the use of the tools and the quality of the science produced from those tools. Successful resources also can be multidisciplinary, employing academic professionals. While the NIH Research Resources have leveraged heavily off of the CISE investments in computing and computing infrastructure, their success may be useful in exploring how to introduce numerous basic biological sciences communities, which can not access NIH funding, to advanced scientific computing and information technology. Advancing and hardening research grade, internal lab developed, software is an important elementary consideration for CI. BIO should complement the new software maintenance program (from NIH) that will support the maintenance of community codes. Extant phylogenetic public software, along with code available from various labs, and concomitant access to conventional desktops, are not enough. No means today exists for testing and developing new algorithms with community involvement or to do in a rigorous fashion. To drive this community collaboratory will require a team of professional programmers; faculty from labs around the world would design the applications. A committee will have to be established to decide who gets priority; that committee should be approved by NSF, not simply established by the Principle Investigator. The NSF cyber-infrastructure goals must be at a higher level; that is, the support of efforts in establishing basic research software is not “on the table,” and is not likely to be funded through new CI programs within or as partnerships with CISE. However, BIO must address the basic software needs for individual research communities. A BIO CI program would have to collaborate

- 28 -





with programs across the Foundation, set high goals for CI within BIO itself, and also evaluate “catch-up strategies” for individual sub-disciplines within BIO that would otherwise not be ready to participate fully. The CIBIO could be initiated with three tiers: (1) select via peer review a small set of “critical mass” activities – a core team, probably distributed geographically, with enough funding to have an early impact; (2) find similarly a distributed set of medium-sized projects in domains that tap into this core; (3) identify smaller, individual investigator activities that can benefit by direct connections to the infrastructure. NSF BIO will need to structure the funding mechanisms to be sure the funds are indeed used for projects either directly empowering investigators to use CI or to develop the CI for BIO, and is not directed into conventional projects.

BIO should establish a cross-cutting CI activity for BIO, with separate review processes, and including an invigorated bioinformatics and database activities and a modeling and simulation or computational biology core. NSF pioneered the federal support of database activities and computational biology, and should now exploit the CI opportunity to create a new generation of these research endeavors. All categories of infrastructure are increasingly important for scientific research, but a strong cyberinfrastructure will be particularly essential for the biological sciences. What will be important is to recognize that infrastructure can not be treated the same as individual research proposals. Robust, rigorous peer review is very important for establishing the best opportunities; competition is also important and similar, overlapping efforts will need to be initiated in many cases and then the best project will ultimately become clearly identified. However, infrastructure needs simply cannot be evaluated directly against individual research, and separate, centralized review and oversight will be needed. Infrastructure benefits all, but has a different time frame, different budgets, different staffing (more academic professionals), and can not be properly reviewed in connection with individual research projects. Public policy, not peer review, is the basis for balancing levels of funding infrastructure and investigator-initiated projects, levels balanced to ensure optimum progress by the community. Collaborations across biology, across institutions and across other disciplines will be the hallmark of 21st Century Biology. In allocating funds along with ensuring appropriate means for peer review of CIBIO projects, NSF must put in place highly effective channels for collaborations beyond those for existing programs. NSF must prepare accordingly by sending an even stronger signal that provided by the existing mechanisms. Suggesting that the biological sciences will participate in big science activities is both quite misleading and needlessly controversial. Instead, a healthy mix of microscale (i.e., the so-called cottage industry or small lab, individual investigator, hypothesis-driven research) with mesoscale (i.e., interdisciplinary, collaborations and team efforts) is needed and is today being established naturally, without centralized planning or top down intervention. In this era (for simplicity according to its emerging properties) of mesoscale biology, the NSF should be particularly attuned to these distinguishing features: • •

Nuturing Collaboratories: intermediate scale environments for remote access to specialized instruments, & for interdisciplinary, multidisciplinary, and sometimes multi-institutional interactions - both local and distant interactions. Sustaining Portfolio Balance: instrument development (high versus low risk, exploratory versus implementation) and prototype applications. o Teams are needed; there should be “ferment” for selecting the direction, then driving instrument development (by situation dependent, biology pull and technology push); funding should also facilitate commercialization and getting tools to the community for a broad impact.

- 29 -

Portfolio policy-level considerations include cost sharing in instrument acquisition and mechanisms to enable PIs to sustain applications / cover recurrent costs of user facilities. Facilitating effective utilization of the Data Deluge: rapid data acquisition for high throughput biology and equally effective acquisition for complex data sets for many lower throughput approaches; organization and long-term maintenance of data sets; creation novel query and integration tools. Developing Quantitative Approaches for all fields of the biological sciences: move beyond binary biological questions and answers (yes or no, up or down, on or off, spot or no spot), and provide a strong extension of adequate training to address the complexity of biology. o





Outreach and Partnerships As pointed out above, each component of all of the science-funding agencies will need to establish its own course, since all communities will participate in CI. Of necessity, the internal NSF interactions for BIO must begin by coordinating as much as possible with CISE, and by exploring options, most notably the potential for joint programs. Especially in the environmental context, there will continue to be a major area of overlapping interest with GEO. Nonetheless, CI promises to be the most interconnected activity at NSF in its history, and there will be productive interfaces with all of the other Directorates. For reasons described above and in the background material in the Appendix, NSF is the only agency that can lead CI and BIO the only possible leader for the biological sciences effort. However, nothing succeeds like success, and the scientists working in applied life science research and funded by the NIH, DOE, USGS, ONR, or components of other mission agencies, will require access to CI, and thus, the mission agencies should look to BIO’s efforts as a model. Conceived and run as more targeted efforts than an NSF project, one arm of NIH is already invested; other Institutes are interested. A role for NIH in their biomedical world can be expected, that is, NIH will continue to build group activities around specific health topics, such as for the human and model brain studies facilitated by the BIRN project. At this point, NSF has covered much of the infrastructure development in basic CSE, and driven innovation in early CI across the biological sciences. What could prove to be very important for the community will be the potential, led by NSF, of “pulling or pushing” NIH a little further into the development of specialized CI, and helping with seamless connections between the CI of basic and biomedical science research. Similarly, private foundations, most notably the Howard Hughes Medical Institute, naturally will also have to extend CI to their investigators, and opportunities for collaborations should be explored. The pervasive, ubiquitous connections that must characterize a fully successful CI will require a form of “intellectual glue,” a willingness to fill in the gaps and reach out to all communities and Institutions. The bottom line on outreach and partnerships begins with the core: BIO can ensure democracy and that CI delivers a new era in biological sciences research. This role is essential and no one else will – or even has the capacity to - undertake such an effort. Just as the grid is universal, CI has huge international implications. Through professional societies and continued use of BIO’s ongoing relationships, BIO will need to coordinate with policy makers in Europe and Asia. NSF’s International Directorate has already begun a CI effort, known as PRAGMA, linking the USA and Far East Asia. CIBIO should prove an opportunity to cement relationships with Japan and the EBI around the Protein DataBank, and more generally, to ensure that there is an international information infrastructure for the biological sciences, or, in other words, to ensure that there are common standards and shared efforts for ontologies.

- 30 -

Immediate Steps for BIO Continue to take risks. NSF BIO has the capacity to be far more adventuresome than the mission agencies. The plunge is essential for 21st Century Biology, and the adventure will benefit not only fundamental biology but also the applied biology supported by the mission agencies. Ensure success through adequate, sustained funding of the best activities. A very serious, central issue that urgently must be addressed before bringing an implementation plan to the community is what are the commitments for sustaining the infrastructure projects over 15 years? Who actually runs a given project needs to be determined by peer review and investigators and teams will compete, and some projects will change ownership. However, it will take a number of years to establish the infrastructure, and once established, CIBIO must stay in place to sustain the full development of 21st Century Biology. A CI for the biological sciences is, of course, going to require attention and support for the indefinite future, but no matter what first generation implementations are established, there will be a need, along with routine community involvement and active dialogue on the best mechanisms to drive research, to establish plans for a sunset review of the projects as whole and a careful full assessment of mechanisms, progress and impacts. A new plan should be established as technology and biological understanding advance, and of course, the sun might rise to shine in the same path and on the same organizations. (Sunset involves rigorous assessment, yet often should simply lie before renewal.) In consideration of the engineering research centers and the science and technology centers, we suggest that this broad overview occur in time to set a revised agenda in this fifteen-year time frame. The time frame is chosen to emphasize the deep and central significance of stability in putting together productive teams at multiple institutions across many disciplines. Without stability, it will be difficult to develop and clearly impossible to sustain an efficient, effective cyberinfrastructure for biology. One vehicle to begin discussion about the necessary long term plan would be for NSF BIO leadership to put together a set of milestones or a roadmap with projections for the lifetime first of developmental and then of maintenance efforts, and share that long range plan with the BIOAC and ultimately, through numerous society meetings, with the broad community. Prepare a five year plan as a first step to place the biological sciences on the map toward a successful CI; that is, identify the especial role NSF BIO will play and empower the community to look forward effectively. An implementation workshop would be an effective vehicle to review this plan, which could be developed through a leadership retreat coupled with post-panel long range plan discussions and the output from the satellite meetings of professional societies. BIO must also make sure the plan is international. Anticipate the accelerated pace that will characterize CI for all of the sciences. The growth path will be so rapid, the requirements so extensive, and the costs so significant, that a broad swath of the BIO research community needs to be involved in planning the implementation phase. To ensure the best ideas can compete, the entire community, more generally, needs to be forewarned and encouraged to prepare (in order to participate fully). Obtain input from the broader BIO community. Given the extraordinary broad importance of CI, building a cyberinfrastructure for the biological sciences will absolutely require considerable, initial and ongoing input from the entire community. One mechanism, which could prove particularly useful and cost effective, would simply be to include a talk, a few talks, or a round table discussion at subgroup meetings, i.e, at the satellite meetings held typically before biological science professional society annual meeting. BIO program officers could participate in a session in which a given discipline examines its needs and well-established investigators in the field address their vision for CI. Such discussions will also be important to ensure full participation and thereby the consideration of the opportunities for all of BIO (at the level of the specific scientific domains of individual Divisions). Thus, biological society meetings could serve to help NSF gather information and plan how to implement the funding. There have been examples of this previously in the biology community’s interactions with NSF, from the role the Genetics Society of

- 31 -

America plays in genetic stock centers or the involvement in the Protein Databank by the International Union of Crystallographers. Programs at other agencies, such as SCIDAC (Scientific Discover through Advance Computing) at the Department of Energy, and the National Tissue Culture Cell Repository, at the National Institutes of Health, also reflect community input, and may represent collaborative channels for communication and outreach to the community. As BIO moves toward an implementation phase, NSF could release a call for proposals aimed at scientific societies, to encourage them to identify key needs and requirements and make sure that the CI activities are inclusive and bring in new people. In addition, a larger meeting to consider specific implementation across the BIO domain will be very important.

- 32 -

Appendix 1 July 2003 Central Questions for a CI BIO Questions grouped in order to focus and facilitate the breakout group discussions. Each breakout group and each attendee considered all questions. A: Why is cyberinfrastructure for biology so important? What difference will it make? What is its scientific scope? Where are we now? What are successful examples? Difficulties? Where do we have to go? Opportunities? Challenges? B: What is its technology scope? Data Intensive Bioscience, Knowledge Management? What are the educational requirements and opportunities? C: What is its administrative scope? What do we need to get there? Funds? Management? Balance between NSF, Institutions; Cost Sharing vs Stable Adequate Funding What should BIO itself do now? Who will be natural collaborators? Internal Agency Partners, External Agency Partners Not for Profits; NGOs; International Implications D: What further meetings or actions are needed in the near term? What are the first steps for BIO?

- 33 -

Appendix II Schedule and Assignments (14-15 July 2003, NSF BIO Workshop) 14 July 2003 Morning Session 9:30 start. Introduction, Welcome, Charge by AD, BIO; Workshop Chair, BIO AC (9:30-9:40) Presentation by DAD, CISE on CI as viewed by CISE (9:40-10; Qs&As, 10 - 10:10) Review of Initial Requirements and Overview of CI-BIO (“seabio”), CH (10:10-10:20) Examples and Models for CI / examples from biological science (10:40-11:40) Overview of Grid, CI Technologies, Issues, CA (11:40 – 12:10) Working Lunch (12:10-1:10) Lunch discussion include overviews, general group discussion and assignments for breakout groups Afternoon Session Breakout Groups formed, begin (A. 1:10-2:40; < break; then reassemble>; B. 3-4:30) A. Science Assignments; Science Drive, Pull; “Why SciBIO CI,” B. Application Assignments; Generic Infrastructure; Technology Push Presentations from Breakout Group Chairs - Review Initial Contributions (4:30-5:10) () Round Table Discussion (5:30-6:30) Break for dinner by 6:30; expect continued discussion 9 PM, hotel Brief Meeting of Writing Group/Steering Committee, Review Outcomes, Plans for Day 2 15 July 2003 - Day 2 Morning Session 9 AM Start Reform Breakout Groups (A. 9-10; ; B. 10:15-11:15) [Review Monday “A” and “B” discussions, add “C” to first; add “D” to second discussion] [During Breakouts, Complete First Draft Major Writing Contributions, esp. both re BIOSCI and General Enablers – connections to CISE & Technology] Review Output for consensus (11:15-12:15) Working Lunch – continued discussion [Outcomes, Implications, OPTIONS FOR FUTURE MEETINGS, OTHER STRATEGIES] Afternoon Session Final Draft of Each Section Prepared 1:15 – 2:25; outline key points/prepare overview Overview presented to AD, BIO. possibly also to CISE, other NSF. (2:30-3:15) Integrated Draft Prepared (3:30 – 6:30 PM) Writing Team - through working dinner Tuesday. Day 3 – Wed AM 16th, revised “first draft” given to AD, BIO, for use in NSF August discussions. Final Version, Slides completed by end of August.

- 34 -

Package presented, reviewed at BIOAC Nov 14, 2003 Assignments Chair, Breakout Group 1: Gwen Jacobs Lead Scribe: Susan Stafford Chair, Breakout Group 2: Jim Beach Lead Scribe: Paul Gilna Workshop Chair and Editor in Chief: John Wooley NSF Liaison: Judy Verbeke Other Writing Team Members: Gwen Jacobs, Susan Stafford Workshop Administrative Support: Amanda Voight Follow-up Administrative Support: Joy Gorback Web Design and Implementation: Kareem Elbayer Printed Text Editing and Implementation: Courtney Smoot

- 35 -

Appendix III Workshop Participants Nancy Amato Texas A & M University [email protected] (979) 862-2275

Peter Arzberger San Diego Supercomputer Center [email protected] (858) 534-5079

Paul Barber Boston University [email protected] (508) 289-7685

James Beach University Kansas [email protected] (785) 864-4645

Helen Berman Rutgers University [email protected] (732) 445-4667

Emery Brown Massachutsetts General [email protected] (617) 726-8786

Brenda Claiborne University of Texas San Antonio [email protected] (210) 458-5487

Mike Colvin Lawrence Livermore National Lab & UC-Merced [email protected] (925) 423-9177

Mark Ellisman University of California San Diego [email protected] (858) 534-2251

Deborah Estrin University of California LA [email protected] (310) 206-3923

Stephanie Forrest University of New Mexico [email protected] (505) 277-7104

Claire Fraser The Institute for Genomics Research [email protected] (301) 838-3504

Paul Gilna Los Alamos National Lab [email protected] (505) 667-3114

Teresa Head-Gordon University of California at Berkley [email protected] (505) 667-3114

Gwen Jacobs Montana State University [email protected] (406) 994-7334

Leonard Krishtalka Kansas State University [email protected] (785) 864-4540

Michael Levitt Stanford University [email protected] (650) 723-6800

Robert MacLeod University of Utah [email protected] (801) 587-9511

- 36 -

William K. Michener Long Term Ecological Research Network Office [email protected] (505) 272-7831

Margaret Palmer University of Maryland [email protected] (301) 405-3795

Phil Papadopoulos San Diego Supercomputer Center [email protected] (858) 822-2628

Jay Snoddy University of Tennessee Oak Ridge National Labs [email protected] (865) 974-3466

Susan Stafford University of Minnesota Twin Cities [email protected] (612) 624-1234

Lincoln Stein Cold Spring Harbor Laboratories [email protected] (516) 367-8380

Russell M. Taylor II University of North Carolina [email protected] (919) 962-1701

Tandy Warnow University of Texas Austin [email protected] (512) 471-9724

Ross T. Whitaker University of Utah [email protected] (801) 587-9549

John Wooley University of California San Diego [email protected] (858) 822-3630

- 37 -

Appendix IV What Is Cyberinfrastructure? An Overview of the Blue-Ribbon Panel/Atkins Report: Overall NSF Considerations on CI The extraordinary impact of computing and information technology, from transaction databases and wired commerce to the World Wide Web, is apparent already in society, in which the transition to an internet economy has happened in a fraction of the time taken by the telephone and the television in penetrating the Nation’s homes. In the sciences, where the future is being created today, the interplay is already at least as profound, and the impact of information technology (IT) seems certain to sustain a positive second derivative, continuing change empowering science and serving society. All disciplines of science and engineering face challenges inherent in studying more complex phenomena, managing to drink from the fire hose of contemporary instrumentation and experimental observation, and establishing data mining tools to probe beneath the surface of massive data sets. Indeed, a common theme to the analyses by the communities of science is that the next generation, the 21st Century of discovery, must deal with the actual details, not simple abstractions, recognizing that all natural systems are inherently complex, heterogeneous and nonlinear in behavior and in all cases span huge spatial and temporal scales. At the same time, the future of computing at all levels fuels data analysis on ever more effective scales, and Computing Platforms will exploit emerging trends: the rise of Linux; the introduction of productive, efficient commodity clusters; the rapid expansion of participation and availability toward ubiquitous access; and the maturation of grid computing. Advanced IT, over the past decade in particular, has played an ever more important, pervasive, Evolution of the Computational obvious and essential role across all of the Infrastructure sciences, in science and engineering education, and even across all elements of the economy and Cyberinfrastructure society. Workshops, journal articles, community TCS, DTF, white papers, and major symposia have Terascale ETF highlighted the general and the specific NPACI and PACI (disciplinary) contributions arising from nothing Alliance less than a revolution in the technology, NSF Networking intellectual approaches, philosophy and vision, Prior SDSC, NCSA, Supercomputer Centers Computing and implementation of IT. As an deeply PSC, CTC Investments embedded feature of research design and 1985 1990 1995 2000 2005 2010 implementation, IT advances experimentation and data collection, archiving and analysis; indeed, 2 often new methods of analyses with information within the data but not obvious to manual observation – a process termed data mining – have changed entire fields and directions of research. Extant data derives more value and is likely to be used and more fully evaluated and exploited, while at the same time, IT can identify limitations and point to the need for more sophisticated instruments or means of collecting data. Collaborations among researchers and the value of theoretical models and computational simulations have also become prominent features of experimental analysis due to IT. |

|

|

|

|

|

Leading to the Atkins study or blue ribbon panel commissioned by CISE were NSF’s and NSB’s recognition of the pervasiveness of IT, including the neologism “cyberinfrastructure” or CI to reflect the revolution. Consequently, a discussion has been ongoing concerning, among other key matters, what CI would enable, what types and levels of CI would be useful in different communities for both education and research, what balance should be established between research into advancing the quality of CI itself

- 38 -

versus research and funding toward implementation of instances of CI for specific communities, and what the balance should be between CI and other forms of infrastructure. The Blue-Ribbon Advisory Panel or Cyberinfrastructure, or simply the Atkins’ report, is the product of an interdisciplinary team to consider (1) the extant high performance computing program; (2) new directions for CISE; (3) implementation plans to meet the recommendations of the study or report itself. The report defined CI as: A set of functions, capabilities, and/or services that make it easier, quicker, and less expensive to develop, provision, and operate a relatively broad range of applications. This can include facilities, software, tools, documentation, and associated human support. The main recommendation contained in the report is summarized as: We propose a large and concerted new effort, not just a linear extension of the current investment level and resources. NSF must recognize that the scope of shared cyber-infrastructure must be far broader than in the past: it includes computing cycles, but also greater bandwidth networking, massive storage, and managed information. Even these are not sufficient: there must be leadership on shared standards, middleware, and basic applications for scientific computation. The individual disciplines must take the lead on defining certain specialized software and hardware configurations, but in a context that encourages them to give back results for the general good of the research enterprise, and that facilitates innovative cross-disciplinary activities. The basis for such a far reaching conclusion included: • Increasing science/engineering community-based initiatives and budget demands in all Directorates for greater investment in information technology to support domain-specific research. • A growing sense that science and engineering research and practice are reaching thresholds in performance and adoption of IT that could radically transform the “what”, “how” and “who” of scientific research on a truly global scale. A very extensive appendix considers the breadth of CI, and includes an extensive survey of the scientific communities. Many aspects of the report should be considered in the sense that not just the biological science but most sciences have been taking increasing use of IT and are only beginning to adopt fully the tools and approaches that would be most beneficial. The panel asserted that CI is now at the core of revolutionary science in every discipline, and that NSF must take a central role in its deployment; that NSF must consider people and software as co-equals with hardware in the context of CI, and recognize that to date NSF has not placed enough emphasis on people, on software, and on maintenance, compared to hardware acquisition and implementation. Networking, clusters, grids and many other aspects of modern computing that are quite relevant to biology have been considered, including the need to deliver the full cyberinfrastructure (IT, data and knowledge resources, scientific computing tools, grid access, visualization tools, etc.) to the desktop of individual students, as well as the need to create a new workforce, one that is expert in a domain but conversant enough in IT and scientific computing to exploit the opportunities and select the most productive collaborations.

- 39 -

Appendix V Models for a Comprehensive CIBIO Based on Extant Test Beds utilizing Advanced IT Biomedical Information Research Network (BIRN) The BIRN is an NCRR initiative aimed at creating a test bed to address biomedical researchers' need to access Sites and analyze data at a variety of levels of aggregation located at diverse sites through out the country. The BIRN test bed will bring together hardware and develop software necessary for a scalable network of databases and computational resources. Issues of user authentication, data integrity, security, and data ownership will also be addressed. The test bed focuses on research involving neuroimaging to take Could Expand Readily; Examples: Plant & Microbe advantage of the relatively advanced Genome; Cellular Level; Tree of Life; also Indefinite level of sophistication of this community Expansion to Many Laboratories, around USA in the use of information technology. An essential feature of the test bed will be creation of infrastructure that can be deployed rapidly at other research centers throughout the country, which may have research emphases outside of neuroimaging. This means that in addition to scalability, the software/hardware must be reusable and extensible. National-Scale, Testbed in Cyberinfrastructure: Federating Multi-Scale, Multi-Modal NeuroImaging Data

The BIRN test bed uses the most advanced networking available and will draw heavily on resources of the next generation Internet. The upgrade for networking is funded by the National Science Foundation for both design and implementation. The initial awards join General Clinical Research Centers (GCRCs) and co-located Biomedical Technology Research Resources, as team-level, shared user facilities (P41s) to establish the necessary infrastructure in the context of ongoing neuroimaging research projects. A separate grant for a "system integrator" that coordinates network, grid, and data mining software development as well as hardware configurations will be awarded to a recognized leader in such technical development and service efforts. Specific BIRN program objectives today are to: • Establish a stable high-performance infrastructure linking key NCRR-supported team efforts (NIH P41 Resources) and GCRC sites using the Internet 2 network. • Establish distributed, linked data collections for all of the test bed projects. • Enable the use of distributed, heterogeneous, grid-based computing resources for project-specific data storage and collaborative analysis. • Enable data mining from multiple data collections or databases on neuroimaging. • Develop software and hardware infrastructure that is stable, and can be reapplied and/or expanded to include other sites with different research foci (than the founding sites). • Demonstrate effectiveness of these technologies in improving and extending research results (needs pull not technology push).

- 40 -

BIRN has become an example for all implementations and discussions of CI, a project that could be called one of the founding activities for a comprehensive cyberinftastructure. Thus, for BIO considerations, BIRN provides the most fully established model for how communities of researchers can share data and conduct research across distance. Built upon a foundation from previous work (supported by various NSF awards; namely, a Grand Challenge award and NPACI, and by NIH funding (as NCMIR, NBCR and the Collaboratory), BIRN researchers have created a set of tools that distribute computing, manipulate data, control instruments). http://www.nbirn.net National Biomedical Computation Resource (NBCR) The goal of NBCR, funded by the National Center for Research Resources as an NIH Research Resource, is to “Conduct, catalyze and advance biomedical research by harnessing, developing and deploying forefront computational and information technologies”. NBCR develops and deploys software and services for the biomedical community. The award supports five interconnected core projects, involving systems biology to grid services and visualization, who also interact with specific application drivers (Collaborative Projects), and have distinctive components of service projects (individuals using the software), training on the tools of the resource, and dissemination of the tools and also the activities of the resource. There are several other related NIH Research Resources, in the area of biomedical research. NBCR, and these other research resources, are models of how to support a critical mass of researchers and academic professionals to develop, deploy, and support tools that are used by the community. This is a model for how CI essential for basic biology could be established and developed within BIO funded activities. http://nbcr.sdsc.edu Pacific Rim Application and Grid Middleware Assembly (PRAGMA) A collaborative effort of 15 institutions located around the rim of the Pacific Ocean, PRAGMA’s mission is to establish sustained collaborations and to advance the use of grid technologies in applications among a community of investigators working with leading institutions around the Pacific Rim. To fulfill this mission, PRAGMA hosts a series of workshop for members; the workshops focus on developing underlying applications and a test bed for those applications.

PRAGMA PARTNERS

Current applications include: • Establishing workflows in biology (e.g., protein 14 – 15 July 2003, BIO Cyberinfrastructure annotation); Alexandria VA • Linking via web services climate data (working with some LTER sites in US and East Asia Pacific Region ILTER); • Running molecular solvation models in chemical biology andbiochemistry; • Extending remote access and collaboration or telescience applications and partnerships to more institutions. The NSF supports the participation of scientists from the USA in PRAGMA, which includes an extraordinary breadth of commitment and participation from Asia. All of the institutions involved have committed to active participation and contribution of labor and resources for applications and the test bed. http://www.pragma-grid.net

- 41 -

Telescience Examples Effective telescience collaborations are supported by portal interfaces into tomographic codes, which have been distributed on various platforms in US, Japan, Taiwan), and by means for storing of large data sets. Funding from both NSF and NIH. http://gridport.npcaci.edu/telescience Science Environment for Ecological Knowledge (SEEK) http://seek.ecoinformatics.org A new NSF supported effort in ecology is already exploiting many features of modern, advanced information technology. http://intranet.lternet.edu/archives/documents/Newsletters/DataBits/03spring/ Geosciences Network (GEON) Involves geosciences community and scientific computing/data management community working together to build an infrastructure for sharing data; includes social scientists for understanding how community coalesces. http://www.geongrid.org OptIPuter (an NSF large ITR) The Optiputer was funded by NSF to prototype and establish a distributed computing paradigm on campus sites, connectivity to the end user’s desk, for which the network is no longer the bottleneck. L. Smarr, PI, for the physics applications, combines applications from Neuroscience (M. Ellisman) and the Environment (J. Orcutt), including cluster and grid solutions (P. Papadapolous) EuroGrid for Biology An early overview of an European perspective on e-science and grids for biology; see Carole Goble, 2001, Computational Functional Genomics. http://www.hgmp.mrc.ac.uk/CCP11/cfg_grid.pdf

- 42 -

Appendix VI Potential Prototypes for BIO Implementation NSF BIO will have to evaluate the strongest options for early prototypes and also establish milestones and a roadmap for the overall National and international effort in building a CI for the biological sciences. Areas of potential impact can be found across all of the BIO research domains, including such things as the Tree of Life, Deep Green, the extant database activities, LTER, new FIBR activities, comparative genomics and an understanding of cis regulatory elements during development, and so on. Due to the enhanced funding and focus on microbes by NSF and also by DOE, there is strong potential for BIO, in partnership with DOE (and perhaps secondarily with NIAID and NIEHS) to develop a CI for microbes, ranging from genomics to studies on environmental communities. As the plant genome projects advance their sequencing and database efforts, the overall information infrastructure to be developed will also provide a strong base for an NSF BIO CI. As NEON is developed and integrated with ongoing advances in the science enabled by LTERs, there will be extraordinary opportunities, building on sensor nets and meso-scale collaboratories, to exploit a cyberinfrastructure to accelerate our understanding of ecosystems and our ability to utilize that understanding to the benefit of society. As an example of what such a successful CI would be, an overview of one such development within the ecosystem community will be outlined. Ecological Analysis Research Network (EARN) Within an Environmental Cyberinfrastructure (ECI) A comprehensive, fully integrated computational environment, a cyberinfrastructure for ecology and ecosystem research, represents both an essential step for the ecological community and an obvious opportunity for NSF. In analogy with other Research Networks, for simplicity such a comprehensive environment, with support for the right mix of people using the ideas and tools of ecology, and exploiting advanced information technology and scientific computing will be described here as an Ecological Analysis Research Network (EARN), one that enabled the organization of ecodata into information, the integration of that information into knowledge, and the synthesis to wisdom about the broad impacts and roles of complex ecosystems, through deep partnerships among experimental, theoretical and computational approaches. An EARN should aim to create a digital knowledge environment that enables vast increases in the ability of ecologists to communicate and collaborate to establish a new synthesis for ecological science and its interconnection with environmental science. The NSF Advisory Committee on Environmental Research and Education (AC-ERE) has described the creation of an Environmental Cyberinfrastructure (ECI), a set of scientific computing and advanced information technology tools for the study of complex environmental systems. (See the Occasional Paper of the NSF Advisory Committee for Environmental Research and Education, AC-ERE 1, May 2003.) AC-ERE has described a very complete plan for implementing cyberinfrastructure to underpin their vision for environmental research and education (Complex Environmental Systems: Synthesis for Earth, Life, and Society in the 21st Century [CES]). A challenge and an opportunity for an EARN would be the need for the ecoscience deployment to be deeply interconnected with the overall environmental research infrastructure. Considerable background about what that would entail are in the AC-ERE Paper. In sum, the AC described the range of empowerment arising from the development of collaboratory software, visualization tools, data mining and data management techniques, and software and embedded hardware for sensor nets. To realize the vision will require the creation of common platforms, applications that support data management and manipulation, computer intensive applications, and advances in high-end computing as well as desktop computing. The CI, through proper management, will lead to increased scientific democratization that is particularly relevant to internationalizing our community and our research. Access to data and computing capabilities will be enhanced worldwide. This is particularly important for ecologists given the rapid environmental

- 43 -

changes taking place globally as well as and the rapid loss of biodiversity in parts of Asia, Africa, and South America. Furthermore, internationalization through enhanced ECI is critical to answering the many key ecological questions that require multiple forms of archived data, real-time global scale data and the ability to work locally (at home) with remote collaborators working in all parts of the world. What would a visionary ECI include? • Software development and production related to data acquisition, management, and application, including tools for data representation and pattern recognition. • Hardware, software, and expertise to enhance communication and provide digital libraries. • Hardware and middleware development for computational grids, data grids, and networking. • A community of computer scientists and technical experts who work routinely with ecologists to develop and then put into production new tools as well as maintain software widely used by the community. • Research associated with the development and deployment of intelligent sensor networks for monitoring and research. • More sophisticated mathematical and statistical tools and researchers trained in their usage. • Advances in model building (numerical, analytical), in linking models and data, and development of common modeling frameworks that can be used by broad communities. Suggested early steps by the community and for the NSF include the following. However, a detailed implementation plan will require further community input. 1. BIO should establish a coordination center for the national effort, the EARNCC, which would serve as the hub for technology introduction and dissemination and for intellectual organization and process management; when mature, the EARN should include about two dozen regional nodes. (More than that many sites would be required for comprehensive analysis of ecosystem dynamics and interactions, and certainly more would be required for what will have to be an international effort, but an initial EARN would inevitably have to be a pilot, and only when the path is established should more coordination centers and associated nodes be added.) The coordination center, at a single site, would consist of a group of computer scientists, ecologists, and academic professions, who serve as a national coordinating group. A. The center director in concert with a science advisory board would make decisions on funding of EARN development proposals – these would not be investigator-originated ecological research proposals but rather (for example) proposals to create new software or tools that would help a particular community of ecological researchers and managers. B. The center should be an entrepreneurial entity (in identifying and leading community efforts to set important directions for EARN) and the coordination entity for a network of service nodes (i.e., the center would provide support to regional service nodes who would be the links to individual investigators and research teams). 2. The expectations for national EARN centers A. Coordinate national funding program for CI tool and idea development and hosts visiting researcher teams. B. House “gardens” or diverse, well-maintained, organized collections of software that are used widely by the community, but watched, pruned, or allowed to die at the Center following community input. C. Ensures that the development of critical software is accompanied by suitable software engineering activities that ensure utility on a variety of platforms, is fast and efficient, etc. (otherwise, researchers may develop novel software in codes that are not useable by the community or are not interoperable, or the software will remain as research grade and never be truly robust enough for widespread use). This recognizes that not only is exploratory research is needed for tool development, but so is production. Research and development are two separate entities.

- 44 -

3. The expectations for the regional service nodes A. Provide advice and consultation to individuals and research teams on technical issues (hardware, software, networks design). B. Host education and training programs/workshops (particularly focused on graduate students, postdocs, and mid-career scientists) 4. The establishment of funding for EARN Centers: A. The center should be funded from multiple sources – federal agencies that fund basic research should provide the primary funding base to ensure longevity and sustainability of personnel and applications. The center is not meant to be a service arm for any agencies but a means by which CI discoveries and applications will be made that will benefit the broad research community. For the regional service nodes, depending on the magnitude of the project/issue, modest fees would be charged for this service and these should be acceptable line items in grant proposals submitted to standard research programs. B. Federal agencies should put in place initiatives to provide funding for capturing legacy datasets. No other vehicle exists to ensure this valuable information is not lost to the community’s efforts at deeper insight. C. Federal agencies in the Nation should work in concert with international agencies, and look to the international programs within our own agencies to help bring the Nation’s talent to bear on the intellectual and technology transfer requirements, toward an international infrastructure. The programs in this case would assist scientists in developing countries upgrade or obtain CI resources, and interconnect to the world wide knowledge resources. D. Programs (funding) and reward systems (including awards by professional societies and other means of recognition) must be established for the development of models that can be widely used by the community for simulation studies including ecological prediction, in guiding experiments, and facilitating cross-scale work. E. Agencies, looking to partnerships of statisticians and experimentalists, should put in place competitions that focus on advances in developing algorithms for rapidly evaluating the quality and validity of data (QA/QC). F. Other competitions, for partnerships with mathematicians and computer scientists, are needed to focus on the development of methods for dealing with fuzzy data - to capture essential information in a quantitative way. G. Competitions for partnerships between experimentalists, informaticians and computer scientists will be needed to drive the development of diverse semantic tools to integrate/federate data, mine the information, empower further modeling and analysis, and convert it to useable knowledge (metadata tools). *Note - items D-G above particularly involve programs and incentives for computer scientists to work with ecologists, or work on ecologically motivated problems AND all of the above require sustained support of CI personnel (academic professionals, programmers, and experts in visualization, networking, and other computer science applications). Ecological forecasting, evolutionary ecology and biodiversity, a set of interconnected ecological research arenas, urgently require a set of tools from scientific computing and advanced information technology that would be provided by this comprehensive CI: • Analytical modeling and simulation • Scientific computing at levels of high performance • Database creation, organization, development, mining and other queries and analyses Communication networks • Wireless sensor networks (for data sets in general and particularly for biodiversity)

- 45 -

The following are a set of environmental “Science Nuggets” that outline the interesting adventures of discovery that a CI for the ecological sciences would enable. The Ecology of Sound This is a new and untapped field, which has been ignored largely because there has not been the technology to ask the right questions, and even more seriously, since there never before has been an opportunity to develop the requisite CI. The field will be able to answer questions such as: what are the sound niches of different organisms? Do organisms within an ecosystem always partition the entire sound spectrum and if not, why? How does this partitioning influence communication among organism? The tools of CI would be essential for monitoring and modeling bioacoustics. The experimental setup would involve small microphones deployed across habitats (such as wetlands or cities) recording the full spectrum of sounds (both anthropogenic and natural). The observations would create multi-gigabyte databases daily, at a minimum, and would create more depending on how pervasive the technology became in the community. Data must be obtained, sampled, stored, organized, managed, processed, mined, and through the typical data chain, to provide the understanding. This will be a huge challenge for ecoinformatics. The data sets will be quite large, beyond ready human comprehension, so creative CI tools will be required to provide visual representations and entry points for analysis, as well as means for the detection of sophisticated, deeply embedded patterns, and other still unknown phenomena. Environmental Event Science An environmental event detection, recording and analysis process, or event science, would begin with smart sensor networks that detect changes and also enhance sampling during those events. This research challenge has plagued environmental biologists for years. (Solving the challenge is critical for basic research as well as for applied; the challenge is also related to national security issues.) We can answer questions like: How do storm events or temperature inversions influence N deposition, and what are the pathways by which the elements move through the soil and groundwater to reach surface waters? Such a science would require that the EARN link directly to research innovations, because advances in computing hardware (related to wireless communication and power systems) and software (intelligent sensor webs to turn on/off; conserve power; smart sensors that will take over other functions as necessary when others fail). The infrastructure required includes the development of software for sensors, possibly collaborations with materials scientists and computer scientists on sensor development itself, and the use of efficient and flexible network designs that transmit (wirelessly) huge amounts of digital information that should be in a form that the information can be immediately captured and stored directly. The overall infrastructure, as a bottom line, will deliver absolutely HUGE data streams that will allow us to ask and answer a new generation of questions about the environment and how ecosystems function. The kinds of questions are an extension of LTER research and represent ultimately the goals of NEON, but the actual technology needed for this revolutionary expansion of ecological science would be enabled by established a strong baseline effort in environmental event science.

- 46 -

Appendix VII References for CIBIO Report There is an enormous wealth of material available—both online and in print—addressing the possibilities for twenty-first century science within the framework of a modern cyberinfrastructure. This literature covers a wide range of topics and disciplines; and indeed, it would be impossible to reproduce all of this information here. Instead, we have chosen to highlight below some of the best outside references we have come across. The references listed below can be accessed by visiting CIBIO on the web at http://research.calit2.net/cibio/.

Cyberinfrastructure Background Information • • • • • • •

Charting Our Cyberinfrastructure Future A presentation made by Dr. Deborah Crawford, Chair of the NSF Cyberinfrastructure Working Group (CIWG), 6 June 2003. Computing: Getting us on the Path to Wisdom (HTML) Slide show presentation and remarks made by Dr. Rita R. Colwell, Director of the NSF, at SC2002: From Terabytes to Insights, 19 November 2002. Cyberinfrastructure - Issues and Challenges An introduction to Cyberinfrastructure prepared by Dr. Francine Berman, Director of the San Diego Supercomputer Center. Cyberinfrastructure: Opportunities for Connections and Collaboration Paper exploring the concepts of envisioning and building a cyberinfrastructure. By Joan Lippincott, Associate Executive Director, Coalition for Networked Information (2002). Cyberinfrastructure: Revolutionizing Science & Engineering Slide show presentation made by Peter A. Freeman, NSF Assistant Director for Computer and Information Science and Engineering (CISE), at CENIC 2003, 9 May 2003. The Grid: A New Infrastructure for 21st Century Science Article from the February 2002 issue of Physics Today. Explores some of the transformations in science and engineering that are possible thanks to cyberinfrastructure. The Human Side of the Cyberinfrastructure Short article by Fran Berman (NPACI and SDSC Director) in the April - June 2001 issue of enVision magazine.

Cyberinfrastructure Applications for the Biological Sciences •

• •

Challenges Faced in the Integration of Biological Information Article by Su Yun Chung and John C. Wooley. Excerpt from Bioinformatics: Managing Scientific Data. o Appendix: Biological Resources Also excerpted from Bioinformatics: Managing Scientific Data. This appendix contains useful biological resources, databases, organizations, and applications. Computational Cell Biology - Challenges and Opportunities for an Emerging Field A report based on a roundtable discussion at the First International Symposium on Computational Cell Biology (March 2001). Computational Infrastructure Workshop for the Genomes to Life Program Report on cyberinfrastructure-related workshop of the US Department of Energy's Genomes to Life program (March 2002).

- 47 -

• •



Next Generation Biology: The Role of Next Generation Computing Overview and report of the workshop held 20 July 1998 by the NSF, National Institutes of Health, and US Department of Energy. The Biomedical Information Science and Technology Initiative Cyberinfrastructure and its potential uses for Biomedical Computing. Document prepared by the Working Group on Biomedical Computing, Advisory Committee to the Director, National Institutes of Health (3 June 1999). Trends in Computational Biology: A Summary Based on a RECOMB Plenary Lecture, 1999 Early paper discussing the potential applications of computers to biology. Useful as a historical primer for the current effort to build a cyberinfrastructure for the biological sciences.

Cyberinfrastructure Applications for Other Scientific Disciplines • • •

• • • •

• • • • •

A Geosciences Network for Understanding the Whole Earth Article from the April-June 2003 issue of EnVision. Details the efforts of researchers at the San Diego Supercomputer Center to build a cyberinfrastructure network for the Geological Sciences. Collaborative Large-scale Engineering Assessment Network for Environmental Research Link to the UC Berkeley-directed infrastructure program for environmental field facilities mandated to help in the formation of environment-friendly engineering and policy options. Cyberinfrastructure in the Mathematical and Physical Sciences [See Page 5] Excerpt from Background for the Discussion on Long-Range Planning at the MPSAC Meeting of April 3-4, 2003, a publication of the NSF Directorate for Mathematical and Physical Sciences. Cyberinfrastructure Needs of the Engineering Community PowerPoint presentation given by Esin Gulari, Acting Assistant Division Director for Engineering at the NSF. Cybersecurity, Cyberinfrastructure Top Priorities at Information Technology Forum Lead article from the April 2003 issue of NASUCGC (National Association of State Universities and Land Grant Colleges) Newsline. Environmental Cyberinfrastructure Needs for Distributed Sensor Networks A report from an NSF sponsored workshop held 12 - 14 August 2003 at the Scripps Institute of Oceanography. Environmental Cyberinfrastructure: Tools for the Study of Complex Environmental Systems An Occasional Paper of the NSF Advisory Committee for Environmental Research and Education, 1 May 2003. Environmental Cyberinfrastructure: Turning Data into Knowledge Presentation made by Margaret Cavanaugh at the NSF SINE Workshop - SDSC (29 October 2001). GEON: Cyberinfrastructure for the Geosciences Web site coordinating the NSF effort to enable scientific discovery and improve education in Earth Sciences through information technology research. Mathematics & Biology: The Interface - Challenges & Opportunities Web version of a publication from June of 1992. Relevant discussion of the ways that Mathematics and Biology can be coordinated, especially with the use of new technologies. Securing Cyberinfrastructure Presentation made by Michael McRobbie, Ph.D., to the NSF Workshop on Cyberinfrastructure Research for Homeland Security (26 February 2003). Steering Committee for Cyberinfrastructure Research and Development in the Atmospheric Sciences (CyRDAS) Links to publications, research, and information provided by CyRDAS and the NSF National Center for Atmospheric Research.

- 48 -



The OptIPuter Article from the November 2003 issue of Communications of the ACM detailing the use of advanced computing architecture in conjunction with distributed cyberinfrastructure.

Cyberinfrastructure Workshops, Lectures, and Reports •

• • • •

• • • • •

CISE Lecture List Links to presentations made as part of the Distinguished Lecture Series in the NSF's Directorate for Computer and Information Science and Engineering (CISE). o Recent Presentations made by the Assistant Director of CISE Cyberinfrastructure for Engineering Research and Education Reports and presentations made at a workshop sponsored by the NSF Directorate for Engineering, 5 - 6 June 2003. Cyberinfrastructure for Environmental Research and Education Agenda and links to reports presented at a workshop sponsored by the National Science Foundation, 30 October - 1 November 2002. Global Terabit Research Network: Building Global Cyber Infrastructure Presentation made by Steven Wallace at the Global Research Networking Summit, May 2002. Management and Models for Cyberinfrastructure Links provided by the University of Michigan's School of Information to presentations from two workshops (held 14 - 15 May and 29 - 30 July 2003) and one town hall meeting (held 23 June 2003). Report of the Task Force on the Future of the NSF Supercomputer Centers Program (Hayes Report) Revolutionizing Science and Engineering through Cyberinfrastructure: Report of the National Science Foundation Advisory Panel on Cyberinfrastructure (February 2003) Science and Engineering Infrastructure for the 21st Century: The Role of the National Science Foundation (February 2003) Workshop on Cyberinfrastructure Research for Homeland Security Access to the list of guests, presentations, and draft reports prepared at a workshop held from 25 - 27 February, 2003. Workshop on EPSCoR Cyberinfrastructure for Large-Scale Science and Engineering Notes and final report of the "Experimental Program to Stimulate Competitive Research" conference that took place from 27 - 29 April 2003.

Historical Roots of Cyberinfrastructure •





Advanced Computing in the Life Sciences: Proceedings of the Workshop on the Applications of Supercomputers in the Life Sciences Report from a December 1984 workshop that set the stage for today's cyberinfrastructure applications. Computational Biology for Biotechnology: Applications of Scientific Computing Biotechnology This 1989 article by John C. Wooley is another excellent look into the historical background of today's cyberinfrastructure applications. Research Opportunities in Computational Biology Results from an invitational workshop held in Washington, DC (13 - 15 December, 1989). Many of the concepts discussed are relevant to today's cyberinfrastructure.

Miscellaneous Links of Interest •

American Association for the Advancement of Science (AAAS) Research and Development Budget and Policy Program o AAAS Report XXVIII: Research and Development in FY 2004

- 49 -

Computing Research in the FY 2003 Budget Request Excerpt from AAAS Report XXVII: Research and Development in FY 2003 Digital Libraries David Hart details the NSF's approach to the creation of usable digital resources in the next millennium, and how cyberinfrastructure will play a role. Implications Of Information Technologies: IT Overview Links to multiple web sites in the Division of Science Resources Statistics of the National Science Foundation. National Science, Technology, Engineering, and Mathematics Digital Library (NSDL) Program Solicitation Grant and program solicitation information (NSF document 03-530). National Science Foundation FY 2004 Budget Request Overview o FY 2004 Summary of NSF Accounts National Science Foundation FY 2005 Budget Request to Congress Biological Sciences Budget Request Excerpts from FY 2005 o

• • • • •

Other Agency Web Sites • • • • • • • • • • • • • •

Digital Libraries Initiative Phase 2 Homepage of the Digital Libraries Initiative Phase 2 (DLI2) with links to ongoing programs, sponsors, and research. Environmental Research and Education (ERE) Homepage of the National Science Foundation's ERE division, detailing current projects and research. Experimental Program to Stimulate Competitive Research (EPSCoR) Homepage of the NSF EPSCoR project, detailing current projects and research. Information Technology Research (ITR) Program Homepage of the NSF's ITR Program, with links to current projects and research. National Center for Atmospheric Research / University Corporation for Atmospheric Research Information on funding and projects administered by the NSF Atmospheric Sciences subdivision. National Computational Science Alliance National Coordination Office for Information Technology Research and Development National Ecological Observatory Network National Partnership for Advanced Computational Infrastructure (NPACI) o NPACI Alpha Projects (Biology-Related Applications) Network for Earthquake Engineering Simulation NSF Middleware Initiative Program Solicitation Grant and program solicitation information (NSF document 03-513). Science and Technology Centers (NSF Office of Integrative Activities) The NSF Science and Technology Centers fund 11 centers in a variety of disciplines, including many related to cyberinfrastructure. TeraGrid Homepage of the NSF project to create the world's largest, most comprehensive distributed infrastructure for open scientific research. The US Long Term Ecological Research Network Homepage for the program established by the NSF in 1980 to support research in long-term ecological phenomena in the United States.

- 50 -

Related Documents


More Documents from "RohnWood"