Data Mining and Grid Ian Foster
Computation Institute Argonne National Lab & University of Chicago http://ianfoster.typepad.com
www.ci.uchicago.edu
www.ci.anl.gov 2
Data Mining
Grid 3
In the Next 50 Years, We Must … ●
Increase energy production by 5, while reducing GHG emissions by 2 or more
●
Mitigate and adapt to climate change
●
Address increasingly drug resistant diseases
●
Provide meaningful livelihoods for 9B people
Innovation
4
Innovation as a Systems Problem ●
Quasi-ubiquitous Internet …
●
… connects many potential innovators ◆
●
Millions of scientists, billions of people
Who need to leverage ◆
Enormous data of tremendous complexity
◆
Immensely powerful computing
◆
Experimental apparatus of great power
We must address problem solving as an distributed, end-to-end, systems problem 5
Grid: A Unifying Concept & Technology Grid enables the federation of resources • Distributed computers, storage, data, people, … • Networks provide connectivity Infrastructure • Software & standards provide the “glue” • Infrastructure services facilitate operation
DATA ACQUISITION
Applications IMAGING INSTRUMENTS
ADVANCED VISUALIZATION
,ANALYSIS
QuickTime™ and a decompressor are needed to see this picture.
COMPUTATIONAL RESOURCES
Research
LARGE-SCALE DATABASES
Credit: Mark Ellisman
6
Grid Infrastructure ●
Massive computing and storage
●
Service interfaces facilitate access and use
TeraGrid
Open Science Grid 7
Software and Standards Bob Grossman
Domenico Talia
Angle
Weka Tool 4WS Uniform interfaces, security mechanisms, Web service transport, monitoring Globus
GRAM
Tool
File Transfer
User Svc
Host Env Registry
User Svc
Host Env
GridFTP
DAI IBM
IBM
IBM
IBM
Computers
Specialized resource
File system
Database 8
Globus Downloads Last 24 Hours
Last month
9
First Generation Grids: On-Demand/Batch Computing
Focus on aggregation of many resources for massively (data-)parallel applications
EGEE
Globus 10
Applications: High Energy Physics
Globus
11
Integrating Data and Computing, on Demand Public PUMA Knowledge Base Information about proteins analyzed against ~2 million gene sequences Back Office Analysis on Grid Millions of BLAST, BLOCKS, etc., on OSG and TeraGrid Natalia Maltsev et al., http://compbio.mcs.anl.gov/puma2
12
Second Generation Grids: Service-Oriented Science ●
Empower many more users by enabling on-demand access to services
●
Grids become an enabling technology for service oriented science (or business) ◆
Grid infrastructures host services
◆
Grid technologies used to build services
Science Gateways “Service-Oriented Science”, Science, 2005
13
Service-Oriented Science People create services (data or functions) … which I discover (& decide whether to use) … & compose to create a new function ... & then publish as a new service.
!!
I find “someone else” to host services, so I don’t have to become an expert in operating services & computers! I hope that this “someone else” can manage security, reliability, scalability, … “Service-Oriented Science”, Science, 2005
14
Earth System Grid ●
●
On-demand access to climate simulation data ◆
Multiple archives
◆
Interactive query
◆
Per-collection control
◆
Server-side processing
Major scientific impact ◆
>5000 users
◆
>200 TB downloaded
◆
>300 scientific papers
Globus
www.earthsystemgrid.org — DOE OASCR
15
Cancer Biomedical Informatics Grid (caBIG)
caBIG: sharing of infrastructure, applications, and data.
Data Integration!
Globus 16
caBIG Under the Covers Analytical Service Grid-Enabled Client Tool 1 caArray
Tool 2
NCICB
Grid Data Service
Grid Services Infrastructure (Metadata, Registry, Query, Invocation, Security, etc.)
Gene Databas e
Research Center
Protein Database
Tool 3 Tool 4
Image Grid Portal
Microarray
Tool 2
Globus
Research Center
Tool 3
17
LIGO Data Grid LIGO Gravitational Wave Observatory
Birmingham• Cardiff
AEI/Golm
Globus
Replicating >1 Terabyte/day to 8 sites >150 million replicas so far MTBF = 1 month www.globus.org/solutions
18
The Angle Project
Globus 19
Social Informatics Data Grid
Globus 20 Bennett Berthenthal et al., www.sidgrid.org
A Few Example Research Themes ●
Service discovery, composition, provisioning ◆
●
Large-scale (distributed) computation ◆
●
E.g., “Provenance Challenge”
“Virtual organizations” ◆
●
E.g., Swift, Kepler, Taverna
Provenance ◆
●
SOA, virtualization, cloud computing, …
E.g., attribute-based authorization, trust
Integration of physical systems ◆
Optimization of end-to-end workflows 21
Security Services for Virtual Organization Policy ●
Attribute Authority (ATA) Issue signed attribute assertions (incl. identity, delegation & mapping)
◆
●
Authorization Authority (AZA) Decisions based on assertions & policy
◆
Delegation Assertion VO Resource Admin User A User B can use Service A Attribute VO ATA
Mapping ATA
VO M embe r Attr ibute
VO Member Attribute
VO User B
VO AZA
VO A Service
Globus VO-A Attr VO-B Attr VO B Service 22
Swift (www.ci.uchicago.edu/swift)
23
An Integrated View of Modeling, Simulation, Experiment, & Informatics Problem Specification
Modeling and Simulation
Bioinformatics Analysis Tools
Experimental Design
Analysis & Visualization
Integrated Biological Databases
High-throughput Experiments
Analysis & Visualization
24
Robot Scientist “The robot scientist project aims to develop a computer system capable of originating its own experiments, physically doing them, interpreting the results, & then repeating the cycle.” Background Knowledge
Machine Learning
Biomek 200
Analysis
Consistent Hypothesis
Final Theory
Experiment(s) selection
Experiments(s) Robot
Stephen Muggleton, Ross King et al., UK
Results 25
Team Science meets Data Deluge
26