071011 Data Mining Foster

  • November 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View 071011 Data Mining Foster as PDF for free.

More details

  • Words: 814
  • Pages: 26
Data Mining and Grid Ian Foster

Computation Institute Argonne National Lab & University of Chicago http://ianfoster.typepad.com

www.ci.uchicago.edu

www.ci.anl.gov 2

Data Mining

Grid 3

In the Next 50 Years, We Must … ●

Increase energy production by 5, while reducing GHG emissions by 2 or more



Mitigate and adapt to climate change



Address increasingly drug resistant diseases



Provide meaningful livelihoods for 9B people

 Innovation

4

Innovation as a Systems Problem ●

Quasi-ubiquitous Internet …



… connects many potential innovators ◆



Millions of scientists, billions of people

Who need to leverage ◆

Enormous data of tremendous complexity



Immensely powerful computing



Experimental apparatus of great power

 We must address problem solving as an distributed, end-to-end, systems problem 5

Grid: A Unifying Concept & Technology Grid enables the federation of resources • Distributed computers, storage, data, people, … • Networks provide connectivity Infrastructure • Software & standards provide the “glue” • Infrastructure services facilitate operation

DATA ACQUISITION

Applications IMAGING INSTRUMENTS

ADVANCED VISUALIZATION

,ANALYSIS

QuickTime™ and a decompressor are needed to see this picture.

COMPUTATIONAL RESOURCES

Research

LARGE-SCALE DATABASES

Credit: Mark Ellisman

6

Grid Infrastructure ●

Massive computing and storage



Service interfaces facilitate access and use

TeraGrid

Open Science Grid 7

Software and Standards Bob Grossman

Domenico Talia

Angle

Weka Tool 4WS Uniform interfaces, security mechanisms, Web service transport, monitoring Globus

GRAM

Tool

File Transfer

User Svc

Host Env Registry

User Svc

Host Env

GridFTP

DAI IBM

IBM

IBM

IBM

Computers

Specialized resource

File system

Database 8

Globus Downloads Last 24 Hours

Last month

9

First Generation Grids: On-Demand/Batch Computing

Focus on aggregation of many resources for massively (data-)parallel applications

EGEE

Globus 10

Applications: High Energy Physics

Globus

11

Integrating Data and Computing, on Demand Public PUMA Knowledge Base Information about proteins analyzed against ~2 million gene sequences Back Office Analysis on Grid Millions of BLAST, BLOCKS, etc., on OSG and TeraGrid Natalia Maltsev et al., http://compbio.mcs.anl.gov/puma2

12

Second Generation Grids: Service-Oriented Science ●

Empower many more users by enabling on-demand access to services



Grids become an enabling technology for service oriented science (or business) ◆

Grid infrastructures host services



Grid technologies used to build services

Science Gateways “Service-Oriented Science”, Science, 2005

13

Service-Oriented Science People create services (data or functions) … which I discover (& decide whether to use) … & compose to create a new function ... & then publish as a new service.

!!

 I find “someone else” to host services, so I don’t have to become an expert in operating services & computers!  I hope that this “someone else” can manage security, reliability, scalability, … “Service-Oriented Science”, Science, 2005

14

Earth System Grid ●



On-demand access to climate simulation data ◆

Multiple archives



Interactive query



Per-collection control



Server-side processing

Major scientific impact ◆

>5000 users



>200 TB downloaded



>300 scientific papers

Globus

www.earthsystemgrid.org — DOE OASCR

15

Cancer Biomedical Informatics Grid (caBIG)

caBIG: sharing of infrastructure, applications, and data.

Data Integration!

Globus 16

caBIG Under the Covers Analytical Service Grid-Enabled Client Tool 1 caArray

Tool 2

NCICB

Grid Data Service

Grid Services Infrastructure (Metadata, Registry, Query, Invocation, Security, etc.)

Gene Databas e

Research Center

Protein Database

Tool 3 Tool 4

Image Grid Portal

Microarray

Tool 2

Globus

Research Center

Tool 3

17

LIGO Data Grid LIGO Gravitational Wave Observatory

Birmingham• Cardiff

AEI/Golm

Globus

Replicating >1 Terabyte/day to 8 sites >150 million replicas so far MTBF = 1 month www.globus.org/solutions

18

The Angle Project

Globus 19

Social Informatics Data Grid

Globus 20 Bennett Berthenthal et al., www.sidgrid.org

A Few Example Research Themes ●

Service discovery, composition, provisioning ◆



Large-scale (distributed) computation ◆



E.g., “Provenance Challenge”

“Virtual organizations” ◆



E.g., Swift, Kepler, Taverna

Provenance ◆



SOA, virtualization, cloud computing, …

E.g., attribute-based authorization, trust

Integration of physical systems ◆

Optimization of end-to-end workflows 21

Security Services for Virtual Organization Policy ●

Attribute Authority (ATA) Issue signed attribute assertions (incl. identity, delegation & mapping)





Authorization Authority (AZA) Decisions based on assertions & policy



Delegation Assertion VO Resource Admin User A User B can use Service A Attribute VO ATA

Mapping ATA

VO M embe r Attr ibute

VO Member Attribute

VO User B

VO AZA

VO A Service

Globus VO-A Attr  VO-B Attr VO B Service 22

Swift (www.ci.uchicago.edu/swift)

23

An Integrated View of Modeling, Simulation, Experiment, & Informatics Problem Specification

Modeling and Simulation

Bioinformatics Analysis Tools

Experimental Design

Analysis & Visualization

Integrated Biological Databases

High-throughput Experiments

Analysis & Visualization

24

Robot Scientist “The robot scientist project aims to develop a computer system capable of originating its own experiments, physically doing them, interpreting the results, & then repeating the cycle.” Background Knowledge

Machine Learning

Biomek 200

Analysis

Consistent Hypothesis

Final Theory

Experiment(s) selection

Experiments(s) Robot

Stephen Muggleton, Ross King et al., UK

Results 25

Team Science meets Data Deluge

26

Related Documents

071011 Data Mining Foster
November 2019 0
Data Mining
May 2020 23
Data Mining
October 2019 35
Data Mining
November 2019 32
Data Mining
May 2020 21
Data Mining
May 2020 19