Research Overview

  • October 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Research Overview as PDF for free.

More details

  • Words: 6,479
  • Pages: 99
Network Based Computing

Dhabaleswar K. (DK) Panda Department of Computer Science and Engineering The Ohio State University E-mail: [email protected] http://nowlab.cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

Presentation Outline • • • • • • • •

Network-Based Computing (NBC) Trend Different Kinds of NBC Systems Our Vision, Collaboration, Funding Research Challenges and Projects Available Experimental Testbed Related 788 and 888 Courses Students in the Group and their accomplishments Conclusions

Current and Next Generation Applications and Computing Systems •

Big demand for



Processor speed continues to grow



Commodity networking also continues to grow



Clusters are increasingly becoming popular to design next generation computing systems

– High Performance Computing (HPC) – Servers for file systems, web, multimedia, database, visualization, ... – Internet data centers (Multi-tier Datacenters) – Doubling every 18 months – Multicore chips are emerging – Increase in speed – Many features – Affordable pricing

– scalability – modularity – can be easily upgraded as computing and networking technologies improve

Growth in Commodity Networking Technologies • Representative commodity networking technologies and their entries into the market – – – – – – – – – –

Ethernet (1979- ) Fast Ethernet (1993- ) Gigabit Ethernet (1995- ) ATM (1995- ) Myrinet (1993- ) Fibre Channel (1994- ) InfiniBand (2001-) InfiniBand (2003-) 10 Gigabit Ethernet (2004-) InfiniBand (2005-)

10 Mbit/sec 100 Mbit/sec 1000 Mbit/sec 155/622/1024 Mbit/sec 1 Gbit/sec 1 Gbit/sec 2.5 Gbit/sec (1X) -> 2 Gbit/sec 10 Gbit/sec (4X) -> 8 Gbit/sec 10 Gbit/sec 20 Gbit/sec (4X DDR) -> 16 Gbit/sec 30 Gbit/sec (12X SDR) -> 24 Gbit/sec

12 times in the last 5 years

Network-Based Computing Computing Systems

Web Technology

•powerful •costeffective •commodity

•powerful •ability to integrate computation and communication

Networking •high bandwidth •cost-effective •commodity

NetworkBased Computing

Emerging Applications • collaborative/ interactive •wide range of computation and communication characteristics

Presentation Outline • • • • • • • •

Network-Based Computing (NBC) Trend Different Kinds of NBC Systems Our Vision, Collaboration, Funding Research Challenges and Projects Available Experimental Testbed Related 788 and 888 Courses Students in the Group and their accomplishments Conclusions

Different Types of NBC Systems • Five major categories – Dedicated Cluster over a LAN – Clusters with Interactive clients – Globally-Interconnected Systems – Interconnected Clusters over WAN – Multi-tier Data Centers – Integrated environment of multiple clusters

Dedicated Cluster over a SAN/LAN • Packaged inside a rack/room interconnected with SAN/LAN comp

comp comp comp

Network(s)

comp comp

Trends for Computing Clusters in the Top 500 List • Top 500 list of Supercomputers (www.top500.org) – – – – – – – – – – –

June 2001: 33/500 (6.6%) Nov 2001: 43/500 (8.6%) June 2002: 80/500 (16%) Nov 2002: 93/500 (18.6%) June 2003: 149/500 (29.8%) Nov 2003: 208/500 (41.6%) June 2004: 291/500 (58.2%) Nov 2004: 294/500 (58.8%) June 2005: 304/500 (60.8%) Nov 2005: 360/500 (72.0%) June 2006: 364/500 (72.8%)

Cluster with Interactive Clients LAN/WAN cluster

client

Globally-Interconnected Systems SMP system

SMP system

WAN

Interconnected Clusters over WAN WAN

Workstation cluster

Generic Three-Tier Model for Data Centers Tier 1

Tier 2

Tier 3

Routers/Servers

Application Server

Database Server

Routers/Servers

Application Server

Database Server

Routers/Servers

Switch

Application Server

Switch

Database Server

. .

. .

. .

Routers/Servers

Application Server

Database Server

Storage

Switch

• All major search engines and e-commerce companies are using clusters for multi-tier datacenter • Google, Amazon, Financial institutions, …..

Integrated Environment with Multiple Clusters Compute cluster

Storage Cluster

Front end LAN

LAN LAN/WAN

LAN

Datacenter for Visualization and Data Mining

Presentation Outline • • • • • • • •

Network-Based Computing (NBC) Trend Different Kinds of NBC Systems Our Vision, Collaboration, Funding Research Challenges and Projects Available Experimental Testbed Related 788 and 888 Courses Students in the Group and their accomplishments Conclusions

Our Vision • Network-Based Computing Group to take a lead in – Proposing new designs for high performance NBC systems by taking advantages of modern networking technologies and computing systems – Developing better middleware/API/programming environments so that modern NBC applications can be developed and implemented in a scalable fashion

• Carry out research in an integrated manner (systems, networking, and applications)

Group Members • Currently 12 PhD students and one MS student

– Lei Chai, Qi Gao, Wei Huang, Matthew Koop, Rahul Kumar, Ping Lai, Amith Mamidala, Sundeep Narravul, Ranjit Noronha, Sayantan Sur, Gopal Santhanaraman, Karthik Vaidyanathan, and Abhinav Vishnu

• All current students are supported as RAs • Programmers – Shaun Rowland – Jonathan Perkins

• System Administrator – Keith Stewart

External Collaboration •

Industry



Research Labs



Other units in campus

– – – – – – – –

IBM TJ Watson and IBM Intel Dell SUN Apple Linux Networx NetApp Mellanox and other InfiniBand companies

– – – –

Sandia National Lab Los Alamos National Lab Pacific Northwest National Lab Argonne National Lab

– Bio-Medical Informatics (BMI) – Ohio Supercomputer Center (OSC) – Internet 2 Technology Evaluation Centers (ITEC-Ohio)

Research Funding • National Science Foundation – Multiple grants

• Department of Energy • • • • • • • • • • •

– Multiple grants

Sandia National Lab Los Alamos National Lab Pacific Northwest National Lab Ohio Board of Regents IBM Research Intel SUN Mellanox Cisco Linux Networx Network Appliance

Research Funding (Cont’d) • Three Major Large-Scale Grants

– Started in Sept ‘06 – DOE Grant for Next Generation Programming Models • Collaboration with Argonne National Laboratory, Rice University, Univ. of Berkeley, and Pacific Northwest National Laboratory

– DOE Grant for Scalable Fault Tolerance in Large-Scale Clusters • Collaboration with Argonne National Laboratory, Indiana University, Lawrence Berkeley National Laboratory, and Oakridge National Laboratory

– AVETEC Grant for Center for Performance Evaluation of Cluster Interconnects • Collaboration with major DOE Labs and NASA

• Several continuation grants from other funding agencies

Presentation Outline • • • • • • • • •

Systems Area in the Department Network-Based Computing (NBC) Trend Different Kinds of NBC Systems Our Vision, Collaboration, Funding Research Challenges and Projects Available Experimental Testbed Related 788 and 888 Courses Students in the Group and their accomplishments Conclusions

Requirements for Modern Clusters, Cluster-based Servers, and Datacenters • Requires – good Systems Area Network with excellent performance (low latency and high bandwidth) for interprocessor communication (IPC) and I/O – good Storage Area Networks high performance I/O – good WAN connectivity in addition to intra-cluster SAN/LAN connectivity – Quality of Service (QoS) for supporting interactive applications – RAS (Reliability, Availability, and Serviceability)

Research Challenges Applications

(Scientific, Commercial, Servers, Datacenters)

Programming Models (Message Passing, Sockets, Shared Memory w/o Coherency)

Low Overhead Substrates Point-to-point Communication

Collective Communication

Networking Technologies

Synchronization & Locks

(Myrinet, Gigabit Ethernet, InfiniBand, Quadrics, 10GigE, RNICs) & Intelligent NICs

Fault I/O & File QoS Tolerance Systems

Commodity Computing System Architectures (single, dual, quad, ..) Multi-core architecture

Research Challenges and Their Dependencies Applications System Software/Middleware Support

Networking and Communication Support Trends in Networking/Computing Technologies

Major Research Directions •

System Software/Middleware – – – – – – –



Networking and Communication Support – – – –



High Performance MPI on InfiniBand Cluster Clustered Storage and File Systems Solaris NFS over RDMA iWARP and its Benefits to High Performance Computing Efficient Shared Memory on High-Speed Interconnects High Performance Computing with Virtual Machines (Xen-IB) Design of Scalable Data-Centers with InfiniBand High Performance Networking for TCP-based Applications NIC-level Support for Collective Communication and Synchronization NIC-level Support for Quality of Service (QoS) Micro-Benchmarks and Performance Comparison of High-Speed Interconnects

More details on http://nowlab.cse.ohio-state.edu/ Æ Projects

Why InfiniBand? • Traditionally HPC clusters have used – Myrinet or Quadrics

• Proprietary interconnects

– Fast Ethernet or Gigabit Ethernet for low-end clusters

• Datacenters have used

– Ethernet (Gigabit Ethernet is common) – 10.0 Gigabit Ethernet is emerging and not yet available with low cost – QoS and RAS capabilities are not there in Ethernet

• Storage and File Systems have used

– Ethernet with IP – Fibre Channel – Does not support high bandwidth and QoS, etc.

Need for A New Generation Computing, Communication & I/O Architecture • Started around late 90s • Several initiatives were working to address high performance I/O issues – Next Generation I/O – Future I/O

• Virtual Interface Architecture (VIA) consortium was working towards high performance interprocessor communication • An attempt was made to consolidate all these efforts as an open standard • The idea behind InfiniBand Architecture (IBA) was conceived ... • The original target was for Data Centers

IBA Trade Organization • IBA Trade Organization was formed with seven industry leaders (Compaq, Dell, HP, IBM, Intel, Microsoft, and Sun) • Goal: To design a scalable and high performance communication and I/O architecture by taking an integrated view of computing, networking, and storage technologies • Many other industry participated in the effort to define the IBA architecture specification • InfiniBand Architecture (Volume 1, Version 1.0) was released to public on Oct 24, 2000 • www.infinibandta.org

Rich Set of Features of IBA • High Performance Data Transfer

– Interprocessor communication and I/O – Low latency (~1.0-3.0 microsec), High bandwidth (~1-2 GigaBytes/sec), and low CPU utilization (5-10%)

• Flexibility for WAN communication • Multiple Transport Services

– Reliable Connection (RC), Unreliable Connection (UC), Reliable Datagram (RD), Unreliable Datagram (UD), and Raw Datagram – Provides flexibility to develop upper layers

• Multiple Operations

– Send/Recv – RDMA Read/Write – Atomic Operations (very unique)

• high performance and scalable implementations of distributed locks, semaphores, collective communication operations

Principles behind RDMA Mechanism Node

Node

Memory

Memory

P0 P1

• • • • •

PCI/PCI-EX

IBA

P0 P1

PCI/PCI-EX

IBA

No involvement by the processor at the receiver side (RDMA Write/Put) No involvement by the processor at the sender side (RDMA Read/get) 1-2 microsec latency (for short data) 1.5 Gbytes/sec bandwidth (for large data) 3-5 microsec for atomic operation

Rich Set of Features of IBA (Cont’d) •

Range of Network Features and QoS Mechanisms – – – –

Service Levels (priorities) Virtual lanes Partitioning Multicast

• allows to design a new generation of scalable communication and I/O subsystem with QoS



Protected Operations



Flexibility for supporting Reliability, Availability, and Serviceability (RAS) in next Generation Systems with IBA features

– Keys – Protection Domains – Multiple CRC fields

• error detection (per-hop, end-to-end)

– Fail-over

• unmanaged and managed

– Path Migration – Built-in Management Services

Subnet Manager Inactive Link

Multicast Setup Switch

Active Inactive Links Multicast Join

Compute Node Multicast Setup

Multicast Join

Subnet Manager

Designing MPI Using InfiniBand Features MPI Design Components Protocol Mapping

Flow Control

Communication Progress

Multirail Support

Buffer Management

Connection Management

Collective Communication

One-sided Active/Passive

Substrate Communication Semantics Transport Services

Atomic Operations Communication Management

Completion & Event

End-to-End Flow Control

Multicast

Quality of Service

InfiniBand Features

High Performance MPI over IBA – Research Agenda at OSU •

Point-to-point communication



Collective communication



Flow control



Connection Management



Multi-rail designs



MPI Datatype Communication



Fault-tolerance Support



Currently extending our designs for iWARP adapters

– RDMA-based design for both small and large messages – Taking advantage of IBA hardware multicast for Broadcast – RDMA-based designs for barrier, all-to-all, all-gather – Static vs. dynamic – Static vs. dynamic – On-demand

– Multiple ports/HCAs – Different schemes (striping, binding, adaptive) – Taking advantage of scatter/gather semantics of IBA – End-to-end reliability to tolerate I/O errors – Network fault-tolerance with Automatic Path Migration (APM) – Applications transparent checkpoint and restart

Overview of MVAPICH and MVAPICH2 Projects (OSU MPI for InfiniBand) • Focusing on

– MPI-1 (MVAPICH) – MPI-2 (MVAPICH2)

• Open Source (BSD licensing) with anonymous SVN access • Directly downloaded and being used by more than 440 organizations worldwide (in 30 countries) • Available in the software stacks of – Many IBA and server vendors – OFED (OpenFabrics Enterprise Distribution) – Linux Distributors

• Empowers multiple InfiniBand clusters in the TOP 500 list • URL: http://nowlab.cse.ohio-state.edu/projects/mpi-iba/

Larger IBA Clusters using MVAPICH and Top500 Rankings (June ’06) • • • • • • •

6th : 4000-node dual Intel Xeon 3.6 GHz cluster at Sandia 28th: 1100-node dual Apple Xserve 2.3 GHz cluster at Virginia Tech 66th: 576-node dual Intel Xeon EM64T 3.6 GHz cluster at Univ. of Sherbrooke (Canada) 367th : 356-node dual Opteron 2.4 GHz cluster at Trinity Center for High Performance Computing (TCHPC), Trinity College Dublin (Ireland) 436th: 272-node dual Intel Xeon EM64T 3.4 GHz cluster at SARA (The Netherlands) 460th: 200-node dual Intel Xeon EM64T 3.2 GHz cluster at Texas Advanced Computing Center/Univ. of Texas 465th: 315-node dual Opteron 2.2 GHz cluster at NERSC/LBNL

• More are getting installed ….

MPI-level Latency (One-way):

IBA (Mellanox and PathScale) vs. Myrinet vs. Quadrics Latency (us)

Small message latency

Large message latency

14

700

12

600

10

500

8

400

MVAPICH-X MVAPICH-Ex-1p MVAPICH-Ex-2p MPICH/MX

4.9 4.0 3.3 2.8 2.0

6

300

4

200

2

100

0

0 0

4

8

16

32

64

128

Msg size (Bytes)

256

512

1024

MPICH/QsNet-X MVAPICH-Gen2-ExDDR-1p

2K

4K

8K

16K

32K

64K

128K

256K

Msg size (Bytes)

• SC ’03 • Hot Interconnect ’04 • IEEE Micro (Jan-Feb) ’05, one of the best papers from HotI ‘04

09/26/05

MPI-level Bandwidth (Uni-directional):

IBA (Mellanox and PathScale) vs. Myrinet vs. Quadrics 1492 1472

1500

Bandwidth (MillionBytes/Sec)

1350

MVAPICH-X

1200

MVAPICH-Ex-1p

1050

MVAPICH-Ex-2p

970

MPICH/MX

900

910 891

MPICH/QsNet-X

750

MVAPICH-Gen2-DDR-Ex-1p

600

494

450 300 150 0 4

8

16

32

64

128

256

512 1024

2K

4K

Mesg size (Bytes)

8K

16K

32K

64K 128K 256K 512K

1M

09/26/05

MPI-level Bandwidth (Bi-directional):

IBA (Mellanox and PathScale)) vs. Myrinet vs. Quadrics 3000

Bandwidth (MillionBytes/Sec)

2700

2724 2628

MVAPICH-X

2400

MVAPICH-Ex-1p

2100

MVAPICH-Ex-2p MPICH/MX-X

1841

1800

MPICH/QsNet-X 1500

MVAPICH-Gen2-DDR-Ex-1p

1200

943 908 901

900 600 300 0 4

8

16

32

64

128

256

512 1024

2K

4K

Mesg size (Bytes)

8K

16K

32K

64K 128K 256K 512K

1M

08/15/05

MPI over InfiniBand Performance

(Dual-core Intel Bensley Systems with Dual-Rail DDR InfiniBand) HCA 0 HCA 1

P3 P1

HCA 0 HCA 1

P2 P0

Uni-Directional Bandwidth 3000

P0 P2

2500

4 processes

2000

P1 P3

1500

2808 MB/sec

1000

4-processes on each node concurrently communicating over Dual-rail InfiniBand DDR (Mellanox)

500 0

1

4

16

64

256 1K 4K 16K Message Size (bytes)

64K

256K

1M

Messaging Rate

Bi-Directional Bandwidth

3.5

5000 4500

3

4000

4 processes

4 processes

2.5

3500 3000

3.16 Million Msgs/sec (16 Bytes)

2

2500

4553 MB/sec

2000 1500

1.5 1

1000

0.5

500 0

0

1

4

16

64

256 1K 4K 16K Message Size (bytes)

64K

256K

1M

1

4

16

64

256

1K

4K

Message Size (bytes)

16K

64K

256K

M. J. Koop, W. Huang, A. Vishnu and D. K. Panda, Memory Scalability Evaluation of Next Generation Intel Bensley Platform with InfiniBand, to be presented at Hot Interconnect Symposium (Aug. 2006).

High Performance and Scalable Collectives • Reliable MPI Broadcast using IB hardware multicast – Capability to support broadcast of 1K bytes message to 1024 nodes in less than 40 microsec

• RDMA-based designs for – MPI_Barrier – MPI_All_to_All J. Liu, A. Mamidala and D. K. Panda, Fast and Scalable MPI-Level Broadcast using InfiniBand’s Hardware Multicast Support, Int’l Parallel and Distributed Processing Symposium (IPDPS ’04), April 2004 S. Sur and D. K. Panda, Efficient and Scalable All-to-all Exchange for InfiniBand-based Clusters, Int’l Conference on Parallel Processing (ICPP ’04), Aug. 2004

Analytical Model – Extrapolation for 1024 nodes Actual, no h/w mcst

70

Estimated, no h/w mcst Actual, h/w mcst

50

Estimated, h/w mcst

la t e n c y

40 30 20 10

96

48

40

24

20

6

2

10

51

8

size

25

12

64

32

16

8

4

2

0 1

latency

60

180 160 140 120 100 80 60 40 20 0

no h/w mcst, 1024 h/w mcst, 1024

1 2 4 8 16 32 64 12 8 25 6 51 10 2 2 20 4 4 40 8 96

80

size

• Can broadcast a 1Kbyte message to 1024 nodes in 40.0 microsec

Fault Tolerance • Component failures are the norm in largescale clusters • Imposes need on reliability and fault tolerance • Working along the following three angles

– End-to-end Reliability with memory-to-memory CRC • Already available with the latest release

– Reliable Networking with Automatic Path Migration (APM) utilizing Redundant Communication Paths • Will be available in future release

– Process Fault Tolerance with Efficient Checkpoint and Restart • Already available with the latest release

Checkpoint/Restart Support for MVAPICH2 • Process-level Fault Tolerance

– User-transparent, system-level checkpointing – Based on BLCR from LBNL to take coordinated checkpoints of entire program, including front end and individual processes – Designed novel schemes to • Coordinate all MPI processes to drain all in flight messages in IB connections • Store communication state and buffers, etc. while taking checkpoint • Restarting from the checkpoint

A Running Example LU Restarted is Now, not Now, Start LU Restart Get affected, is Take running still LU its PID checkpoint finish from running Stop LU the running. it using checkpoint CTRL-C

• Show how to checkpoint/restart LU from NAS benchmark • There are two terminals: – Left one for normal run – Right one for checkpoint/restart

NAS Execution Time and HPL Performance P e rfo rm a n c e (G F L O P S )

lu.C.8 bt.C.9 sp.C. 500 400 300 200 100 2 min

4 min

Checkpointing In



25 20 15 10 5 0

1 min



30

None

2 min (6)

4 min (2)

8 min (1)

None

Checkpointing Interval & No. of checkpoints

The execution time of NAS benchmarks increased slightly with frequent checkpointing Frequent checkpointing also has some impact on HPL performance metrics Q. Gao, W. Yu, W. Huang and D.K. Panda, Application-Transparent Checkpoint/Restart for MPI over InfiniBand, International Conference on Parallel Processing, Columbus, Ohio, 2006

Continued Research on several components of MPI •

Collective communication – – – –



All-to-all All-gather Barrier All-reduce

Intra-node shared memory support – User-level, kernel-level – NUMA support – Multi-core support



MPI-2 – One-sided (get, put, accumulate) – Synchronization (active, passive) – Message ordering



Datatype

Continued Research on several components of MPI (Cont’d) • Multi-rail support – Multiple NICs – Multiple ports/NIC

• uDAPL support – Portability across different networks (Myrinet, Quadrics, GigE, 10GigE)

• Fault-tolerance – Check-point restart – Automatic Path Migration

• MPI-IO – High performance I/O support

Major Research Directions •

System Software/Middleware – – – – – – –



Networking and Communication Support – – – –



High Performance MPI on InfiniBand Cluster Clustered Storage and File Systems Solaris NFS over RDMA iWARP and its Benefits to High Performance Computing Efficient Shared Memory on High-Speed Interconnects High Performance Computing with Virtual Machines (Xen-IB) Design of Scalable Data-Centers with InfiniBand High Performance Networking for TCP-based Applications NIC-level Support for Collective Communication and Synchronization NIC-level Support for Quality of Service (QoS) Micro-Benchmarks and Performance Comparison of High-Speed Interconnects

More details on http://nowlab.cse.ohio-state.edu/ Æ Projects

Can we enhance NFS Performance with RDMA and InfiniBand? • Many enterprise environments use file servers and NFS • Most of the current systems use Ethernet • Can such environments use InfiniBand? • NFS over RDMA standard has been proposed • Designed and implemented this on InfiniBand in Open Solaris – Taking advantage of RDMA mechanisms

• Joint work with funding from Sun and Network Appliance

NFS over RDMA – New Design • NFS over RDMA in Open Solaris

– New designs with server-based RDMA read and write operations – Allows different registration mode, including IB FMR (Fast Memory Registration) – Enables multiple chunk support – Supports multiple concurrent outstanding RDMA operations – Interoperable with Linux NFSv3 and v4 – Currently tested on InfiniBand only

• Advantages

– Improved performance – Reduced CPU utilization and efficiency – Eliminates risk to NFS servers

• The code will be available with Solaris and Open Solaris in 2007

IOZone Read Performance Read-Write

Client Utlization

800

25

700

600

%/(MB/s)

Bandwidth (MB/s)

20

500

15

400

10

300

200 5

100

0

0

1

2

3

4

5

6

7

8

Number of IOzone Threads

9

10

11

12

• Sun x2100’s (dual 2.2 GHz Opteron CPU’s with x8 PCI-Express Adaptors) • IOzone Read Bandwidth up to 713 MB/s

IOZone Write Performance 800 700 600 WRITE 500 Throughput 400 300 (MB/s) 200 100 0

tmpfs sinkfs 1

2

3

4

5

Threads

6

7

8

System: Sun x2100’s (Dual 2.2 GHz Opteron CPU’s with x8 PCI-Express Adaptors) File size: 128MB for tmpfs and 1GB for sinkfs IO size: 1MB

Major Research Directions •

System Software/Middleware – – – – – – –



Networking and Communication Support – – – –



High Performance MPI on InfiniBand Cluster Clustered Storage and File Systems Solaris NFS over RDMA iWARP and its Benefits to High Performance Computing Efficient Shared Memory on High-Speed Interconnects High Performance Computing with Virtual Machines (Xen-IB) Design of Scalable Data-Centers with InfiniBand High Performance Networking for TCP-based Applications NIC-level Support for Collective Communication and Synchronization NIC-level Support for Quality of Service (QoS) Micro-Benchmarks and Performance Comparison of High-Speed Interconnects

More details on http://nowlab.cse.ohio-state.edu/ Æ Projects

Why Targeting Virtualization? • Ease of management – Virtualized clusters – VM migration – deal with system upgrade/failures • Customized OS – Light-weight OS: No widely adoption due to management difficulties – VM makes those techniques possible • System security & productivity – Users can do ‘anything’ in VM, in the worst case crash a VM, not the whole system

Challenges • Performance overhead of I/O virtualization • Management framework to take advantages of VM technology for HPC • Migration of modern OS-bypass network devices • File system support for VM based cluster

Our Initial Studies •

J. Liu*, W. Huang, B. Abali*, D. K. Panda. High Performance VMMBypass I/O in Virtual Machines, USENIX’06



W. Huang, J. Liu*, B. Abali*, D. K. Panda. A Case for High Performance Computing with Virtual Machines, ICS ’06

External collaborators from IBM T.J. Watson Research Center

From OS-bypass to VMM-bypass VM

VM

Application

OS



Application

Guest Module OS

Backend Module

VMM Privileged Module



Device Privileged Access VMM-bypass Access



Guest modules in guest VMs handle setup and management operations (privileged access) – Guest modules communicate with backend modules in VMM to get jobs done – The original privileged module can be reused Once things are setup properly, devices can be accessed directly from guest VMs (VMM-bypass access) – Either from the OS kernel or applications Backend and privileged modules can also reside in a special VM

InfiniBand Xen-IB:Xen-IB: an InfiniBand virtualization driver for Driver Xen Virtualization for Xen • Follows Xen split driver model • Presents virtual HCAs to guest domains – Para-virtualization • Two modes of access: – Privileged access • OS involved • Setup, resource management and memory management – OS/VMM-bypass access • Directly done in user space/guest VM • Maintains high performance of InfiniBand hardware

MPI Latency and Bandwidth (MVAPICH) Latency

25

xen

20

native

xen

800

15 10 5

native 600 400 200

4M

16 k 64 k 25 6k 1M

4k

1k

25 6

64

16

4

8k

2k

51 2

12 8

32

8

2

0

Msg size (Bytes)

1

0

0

• • •

Bandwidth

1000

MillionBytes/s

Latency (us)

30

Msg size (Bytes)

Only VMM Bypass operations are used Xen-IB performs similar to native InfiniBand Numbers taken with MVAPICH, high performance MPI-1 and MPI-2 (MVAPICH2) implementations over InfiniBand

– Currently used by more than 405 organizations across the world

Normalized Execution Time

HPC Benchmarks (NAS)



1.2

VM

Native

1 0.8 0.6 0.4 0.2 0

BT CG EP

FT

IS

LU MG SP

Dom0

VMM

DomU

BT

0.4%

0.2%

99.4%

CG

0.6%

0.3%

99.0%

EP

0.6%

0.3%

99.3%

FT

1.6%

0.5%

97.9%

IS

3.6%

1.9%

94.5%

LU

0.6%

0.3%

99.0%

MG

1.8%

1.0%

97.3%

SP

0.3%

0.1%

99.6%

NAS Parallel Benchmarks achieve similar performance in VM and native environment

Future work • System-level support for better virtualization

– Fully compatible implementation with OpenIB/Gen2 interface (including SDP, MAD service, etc. besides user verbs) – Migration of OS-bypass interconnects – Enhancement to file systems to support effective image management in VM-based environment – Scalability studies

Future Work (Cont’d) • Comparing with Native Check-point Restart in MPI over InfiniBand – MVAPICH2 – BLCR-based approach

• Service virtualization

– Migration of web services (e.g. virtualized http servers, application servers, etc.) – Designing service management protocols to enable efficient coordination of services in a virtualized system

• Designing higher level services with virtualization – Providing QoS and prioritization at all levels: network, application and system – Enabling fine-grained application scheduling for better system utilization

Major Research Directions •

System Software/Middleware – – – – – – –



Networking and Communication Support – – – –



High Performance MPI on InfiniBand Cluster Clustered Storage and File Systems Solaris NFS over RDMA iWARP and its Benefits to High Performance Computing Efficient Shared Memory on High-Speed Interconnects High Performance Computing with Virtual Machines (Xen-IB) Design of Scalable Data-Centers with InfiniBand High Performance Networking for TCP-based Applications NIC-level Support for Collective Communication and Synchronization NIC-level Support for Quality of Service (QoS) Micro-Benchmarks and Performance Comparison of High-Speed Interconnects

More details on http://nowlab.cse.ohio-state.edu/ Æ Projects

Data Centers – Issues and Challenges Tier 1

Tier 2

Tier 3

Routers/Servers

Application Server

Database Server

Routers/Servers

Application Server

Database Server

Routers/Servers

Switch

Application Server

Switch

Database Server

. .

. .

. .

Routers/Servers

Application Server

Database Server

• Client requests come over TCP (WAN)

Storage

Switch

• Traditionally TCP requests have been forwarded through multi-tiers • Higher response time and lower throughput • Can performance be improved with IBA? • High Performance TCP-like communication over IBA • TCP Termination

Multi-Tier Commercial Data-Centers: Can they be Re-architected? Proxy Server

Clients

Web-server (Apache)

More Complexity and Overhead Storage

WAN WAN More Number of Requests ! Application Server (PHP)

Database Server (MySQL)

• Proxy: Caching, load balancing, resource monitoring, etc. • Application server performs the business logic – Retrieves data from the database to process the requests

Data-Center Communication • Data Center Services require efficient Communications – – – –

Resource Management Cache and Shared State Management Support for Data-Centric Application like Apache Load Monitoring

• High performance and feature requirements – Utilizing Capabilities of networks like InfiniBand

• Providing effective mechanisms - A challenge!!

SDP Architectural Model Traditional Model

Possible SDP Model SDP

Sockets App

Sockets Application

Sockets API

Sockets API Sockets

User Kernel

TCP/IP Sockets Provider

User Kernel

TCP/IP Sockets Provider

TCP/IP Transport Driver

TCP/IP Transport Driver

Driver

Driver

InfiniBand CA

OS Modules Infiniband Hardware

Sockets Direct Protocol

InfiniBand CA Source: InfinibandSM Trade Association 2002

Kernel Bypass RDMA Semantics

Our Objectives • To study the importance of the communication

layer in the context of a multi-tier data center – Sockets Direct Protocol (SDP) vs. IPoIB

• Explore whether InfiniBand mechanisms can help designing various components of datacenter efficiently – Web Caching/Coherency – Re-configurability and QoS – Load Balancing, I/O and File systems, etc.

• Studying workload characteristics • In memory databases

3-Tier Datacenter Testbed at OSU Caching

Web Servers

Tier 3

Tier 1 Clients

Apache

Database Servers

Proxy Nodes

MySQL/ DB2

Tier 2 Application Servers Generate requests for both web servers and database servers.

TCP Termination Load Balancing Caching

File System evaluation Caching Schemes

PHP

Dynamic Content Caching Persistent Connections

SDP vs. IPoIB: Latency and Bandwidth (3.4 GHz PCI-Express) Latency

Bandwidth

100

7000

90 6000 Bandwidth (Mbps)

80 Latency (us)

70 60 50 40 30

5000 4000 3000 2000

20 1000

10

0

0 1

2

4

8

16

32

64 128 256 512 1K

Message Size (bytes)

IPoIB

SDP

2K

4K

8K 16K

1

2

4

8

16

32

64 128 256 512 1K

2K

Message Size (bytes)

IPoIB

SDP

SDP enables high bandwidth (up to 750 MBytes/sec or 6000 Mbps), low latency (21 µs) message passing

4K

8K 16K

Cache Coherency and Consistency with Dynamic Data User Request

Proxy Node Cache

Back-End Data

Update

Proxy Nodes Back-End Nodes

User Requests

Update

Active Caching Performance Throughput: ZipF Distribution

700

Transactions per Second (TPS)

600 No Cache TCP/IP RDMA SDP

400 300 200 100 0

2000 No Cache TCP/IP RDMA

1500

SDP 1000

500

200

90

80

70

60

50

40

20

10

0

0 20

10

30

Number of Threads

100

Number of Threads

90

80

70

60

50

40

30

20

10

0

0 0

Transactions per Second (TPS)

800

500

Throughput: Worldcup Trace

2500

“Architecture for Caching Responses with Multiple Dynamic Dependencies in Multi-tier Data-centers over InfiniBand”, S. Narravula, P. Balaji, K. Vaidyanathan and D. K. Panda. IEEE International Conference on Cluster Computing and the Grid (CCGrid) ’05. “Supporting Strong Coherency for Active Caches in Multi-tier Data-Centers over InfiniBand”, S. Narravula, P. Balaji, K. Vaidyanathan, K. Savitha, J. Wu and D. K. Panda. Workshop on System Area Networks (SAN); with HPCA ’03.

Dynamic Re-configurability and QoS • More datacenters are using dynamic data • How to decide the number of proxy nodes vs. application servers • Current approach – Use a fixed distribution – Incorporate over-provisioning to handle dynamic data

• Can we design a dynamic re-configurability scheme with shared state using RDMA operations?

– To allow a proxy node work like app server and vice versa as needed – Allocates resources as needed – Over-provisioning need not be used

• Scheme can be used for multi web-site hosting servers

Dynamic Reconfigurability in Shared Multi-tier Data-Centers Load Balancing Cluster (Site A)

Website A Servers

Clients Load Balancing Cluster (Site B)

WAN

Website B Servers

Clients Load Balancing Cluster (Site C)

Website C Servers

Nodes reconfigure themselves to highly loaded websites at run-time

Dynamic Re-configurability with Shared State using RDMA Operations 60000

N um ber of busy nodes

5

50000

4

TPS

40000

3

30000

2

20000

1

10000 0

0

512

1024

2048

4096

8192 16384

Burst Length Rigid Reconf Over-Provisioning

0

7491

14962

22423

Iterations Reconf Node Utilization Rigid

29901

37345

Over-provisioning

For large burst of requests, dynamic Performance of dynamic reconfiguration reconfiguration scheme utilizes all idle scheme largely depends on the burst length nodes in the system of requests P. Balaji, S. Narravula, K. Vaidyanathan, S. Narravula, H. -W. Jin, K. Savitha and D. K. Panda, Exploiting Remote Memory Operations to Design Efficient Reconfigurations for Shared Data-Centers over InfiniBand, RAIT ’04

QoS meeting capabilities Hard QoS Meeting Capability (Low Priority Requests)

Hard QoS Meeting Capability (High Priority Requests)

90%

90%

80%

80% % o f t im e s Q o S m e t

100%

% o f t im e s Q o S m e t

100%

70% 60% 50% 40% 30%

70% 60% 50% 40% 30%

20%

20%

10%

10%

0%

0% Case 1

Case 2

Reconf Reconf-P Reconf-PQ

Case 3

Case 1

Case 2

Case 3

Reconf Reconf-P Reconf-PQ

Re-configurability with QoS and Prioritization handles the QoS meeting capabilities of both high and low priority requests for all cases

P. Balaji, S. Narravula, K. Vaidyanathan, H. -W. Jin and D. K. Panda, On the Provision of Prioritization and Soft QoS in Dynamically Reconfigurable Shared Data Centers over InfiniBand, Int’l Symposium on Performance Analysis of Systems and Software (ISPASS 05)

Other Research Directions •

System Software/Middleware – – – – – – –



Networking and Communication Support – – – –



High Performance MPI on InfiniBand Cluster Clustered Storage and File Systems Solaris NFS over RDMA iWARP and its Benefits to High Performance Computing Efficient Shared Memory on High-Speed Interconnects High Performance Computing with Virtual Machines (Xen-IB) Design of Scalable Data-Centers with InfiniBand High Performance Networking for TCP-based Applications NIC-level Support for Collective Communication and Synchronization NIC-level Support for Quality of Service (QoS) Micro-Benchmarks and Performance Comparison of High-Speed Interconnects

More details on http://nowlab.cse.ohio-state.edu/ Æ Projects

Presentation Outline • • • • • • • • •

Systems Area in the Department Network-Based Computing (NBC) Trend Different Kinds of NBC Systems Our Vision, Collaboration, Funding Research Challenges and Projects Available Experimental Testbed Related 788 and 888 Courses Students in the Group and their accomplishments Conclusions

NBCLab Infrastructure •



Multiple Clusters (around 500 processors in total) –

128-processor



128-processor (Donation by Intel for 99 years)



32-processor (Donation by Appro, Mellanox and Intel)



28-processor (Donation by Sun)



32-processor



96-processor



32-processor



Many other donated systems from Intel, AMD, Apple, IBM, SUN, Microway and QLogic



18 dual GHz nodes

• • •

32 dual Intel EM64T processor 32 dual AMD Opteron InfiniBand DDR on all nodes (32 nodes with dual rail)

• •

64 dual Intel Pentium 2.6 GHz processor InfiniBand SDR on all nodes

• •

16 dual Intel EM64T blades InfiniBand SDR and DDR on all nodes

• •

12 dual AMD Opteron 1 quad AMD Opteron

• •

16 dual-Pentium 300 MHz nodes Myrinet, Gigabit Ethernet, and GigaNet cLAN

• • •

16 quad-Pentium 700 MHz nodes 16 dual-Pentium 1.0 GHz nodes Myrinet and GigaNet cLAN

• • •

8 dual-Xeon 2.4 GHz nodes 8 dual-Xeon 3.0 GHz nodes InfiniBand, Quadrics, Ammasso (1GigE), Level 5(1GigE), Chelsio (10GigE), Neteffect (10GigE)

Other test systems and desktops

NBCLab Clusters

High-End Computing and Networking Research Testbed for Next Generation Data Driven, Interactive Applications PI: D. K. Panda Co-PIs: G. Agrawal, P. Sadayappan, J. Saltz and H.-W. Shen Other Investigators: S. Ahalt, U. Catalyurek, H. Ferhatosmanoglu, H.-W. Jin, T. Kurc, M. Lauria, D. Lee, R. Machiraju, S. Parthasarathy, P. Sinha, D. Stredney, A. E. Stutz, and P. Wyckoff Dept. of Computer Science and Engineering, Dept. of Biomedical Informatics, and Ohio Supercomputer Center The Ohio State University Total funding: $3.01M ($1.53M from NSF + $1.48 from Ohio State Board of Regents and Various units in OSU)

Experimental Testbed Installed OSC

BMI 10.0 GigE switch

2x10=20 GigE 40 GigE (Yr4)

70-node Memory Cluster with 512 GBytes memory, 24 TB disk GigE, and InfiniBand SDR Upgrade (Yr4)

Mass Storage System 500 TBytes (Existing) 10.0 GigE switch

2x10=20 GigE 40 GigE (Yr4)

CSE 64-node Compute Cluster with InfiniBand DDR and 4TB disk 10 GigE on some (to be added) Upgrade (Yr4) Graphics Adapters and Haptic Devices

10.0 GigE switch

20 Wireless Clients & 3 Access Points Upgrade (Yr4) Video wall

Collaboration among the Components and Investigators Data Intensive Applications

Saltz, Stredney, Sadayappan Machiraju, Parthasarathy, Catalyurek, and Other OSU collaborators

Data Intensive Algorithms Programming Systems and Scheduling Networking, Communication, QoS, and I/O

Shen, Agrawal, Machiraju, and Parthasarathy, Saltz, Agrawal, Sadayappan, Kurc, Catalyurek, Ahalt, and Hakan Panda, Jin, Lee, Lauria, Sinha, Wyckoff, and Kurc

Wright Center for Innovation (WCI) • A new funding to install a larger cluster with 64 nodes with dual dual-core processors (up to 256 processors) • Storage nodes with 40 TBytes of space • Connected with InfiniBand DDR • Focuses on Advanced Data Management

Presentation Outline • • • • • • • • •

Systems Area in the Department Network-Based Computing (NBC) Trend Different Kinds of NBC Systems Our Vision, Collaboration, Funding Research Challenges and Projects Available Experimental Testbed Related 788 and 888 Courses Students in the Group and their accomplishments Conclusions

788.08P course • Network-Based Computing • Covers all the issues in detail and develops the foundation in this research area • Last offered in Wi ‘06 – http://www.cis.ohio-state.edu/~panda/788/ – Contains all papers and presentations

• Will be offered in Wi ’08 again

888.08P course • Research Seminar on Network-Based Computing • Offered every quarter • This quarter (Au ’06) offering – Mondays: 4:30-6:00pm – Thursdays: 3:30-4:30pm

• Project-based course • 3-credits • Two-quarter rule

Presentation Outline • • • • • • • • •

Systems Area in the Department Network-Based Computing (NBC) Trend Different Kinds of NBC Systems Our Vision, Collaboration, Funding Research Challenges and Projects Available Experimental Testbed Related 788 and 888 Courses Students in the Group and their accomplishments Conclusions

Group Members • Currently 12 PhD students and One MS student

– Lei Chai, Qi Gao, Wei Huang, Matthew Koop, Rahul Kumar, Ping Lai, Amith Mamidala, Sundeep Narravul, Ranjit Noronha, Sayantan Sur, Gopal Santhanaraman, Karthik Vaidyanathan and Abhinav Vishnu

• All current students are supported as RAs • Programmers – Shaun Rowland – Jonathan Perkins

• System Administrator – Keith Stewart

Publications from the Group • Since 1995

– 11 PhD Dissertations – 15 MS Theses – 180+ papers in prestigious international conferences and Journals – 5 journals and 24 conference/workshop papers in 2005 – 2 journals and 21 conference in 2006 (so far) – Electronically available from NBC web page • http://nowlab.cse.ohio-state.edu -> Publications

Student Quality and Accomplishments •

First employment of graduated students

• •

Many employments are by invitation Several of the past and current students



Big demand for summer internships every year

– IBM TJ Watson, IBM Research, Argonne National Laboratory, OakRidge National Laboratory, SGI, Compaq/Tandem, Pacific Northwest National Lab, Argonne National Lab, ASK Jeeves, Fore Systems, Microsoft, Lucent, Citrix, Dell

– – – – –

OSU Graduate Fellowship OSU Presidential Fellowship (’95, ’98, ’99, ’03 and ’04) IBM Co-operative Fellowship (’96, ’97, ’02 and ’06) CIS annual research award (’99, ’02, ’04 (2), ’05 (2), ’06) Best Papers at conferences (HPCA ’95, IPDPS ’03, Cluster ’03, Euro PVM ’05, Hot Interconnect ’04, Hot Interconnect ’05, ..)

– IBM TJ Watson, IBM Almaden, Argonne, Sandia, Los Alamos, Pacific Northwest National Lab, Intel, Dell, .. – 7 internships this past summer (Argonne (2), IBM TJ Watson, IBM, Sun, LLNL, HP)

In News • Research in our group gets quoted in press releases and articles in a steady manner • HPCwire, Grid Today, Supercomputing Online, AVM News http://nowlab.cse.ohio-state.edu/projects/mpi-iba/

-> In News

Visit to Capitol Hill (June 2005) •

Invited to Capitol Hill (CNSF) to demonstrate the benefits of InfiniBand networking technology to – Congressmen – Senators – White House Director (Office of Science and Technology)

More details at http://www.cra.org/govaffairs/blog/archives/000101.htm

Looking For …. •

• •

Motivated students with strong interest in different aspects of network-based computing – – – – –

architecture networks operating systems algorithms applications

– – – – –

architecture design networking parallel algorithms parallel programming experimentation and evaluation

Should have good knowledge in Architecture (775, 875), High Performance/Parallel Computing (621, 721), networking (677, 678), operating systems (662) Should have or should be willing to develop skills in

Presentation Outline • • • • • • • • •

Systems Area in the Department Network-Based Computing (NBC) Trend Different Kinds of NBC Systems Our Vision, Collaboration, Funding Research Challenges and Projects Available Experimental Testbed Related 788 and 888 Courses Students in the Group and their accomplishments Conclusions

Conclusions • Network-Based computing is an emerging trend • Small to medium scale network-based computing systems are getting ready to be used • Many open research issues need to be solved before building large-scale network-based computing systems • Will be able to provide affordable, portable, scalable, and ..able computing paradigm for the next 10-15 years • Exciting area of research

Web Pointers

NBC

home page

http://www.cse.ohio-state.edu/~panda/ http://nowlab.cse.ohio-state.edu/ E-mail: [email protected]

Acknowledgements Our research is supported by the following organizations • Current Funding support by

• Current Equipment support by

99

Related Documents

Research Overview
October 2019 13
Overview
November 2019 42
Overview
June 2020 25