Network Based Computing
Dhabaleswar K. (DK) Panda Department of Computer Science and Engineering The Ohio State University E-mail:
[email protected] http://nowlab.cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda
Presentation Outline • • • • • • • •
Network-Based Computing (NBC) Trend Different Kinds of NBC Systems Our Vision, Collaboration, Funding Research Challenges and Projects Available Experimental Testbed Related 788 and 888 Courses Students in the Group and their accomplishments Conclusions
Current and Next Generation Applications and Computing Systems •
Big demand for
•
Processor speed continues to grow
•
Commodity networking also continues to grow
•
Clusters are increasingly becoming popular to design next generation computing systems
– High Performance Computing (HPC) – Servers for file systems, web, multimedia, database, visualization, ... – Internet data centers (Multi-tier Datacenters) – Doubling every 18 months – Multicore chips are emerging – Increase in speed – Many features – Affordable pricing
– scalability – modularity – can be easily upgraded as computing and networking technologies improve
Growth in Commodity Networking Technologies • Representative commodity networking technologies and their entries into the market – – – – – – – – – –
Ethernet (1979- ) Fast Ethernet (1993- ) Gigabit Ethernet (1995- ) ATM (1995- ) Myrinet (1993- ) Fibre Channel (1994- ) InfiniBand (2001-) InfiniBand (2003-) 10 Gigabit Ethernet (2004-) InfiniBand (2005-)
10 Mbit/sec 100 Mbit/sec 1000 Mbit/sec 155/622/1024 Mbit/sec 1 Gbit/sec 1 Gbit/sec 2.5 Gbit/sec (1X) -> 2 Gbit/sec 10 Gbit/sec (4X) -> 8 Gbit/sec 10 Gbit/sec 20 Gbit/sec (4X DDR) -> 16 Gbit/sec 30 Gbit/sec (12X SDR) -> 24 Gbit/sec
12 times in the last 5 years
Network-Based Computing Computing Systems
Web Technology
•powerful •costeffective •commodity
•powerful •ability to integrate computation and communication
Networking •high bandwidth •cost-effective •commodity
NetworkBased Computing
Emerging Applications • collaborative/ interactive •wide range of computation and communication characteristics
Presentation Outline • • • • • • • •
Network-Based Computing (NBC) Trend Different Kinds of NBC Systems Our Vision, Collaboration, Funding Research Challenges and Projects Available Experimental Testbed Related 788 and 888 Courses Students in the Group and their accomplishments Conclusions
Different Types of NBC Systems • Five major categories – Dedicated Cluster over a LAN – Clusters with Interactive clients – Globally-Interconnected Systems – Interconnected Clusters over WAN – Multi-tier Data Centers – Integrated environment of multiple clusters
Dedicated Cluster over a SAN/LAN • Packaged inside a rack/room interconnected with SAN/LAN comp
comp comp comp
Network(s)
comp comp
Trends for Computing Clusters in the Top 500 List • Top 500 list of Supercomputers (www.top500.org) – – – – – – – – – – –
June 2001: 33/500 (6.6%) Nov 2001: 43/500 (8.6%) June 2002: 80/500 (16%) Nov 2002: 93/500 (18.6%) June 2003: 149/500 (29.8%) Nov 2003: 208/500 (41.6%) June 2004: 291/500 (58.2%) Nov 2004: 294/500 (58.8%) June 2005: 304/500 (60.8%) Nov 2005: 360/500 (72.0%) June 2006: 364/500 (72.8%)
Cluster with Interactive Clients LAN/WAN cluster
client
Globally-Interconnected Systems SMP system
SMP system
WAN
Interconnected Clusters over WAN WAN
Workstation cluster
Generic Three-Tier Model for Data Centers Tier 1
Tier 2
Tier 3
Routers/Servers
Application Server
Database Server
Routers/Servers
Application Server
Database Server
Routers/Servers
Switch
Application Server
Switch
Database Server
. .
. .
. .
Routers/Servers
Application Server
Database Server
Storage
Switch
• All major search engines and e-commerce companies are using clusters for multi-tier datacenter • Google, Amazon, Financial institutions, …..
Integrated Environment with Multiple Clusters Compute cluster
Storage Cluster
Front end LAN
LAN LAN/WAN
LAN
Datacenter for Visualization and Data Mining
Presentation Outline • • • • • • • •
Network-Based Computing (NBC) Trend Different Kinds of NBC Systems Our Vision, Collaboration, Funding Research Challenges and Projects Available Experimental Testbed Related 788 and 888 Courses Students in the Group and their accomplishments Conclusions
Our Vision • Network-Based Computing Group to take a lead in – Proposing new designs for high performance NBC systems by taking advantages of modern networking technologies and computing systems – Developing better middleware/API/programming environments so that modern NBC applications can be developed and implemented in a scalable fashion
• Carry out research in an integrated manner (systems, networking, and applications)
Group Members • Currently 12 PhD students and one MS student
– Lei Chai, Qi Gao, Wei Huang, Matthew Koop, Rahul Kumar, Ping Lai, Amith Mamidala, Sundeep Narravul, Ranjit Noronha, Sayantan Sur, Gopal Santhanaraman, Karthik Vaidyanathan, and Abhinav Vishnu
• All current students are supported as RAs • Programmers – Shaun Rowland – Jonathan Perkins
• System Administrator – Keith Stewart
External Collaboration •
Industry
•
Research Labs
•
Other units in campus
– – – – – – – –
IBM TJ Watson and IBM Intel Dell SUN Apple Linux Networx NetApp Mellanox and other InfiniBand companies
– – – –
Sandia National Lab Los Alamos National Lab Pacific Northwest National Lab Argonne National Lab
– Bio-Medical Informatics (BMI) – Ohio Supercomputer Center (OSC) – Internet 2 Technology Evaluation Centers (ITEC-Ohio)
Research Funding • National Science Foundation – Multiple grants
• Department of Energy • • • • • • • • • • •
– Multiple grants
Sandia National Lab Los Alamos National Lab Pacific Northwest National Lab Ohio Board of Regents IBM Research Intel SUN Mellanox Cisco Linux Networx Network Appliance
Research Funding (Cont’d) • Three Major Large-Scale Grants
– Started in Sept ‘06 – DOE Grant for Next Generation Programming Models • Collaboration with Argonne National Laboratory, Rice University, Univ. of Berkeley, and Pacific Northwest National Laboratory
– DOE Grant for Scalable Fault Tolerance in Large-Scale Clusters • Collaboration with Argonne National Laboratory, Indiana University, Lawrence Berkeley National Laboratory, and Oakridge National Laboratory
– AVETEC Grant for Center for Performance Evaluation of Cluster Interconnects • Collaboration with major DOE Labs and NASA
• Several continuation grants from other funding agencies
Presentation Outline • • • • • • • • •
Systems Area in the Department Network-Based Computing (NBC) Trend Different Kinds of NBC Systems Our Vision, Collaboration, Funding Research Challenges and Projects Available Experimental Testbed Related 788 and 888 Courses Students in the Group and their accomplishments Conclusions
Requirements for Modern Clusters, Cluster-based Servers, and Datacenters • Requires – good Systems Area Network with excellent performance (low latency and high bandwidth) for interprocessor communication (IPC) and I/O – good Storage Area Networks high performance I/O – good WAN connectivity in addition to intra-cluster SAN/LAN connectivity – Quality of Service (QoS) for supporting interactive applications – RAS (Reliability, Availability, and Serviceability)
Research Challenges Applications
(Scientific, Commercial, Servers, Datacenters)
Programming Models (Message Passing, Sockets, Shared Memory w/o Coherency)
Low Overhead Substrates Point-to-point Communication
Collective Communication
Networking Technologies
Synchronization & Locks
(Myrinet, Gigabit Ethernet, InfiniBand, Quadrics, 10GigE, RNICs) & Intelligent NICs
Fault I/O & File QoS Tolerance Systems
Commodity Computing System Architectures (single, dual, quad, ..) Multi-core architecture
Research Challenges and Their Dependencies Applications System Software/Middleware Support
Networking and Communication Support Trends in Networking/Computing Technologies
Major Research Directions •
System Software/Middleware – – – – – – –
•
Networking and Communication Support – – – –
•
High Performance MPI on InfiniBand Cluster Clustered Storage and File Systems Solaris NFS over RDMA iWARP and its Benefits to High Performance Computing Efficient Shared Memory on High-Speed Interconnects High Performance Computing with Virtual Machines (Xen-IB) Design of Scalable Data-Centers with InfiniBand High Performance Networking for TCP-based Applications NIC-level Support for Collective Communication and Synchronization NIC-level Support for Quality of Service (QoS) Micro-Benchmarks and Performance Comparison of High-Speed Interconnects
More details on http://nowlab.cse.ohio-state.edu/ Æ Projects
Why InfiniBand? • Traditionally HPC clusters have used – Myrinet or Quadrics
• Proprietary interconnects
– Fast Ethernet or Gigabit Ethernet for low-end clusters
• Datacenters have used
– Ethernet (Gigabit Ethernet is common) – 10.0 Gigabit Ethernet is emerging and not yet available with low cost – QoS and RAS capabilities are not there in Ethernet
• Storage and File Systems have used
– Ethernet with IP – Fibre Channel – Does not support high bandwidth and QoS, etc.
Need for A New Generation Computing, Communication & I/O Architecture • Started around late 90s • Several initiatives were working to address high performance I/O issues – Next Generation I/O – Future I/O
• Virtual Interface Architecture (VIA) consortium was working towards high performance interprocessor communication • An attempt was made to consolidate all these efforts as an open standard • The idea behind InfiniBand Architecture (IBA) was conceived ... • The original target was for Data Centers
IBA Trade Organization • IBA Trade Organization was formed with seven industry leaders (Compaq, Dell, HP, IBM, Intel, Microsoft, and Sun) • Goal: To design a scalable and high performance communication and I/O architecture by taking an integrated view of computing, networking, and storage technologies • Many other industry participated in the effort to define the IBA architecture specification • InfiniBand Architecture (Volume 1, Version 1.0) was released to public on Oct 24, 2000 • www.infinibandta.org
Rich Set of Features of IBA • High Performance Data Transfer
– Interprocessor communication and I/O – Low latency (~1.0-3.0 microsec), High bandwidth (~1-2 GigaBytes/sec), and low CPU utilization (5-10%)
• Flexibility for WAN communication • Multiple Transport Services
– Reliable Connection (RC), Unreliable Connection (UC), Reliable Datagram (RD), Unreliable Datagram (UD), and Raw Datagram – Provides flexibility to develop upper layers
• Multiple Operations
– Send/Recv – RDMA Read/Write – Atomic Operations (very unique)
• high performance and scalable implementations of distributed locks, semaphores, collective communication operations
Principles behind RDMA Mechanism Node
Node
Memory
Memory
P0 P1
• • • • •
PCI/PCI-EX
IBA
P0 P1
PCI/PCI-EX
IBA
No involvement by the processor at the receiver side (RDMA Write/Put) No involvement by the processor at the sender side (RDMA Read/get) 1-2 microsec latency (for short data) 1.5 Gbytes/sec bandwidth (for large data) 3-5 microsec for atomic operation
Rich Set of Features of IBA (Cont’d) •
Range of Network Features and QoS Mechanisms – – – –
Service Levels (priorities) Virtual lanes Partitioning Multicast
• allows to design a new generation of scalable communication and I/O subsystem with QoS
•
Protected Operations
•
Flexibility for supporting Reliability, Availability, and Serviceability (RAS) in next Generation Systems with IBA features
– Keys – Protection Domains – Multiple CRC fields
• error detection (per-hop, end-to-end)
– Fail-over
• unmanaged and managed
– Path Migration – Built-in Management Services
Subnet Manager Inactive Link
Multicast Setup Switch
Active Inactive Links Multicast Join
Compute Node Multicast Setup
Multicast Join
Subnet Manager
Designing MPI Using InfiniBand Features MPI Design Components Protocol Mapping
Flow Control
Communication Progress
Multirail Support
Buffer Management
Connection Management
Collective Communication
One-sided Active/Passive
Substrate Communication Semantics Transport Services
Atomic Operations Communication Management
Completion & Event
End-to-End Flow Control
Multicast
Quality of Service
InfiniBand Features
High Performance MPI over IBA – Research Agenda at OSU •
Point-to-point communication
•
Collective communication
•
Flow control
•
Connection Management
•
Multi-rail designs
•
MPI Datatype Communication
•
Fault-tolerance Support
•
Currently extending our designs for iWARP adapters
– RDMA-based design for both small and large messages – Taking advantage of IBA hardware multicast for Broadcast – RDMA-based designs for barrier, all-to-all, all-gather – Static vs. dynamic – Static vs. dynamic – On-demand
– Multiple ports/HCAs – Different schemes (striping, binding, adaptive) – Taking advantage of scatter/gather semantics of IBA – End-to-end reliability to tolerate I/O errors – Network fault-tolerance with Automatic Path Migration (APM) – Applications transparent checkpoint and restart
Overview of MVAPICH and MVAPICH2 Projects (OSU MPI for InfiniBand) • Focusing on
– MPI-1 (MVAPICH) – MPI-2 (MVAPICH2)
• Open Source (BSD licensing) with anonymous SVN access • Directly downloaded and being used by more than 440 organizations worldwide (in 30 countries) • Available in the software stacks of – Many IBA and server vendors – OFED (OpenFabrics Enterprise Distribution) – Linux Distributors
• Empowers multiple InfiniBand clusters in the TOP 500 list • URL: http://nowlab.cse.ohio-state.edu/projects/mpi-iba/
Larger IBA Clusters using MVAPICH and Top500 Rankings (June ’06) • • • • • • •
6th : 4000-node dual Intel Xeon 3.6 GHz cluster at Sandia 28th: 1100-node dual Apple Xserve 2.3 GHz cluster at Virginia Tech 66th: 576-node dual Intel Xeon EM64T 3.6 GHz cluster at Univ. of Sherbrooke (Canada) 367th : 356-node dual Opteron 2.4 GHz cluster at Trinity Center for High Performance Computing (TCHPC), Trinity College Dublin (Ireland) 436th: 272-node dual Intel Xeon EM64T 3.4 GHz cluster at SARA (The Netherlands) 460th: 200-node dual Intel Xeon EM64T 3.2 GHz cluster at Texas Advanced Computing Center/Univ. of Texas 465th: 315-node dual Opteron 2.2 GHz cluster at NERSC/LBNL
• More are getting installed ….
MPI-level Latency (One-way):
IBA (Mellanox and PathScale) vs. Myrinet vs. Quadrics Latency (us)
Small message latency
Large message latency
14
700
12
600
10
500
8
400
MVAPICH-X MVAPICH-Ex-1p MVAPICH-Ex-2p MPICH/MX
4.9 4.0 3.3 2.8 2.0
6
300
4
200
2
100
0
0 0
4
8
16
32
64
128
Msg size (Bytes)
256
512
1024
MPICH/QsNet-X MVAPICH-Gen2-ExDDR-1p
2K
4K
8K
16K
32K
64K
128K
256K
Msg size (Bytes)
• SC ’03 • Hot Interconnect ’04 • IEEE Micro (Jan-Feb) ’05, one of the best papers from HotI ‘04
09/26/05
MPI-level Bandwidth (Uni-directional):
IBA (Mellanox and PathScale) vs. Myrinet vs. Quadrics 1492 1472
1500
Bandwidth (MillionBytes/Sec)
1350
MVAPICH-X
1200
MVAPICH-Ex-1p
1050
MVAPICH-Ex-2p
970
MPICH/MX
900
910 891
MPICH/QsNet-X
750
MVAPICH-Gen2-DDR-Ex-1p
600
494
450 300 150 0 4
8
16
32
64
128
256
512 1024
2K
4K
Mesg size (Bytes)
8K
16K
32K
64K 128K 256K 512K
1M
09/26/05
MPI-level Bandwidth (Bi-directional):
IBA (Mellanox and PathScale)) vs. Myrinet vs. Quadrics 3000
Bandwidth (MillionBytes/Sec)
2700
2724 2628
MVAPICH-X
2400
MVAPICH-Ex-1p
2100
MVAPICH-Ex-2p MPICH/MX-X
1841
1800
MPICH/QsNet-X 1500
MVAPICH-Gen2-DDR-Ex-1p
1200
943 908 901
900 600 300 0 4
8
16
32
64
128
256
512 1024
2K
4K
Mesg size (Bytes)
8K
16K
32K
64K 128K 256K 512K
1M
08/15/05
MPI over InfiniBand Performance
(Dual-core Intel Bensley Systems with Dual-Rail DDR InfiniBand) HCA 0 HCA 1
P3 P1
HCA 0 HCA 1
P2 P0
Uni-Directional Bandwidth 3000
P0 P2
2500
4 processes
2000
P1 P3
1500
2808 MB/sec
1000
4-processes on each node concurrently communicating over Dual-rail InfiniBand DDR (Mellanox)
500 0
1
4
16
64
256 1K 4K 16K Message Size (bytes)
64K
256K
1M
Messaging Rate
Bi-Directional Bandwidth
3.5
5000 4500
3
4000
4 processes
4 processes
2.5
3500 3000
3.16 Million Msgs/sec (16 Bytes)
2
2500
4553 MB/sec
2000 1500
1.5 1
1000
0.5
500 0
0
1
4
16
64
256 1K 4K 16K Message Size (bytes)
64K
256K
1M
1
4
16
64
256
1K
4K
Message Size (bytes)
16K
64K
256K
M. J. Koop, W. Huang, A. Vishnu and D. K. Panda, Memory Scalability Evaluation of Next Generation Intel Bensley Platform with InfiniBand, to be presented at Hot Interconnect Symposium (Aug. 2006).
High Performance and Scalable Collectives • Reliable MPI Broadcast using IB hardware multicast – Capability to support broadcast of 1K bytes message to 1024 nodes in less than 40 microsec
• RDMA-based designs for – MPI_Barrier – MPI_All_to_All J. Liu, A. Mamidala and D. K. Panda, Fast and Scalable MPI-Level Broadcast using InfiniBand’s Hardware Multicast Support, Int’l Parallel and Distributed Processing Symposium (IPDPS ’04), April 2004 S. Sur and D. K. Panda, Efficient and Scalable All-to-all Exchange for InfiniBand-based Clusters, Int’l Conference on Parallel Processing (ICPP ’04), Aug. 2004
Analytical Model – Extrapolation for 1024 nodes Actual, no h/w mcst
70
Estimated, no h/w mcst Actual, h/w mcst
50
Estimated, h/w mcst
la t e n c y
40 30 20 10
96
48
40
24
20
6
2
10
51
8
size
25
12
64
32
16
8
4
2
0 1
latency
60
180 160 140 120 100 80 60 40 20 0
no h/w mcst, 1024 h/w mcst, 1024
1 2 4 8 16 32 64 12 8 25 6 51 10 2 2 20 4 4 40 8 96
80
size
• Can broadcast a 1Kbyte message to 1024 nodes in 40.0 microsec
Fault Tolerance • Component failures are the norm in largescale clusters • Imposes need on reliability and fault tolerance • Working along the following three angles
– End-to-end Reliability with memory-to-memory CRC • Already available with the latest release
– Reliable Networking with Automatic Path Migration (APM) utilizing Redundant Communication Paths • Will be available in future release
– Process Fault Tolerance with Efficient Checkpoint and Restart • Already available with the latest release
Checkpoint/Restart Support for MVAPICH2 • Process-level Fault Tolerance
– User-transparent, system-level checkpointing – Based on BLCR from LBNL to take coordinated checkpoints of entire program, including front end and individual processes – Designed novel schemes to • Coordinate all MPI processes to drain all in flight messages in IB connections • Store communication state and buffers, etc. while taking checkpoint • Restarting from the checkpoint
A Running Example LU Restarted is Now, not Now, Start LU Restart Get affected, is Take running still LU its PID checkpoint finish from running Stop LU the running. it using checkpoint CTRL-C
• Show how to checkpoint/restart LU from NAS benchmark • There are two terminals: – Left one for normal run – Right one for checkpoint/restart
NAS Execution Time and HPL Performance P e rfo rm a n c e (G F L O P S )
lu.C.8 bt.C.9 sp.C. 500 400 300 200 100 2 min
4 min
Checkpointing In
•
25 20 15 10 5 0
1 min
•
30
None
2 min (6)
4 min (2)
8 min (1)
None
Checkpointing Interval & No. of checkpoints
The execution time of NAS benchmarks increased slightly with frequent checkpointing Frequent checkpointing also has some impact on HPL performance metrics Q. Gao, W. Yu, W. Huang and D.K. Panda, Application-Transparent Checkpoint/Restart for MPI over InfiniBand, International Conference on Parallel Processing, Columbus, Ohio, 2006
Continued Research on several components of MPI •
Collective communication – – – –
•
All-to-all All-gather Barrier All-reduce
Intra-node shared memory support – User-level, kernel-level – NUMA support – Multi-core support
•
MPI-2 – One-sided (get, put, accumulate) – Synchronization (active, passive) – Message ordering
•
Datatype
Continued Research on several components of MPI (Cont’d) • Multi-rail support – Multiple NICs – Multiple ports/NIC
• uDAPL support – Portability across different networks (Myrinet, Quadrics, GigE, 10GigE)
• Fault-tolerance – Check-point restart – Automatic Path Migration
• MPI-IO – High performance I/O support
Major Research Directions •
System Software/Middleware – – – – – – –
•
Networking and Communication Support – – – –
•
High Performance MPI on InfiniBand Cluster Clustered Storage and File Systems Solaris NFS over RDMA iWARP and its Benefits to High Performance Computing Efficient Shared Memory on High-Speed Interconnects High Performance Computing with Virtual Machines (Xen-IB) Design of Scalable Data-Centers with InfiniBand High Performance Networking for TCP-based Applications NIC-level Support for Collective Communication and Synchronization NIC-level Support for Quality of Service (QoS) Micro-Benchmarks and Performance Comparison of High-Speed Interconnects
More details on http://nowlab.cse.ohio-state.edu/ Æ Projects
Can we enhance NFS Performance with RDMA and InfiniBand? • Many enterprise environments use file servers and NFS • Most of the current systems use Ethernet • Can such environments use InfiniBand? • NFS over RDMA standard has been proposed • Designed and implemented this on InfiniBand in Open Solaris – Taking advantage of RDMA mechanisms
• Joint work with funding from Sun and Network Appliance
NFS over RDMA – New Design • NFS over RDMA in Open Solaris
– New designs with server-based RDMA read and write operations – Allows different registration mode, including IB FMR (Fast Memory Registration) – Enables multiple chunk support – Supports multiple concurrent outstanding RDMA operations – Interoperable with Linux NFSv3 and v4 – Currently tested on InfiniBand only
• Advantages
– Improved performance – Reduced CPU utilization and efficiency – Eliminates risk to NFS servers
• The code will be available with Solaris and Open Solaris in 2007
IOZone Read Performance Read-Write
Client Utlization
800
25
700
600
%/(MB/s)
Bandwidth (MB/s)
20
500
15
400
10
300
200 5
100
0
0
1
2
3
4
5
6
7
8
Number of IOzone Threads
9
10
11
12
• Sun x2100’s (dual 2.2 GHz Opteron CPU’s with x8 PCI-Express Adaptors) • IOzone Read Bandwidth up to 713 MB/s
IOZone Write Performance 800 700 600 WRITE 500 Throughput 400 300 (MB/s) 200 100 0
tmpfs sinkfs 1
2
3
4
5
Threads
6
7
8
System: Sun x2100’s (Dual 2.2 GHz Opteron CPU’s with x8 PCI-Express Adaptors) File size: 128MB for tmpfs and 1GB for sinkfs IO size: 1MB
Major Research Directions •
System Software/Middleware – – – – – – –
•
Networking and Communication Support – – – –
•
High Performance MPI on InfiniBand Cluster Clustered Storage and File Systems Solaris NFS over RDMA iWARP and its Benefits to High Performance Computing Efficient Shared Memory on High-Speed Interconnects High Performance Computing with Virtual Machines (Xen-IB) Design of Scalable Data-Centers with InfiniBand High Performance Networking for TCP-based Applications NIC-level Support for Collective Communication and Synchronization NIC-level Support for Quality of Service (QoS) Micro-Benchmarks and Performance Comparison of High-Speed Interconnects
More details on http://nowlab.cse.ohio-state.edu/ Æ Projects
Why Targeting Virtualization? • Ease of management – Virtualized clusters – VM migration – deal with system upgrade/failures • Customized OS – Light-weight OS: No widely adoption due to management difficulties – VM makes those techniques possible • System security & productivity – Users can do ‘anything’ in VM, in the worst case crash a VM, not the whole system
Challenges • Performance overhead of I/O virtualization • Management framework to take advantages of VM technology for HPC • Migration of modern OS-bypass network devices • File system support for VM based cluster
Our Initial Studies •
J. Liu*, W. Huang, B. Abali*, D. K. Panda. High Performance VMMBypass I/O in Virtual Machines, USENIX’06
•
W. Huang, J. Liu*, B. Abali*, D. K. Panda. A Case for High Performance Computing with Virtual Machines, ICS ’06
External collaborators from IBM T.J. Watson Research Center
From OS-bypass to VMM-bypass VM
VM
Application
OS
•
Application
Guest Module OS
Backend Module
VMM Privileged Module
•
Device Privileged Access VMM-bypass Access
•
Guest modules in guest VMs handle setup and management operations (privileged access) – Guest modules communicate with backend modules in VMM to get jobs done – The original privileged module can be reused Once things are setup properly, devices can be accessed directly from guest VMs (VMM-bypass access) – Either from the OS kernel or applications Backend and privileged modules can also reside in a special VM
InfiniBand Xen-IB:Xen-IB: an InfiniBand virtualization driver for Driver Xen Virtualization for Xen • Follows Xen split driver model • Presents virtual HCAs to guest domains – Para-virtualization • Two modes of access: – Privileged access • OS involved • Setup, resource management and memory management – OS/VMM-bypass access • Directly done in user space/guest VM • Maintains high performance of InfiniBand hardware
MPI Latency and Bandwidth (MVAPICH) Latency
25
xen
20
native
xen
800
15 10 5
native 600 400 200
4M
16 k 64 k 25 6k 1M
4k
1k
25 6
64
16
4
8k
2k
51 2
12 8
32
8
2
0
Msg size (Bytes)
1
0
0
• • •
Bandwidth
1000
MillionBytes/s
Latency (us)
30
Msg size (Bytes)
Only VMM Bypass operations are used Xen-IB performs similar to native InfiniBand Numbers taken with MVAPICH, high performance MPI-1 and MPI-2 (MVAPICH2) implementations over InfiniBand
– Currently used by more than 405 organizations across the world
Normalized Execution Time
HPC Benchmarks (NAS)
•
1.2
VM
Native
1 0.8 0.6 0.4 0.2 0
BT CG EP
FT
IS
LU MG SP
Dom0
VMM
DomU
BT
0.4%
0.2%
99.4%
CG
0.6%
0.3%
99.0%
EP
0.6%
0.3%
99.3%
FT
1.6%
0.5%
97.9%
IS
3.6%
1.9%
94.5%
LU
0.6%
0.3%
99.0%
MG
1.8%
1.0%
97.3%
SP
0.3%
0.1%
99.6%
NAS Parallel Benchmarks achieve similar performance in VM and native environment
Future work • System-level support for better virtualization
– Fully compatible implementation with OpenIB/Gen2 interface (including SDP, MAD service, etc. besides user verbs) – Migration of OS-bypass interconnects – Enhancement to file systems to support effective image management in VM-based environment – Scalability studies
Future Work (Cont’d) • Comparing with Native Check-point Restart in MPI over InfiniBand – MVAPICH2 – BLCR-based approach
• Service virtualization
– Migration of web services (e.g. virtualized http servers, application servers, etc.) – Designing service management protocols to enable efficient coordination of services in a virtualized system
• Designing higher level services with virtualization – Providing QoS and prioritization at all levels: network, application and system – Enabling fine-grained application scheduling for better system utilization
Major Research Directions •
System Software/Middleware – – – – – – –
•
Networking and Communication Support – – – –
•
High Performance MPI on InfiniBand Cluster Clustered Storage and File Systems Solaris NFS over RDMA iWARP and its Benefits to High Performance Computing Efficient Shared Memory on High-Speed Interconnects High Performance Computing with Virtual Machines (Xen-IB) Design of Scalable Data-Centers with InfiniBand High Performance Networking for TCP-based Applications NIC-level Support for Collective Communication and Synchronization NIC-level Support for Quality of Service (QoS) Micro-Benchmarks and Performance Comparison of High-Speed Interconnects
More details on http://nowlab.cse.ohio-state.edu/ Æ Projects
Data Centers – Issues and Challenges Tier 1
Tier 2
Tier 3
Routers/Servers
Application Server
Database Server
Routers/Servers
Application Server
Database Server
Routers/Servers
Switch
Application Server
Switch
Database Server
. .
. .
. .
Routers/Servers
Application Server
Database Server
• Client requests come over TCP (WAN)
Storage
Switch
• Traditionally TCP requests have been forwarded through multi-tiers • Higher response time and lower throughput • Can performance be improved with IBA? • High Performance TCP-like communication over IBA • TCP Termination
Multi-Tier Commercial Data-Centers: Can they be Re-architected? Proxy Server
Clients
Web-server (Apache)
More Complexity and Overhead Storage
WAN WAN More Number of Requests ! Application Server (PHP)
Database Server (MySQL)
• Proxy: Caching, load balancing, resource monitoring, etc. • Application server performs the business logic – Retrieves data from the database to process the requests
Data-Center Communication • Data Center Services require efficient Communications – – – –
Resource Management Cache and Shared State Management Support for Data-Centric Application like Apache Load Monitoring
• High performance and feature requirements – Utilizing Capabilities of networks like InfiniBand
• Providing effective mechanisms - A challenge!!
SDP Architectural Model Traditional Model
Possible SDP Model SDP
Sockets App
Sockets Application
Sockets API
Sockets API Sockets
User Kernel
TCP/IP Sockets Provider
User Kernel
TCP/IP Sockets Provider
TCP/IP Transport Driver
TCP/IP Transport Driver
Driver
Driver
InfiniBand CA
OS Modules Infiniband Hardware
Sockets Direct Protocol
InfiniBand CA Source: InfinibandSM Trade Association 2002
Kernel Bypass RDMA Semantics
Our Objectives • To study the importance of the communication
layer in the context of a multi-tier data center – Sockets Direct Protocol (SDP) vs. IPoIB
• Explore whether InfiniBand mechanisms can help designing various components of datacenter efficiently – Web Caching/Coherency – Re-configurability and QoS – Load Balancing, I/O and File systems, etc.
• Studying workload characteristics • In memory databases
3-Tier Datacenter Testbed at OSU Caching
Web Servers
Tier 3
Tier 1 Clients
Apache
Database Servers
Proxy Nodes
MySQL/ DB2
Tier 2 Application Servers Generate requests for both web servers and database servers.
TCP Termination Load Balancing Caching
File System evaluation Caching Schemes
PHP
Dynamic Content Caching Persistent Connections
SDP vs. IPoIB: Latency and Bandwidth (3.4 GHz PCI-Express) Latency
Bandwidth
100
7000
90 6000 Bandwidth (Mbps)
80 Latency (us)
70 60 50 40 30
5000 4000 3000 2000
20 1000
10
0
0 1
2
4
8
16
32
64 128 256 512 1K
Message Size (bytes)
IPoIB
SDP
2K
4K
8K 16K
1
2
4
8
16
32
64 128 256 512 1K
2K
Message Size (bytes)
IPoIB
SDP
SDP enables high bandwidth (up to 750 MBytes/sec or 6000 Mbps), low latency (21 µs) message passing
4K
8K 16K
Cache Coherency and Consistency with Dynamic Data User Request
Proxy Node Cache
Back-End Data
Update
Proxy Nodes Back-End Nodes
User Requests
Update
Active Caching Performance Throughput: ZipF Distribution
700
Transactions per Second (TPS)
600 No Cache TCP/IP RDMA SDP
400 300 200 100 0
2000 No Cache TCP/IP RDMA
1500
SDP 1000
500
200
90
80
70
60
50
40
20
10
0
0 20
10
30
Number of Threads
100
Number of Threads
90
80
70
60
50
40
30
20
10
0
0 0
Transactions per Second (TPS)
800
500
Throughput: Worldcup Trace
2500
“Architecture for Caching Responses with Multiple Dynamic Dependencies in Multi-tier Data-centers over InfiniBand”, S. Narravula, P. Balaji, K. Vaidyanathan and D. K. Panda. IEEE International Conference on Cluster Computing and the Grid (CCGrid) ’05. “Supporting Strong Coherency for Active Caches in Multi-tier Data-Centers over InfiniBand”, S. Narravula, P. Balaji, K. Vaidyanathan, K. Savitha, J. Wu and D. K. Panda. Workshop on System Area Networks (SAN); with HPCA ’03.
Dynamic Re-configurability and QoS • More datacenters are using dynamic data • How to decide the number of proxy nodes vs. application servers • Current approach – Use a fixed distribution – Incorporate over-provisioning to handle dynamic data
• Can we design a dynamic re-configurability scheme with shared state using RDMA operations?
– To allow a proxy node work like app server and vice versa as needed – Allocates resources as needed – Over-provisioning need not be used
• Scheme can be used for multi web-site hosting servers
Dynamic Reconfigurability in Shared Multi-tier Data-Centers Load Balancing Cluster (Site A)
Website A Servers
Clients Load Balancing Cluster (Site B)
WAN
Website B Servers
Clients Load Balancing Cluster (Site C)
Website C Servers
Nodes reconfigure themselves to highly loaded websites at run-time
Dynamic Re-configurability with Shared State using RDMA Operations 60000
N um ber of busy nodes
5
50000
4
TPS
40000
3
30000
2
20000
1
10000 0
0
512
1024
2048
4096
8192 16384
Burst Length Rigid Reconf Over-Provisioning
0
7491
14962
22423
Iterations Reconf Node Utilization Rigid
29901
37345
Over-provisioning
For large burst of requests, dynamic Performance of dynamic reconfiguration reconfiguration scheme utilizes all idle scheme largely depends on the burst length nodes in the system of requests P. Balaji, S. Narravula, K. Vaidyanathan, S. Narravula, H. -W. Jin, K. Savitha and D. K. Panda, Exploiting Remote Memory Operations to Design Efficient Reconfigurations for Shared Data-Centers over InfiniBand, RAIT ’04
QoS meeting capabilities Hard QoS Meeting Capability (Low Priority Requests)
Hard QoS Meeting Capability (High Priority Requests)
90%
90%
80%
80% % o f t im e s Q o S m e t
100%
% o f t im e s Q o S m e t
100%
70% 60% 50% 40% 30%
70% 60% 50% 40% 30%
20%
20%
10%
10%
0%
0% Case 1
Case 2
Reconf Reconf-P Reconf-PQ
Case 3
Case 1
Case 2
Case 3
Reconf Reconf-P Reconf-PQ
Re-configurability with QoS and Prioritization handles the QoS meeting capabilities of both high and low priority requests for all cases
P. Balaji, S. Narravula, K. Vaidyanathan, H. -W. Jin and D. K. Panda, On the Provision of Prioritization and Soft QoS in Dynamically Reconfigurable Shared Data Centers over InfiniBand, Int’l Symposium on Performance Analysis of Systems and Software (ISPASS 05)
Other Research Directions •
System Software/Middleware – – – – – – –
•
Networking and Communication Support – – – –
•
High Performance MPI on InfiniBand Cluster Clustered Storage and File Systems Solaris NFS over RDMA iWARP and its Benefits to High Performance Computing Efficient Shared Memory on High-Speed Interconnects High Performance Computing with Virtual Machines (Xen-IB) Design of Scalable Data-Centers with InfiniBand High Performance Networking for TCP-based Applications NIC-level Support for Collective Communication and Synchronization NIC-level Support for Quality of Service (QoS) Micro-Benchmarks and Performance Comparison of High-Speed Interconnects
More details on http://nowlab.cse.ohio-state.edu/ Æ Projects
Presentation Outline • • • • • • • • •
Systems Area in the Department Network-Based Computing (NBC) Trend Different Kinds of NBC Systems Our Vision, Collaboration, Funding Research Challenges and Projects Available Experimental Testbed Related 788 and 888 Courses Students in the Group and their accomplishments Conclusions
NBCLab Infrastructure •
•
Multiple Clusters (around 500 processors in total) –
128-processor
–
128-processor (Donation by Intel for 99 years)
–
32-processor (Donation by Appro, Mellanox and Intel)
–
28-processor (Donation by Sun)
–
32-processor
–
96-processor
–
32-processor
–
Many other donated systems from Intel, AMD, Apple, IBM, SUN, Microway and QLogic
–
18 dual GHz nodes
• • •
32 dual Intel EM64T processor 32 dual AMD Opteron InfiniBand DDR on all nodes (32 nodes with dual rail)
• •
64 dual Intel Pentium 2.6 GHz processor InfiniBand SDR on all nodes
• •
16 dual Intel EM64T blades InfiniBand SDR and DDR on all nodes
• •
12 dual AMD Opteron 1 quad AMD Opteron
• •
16 dual-Pentium 300 MHz nodes Myrinet, Gigabit Ethernet, and GigaNet cLAN
• • •
16 quad-Pentium 700 MHz nodes 16 dual-Pentium 1.0 GHz nodes Myrinet and GigaNet cLAN
• • •
8 dual-Xeon 2.4 GHz nodes 8 dual-Xeon 3.0 GHz nodes InfiniBand, Quadrics, Ammasso (1GigE), Level 5(1GigE), Chelsio (10GigE), Neteffect (10GigE)
Other test systems and desktops
NBCLab Clusters
High-End Computing and Networking Research Testbed for Next Generation Data Driven, Interactive Applications PI: D. K. Panda Co-PIs: G. Agrawal, P. Sadayappan, J. Saltz and H.-W. Shen Other Investigators: S. Ahalt, U. Catalyurek, H. Ferhatosmanoglu, H.-W. Jin, T. Kurc, M. Lauria, D. Lee, R. Machiraju, S. Parthasarathy, P. Sinha, D. Stredney, A. E. Stutz, and P. Wyckoff Dept. of Computer Science and Engineering, Dept. of Biomedical Informatics, and Ohio Supercomputer Center The Ohio State University Total funding: $3.01M ($1.53M from NSF + $1.48 from Ohio State Board of Regents and Various units in OSU)
Experimental Testbed Installed OSC
BMI 10.0 GigE switch
2x10=20 GigE 40 GigE (Yr4)
70-node Memory Cluster with 512 GBytes memory, 24 TB disk GigE, and InfiniBand SDR Upgrade (Yr4)
Mass Storage System 500 TBytes (Existing) 10.0 GigE switch
2x10=20 GigE 40 GigE (Yr4)
CSE 64-node Compute Cluster with InfiniBand DDR and 4TB disk 10 GigE on some (to be added) Upgrade (Yr4) Graphics Adapters and Haptic Devices
10.0 GigE switch
20 Wireless Clients & 3 Access Points Upgrade (Yr4) Video wall
Collaboration among the Components and Investigators Data Intensive Applications
Saltz, Stredney, Sadayappan Machiraju, Parthasarathy, Catalyurek, and Other OSU collaborators
Data Intensive Algorithms Programming Systems and Scheduling Networking, Communication, QoS, and I/O
Shen, Agrawal, Machiraju, and Parthasarathy, Saltz, Agrawal, Sadayappan, Kurc, Catalyurek, Ahalt, and Hakan Panda, Jin, Lee, Lauria, Sinha, Wyckoff, and Kurc
Wright Center for Innovation (WCI) • A new funding to install a larger cluster with 64 nodes with dual dual-core processors (up to 256 processors) • Storage nodes with 40 TBytes of space • Connected with InfiniBand DDR • Focuses on Advanced Data Management
Presentation Outline • • • • • • • • •
Systems Area in the Department Network-Based Computing (NBC) Trend Different Kinds of NBC Systems Our Vision, Collaboration, Funding Research Challenges and Projects Available Experimental Testbed Related 788 and 888 Courses Students in the Group and their accomplishments Conclusions
788.08P course • Network-Based Computing • Covers all the issues in detail and develops the foundation in this research area • Last offered in Wi ‘06 – http://www.cis.ohio-state.edu/~panda/788/ – Contains all papers and presentations
• Will be offered in Wi ’08 again
888.08P course • Research Seminar on Network-Based Computing • Offered every quarter • This quarter (Au ’06) offering – Mondays: 4:30-6:00pm – Thursdays: 3:30-4:30pm
• Project-based course • 3-credits • Two-quarter rule
Presentation Outline • • • • • • • • •
Systems Area in the Department Network-Based Computing (NBC) Trend Different Kinds of NBC Systems Our Vision, Collaboration, Funding Research Challenges and Projects Available Experimental Testbed Related 788 and 888 Courses Students in the Group and their accomplishments Conclusions
Group Members • Currently 12 PhD students and One MS student
– Lei Chai, Qi Gao, Wei Huang, Matthew Koop, Rahul Kumar, Ping Lai, Amith Mamidala, Sundeep Narravul, Ranjit Noronha, Sayantan Sur, Gopal Santhanaraman, Karthik Vaidyanathan and Abhinav Vishnu
• All current students are supported as RAs • Programmers – Shaun Rowland – Jonathan Perkins
• System Administrator – Keith Stewart
Publications from the Group • Since 1995
– 11 PhD Dissertations – 15 MS Theses – 180+ papers in prestigious international conferences and Journals – 5 journals and 24 conference/workshop papers in 2005 – 2 journals and 21 conference in 2006 (so far) – Electronically available from NBC web page • http://nowlab.cse.ohio-state.edu -> Publications
Student Quality and Accomplishments •
First employment of graduated students
• •
Many employments are by invitation Several of the past and current students
•
Big demand for summer internships every year
– IBM TJ Watson, IBM Research, Argonne National Laboratory, OakRidge National Laboratory, SGI, Compaq/Tandem, Pacific Northwest National Lab, Argonne National Lab, ASK Jeeves, Fore Systems, Microsoft, Lucent, Citrix, Dell
– – – – –
OSU Graduate Fellowship OSU Presidential Fellowship (’95, ’98, ’99, ’03 and ’04) IBM Co-operative Fellowship (’96, ’97, ’02 and ’06) CIS annual research award (’99, ’02, ’04 (2), ’05 (2), ’06) Best Papers at conferences (HPCA ’95, IPDPS ’03, Cluster ’03, Euro PVM ’05, Hot Interconnect ’04, Hot Interconnect ’05, ..)
– IBM TJ Watson, IBM Almaden, Argonne, Sandia, Los Alamos, Pacific Northwest National Lab, Intel, Dell, .. – 7 internships this past summer (Argonne (2), IBM TJ Watson, IBM, Sun, LLNL, HP)
In News • Research in our group gets quoted in press releases and articles in a steady manner • HPCwire, Grid Today, Supercomputing Online, AVM News http://nowlab.cse.ohio-state.edu/projects/mpi-iba/
-> In News
Visit to Capitol Hill (June 2005) •
Invited to Capitol Hill (CNSF) to demonstrate the benefits of InfiniBand networking technology to – Congressmen – Senators – White House Director (Office of Science and Technology)
More details at http://www.cra.org/govaffairs/blog/archives/000101.htm
Looking For …. •
• •
Motivated students with strong interest in different aspects of network-based computing – – – – –
architecture networks operating systems algorithms applications
– – – – –
architecture design networking parallel algorithms parallel programming experimentation and evaluation
Should have good knowledge in Architecture (775, 875), High Performance/Parallel Computing (621, 721), networking (677, 678), operating systems (662) Should have or should be willing to develop skills in
Presentation Outline • • • • • • • • •
Systems Area in the Department Network-Based Computing (NBC) Trend Different Kinds of NBC Systems Our Vision, Collaboration, Funding Research Challenges and Projects Available Experimental Testbed Related 788 and 888 Courses Students in the Group and their accomplishments Conclusions
Conclusions • Network-Based computing is an emerging trend • Small to medium scale network-based computing systems are getting ready to be used • Many open research issues need to be solved before building large-scale network-based computing systems • Will be able to provide affordable, portable, scalable, and ..able computing paradigm for the next 10-15 years • Exciting area of research
Web Pointers
NBC
home page
http://www.cse.ohio-state.edu/~panda/ http://nowlab.cse.ohio-state.edu/ E-mail:
[email protected]
Acknowledgements Our research is supported by the following organizations • Current Funding support by
• Current Equipment support by
99