INFINIBAND CS 708 Seminar
NEETHU RANJIT (Roll No. 05088) B. Tech. Computer Science & Engineering
College of Engineering Kottarakkara Kollam 691 531 Ph: +91.474.2453300 http://www.cek.ihrd.ac.in
[email protected]
Certificate
This is to certify that this report titled InfiniBand is a bonafide record of the CS 708 Seminar work done by Miss.NEETHU RANJIT, Reg No.10264042 , Seventh Semester B. Tech. Computer Science & Engineering student, under our guidance and supervision, in partial fulfillment of the requirements for the award of the degree, B. Tech. Computer Science and Engineering of Cochin University of Science & Technology.
October 16, 2008
Guide
Coordinator & Dept. Head
Mr Renjith S.R Lecturer Dept. of Computer Science & Engg.
Mr Ahammed Siraj K K Asst. Professor Dept. of Computer Science & Engg.
Acknowledgments I express my whole hearted thanks to our respected Principal Dr Jacob Thomas, Mr.Ahammed Siraj sir, Head of the Department, for providing me with the guidance and facilities for the seminar. I wish to express my sincere thanks toMr Renjith sir, lecturer in Computer Science Department,and also my guide for his timely advises during the course period of my seminar.I thank all faculty members of College of Engineering Kottarakara for their cooperation in completing my seminar. My sincere thanks to all those well wishers and friends who have helped me during the course of the seminar work and have made it a great success. Above all I thank the Almighty Lord, the foundation of all wisdom for guiding me step by step throughout my seminar.Last but not the least i would like to thank my parents for their moral support.
NEETHU RANJIT
Abstract InfiniBand is a powerful new architecture designed to support I/O connectivity for the internet infrastructure. InfiniBand is supported by all major OEM server vendors as a means to expand and create the next generation I/O interconnect standard in servers. For the first time, a high volume, industry standard I/O interconnect extend the role of traditional in the box requirements are related to mean bandwidth needed and maximum latency tolerated by this application. It provides a comprehensive silicon software and system solution which provides an overview to layered protocol and InfiniBands management infrastructure. The comprehensive nature of architecture provide a overview to major sections of InfiniBand I/O specification ranges from industry standard electrical interfaceand mechanical connectors to well defined software and management services.InfiniBand is unique in providing connectivity in a way previously reserved only for traditional networking. This unification of I/O and system area networking require a new architecture domain. Underlying this major transition is InfiniBands superior abilities to support the internet requirement for RAS: Reliability, Availability, and Serviceability. The InfiniBand Architecture (IBA) is an industry standard architecture for server I/O and interprocessor communication.IBA that enables QoS: Quality of Services which support with certain mechanisms. These mechanisms are basically service levels,virtual lanes and table based arbitration of virtual lanes.InfiniBand has a formal model to manage the InfiniBand to provide QoS,according to this model, each application need a sequence of entries in the IBA arbitration tables based on requirements. These requirements are related to mean bandwidth needed and maximum latency tolerated by this application. It provides a comprehensive silicon software and system solution which provides an overview to layered protocol and InfiniBands management infrastructure. The comprehensive nature of architecture provide a overview to major sections of InfiniBand I/O specification ranges from industry standard electrical interface and mechanical connectors to well defined software and management services.InfiniBridge is the channel adapter architecture of InfiniBand which aids packet switching feature of InfiniBand.
i
Contents 1 INTRODUCTION
1
2
3
INFINIBAND ARCHITECTURE
3 COMPONENTS OF INFINIBAND 3.1 HCA and TCA Channel adapters . . . . . . . . . . . . 3.2 Switches . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Routers . . . . . . . . . . . . . . . . . . . . . . . . . .
5 5 5 6
4 INFINIBAND BASIC FABRIC TOPOLOGY
7
5 IBA Subnet 5.1 Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Endnodes . . . . . . . . . . . . . . . . . . . . . . . . .
9 9 10
6 FLOW CONTROL
11
7
INFINIBAND SUBNET MANAGEMENT AND QoS 12
8 REMOTE DIRECT ACESS (RDMA) 14 8.1 Comparing a Traditional Server I/O and RDMA-Enabled I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 9
INFINIBAND PROTOCOL 9.1 Physical Layer . . . . . . . 9.2 Link Layer . . . . . . . . . 9.3 Network Layer . . . . . . 9.4 Transport Layer . . . . . .
STACK . . . . . . . . . . . . . . . . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
17 17 18 18 19
10 COMMUNICATION SERVICES 10.1 Communication Stack :InfiniBand support for the Virtual Interface Architecture (VIA) . . . . . . . . . . . .
20
11 INFINIBAND FABRIC VERSUS SHARED BUS
22
12 INFINIBRIDGE 12.1 Hardware transport performance of InfiniBridge . . . .
24 24
ii
21
13 INFINIBRIDGE CHANNEL ADAPTER ARCHITECTURE 26 14 VIRTUAL OUTPUT QUEUEING ARCHITECTURE 27 15 FORMAL MODEL TO MANAGE INFINIBAND ARBITRATION TABLES TO PROVIDE TO QUALITY OF SERVICE(QoS) 15.1 THREE MECHANISMS TO PROVIDE QoS . . . . . 15.1.1 Service Level . . . . . . . . . . . . . . . . . . . 15.1.2 Virtual Lanes . . . . . . . . . . . . . . . . . . . 15.1.3 Virtual Arbitration table . . . . . . . . . . . .
29 29 29 30 30
16 FORMAL MODEL FOR THE INFINIBAND ARBITRATION TABLE 31 16.0.4 Initial Hypothesis . . . . . . . . . . . . . . . . 33 17 FILLING IN THE VL ARBITRATION 17.1 Insertion and elimination in the table . 17.1.1 Example 1. . . . . . . . . . . . . 17.2 Disfragmentation Algorithm . . . . . . . 17.3 Reordering Algorithm . . . . . . . . . . 17.4 Global management of the table . . . .
TABLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . .
. . . . .
35 35 35 36 36 36
18 CONCLUSION
37
REFERENCES
38
iii
1
INTRODUCTION
Bus architectures have a tremendous amount of inertia because they dictate the bus interface architecture of semiconductor devices. For this reason successful bus architectures typically enjoy a dominant position for ten years or more. The PCI bus was introduced to the standard PC architecture in the early 90s and has maintained its dominance with only one major upgrade during that period:from 32bit/33MHz to 64bit/66Mhz.The PCI-X initiative takes this one step further to 133MHz and seemingly should provide the PCI architecture with a few more years of life.But there is a divergence between what personal computer and servers require. Throughout the past decade of fast paced computer development the traditional Peripheral Component Interconnect architecture has continued to be the dominant input/output standard for most internal back-plane and external peripheral connections.However,these days,the PCI bus,using shared bus approach is beginning to be noticeably lagging.Performance limitations, poor bandwidth and reliability issues are surfacing within the higher market tiers, especially as the PCI bus is quickly becoming an outdated technology. Computers are made up of a number of addressable elements-CPU, memory,screen,hard disks,LAN and SAN interface etc, that use a systems bus for communications. As these elements have become faster, the system bus and overhead associated with data movement commonly reffered to as I/O between devices has become a gating factor in computer performance.To address the problem of server performance with respect to I/O in particular,InfiniBand was developed as a standards-based protocol to provide data movement data movement offload from the CPU to dedicated hardware, thus allowing more CPU to be dedicated to application processing.As a result ,InfiniBand,by leveraging networking technologies and principles provide scable,high bandwidth transport for efficient communications between InfiniBand attached devices. InfiniBand technology advances I/O connectivity for data center and enterprise infrastructure deployment, overcoming the I/O bottleneck in todays server architectures. Although primarily suited for next generation server I/O, the InfiniBand can also extend to the embedded computing, storage, and telecommunications industries. This high-volume, industry-standard I/O interconnect extends the role of traditional backplane and board buses beyond the physical connector.
1
Another major bottleneck is the scalability problems with parallelbus architectures such as the peripheral component interconnect (PCI). As these buses scale in speed, they cant support the multiple network interfaces that system designersrequire. For example the PCI-X bus at 133 MHz can only support one slot and at higher speeds these buses begin to look like point-to point connections. Mellanox Technologies InfiniBand silicon product, InfiniBridge, lets system designers construct entire fabrics based on the devices switching and channel adapter functionality. InfiniBridge implements an advanced set of packet switching, quality of service, and flow control mechanisms. These capabilities support multiprotocol environments with many I/O devices shared by multiple servers. These InfiniBridge features include an integrated switch and PCI channel adapter, InfiniBand 1X and 4X link speeds (defined as 2.5 and 10 Gbps), eight virtual lanes, a maximum transfer unit (MTU) size of up to 2 Kbytes. InfiniBridge also offers multicast support, an embedded subnet management agent, and InfiniPCI for transparent PCI-to PCI. InfiniBand is architecture and specification for data flow between processors and I/O devices that promise greater band width and almost unlimited expandability.Infiniband is hence used to replace the existing Peripheral Component Interconnect (PCI).Offering throughput of up to 2.5 gigabytes per second and support for up to 64000addresable devices, the architecture. Also promises increased reliability better sharing of data between clustered processors, and built in security. The InfiniBand architecture spec was released by the InfiniBand Trade association. InfiniBand is backed by top companies in the industry like Compaq,Dell,Hewleet Packard,IBM,Intel,Microsoft and Sun. Underlying the major I/O transition in InfiniBand is able to provide a unique feature of Quality Of Service and many mechanisms exist to provide this once such mechanism is the formal method of using arbitration table.
2
2
INFINIBAND ARCHITECTURE
InfiniBand is a switched, point-to-point interconnect for data centers based on a 2.5-Gbps link speed up to 30 Gbps. The architecture defines a layered hardware protocol (physical, link, network, and transport layers) and a software layer to support fabric management and low-latency communication between devices. InfiniBand provides transport services for the upper layer protocols and supports flow control and Quality Of Service to provide ordered,guaranteed packet delivery across the fabric.An InfiniBand fabric may comprise a number of infiniband subnets that are inter connected using InfiniBand routers,where each subnet may conist of one or more Infiniband switchesand InfiniBand attached switches. The InfiniBand standard defines Reliability, Availability, and Serviceability from the ground up, making the specification efficient to implement in silicon yet able to support a broad range of applications. InfiniBands physical layer supports a wide range of media by using a differential serial interconnect with an embedded clock. This signaling supports printed circuit board, backplane,copper, and fiber links; it leaves room for further growth in speed and media types. The physical layer implements 1X, 4X, and 12X links by byte striping over multiple links.The InfiniBand layered protocol features sidebar lists InfiniBands other features.An InfiniBand system area network has four basic system components that interconnect using InfiniBand links, as Fig 1 shows: The host channel adapter (HCA) terminates connection for a host node. It includes hardware features to support high-performance memory transfers into CPU memory. The target channel adapter (TCA) terminates connection for a peripheral node. It defines a subset of HCA functionality and can be optimized for embedded applications. The switch handles link-layer packet forwarding. A switch does not consume or generate packets other than management packets. The router sends packets between subnets using the network layer. InfiniBand routers divide InfiniBand networks into subnets and do not consume or generate packets other than management packets. A subnet manager runs on each subnet and handles device and connection management tasks. A subnet manager can run on a host or embedded in switches and routers. All system components must include a subnet management agent that handles communication with the subnet manager.
3
Figure 1: INFINIBAND ARCHITECTURE
4
3
COMPONENTS OF INFINIBAND
The main components in the InfiniBand architecture are:
3.1
HCA and TCA Channel adapters
HCAs are present in servers or even desktop machines and provide an interface that is used to interhrate the InfiniBand with the operating system.TCAs are present on I/O devices such as RAID subsystem or a JBOD subsystem.Host and Target Channel adapters present an interface to the layers above them that allow those layers to generate and consume packets.In the case of a server writing a file to a storage device,the host is generating the packets that are then consumed by the storage device. Each channel adapter has one or more ports.A channel adapter with more than one port may be connected to multiple switch ports.
3.2
Switches
Switches simply forward packets between two of their ports based on the established routing table and addressing information stored on the packets.Acollection of end nodes connected to one another through one or more switches form a subnet.Each subnet must have atleast one subnet manager that is responsible for the configuration and management of the subnet
5
Figure 2: InfiniBand Switch
3.3
Routers
Are like switches in the respect that they simply forward packets between their ports. The difference between routers and the switches is that a router is used to interconnect two or more subnets to form a multidomain system area network. Within a subnet each port is assigned a unique identifier by the subnet manager called the LOCAL ID or LID. In addition to the LID each port is assigned a globally unique identifier called the GID. Main feature of the InfiniBand architecture is that is not available in the current shared bus I/0 architecture is the ability to partition the ports within the fabric that can communicate with one another. This is useful for partitioning the available storage across one or more servers for management reasons.
6
Figure 3: System Network of Infiniband
4 INFINIBAND BASIC FABRIC TOPOLOGY Infiniband is a high -speed serial ,channel based ,switch-fabric message passing architecture that can have server,fibre channel,SCSI RAID,router and other end nodes each with its own dedicated fat pipe.Each node can talk to any other node in a many-yo-many configuration.redundant paths can be set up through an InfiniBand Fabric for fault tolerance and InfiniBand routers can connect multiple subnets. Figure below shows the simplest configuration of an InfiniBand Installation,where two or more nodes are connected to one another through the fabric.A node represents either a host device such as a server or an I/O device such as RAID subsystem.The fabric itself may consist of a single switch in the simplest case or a collection of interconnected switches and routers.Each connection between nodes ,switches,and routers is a point-point ,serial connection.
7
Figure 4: InfiniBand Fabric Topology
8
Figure 5: IBA SUBNET
5
IBA Subnet
The smallest complete IBA unit is a subnet, illustrated in the figure . Multiple subnets can be joined by routers (notshown) to create large IBA networks.The elements of a subnet, as shown in the figure, are endnodes, switches, links, and a subnet manager. Endnodes, such as hosts and devices, send messages over linksto other endnodes; the messages are routed by switches.Routing is defined, and subnet discovery performed, by the Subnet Manager. Channel Adapters (CAs) (not shown) connect endnodes to links.
5.1
Links
IBA links are bidirectional point-to-point communication channels, and may be either copper and optical fibre. The signalling rate on all links is 2.5 Gbaud in the 1.0 release; later releases will undoubtedly be faster. Automatic training sequences are defined in the architecture
9
that will allow compatibility with later faster speeds. The physical links may be used in parallel to achieve greater bandwidth. The different link widths are referred to as 1X, 4X, and 12X. The basic 1X copper link has four wires, comprising a differential signaling pair for each direction. Similarly, the 1X fibre link has two optical fibres, one for each direction. Wider widths increase the number of signal paths as implied. There is also a copper backplane connection allowing dense structures of modules to be constructed; unfortunately, an illustration of that which reproduces adequately in black and white were not available at the time of publication. The 1X size allows up to six ports on the faceplate of the standard (smallest) size IBA module. Short reach (multimode) optical fibre links are provided in all three widths; while distances are not specified (as explained earlier), it is expected that they will reach 250m for 1X and 125m for 4X and 12X. Long reach (single mode) fiber is defined in the 1.0 IBA specification only for 1X widths, with an anticipated reach of up to 10Km.
5.2
Endnodes
IBA endnodes are the ultimate sources and sinks of communicationin IBA. They may be host systems or devices(network adapters, storage subsystems, etc.). It is also possible that endnodes will be developed that are bridges to legacy I/O busses such as PCI, but whether and how that is done is vendor-specific; it is not part of the InfiniBand architecture. Note that as a communication service, IBA makes no distinction between these types; an endnode is simply an endnode. So all IBA facilities may be used equally to communicate between hosts and devices; or between hosts and other hosts like normal networking; or even directly between devices, e.g., direct disk-to-tape backup without any load imposed on a host. IBA defines several standard form factors for devices used as endnodes, illustrated in Figure 3: standard, wide, tall, and tall wide. The standard form factor is approximately 20x100x220 mm. Wide doubles the width, tall doubles
10
Figure 6: Flow control in InfiniBand
6
FLOW CONTROL
InfiniBand defines two levels of credit-based flow control to manage congestion: link level and end-to-end. Link-level flow control applies back pressure to traffic on a link, while end-to end flow control protects against buffer over-flow at endpoint connections that might be multiple hops away. Each receiving end of a link/connection supplies credits to the sending device to specify the amount of data that the device can reliably receive. Sending devices do not transmit data unless the receiver advertises credits indicating available receive buffer space. The link and connection protocols have built in credit passing between each device to guarantee reliable flow control operation. InfiniBand handles link-level flow control on a per-quality-of-service-level (virtual lane) basis. InfiniBand has a unidirectional 2.5 Gbps(250MB/sec using 10 bits per data byte encoding called 8B/10B similar to 3 GIO)wire speed connection, and uses either one differential signal pair per direction called 1X,or 4(4X)or 12(12X) for bandwidth up to 30 Gbps per direction(12x2.5 Gbps).Bidirectional throughput with InfiniBand is often expressed in MB/sec,yeiding 500MB/sec for 1X,2 GB/sec for 4X and 6 GB/sec for 12X respectively. Each bi-directional 1X connections consist of four wires, two for send and two for receive. Both fiber and copper are supported. Copper can be n the form of traces or cables and fiber distances between nodes can be far as 300 meters and more. Each infiniBand subnet can host up to 64 000 nodes
11
7 INFINIBAND SUBNET MANAGEMENT AND QoS InfiniBand Subnet Management and QoS InfiniBand supports two levels of management packets: subnet management and the general services interface (GSI). High-priority subnet management packets (SMP) are used to discover the topology of the network, attached nodes, and so on, and are transported within the high-priority VLane (which is not subject to flow control). The low-priority GSI management packets handle management functions such as chassis management and other functions not associated with subnet management. These services are not critical to subnet management, so GSI management packets are neither transported within the high-priority VLane nor subject to flow control. InfiniBand supports quality of service at the link level through virtual lanes. The InfiniBand virtual lane is a separate logical communication link that shares, with other virtual lanes, a single physical link. Each virtual lane has its own buffer and flow-control mechanism implemented at each port in a switch.InfiniBand allows up to 15 general-purpose virtual lanes plus one additional lane dedicatedfor management traffic.Link layer quality of service comes from isolating traffic congestion to individual virtual lanes. For example, the link layer will isolate isochronous real-time traffic from non-realtime data traffic; that is, isolate real-time voice or multimedia streams from Web or FTP data traffic. The system manager can assign a higher virtuallane priority to voice traffic, in effect scheduling voice packets ahead of congested data packets in each link buffer encountered in the voice packets end-to-end path. Thus,the voice traffic will still move through the fabric with minimal latency. InfiniBand presents a number of transport services that provide different characteristics. To ensure reliable, sequenced packet delivery, InfiniBand uses flow control and service levels in conjunction with VLanes to achieve end-to-end QoS. InfiniBand VLanes are logical channels that share a common physical link, where VLane 15 has the highest priority and is used exclusively for management traffic, and VLane=0 the lowest. The concept of a VLane is similar to that of the hardware queues found in routers and switches. For applications that require reliable delivery, InfiniBand supports reliable delivery of packets using flow control. Within an InfiniBand network, the receivers on a point-to-point link periodically transmit
12
information to the upstream transmitter to specify the amount of data that can be transmitted without data loss, on a per-VLane basis. The transmitter can then transmit data up to the amount of credits that are advertised by the receiver. If no buffer credits exist, data cannot be transmitted. The use of credit-based flow control prevents packet loss that might result from congestion. Furthermore, it enhances application performance, because it avoids packet retransmission. For applications that do not require reliable delivery, InfiniBand also supports unreliable delivery of packetsi.e. they may be dropped with little or no consequencethat are not subject to flow control; some management traffic, for example does not require reliable delivery. At the InfiniBand network layer, the GRH contains an 8-bit traffic class field. This value is mapped to a 4-bit service level field within the LRH to indicate the service levelmatches the packets service level against a service level-to-VLane table, which has been populated by the subnet manager. The HCA then that the packet is requesting from the InfiniBand network. As transmits the packet on the VLane associated with that service level. As the packet traverses the network, each switch matches the service level against the packets egress port to identify the VLane within which the packet should be transported.
13
Figure 7: RDMA Hardware
8
REMOTE DIRECT ACESS (RDMA)
One of the key problems with server I/O is the CPU overhead associated with data movement between memory and I/O devices such as LAN and SAN interfaces. InfiniBand solves this problem by using RDMA to offload data movement from the server CPU to the InfiniBand host channel adapter (HCA). RDMA is an extension of hardware-based Direct Memory Access (DMA) capabilities that allows the CPU to delegate data movement within the computer to the DMA hardware.location where data that is associated with a particular process resides and the memory location the data is to be moved to. Once the DMA instructions are sent, the CPU can process other threads while the DMA hardware moves the data. RDMA enables data to be moved from one memory location to another, even if that memory resides on another device.
8.1 Comparing a Traditional Server I/O and RDMA-Enabled I/O The process in a traditional server i/o is extremely inefficient because it results in multiple copies of the same data traveresing between the
14
Figure 8: Traditional Server I/O memory system bus and also invokes multiple CPU interrupts and context switches. By Contrast RDMA, an embedded hardware function of the InfiniBand handles all communications operations without interrupting the CPU.Using RDMA,the sending devices either reads data or writes to the target device user space memory thereby avoiding CPU interrupts and multiple data copies on the memory buswhich enables RDMA to significantly reduce the CPU overhead.
15
Figure 9: RDMA-Enabled Server I/O
16
Figure 10: InfiniBand Protocol Stack
9
INFINIBAND PROTOCOL STACK
From a protocol perspective, the InfiniBand architecture consists of four layers: physical, link, network, and transport. These layers are analogous to Layers 1 through 4 of the OSI protocol stack.TheInfiniBand is divided into multiple layers where each layer operates independently of one another.
9.1
Physical Layer
InfiniBand is a comprehensive architecture that defines both electrical and mechanical characteristics for the system. These include cables and receptacles and copper media; backplane connectors and hot swap characteristics.InfiniBand defines three link speeds at the physical layer,1X,4X,12X.each individual link is a four wire serial connection (two wires in each direction)that provides a full duplex connection at 2.5Gb/s.This physical layer specifies the hardware components.
17
9.2
Link Layer
The link layer (along with the transport layer)is the heart of the InfiniBand architecture. The link layer encompasses packet layout, pointto-point link operations and switching within a subnet. At the packet communication level two packets types for data transfer and network management are specified. The management packets provide operational control over device enumeration, subnet directing and fault tolerance. Data packets transfer the actual information with each packet deploying a maximum of four kilobytes of transaction information. Within each specific device subnet the packet direction and switching properties are directed via a Subnet Manager with 16 bit local identification address. The link layer also allows for the Quality Of Service characteristics of InfiniBand.The primary consideration is the usage of the Virtual Lane(VL) architecture for interconnectivity. Even though a single IBA data path may be defined at the hardware level, the VL approach allows for 16 logical links. With 15 independent levels(VL0-14) and one management path (VL15) available, the ability to configure device specific prioritization is available. Since management requires the most priority, VL15 retains the maximum priority. The ability to assert a priority driven architecture lends not only to Quality Of Service but performance as well. Credit Based Flow Control. is also used to manage data flow between two point to point links.Flow control is handled on a per VL basis allowing separate virtual fabrics to maintain communication utilizing the same physical media.
9.3
Network Layer
The network layer handles routing of packets from one subnet to another (within a subnet, the network layer is not required).Packets that sent between subnets contain a Global Route Header (GRH).The GRH contains the 128 bit IPv6 address for the source and destination of the packet. The packets are forward between subnet through router based on each devices 64bit globally unique ID(GUID).The router modifies the LRH with the proper local address within each subnet. Therefore the last router in the path replaces LID in the LRH with the LID of the destination port. Within the network layer InfiniBand packets do not require the network layer information and the header overhead when used within a single subnet (which is a likely scenario for InfiniBand system area networks)
18
9.4
Transport Layer
The transport layer is responsible for in-order packet delivery, partioning, channel multiplexing and transport services (reliable connection, reliable datagram, unreliable datagram).The transport layer also handles transaction data segmentation when sending and reassembly when receiving. Based on the Maximum Transfer Unit (MTU) of the path the transport layer divides the data in to packets of the proper size. The receiver reassembles the packets based on a Base Transport Header (BTH) that contains the destination queue pair and packet sequence number. The receiver acknowledges the packets and the sender receives the acknowledge and updates the completion queue with the status of the operation. There is a significant improvement that the IBA offers for the transport layer. All functions are implemented in hardware.InfiniBand specifies multiple transport services for data reliability.
19
10
COMMUNICATION SERVICES
IBA provides several different types of communication services between endnodes: Reliable Connection (RC): a connection is established between end nodes, and messages are reliably sent between them. This is optional for TCAs (devices), but mandatory for HCAs (hosts). (Unreliable) Datagram (UD): a single packet message can be sent to an end nodes without first establishing a connection; transmission is not guaranteed. Unreliable Connection (UC): a connection is established between end nodes, and messages are sent, but transmission is not guaranteed. This is optional. Reliable Datagram (RD): a single packet message can be reliably sent to any end node without a one-to-one connection. This is optional. Raw IPv6 Datagram Raw Ether type Datagram (optional) (Raw): single-packet unreliable datagram service with all but local transport header information stripped off; this allows packets using non-IBA transport layers to traverse an IBA network, e.g., for use by routers and network interfaces to transfer packets to other media with minimal modification. In the above, reliably send means the data is, barring catastrophic failure, guaranteed to arrive in order, checked for correctness, with its receipt acknowledged. Each packet, even those for unreliable data grams, contains two separate CRCs, one covering data that cannot change (Constant CRC) and one that must be recomputed (V-CRC) since it covers data that change; such change can occur only when a packet moves from one IBA subnet to another, however. This is intentional, since they provide essentially the same services. However, these are designed for hardware implementation, as required by a high-performance I/O system. In addition, the host-side functions have been designed to allow all service types to be used completely in user mode,without necessarily using any operating system services; RDMA moving data directly into or out of the memory of an endnode. This and user mode operation implies that virtual addressing must be supported by the channel adapters, since real addresses are unavailable in user mode. In addition to RDMA, the reliable communication classes also optionally support atomic operations directly against endnodes memory. The atomic operations supported are Fetch-and-Add and Compare-andSwap, both on 64-bit data. Atomics are effectively a variation on RDMA: a combined write and read RDMA, carrying the data.
20
10.1 Communication Stack :InfiniBand support for the Virtual Interface Architecture (VIA) The Virtual Interface Architecture is a distributed messaging technology that is both hardware independent and compatible with current network interconnects. The architecture provides an API that can be utilized to provide high speed and low latency communications between peers in clustered applications .InfiniBand was developed with the VIA architecture in mind.InfiniBand off loads traffic control from the software client through the use of execution queues. These queues called work queue, are initiated by the client and then left for InfiniBand to manage. For each communication channel between devices, a Work Queue Pair (WQP-send and receive queue)is assigned at each end. The client places a transaction in to the work queues (Work Queue entry-WQE) which is then processed by the channel adapter from the queue and sent out to the remote device. When the remote device responds the channel adapter returns status to the client through a completion queue or event. The client can post multiple WQEs and the channel adapters hardware will handle each of the communication requests. The channel adapter then generates a Completion Queue Entry (CQE)to provide status for each WQE in the proper prioritized order. This allows the client to continue with the activities while the transactions are being processed.
21
Figure 11: InfiniBand Protocol Stack
11 INFINIBAND FABRIC VERSUS SHARED BUS The switched fabric architecture of InfiniBand is designed around a completely different approach as compared as compared to the limited capabilities Of shared bus .IBA specifies a point to point (PTP) communication protocol for primary connectivity. Being based upon PTP,each link along the fabric terminates at one connection point (or device).The actual underlying transport addressing standard is derived from the impressive IP method employed by advanced networks .Each InfiniBand device is assigned an IP address ,thus the load management and signal termination characteristics are clearly defined and more efficient .To add more TCA connection points or endnodes ,the simple addition of a dedicated IBA switch is required Unlike the shared bus ,each TCA and IBA switch can be interconnected via multiple data paths in order to sustain maximum aggregate device bandwidth and provide fault tolerance by way of multiple redundant connections.
22
23
12
INFINIBRIDGE
InfiniBridge is effective for implementation of HCAs, TCAs, or standalone switches with very few external components. The devices channel adapter side has a standard 64-bit-wide PCI interface operating at 66 MHz that enables operation with a variety of standard I/O controllers, motherboards, and backplanes. The devices InfiniBand side is an advanced switch architecture that is configurable as eight 1ports, two 4ports, or a mix of each. Industry standard external serial/deserializes interface the switch ports to InfiniBand-supported media (printed circuit board traces, copper cable connectors, or fiber transceiver modules). No external memory is required for switching or channel adapter functions. The embedded processor initializes the IC on reset and executes subnet management agent functions in firmware. An I2C EPROM holds boot configuration. InfiniBridge also effectively implements managed or unmanaged switch applications. The PCI or CPU interface can connect external controllers running Infini-Band management software. Or an unmanaged switch design can eliminate the processor connection for applications with low area and part count. Appropriate configuration of the ports can implement a 4X to four 1aggregation Switches. The InfiniBridge switching architecture implements these advanced features of the InfiniBand architecture: standard InfiniBand packets up to an MTU size of 4 Kbytes, eight virtual and one management lane, 16Kbyte Unicast local identifications, 1-Kbyte multicast LIDs, VCRC and ICRC integrity checks, and 4to 1link aggregation.
12.1 Hardware transport performance of InfiniBridge Hardware transport is probably the most significant feature InfiniBand offers to next generation data center and telecommunications equipment. Hardware transport performance is primarily a measurement of CPU utilization during a period of a devices maximum wire speed throughput. Lowest CPU utilization is desired. The following test setup was used to evaluate InfiniBridge hardware transport: two 800-MHz PIII servers with InfiniBridge64-bit/66-MHz PCI channel adapter cards and running Red Hat Linux 7.1, a 1InfiniBand link between the two server channel adapters, an InfiniBand protocol analyzer inserted in the link, and an embedded storage protocol run-
24
ning over the link. The achieved wire speed was 1.89 Gbps in both directions simultaneously, which is 94 percent of the maximum possible bandwidth of a 1link (2.5 Gbps minus 8/10 Byte encoding or 2 Gbps). During this time, the driver used an average of 6.9 percent of the CPU. The bidirectional traffic also traverses the PCI bus, which has a unidirectional upper limit of 4.224 Gbps. Although the InfiniBridge DMA engine can efficiently send burst packet data across the PCI bus, we speculate that PCI is the limiting factor in this test case.
25
13 INFINIBRIDGE CHANNEL ADAPTER ARCHITECTURE The InfiniBridge channel adapter architecture has two blocks, each having independent ports to the switch fabric, as figure shows .One block uses a direct memory access (DMA) engine interface to the PCI bus, and the other uses PCI target and PCI master interfaces. This provides flexibility in the use of the PCI bus and enables implementation of the Infini PCI feature. This unique feature lets the transport hardware automatically translate PCI transactions to InfiniBand packets, thus enabling transparent PCI-to-PCI bridging over the InfiniBand fabric. Both blocks include hardware transport engines that implement the InfiniBand features of reliable connection, unreliable datagram, raw datagram, RDMA reads/writes, message size up to 2 Kbytes, and eight virtual lanes. The PCI target includes address bar/limit hardware to claim PCI transactions in segments of the PCI address space. Each segment can be associated with a standard InfiniBand channel in the PCI-target transport engine. The association lets claimed transactions be translated into InfiniBand packets that will go out over the corresponding channel. In the reverse direction, the PCI master also has segment hardware that lets a channel automatically translate InfiniBand packet payload into PCI transactions generated onto the PCI bus. This flexible segment capability and channel association enables transparent PCI bridges construction over the InfiniBand fabric. The DMA interface can move data directly between local memory and InfiniBand channels. This process uses execution queues containing linked lists of descriptors that one of multiple DMA execution engines will execute. Each descriptor can contain a multientry scat ter-gather list, and each engine can use this list to gather data from multiple locations in local memory and combine it into a single message to send into an InfiniBand channel. Similarly, the engines can scatter data received from an InfiniBand channel to local memory.
26
Figure 13: InfiniBridge Channel Adapter Architecture
14 VIRTUAL OUTPUT QUEUEING ARCHITECTURE InfiniBridge uses an advanced virtual output queuing (VOQ) and cutthrough switching architecture to implement these features with low latency and non blocking performance. Each port has a VOQ buffer, transmit scheduling logic, and packet decoding logic.. Incoming data goes to both the VOQ buffer and packet-decoding logic. The decoder extracts the parameters needed for flow control, scheduling, and forwarding decisions. Processing of the flow-control inputs gives link flow-control credits to the local transmit port, limiting output packets based on available credits. InfiniBridge decodes the destination local identification from the packet and uses it to index the forwarding database. and retrieve the destination port number. The switch fabric uses the destination port number to decide which port to send the scheduling information. The service level identification field is also extracted from the input packet by the decoder and used to determine the virtual lane, which goes to the destination ports transmit scheduling logic. All parameter decoding takes place in real time and is given to the switch fabric to make scheduling requests as soon as the
27
Figure 14: Virtual output-queuing architecture information is available. The packet data is stored only once in the VOQ. The transmit-scheduling logic of each port arbitrates the order of output packets and pulls them from the correct VOQ buffer. Each port logic module is actually part of a distributed scheduling architecture that maintains the status of all output ports and receives all scheduling requests. In cut-through mode, a port scheduler receives notification of an incoming packet as soon as the local identification for that packets destination is decoded. Once the port scheduler receives virtual lane and other scheduling information, it schedules the packet for output. This transmission could start immediately, based on the priority of waiting packets and flow control credits for the packets virtual lane. The switch fabric actually includes three on-chip ports in addition to the eight external ones, as Figure shows. One port is a management tport that connects to the internal RISC processor, which handles management packets and exceptions. The other two ports interface with the channel adapter.
28
15 FORMAL MODEL TO MANAGE INFINIBAND ARBITRATION TABLES TO PROVIDE TO QUALITY OF SERVICE(QoS) The InfiniBand Architecture (IBA) has been proposed as an industry standard both for communication between processing nodes and I/O devices and for interprocessor communication. It replaces the traditional bus-based interconnect with a switch-based network for connecting processing nodes and I/O devices. It is being developed by the InfiniBand Trade Association (IBTA) in the aim to provide the levels of reliability, availability, performance, scalability, and quality of service (QoS) required by present and future server systems. For this purpose, IBA provides a series of mechanisms that are able to guarantee QoS to the applications. Therefore, it would be important for InfiniBand to be able to satisfy both the applications that only need minimum latency, and also those different applications that need other characteristics to satisfy their QoS requirements. InfiniBand provides a series of mechanisms that, properly used, are able to provide QoS for the applications. These mechanisms are mainly the segregation of traffic according to categories and the arbitration of the outputports according to an arbitration table that can be configured to give priority to the packets with higher QoS requirements.
15.1 QoS
THREE MECHANISMS TO PROVIDE
Basically, IBA has three mechanisms to support QoS: service levels, virtual lanes, and virtual lane arbitration.
15.1.1
Service Level
According to the different link service levels that an InfiniBand architecture provide the quality of service at various communication level hence quality provision is greater.
29
15.1.2
Virtual Lanes
IBA ports support virtual lanes (VLs), providing a mechanism for creating multiple virtual links within a single physical link. A VL is an independent set of receiving and transmitting buffers associated with a port Each VL must be an independent resource for flow control purposes. IBA ports have to support a minimum of two and a maximum of 16 virtual lanes (VL0 . . . VL15). All ports support VL15, which is reserved exclusively for subnet management, and must always have priority over data traffic in the other VLs. Since systems can be constructed with switches supporting different numbers of VLs, the number of VLs used by a port is configured by the subnet manager. Also, packets are marked with a service level (SL), and a relation between SL and VL is established at the input of each link with the SLtoVL Mapping Table. When more than two VLs are implemented, an arbitration mechanism is used to allow an output port to select which virtual lane to transmit from. This arbitration is only for data VLs, because VL15, which transports control traffic, always has priority over any other VL. The priorities of the data lanes are defined by the VL Arbitration Table.
15.1.3
Virtual Arbitration table
A limit of high priority value specifies the maximum number of high priority packets that can be sentbefore a low priority packet is sent. More specifi- cally, the VLs of the High Priority table can transmit limit of high priority X4096 bytes before a packet from the Low Priority table could be transmitted. If no high priority packets are ready for transmission at a given time, low priority packets can also be transmitted.When more than two VLs are implemented, the VL Arbitration Table defines the priorities of the data lanes.Each VL Arbitration Table has two tables: one for delivering packets from high-priority VLs and another one for low priority VLs.Up to 64 table entries are cycled through, each one specifying a VL and a weight. The weight is the number of units of 64 bytes to be sent from that VL. This weight must be in the range of 0 to 255 and is always rounded up in order to transmit a whole packet.
30
Figure 15: Virtual Lanes
16 FORMAL MODEL FOR THE INFINIBAND ARBITRATION TABLE We present an algorithm to find a new sequence of free entries able to locate a connection request in the table. This algorithm is part of a formal model to manage the IBA arbitration table. In the next sections, we will present a formal model to manage the IBA arbitration table and several algorithms in order to adapt this model for being used in a dynamic scenario when new requests and releases are made. To propose a concrete algorithm to find a new sequence of free entries able to locate a connection request in the table. The treatment of the problem that we present basically consists of setting out an efficient algorithm able to select a sequence of free entries on the arbitration table. These entries must be selected with a maximum separation between any consecutive pair. To develop this algorithm, we first propose some hypotheses and definitions for establishing the correct frame to later present the algorithm and its associated theorems. we consider some specific characteristics of IBA:the number of
31
Figure 16: Virtual Arbitration Table
32
table entries (64) and the value of the weight 0 . . . 0.255. All we need to know is that the requests areoriginated by the connections so that some requirements are guaranteed. Besides, the group of entries assigned to a request belongs to the arbitration table associated with the output ports and interfaces of the InfiniBand switches and hosts, respectively. We formally define the following concepts: Table: Round list of 64 entries. Entry: Each one of the 64 parts compounding a table. Weight: Numerical value of the entries in the table. This can vary between 0 and 255. Status of an entry: Situation of an entry of the table. The different situations can be free weight 0 or occupied weight. Request: A demand of a certain number of entries. Distance: Maximum separation between two consecutive entries in the table that are assigned to one request. Type of request: Each one of the different types into which the requests can be grouped. They are based on the requested distances and, so, on the requested number of entries. Group or sequence of entries: The set of entries of the table with a fixed distance between any consecutive pair. In order to characterize a sequence of entries, it will be enough to give the first entry and the distance between a consecutive pair.
16.0.4
Initial Hypothesis
In what follows, and when not indicated to the contrary, the following hypotheses will be considered: 1. There are no request eliminations, so the table is filled in when new requests are received and these requests are never removed. In other words, the entries could change from a free status to an occupied status, but it is not possible for an occupied entry to change to free. This hypothesis permits us to do a more simple and clear initial study, but, logically, it will be discarded later on. 2. It may be necessary to devote more than a group of entries to a set of requests of the same type. 3. The total weight associated with one request is distributed among the entries of the selected sequence so that the weight for the first entry of this sequence is always larger than or equal to the weight of the other entries of the sequence.
33
Figure 17: Structure of a VL Arbitration Table 4. The distance d associated to one request will always be a power of 2 and it must be between 1 and 64. These are the different types of requests that we are going to consider.
34
17 FILLING IN THE VL ARBITRATION TABLE The classification of traffic into categories based on its QoS requirements, is just a first step to achieve the objective of providing QoS. A suitable filling in of the arbitration table is critical. We propose a strategy to fill in the weights for the arbitration tables. In this section, we see how to fill in the table in order to provide the bandwidth requested by each application also on the basis of how to provide latency guarantee. Each arbitration table only has 64 entries, hence we can fill a different entry to each connection, this could limit the number of connections that can be accepted.Also a connection requiring very high bandwidth could also need slots in more than one entry in the table so for that reason, we propose grouping the connections with the same SL into a single entry of the table until completing the maximum weight for that entry, before moving to another free entry. In this way, the number of entries in the table is not a limitation for the acceptance of new connections, but only the available bandwidth. Each set contains the needed entries to meet the request of a certain distance The first one of these sets having all of its entries free is selected. The order in which the sets are examined has as an objective to maximize the distance between two free consecutive entries that would remain in the table after carrying out the selection. This way, the table remains in the optimum condition to be able to later meet the most restrictive possible request. For a new request of maximum distance d=2 to the power i.
17.1
Insertion and elimination in the table
The elimination of requests is now possible. As a consequence, the entries used for the eliminated requests will be released. Considering the filling-in algorithm, and if the entries are not correctly separated. We can eliminate that request
17.1.1
Example 1.
We have the table filled and two requests of type d is 8 are eliminated. These requests were made using the entries of the sets specified in
35
the tree This means that, now, the table has free entries, and, so, a request that is not need can be eliminated
17.2
Disfragmentation Algorithm
The basic idea of this algorithm is to group all of the free entries of the table into several free sets that permit meeting any request needing a number of entries equal to or lower than the available table entries. Thus, the objective of the algorithm is to perform a grouping of the free entries.A process that consists of joining the entries of two free sets of the same size in a unique free set.This joining will be effective only if the two free sets do not already belong to the same greater free set. Therefore, the algorithm is restricted to only singular sets. The goal is to have a free set of the biggest size in order to be able to meet a request of this size. For that purpose, the table has enough free entries which, however, belong to two small free sets that are not able to meet that request.
17.3
Reordering Algorithm
The reordering algorithm basically consists of an order algorithm, but applies it at a level of sets. This algorithm has been designed to be applied to a table that is non ordered, with the purpose of leaving the table ordered. So that a ordered table will ensure proper sending of request.
17.4
Global management of the table
For the global management of the table, having both insertions and releases, we have shown that a combination of the filling-in and disfragmentation algorithms (and even the reordering algorithm, if needed) must be used. Using this global management table to prove that the table will always have a correct status in order that the propositions of the filling-in algorithm continue to be true.Hence overall management of arbitration table occurs.
36
18
CONCLUSION
The InfiniBand is a powerful new architecture designed to support I/O connectivity for the Internet infrastructure. InfiniBand is supported by all major OEM server vendors as a means to expand and create the next generation I/O interconnect standard in servers. IBA that enables QoS: Quality of Services which support with certain mechanisms. These mechanisms are basically service levels, virtual lanes and table based arbitration of virtual lanes.InfiniBand has a formal model to manage the InfiniBand to provide QoS,according to this model, each application need a sequence of entries in the IBA arbitration tables based on requirements. These requirements are related to mean bandwidth needed and maximum latency tolerated by this application. It provides a comprehensive silicon software and system solution which provides an overview to layered protocol and InfiniBands management infrastructure. InfiniBand provides a layered architecture.Mellanox and related companies are now positioned to release InfiniBand as a multifaceted architecture within several market segments .The most notable application area is enterprise class network clusters and Internet data centers .These types of application require extreme performance with the maximum in fault tolerance and reliability. Other computing system uses include Internet service providers, collocations hosting and large corporate networks Atleast the for the introduction InfiniBand is positioned as a complimentary architecture.IBA will move through a transitional period where future PCI,IBA and other interconnect standards can be offered within the same system or network. The understanding of PCI limitation (even PCI-X) should allow InfiniBand to be an aggressive market contenders higher-class systems move the conversion to IBA devices. Currently Mellanox is developing the IBA software interface standard using linux as their internal OS choice. Another key concern is the cost of implementing InfiniBand at consumer level. Industry sources are currently projecting IBA prices to fall somewhere between the currently available Gigabit Ethernet and Fibre Channel technologies.Infiniband could be positioned as the dominant I/O connectivity architecture at all upper tier levels that provide the top level in Quality of Service(QoS) that can be implemented in various method as discussed. This is definitely a technology to watch and can provide competitive market.
37
References [1] Chris Eddington. Infinibridge:an infiniband channel adapter with intergrated switch. IEEE Magazine micro, pages 492–524, MarchApril 2006. [2] Sanchez. JL Menduia m;Duato J Alfaro F.J. A formal model to manage the infiniband arbitration tables providing qos. In Computer, IEEE Transaction,, page 10241039, August 2007. [3] CISCO Collection Library. UNDERSTANDING INFINIBAND. Cisco Public Informations, Second edition, 2006.
38