.
THE VIRTUAL INTERFACE ARCHITECTURE Dave Dunning Greg Regnier Gary McAlpine Don Cameron Bill Shubert Frank Berry Anne Marie Merritt Ed Gronke Chris Dodd
performance CPU/memory
Interprocess communication
Intel Corporation
This protected, zerocopy, user-level network interface architecture reduces the system overhead for sending and receiving messages between high-
subsystems and networks to less than 10 microseconds when integrated in silicon. 66
N
etwork bandwidths are increasing, and latencies through these networks are decreasing. Unfortunately, applications have not been able to take full advantage of these performance improvements due to the many layers of user-level and kernellevel software that is required to get to and from the network. To address this problem, Intel Corporation, Compaq Computer Corporation, and Microsoft Corporation jointly authored the VI Architecture specification. VIA significantly reduces the software overhead between a high-performance CPU/memory subsystem and a high-performance network. Access http://www.viarch.org/ for a copy of the specification. VIA defines a set of functions, data structures, and associated semantics for moving data into and out of a process’ memory. It achieves low-latency, high-bandwidth communication and data exchange between processes running on two nodes within a computing cluster, with minimal CPU usage. VIA gives a user process direct access to the network interface, avoiding intermediate copies of data and bypassing the operating system in a fully protected fashion. Avoiding interrupts and context switches whenever possible minimizes CPU usage. This article presents the mechanisms that support protected, zero-copy user-level access and the performance data of two related implementations.
IEEE Micro
VIA attacks the problem of the relatively low achievable performance of interprocess communication (IPC) within a cluster. (Cluster computing consists of short-distance, low-latency, high-bandwidth IPCs between multiple building blocks. Cluster building blocks include servers, workstations, and I/O subsystems, all of which connect direct-
ly to a network.) IPC performance depends on the software overhead to send and receive messages and/or data and the time required to move a message or data across the network. The number of software layers that are traversed and the number of interrupts, context switches, and data copies when crossing those boundaries contribute to the software overhead. Note that the time to traverse the software stack layers and the time to field interrupts and complete context switches do not depend on the number of bytes being sent or received. Only the time to complete the data copies depends on the number of bytes being moved. Clearly, the time to move a message or data across the network depends on the number of bytes moved. Primarily, faster processors that execute protocol layers in less time reduce the software overhead. To facilitate the increase in clock frequency, designers have used deeper pipelines, intelligent branch prediction algorithms in hardware, more on-board registers, and larger, faster caches. These innovations lead to executing code paths faster. However, they also lead to a larger penalty (usually measured in the number of clock cycles) paid when interrupting the code sequence, switching to another process with its associated context, and flushing a portion of the instruction cache. Therefore, the increase in processor clock frequencies does not necessarily lead to a proportional reduction in the software overhead of each message. With the introduction of technologies such as fast Ethernet and OC-3 ATM, network bandwidths have increased from 1 Mbps to 10 Mbps. We see 100-Mbps to 150Mbps bandwidths and roadmaps to 1 Gbps to 2 Gbps. These impressive increases in bandwidth have reduced the time to complete bulk data transfers. However, a common misconception is that the data rate of
0272-1732/98/$10.00 © 1998 IEEE
.
Key design issues and resolutions The Virtual Interface Architecture is connection oriented; each VI instance (VI) is specifically connected to another VI and thus can only send to and receive from its connected VI. Bailey et al.1 describes the need to quickly dispatch packets and start processing them just as quickly. The high weighting placed on message-passing (data movement) performance led to the connection architecture. Requiring connections between VIs allows for lower latency between source and destination and simpler VIA-compliant implementations. Connecting two VIs removes one level of protection checking from the critical path of sending and receiving packets. The sender has permission to send to the receiver because the kernel agent already has established the connection. The receiver is not required to verify that the sender is a valid source of data. Upon arrival, the data is funneled directly to the VI receive queue, reducing the amount of processing required, and therefore reducing the latency to receive a packet. VIA’s connection-oriented nature facilitates a simpler NIC design, which reduces the head-of-line blocking. We define head-of-line blocking as a small message that gets stuck behind a large message. If VI pairs were not connected, a small message could be queued behind a large message with a different endpoint. Then to prevent a large message from blocking the small message, the scheduler would need to check ahead on the VI and begin processing the small message. The small message would likely complete first. However, the done status would be blocked until either the large message completed or the NIC reordered the descriptors, neither of which are appealing design options. Avoiding this blocking is one service that is desirable in NICs. The next level of service is a minimum bandwidth guarantee, a maximum bandwidth not to be exceeded, and a maximum latency not to exceed. Although not required in the current revision of VIA, connection-oriented VIs simplify the design of a scheduler that can support these qualities of service. Reliability guarantees. The reliability attributes of the physical network below VIA are separated into three different categories: an unreliable datagram service, a reliable delivery service, and a reliable reception service. Unreliable datagram service guarantees that all data that is delivered has not been corrupted during transmission and will only be delivered once. There is no guarantee that all data sent was delivered. There is no guarantee that the delivered packets will be delivered in the order in which they were sent. If an error occurs in the network, the data is discarded and the connection between the two VIs remains intact. Reliable delivery service guarantees that data sent is received uncorrupted, only once, and in the order that it
the network’s physical layer is a good metric to assess communication performance. It must be coupled to the software
was sent. When data has been transferred to the network supporting reliable delivery service, the descriptor can be marked complete. All errors in the network must be detected and reported. Note that most transport errors will be reported in a receive descriptor, not the send descriptor of a given transaction. When an error is detected, the connection on which the error occurred is disconnected. Reliable reception service guarantees that data sent is received uncorrupted, only once, and in the order that it was sent. Reliable reception service differs from reliable delivery service in that a descriptor cannot be marked complete until the data has been transferred into the memory at the remote endpoint. Like the reliable delivery service, all errors in the network must be detected and reported; when an error occurs, the connection on which the error occurred is disconnected. Unlike reliable delivery, when using the reliable reception service, a transport error is reported in the send descriptor associated with the transaction, and no other descriptors on that queue of that VI can be completed. Remote-DMA/read. This is an optional feature in VIA. The only communication operations required to achieve the goals of the architecture are send and receive.2 As the architecture progressed, it became apparent that remoteDMA/write was a very useful operation allowing a node to “put” data into remote memory without consuming a descriptor. The added complexity of implementations was fairly small, being very similar to a send operation. The trade-offs between application performance and design complexity with associated costs could not be quantitatively evaluated. Therefore, the semantics of remoteDMA/read are defined but not required in VIA. Pinned memory regions. We based the decision to require that memory regions used for descriptors and data buffers be pinned on performance. We considered methods to allow caching of virtual-to-physical translations, but they would not yield the performance associated with pinning and registering the memory. Nonvariable time for virtual-to-physical address translation is especially critical at the receiving endpoint. We did not consider backing data up onto the network or discarding data at a receiving endpoint to be acceptable.
References 1. M.L. Bailey et al., “PathFinder: A Pattern-Based Packet Classifier,” Proc. First Symp. Operating Systems Design and Implementation, Usenix Assoc., Sunset Beach, Calif., Nov. 1994, pp. 115-123. 2. P. Pierce and G. Regnier, “Fast Messages: Efficient, Portable Communication for Workstation Clusters and MPPs,” IEEE Concurrency, Vol. 5, No. 2, Apr.-June 1997, pp. 60-72.
overhead associated with the communication traffic as well as the traffic pattern. See Martin et al.1 for a more compre-
March/April 1998
67
. VIA
100 Percentage
80 60 40 20 0
Driver TCP to driver AFD to TCP User to AFD
(a)
Percentage
100 80 60 40 20 0
AFD to user Sockets AFD TCP NDIS Driver ISR to handler
al.,4 it is not simply one or two of the software layers that must be traversed. To further quantify the problem, we conducted a series of simple tests between two 133-MHz Pentium Pro processor servers. We inserted instrumentation hooks in a version of the Windows NT operating system and collected cycle counts associated with the various software layers and the TCP/IP protocol stack. We recorded measurements for the send and receive paths through the Windows NT protocol stack in a set of trials large enough to assure reasonable statistical accuracy. Figures 1a and 1b summarize the send and receive results of these experiments. Note that no single layer can be tuned to significantly reduce the latency. Making the common case (a 200-byte message) fast is difficult. The following application of Amdahl’s law,5 which points out the law of diminishing returns, displays the magnitude of the problem. For a latency budget, we conservatively assumed a 200-MHz processor averaging one clock per instruction, a network with a 1.0-Gbps physical bandwidth, and a 200-byte message. In a balanced system design, the overhead portion of sending a 200-byte message must balance the bandwidth-dependent portion of the message latency (1.6 µs). The overhead comes from executed software instructions and hardware processing. The budget allocated between the instructions executed and hardware processing time will vary with the implementation, but the upper bound on software overhead is less than 1.6 µs or less than 320 instructions in this simple example.
VIA description (b)
Figure 1. Sending (a) and receiving (b) overhead in two Pentium Pro servers.
hensive discussion on the relationships between latency, overhead, and bandwidth in a cluster environment.
Communication traffic Without knowing the traffic pattern within a system, we cannot assess the relative importance of decreasing software overhead versus increasing bandwidth. If the majority of messages are large, higher bandwidth is of greater importance because the software overhead can be amortized over the entire message, making the average per-byte overhead cost relatively low. Thus for large message sizes, the network bandwidth is the dominant contributor to message latency. If the majority of the messages are small, the average per-byte overhead cost is relatively high, and the dominant contributor to message latency becomes the software overhead. Studies of LAN traffic patterns2,3 show that the typical pattern is bimodal with over 80% of messages being 200 bytes or smaller and approximately 8% of the message over 8,192 bytes. The most significant problem confronting communication performance in clusters is the magnitude of software overhead in virtual memory operating environments when sending and receiving messages. As confirmed by Clark et
68
IEEE Micro
We use the following two terms when describing VIA: user and kernel agent. A user is the software layer using the architecture; it could be an application or a communication services layer. The kernel agent is a driver running in protected (kernel) mode. It must set up the necessary tables and structures that allow communication between cooperating processes. It is not in the critical path for data transfers unless specifically requested by a user. We describe its functions throughout this article. Major tenets. VIA defines and specifies a simple set of operations that moves data between network-connected endpoints with latencies closer to memory operations than the longer latencies of network operations. VIA accomplishes low latency in a message-passing environment by following these rules: • Eliminate any intermediate copies of the data. • Avoid traps into the operating system whenever possible to avoid context switches in the CPU as well as cache thrashing. • Eliminate the need for a driver running in protected kernel mode to multiplex a hardware resource (the network interface) between multiple concurrent processes. • Minimize the number of instructions a process must execute to initiate data movement. • Remove the constraint of requiring an interrupt when initiating and/or completing an I/O operation. • Define a simple set of operations that send and receive data.
.
User process
CNTL CNTL CNTL Receive queue
CNTL Send queue
Receive queue
User process
VI VIVI
VI
CNTL
VI
Send queue
Network interface
Addresses of completed descriptors submitted by NIC Descriptors posted
Example showing 1 completion queue
VI instances. VIA presents the illusion to each process that it owns the interface to the network. The construct used to create this illusion Kernel is a VI instance. Each VI consists of agent one send queue and one receive queue, and is owned and maintained by a single process. A process can own many VIs, and many processes may own many VIs that all contain active work to be processed. The kernel can also own VIs. See Figure 2. Control Each VI queue is formed by a and interrupts linked list of variable-length descriptors. To add a descriptor to a queue, the user builds the descriptor and Figure 2. VI queues. posts it onto the tail of the appropriate work queue. That same user pulls the completed descriptors off the head of the same work queue they were posted on. Posting a descriptor includes 1) linking the descriptor being posted to the descriptor currently on the tail of the desired work queue and 2) notifying the network interface controller (NIC) that work (an additional descriptor) has been added. Notification consists of writing to a memory-mapped register called a doorbell. Posting a descriptor exchanges its ownership from the owning process to the NIC. The process that owns a VI can post four types of descriptors. Send, remote-DMA/write, and remote-DMA/read descriptors are placed on the send queue of a VI. Receive descriptors are placed on the receive queue of a VI. Synchronization. VIA provides both polling and blocking mechanisms to synchronize between the user process and completed operations. When descriptor processing completes, the NIC writes a done bit and includes any error bits associated with that descriptor in its specified fields. This act transfers ownership of the descriptor from the NIC back to the process that originally posted it. The head of each queue of each VI can be polled (as long as the VI is not associated with a completion queue, as discussed later). Polling checks the descriptor at the head of the work queue to see if it has been marked complete. If it is complete, the function call removes the descriptor from the head of the queue and returns the descriptor’s address to the calling process. Otherwise it returns a unique status, and the head of the queue does not change. The user may also use a blocking call to check the head of the queue for a completed descriptor. If it is complete, the function call removes the descriptor from the head of the queue and returns the address of the descriptor to the calling process. Otherwise the function call requests the operating system to remove the process from the active run list until the descriptor on the head of the queue completes.
User process
Send Send queue Send Queue Queue Receive Send queue Send Queue Queue
Per VI control and synchronization
Example showing 3 work queues
• Keep the architecture simple enough to be emulated in software as well as integrated in silicon.
Descriptors dequeued
Poll/wait on completion queue, then dequeue descriptor
Figure 3. Example of one completion queue.
Then an interrupt will be generated. VIA supports both an interrupt to awaken the process as well as a callback with associated handler through two different blocking calls. Completion queues. These queues (see Figure 3) are an additional construct that allows the coalescing of completion notifications from multiple work queues into a single queue. The two work queues of one VI can be associated with completion queues independently of one another. The work queues of the same VI can be associated with differ-
March/April 1998
69
. VIA
Control
Memory handle
Next descriptor virtual address
Status
Length
Reserved
Memory handle
Remote virtual address
Length
Memory handle
Buffer virtual address
Length
Memory handle
Buffer virtual address
Immediate data
VI#
Figure 4. Descriptor formats.
ent completion queues; it is possible to associate only one work queue of a VI with a completion queue. When a descriptor on a VI that is associated with a completion queue completes, the NIC marks the descriptor done and places a pointer to that descriptor on the tail of the associated completion queue. If a VI work queue is associated with a completion queue, synchronization of completed descriptors must occur either by polling or by waiting on the completion queue, not the work queue itself. Descriptors. These constructs describe work to be done by the network interface. Send and receive descriptors contain one control segment and a variable number of data segments. Remote-DMA/write and remote-DMA/read descriptors contain one additional address segment following the control segment and preceding the data segment(s). The control segment contains the descriptor type, the immediate data if present, a queue fence if present, the number of segments that follow the control segment, the virtual address of the next descriptor on the queue and its associated memory handle, the status of the descriptor (initially null when posted and filled in by the NIC upon completion), and the total length (in bytes) of the data to be moved by the descriptor. The address segment contains the remote memory region’s virtual address and associated memory handle. Each data segment describes a memory region and contains the local memory region’s virtual address, its associated memory handle, and the length (in bytes) of data associated with that region of memory. For remote-DMA/write descriptors, the address/memory handle pair in the address segment is the starting location of the region where the data is placed. Though only one remote memory region address is supported per descriptor, a user can specify many local memory regions in each remoteDMA/write descriptor. Therefore it is possible to gather data but not possible to scatter data in one remote-DMA/write descriptor. For remote-DMA/read descriptors, the address/memory handle pair in the address segment is the starting location of the region for the data. Again, although only one remote memory region address is supported per descriptor, a user can specify many local memory regions in each remoteDMA/read descriptor. Therefore it is possible to scatter data
70
IEEE Micro
but not possible to gather data in one remote-DMA/read descriptor. Figure 4 illustrates the format of the descriptors. Immediate data. VIA permits 32 bits of immediate data to be specified in each descriptor. When immediate data is present in a send descriptor, the NIC moves the field directly into the receive descriptor. The presence of immediate data in a remote-DMA/write descriptor causes a receive descriptor to be consumed at the remote VI with the immediate data field transferred into that consumed receive descriptor. The presence of immediate data in a remoteDMA/read descriptor is benign; the 32-bit immediate data field can be written into and will remain unchanged when the operation is complete with no descriptor being consumed at the remote VI. Potential uses for immediate data include synchronization of remote-DMA/writes at the remote VI as well as sequence numbers used by a software layer using that VI. Work queue ordering. VIA maintains ordering and data consistency rules within one VI but not between different VIs. Descriptors posted on a VI are processed in FIFO order. VIA doesn’t specify the order of completion for descriptors posted on different VIs; the order depends on the scheduling algorithm implemented. This rule is easily maintained with sends and remoteDMA/writes because sends behave like remote-DMA/writes without a remote address segment. The receive descriptor on the head of the queue determines the remote address(es). Remote-DMA/reads make it more difficult to balance data consistency with keeping the data pipelines full. RemoteDMA/reads are round-trip transactions and are not complete until the requested data is returned from the remote node/endpoint to the initiating node/endpoint. The VI-NIC designer must decide on an implementation that trades off performance with complexity. Two possible implementations are described. One implementation would stop processing descriptors on a send queue when a remote-DMA/read descriptor is encountered until the read is complete. With many active VIs and a well-designed scheduler, the NIC can continue to make progress by multiplexing between VIs. With only a few active VIs, the performance to or from the network(s) may drop due to inability to pipeline the execution of multiple descriptors from one VI. An alternative implementation is to process the request portion of a remote-DMA/read descriptor and continue processing descriptors on that VI’s send queue. The benefit of this approach is the ability to continue processing additional descriptors posted on that VI. However, either the NIC or a software layer (usually the application) must ensure that memory consistency is maintained. Specifically, a write to a remote address that follows a read from that same address can not “pass” the read, or the read may return incorrect data. To allow and encourage pipelining within a VI, VIA defines a queue fence bit. The queue fence bit forces all descriptors on the queue to be completed before further descriptor processing can continue. This queue fence bit ensures that memory consistency is maintained even when the NIC does not guarantee it. As described previously, if a NIC executes more than one
.
descriptor from a VI, it is possible for a send or remoteDMA/write descriptor to complete before a remoteDMA/read descriptor completes. The completion of the send and/or remote-DMA/write descriptor(s) is not visible until the descriptor on the head of the queue is also complete. Work queue scheduling. There is no implicit ordering relationship between the execution of descriptors placed on different VIs. The scheduling of service for each active VI depends on the algorithm used in the NIC, the message sizes associated with the active descriptors, and the underlying transport used. Memory protection. VIA provides memory protection for all VI operations to ensure that a user process cannot send out of, or receive into, memory that it does not own. A mechanism called protection tags that is programmed by the kernel agent in a translation and protection table (TPT) provides the memory protection. Protection tags are unique identifiers that are associated with VIs and memory regions. A user, prior to creating a new VI or registering a memory region, must obtain a protection tag(s). When a user requests the creation of a VI or requests that a region of memory be registered, a protection tag is an input parameter to the requesting call. The kernel agent checks to ensure that the user owns the protection tag specified in the calling request. When a user creates a new VI, the user must specify a memory tag as an input parameter. The kernel agent checks to ensure that the requesting user is the owner of that protection tag. If that check fails, a new VI is not created. When a user registers a memory region, the user must specify a memory tag associated with the region, if remoteDMA/write is enabled and if a remote-DMA/read is enabled. The kernel agent checks to ensure that the process registering the memory regions owns the region and the specified protection tag. The kernel agent also checks to see if the memory region being registered is marked as read only. If the requesting user does not own the specified protection tag, the kernel agent rejects the entire registration request. If the memory region is marked as read only by the operating system, the agent rejects a request to enable remote-DMA/write. Otherwise, the memory region will be registered, and the kernel agent will program the appropriate entry(s) and associated bits into the translation and protection table, and return the memory handle for that region. A process can own many VIs, each with the same or different protection tags. Each protection tag is unique; different processes are not allowed to share protection tags. VIs can only access memory regions registered with the same protection tag. Therefore, not all VIs owned by a process can necessarily access all memory regions owned by that process. Virtual address translation. An equally important task performed by the kernel agent occurs when registering a memory region, giving the NIC a method of translating virtual addresses to physical addresses. When a user requests registeration of a memory region, the kernel agent performs ownership checks, pins the pages into physical memory, and probes the region for the virtual-tophysical address translations. The kernel agent enters the
physical page addresses into the translation and protection table and returns a memory handle. The memory handle is an opaque, 32-bit data structure. The user can now refer to that memory region using the virtual address and memory handle pair without having to worry about crossing page boundaries. The use of a memory handle/virtual address pair is a unique feature of VIA. In contrast to the Hamlyn6 and U-Net7 architectures in which addresses are specified as a tag and offset, VIA allows the direct use of virtual addresses. This means that the application does not need to keep track of the mappings of virtual addresses to tags. A VI-NIC is responsible for taking a memory handle/virtual address pair, determining the physical address, and verifying that the user has the appropriate rights for the requested operation. The designer of a VI-NIC is free to decide the format of a memory handle and the way the translation and protection information is stored and retrieved. In our implementation, when a memory region of n pages is registered, the kernel agent locates n consecutive entries in the translation and protection table, and assigns a memory handle such that the calculation [(Virtual Address >> page offset bits ) − Memory Handle] results in the index into the table of the first entry in the series. The page offset bits refer to the number of bits in the virtual address that are used to store the page offset—12 for a 4,096-byte page. The >> symbol represents a logical shiftright operation. We use unsigned binary arithmetic. After the page offset bits in the virtual address have been shifted right (truncated), the calculation ignores the remaining upper bits that extend beyond the number of bits in the memory handle. The n entries in the table contain the physical addresses of the n pages. The page offset bits are not entered in the table because they do not differ from the virtual address page offset bits. There are two additional consequences of registering memory regions as opposed to the more common use of registering pages. The user is not concerned about crossing page boundaries. Therefore, the network interface must be aware of the page size so that when the virtual address crosses over to another page, the controller receives a new translation. (The physical pages need not be contiguous.) Although the user may request that only a partial page be registered, that entire page is actually registered, and therefore the entire page is subject to the protection attributes for the portion that was registered.
Prototype projects We split the VI prototype program into two separate projects. The first project was a proof of concept for the overall architectural concepts. The second project aimed to validate a proposed hardware implementation. Architecture validation prototype. We built a prototype that used a standard commodity Ethernet NIC. This controller has no VI-specific hardware support or intelligent subsystem, so we emulated the VI-NIC functionality entirely in software running on the host. Goals. We established three goals prior to starting the project:
March/April 1998
71
. VIA
350
120
Latency (µs)
Bandwidth (Mbps)
Pro100B - UDP/IP Pro100B - VI emulation
300 250 200 150 100
0
32
64 128 256 512 Data payload size (bytes)
1,024 1,472
Figure 5. VI versus UDP latency.
• Implement all full features in VIA Version 0.9 of the specification, the version current at that time. • Demonstrate at least a two-fold reduction in overhead, as compared to UDP on a similar platform. An important component of the architecture validation is to demonstrate that VI provides performance gains over legacy protocol stacks. We selected UDP because it is most similar to the VI unreliable service. • Provide a stable VI development platform. We aimed to support at least four nodes interconnected through a network, with each node supporting 1,000 (emulated) VIs. NIC description. We selected Intel Ethernet Pro 100B NICs and Ethernet switches because the controllers and switches are commodity items with hardware and tools, such as network analyzers that are readily available, and the availability of an NT-NDIS driver for UDP for performance comparisons. Prototype implementation. The architecture validation prototype implemented the VI-NIC functionality in software. This approach provided the quickest route to validate the architectural concepts and an easily duplicated hardware platform for VI application experiments. Performance measurements and results. The host nodes used for the performance tests were IA-32 server systems with dual 200-MHz Pentium Pro processors, the Intel 82450GX PCIset, 64 Mbytes of memory, and the Intel EtherExpress Pro/100B network adapter. We used Microsoft Windows NT 3.51 with the EtherExpress Pro/100B network adapter NDIS driver for the software environment for the UDP testing. The VIA performance tests were run on Microsoft Windows NT 4.0 with an Intel-supplied VI driver. We measured the UDP/IP performance using a ttcp test that we modified only to give timing and round-trip information. We wrote a VI test program to measure the equivalent functions as ttcp. Application send latency. This test measures the time to copy the contents of an application’s data buffer to another application’s data buffer across an interconnect using the VI interface. Since VIA semantics require a preposted receive buffer, the test program includes the time to post a receive buffer.
72
80 60 40 20
50 0
Pro100B - UDP/IP Pro100B - VI emulation
100
IEEE Micro
0
0
32
64 128 256 512 Data payload size (bytes)
1,024 1,472
Figure 6. VI versus UDP bandwidth.
The test program calculates application send latency by sending a packet to a remote node that then sends the packet back to the sender. Multiple round-trip operations provide an average time per round trip. One half of the average round-trip time is the single packet application send latency. Figure 5 compares VI to UDP latency. Application send bandwidth. These tests measure the rate at which large amounts of data can be sent from one application to another application across the network. The test program calculates application send bandwidth by initiating multiple concurrent sends, then continuing to issue sends as quickly as possible for the duration of the test time. The receiving node must have multiple receives outstanding throughout the test. It calculates bandwidth by simply dividing the total amount of data sent by the total test time. Figure 6 illustrates the comparison of VI and UDP bandwidth for various message sizes. Hardware validation prototype. To further validate VIA concepts and verify performance goals, we built a prototype system using a NIC with an on-board RISC CPU. We moved most of the VIA emulation from the driver running on the host CPU to code running on the RISC CPU on the NIC. Note that throughout the following description, we refer to a NIC coupled with the VIA emulation in software running on the host node as a VI-emulated NIC. We refer to a NIC coupled with VIA that is implemented in hardware and/or running as code on the NIC board as a VI-NIC. Goals. We had three goals. First, we aimed to validate VIA concepts and provide a basis for experimentation with applications using VIA in place of traditional network interfaces. We required that prototypes support at least four nodes interconnected with a network, each node supporting at least 200 VIs and the mapping of 64 Mbytes of memory per NIC. We also wanted to demonstrate performance gains available with a VI-NIC and compare them with a VI-emulated NIC. The performance advantages of a VI-NIC should be clearly visible as compared to emulating the VI interface using a low-cost NIC. Finally, we hoped to expose potential problems and issues with the VI-NIC designs. Creation of a functional VI-NIC and the software to drive it allows discovery of any issues that were undiscovered in the original VIA proof of concept.
.
1,200 Myrinet-VI emulated in kernel agent Myrinet-VI emulated in NIC
Latency (µs)
200 150 100 50 0
Myrinet-VI emulated in kernel agent Myrinet-VI emulated in NIC
1,000
Bnadwidth (Mbps)
250
800 600 400 200
0
32
64
128
256
512
1,024 2,048 4,096 8,192
Data payload size (bytes)
0
0
32
64
128
256
512
1,024 2,048 4,096 8,192
Data payload size (bytes)
Figure 7. Application-to-application latency test results.
Figure 8. Application-to-application bandwidth test results.
Prototype nodes. The host nodes used for the performance tests were Micron Millenia PRO2 systems with 64 Mbytes of DRAM, dual 200-MHz Pentium Pro processors each with 256 Kbytes of second-level cache, a 440FX chipset, and Windows NT 4.0 with Service Pack 3. Hardware environment. We selected Myricom’s Myrinet NICs and switches for several reasons. They provide a data rate of at least 1.28 Gbps. The programmable RISC CPU on the NIC allows for rapid development and modification of the prototype. Enough memory resources (1 Mbyte) are present to allow the control program, memory protection and translation table, and per-VI context information to reside on the NIC. Tools and documentation were available for developing the network interface’s control program. Example source code was available as a starting point. The NICs were available as off-the-shelf products and compatible with the host’s hardware (based on PCI-32 I/O bus and Pentium Pro processors). Prototype implementations. We created two prototype environments: a VI-emulated NIC and an emulation of a VI-NIC. For the VI-emulated NIC environment, we programmed the Myrinet NIC to emulate a low-cost gigabit NIC; the driver software supported the VI emulation. This only provided a means to place packets onto a network and to extract packets from the network. These types of controllers only understand physical addresses and deal with network data on a packet basis. This leaves the work of memory address translation and perconnection multiplexing and demultiplexing of the packet stream to the host CPU. We refer to this environment in the performance graphs as “VI emulated in kernel agent.” For the emulated VI-NIC environment, we programmed the Myrinet NIC to emulate an example VI-NIC hardware design. We modified the driver software to support this emulation. We refer to this environment in the performance graphs as “VI emulated in NIC.” VI-NIC characteristics. A VI-NIC works directly with the send and receive descriptors, using virtual addresses. The network adapter handles all the memory translation and protection functions, and multiplexing and demultiplexing of packets to connections. The hardware and software map the doorbell registers directly into the application’s memory space, allowing the posting of descriptors to avoid a kernel transition.
The emulation that generated the results we present did not emulate directly accessible doorbell registers, and thus did require a kernel transition to post a descriptor.
Performance measurements We measured both application-to-application latency and bandwidth. Latency. The application-to-application latency test is the same test we used to gather the VI Ethernet latency numbers described earlier. Figure 7 summarizes the results of this test. Bandwidth. The application-to-application bandwidth test measures the rate at which large amounts of data can be copied from one application to another application across an interconnect using VIA. The test program calculates application-to-application bandwidth by initiating multiple round-trip operations (with an average of 50 that are outstanding) and obtaining an average time per round trip. Using one half of the average round-trip time and the data packet payload size, we calculated the application-toapplication bandwidth for streamed data. Figure 8 summarizes the results. CPU utilization. The application CPU load test measures the amount of CPU time that is unavailable to the application when passing messages. This test program measures the total cost of system activity related to message passing that affects the application. It includes the execution of message-passing code (library calls), kernel transitions (if any), interrupt handling, and CPU instruction cache flushing (due to execution of instructions outside of the main application). It also includes cache snooping and memory bandwidth contention due to movement of data to and from NIC. This program calculates the overhead to pass a message by constructing a situation in which the test application can consume all of the CPU resources while performing its work. The program uses a vector sum for the workload. We ran the application twice, once without passing messages and once while passing messages. Each run continues for a fixed amount of time, and we recorded the number of vector sums completed. The first run yields an average time to perform a vector sum. The second run yields overhead of the message-passing operations performed during the test run. The second run
March/April 1998
73
. VIA
Application time lost per message (µs)
100
Myrinet-VI emulated in kernel agent Myrinet-VI emulated in NIC
80 60 40 20 0
4
32
1,024 2,048 4,096 Data payload size (bytes)
8,192
Figure 9. CPU utilization test results.
completes some number of vector sums (always smaller in quantity than the first test run). The time consumed by passing messages is determined by subtracting the amount of time spent running the application (the number of vector sums performed in the second test run multiplied by the average time to perform a vector sum from the first test run) from the total test time. The overhead per message is the total message-passing time divided by the number of messages passed. See Figure 9 for the results. Performance results. Both of the application-toapplication tests showed that the VI-NIC performance improvements resulted from the elimination of host data copies at the receiving node and host system interrupts. The quantity of code executed by the NIC’s processor limited the performance gains. The test of the application CPU cost per message clearly revealed the entire cost of handling host system interrupts and performing host system data copies. In the VI-NIC results, the slight increase in CPU time associated with increasing the packet size is attributable to the PCI data transfers competing with the application for host system memory bandwidth.
Performance issues and projections Our prototype environment was successful in increasing performance and reducing CPU cycles from the host, but it did limit performance compared to a fully integrated VI-NIC. Hardware performance limitations. Although the selected hardware’s capabilities were well matched to the requirements, certain limitations constrained the performance available when emulating the VI-NIC hardware. The host system’s 200-MHz Pentium Pro processors overshadow the 33-MHz instruction rate of the on-board controller. Also, due to the Myrinet NIC architecture, the transfer of a block of data takes four data copy operations. These operations are the local host to local adapter, local adapter memory to network (mostly overlapped with the next item), network to remote adapter memory, and adapter memory to host. Finally, emulation of the VI doorbell scheme is difficult for more than a trivial number of VIs without consuming a significant amount of memory, controller CPU cycles, or PCI bus resources. Design limitations. To support the stated project goals,
74
IEEE Micro
the design focused on emulating the reference hardware design over optimizing performance for the Myrinet NIC. If, for example, the goal was to design and implement the fastest VI-emulated system for the Myrinet NICs, different design choices may have yielded better performance. Emulation overheads. The scheme used to emulate the doorbell functionality required an application-to-kernel transition. Doorbell support in hardware would reduce measured single-packet latencies by an estimated 7.5 µs and the CPU load by an estimated 9 µs. Prototype conclusions. Emulating a VI-NIC in host software using a low-cost adapter is quite feasible, but a significant amount of performance is lost in both the communications space and in the application space. Emulating a VI-NIC using an intelligent NIC provides the minimum level of performance that VIA can provide. Two additions to an intelligent NIC would provide significant performance improvements over the prototype results achieved in this project. A faster programmable controller would result in a more desirable bandwidth performance curve and would improve small-packet latency. A control CPU in the 200-MIPS range would have been a good match for today’s host CPUs. A hardware assist for implementing the doorbell registers would reduce small packet latency and significantly reduce host CPU overhead per message sent. Performance conclusions. The best network performance and the best price/performance will be available on network adapters that implement the core VI functionality in special-purpose silicon and eliminate all possible processing overheads in the send and receive paths. Conservative estimates show that the elimination of kernel transitions and no intermediate copies of data at either the send or receive nodes will yield a software overhead of well under 300 user-level instructions, many of which are construction of descriptors. The use of completion queues and associated handler(s) should avoid most interrupts on receive for a busy system. With today’s processors, a gigabit network and modest hardware design, small message latencies well below 10 µs are achievable.
GIVING A USER PROCESS DIRECT ACCESS to a network resource is not a new concept. What is new about VIA is that the concepts, data structures, and semantics are specified and defined to the computing industry in an open fashion, independent of the operating system and the CPU. The goal of this approach is to encourage a vibrant and competitive environment that accelerates the acceptance of clustered computers. VIA removes the barrier of application code being written based on one system with a specific network interface controller, operating system, and CPU, each of which have a limited lifetime. Establishing and adhering to VIA results in application code that achieves excellent interprocess communication using today’s technologies. More importantly, that application code will benefit from future technology improvements and lead to higher application performance without rewriting code. Reuse of application code is a key to architecture accep-
.
Related work Academic and industrial research have described and validated many of VIA’s concepts. The works previously referenced and/or mentioned here influenced VIA. The message-driven processor (MDP) project8 showed that fast context switches and offloading the host CPU from handling and buffering messages would achieve good message-passing performance. The progress of microprocessors has led to impressive increases in clock speed but not a proportional decrease in context switch times. Therefore VIA avoids context switches while off-loading the processor. The Active Messages9 project at University of California, Berkeley, and the Fast Messages10 project at the University of Illinois, Urbana-Champaign, point out the benefit of an asynchronous communication mechanism. The architectures are similar in that each uses a handler, which is executed on message arrival. In VIA, all message-passing operations are asynchronous. It does not specify the concept of a handler per message in lieu of the use of protection tags, translation and protection tables, and registered memory. Four examples of user-level networking architectures are Shrimp,11 U-Net,7 Hamlyn,6 and Memory Channel.12 In Princeton University’s Shrimp project, a virtual memorymapped network interface maps the send buffer to main memory by a kernel call. This call checks for protection and stores memory-mapping information on the network interface. The network interface maps out each physical page of the send buffer to a physical page of the destina-
tance. However, the performance must be there to warrant the intitial investment. Prototyping efforts show that VIA does meet the three performance goals of reducing the latency and increasing the bandwidth between communicating processes, while significantly reducing the number of instructions the CPU needs to execute to accomplish the communication. Performance investigations show that VI-NICs integrated in silicon should easily achieve below 10-µs end-to-end latencies. The sustainable bandwidth between two communicating processes is limited by the hardware NIC and fabrication implementation, not the code used to send and receive the data. The reduction in latency and increase in bandwidth for interprocess communication are achieved with a great reduction of instruction executed by the CPU.
tion node. This information is set up in the page tables of network interfaces on both communicating nodes. After mapping the memory, the data is sent by performing writes to the send buffer. Cornell University’s U-Net project demonstrated userlevel access to a network using a simple queuing method. This user-level network architecture provides every process with an illusion of its own protected interface to the network. It associates a communication segment and a set of send, receive, and free message queues with each U-Net endpoint. Endpoints and communication channels allow U-Net to provide protection boundaries between multiple processes. The Hamlyn architecture from Hewlett-Packard Labs provides applications with direct access to network hardware. The sender-based memory management used in Hamlyn avoids software-induced packet losses. An application’s virtually contiguous address space is used to send and receive data in a protected fashion without operating system intervention. The Memory Channel interconnect developed at Digital Equipment Corporation uses memory mapping to provide low communication latency. Applications map a portion of a clusterwide address space into their virtual address space and then use load and store instructions to read or write data. Direct manipulation of shared-memory pages achieves message-passing operations. In the Memory Channel network, no calls to operating systems are made once a map is established.
4. 5.
6.
7.
8.
9.
References 1. R.P. Martin et al., “Effects of Communication Latency, Overhead and Bandwidth in a Cluster Architecture,” Computer Architecture News, May 1997, pp. 85-97. 2. R. Gusella, “A Measurement Study of Diskless Workstation Traffic on an Ethernet,” IEEE Trans. Communications, Vol. 38, No. 9, Sept. 1990, pp. 1557-1568. 3. J. Kay and J. Pasquale, “The Importance of Non-Data Touching
10.
11.
Processing Overheads in TCP/IP,” Computer Communication Review, Vol. 23, No. 4, Oct. 1993, pp. 259-268. D.D. Clark et al., “An Analysis of TCP Processing Overhead,” IEEE Communications, Vol. 27, No. 6, June 1989, pp. 23-29. J.L. Hennessy and D.A. Patterson, Computer Architecture: A Quantitative Approach, 2nd ed., Morgan Kaufmann Publishers, Inc., San Francisco, Calif., 1995. G. Buzzard et al., “An Implementation of the Hamlyn SenderManaged Interface Architecture,” Operating Systems Review, 1996, pp. 245-259. T. von Eicken et al., “U-Net: A User-level Network Interface for Parallel and Distributed Computing,” Operating Systems Review, Vol. 29, No. 5, Dec. 1995, pp. 40-53. W.J. Dally et al., “Architecture of a Message-Driven Processor,” Proc. 14th Int’l Symp. Computer Architecture, IEEE Computer Society Press, Los Alamitos, Calif., 1987, pp. 189-196. T. von Eicken et al., “Active Messages: A Mechanism for Integrated Communications and Computation,” Computer Architecture News, Vol. 20, No. 2, May 1992, pp. 256-266. S. Pakin, V. Karamcheti, and A. Chien, “Fast Messages (FM): Efficient, Portable Communication for Workstations and Massively-Parallel Processors,” IEEE Concurrency, Vol. 5, No. 2, Apr.-June 1997, pp. 60-72. M.A. Blumrich et al., “Virtual Memory Mapped Network Interface for the SHRIMP Multicomputer,” Proc. 21st Int’l Symp. Computer Architecture, Apr. 1994, pp. 142-153.
March/April 1998
75
. VIA
12. R. Gillett and R. Kaufmann, “Using the Memory Channel Network,” IEEE Micro, Vol. 17, No. 1, Jan.-Feb. 1997, pp. 19-25.
Dave Dunning is a principal engineer within the Intel Server Architecture Lab. His technical interests include highspeed networks and network interfaces. Dunning received a BA in physics from Grinnell College, a BS in electrical engineering from Washington University (St. Louis), and an MS in computer science from Portland State University.
Greg Regnier is a senior staff engineer within the Intel Server Architecture Lab. His technical interests include operating systems, distributed computing, message passing, and networking technology. Regnier has a BS in computer science from Saint Cloud State University in Minnesota.
Gary McAlpine is a system/component architect in Intel’s Enterprise Server Group. His technical interests include distributed computer systems, high-end computer system architectures, networking systems, networking, graphics, signal and image processing, real-time computing, and I/O and neural networks.
Don Cameron is a senior staff architect in the Intel Enterpise Server Group’s Server Architecture Lab. His current interests are user-level IO and network attached storage. He is a member of the Storage Network Industry Association.
Bill Shubert is a software engineer in Intel Corporation’s Server Architecture Lab. He has done extensive work on networking and communications software for supercomputers and servers. Shubert graduated with a BS from Carnegie Mellon University in Pittsburgh.
76
IEEE Micro
Frank Berry is a staff software engineer in Intel’s Enterprise Server Group. His technical interests include operating systems, device drivers, and networking technology. Berry was the principal developer of very high performance Windows NT network drivers on Intelbased systems. He is currently the senior designer for the VI Proof of Concept work.
Anne Marie Merritt is a software engineer in Intel’s Enterprise Server Group. She has worked on the VI Architecture library and device driver under Windows NT. Her technical interests include Windows NT device drivers and network management. Merritt is finishing her master’s degree at California State University, Sacramento. Her final project details an SNMP MIB-II to manage VIA 1.0 clusters.
Ed Gronke is a staff software sngineer in Intel’s Server Architecture Lab. As part of the VI Architecture team, he focuses on how database systems and databaselike applications can make the best use of the capabilities of VI-enabled clusters. Gronke has a BA in physics from Reed College and did graduate work in stellar astrophysics at Wesleyan University.
Chris Dodd is a principal engineer in Intel’s Server Architecture Laboratory. His technical interests focus on scalable server architectures, especially high-performance message passing formulticomputers. Dodd received his BS and MS degrees in computer science from the University of Wisconsin, Madison.
Direct questions concerning this article to Dave Dunning, Server Architecture Laboratory, Intel Corporation, 5200 N.E. Elam Young Parkway, Hillsboro, OR 97124; ddunning@ co.intel.com.