Lker-Level Interprocess Communication for Shared Memory Multiprocessors BRIAN N. BERSHAD Carnegie
Mellon
University
THOMAS
E. ANDERSON,
University
of Washington
Interprocess (between
communication address
operating IPC
applications encounter level
and
communication facilities
These
when
at the user
level,
with
thread
Terms—thread,
local
to the design of the
kernel,
is architecturally
from
one
problems them
since
between
cornmzmzcation of contemporary but
limited
address
space
kernel-based by the
to another.
cost of Second,
their own thread management the interaction between kernel-
User-Level
can be solved at the user
kernel
address
structure
allows
the
sees threads
on the
same Call
shared kernel
within
the communica-
each address
and processor
Procedure
using
by moving
level
invocation
spaces
Remote
protocol
The programmer
unconventional
towards
central
M. LEVY
management.
these
is improved
This
oriented
threads and must provide problems stemming from
and supporting
motivated
and HENRY
the responsibility
a processor
space communication
space communication.
Index
kernel
communicating
observations
though
been
multiprocessor,
performance
cross-address
managed
IPC
has become
its performance
reallocating
and user-level
memory out of the
be avoided
particular
First,
that need inexpensive functional and performance
Communication
fast
in
has traditionally problems
kernel
On a shared tion
IPC
inherent
the
(IPC),
D. LAZOWSKA,
spaces on the same machine),
systems.
has two
invoking
EDWARD
can
machine.
(URPC)
memory
URPC
with
through
combines
lightweight
to be bypassed
and RPC
space.
reallocation
during
a
threads
cross-address
a conventional
interface,
performance,
multiprocessor,
operating
system,
parallel
programming,
performance,
communication Categories
and
Subject
Descriptors:
D. 3.3
[Programming
Languages]:
Language
Constructs
[Operating Systems]: Process Management— concurrency, multiprocessing / multiprograrnmmg; D.4.4 [Operating Systems]: Communications Manfl!ow agement; D.4. 6 [Operating Systems]: Security and Protection— accesscontrols, information controls; D.4. 7 [Operating systems]: Organization and Desig~ D.4.8 [Operating Systems]: Performance— measurements
and
Features
— modules,
This
material
is based
8619663, and
on work
CCR-87OO1O6,
Digital
Program).
supported
CCFL8703049,
Equipment Bershad
D.4. 1
packages;
Corporation
was
(the
supported
by the
and
National
CCR-897666),
Systems
by an AT&T
Science the
Research Ph.D.
Center
Fellowship.
Authors’ Pittsburgh,
addresses: B. N. Bershad, School of Computer Science, PA 15213; T. E. Anderson, E. D. Lazowska, and Science
Permission not made of the
and Engineering,
to copy without or distributed
publication
Association
and
specific
permission.
@ 1991
ACM
fee all or part
for direct its
for Computing
FR-35,
date
of this
commercial
appear,
Machinery.
0734-2071/91/0500-0175 ACM
University
Transactions
and
material is given
To copy otherwise,
the
External
Anderson
CCRCenter,
Research by an IBM
Carnegie Mellon University, H. M Levy, Department
is granted the ACM that
(grants
Technology
and
of Washington,
advantage, notice
and
Scholarship,
Graduate
Computer
Foundation
Washington
Seattle, provided copyright
copying
or to republish,
WA that notice
the copies
are
and the title
is by permission requires
of
98195.
of the
a fee and/or
$01.50 on Computer
Systems,
Vol
9, No. 2, May 1991, Pages 175-198
176
.
General
B. N
Terms:
Addltlonal
Bershad
Design,
Key
et al
Performance,
Words
and
Phrases:
Measurement Modularity,
remote
procedure
call,
small-kernel
operating
system
1. INTRODUCTION Efficient
interprocess
communication
is central
to the design
of contemporary
operating systems [16, 231. An efficient communication facility encourages system decomposition across address space boundaries. Decomposed systems have several advantages over more monolithic ones, including failure isolation (address space boundaries prevent a fault in one module from “leaking” into another), extensibility (new modules can be added to the system without having
to modify
mechanism
existing
rather
Although
than
address
which they primitives.
ones),
and
modularity
(interfaces
are enforced
by
by convention).
spaces
can be a useful
structuring
device,
the
extent
to
can be used depends on the performance of the communication If cross-address space communication is slow, the structuring
benefits that come from decomposition are difficult to justify to end users, whose primary concern is system performance, and who treat the entire operating system as a “black box” [181 regardless of its internal structure. Consequently, designers are forced to coalesce weakly related subsystems into
the same address
modularity Interprocess operating problems:
space, trading
away
failure
isolation,
extensibility,
and
for performance. communication
system
—Architectural
has traditionally
kernel.
However,
performance
barriers.
been the responsibility
kernel-based
The
communication
performance
of the has
of kernel-based
two
syn-
chronous communication is architecturally limited by the cost of invoking the kernel and reallocating a processor from one address space to another. In our earlier work on Lightweight Remote Procedure Call (LRPC) [10], we show that it is possible to reduce the overhead of a kernel-mediated cross-address space call to nearly the limit possible on processor architecture: the time to perform a cross-address slightly
greater
than
that
required
to twice
invoke
a conventional LRPC is only
the kernel
and have
it
reallocate a processor from one address space to another. The efficiency of LRPC comes from taking a “common case” approach to communication, thereby avoiding unnecessary synchronization, kernel-level thread management, and data copying for calls between address spaces on the same machine. The majority of LRPC’S overhead (70 percent) can be attributed directly to the fact that the kernel mediates every cross-address space call. —Interaction
between
kernel-based
communication
and
high-performance
The performance of a parallel application running on a multiprocessor can strongly depend on the efficiency of thread management operations. Medium and fine-grained parallel applications must use a thread management system implemented at the user level to obtain user-level
ACM
TransactIons
threads.
on Computer
Systems,
Vol
9, No
2, May
1991
User-Level Interprocess satkfactory have
Communication
performance
strong
for Shared Memory Multiprocessors
[6, 36]. Communication
interdependencies,
though,
and thread
and the
across protection boundaries (kernel-level level thread management) is high in terms
.
177
management
cost of partitioning
communication of performance
them
and userand system
complexity. On
a shared
eliminate Because
memory
multiprocessor,
these
problems
the kernel from the path of cross-address address spaces can share memory directly,
have
space shared
a solution:
communication. memory can be
used as the data transfer channel. Because a shared memory multiprocessor has more than one processor, processor reallocation can often be avoided by taking
advantage
of a processor
already
active
without involving the kernel. User-level management of cross-address improved performance over kernel-mediated –Messages kernel.
are sent
between
–Unnecessary processor reducing call overhead, contexts
across
inherent
exploited,
the
same
address and
spaces,
processor
call
Procedure
directly,
results
without
invoking
in
the
without
components management, which
reallocation.
(URPC),
the
communication
kernel
can be
of a message
can be
mediation.
system
described
between URPC
address
isolates
of interprocess communication: and data transfer. Control
is the
is implemented
and receiving
its overhead
performance.
Call
safe and efficient
machine
programmer,
in the sending
improving
Remote
other the three cation, thread
spaces
space
reallocation between address spaces is eliminated, and helping to preserve valuable cache and TLB
parallelism
provides
address
space communication approaches because
reallocation does prove to be necessary, several independent calls.
thereby
User-Level paper,
target
calls.
–When processor amortized over —The
address
in the
communication through
Only
processor
volvement; thread management and ment and interprocess communication rather than by the kernel.
from
presented
of thread
reallocation
one an-
processor reallotransfer between
abstraction
a combination
in this spaces on
to the
management
requires
data transfer do not. Thread are done by application-level
kernel
in-
managelibraries,
URPC’S approach stands in contrast to conventional “small kernel” systems in which the kernel is responsible for address spaces, thread management, and interprocess communication. Moving communication and thread management to the user level leaves the kernel responsible only for the mechanisms that allocate processors to address spaces. For reasons of performance and flexibility, this is an appropriate division of responsibility for operating systems of shared memory multiprocessors. (URPC can also be appropriate for uniprocessors running multithreaded applications. ) The latency of a simple cross-address space procedure call is 93 psecs using a URPC implementation running on the Firefly, DEC SRC’S multiprocessor workstation [37]. Operating as a pipeline, two processors can complete one ACM
Transactions
on Computer
Systems,
Vol
9, No
2, May 1991.
178
.
call
every
B. N. Bershad et al 53 psecs.
that involve support for kernel-based user-level
On the
same
multiprocessor
hardware,
RPC
threads
that
accompany
URPC
takes
43
psecs,
while
operation for the kernel-level threads that accompany LRPC millisecond. To put these figures in perspective, a same-address dure
facilities
the kernel are markedly slower, without providing adequate user-level thread management. LRPC, a high performanceimplementation, takes 157 psecs. The Fork operation for the
call takes
7 psecs on the Firefly,
and a protected
kernel
the
Fork
take over a space proce-
invocation
(trap)
takes 20 psecs. We describe the mechanics of URPC in more detail in the next section. In Section 3 we discuss the design rationale behind URPC. We discuss performance
in Section
we present
4. In Section
2. A USER-LEVEL In
many
5 we survey
related
work.
REMOTE PROCEDURE
contemporary
uniprocessor
applications
communicate
sages
narrow
across
operations (create, between programs
a control
languages memory tween
in Section
6
with
channels
and
CALL FACILITY multiprocessor
operating
system
that
or ports
support
and
that for
by
only
a small
systems,
sending
mes-
number
of
send, receive, destroy). Messages permit communication on the same machine, separated by address space bound-
data-structuring
support
device
synchronous
communication
address
operating
services
aries, or between programs on different machines, boundaries. Although messages are a powerful communication sent
Finally,
our conclusions.
within
spaces is with
foreign
procedure an
untyped,
primitive, to traditional
call,
address
separated
data
space.
asynchronous
typing,
by
physical
they
repre-
Algol-like and shared
Communication
messages.
be-
Programmers
of message-passing systems who must use one of the many popular Algol-like languages as a systems-building platform must think, program, and structure according to two quite different programming paradigms. Consequently, almost every mature message-based operating system also includes support for Remote Procedure Call (RPC) [111, enabling messages to serve as the underlying transport mechanism beneath a procedure call interface. Nelson defines RPC as the synchronous language-level transfer of control between programs in disjoint address spaces whose primary communication medium is a narrow channel [301. The definition of RPC is silent about the operation of that narrow channel and how the processor scheduling (reallocation) mechanisms interact with data transfer. IJRPC exploits this silence in two ways: -Messages
are passed
between
address
spaces through
logical
channels
kept
in memory that is pair-wise shared between the client and server. The channels are created and mapped once for every client/server pairing, and are used for all subsequent communication between the two address spaces so that several interfaces can be multiplexed over the same channel. The integrity of data passed through the shared memory channel is ensured by ACM
Transactions
on Computer
Systems,
Vol
9, No
2, May
1991
User-Level Interprocess a combination statically) correctness
Communication
for Shared Memory Multiprocessors
of the pair-wise
mapping
and the URPC runtime is verified dynamically).
(message
system
authentication
in each address
—Thread management is implemented entirely at the integrated with the user-level machinery that manages nels.
A user-level
directly
thread
enqueuing
channel.
the
No kernel
Although
sends
a message
message
calls
on the
are necessary
a cross-address
space
server,
it blocks
waiting
to send a call
procedure
for the reply
link
call
signifying
179
is implied
space (message
user level and is the message chan-
to another
outgoing
perspective of the programmer, it is asynchronous thread management. When a thread in a client
.
address of the
space
by
appropriate
or reply
message.
is synchronous
from
the
at and beneath the level of invokes a procedure in a
the procedure’s
return;
while
blocked, its processor can run another ready thread in the same address space. In our earlier system, LRPC, the blocked thread and the ready thread were really the same; the thread just crosses an address space boundary. In contrast,
URPC
address that
always
tries
space on the client
can be handled
When
the reply
entirely
arrives,
to
schedule
thread’s
another
processor.
by the
This
user-level
the blocked
client
thread
thread
thread
from
the
is a scheduling
same
operation
management
system.
can be rescheduled
on any
of the processors allocated to the client’s address space, again without kernel intervention. Similarly, execution of the call on the server side can be done by a processor
already
and need not occur By preferentially takes
advantage
in switching call
this
address requires
executing
in the context
of the fact that
a processor context
of the server’s
address
synchronously with the call. scheduling threads within the same address there
to another than
switching)
is significantly
thread in
reallocating
space, URPC
less overhead
in the same address
space,
involved
space (we will
it to a thread
in
another
reallocation). Processor reallocation space (we will call this processor changing the mapping registers that define the virtual address
space context architectures, in privileged Several
in which these kernel
components
tion. There are processor should
a processor
mapping mode.
is executing.
registers
contribute
are protected
to the
high
On conventional and can only
overhead
processor be accessed
of processor
realloca-
scheduling costs to decide the address space to which a be reallocated; immediate costs to update the virtual mem-
ory mapping registers and to transfer the processor between and long-term costs due to the diminished performance of translation lookaside buffer (TLB) that occurs whefi locality address space to another [31. Although there is a long-term with context switching within when processors are frequently To demonstrate the relative
the same address space, that cost is less than reallocated between address spaces [201. overhead involved in switching contexts be-
tween two threads in the between address spaces, same-address
space
same-address
space
same address space we introduce the On the context switch. context switch requires ACM
address spaces; the cache and shifts from one cost associated
TransactIons
versus reallocating processors latency concept of a minimal C-VAX, the minimal latency saving and then restoring the
on Computer
Systems,
Vol
9, No
2, May 1991.
180
B. N Bershad et al
.
machine’s
general-purpose
registers,
reallocating
the
takes
55 psecs without
about
Because reallocation, switching
processor
from
and
takes
one address
including
about
15 psecs.
space to another
the long-term
C-VAX
cost.
of the cost difference between context switching URPC strives to avoid processor reallocation, whenever
In contrast,
on the
and processor instead context
possible.
Processor reallocation is sometimes necessary with URPC, though. If a client invokes a procedure in a server that has an insufficient number of processors to handle the call (e. g., the server’s processors are busy doing other reply. with
work), the client’s processors may idle for some time awaiting the This is a load-balancing problem. URPC considers an address space pending
incoming
messages
on its
channel
to be “underpowered.”
An
idle processor in the client address space can balance the load by reallocating itself to a server that has pending (unprocessed) messages from that client. Processor reallocation requires that the kernel be invoked to change the processor’s virtual memory context to that of the underpowered address space. That done, the kernel upcalls into a server routine that handles the client’s outstanding requests. After these have been processed, the processor is returned to the client address space via the kernel. The responsibility for detecting incoming messages and scheduling threads to handle them belongs to special, low-priority threads that are part of a URPC runtime library linked incoming messages only when Figure proceeding
1 illustrates downward.
manager (WinMgr) a separate address procedures
into they
each address space. Processors would otherwise be idle.
a sample execution timeline One client, an editor, and
and a file cache manager space. Two threads in
in the window
manager
for URPC two servers,
scan
for
with time a window
(FCMW), are shown. Each is in T1 and T2, invoke the editor,
and the file
cache manager.
Initially,
the
editor has one processor allocated to it, the window manager has one, and the space call into file cache manager has none. T1 first makes a cross-address the window manager by sending and then attempting to receive a message. The window manager receives the message and begins processing. In the meantime, the thread scheduler in the editor has context switched from TI (which is blocked waiting T2 initiates a procedure
to receive a message) to another thread call to the file cache manager by sending
Thread a message
T2.
T1 can be unblocked because the window and waiting to receive a reply. T1 then calls manager has sent a reply message back to the editor. Thread into the file cache manager and blocks. The file cache manager now has two pending messages from the editor but no processors with which to handle T1 and T2 are blocked in the editor, which has no other them. Threads threads waiting to run. The editor’s thread management system detects the
outgoing pending messages and donates a processor to the file cache manager, which can then receive, process, and reply to the editor’s two incoming calls before returning the processor back to the editor. At this point, incoming reply messages from the file cache manager can be handled. Tz each terminate ACM
TransactIons
when
on Computer
they
receive
Systems,
Vol
their 9, No
replies.
2, May
1991
the two 7’1 and
User-Level Interprocess
Communication
for Shared Memory Multiprocessors
(has no processors)
(scanning channels) m
181
FCMgr
WinMgr
Editor
.
m
Send WinMgr
TI
Recv & Process Reply to T1’s Call
Call [
Recv WinMgr /
Context Switch T2
Send FCMgr
Call [
Recv FCMgr
Context
Switch
Send
TI Call [
FCMgr
Recv
FCMgr
Reallocate Processor
Recv
to
FCMgr
Reply
Recv
& Process to T2’s Call
& Process
Reply
to T1’s Call
Return
Processor
to Editor Context Terminate
Context Terminate
Switch T2
Switch T1
Time Fig.
1.
URPC
timeline
URPC’S treatment of pending messages is analogous to the treatment of runnable threads waiting on a ready queue. Each represents an execution context in need of attention from a physical processor; otherwise idle processors poll the queues looking for work in the form of these execution contexts. There are two main differences between the operation of message channels and thread ready queues. First, URPC enables on address space to spawn work in another address space by enqueuing a message (an execution context consisting of some control information and arguments or results). Second, ACM
Transactions
on Computer
Systems,
Vol
9, No
2, May 1991
182
B. N Bershad et al.
.
Server
Client
Application
Application
Stubs
Message
URPC
❑
Stubs
Channel mm
L
URPC
Fa~tThreads
FastThreads
/
\ /
~eallocate~roces~F-<
Fig,2
URPC
enables
another
Thesoftware
processors
through
components
in one address
an explicit
reallocation
of URPC
space to execute
in the
if the
between
workload
context
of
them
is
imbalance. 2.1
The User View
It is important to emphasize that all of the URPC mechanisms described in this section exist “under the covers” of quite ordinary looking Modula2 + [34] interfaces. The RPC paradigm provides the freedom to implement the control and data transfer mechanisms in ways that are best matched by the underlying hardware. URPC consists of two code. One package, called
software
are managed
level
at the
user
packages
FczstThreacis
and
[5],
used by stubs and application provides lightweight threads that
scheduled
on top of middleweight
kernel
implements the channel managethreads. The second package, called URPC, package ment and message primitives described in this section. The URPC FastThreads, as lies directly beneath the stubs and closely interacts with shown
in Figure
3. RATIONALE
2.
FOR THE URPC DESIGN
In this section we discuss the design rationale behind URPC. rationale is based on the observation that there are several components to a cross-address space call and each can be separately.
The main
components
In brief, this independent implemented
are the following:
management: blocking the caller’s thread, running a thread through – ‘Thread the procedure in the server’s address space, and resuming the caller’s thread on return, ACM
Transactions
on Computer
Systems,
Vol
9, No
2, May
1991
User-Level Interprocess —Data
moving
transfer:
spaces,
arguments
between
the
client
and
.
server
183 address
and
–Processor
ensuring that there is a physical processor in the server and the server’s reply in the client.
to handle
reallocation:
the client’s In
Communicahon for Shared Memory Mulhprocessors
call
conventional
message
systems,
these
three
components
are
combined
beneath a kernel interface, leaving the kernel responsible for each. However, thread management and data transfer do not require kernel assistance, only processor how
reallocation
the
another,
3.1
does. In the three
components
and the benefits
Processor attempts
occurs URPC
through the optimistically
—The
client
address —The
to reduce
arise
from
that
call
follow
we describe
can be isolated
from
one
such a separation.
has serious
the
frequency
with
use of an optimistic assumes the following: other
messages,
not have
that
space
Reallocation
URPC
incoming
subsections
of a cross-address
work and
effect
to
do,
in
a potential
which
processor
reallocation
the
delay
form
of runnable
in the
on the performance
reallocation At
policy.
processing
of other
threads
call
time,
threads of a call
or will
in the client’s
space.
server
has, or will
soon have,
a processor
with
which
it can service
a
message. In
terms
several context
of performance,
ways. The first switch between
URPC’S
assumption user-level
optimistic makes threads
assumptions
can pay
off in
it possible to do an inexpensive during the blocking phase of a
cross-address space call. The second assumption enables a URPC to execute in parallel with threads in the client’s address space while avoiding a processor reallocation. An implication of both assumptions is that it is possible
to amortize
the
cost of a single
outstanding calls between the threads in the same client make a single
processor
reallocation
processor
reallocation
across
several
same address spaces. For example, if two calls into the same server in succession, then can handle
both.
A user-level approach to communication and thread management is appropriate for shared memory multiprocessors where applications rely on aggressive multithreading to exploit parallelism while at the same time compensating for multiprogramming effects and memory latencies [2], where a few key operating system services are the target of the majority calls [8], and where operating system functions are affixed ing nodes for the sake of locality [31]. In contrast well
suited
–Kernel-level
to URPC,
contemporary
for use on shared
memory
communication
and
uniprocessor
structures
are not
multiprocessors: thread
pessimistic processor reallocation policies, rency within an application to reduce scheduling [13] underscores this Handoff tion blocks the client and reallocates its ACM
kernel
of all application to specific process-
TransactIons
management
facilities
rely
on
and are unable to exploit concurthe overhead of communication. pessimism: a single kernel operaprocessor directly to the server.
on Computer
Systems,
Vol. 9, No
2, May 1991
184
B. N. Bershad et al
.
Although
handoff
is limited
by the cost of kernel
–In
a traditional
scheduling
operating
shared
performance,
invocation
and processor
system
running on a multiprocessor, distributed over the many large-scale
does improve
memory
kernel
designed
for
the
improvement
reallocation. a uniprocessor,
but
kernel resources are logically centralized, processors in the system. For a mediummultiprocessor
such as the Butterfly
but to
[71, Alewife
[41, or DASH [251, URPC’S user-level orientation to operating system design localizes system resources to those processors where the resources are in use, relaxing the performance bottleneck that comes from relying on centralized kernel data structures. This bottleneck is due to the contention for logical resources (locks) and physical resources (memory and interconnect bandwidth) that results when a few data structures, such as thread run
queues
and message By distributing
processors. functions
to
directly,
the
address
contention
URPC
can
channels, are shared by a large number of the communication and thread management
also
spaces
(and
hence
processors)
that
use
be appropriate
on uniprocessors
applications where inexpensive threads physical concurrency within a problem.
running
multithreaded
can be used to express Low overhead threads
the logical and and communi-
cation make it possible to overlap even small amounts of external tion. Further, multithreaded applications that are able to benefit layed
reallocation
tion
protocols
uniprocessor, long
them
is reduced.
can do so without [191. Although
having
reallocation
it can be delayed
to develop will
by scheduling
their
eventually within
computafrom de-
own communicabe necessary
an address
on a
space for as
as possible.
3.1.1
The
Optimistic
Assumptions
Won’
t Alulays
In cases where
Hold.
the
optimistic assumptions do not hold, it is necessary to invoke the kernel to force a processor reallocation from one address space to another. Examples of where it is inappropriate to rely on URPC’S optimistic processor reallocation policy are single-threaded latency must be bounded), initiate
the
and priority call
is of high
1/0
operation
invocations
applications, high-latency early (where
priority).
since
real-time applications 1/0 operations (where it will
the thread
To handle
these
take
a long
executing
situations,
time
(where call it is best to to complete),
the cross-address URPC
address space to force a processor reallocation to the there might still be runnable threads in the client’s.
allows
server’s,
space
the client’s even
though
3.1.2 The Kernel Handles Processor Reallocation. The kernel implements the mechanism that support processor reallocation. When an idle processor discovers an underpowered address space, it invokes a procedure called Processor. Donate, passing in the identity of the address space to which the processor
(on which the invoking thread is running) should be reallocated. Donate transfers control down through the kernel and then up to a prespecified address in the receiving space. The identity of the donating Processor.
address ACM
space is made
TransactIons
known
on Computer
to the receiver
Systems,
Vol
9, No
2, May
by the kernel. 1991
User-Level Interprocess 3.1.3
Voluntary
interface URPC,
Return
defines as with
Communication of
a contract
for Shared Memory Multiprocessors
Processors
Cannot
between
traditional
RPC
a client
systems,
Be
and
in the
185
A service
Guaranteed.
a server.
implicit
.
In
the
contract
case of
is that
the
server obey the policies that determine when a processor is to be returned back to the client. The URPC communication library implements the following policy in the server: upon receipt of a processor from a client address space, return the processor when all have generated replies, or when the become
“underpowered”
(there
outstanding messages from server determines that the
are outstanding
messages
and one of the server’s processors is idle). Although URPC’S runtime libraries implement no way
to enforce
that
kernel
transfers
control
guarantee returning
protocol. of
Just
the
as with
processor
back
a specific
kernel-based to
an
to the client,
protocol,
there
systems,
application,
there
a fair
available
deals
processing
power.
of load balancing
URPC’s
between
direct
reallocation
applications
that
clients,
is
once the is
that the application will voluntarily return the processor from the procedure that the client invoked. The server could,
example, use the processor to handle requests from other this was not what the client had intended. It is necessary to ensure that applications receive problem
the client client has
no by for
even though share only
of the
with
are communicating
the with
one another. Independent of communication, a multiprogrammed system requires policies and mechanisms that balance the load between noncommunicating (or noncooperative) address spaces. Applications, for example, must not be able to starve able
to delay
clients
one another
out for processors,
indefinitely
by
not
returning
and servers processors.
policies, which forcibly reallocate processors from one address other, are therefore necessary to ensure that applications make A processor reallocation policy should be work conserving, high-priority thread waits for a processor while a lower priority and,
by implication,
that
anywhere in the system, specifics of how to enforce
no processor
address be easily
3.2
when
there
is work
not be
Preemptive space to anprogress. in that no thread runs, for
it to do
even if the work is in another address space. The this constraint in a system with user-level threads
are beyond the scope of this paper, The direct processor reallocation optimization space of a ing to that standpoint reason for processor
idles
must
but are discussed done in URPC
by Anderson et al. in [61. can be thought of as an
of a work-conserving policy. A processor idling in the address URPC client can determine which address spaces are not respondclient’s calls, and therefore which address spaces are, from the of the client, the most eligible to receive a processor. There is no the client to first voluntarily relinquish the processor to a global allocator, only to have the allocator then determine to which
space the processor made
by the client
should
be reallocated.
This
is a decision
that
can
itself.
Data Transfer Using Shared Memory
The requirements of a communication system can be relaxed and its efficiency improved when it is layered beneath a procedure call interface rather than exposed to programmers directly. In particular, arguments that are part ACM
TransactIons
on Computer
Systems,
Vol. 9, No
2, May 1991.
186
B. N. Bershad et al
.
of a cross-address space procedure call while still guaranteeing safe interaction
can be passed using shared between mutually suspicious
memory subsys-
tems. Shared memory message of client-server interactions. still overload one another,
channels do not increase the “abusability factor” As with traditional RPC, clients and servers can deny service, provide bogus results, and violate
communication protocols (e. g., fail nel data structures). And, as with
to release traditional
protocols
abuses
to ensure
that
lower
level
a well-defined manner (e. g., by raising down the channel). In URPC, the safety of communication responsibility
of the stubs.
channel RPC,
filter
locks, or corrupt it is up to higher
up to the application exception
a call-faded based
The arguments
on shared
of a URPC!
layer
memory
are passed
is implemented
one address space to another. application programmers deal
necessary
nor sufficient
by having
Such directly
form of messages. But, when standard runtime as is the case with URPC, having the kernel spaces is neither
is the
in message phase server.
unmarshal the message’s data into copying and checking is necessary to
ensure the application’s safety. Traditionally, safe communication copy data from necessary when
in
or by closing
buffers that are allocated and pair-wise mapped during the binding that precedes the first cross-address space call between a client and On receipt of a message, the stubs procedure parameters doing whatever
chanlevel
the
kernel
an implementation is with raw data in the
facilities and stubs are used, copy data between address
to guarantee
safety.
Copying in the kernel is not necessary because programming languages are implemented to pass parameters between procedures on the stack, the heap, or in registers. When data is passed between address spaces, none of these
storage
areas
can,
in general,
be used
directly
by both
the
client
and
the server. Therefore, the stubs must copy the data into and out of the memory used to move arguments between address spaces. Safety is not increased by first doing an extra kernel-level copy of the data. Copying in the kernel is not sufficient for ensuring safety when using type-safe languages such as Modula2 + or Ada [1] since each actual parameter must be checked by the stubs for conformity with the type of its corresponding formal. Without such checking, for example, a client could crash the server by passing in an illegal (e. g., out of range) value for a parameter. These
points
motivate
the use of pair-wise
shared
memory
for cross-address
space communication. Pair-wise shared memory can be used to transfer between address spaces more efficiently, but just as safely, as messages are copied by the kernel between address spaces.
3.2.1
Controlling
different address test-and-set locks nitely on message is simply
if the
spin-wait.
The
ACM
TransactIons
data that
Channel Access. Data flows between URPC packages in spaces over a bidirectional shared memory queue with on either end. To prevent processors from waiting indefichannels, the locks are nonspinning; i.e., the lock protocol
lock
is free,
rationale on Computer
here
acquire
is that
Systems,
Vol
it,
or
else go on
the receiver 9, No
2, May
1991
to something
of a message
else– never
should
never
User-Level Interprocess
Communication
for Shared Memory Multiprocessors
187
.
have to delay while the sender holds the lock. If the sender stands to benefit from the receipt of a message (as on call), then it is up to the sender to ensure that locks are not held indefinitely; if the receiver stands to benefit (as on return), realized
then the sender could just as easily prevent the benefits by not sending the message in the first place.
Cross-Address
3.3
Space Procedure
The calling semantics normal procedure call, performing software address
a server
starts
from
space procedure call, like those with respect to the thread that
the call. Consequently, there system that manages threads
the
server.
thread.
When
the
The server
communication
thread client
client
thread
finishes,
—High
and thread
performance
grained
parallel
management.
thread
it
stops, sends
waiting
on a receive
a response
back
to the
we argue
facilities
are
the following:
necessary
for
fine-
programs.
— While thread management facilities at the user level, high-performance be provided
(send
Finally, the to be started. for the next message. between cross-address space
In brief,
management
function
management synchronization sends a message to the server,
client, which causes the client’s waiting thread stops again on a receive, waiting server’s thread In this subsection we explore the relationship communication
of is
is a strong interaction between the and the one that manages cross-
Each underlying
and receive) involves a corresponding function (start and stop). On call, the which
being
Call and Thread Management
of cross-address are synchronous
space communication.
from
at the user
can be provided either in the kernel or thread management facilities can only
level.
—The close interaction between communication and thread management can be exploited to achieve extremely good performance for both, when both are implemented 3.3.1
threads
Concurrent
within
activities. parallel
together
at the user
Programming
an
address
Concurrency programs
space
needs
and
Thread
simplify
the
can be entirely
for
access to a thread
or
it
to the can
some amount address space.
management
system
that
continuum
major
middleweight,
are three
and lightweight,
for which
on the order of thousands, hundreds, Kernels supporting heavyweight
points thread
application,
as with as with
an
of its own computation with In either case, the program-
of control. can be described
there
of concurrent
be external,
create, schedule, and destroy threads Support for thread management on which
Multiple
Management.
management
internal
multiprocessors,
application that needs to overlap activity on its behalf in another mer
level.
makes in
it possible
terms
of reference: management
of
to
a cost
heavyweight, overheads
are
and tens of psecs, respectively. threads [26, 32, 35] make no distinction
between a thread, the dynamic component of a program, and its address space, the static component. Threads and address spaces are created, scheduled, and destroyed as single entities. Because an address space has a large amount of baggage, such as open file descriptors, page tables, and accounting ACM
Transactions
on Computer
Systems,
Vol
9, No
2, May 1991
188
B. N Bershad et al.
.
state that must be manipulated during thread management operations, operations on heavyweight threads are costly, taking tens of milliseconds. Many contemporary operating system kernels provide support for middleweight, address
threads [12, 361. In contrast to heavyweight middleweight threads are decoupled, so that
or kernel-level spaces and
address space can have many middleweight threads. As with threads, though, the kernel is responsible for implementing management functions, and directly schedules an application’s the
available
defines
physical
a general
processors.
programming
With
middleweight
interface
for
heavyweight the thread threads on
threads,
use by all
threads, a single
the
concurrent
kernel applica-
tions. Unfortunately,
the
grained
parallel
slicing,
or
between
saving
threads
applications,
cost of this
applications. and
restoring
contexts
traps are required kernel must treat
affects
example, floating
can be sources
even those for which
thread management malfeasant programs.
generality
For
the
simple point
performance
features
registers
of performance
the features
overhead
are unnecessary.
of kernel-level for all parallel
weight
for
all
A kernel-level
thread management operation, and the suspiciously, checking arguments and
from
the effect
of thread
to the high cost of but more pervasive
management
on program
policy
There are a large number of parallel programming models, and a wide variety of scheduling disciplines that are most appropri-
ate (for performance) to a given strongly influenced by the choice rithms used to implement threads,
In response to lightweight
switching
system must also protect itself from malfunctioning or Expensive (relative to procedure call) protected kernel to invoke any each invocation
stemming
when
as time
degradation
access privileges to ensure legitimacy. In addition to the direct overheads that contribute kernel-level thread management, there is an indirect performance. within these,
of fine-
such
thread is unlikely programs.
model [25, 33]. Performance, though, is of interfaces, data structures, and algoso a single model represented by one style to have
an implementation
to the costs of kernel-level threads, threads that execute in the context
threads
provided
by the kernel,
but
that
is efficient
programmers have turned of middleweight or heavy-
are managed
at the user level
by
a library linked in with each application [9, 38]. Lightweight thread managetwo-level scheduling, since the application library schedules ment implies lightweight threads on top of weightier threads, which are themselves being scheduled by the kernel. Two-level schedulers attack both the direct and indirect costs of kernel threads. Directly, it is possible to implement efficient user-level thread management functions because they are accessible through a simple procedure call rather than a trap, need not be bulletproofed against application errors, and can be customized to provide only the level of support needed by a given application. The cost of thread management operations in a user-level scheduler
3.3.2 for two ACM
can be orders The
Problem
reasons:
fiansactlons
of magnitude
with
Having
(1) to synchronize
on Computer
Systems,
Vol
less than Functions
their 9, No
for kernel-level at
activities 2, May 1991
Two
Levels.
within
threads.
Threads an address
block space
User-Level Interprocess
Communication
for Shared Memory Multiprocessors
(e.g., while controlling access to critical sections], events in other address spaces (e. g., while reading manager). pensive kernel,
If threads
are implemented
at the
and flexible concurrency), but then the performance benefits
switching
are lost when
scheduler
accurately
reflects
In contrast,
when
efficient
user-level
(necessary
in separate
address
of the application,
awaiting
and thread
and
management
scheduling
and then
kernel-mediated
com-
are moved
together at user level, the system can control threads
synchronization
for inex-
is implemented in the scheduling and context
activities
state
applications notified.
communication
of the kernel and implemented needed by the communication
level
operation must be implemented and the application, so that the user-level
the scheduling
again (2) in the kernel, so that munication activity are properly
user
between
189
and (2) to wait for external keystrokes from a window
communication of inexpensive
synchronizing
spaces. In effect, each synchronizing executed at two levels: (1) once in
.
out
synchronization with the same
primitives
that
are used
by applications.
3.4 Summary of Rationale This
section
has
described
the
design
advantages of a user-level approach made without invoking the kernel processors
between
processor reallocation over multiple calls. grated
with
spaces.
do turn Finally,
out to be necessary, their user-level communication
thread
programs, as within,
4. THE PERFORMANCE
Further,
as
measurements megabytes
4.1
have
kernel
processor described
many
as
dedicated
main
invocation
and
cost can be amortized can be cleanly intenecessary
can inexpensively
in
this
of URPC by DEC’s
six to
C-VAX
1/0.
paper
for
fine-
synchronize
The
had
on the Firefly, an experiSystems Research Center. microprocessors, Firefly
four
used
C-VAX
to
plus
one
collect
the
processors
and
32
of memory.
User-Level Thread Management
Performance
In this section we compare the performance of thread management when implemented at the user level and in the kernel. The threads are those provided by Taos, the Firefly’s native operating Table mented invokes
The
calls can be reallocating
OF URPC
A
can
URPC.
facilities,
programs spaces.
the performance workstation built
Firefly
when
management
so that address
This section describes mental multiprocessor MicroVAX-11
behind
address
user-level
grained parallel between, as well
rationale
to communication are that and without unnecessarily
primitives kernel-level system.
I shows the cost of several thread management operations impleat the user level and at the kernel level. PROCEDURE CALL the null procedure and provides a baseline value for the machine’s
performance. FORK creates, starts, and terminates a thread. FORK;JOIN is like FORK, except that the thread which creates the new thread blocks until the forked thread completes. YIELD forces a thread to make a full trip through the scheduling machinery; when a thread yields the processor, it is blocked (state saved), made ready (enqueued on the ready queue), scheduled ACM
Transactions
on Computer
Systems,
Vol
9, No. 2, May 1991
190
.
B. N Bershad et al Table
1.
Comparative
Performance
of Thread
URPC Test
II
Component
(ysecs)
7 43 102 37 27 53
7 1192 1574 57 27 271
(dequeued),
Server (ysecs)
18
13
poll
6
6
receme
10 20
25
Total
54
53
unblocked
and forth on a condition means of paired signal Each
(state
measured in
the
was
elapsed
table.
processor.
variable, and wait
and unblocks
test
The
run
restored),
was divided
using
the
continued.
ACQUIRE;
in succession.
number
FastThreads
tests
and
a blocking (nonspinning) lock for has two threads “pingpong” back
blocking and unblocking one another by operations. Each pingpong cycle blocks,
two threads a large
time
The
9
dispatch
RELEASE acquires and then releases which there is no contention. PING-PONG
schedules,
of a URPC
Client (~secs)
send
again
Taos Threads
FastThreads
Breakdown
Component
Operations
(psecs)
PROCEDURE CALL FORK FORK;JOIN YIELD ACQUIRE, RELEASE PINGPONG
Table
Management
of times
as part
by the loop limit
tests
were
kernel-level
of a loop,
to determine
constrained
to run
threads
no such
had
and
the
the values on a single constraint;
Taos caches recently executed kernel-level threads on idle processors to reduce wakeup latency. Because the tests were run on an otherwise idle machine, the caching optimization worked well and we saw little variance in the measurements. The table demonstrates the performance advantage of implementing thread management at the user-level, where the operations have an overhead hundred, as when 4.2
on the threads
URPC Component
order of a few procedure calls, are provided by the kernel.
rather
than
a few
Breakdown
In the case when an address space has enough processing power to handle all incoming messages without requiring a processor reallocation, the work done send (enqueuing a during a URPC can be broken down into four components: poll (detecting the channel on which a message on an outgoing channel); receive (dequeueing the message); and dispatch (dismessage has arrived); patching the appropriate thread to handle an incoming message). Table II shows the time taken by each component. The total processing times are almost the same for the client and the server. Nearly half of that time goes towards thread management (dispatch), ACM TransactIons
on Computer
Systems,
Vol
9, No
2, May 1991
User-Level Interprocess demonstrating
the
munication The
influence
for Shared Memory Mulhprocessors
that
thread
management
.
overhead
has
191 on
com-
performance.
send
and
marshaling
receive
that
spaces.
On
buffers
adds
passed
in
average Table
Communication
is
the
Firefly,
about
most
do
cross-address
and
not
the
cost
of
psec space
is
reflect
cost
are
data
of
data
passed
passed
of latency.
in
calls
the
influenced
by
is
and
address
shared
the
memory
amount
small
the
copying
between
Because
procedure
most
of data
the
parameters
word
additional
performance
not
when
each
one
call II,
times
incurred
[8,
of data
16,
components
17,
24],
shown
in
transfer.
4.3 Call Latency and Throughput The
figures
there
in
Table
is no need
latency
and
call
throughput server
II
for
depend
space,
on
for
are
the
S,
and
of client
reallocation. load
number the
Two
and
dependent, of
client
number
of
server
other
load,
important
though.
Both
processors,
runnable
C,
call
latency
and
the
threads
provided
metrics, number
in
the
of
client’s
T.
To evaluate values
independent
throughput,
processors,
address
are
processor
T,
latency
and
C,
S
and
throughput, in
we ran
which
threads to make 100,000 “Null” loop. The Null procedure call
we
timed
a series how
of tests
long
procedure calls into takes no arguments,
it
using
took
different
the
the server computes
client’s
T
inside a tight nothing, and
returns no results; it exposes the performance characteristics of a round-trip cross-address space control transfer. Figures 3 and 4 graph call latency and throughput as a function of T. Each line on the graphs represents a different combination of S and C. Latency is measured as the time from when a thread calls into the Null stub to when control returns from the stub, and depends on the speed of the message primitives and the length of the thread ready queues in the client and server. When (T>
the
number
processor
of caller
call
C + S),
latency
in the client
exceeds since
or the server
to T. Except
is less sensitive call processing
threads increases
times
total call
or both.
for the special
in the client
the each
number must
When
the
number
of calling
on the other
case where
S = 0, as long
and server
message wasted
transfer. processing
threads,
are roughly
client
equal T >
hand, as the
(see Table
II),
At that no further
C + S.
can
processors,
a free
and
be
server
proces-
latency is 93 ~secs. This is “pure” latency, in basic processing required to do a round-trip
1 Unfortunately, time
for
Throughput,
until then throughput, as a function of T, improves point, processors are kept completely busy, so there T. improvement in throughput by increasing sors is 1 (T = C = S = 1), call the sense that it reflects the
of processors
wait
the
due to idling
latency
in the
includes
client
and
a large server
amount
while
of
waiting
1 The 14 psec discrepancy between the obseved latency and the expected latency (54 + 53, cf., Table II) is due to a low-level scheduling optimization that allows the caller thread’s context to remain loaded on the processor if the ready queue is empty and there is no other work to be done anywhere in the system. In this case, the idle loop executes in the context of the caller thread, enabling a fast wakeup when the reply message arrives. At 2’ = C = S, the optimization is enabled. The optimization is also responsible for the small “spike” in the throughput curves shown in Figure 4. ACM
Transactions
on Computer
Systems,
Vol. 9, No
2, May 1991.
192
.
B. N. Bershad et al.
1500–
C=l,
s=o
C=l,
S=2 s=l
14001300120011001000 –
900Call
800-
Latency
700-
(psecs)
/c=l,
600500400 -
/<,..””
~=*~5~2s==2 ‘-- ‘
/’
300 200 100 -
k’~”;”;
‘
“
,---”-’”
o~ 12
8 Number Fig
for the next
call
or reply
of Client
3
Threads
9
10
{T)
URPC call latency
message.
At
T =
2, C = 1, S = 1, latency
increases
by only 20 percent (to 112 Wsecs), but throughput increases by nearly 75 percent (from 10,300 to 17,850 calls per second). This increase reflects the fact that message processing can be done in parallel between the client and server.
Operating
as a two-stage
pipeline,
the client
and server
complete
call every 53 psecs. For larger values of S and C, the same analysis holds. As long sufficiently large to keep the processors in the client and server spaces busy, throughput Note that throughput
is maximized. for S = C = 2 is not twice
as large
one
as T is address
as for S = C = 1
(23,800 vs. 17,850 calls per second, a 33 percent improvement). Contention for the critical sections that manage access to the thread and message queues in each address space limits throughput over a single channel. Of the 148 instructions required to do a round-trip message transfer, approximately half are part of critical sections guarded by two separate locks kept on a perchannel basis. With four processors, the critical sections are therefore nearly always occupied, so there is slowdown due to queueing at the critical sections. This factor does not constrain the aggregate multiple clients and servers since they use different When reallocate round-trip sor ACM
S = O throughput and latency are at their processors frequently between the client latency
is 375
reallocations.
At this
Transactions
on Computer
psecs. point, Systems,
Every URPC Vol
call
requires
performs
9, No
worse
2, May 1991
rate of URPCS channels.
between
worst due to the need to and server. When T = 1, two
traps
than
and
LRPC
two
proces-
(157
psecs).
User-Level Interprocess
Communication
for Shared Memory Multiprocessors
.
193
25000
---1
—C=2’S’2
A — — —
/
/
20000 4
/–––__–___c=2, ~<:.— ——.— —————— ~ -’-----------------------
15000
s=l —_ —c=l S=l ~=1:’=2
///
Calls
$ f
per Second 10000 1
I
C=l,
s=o
5000
I
o~
2345678
1
Number Fig. 4.
9 of Client
Threads
(T)
URPC call throughput
We consider this comparison further in Section 5.5. mance is worse, URPC’S thread management primitives performance overheads proves
be amortized
steadily,
and
has been absorbed, 4.4
I). As T increases,
(see Table can
Performance
over
latency,
the
once the
worsens
10
only
though,
the trap
outstanding initial
because
Although call perforretain their superior calls.
shock
and reallocation Throughput
of processor
of scheduling
im-
reallocation
delays.
Summary
We have described the performance of the communication and thread management primitives for an implementation of URPC on the Firefly multiprocessor. The performance figures demonstrate the advantages of moving
traditional
operating
them at the user level. Our approach stands ogy.
Normally,
system in contrast
operating
system
order to improve performance we are pushing functionality and
functions
out
to traditional functionality
of kernel
and
operating
implementing
system
is pushed
into
methodol-
the
kernel
in
at the cost of reduced flexibility. With URPC, out of the kernel to improve both performance
flexibility.
5. WORK RELATED TO URPC In this of
section
URPC.
The
we examine common
several theme
other
communication
underlying
ACM Transactions
these
on Computer
systems other
Systems,
in the light
systems
is
that
Vol. 9, No, 2, May 1991
194
.
B N Bershad et al
they
reduce
the
kernel’s
role
in
communication
in
an
effort
to
improve
performance.
5.1 Sylvan Sylvan [141 relies on architectural message-passing primitives. Special
support to implement coprocessor instructions
the Thoth [151 are used to ac-
cess these primitives, so that messages can be passed between Thoth tasks without kernel intervention. A pair of 68020 processors, using a coprocessor to implement the communication primitives, can execute a send-receiue-reply cycle on a zero-byte message (i. e., the Null call) in 48 ~secs. Although ing
the
Sylvan
outperforms
URPC,
use of special-purpose
the
hardware.
enqueueing and dequeueing messages and when a thread becomes runnable or blocked. handles
thread
scheduling
and context
improvement
Sylvan’s
is small
coprocessor
updating Software
switching.
consider-
takes
care
of
thread control blocks on the main processor
As Table
II shows,
though,
thread management can be responsible for a large portion of the round-trip processing time. Consequently, a special-purpose coprocessor can offer only limited performance gains. 5.2
Yackos
Yackos
[22] is similar
communication Messages
and
to URPC
in that
is intended
are passed
between
for
it strives
to bypass
use on shared
address
memory
spaces by means
the kernel
during
multiprocessors.
of a special
message-
passing process is no automatic
that shares memory with all communicating processes. There load balancing among the processors cooperating to provide a
Yackos
Address
service.
spaces are single
threaded,
and clients
communicate
with a process running on a single processor. That process is responsible for forwarding the message to a less busy process if necessary. In URPC, address spaces are multithreaded, so a message sent from one address space to another can be fielded by any processor allocated to the receiving address space.
Further,
URPC
on
a multiprocessor
in
the
best
case
(no
kernel
involvement) outperforms Yackos by a factor of two for a round-trip call (93 VS. 200 psecs).2 The difference in communication performance is due mostly to the indirection through the interposing Yackos message-passing process. This
indirection
results
in increased
latency
and data copying. By taking advantage in an RPC relationship, URPC avoids
due to message
queueing,
polling,
of the client/server pairing implicit the indirection through the use of
pair-wise shared message queues. In the pessimistic case (synchronous cessor allocation on every message send), URPC is over thirteen times efficient (375 vs. 5000 psecs) than Yackos.
5.3
promore
Promises
Argus [28] provides the programmer with that can be used to make an asynchronous
a mechanism cross-address
called a Promise space procedure
call.
80386
that
faster
2 Yackos
runs
than
the
C-VAX
the
ACM
TransactIons
Sequent
processors
Symmetry, used
on Computer
which
uses
processors
in the Firefly.
Systems,
Vol
9, No
2, May
1991
are
somewhat
[271
User-Level Interprocess
Communication
for Shared Memory Multiprocessors
.
The caller thread continues to run, possibly the time that the call’s results are needed.
in parallel If a thread
on a Promise
(i. e., the cross-address
that
has not yet been fulfilled
has not yet completed), A Promise
the collecting
is used to compensate
thread
195
with the callee, until attempts to “collect” space call
blocks.
for the high
cost of thread
management
in
the Argus system. Although an asynchronous call can be simulated by forking a thread and having the forked thread do the cross-address space call synchronously, the cost of thread creation in Argus [29] (around 20 procedure call times) the
precludes
order
explicit
this
approach.
of six procedure language
one largely
call
support
In URPC,
times,
for asynchronous
of preference
and style,
the cost of thread
and so the
issue
cross-address
rather
than
creation
of whether
is on
to provide
space calls
becomes
performance.
Camelot
5.4
Camelot
[19]
is a high-performance
distributed
transaction
processing
sys-
tem. Key to the system’s performance is its use of recoverable virtual memory and write-ahead logging. Because of the cost of Mach’s communication
primitives
[18], data
servers
use shared
memory
queues
to communicate
virtual memory control information and log records to the disk manager on each machine. When the queues become full, data servers use Mach’s kernellevel RPC system to force a processor reallocation to the disk manager so that it can process
the pending
messages
in the queue.
URPC generalizes the ad hoc message-passing approach used by Camelot. Camelot’s shared memory queues can only be used between the data servers and the disk manager. URPC’S channels are implemented beneath the stub layer, allowing any client and any server to communicate through a standard RPC
interface.
with those that memory queues nication At
Camelot’s
in URPC the
thread
management
access the shared memory support only unidirectional
lowest
is bidirectional, level,
supporting
URPC
is
simply
separate address spaces access shared Programmers have often used shared address space communication, interfaces. URPC formalizes work of standard programming pensive 5.5
primitives
are not integrated
queues. Finally, communication, request-reply a protocol memory memory
by
in for
Camelot’s whereas
shared commu-
interaction. which
threads
a disciplined high-bandwidth
in
fashion. cross-
but have had to implement special-purpose this interface by presenting it in the frameabstraction based on procedure call and inex-
threads.
LRPC
LRPC
demonstrates
that
it
is
possible
to
communicate
between
address
spaces by way of the kernel at “hardware speed. ” Like URPC, LRPC passes parameters in shared memory that is pair-wise mapped between the client and server.
Unlike
URPC,
though,
LRPC
uses kernel-level
threads
that
pass
between address spaces on each call and return. LRPC greatly reduces the amount of thread management that must be done on each call, but is still performance limited by kernel mediation. ACM
Transactions
on Computer
Systems,
Vol. 9, No
2, MaY 1991
196
.
B. N. Bershad et al
In the worst
case, when
to the client, 375
and there
,usecs. On the
and LRPC, reallocations.
there
are no processors
is one client
same
thread,
hardware,
LRPC
each call involves two There are two reasons
reallocation
in
URPC
takes
to the server,
for the Null
157 ~secs.
For
kernel invocations and why LRPC outperforms
case; the first is an artifact of URPC’S inherent in URPC’s approach: —Processor
allocated
the latency
is
implementation,
based
on
both
is
URPC
two processor URPC in this
while
the
URPC
LRPC.
one
URPC
second
is
implements
processor allocation using LRPC, which is a general cross-address space procedure call facility for kernel-level threads. We used LRPC because it was available and integrated into the Firefly operating system. The overhead
of LRPC-based
that
a special-purpose
— URPC curs
is
processor
integrated
after
two
processor, and be reallocated?
reallocation
mechanism with
distinct
would
two-level
is 180 be about
Processor
scheduling.
scheduling
decisions
psecs, but 30 percent
are made:
we estimate faster.
reallocation
(1) Is there
ocan idle
(2) is there an underpowered address space to which In contrast, a call using LRPC makes no scheduling
it can deci-
sions (one of the main reasons for its speed). Making these two decisions is more expensive than not making them, but is necessary to take advantage of the inexpensive synchronization and scheduling functions made possible by user-level thread management. The overhead of these two scheduling decisions
is about
processor
reallocation
100
It should
not be surprising
uler, which manages the first level, which act
with
the
6.
round-trip
call
(only
in the
case where
of course).
that
indirecting
through
a second-level
sched-
threads, increases the cost of accessing the scheduler at manages processors. The trick is to infrequently inter-
first-level
used, the overhead inevitably degrade of scheduling.
~secs per is required,
scheduler.
When
the
first-level
scheduler
must
of having to pass through the second-level mechanism performance relative to a system with only a single
be
will level
CONCLUSIONS
This
paper
mance
has described
of URPC,
the motivation,
a new approach
that
design, addresses
implementation, the problems
and perforof kernel-based
communication by moving traditional operating system functionality out of the kernel and up to the user level. We believe that URPC represents the appropriate division of responsibility for the operating system kernels of shared memory multiprocessors. While it is a straightforward task to port a uniprocessor operating system to a multiprocessor by adding kernel support for threads and synchronization, it is another thing entirely to design facilities for a multiprocessor that will enable URPC
programmers demonstrates
to fully exploit the processing power of the machine. that one way in which a multiprocessor’s performance
potential can multiprocessor
be greatly increased is by in the first place, thereby
ACM
on Computer
Transactions
Systems,
Vol
9, No
designing making
2, May
1991
system facilities for a distinction between
a a
User-Level Interprocess multiprocessor happens
Communication
operating
to run
for Shared Memory Multiprocessors
system
and
a uniprocessor
.
operating
197
system
that
on a multiprocessor.
REFERENCES 1. United
States
guage,
July
2. AGARWAL, Memo Sept.
Department
A.
Performance
89-566,
tradeoffs
Massachusetts
in
Manual
multithreaded
Institute
of Technology,
A., HENNESSY, J., AND HOROWITZ, M.
multiprogramming 4. AGARWAL,
workloads.
A., LIM, B.-H.,
multiprocessing.
puter thread
T.
12 (Dec.
Operating
for
the Ada
Programming
Lan-
processors.
Tech.
Laboratory
for
Rep.
MIT
Computer
VLSI
Science,
1989),
thread 7. BBN June
performance
of the 17th
of operating
S+@. 6, 4 (Nov. J.
Annual
APRIL:
1988),
system
A processor
International
and
393-431. architecture
Symposium
on Com-
E.
D.,
AND LEVY,
for shared
1631-1644.
H.
M.
memory
To appear
in
The
performance
multiprocessors.
Proceedings
implications
IEEE
of the 13th
Trans.
ACM
of
Comput.
Symposium
on
Principles.
T. E. BERSHAD,
management
Computer
Cache
Compzd.
104-114.
alternatives
Systems
6. ANDERSON,
1990,
LAZOWSKA,
management
Trans.
Proceedings
May
E.,
ACM
KRANZ, D., AND KUBIATOWICZ,
In
Architecture.
5. ANDERSON, 38,
Reference
1989.
3. AGARWAL,
for
of Defense.
1980.
Science
for
B. N.,
shared
LAZOWSKA, memory
and Engineering, ButterfZy
Laboratories.
Univ.
Parallel
E. D., AND LEVY,
multiprocessors. of Washington,
Processor
H.
Tech. April
Oueruzew.
M.
Efficient
Rep.
user-level
90-04-02,
Dept.
of
1990
BBN
Labs.,
Cambridge,
Mass.,
1985.
8. BERSHAD, B. N. Dept.
High
of Computer
performance
Science
and
cross-address
Engineering,
space communication,
Univ.
Ph.D.
of Washington,
June
dissertation,
1990
Tech.
Rep.
90-06-02. 9. BERSHAD, B. N., LAZOWSKA, parallel
programming.
E, D., AND LEVY,
Softw.
10. BERSHAD, B. N., ANDERSON, procedure
call.
Proceedings 11. BIRRELL,
of the 12th A.
Comput.
ACM
D.
A.
13. BLACK,
D. L.
system.
IEEE
14. BURKOWSKI,
15. CHERITON,
39-59.
An
1984),
Scheduling
In
D. R.
and
Thoth:
concurrency 1990),
of the 3rd Operating
A portable
Lightweight Also
Principles,
procedure
threads.
Alto,
and
M. 37-55.
Dec. calls
Tech.
Calif.,
Jan.
parallelism
remote
appeared
in
1989. ACM
Rep.
Trans.
35,
Digital
1989.
in the
Mach
operating
35-43. Architectural
G., AND DUECK, G.
Proceedings
Languages
for
with Palo
H
1990),
remote
Center,
for object-oriented
713-732.
Systems
programming
23, 5 (May
F. J., CORMAC~,
1 (Feb.
on Operating
Research
support Mug.
8,
Implementing
to
Systems
A system
1988),
E. D., AND LEVY,
Syst.
J.
introduction
PRESTO:
ACM
Systems. real-time
support
Conference April
1989,
operating
for synchronous
on Architectural
task
Support
for
40-53.
system.
Commun.
ACM
2 (Feb.
105-115.
16. CHERITON, 17. CooK,
D.
Computer
D. R. The
The V distributed evaluation
Lab.,
18. DUCHAMP,
April
D.
20. GUPTA, policies
A.,
TUCKER,
(May R. Lang.
ACM
Commun. system.
ACM. Ph.D.
A.,
management
Systems
memory
of Computer
and synchronization
HALSTEAD, Program.
Virtual
Dept.
ings of the 1991 Systems
of transaction
on Operating
J. L.
dissertation,
system.
of a protection
31, 3 (Mar. dissertation,
1988),
314-333.
Cambridge
Univ.,
1978.
Analyis
Symposium
19. EPPINGER.
21.
Symposium
Comput.
Programming
ACM
ACM
B.
communication.
1979),
Comput.
Corporation
H. M.
18, 8 (Aug.
T. E., LAZOWSKA,
Trans.
1 (Feb.
D.
Equipment
Exper.
AND NELSON,
Syst. 2,
12. BIRRELL,
Pratt
management
Science,
Dec. for
Carnegie
AND URUSHIBARA, methods
performance.
Principles.
S.
The
Conference
In
processing
Univ.,
impact
Proceedings
Feb.
of operating
of parallel
on Measurement
of the 12th
177-190.
transaction
Mellon
of the performance
SIGMETRICS
1989,
systems. system
applications. and
Ph.D.
1989.
Modeling
scheduling In
Proceed-
of Computer
1991). Multilisp: Syst.
A language
7, 4 (Oct.
1985), ACM
for
concurrent
symbolic
computation.
ACM
Trans.
501-538.
Transactions
on Computer
Systems,
Vol. 9, No
2, May 1991.
198 22.
.
B. N. Bershad et al
HENSGEN,
D., AND FINKEL,
of Computer 23
Science,
R
Umv
JONES, M. B. AND RASHID,
Dynamic
server
of Kentucky,
1989
R, F.
Mach
squads
m Yackos.
and Matchmaker:
Tech.
Kernel
object-oriented distributed systems In Proceedz ngs of the Oriented Programming Systems, Languages, and ApphcatLons
and
Rep
138-89,
language
Dept
support
for
ACM Conference on ObJect(OOPSLA ‘86). (Sept. 1986),
67-77. 24
KARGER, P A. the 3rd L~NOSKI, based
26 27
LKKOV, dure
1989),
coherence
Digital
protocol
m
28.
LISKOV, B
29.
LISKOV,
Language
R
of the
In
Proceedings
Languages
and
of
Operat-
ACM
Architecture.
(May
Programmmg
LIngu&lc In
J.
The
Proceedings 1990),
directory-
of the
17th
148-159.
and Architecture.
support
Proceedings
the
(June
in Argus, P , AND
for
of
Implementation
D , JOHNSON,
In
The VAX-11.
1989.
Promises. and
AND HENNESSY,
multiprocessor,
Computer
systems.
Ilth
K , GUPTA, A., DASH
Mass.,
programming
B , CURTIS,
proceedings
H
Deszgn
Distributed
the
on Computer
Bedford,
distributed
Programming
for
Symposium
Press,
call performance.
for Programming
194-204.
B. AND SHRIRA, L
calls
cross-domain Support
J., GHARACHORLOO,
H. M. AND ECKHOUSE,
Ed.
to optlmlze
on Architectural
3-6
Internatzona~
LEVY, 2nd
(April
D , LAUDON,
cache
Annual
regmters
Conference
rng Systems 25.
Using
ACM
1988),
Commun.
ACM
SCHEIFLER,
Symposium
on
efficient
R.
Operating
asynchronous
SIGPLAN
’88
proce-
Conference
on
260-267
31, 3 (Mar,
1988),
Implementation
Systems
300-312
of Argus,
Prmczples
(Nov.
In
1987),
111-122 30. 31.
NELSON,
B
Carnegie
Mellon
J.
OUSTERHOUT, RITGHIE, 1974),
33
34
Umv.,
J
dissertation, 32.
Remote K.
call
Ph.D.
dLssertatlon,
Dept.
and
of Computer
Computer
Science,
cooperation
Science,
D. AND THC,MPSON, K.
The
in a distributed
Carnegie-Mellon
Unix
operating
Univ.,
time-sharing
April
system
system.
Ph D.
1980.
Corn mun.
ACM
abstraction
for
17, 7 (July
365-375. E
S
AND
VANDEVOORDE,
fiarallelism
Tech
Rep.
Alto,
April
1989.
Cahf.,
ROYNER, P , LEVIN, systems Calif.,
Tech Jan.
Sequent
42, Digital
3, Digital
M,
T.
WorkCrews:
Equipment
R., AND WICK,
Rep
J.
On extendmg
Equipment
An
Corporation,
Systems
Modula-2
Corporation,
Computer
Systeme,
A , RASHID,
Mach
and
threads
controlling Center,
for building
large,
Research
Center,
Systems
Inc.
Symmetry
R. F.,
GOLUB,
Unix
kernel:
Palo
integrated Palo
Alto,
JR., E
M.
Trans.
DEMERS,
interoperability
In
pies.
(Dee
1989),
Received
July
1990;
TransactIons
Cornput.
A , AND
proceedings
battle
37, 8 (Aug
HAUSER,
C
of the 12th
1988), The
ACM
In H
February
on Computer
Systems,
1991;
Vol
accepted
9, No
portable sympostu
February
2, May
Proceedings Firefly:
M.
of the
W
1987
A multiprocessor
909-920 common
1991
runtlme
approach
m on Operatz ng Systems
114-122. revised
1988.
COOPER, E , AND YOUNG,
USENIX Summer Conference. (1987), 185-197. 37. THACKER, C. P., STEWART, L. C., AND SATTERTHWAITE, IEEE
The
Summary, D. L.,
control.
workstation
the
Technical
D. B , BLACK, for
38. WEISER,
Research
1985.
36. TEVANIAN,
ACM
of
1981
Partitioning
Dept
ROBERTS,
35.
procedure May
1991
to
Pi-t ncl.