Lker-level Interprocess Communication For Shared Memory Multiprocessors

  • May 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Lker-level Interprocess Communication For Shared Memory Multiprocessors as PDF for free.

More details

  • Words: 10,805
  • Pages: 24
Lker-Level Interprocess Communication for Shared Memory Multiprocessors BRIAN N. BERSHAD Carnegie

Mellon

University

THOMAS

E. ANDERSON,

University

of Washington

Interprocess (between

communication address

operating IPC

applications encounter level

and

communication facilities

These

when

at the user

level,

with

thread

Terms—thread,

local

to the design of the

kernel,

is architecturally

from

one

problems them

since

between

cornmzmzcation of contemporary but

limited

address

space

kernel-based by the

to another.

cost of Second,

their own thread management the interaction between kernel-

User-Level

can be solved at the user

kernel

address

structure

allows

the

sees threads

on the

same Call

shared kernel

within

the communica-

each address

and processor

Procedure

using

by moving

level

invocation

spaces

Remote

protocol

The programmer

unconventional

towards

central

M. LEVY

management.

these

is improved

This

oriented

threads and must provide problems stemming from

and supporting

motivated

and HENRY

the responsibility

a processor

space communication

space communication.

Index

kernel

communicating

observations

though

been

multiprocessor,

performance

cross-address

managed

IPC

has become

its performance

reallocating

and user-level

memory out of the

be avoided

particular

First,

that need inexpensive functional and performance

Communication

fast

in

has traditionally problems

kernel

On a shared tion

IPC

inherent

the

(IPC),

D. LAZOWSKA,

spaces on the same machine),

systems.

has two

invoking

EDWARD

can

machine.

(URPC)

memory

URPC

with

through

combines

lightweight

to be bypassed

and RPC

space.

reallocation

during

a

threads

cross-address

a conventional

interface,

performance,

multiprocessor,

operating

system,

parallel

programming,

performance,

communication Categories

and

Subject

Descriptors:

D. 3.3

[Programming

Languages]:

Language

Constructs

[Operating Systems]: Process Management— concurrency, multiprocessing / multiprograrnmmg; D.4.4 [Operating Systems]: Communications Manfl!ow agement; D.4. 6 [Operating Systems]: Security and Protection— accesscontrols, information controls; D.4. 7 [Operating systems]: Organization and Desig~ D.4.8 [Operating Systems]: Performance— measurements

and

Features

— modules,

This

material

is based

8619663, and

on work

CCR-87OO1O6,

Digital

Program).

supported

CCFL8703049,

Equipment Bershad

D.4. 1

packages;

Corporation

was

(the

supported

by the

and

National

CCR-897666),

Systems

by an AT&T

Science the

Research Ph.D.

Center

Fellowship.

Authors’ Pittsburgh,

addresses: B. N. Bershad, School of Computer Science, PA 15213; T. E. Anderson, E. D. Lazowska, and Science

Permission not made of the

and Engineering,

to copy without or distributed

publication

Association

and

specific

permission.

@ 1991

ACM

fee all or part

for direct its

for Computing

FR-35,

date

of this

commercial

appear,

Machinery.

0734-2071/91/0500-0175 ACM

University

Transactions

and

material is given

To copy otherwise,

the

External

Anderson

CCRCenter,

Research by an IBM

Carnegie Mellon University, H. M Levy, Department

is granted the ACM that

(grants

Technology

and

of Washington,

advantage, notice

and

Scholarship,

Graduate

Computer

Foundation

Washington

Seattle, provided copyright

copying

or to republish,

WA that notice

the copies

are

and the title

is by permission requires

of

98195.

of the

a fee and/or

$01.50 on Computer

Systems,

Vol

9, No. 2, May 1991, Pages 175-198

176

.

General

B. N

Terms:

Addltlonal

Bershad

Design,

Key

et al

Performance,

Words

and

Phrases:

Measurement Modularity,

remote

procedure

call,

small-kernel

operating

system

1. INTRODUCTION Efficient

interprocess

communication

is central

to the design

of contemporary

operating systems [16, 231. An efficient communication facility encourages system decomposition across address space boundaries. Decomposed systems have several advantages over more monolithic ones, including failure isolation (address space boundaries prevent a fault in one module from “leaking” into another), extensibility (new modules can be added to the system without having

to modify

mechanism

existing

rather

Although

than

address

which they primitives.

ones),

and

modularity

(interfaces

are enforced

by

by convention).

spaces

can be a useful

structuring

device,

the

extent

to

can be used depends on the performance of the communication If cross-address space communication is slow, the structuring

benefits that come from decomposition are difficult to justify to end users, whose primary concern is system performance, and who treat the entire operating system as a “black box” [181 regardless of its internal structure. Consequently, designers are forced to coalesce weakly related subsystems into

the same address

modularity Interprocess operating problems:

space, trading

away

failure

isolation,

extensibility,

and

for performance. communication

system

—Architectural

has traditionally

kernel.

However,

performance

barriers.

been the responsibility

kernel-based

The

communication

performance

of the has

of kernel-based

two

syn-

chronous communication is architecturally limited by the cost of invoking the kernel and reallocating a processor from one address space to another. In our earlier work on Lightweight Remote Procedure Call (LRPC) [10], we show that it is possible to reduce the overhead of a kernel-mediated cross-address space call to nearly the limit possible on processor architecture: the time to perform a cross-address slightly

greater

than

that

required

to twice

invoke

a conventional LRPC is only

the kernel

and have

it

reallocate a processor from one address space to another. The efficiency of LRPC comes from taking a “common case” approach to communication, thereby avoiding unnecessary synchronization, kernel-level thread management, and data copying for calls between address spaces on the same machine. The majority of LRPC’S overhead (70 percent) can be attributed directly to the fact that the kernel mediates every cross-address space call. —Interaction

between

kernel-based

communication

and

high-performance

The performance of a parallel application running on a multiprocessor can strongly depend on the efficiency of thread management operations. Medium and fine-grained parallel applications must use a thread management system implemented at the user level to obtain user-level

ACM

TransactIons

threads.

on Computer

Systems,

Vol

9, No

2, May

1991

User-Level Interprocess satkfactory have

Communication

performance

strong

for Shared Memory Multiprocessors

[6, 36]. Communication

interdependencies,

though,

and thread

and the

across protection boundaries (kernel-level level thread management) is high in terms

.

177

management

cost of partitioning

communication of performance

them

and userand system

complexity. On

a shared

eliminate Because

memory

multiprocessor,

these

problems

the kernel from the path of cross-address address spaces can share memory directly,

have

space shared

a solution:

communication. memory can be

used as the data transfer channel. Because a shared memory multiprocessor has more than one processor, processor reallocation can often be avoided by taking

advantage

of a processor

already

active

without involving the kernel. User-level management of cross-address improved performance over kernel-mediated –Messages kernel.

are sent

between

–Unnecessary processor reducing call overhead, contexts

across

inherent

exploited,

the

same

address and

spaces,

processor

call

Procedure

directly,

results

without

invoking

in

the

without

components management, which

reallocation.

(URPC),

the

communication

kernel

can be

of a message

can be

mediation.

system

described

between URPC

address

isolates

of interprocess communication: and data transfer. Control

is the

is implemented

and receiving

its overhead

performance.

Call

safe and efficient

machine

programmer,

in the sending

improving

Remote

other the three cation, thread

spaces

space

reallocation between address spaces is eliminated, and helping to preserve valuable cache and TLB

parallelism

provides

address

space communication approaches because

reallocation does prove to be necessary, several independent calls.

thereby

User-Level paper,

target

calls.

–When processor amortized over —The

address

in the

communication through

Only

processor

volvement; thread management and ment and interprocess communication rather than by the kernel.

from

presented

of thread

reallocation

one an-

processor reallotransfer between

abstraction

a combination

in this spaces on

to the

management

requires

data transfer do not. Thread are done by application-level

kernel

in-

managelibraries,

URPC’S approach stands in contrast to conventional “small kernel” systems in which the kernel is responsible for address spaces, thread management, and interprocess communication. Moving communication and thread management to the user level leaves the kernel responsible only for the mechanisms that allocate processors to address spaces. For reasons of performance and flexibility, this is an appropriate division of responsibility for operating systems of shared memory multiprocessors. (URPC can also be appropriate for uniprocessors running multithreaded applications. ) The latency of a simple cross-address space procedure call is 93 psecs using a URPC implementation running on the Firefly, DEC SRC’S multiprocessor workstation [37]. Operating as a pipeline, two processors can complete one ACM

Transactions

on Computer

Systems,

Vol

9, No

2, May 1991.

178

.

call

every

B. N. Bershad et al 53 psecs.

that involve support for kernel-based user-level

On the

same

multiprocessor

hardware,

RPC

threads

that

accompany

URPC

takes

43

psecs,

while

operation for the kernel-level threads that accompany LRPC millisecond. To put these figures in perspective, a same-address dure

facilities

the kernel are markedly slower, without providing adequate user-level thread management. LRPC, a high performanceimplementation, takes 157 psecs. The Fork operation for the

call takes

7 psecs on the Firefly,

and a protected

kernel

the

Fork

take over a space proce-

invocation

(trap)

takes 20 psecs. We describe the mechanics of URPC in more detail in the next section. In Section 3 we discuss the design rationale behind URPC. We discuss performance

in Section

we present

4. In Section

2. A USER-LEVEL In

many

5 we survey

related

work.

REMOTE PROCEDURE

contemporary

uniprocessor

applications

communicate

sages

narrow

across

operations (create, between programs

a control

languages memory tween

in Section

6

with

channels

and

CALL FACILITY multiprocessor

operating

system

that

or ports

support

and

that for

by

only

a small

systems,

sending

mes-

number

of

send, receive, destroy). Messages permit communication on the same machine, separated by address space bound-

data-structuring

support

device

synchronous

communication

address

operating

services

aries, or between programs on different machines, boundaries. Although messages are a powerful communication sent

Finally,

our conclusions.

within

spaces is with

foreign

procedure an

untyped,

primitive, to traditional

call,

address

separated

data

space.

asynchronous

typing,

by

physical

they

repre-

Algol-like and shared

Communication

messages.

be-

Programmers

of message-passing systems who must use one of the many popular Algol-like languages as a systems-building platform must think, program, and structure according to two quite different programming paradigms. Consequently, almost every mature message-based operating system also includes support for Remote Procedure Call (RPC) [111, enabling messages to serve as the underlying transport mechanism beneath a procedure call interface. Nelson defines RPC as the synchronous language-level transfer of control between programs in disjoint address spaces whose primary communication medium is a narrow channel [301. The definition of RPC is silent about the operation of that narrow channel and how the processor scheduling (reallocation) mechanisms interact with data transfer. IJRPC exploits this silence in two ways: -Messages

are passed

between

address

spaces through

logical

channels

kept

in memory that is pair-wise shared between the client and server. The channels are created and mapped once for every client/server pairing, and are used for all subsequent communication between the two address spaces so that several interfaces can be multiplexed over the same channel. The integrity of data passed through the shared memory channel is ensured by ACM

Transactions

on Computer

Systems,

Vol

9, No

2, May

1991

User-Level Interprocess a combination statically) correctness

Communication

for Shared Memory Multiprocessors

of the pair-wise

mapping

and the URPC runtime is verified dynamically).

(message

system

authentication

in each address

—Thread management is implemented entirely at the integrated with the user-level machinery that manages nels.

A user-level

directly

thread

enqueuing

channel.

the

No kernel

Although

sends

a message

message

calls

on the

are necessary

a cross-address

space

server,

it blocks

waiting

to send a call

procedure

for the reply

link

call

signifying

179

is implied

space (message

user level and is the message chan-

to another

outgoing

perspective of the programmer, it is asynchronous thread management. When a thread in a client

.

address of the

space

by

appropriate

or reply

message.

is synchronous

from

the

at and beneath the level of invokes a procedure in a

the procedure’s

return;

while

blocked, its processor can run another ready thread in the same address space. In our earlier system, LRPC, the blocked thread and the ready thread were really the same; the thread just crosses an address space boundary. In contrast,

URPC

address that

always

tries

space on the client

can be handled

When

the reply

entirely

arrives,

to

schedule

thread’s

another

processor.

by the

This

user-level

the blocked

client

thread

thread

thread

from

the

is a scheduling

same

operation

management

system.

can be rescheduled

on any

of the processors allocated to the client’s address space, again without kernel intervention. Similarly, execution of the call on the server side can be done by a processor

already

and need not occur By preferentially takes

advantage

in switching call

this

address requires

executing

in the context

of the fact that

a processor context

of the server’s

address

synchronously with the call. scheduling threads within the same address there

to another than

switching)

is significantly

thread in

reallocating

space, URPC

less overhead

in the same address

space,

involved

space (we will

it to a thread

in

another

reallocation). Processor reallocation space (we will call this processor changing the mapping registers that define the virtual address

space context architectures, in privileged Several

in which these kernel

components

tion. There are processor should

a processor

mapping mode.

is executing.

registers

contribute

are protected

to the

high

On conventional and can only

overhead

processor be accessed

of processor

realloca-

scheduling costs to decide the address space to which a be reallocated; immediate costs to update the virtual mem-

ory mapping registers and to transfer the processor between and long-term costs due to the diminished performance of translation lookaside buffer (TLB) that occurs whefi locality address space to another [31. Although there is a long-term with context switching within when processors are frequently To demonstrate the relative

the same address space, that cost is less than reallocated between address spaces [201. overhead involved in switching contexts be-

tween two threads in the between address spaces, same-address

space

same-address

space

same address space we introduce the On the context switch. context switch requires ACM

address spaces; the cache and shifts from one cost associated

TransactIons

versus reallocating processors latency concept of a minimal C-VAX, the minimal latency saving and then restoring the

on Computer

Systems,

Vol

9, No

2, May 1991.

180

B. N Bershad et al

.

machine’s

general-purpose

registers,

reallocating

the

takes

55 psecs without

about

Because reallocation, switching

processor

from

and

takes

one address

including

about

15 psecs.

space to another

the long-term

C-VAX

cost.

of the cost difference between context switching URPC strives to avoid processor reallocation, whenever

In contrast,

on the

and processor instead context

possible.

Processor reallocation is sometimes necessary with URPC, though. If a client invokes a procedure in a server that has an insufficient number of processors to handle the call (e. g., the server’s processors are busy doing other reply. with

work), the client’s processors may idle for some time awaiting the This is a load-balancing problem. URPC considers an address space pending

incoming

messages

on its

channel

to be “underpowered.”

An

idle processor in the client address space can balance the load by reallocating itself to a server that has pending (unprocessed) messages from that client. Processor reallocation requires that the kernel be invoked to change the processor’s virtual memory context to that of the underpowered address space. That done, the kernel upcalls into a server routine that handles the client’s outstanding requests. After these have been processed, the processor is returned to the client address space via the kernel. The responsibility for detecting incoming messages and scheduling threads to handle them belongs to special, low-priority threads that are part of a URPC runtime library linked incoming messages only when Figure proceeding

1 illustrates downward.

manager (WinMgr) a separate address procedures

into they

each address space. Processors would otherwise be idle.

a sample execution timeline One client, an editor, and

and a file cache manager space. Two threads in

in the window

manager

for URPC two servers,

scan

for

with time a window

(FCMW), are shown. Each is in T1 and T2, invoke the editor,

and the file

cache manager.

Initially,

the

editor has one processor allocated to it, the window manager has one, and the space call into file cache manager has none. T1 first makes a cross-address the window manager by sending and then attempting to receive a message. The window manager receives the message and begins processing. In the meantime, the thread scheduler in the editor has context switched from TI (which is blocked waiting T2 initiates a procedure

to receive a message) to another thread call to the file cache manager by sending

Thread a message

T2.

T1 can be unblocked because the window and waiting to receive a reply. T1 then calls manager has sent a reply message back to the editor. Thread into the file cache manager and blocks. The file cache manager now has two pending messages from the editor but no processors with which to handle T1 and T2 are blocked in the editor, which has no other them. Threads threads waiting to run. The editor’s thread management system detects the

outgoing pending messages and donates a processor to the file cache manager, which can then receive, process, and reply to the editor’s two incoming calls before returning the processor back to the editor. At this point, incoming reply messages from the file cache manager can be handled. Tz each terminate ACM

TransactIons

when

on Computer

they

receive

Systems,

Vol

their 9, No

replies.

2, May

1991

the two 7’1 and

User-Level Interprocess

Communication

for Shared Memory Multiprocessors

(has no processors)

(scanning channels) m

181

FCMgr

WinMgr

Editor

.

m

Send WinMgr

TI

Recv & Process Reply to T1’s Call

Call [

Recv WinMgr /

Context Switch T2

Send FCMgr

Call [

Recv FCMgr

Context

Switch

Send

TI Call [

FCMgr

Recv

FCMgr

Reallocate Processor

Recv

to

FCMgr

Reply

Recv

& Process to T2’s Call

& Process

Reply

to T1’s Call

Return

Processor

to Editor Context Terminate

Context Terminate

Switch T2

Switch T1

Time Fig.

1.

URPC

timeline

URPC’S treatment of pending messages is analogous to the treatment of runnable threads waiting on a ready queue. Each represents an execution context in need of attention from a physical processor; otherwise idle processors poll the queues looking for work in the form of these execution contexts. There are two main differences between the operation of message channels and thread ready queues. First, URPC enables on address space to spawn work in another address space by enqueuing a message (an execution context consisting of some control information and arguments or results). Second, ACM

Transactions

on Computer

Systems,

Vol

9, No

2, May 1991

182

B. N Bershad et al.

.

Server

Client

Application

Application

Stubs

Message

URPC



Stubs

Channel mm

L

URPC

Fa~tThreads

FastThreads

/

\ /

~eallocate~roces~F-<

Fig,2

URPC

enables

another

Thesoftware

processors

through

components

in one address

an explicit

reallocation

of URPC

space to execute

in the

if the

between

workload

context

of

them

is

imbalance. 2.1

The User View

It is important to emphasize that all of the URPC mechanisms described in this section exist “under the covers” of quite ordinary looking Modula2 + [34] interfaces. The RPC paradigm provides the freedom to implement the control and data transfer mechanisms in ways that are best matched by the underlying hardware. URPC consists of two code. One package, called

software

are managed

level

at the

user

packages

FczstThreacis

and

[5],

used by stubs and application provides lightweight threads that

scheduled

on top of middleweight

kernel

implements the channel managethreads. The second package, called URPC, package ment and message primitives described in this section. The URPC FastThreads, as lies directly beneath the stubs and closely interacts with shown

in Figure

3. RATIONALE

2.

FOR THE URPC DESIGN

In this section we discuss the design rationale behind URPC. rationale is based on the observation that there are several components to a cross-address space call and each can be separately.

The main

components

In brief, this independent implemented

are the following:

management: blocking the caller’s thread, running a thread through – ‘Thread the procedure in the server’s address space, and resuming the caller’s thread on return, ACM

Transactions

on Computer

Systems,

Vol

9, No

2, May

1991

User-Level Interprocess —Data

moving

transfer:

spaces,

arguments

between

the

client

and

.

server

183 address

and

–Processor

ensuring that there is a physical processor in the server and the server’s reply in the client.

to handle

reallocation:

the client’s In

Communicahon for Shared Memory Mulhprocessors

call

conventional

message

systems,

these

three

components

are

combined

beneath a kernel interface, leaving the kernel responsible for each. However, thread management and data transfer do not require kernel assistance, only processor how

reallocation

the

another,

3.1

does. In the three

components

and the benefits

Processor attempts

occurs URPC

through the optimistically

—The

client

address —The

to reduce

arise

from

that

call

follow

we describe

can be isolated

from

one

such a separation.

has serious

the

frequency

with

use of an optimistic assumes the following: other

messages,

not have

that

space

Reallocation

URPC

incoming

subsections

of a cross-address

work and

effect

to

do,

in

a potential

which

processor

reallocation

the

delay

form

of runnable

in the

on the performance

reallocation At

policy.

processing

of other

threads

call

time,

threads of a call

or will

in the client’s

space.

server

has, or will

soon have,

a processor

with

which

it can service

a

message. In

terms

several context

of performance,

ways. The first switch between

URPC’S

assumption user-level

optimistic makes threads

assumptions

can pay

off in

it possible to do an inexpensive during the blocking phase of a

cross-address space call. The second assumption enables a URPC to execute in parallel with threads in the client’s address space while avoiding a processor reallocation. An implication of both assumptions is that it is possible

to amortize

the

cost of a single

outstanding calls between the threads in the same client make a single

processor

reallocation

processor

reallocation

across

several

same address spaces. For example, if two calls into the same server in succession, then can handle

both.

A user-level approach to communication and thread management is appropriate for shared memory multiprocessors where applications rely on aggressive multithreading to exploit parallelism while at the same time compensating for multiprogramming effects and memory latencies [2], where a few key operating system services are the target of the majority calls [8], and where operating system functions are affixed ing nodes for the sake of locality [31]. In contrast well

suited

–Kernel-level

to URPC,

contemporary

for use on shared

memory

communication

and

uniprocessor

structures

are not

multiprocessors: thread

pessimistic processor reallocation policies, rency within an application to reduce scheduling [13] underscores this Handoff tion blocks the client and reallocates its ACM

kernel

of all application to specific process-

TransactIons

management

facilities

rely

on

and are unable to exploit concurthe overhead of communication. pessimism: a single kernel operaprocessor directly to the server.

on Computer

Systems,

Vol. 9, No

2, May 1991

184

B. N. Bershad et al

.

Although

handoff

is limited

by the cost of kernel

–In

a traditional

scheduling

operating

shared

performance,

invocation

and processor

system

running on a multiprocessor, distributed over the many large-scale

does improve

memory

kernel

designed

for

the

improvement

reallocation. a uniprocessor,

but

kernel resources are logically centralized, processors in the system. For a mediummultiprocessor

such as the Butterfly

but to

[71, Alewife

[41, or DASH [251, URPC’S user-level orientation to operating system design localizes system resources to those processors where the resources are in use, relaxing the performance bottleneck that comes from relying on centralized kernel data structures. This bottleneck is due to the contention for logical resources (locks) and physical resources (memory and interconnect bandwidth) that results when a few data structures, such as thread run

queues

and message By distributing

processors. functions

to

directly,

the

address

contention

URPC

can

channels, are shared by a large number of the communication and thread management

also

spaces

(and

hence

processors)

that

use

be appropriate

on uniprocessors

applications where inexpensive threads physical concurrency within a problem.

running

multithreaded

can be used to express Low overhead threads

the logical and and communi-

cation make it possible to overlap even small amounts of external tion. Further, multithreaded applications that are able to benefit layed

reallocation

tion

protocols

uniprocessor, long

them

is reduced.

can do so without [191. Although

having

reallocation

it can be delayed

to develop will

by scheduling

their

eventually within

computafrom de-

own communicabe necessary

an address

on a

space for as

as possible.

3.1.1

The

Optimistic

Assumptions

Won’

t Alulays

In cases where

Hold.

the

optimistic assumptions do not hold, it is necessary to invoke the kernel to force a processor reallocation from one address space to another. Examples of where it is inappropriate to rely on URPC’S optimistic processor reallocation policy are single-threaded latency must be bounded), initiate

the

and priority call

is of high

1/0

operation

invocations

applications, high-latency early (where

priority).

since

real-time applications 1/0 operations (where it will

the thread

To handle

these

take

a long

executing

situations,

time

(where call it is best to to complete),

the cross-address URPC

address space to force a processor reallocation to the there might still be runnable threads in the client’s.

allows

server’s,

space

the client’s even

though

3.1.2 The Kernel Handles Processor Reallocation. The kernel implements the mechanism that support processor reallocation. When an idle processor discovers an underpowered address space, it invokes a procedure called Processor. Donate, passing in the identity of the address space to which the processor

(on which the invoking thread is running) should be reallocated. Donate transfers control down through the kernel and then up to a prespecified address in the receiving space. The identity of the donating Processor.

address ACM

space is made

TransactIons

known

on Computer

to the receiver

Systems,

Vol

9, No

2, May

by the kernel. 1991

User-Level Interprocess 3.1.3

Voluntary

interface URPC,

Return

defines as with

Communication of

a contract

for Shared Memory Multiprocessors

Processors

Cannot

between

traditional

RPC

a client

systems,

Be

and

in the

185

A service

Guaranteed.

a server.

implicit

.

In

the

contract

case of

is that

the

server obey the policies that determine when a processor is to be returned back to the client. The URPC communication library implements the following policy in the server: upon receipt of a processor from a client address space, return the processor when all have generated replies, or when the become

“underpowered”

(there

outstanding messages from server determines that the

are outstanding

messages

and one of the server’s processors is idle). Although URPC’S runtime libraries implement no way

to enforce

that

kernel

transfers

control

guarantee returning

protocol. of

Just

the

as with

processor

back

a specific

kernel-based to

an

to the client,

protocol,

there

systems,

application,

there

a fair

available

deals

processing

power.

of load balancing

URPC’s

between

direct

reallocation

applications

that

clients,

is

once the is

that the application will voluntarily return the processor from the procedure that the client invoked. The server could,

example, use the processor to handle requests from other this was not what the client had intended. It is necessary to ensure that applications receive problem

the client client has

no by for

even though share only

of the

with

are communicating

the with

one another. Independent of communication, a multiprogrammed system requires policies and mechanisms that balance the load between noncommunicating (or noncooperative) address spaces. Applications, for example, must not be able to starve able

to delay

clients

one another

out for processors,

indefinitely

by

not

returning

and servers processors.

policies, which forcibly reallocate processors from one address other, are therefore necessary to ensure that applications make A processor reallocation policy should be work conserving, high-priority thread waits for a processor while a lower priority and,

by implication,

that

anywhere in the system, specifics of how to enforce

no processor

address be easily

3.2

when

there

is work

not be

Preemptive space to anprogress. in that no thread runs, for

it to do

even if the work is in another address space. The this constraint in a system with user-level threads

are beyond the scope of this paper, The direct processor reallocation optimization space of a ing to that standpoint reason for processor

idles

must

but are discussed done in URPC

by Anderson et al. in [61. can be thought of as an

of a work-conserving policy. A processor idling in the address URPC client can determine which address spaces are not respondclient’s calls, and therefore which address spaces are, from the of the client, the most eligible to receive a processor. There is no the client to first voluntarily relinquish the processor to a global allocator, only to have the allocator then determine to which

space the processor made

by the client

should

be reallocated.

This

is a decision

that

can

itself.

Data Transfer Using Shared Memory

The requirements of a communication system can be relaxed and its efficiency improved when it is layered beneath a procedure call interface rather than exposed to programmers directly. In particular, arguments that are part ACM

TransactIons

on Computer

Systems,

Vol. 9, No

2, May 1991.

186

B. N. Bershad et al

.

of a cross-address space procedure call while still guaranteeing safe interaction

can be passed using shared between mutually suspicious

memory subsys-

tems. Shared memory message of client-server interactions. still overload one another,

channels do not increase the “abusability factor” As with traditional RPC, clients and servers can deny service, provide bogus results, and violate

communication protocols (e. g., fail nel data structures). And, as with

to release traditional

protocols

abuses

to ensure

that

lower

level

a well-defined manner (e. g., by raising down the channel). In URPC, the safety of communication responsibility

of the stubs.

channel RPC,

filter

locks, or corrupt it is up to higher

up to the application exception

a call-faded based

The arguments

on shared

of a URPC!

layer

memory

are passed

is implemented

one address space to another. application programmers deal

necessary

nor sufficient

by having

Such directly

form of messages. But, when standard runtime as is the case with URPC, having the kernel spaces is neither

is the

in message phase server.

unmarshal the message’s data into copying and checking is necessary to

ensure the application’s safety. Traditionally, safe communication copy data from necessary when

in

or by closing

buffers that are allocated and pair-wise mapped during the binding that precedes the first cross-address space call between a client and On receipt of a message, the stubs procedure parameters doing whatever

chanlevel

the

kernel

an implementation is with raw data in the

facilities and stubs are used, copy data between address

to guarantee

safety.

Copying in the kernel is not necessary because programming languages are implemented to pass parameters between procedures on the stack, the heap, or in registers. When data is passed between address spaces, none of these

storage

areas

can,

in general,

be used

directly

by both

the

client

and

the server. Therefore, the stubs must copy the data into and out of the memory used to move arguments between address spaces. Safety is not increased by first doing an extra kernel-level copy of the data. Copying in the kernel is not sufficient for ensuring safety when using type-safe languages such as Modula2 + or Ada [1] since each actual parameter must be checked by the stubs for conformity with the type of its corresponding formal. Without such checking, for example, a client could crash the server by passing in an illegal (e. g., out of range) value for a parameter. These

points

motivate

the use of pair-wise

shared

memory

for cross-address

space communication. Pair-wise shared memory can be used to transfer between address spaces more efficiently, but just as safely, as messages are copied by the kernel between address spaces.

3.2.1

Controlling

different address test-and-set locks nitely on message is simply

if the

spin-wait.

The

ACM

TransactIons

data that

Channel Access. Data flows between URPC packages in spaces over a bidirectional shared memory queue with on either end. To prevent processors from waiting indefichannels, the locks are nonspinning; i.e., the lock protocol

lock

is free,

rationale on Computer

here

acquire

is that

Systems,

Vol

it,

or

else go on

the receiver 9, No

2, May

1991

to something

of a message

else– never

should

never

User-Level Interprocess

Communication

for Shared Memory Multiprocessors

187

.

have to delay while the sender holds the lock. If the sender stands to benefit from the receipt of a message (as on call), then it is up to the sender to ensure that locks are not held indefinitely; if the receiver stands to benefit (as on return), realized

then the sender could just as easily prevent the benefits by not sending the message in the first place.

Cross-Address

3.3

Space Procedure

The calling semantics normal procedure call, performing software address

a server

starts

from

space procedure call, like those with respect to the thread that

the call. Consequently, there system that manages threads

the

server.

thread.

When

the

The server

communication

thread client

client

thread

finishes,

—High

and thread

performance

grained

parallel

management.

thread

it

stops, sends

waiting

on a receive

a response

back

to the

we argue

facilities

are

the following:

necessary

for

fine-

programs.

— While thread management facilities at the user level, high-performance be provided

(send

Finally, the to be started. for the next message. between cross-address space

In brief,

management

function

management synchronization sends a message to the server,

client, which causes the client’s waiting thread stops again on a receive, waiting server’s thread In this subsection we explore the relationship communication

of is

is a strong interaction between the and the one that manages cross-

Each underlying

and receive) involves a corresponding function (start and stop). On call, the which

being

Call and Thread Management

of cross-address are synchronous

space communication.

from

at the user

can be provided either in the kernel or thread management facilities can only

level.

—The close interaction between communication and thread management can be exploited to achieve extremely good performance for both, when both are implemented 3.3.1

threads

Concurrent

within

activities. parallel

together

at the user

Programming

an

address

Concurrency programs

space

needs

and

Thread

simplify

the

can be entirely

for

access to a thread

or

it

to the can

some amount address space.

management

system

that

continuum

major

middleweight,

are three

and lightweight,

for which

on the order of thousands, hundreds, Kernels supporting heavyweight

points thread

application,

as with as with

an

of its own computation with In either case, the program-

of control. can be described

there

of concurrent

be external,

create, schedule, and destroy threads Support for thread management on which

Multiple

Management.

management

internal

multiprocessors,

application that needs to overlap activity on its behalf in another mer

level.

makes in

it possible

terms

of reference: management

of

to

a cost

heavyweight, overheads

are

and tens of psecs, respectively. threads [26, 32, 35] make no distinction

between a thread, the dynamic component of a program, and its address space, the static component. Threads and address spaces are created, scheduled, and destroyed as single entities. Because an address space has a large amount of baggage, such as open file descriptors, page tables, and accounting ACM

Transactions

on Computer

Systems,

Vol

9, No

2, May 1991

188

B. N Bershad et al.

.

state that must be manipulated during thread management operations, operations on heavyweight threads are costly, taking tens of milliseconds. Many contemporary operating system kernels provide support for middleweight, address

threads [12, 361. In contrast to heavyweight middleweight threads are decoupled, so that

or kernel-level spaces and

address space can have many middleweight threads. As with threads, though, the kernel is responsible for implementing management functions, and directly schedules an application’s the

available

defines

physical

a general

processors.

programming

With

middleweight

interface

for

heavyweight the thread threads on

threads,

use by all

threads, a single

the

concurrent

kernel applica-

tions. Unfortunately,

the

grained

parallel

slicing,

or

between

saving

threads

applications,

cost of this

applications. and

restoring

contexts

traps are required kernel must treat

affects

example, floating

can be sources

even those for which

thread management malfeasant programs.

generality

For

the

simple point

performance

features

registers

of performance

the features

overhead

are unnecessary.

of kernel-level for all parallel

weight

for

all

A kernel-level

thread management operation, and the suspiciously, checking arguments and

from

the effect

of thread

to the high cost of but more pervasive

management

on program

policy

There are a large number of parallel programming models, and a wide variety of scheduling disciplines that are most appropri-

ate (for performance) to a given strongly influenced by the choice rithms used to implement threads,

In response to lightweight

switching

system must also protect itself from malfunctioning or Expensive (relative to procedure call) protected kernel to invoke any each invocation

stemming

when

as time

degradation

access privileges to ensure legitimacy. In addition to the direct overheads that contribute kernel-level thread management, there is an indirect performance. within these,

of fine-

such

thread is unlikely programs.

model [25, 33]. Performance, though, is of interfaces, data structures, and algoso a single model represented by one style to have

an implementation

to the costs of kernel-level threads, threads that execute in the context

threads

provided

by the kernel,

but

that

is efficient

programmers have turned of middleweight or heavy-

are managed

at the user level

by

a library linked in with each application [9, 38]. Lightweight thread managetwo-level scheduling, since the application library schedules ment implies lightweight threads on top of weightier threads, which are themselves being scheduled by the kernel. Two-level schedulers attack both the direct and indirect costs of kernel threads. Directly, it is possible to implement efficient user-level thread management functions because they are accessible through a simple procedure call rather than a trap, need not be bulletproofed against application errors, and can be customized to provide only the level of support needed by a given application. The cost of thread management operations in a user-level scheduler

3.3.2 for two ACM

can be orders The

Problem

reasons:

fiansactlons

of magnitude

with

Having

(1) to synchronize

on Computer

Systems,

Vol

less than Functions

their 9, No

for kernel-level at

activities 2, May 1991

Two

Levels.

within

threads.

Threads an address

block space

User-Level Interprocess

Communication

for Shared Memory Multiprocessors

(e.g., while controlling access to critical sections], events in other address spaces (e. g., while reading manager). pensive kernel,

If threads

are implemented

at the

and flexible concurrency), but then the performance benefits

switching

are lost when

scheduler

accurately

reflects

In contrast,

when

efficient

user-level

(necessary

in separate

address

of the application,

awaiting

and thread

and

management

scheduling

and then

kernel-mediated

com-

are moved

together at user level, the system can control threads

synchronization

for inex-

is implemented in the scheduling and context

activities

state

applications notified.

communication

of the kernel and implemented needed by the communication

level

operation must be implemented and the application, so that the user-level

the scheduling

again (2) in the kernel, so that munication activity are properly

user

between

189

and (2) to wait for external keystrokes from a window

communication of inexpensive

synchronizing

spaces. In effect, each synchronizing executed at two levels: (1) once in

.

out

synchronization with the same

primitives

that

are used

by applications.

3.4 Summary of Rationale This

section

has

described

the

design

advantages of a user-level approach made without invoking the kernel processors

between

processor reallocation over multiple calls. grated

with

spaces.

do turn Finally,

out to be necessary, their user-level communication

thread

programs, as within,

4. THE PERFORMANCE

Further,

as

measurements megabytes

4.1

have

kernel

processor described

many

as

dedicated

main

invocation

and

cost can be amortized can be cleanly intenecessary

can inexpensively

in

this

of URPC by DEC’s

six to

C-VAX

1/0.

paper

for

fine-

synchronize

The

had

on the Firefly, an experiSystems Research Center. microprocessors, Firefly

four

used

C-VAX

to

plus

one

collect

the

processors

and

32

of memory.

User-Level Thread Management

Performance

In this section we compare the performance of thread management when implemented at the user level and in the kernel. The threads are those provided by Taos, the Firefly’s native operating Table mented invokes

The

calls can be reallocating

OF URPC

A

can

URPC.

facilities,

programs spaces.

the performance workstation built

Firefly

when

management

so that address

This section describes mental multiprocessor MicroVAX-11

behind

address

user-level

grained parallel between, as well

rationale

to communication are that and without unnecessarily

primitives kernel-level system.

I shows the cost of several thread management operations impleat the user level and at the kernel level. PROCEDURE CALL the null procedure and provides a baseline value for the machine’s

performance. FORK creates, starts, and terminates a thread. FORK;JOIN is like FORK, except that the thread which creates the new thread blocks until the forked thread completes. YIELD forces a thread to make a full trip through the scheduling machinery; when a thread yields the processor, it is blocked (state saved), made ready (enqueued on the ready queue), scheduled ACM

Transactions

on Computer

Systems,

Vol

9, No. 2, May 1991

190

.

B. N Bershad et al Table

1.

Comparative

Performance

of Thread

URPC Test

II

Component

(ysecs)

7 43 102 37 27 53

7 1192 1574 57 27 271

(dequeued),

Server (ysecs)

18

13

poll

6

6

receme

10 20

25

Total

54

53

unblocked

and forth on a condition means of paired signal Each

(state

measured in

the

was

elapsed

table.

processor.

variable, and wait

and unblocks

test

The

run

restored),

was divided

using

the

continued.

ACQUIRE;

in succession.

number

FastThreads

tests

and

a blocking (nonspinning) lock for has two threads “pingpong” back

blocking and unblocking one another by operations. Each pingpong cycle blocks,

two threads a large

time

The

9

dispatch

RELEASE acquires and then releases which there is no contention. PING-PONG

schedules,

of a URPC

Client (~secs)

send

again

Taos Threads

FastThreads

Breakdown

Component

Operations

(psecs)

PROCEDURE CALL FORK FORK;JOIN YIELD ACQUIRE, RELEASE PINGPONG

Table

Management

of times

as part

by the loop limit

tests

were

kernel-level

of a loop,

to determine

constrained

to run

threads

no such

had

and

the

the values on a single constraint;

Taos caches recently executed kernel-level threads on idle processors to reduce wakeup latency. Because the tests were run on an otherwise idle machine, the caching optimization worked well and we saw little variance in the measurements. The table demonstrates the performance advantage of implementing thread management at the user-level, where the operations have an overhead hundred, as when 4.2

on the threads

URPC Component

order of a few procedure calls, are provided by the kernel.

rather

than

a few

Breakdown

In the case when an address space has enough processing power to handle all incoming messages without requiring a processor reallocation, the work done send (enqueuing a during a URPC can be broken down into four components: poll (detecting the channel on which a message on an outgoing channel); receive (dequeueing the message); and dispatch (dismessage has arrived); patching the appropriate thread to handle an incoming message). Table II shows the time taken by each component. The total processing times are almost the same for the client and the server. Nearly half of that time goes towards thread management (dispatch), ACM TransactIons

on Computer

Systems,

Vol

9, No

2, May 1991

User-Level Interprocess demonstrating

the

munication The

influence

for Shared Memory Mulhprocessors

that

thread

management

.

overhead

has

191 on

com-

performance.

send

and

marshaling

receive

that

spaces.

On

buffers

adds

passed

in

average Table

Communication

is

the

Firefly,

about

most

do

cross-address

and

not

the

cost

of

psec space

is

reflect

cost

are

data

of

data

passed

passed

of latency.

in

calls

the

influenced

by

is

and

address

shared

the

memory

amount

small

the

copying

between

Because

procedure

most

of data

the

parameters

word

additional

performance

not

when

each

one

call II,

times

incurred

[8,

of data

16,

components

17,

24],

shown

in

transfer.

4.3 Call Latency and Throughput The

figures

there

in

Table

is no need

latency

and

call

throughput server

II

for

depend

space,

on

for

are

the

S,

and

of client

reallocation. load

number the

Two

and

dependent, of

client

number

of

server

other

load,

important

though.

Both

processors,

runnable

C,

call

latency

and

the

threads

provided

metrics, number

in

the

of

client’s

T.

To evaluate values

independent

throughput,

processors,

address

are

processor

T,

latency

and

C,

S

and

throughput, in

we ran

which

threads to make 100,000 “Null” loop. The Null procedure call

we

timed

a series how

of tests

long

procedure calls into takes no arguments,

it

using

took

different

the

the server computes

client’s

T

inside a tight nothing, and

returns no results; it exposes the performance characteristics of a round-trip cross-address space control transfer. Figures 3 and 4 graph call latency and throughput as a function of T. Each line on the graphs represents a different combination of S and C. Latency is measured as the time from when a thread calls into the Null stub to when control returns from the stub, and depends on the speed of the message primitives and the length of the thread ready queues in the client and server. When (T>

the

number

processor

of caller

call

C + S),

latency

in the client

exceeds since

or the server

to T. Except

is less sensitive call processing

threads increases

times

total call

or both.

for the special

in the client

the each

number must

When

the

number

of calling

on the other

case where

S = 0, as long

and server

message wasted

transfer. processing

threads,

are roughly

client

equal T >

hand, as the

(see Table

II),

At that no further

C + S.

can

processors,

a free

and

be

server

proces-

latency is 93 ~secs. This is “pure” latency, in basic processing required to do a round-trip

1 Unfortunately, time

for

Throughput,

until then throughput, as a function of T, improves point, processors are kept completely busy, so there T. improvement in throughput by increasing sors is 1 (T = C = S = 1), call the sense that it reflects the

of processors

wait

the

due to idling

latency

in the

includes

client

and

a large server

amount

while

of

waiting

1 The 14 psec discrepancy between the obseved latency and the expected latency (54 + 53, cf., Table II) is due to a low-level scheduling optimization that allows the caller thread’s context to remain loaded on the processor if the ready queue is empty and there is no other work to be done anywhere in the system. In this case, the idle loop executes in the context of the caller thread, enabling a fast wakeup when the reply message arrives. At 2’ = C = S, the optimization is enabled. The optimization is also responsible for the small “spike” in the throughput curves shown in Figure 4. ACM

Transactions

on Computer

Systems,

Vol. 9, No

2, May 1991.

192

.

B. N. Bershad et al.

1500–

C=l,

s=o

C=l,

S=2 s=l

14001300120011001000 –

900Call

800-

Latency

700-

(psecs)

/c=l,

600500400 -

/<,..””

~=*~5~2s==2 ‘-- ‘

/’

300 200 100 -

k’~”;”;





,---”-’”

o~ 12

8 Number Fig

for the next

call

or reply

of Client

3

Threads

9

10

{T)

URPC call latency

message.

At

T =

2, C = 1, S = 1, latency

increases

by only 20 percent (to 112 Wsecs), but throughput increases by nearly 75 percent (from 10,300 to 17,850 calls per second). This increase reflects the fact that message processing can be done in parallel between the client and server.

Operating

as a two-stage

pipeline,

the client

and server

complete

call every 53 psecs. For larger values of S and C, the same analysis holds. As long sufficiently large to keep the processors in the client and server spaces busy, throughput Note that throughput

is maximized. for S = C = 2 is not twice

as large

one

as T is address

as for S = C = 1

(23,800 vs. 17,850 calls per second, a 33 percent improvement). Contention for the critical sections that manage access to the thread and message queues in each address space limits throughput over a single channel. Of the 148 instructions required to do a round-trip message transfer, approximately half are part of critical sections guarded by two separate locks kept on a perchannel basis. With four processors, the critical sections are therefore nearly always occupied, so there is slowdown due to queueing at the critical sections. This factor does not constrain the aggregate multiple clients and servers since they use different When reallocate round-trip sor ACM

S = O throughput and latency are at their processors frequently between the client latency

is 375

reallocations.

At this

Transactions

on Computer

psecs. point, Systems,

Every URPC Vol

call

requires

performs

9, No

worse

2, May 1991

rate of URPCS channels.

between

worst due to the need to and server. When T = 1, two

traps

than

and

LRPC

two

proces-

(157

psecs).

User-Level Interprocess

Communication

for Shared Memory Multiprocessors

.

193

25000

---1

—C=2’S’2

A — — —

/

/

20000 4

/–––__–___c=2, ~<:.— ——.— —————— ~ -’-----------------------

15000

s=l —_ —c=l S=l ~=1:’=2

///

Calls

$ f

per Second 10000 1

I

C=l,

s=o

5000

I

o~

2345678

1

Number Fig. 4.

9 of Client

Threads

(T)

URPC call throughput

We consider this comparison further in Section 5.5. mance is worse, URPC’S thread management primitives performance overheads proves

be amortized

steadily,

and

has been absorbed, 4.4

I). As T increases,

(see Table can

Performance

over

latency,

the

once the

worsens

10

only

though,

the trap

outstanding initial

because

Although call perforretain their superior calls.

shock

and reallocation Throughput

of processor

of scheduling

im-

reallocation

delays.

Summary

We have described the performance of the communication and thread management primitives for an implementation of URPC on the Firefly multiprocessor. The performance figures demonstrate the advantages of moving

traditional

operating

them at the user level. Our approach stands ogy.

Normally,

system in contrast

operating

system

order to improve performance we are pushing functionality and

functions

out

to traditional functionality

of kernel

and

operating

implementing

system

is pushed

into

methodol-

the

kernel

in

at the cost of reduced flexibility. With URPC, out of the kernel to improve both performance

flexibility.

5. WORK RELATED TO URPC In this of

section

URPC.

The

we examine common

several theme

other

communication

underlying

ACM Transactions

these

on Computer

systems other

Systems,

in the light

systems

is

that

Vol. 9, No, 2, May 1991

194

.

B N Bershad et al

they

reduce

the

kernel’s

role

in

communication

in

an

effort

to

improve

performance.

5.1 Sylvan Sylvan [141 relies on architectural message-passing primitives. Special

support to implement coprocessor instructions

the Thoth [151 are used to ac-

cess these primitives, so that messages can be passed between Thoth tasks without kernel intervention. A pair of 68020 processors, using a coprocessor to implement the communication primitives, can execute a send-receiue-reply cycle on a zero-byte message (i. e., the Null call) in 48 ~secs. Although ing

the

Sylvan

outperforms

URPC,

use of special-purpose

the

hardware.

enqueueing and dequeueing messages and when a thread becomes runnable or blocked. handles

thread

scheduling

and context

improvement

Sylvan’s

is small

coprocessor

updating Software

switching.

consider-

takes

care

of

thread control blocks on the main processor

As Table

II shows,

though,

thread management can be responsible for a large portion of the round-trip processing time. Consequently, a special-purpose coprocessor can offer only limited performance gains. 5.2

Yackos

Yackos

[22] is similar

communication Messages

and

to URPC

in that

is intended

are passed

between

for

it strives

to bypass

use on shared

address

memory

spaces by means

the kernel

during

multiprocessors.

of a special

message-

passing process is no automatic

that shares memory with all communicating processes. There load balancing among the processors cooperating to provide a

Yackos

Address

service.

spaces are single

threaded,

and clients

communicate

with a process running on a single processor. That process is responsible for forwarding the message to a less busy process if necessary. In URPC, address spaces are multithreaded, so a message sent from one address space to another can be fielded by any processor allocated to the receiving address space.

Further,

URPC

on

a multiprocessor

in

the

best

case

(no

kernel

involvement) outperforms Yackos by a factor of two for a round-trip call (93 VS. 200 psecs).2 The difference in communication performance is due mostly to the indirection through the interposing Yackos message-passing process. This

indirection

results

in increased

latency

and data copying. By taking advantage in an RPC relationship, URPC avoids

due to message

queueing,

polling,

of the client/server pairing implicit the indirection through the use of

pair-wise shared message queues. In the pessimistic case (synchronous cessor allocation on every message send), URPC is over thirteen times efficient (375 vs. 5000 psecs) than Yackos.

5.3

promore

Promises

Argus [28] provides the programmer with that can be used to make an asynchronous

a mechanism cross-address

called a Promise space procedure

call.

80386

that

faster

2 Yackos

runs

than

the

C-VAX

the

ACM

TransactIons

Sequent

processors

Symmetry, used

on Computer

which

uses

processors

in the Firefly.

Systems,

Vol

9, No

2, May

1991

are

somewhat

[271

User-Level Interprocess

Communication

for Shared Memory Multiprocessors

.

The caller thread continues to run, possibly the time that the call’s results are needed.

in parallel If a thread

on a Promise

(i. e., the cross-address

that

has not yet been fulfilled

has not yet completed), A Promise

the collecting

is used to compensate

thread

195

with the callee, until attempts to “collect” space call

blocks.

for the high

cost of thread

management

in

the Argus system. Although an asynchronous call can be simulated by forking a thread and having the forked thread do the cross-address space call synchronously, the cost of thread creation in Argus [29] (around 20 procedure call times) the

precludes

order

explicit

this

approach.

of six procedure language

one largely

call

support

In URPC,

times,

for asynchronous

of preference

and style,

the cost of thread

and so the

issue

cross-address

rather

than

creation

of whether

is on

to provide

space calls

becomes

performance.

Camelot

5.4

Camelot

[19]

is a high-performance

distributed

transaction

processing

sys-

tem. Key to the system’s performance is its use of recoverable virtual memory and write-ahead logging. Because of the cost of Mach’s communication

primitives

[18], data

servers

use shared

memory

queues

to communicate

virtual memory control information and log records to the disk manager on each machine. When the queues become full, data servers use Mach’s kernellevel RPC system to force a processor reallocation to the disk manager so that it can process

the pending

messages

in the queue.

URPC generalizes the ad hoc message-passing approach used by Camelot. Camelot’s shared memory queues can only be used between the data servers and the disk manager. URPC’S channels are implemented beneath the stub layer, allowing any client and any server to communicate through a standard RPC

interface.

with those that memory queues nication At

Camelot’s

in URPC the

thread

management

access the shared memory support only unidirectional

lowest

is bidirectional, level,

supporting

URPC

is

simply

separate address spaces access shared Programmers have often used shared address space communication, interfaces. URPC formalizes work of standard programming pensive 5.5

primitives

are not integrated

queues. Finally, communication, request-reply a protocol memory memory

by

in for

Camelot’s whereas

shared commu-

interaction. which

threads

a disciplined high-bandwidth

in

fashion. cross-

but have had to implement special-purpose this interface by presenting it in the frameabstraction based on procedure call and inex-

threads.

LRPC

LRPC

demonstrates

that

it

is

possible

to

communicate

between

address

spaces by way of the kernel at “hardware speed. ” Like URPC, LRPC passes parameters in shared memory that is pair-wise mapped between the client and server.

Unlike

URPC,

though,

LRPC

uses kernel-level

threads

that

pass

between address spaces on each call and return. LRPC greatly reduces the amount of thread management that must be done on each call, but is still performance limited by kernel mediation. ACM

Transactions

on Computer

Systems,

Vol. 9, No

2, MaY 1991

196

.

B. N. Bershad et al

In the worst

case, when

to the client, 375

and there

,usecs. On the

and LRPC, reallocations.

there

are no processors

is one client

same

thread,

hardware,

LRPC

each call involves two There are two reasons

reallocation

in

URPC

takes

to the server,

for the Null

157 ~secs.

For

kernel invocations and why LRPC outperforms

case; the first is an artifact of URPC’S inherent in URPC’s approach: —Processor

allocated

the latency

is

implementation,

based

on

both

is

URPC

two processor URPC in this

while

the

URPC

LRPC.

one

URPC

second

is

implements

processor allocation using LRPC, which is a general cross-address space procedure call facility for kernel-level threads. We used LRPC because it was available and integrated into the Firefly operating system. The overhead

of LRPC-based

that

a special-purpose

— URPC curs

is

processor

integrated

after

two

processor, and be reallocated?

reallocation

mechanism with

distinct

would

two-level

is 180 be about

Processor

scheduling.

scheduling

decisions

psecs, but 30 percent

are made:

we estimate faster.

reallocation

(1) Is there

ocan idle

(2) is there an underpowered address space to which In contrast, a call using LRPC makes no scheduling

it can deci-

sions (one of the main reasons for its speed). Making these two decisions is more expensive than not making them, but is necessary to take advantage of the inexpensive synchronization and scheduling functions made possible by user-level thread management. The overhead of these two scheduling decisions

is about

processor

reallocation

100

It should

not be surprising

uler, which manages the first level, which act

with

the

6.

round-trip

call

(only

in the

case where

of course).

that

indirecting

through

a second-level

sched-

threads, increases the cost of accessing the scheduler at manages processors. The trick is to infrequently inter-

first-level

used, the overhead inevitably degrade of scheduling.

~secs per is required,

scheduler.

When

the

first-level

scheduler

must

of having to pass through the second-level mechanism performance relative to a system with only a single

be

will level

CONCLUSIONS

This

paper

mance

has described

of URPC,

the motivation,

a new approach

that

design, addresses

implementation, the problems

and perforof kernel-based

communication by moving traditional operating system functionality out of the kernel and up to the user level. We believe that URPC represents the appropriate division of responsibility for the operating system kernels of shared memory multiprocessors. While it is a straightforward task to port a uniprocessor operating system to a multiprocessor by adding kernel support for threads and synchronization, it is another thing entirely to design facilities for a multiprocessor that will enable URPC

programmers demonstrates

to fully exploit the processing power of the machine. that one way in which a multiprocessor’s performance

potential can multiprocessor

be greatly increased is by in the first place, thereby

ACM

on Computer

Transactions

Systems,

Vol

9, No

designing making

2, May

1991

system facilities for a distinction between

a a

User-Level Interprocess multiprocessor happens

Communication

operating

to run

for Shared Memory Multiprocessors

system

and

a uniprocessor

.

operating

197

system

that

on a multiprocessor.

REFERENCES 1. United

States

guage,

July

2. AGARWAL, Memo Sept.

Department

A.

Performance

89-566,

tradeoffs

Massachusetts

in

Manual

multithreaded

Institute

of Technology,

A., HENNESSY, J., AND HOROWITZ, M.

multiprogramming 4. AGARWAL,

workloads.

A., LIM, B.-H.,

multiprocessing.

puter thread

T.

12 (Dec.

Operating

for

the Ada

Programming

Lan-

processors.

Tech.

Laboratory

for

Rep.

MIT

Computer

VLSI

Science,

1989),

thread 7. BBN June

performance

of the 17th

of operating

S+@. 6, 4 (Nov. J.

Annual

APRIL:

1988),

system

A processor

International

and

393-431. architecture

Symposium

on Com-

E.

D.,

AND LEVY,

for shared

1631-1644.

H.

M.

memory

To appear

in

The

performance

multiprocessors.

Proceedings

implications

IEEE

of the 13th

Trans.

ACM

of

Comput.

Symposium

on

Principles.

T. E. BERSHAD,

management

Computer

Cache

Compzd.

104-114.

alternatives

Systems

6. ANDERSON,

1990,

LAZOWSKA,

management

Trans.

Proceedings

May

E.,

ACM

KRANZ, D., AND KUBIATOWICZ,

In

Architecture.

5. ANDERSON, 38,

Reference

1989.

3. AGARWAL,

for

of Defense.

1980.

Science

for

B. N.,

shared

LAZOWSKA, memory

and Engineering, ButterfZy

Laboratories.

Univ.

Parallel

E. D., AND LEVY,

multiprocessors. of Washington,

Processor

H.

Tech. April

Oueruzew.

M.

Efficient

Rep.

user-level

90-04-02,

Dept.

of

1990

BBN

Labs.,

Cambridge,

Mass.,

1985.

8. BERSHAD, B. N. Dept.

High

of Computer

performance

Science

and

cross-address

Engineering,

space communication,

Univ.

Ph.D.

of Washington,

June

dissertation,

1990

Tech.

Rep.

90-06-02. 9. BERSHAD, B. N., LAZOWSKA, parallel

programming.

E, D., AND LEVY,

Softw.

10. BERSHAD, B. N., ANDERSON, procedure

call.

Proceedings 11. BIRRELL,

of the 12th A.

Comput.

ACM

D.

A.

13. BLACK,

D. L.

system.

IEEE

14. BURKOWSKI,

15. CHERITON,

39-59.

An

1984),

Scheduling

In

D. R.

and

Thoth:

concurrency 1990),

of the 3rd Operating

A portable

Lightweight Also

Principles,

procedure

threads.

Alto,

and

M. 37-55.

Dec. calls

Tech.

Calif.,

Jan.

parallelism

remote

appeared

in

1989. ACM

Rep.

Trans.

35,

Digital

1989.

in the

Mach

operating

35-43. Architectural

G., AND DUECK, G.

Proceedings

Languages

for

with Palo

H

1990),

remote

Center,

for object-oriented

713-732.

Systems

programming

23, 5 (May

F. J., CORMAC~,

1 (Feb.

on Operating

Research

support Mug.

8,

Implementing

to

Systems

A system

1988),

E. D., AND LEVY,

Syst.

J.

introduction

PRESTO:

ACM

Systems. real-time

support

Conference April

1989,

operating

for synchronous

on Architectural

task

Support

for

40-53.

system.

Commun.

ACM

2 (Feb.

105-115.

16. CHERITON, 17. CooK,

D.

Computer

D. R. The

The V distributed evaluation

Lab.,

18. DUCHAMP,

April

D.

20. GUPTA, policies

A.,

TUCKER,

(May R. Lang.

ACM

Commun. system.

ACM. Ph.D.

A.,

management

Systems

memory

of Computer

and synchronization

HALSTEAD, Program.

Virtual

Dept.

ings of the 1991 Systems

of transaction

on Operating

J. L.

dissertation,

system.

of a protection

31, 3 (Mar. dissertation,

1988),

314-333.

Cambridge

Univ.,

1978.

Analyis

Symposium

19. EPPINGER.

21.

Symposium

Comput.

Programming

ACM

ACM

B.

communication.

1979),

Comput.

Corporation

H. M.

18, 8 (Aug.

T. E., LAZOWSKA,

Trans.

1 (Feb.

D.

Equipment

Exper.

AND NELSON,

Syst. 2,

12. BIRRELL,

Pratt

management

Science,

Dec. for

Carnegie

AND URUSHIBARA, methods

performance.

Principles.

S.

The

Conference

In

processing

Univ.,

impact

Proceedings

Feb.

of operating

of parallel

on Measurement

of the 12th

177-190.

transaction

Mellon

of the performance

SIGMETRICS

1989,

systems. system

applications. and

Ph.D.

1989.

Modeling

scheduling In

Proceed-

of Computer

1991). Multilisp: Syst.

A language

7, 4 (Oct.

1985), ACM

for

concurrent

symbolic

computation.

ACM

Trans.

501-538.

Transactions

on Computer

Systems,

Vol. 9, No

2, May 1991.

198 22.

.

B. N. Bershad et al

HENSGEN,

D., AND FINKEL,

of Computer 23

Science,

R

Umv

JONES, M. B. AND RASHID,

Dynamic

server

of Kentucky,

1989

R, F.

Mach

squads

m Yackos.

and Matchmaker:

Tech.

Kernel

object-oriented distributed systems In Proceedz ngs of the Oriented Programming Systems, Languages, and ApphcatLons

and

Rep

138-89,

language

Dept

support

for

ACM Conference on ObJect(OOPSLA ‘86). (Sept. 1986),

67-77. 24

KARGER, P A. the 3rd L~NOSKI, based

26 27

LKKOV, dure

1989),

coherence

Digital

protocol

m

28.

LISKOV, B

29.

LISKOV,

Language

R

of the

In

Proceedings

Languages

and

of

Operat-

ACM

Architecture.

(May

Programmmg

LIngu&lc In

J.

The

Proceedings 1990),

directory-

of the

17th

148-159.

and Architecture.

support

Proceedings

the

(June

in Argus, P , AND

for

of

Implementation

D , JOHNSON,

In

The VAX-11.

1989.

Promises. and

AND HENNESSY,

multiprocessor,

Computer

systems.

Ilth

K , GUPTA, A., DASH

Mass.,

programming

B , CURTIS,

proceedings

H

Deszgn

Distributed

the

on Computer

Bedford,

distributed

Programming

for

Symposium

Press,

call performance.

for Programming

194-204.

B. AND SHRIRA, L

calls

cross-domain Support

J., GHARACHORLOO,

H. M. AND ECKHOUSE,

Ed.

to optlmlze

on Architectural

3-6

Internatzona~

LEVY, 2nd

(April

D , LAUDON,

cache

Annual

regmters

Conference

rng Systems 25.

Using

ACM

1988),

Commun.

ACM

SCHEIFLER,

Symposium

on

efficient

R.

Operating

asynchronous

SIGPLAN

’88

proce-

Conference

on

260-267

31, 3 (Mar,

1988),

Implementation

Systems

300-312

of Argus,

Prmczples

(Nov.

In

1987),

111-122 30. 31.

NELSON,

B

Carnegie

Mellon

J.

OUSTERHOUT, RITGHIE, 1974),

33

34

Umv.,

J

dissertation, 32.

Remote K.

call

Ph.D.

dLssertatlon,

Dept.

and

of Computer

Computer

Science,

cooperation

Science,

D. AND THC,MPSON, K.

The

in a distributed

Carnegie-Mellon

Unix

operating

Univ.,

time-sharing

April

system

system.

Ph D.

1980.

Corn mun.

ACM

abstraction

for

17, 7 (July

365-375. E

S

AND

VANDEVOORDE,

fiarallelism

Tech

Rep.

Alto,

April

1989.

Cahf.,

ROYNER, P , LEVIN, systems Calif.,

Tech Jan.

Sequent

42, Digital

3, Digital

M,

T.

WorkCrews:

Equipment

R., AND WICK,

Rep

J.

On extendmg

Equipment

An

Corporation,

Systems

Modula-2

Corporation,

Computer

Systeme,

A , RASHID,

Mach

and

threads

controlling Center,

for building

large,

Research

Center,

Systems

Inc.

Symmetry

R. F.,

GOLUB,

Unix

kernel:

Palo

integrated Palo

Alto,

JR., E

M.

Trans.

DEMERS,

interoperability

In

pies.

(Dee

1989),

Received

July

1990;

TransactIons

Cornput.

A , AND

proceedings

battle

37, 8 (Aug

HAUSER,

C

of the 12th

1988), The

ACM

In H

February

on Computer

Systems,

1991;

Vol

accepted

9, No

portable sympostu

February

2, May

Proceedings Firefly:

M.

of the

W

1987

A multiprocessor

909-920 common

1991

runtlme

approach

m on Operatz ng Systems

114-122. revised

1988.

COOPER, E , AND YOUNG,

USENIX Summer Conference. (1987), 185-197. 37. THACKER, C. P., STEWART, L. C., AND SATTERTHWAITE, IEEE

The

Summary, D. L.,

control.

workstation

the

Technical

D. B , BLACK, for

38. WEISER,

Research

1985.

36. TEVANIAN,

ACM

of

1981

Partitioning

Dept

ROBERTS,

35.

procedure May

1991

to

Pi-t ncl.

Related Documents