Components of a Network Operating System :Abstract Recent advances in hardware interconnection techniques for local networks have highlighted the inadequacies in existing software interconnection technology. Though the idea of using a pure message-passing operating system to improve this situation is not a new one, in the environment of a mature, high speed, local area network it is an idea whose time has come. A new timesharing system being developed at the Lawrence Livermore Laboratory is a pure message-passing system. In this system, all services that are typically obtained directly by a system call (reading and writing files, terminal I/O, creating and controlling other processes, etc.) are instead obtained by communicating via messages with system service processes (which may be local or remote). The motivation for the development of this new system and some design and implementation features needed for its .efficient operation are discussed :Keywords Network, local network, operating system, network operating system, message, protocol, distributed
Introduction .1 The basic job performed by an operating system is multiplexing the physical resources available on its system (Fig. l). By a variety of techniques such as time slicing, spooling, paging, reservation, allocation, etc. the operating system transforms the available physical resources into logical resources that can be utilized by the .(active processes running under it (Fig. 2
.Figure 1 - Physical resources directly attached to a single processor
.Figure 2 - Logical resources made available to a user process The interface between a process running under an operating system and the world outside its memory space is the "system call", a request for service from the operating system. The usual approach taken in operating system design has been to provide distinct system calls to obtain service for each type of available local resource (Fig. .(3
Figure 3 - Request structure for a typical third generation operating .system If a network becomes available, system calls for network communication are added to the others already supported by the operating system. Some problems with this
approach are the Dual Access and Dual Service Dichotomies discussed below. It is argued here that operating systems to be connected to a network (particularly a high (speed local area network) should be based on a pure message-passing monitor (Fig. 4
.Figure 4 - Resource interface for a message-passing operating system The title of this paper has at least two interpretations that are consistent with the intent :of the author If the term "Network Operating System" is taken to refer to a collection of• cooperating computer systems working together to provide services by multiplexing the hardware resources available on a network, then the title "Components of a Network Operating System" suggests a discussion of the ."Component" systems On the other hand, the term "Network Operating System" can also be taken to• refer to a single machine monitor to which the adjective "Network" is applied to indicate a design that facilitates network integration. In this case the title "Components of a Network Operating System" suggests a discussion of the component pieces or modules that comprise such a single machine operating .system The basic approach taken here will be to describe the components of a single machine operating system being implemented at the Lawrence Livermore Laboratory (LLL). The presentation will be largely machine independent, however, and will include discussion of the integration of the described system into a network of similar and .dissimilar systems
Historical Perspective .2 LLL has a long history of pushing the state of the art in high speed scientific processing to satisfy the prodigious raw processing requirements of the many physics simulation codes run at the laboratory. The high speed, often few of a kind computing engines (For example, Univac-1, 1953, Larc, Remington Rand, 1960, Stretch, IBM, 1961, 6600, CDC, 1964, Star-100, CDC, 1974, Cray-1, Cray Research, 1978) utilized
at LLL are usually purchased before mature operating system software is available for them [22]. The very early operating systems implemented at LLL were quite simple and were usually coded in assembly language. By the time of the CDC 6600 (1965), however, they were becoming more comp1ex timesharing systems. By 1966 it was decided to write the operating system for the 6600 in a higher level language. This decision made it easier to transfer that system (dubbed LTSS, Livermore Time Sharing System) to new machines as they arrived: CDC 7600, CDC Star-100, and the .Cray-l Another important development at LLL that began about the time of the first LTSS was networking. It started with a local packet switched message network for sharing terminals and a star type local file transport network for sharing central file storage (e.g. the trillion bit IBM photodigital storage device). These early networks worked out so well that they eventually multiplied to include a Computer Hardcopy Output Recording network, a Remote Job Entry Terminal network, a data acquisition network, a tape farm, a high speed MASS storage network, and others. The entire interconnected facility has retained the name "Octopus" [12, 27] from its earliest days .as a star topology Recent developments in high speed local networking [5, 11,17] are making it easier to flexibly connect new high speed processors into a comprehensive support network like Octopus. This very ease of hardware interconnection, however, is forcing some rethinking of software interconnection issues to ensure that the software interconnects .[as easily as the hardware [26, 27
Motivation for Network LTSS .3 Recently the network systems group at LLL has started down a significant new fork in the LTSS development path. The new version of LTSS is different enough from the existing versions that it has been variously called New LTSS or Network LTSS (NLTSS). Many of the reasons for the new development have little to do with networking. For example, NLTSS shares resources with capabilities [4, 8, 10, 16, 20, 21, 28]. This allows it to combine the relatively ad hoc sharing mechanisms of older LTSS versions into a more general facility providing principal-of-least-priviledge domain protection. It is only the lowest level network related motivations for the .NLTSS development, however, that we will consider here When a processor is added to a mature high speed local area network like Octopus, it needs very little in the way of peripherals [27]. For example, when a Cray-1 computer was recently added to Octopus, it came with only some disks, a console, and a high speed net- work interface. All of the other peripherals (terminals, mass storage, printers, film output, tape drives, card readers, etc. etc.) are accessed through the network. The operating system on a machine like this faces two basic problems when :it is connected to the network A. How to make all the facilities available on the network available to its• processes, and B. How to make all of the resources that it and its processes supply available• .(to processes out on the network (as well as its own processes Typical third generation operating systems have concerned themselves with supplying local processes access to local resources. They do this via operating system calls. There are system calls for reading and writing files (access to the storage resource), running processes (access to the processing resource), reading and writing tapes (access to a typical peripheral resource), etc. When networks came along, it was
natural to supply access to the network resources by supporting system calls to send .(and receive data on the network (Fig. 3
The Dual Access Dichotomy 3.1 Unfortunately, however, this approach is fraught with difficulties for existing operating systems. Just supporting general network communication is not at all an easy task, especially for operating systems without a flexible interprocess communication facility. In fact, if flexible network communication system calls are added to an operating system, they provide a de facto interprocess communication .(mechanism (though usually with too much overhead for effective local use Even systems that are able to add flexible network communication calls create a dual access problem for their users (Fig. 5). For example, consider a user programming a utility to read and write magnetic tapes. If a tape drive is remote, it must be accessed via the network communication system calls. On the other hand, if the drive is local, it must be accessed directly via a tape access operating system call. Since any resource may be local or remote, users must always be prepared to access each resource in two .possible ways
.Figure 5 - The Dual Access Dichotomy for direct call operating systems The Dual Service Dichotomy 3.2 The problem of making local resources available to a network has proven difficult for existing operating systems. The usual approach is to have one or more "server" processes waiting for requests from the network (Fig. 6). These server processes then make local system calls to satisfy the remote requests and return results through the network. Examples of this type of server (though somewhat complicated by access control and translation issues) are the ARPA network file transfer server and Telnet user programs [6, 7]. With this approach there are actually two service codes for each resource, one in the system monitor for local service and one in the server process for .remote access
Figure 6 - The Dual Service Dichotomy for direct call operating .systems The major network motivation for the New LTSS development is to solve problems A. and B. in future versions of LTSS in such a way as to avoid the dual access and dual service dichotomies. By doing so, NLTSS also reaps some consequential benefits .such as user and server mobility, user extendibility, and others
The Overall NLTSS Philosophy .4 NLTSS provides only a single message system call (described in the next section). Figure 7 illustrates the view that an NLTSS process has of the world outside its memory space. Deciding how and where to deliver message data is the responsibility .of the NLTSS message system and the rest of the distributed data delivery network
Figure 7 - NLTSS processes have only the distributed message system .for dealing with the world outside their memory spaces Avoiding The Dual Access and Dual Service Dichotomies 4.1 There are two fundamentally opposite methods of avoiding the dual access dichotomy: either make all resource accesses appear local, or make all resource accesses appear remote. The TENEX Resource Sharing EXECutive (RSEXEC) is an example of the former approach [23]. Under the RSEXEC, system calls are trapped and interpreted to see if they refer to local or remote resources. The interpreter must .(then be capable of both access modes (Fig. 8
Figure 8 - Emulation technique for removing dual access from user .codes
NLTSS uses the opposite approach. Since all NLTSS resource requests are made and serviced with message exchanges, the message system is the only part of NLTSS that need distinguish between local and remote transfers (Fig. 9). Also, since the distinction made by the message system is independent of the message content, NLTSS eliminates the dual access dichotomy rather than just moving it away from the .user as the RSEXEC and similar systems do
Figure 9 - Uniform remote access in a message-passing operating .system NLTSS is able to avoid the dual service dichotomy by having the resource service processes be the only codes that service resource requests (Fig. 10). This means, however, that all "system calls" must go through the NLTSS message system. The major difficulty faced in the NLTSS design is to supply resource access with this pure message-passing mechanism and yet still keep system overhead at least as low as that .found in the competing third generation operating systems available to LLL
Figure 10 - Uniform remote service in a message-passing operating .system Comparable Systems 4.2 There have been many operating system designs and implementations that supply all resource access through a uniform interprocess communication facility [1, 2, 3, 8, 10, 15, 16, 21, 24, 28]. These interprocess communication mechanisms generally do not extend readily into a network, however. For example, in a system that utilizes shared memory for communication, remote processes have difficulty communicating with processes that expect such direct memory access. Capability based systems generally experience difficulty extending the capability passing mechanism into the network [4, .[8, 10, 16, 20, 21, 28 NLTSS is certainly not the first pure message-passing system [1, 15, 24]. In fact, it is remarkably similar to a system proposed by Walden [24]. Any contributions that NLTSS has to make will come from the care that was given to exclude system overhead and yet still support full service access to local and remote resources .through a uniform message-passing mechanism
The NLTSS Message System Interface .5 Since all resource access in NLTSS is provided through the message system, the message system interface is a key element in the system design. The major goal of the NLTSS message system interface design was to supply a simple, flexible communication facility with an average local system overhead comparable to the cost of copying the communicated data. To do this it was necessary to minimize the number of times that the message system must be called. Another important goal was to allow data transfers from processes directly to and from local peripherals without .impacting the uniformity of the message system interface
The Buffer Table 5.1 The most important element in the NLTSS message system design is a data structure that has been called a Buffer Table (Fig. 11). A linked list of buffer tables is passed to .(the NLTSS message system when a user process executes a system call (Fig. 12 The NLTSS Buffer Table
Link.1 .2 (Action bits (Activate, Cancel, and Wait.3 .4 Send/Receive bit.5 .6 Done bit.7 .8 Beginning (BOM) and end (EOM) of message bits.9 .10 Receive-From-any and Receive-To-Any bits.11 .12 To and From network addresses.13 .14
Base and length buffer description.15 .16 Buffer offset pointer.17 .18 Status.19 .20
.Figure 11 - The NLTSS Buffer Table
.Figure 12 - The NLTSS message system call :The Buffer Table fields are used as follows The Link field is a pointer to the next Buffer Table (if any) to be processed by.1 the message system. When the message system is called, it is passed the head of this linked list of Buffer Tables. The linkage mechanism provides for data chaining of message pieces to and from a single address pair, for activation of parallel data transfers, and for waiting on completion of any number of data .transfers The Action bits indicate what actions are to be performed by the message.2 :system during a call The Activate bit requests initiation of a transfer. If the transfer can't be○ completed immediately because the matching Buffer Table is remote or because of an insufficient matching buffer size, the message system .remembers the active Buffer Table for later action
The Cancel bit requests deactivation of a previously activated Buffer○ Table. The Cancel operation completes immediately unless a .peripheral is currently transferring into or out of the buffer The Wait action bit requests that the process be awakened when this○ .(Buffer Table is Done (see Done bit below .The Send/Receive bit indicates the direction of the data transfer.3 The Done bit is set by the message system when a previously activated Buffer.4 .Table is deactivated due to completion, error, or explicit Cancel The BOM and EOM bits provide a mechanism for logical message.5 demarcation. In a send Buffer Table, the BOM bit indicates that the first data bit in the buffer marks the beginning of a message. Similarly, the EOM bit indicates that the last bit in the buffer marks the end of a message. For receive Buffer Tables the BOM and EOM bits are set to indicate the separation in .incoming data The Receive-From-Any and Receive-To-Any bits are only meaningful for.6 receive Buffer Tables. If on, they indicate that the Buffer Table will match (for data transfer) a send Buffer Table with anything in the corresponding address field (see below). Of course data will only be routed to this receive buffer if it's "To" address actually addresses the activating process. If an "Any" bit is set, the corresponding address is filled in upon initiation of a transfer and the ."Any" bit is turned off The To and From address fields indicate the address pair (or association) over.7 .which the data transfer occurs. The From address is checked for validity .(The Base and Length fields define the data buffer (bit address and bit length.8 The Offset field is updated to point just after the last bit of data in the buffer.9 .(successfully transferred (relative to Base The Status field is set by the message system to indicate the current state of.10 the transfer. It should be noted that the NLTSS message system call is designed to minimize the number of times that a process must execute a system call. Generally a process will call the message system only when it has no processing left to do until some communication completes. It is also important that messages of arbitrary length can be exchanged (even by .(processes that have insufficient memory space to hold an entire message The BOM and EOM message separators are in many ways like virtual circuit opening and closing indicators. It is expected that for NLTSS message systems interfacing with virtual circuit networks (e.g. an X.25 network) that circuits will be opened at the beginning of a message and closed at the end. The first network protocol that the NLTSS message system will be interfaced with, however, has been designed to eliminate the opening and closing of circuits while still maintaining logical message .[separation very much as the NLTSS message system interface does [13, 25, 26
The Structure of the NLTSS Monitor .6 The paucity and simplicity of the NLTSS system calls allow its monitor to be quite small and simple (a distinct advantage at LLL where memory is always in short .(supply and security is an important consideration Essentially all that is in the NLTSS monitor is the message call handler and device drivers for directly attached hardware devices (figure 4). In the case of the CPU
device, the driver contains the innermost parts of the scheduler (the so-called Alternator) and memory manager (that is those parts that implement mechanism, not .(policy One property of the current NLTSS monitor implementations is that each device driver must share some resident memory with a hardware interface process for its device. For example, the storage driver must share some memory with the storage access process, and the alternator must share some memory with the process server. This situation is a little awkward on machines that don't have memory mapping hardware. On systems with only base and bounds memory protection, for example, it .forces the lowest level device interface processes to be resident
The NLTSS file system .7 .The file system illustrates several features of the NLTSS design and implementations The basic service provided by the file system is to allow processes to read and write data stored outside their memory spaces. The way in which a process gets access to a file involves the NLTSS capability protocol [26] and is beyond the scope of this paper. We will assume that the file server has been properly instructed to accept requests on a file from a specific network address. The trust that the servers have in the "From" address delivered by the message system is the basis for the higher-level .[NLTSS capability protection mechanisms [10, 14 The simplest approach for a file server to take might be to respond to a message of the form "Read', portion description (To file server, From requesting process) with a message containing either "OK”, data or "Error" (To requesting process, From file .(server Unfortunately, this approach would require that the file server be responsible for both storage allocation (primarily a policy matter) and storage access (a mechanism). Either that or the file server would have to flush all data transfers through itself on .their way to or from a separate storage access process The mechanism implemented in NLTSS is pictured in figure 13. To read or write a file, a process must activate three Buffer Tables. For reading, it activates a send of the command to the file server, a receive for the returned status, and a separate receive for the data returned from the storage access process. For writing, it activates similar command status Buffer Tables, but in place of a data receive, it activates a data send .to the storage access process
.Figure 13 - The NLTSS file system This example illustrates the importance of the linkage mechanism in the message system interface. In most systems a file access request requires only one system call. Through the linkage mechanism, NLTSS shares this property. In fact, in NLTSS a process can initiate and/or wait on an arbitrary number of other transfers at the same time. For example, when initiating a file request, it may be desirable to also send an alarm request (return a message after T units of time) and wait for either the file status .message or the alarm response When the file server gets a read or write request, it translates the logical file access request into one or more physical storage access requests that it send to the storage access process. In this request it includes the network address for the data transfer (this was included in the original "Read" or "Write" request). Having received the storage access request, the access process can receive the written data and write it to .storage or read the data from storage and send it to the "Read"ing process This mechanism works fine in the case where the requesting process and the storage access process are on separate machines (note that the file server can be on yet a third machine). In this case the data must be buffered as it is transferred to or from storage. In the case where the requesting process and the storage access processes are on the same machine, however, it is possible to transfer the data directly to or from the memory space of the requesting process. In fact, many third generation operating .systems perform this type of direct data transfer To be a competitive stand-alone operating system, NLTSS must also take advantage of this direct transfer opportunity. In our implementations, the mechanism to take .advantage of direct I/O requires an addition to the message system
There are two additional action bits available in the Buffer Tables of device access processes, IOLock and IOUnLock. If a device access process wants to attempt a direct data transfer, it sets the IOLock bit in its Buffer Table before activation. If the message system finds a match in a local process, instead of copying the data, it will lock the matching process in memory and return the Base address (absolute), Length and Offset of its buffer in the IOLocking Buffer Table. The device access process can then transfer the data directly to or from storage. The IOUnLock operation releases the lock on the requesting processes memory and updates the status of the formerly .locked Buffer Table The most important aspect of this direct I/0 mechanism is that it has no effect on the operation of the requesting process OR on that of the file server. Only the device access process (which already has to share resident memory to interact with its device .driver) and the message system need be aware of the direct I/O mechanism
A Semaphore Server Example .8 The example of an NLTSS semaphore [9, 10] server can be used to further illustrate the flexibility of the NLTSS message system. The basic idea of the semaphore server :is to implement a logical semaphore resource to support the following operations P": semaphore number (To semaphore server, From requester) - Decrement".1 the integer value of the semaphore by 1. If its new value is less than zero then add the "From" address of the request to a list of pending notifications. .Otherwise send a notification immediately V": semaphore number (To semaphore server, From requester) - Increment".2 the value of the semaphore by 1. If its value was less than zero then send a notification to the oldest address in the pending notification list and remove .the address from the list Typically such a semaphore resource is used by several processes to coordinate exclusive access to a shared resource (a file for example). In this case, after the semaphore value is initialized to 1, each process sends a "P" request to the semaphore server to lock the resource and awaits notification before accessing it (note that the first such locking process will get an immediate notification). After accessing the resource, each process sends a "V" request to the semaphore server to unlock the .resource An NLTSS implementation of such a server might keep the value of the semaphore and a notification list for each supported semaphore. The server would at all times keep a linked list of Buffer Tables used for submission to the message system. This list would be initialized with some number (chosen to optimize performance) of receive Buffer Tables "To" the semaphore server and "From" any. These Buffer .Tables would also have their activate and wait action bits turned on The semaphore server need only call the message system after making a complete scan of its receive Buffer Tables without finding any requests to process (i.e. any with Done bits on). Any Done receive requests can be processed as indicated above (l. and 2.). If a notification is to be sent, an appropriate send Buffer Table with only the Activate action bit on can be added to the Buffer Table list for the next message system call. These send Buffer Tables are removed from the list after every message .system call Processes may in general be waiting on some receive completions to supply more data, and for some send completions to free up more output buffer space. Even in this
most general situation, however, they need only call the message system when they .have no processing left to do This semaphore server example can be compared with that given in [10] to illustrate how the network operating system philosophy has evolved at LLL over the years. In earlier designs, for example, capabilities were handled only by the resident monitor. In the NLTSS implementations, the resident monitor handles only the communication and hardware multiplexing described here. Resource access in NLTSS is still managed by capabilities, but this matter is handled as a protocol between the users and servers [26]. The integrity of the capability access protection mechanism is built on the simpler data protection and address control implemented in the distributed [network message system of which NLTSS can be a component [10, 14