Lecture Notes On Operating Systems

November 2019
PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA

Overview

Download & View Lecture Notes On Operating Systems as PDF for free.

More details

Words: 67,364
Pages: 157

Preview
Full text

Lecture Notes on Operating Systems Marvin Solomon Computer Sciences Department Unversity of Wisconsin -- Madison [email protected] Mon Jan 24 13:28:57 CST 2000 Copyright © 1996-1999 by Marvin Solomon. All rights reserved.

Contents •

• •

•

Introduction • History • What is an OS For? • Bottom-up View • Top-Down View • Course Outline Java for C++ Programmers Processes and Synchronization • Using Processes • What is a Process? • Why Use Processes • Creating Processes • Process States • Synchronization • Race Conditions • Semaphores • The Bounded Buffer Problem • The Dining Philosophers • Monitors • Messages • Deadlock • Terminology • Deadlock Detection • Deadlock Recovery • Deadlock Prevention • Deadlock Avoidance • Implementing Processes • Implementing Monitors • Implementing Semaphores • Implementing Critical Sections • Short-term Scheduling Memory Management • Allocating Main Memory • Algorithms for Memory Management • Compaction and Garbage Collection • Swapping • Paging • Page Tables • Page Replacement • Frame Allocation for a Single Process • Frame Allocation for Multiple Processes • Paging Details • Segmentation • Multics

• • •

•

Intel x86

Disks File Systems • The User Interface to Files • Naming • File Structure • File Types • Access Modes • File Attributes • Operations • The User Interface to Directories • Implementing File Systems • Files • Directories • Symbolic Links • Mounting • Special Files • Long File Names • Space Management • Block Size and Extents • Free Space • Reliability • Bad-block Forwarding • Back-up Dumps • Consistency Checking • Transactions • Performance Protection and Security • Security • Threats • The Trojan Horse • Design Principles • Authentication • Protection Mechanisms • Access Control Lists • Capabilities • Encryption • Key Distribution • Public Key Encryption

CS 537

Lecture Notes Part 1

Introduction Contents • • • • •

History What is an OS For? Bottom-up View Top-Down View Course Outline

History The first computers were built for military purposes during World War II, and the first commercial computers were built during the 50's. They were huge (often filling a large room with tons of equipment), expensive (millions of dollars, back when that was a lot of money), unreliable, and slow (about the power of today's $1.98 pocket calculator). Originally, there was no distinction between programmer, operator, and end-user (the person who wants something done). A physicist who wanted to calculate the trajectory of a missile would sign up for an hour on the computer. When his time came, he would come into the room, feed in his program from punched cards or paper tape, watch the lights flash, maybe do a little debugging, get a print-out, and leave. The first card in the deck was a bootstrap loader. The user/operator/programmer would push a button that caused the card reader to read that card, load its contents into the first 80 locations in memory, and jump to the start of memory, executing the instructions on that card. Those instructions read in the rest of the cards, which contained the instructions to perform all the calculations desired: what we would now call the "application program". This set-up was a lousy way to debug a program, but more importantly, it was a waste of the fabulously expensive computer's time. Then someone came up with the idea of batch processing. User/programmers would punch their jobs on decks of cards, which they would submit to a professional operator. The operator would combind the decks into batches. He would precede the batch with a batch executive (another deck of cards). This program would read the remaining programs into memory, one at a time, and run time. The operator would take the printout from the printer, tear off the part associated with each job, wrap it around the asociated deck, and put it in an output bin for the user to pick up. The main benefit of this approach was that it minimized the wasteful down time between jobs. However, it did not solve the growing I/O bottleneck. Card readers and printers got faster, but since they are mechanical devices, there were limits to how fast they could go. Meanwhile the central processing unit (CPU) kept getting faster and was spending more and more time idly waiting for the next card to be read in or the next line of output to be printed. The next advance was to replace the card reader and printer with magnetic tape drives, which were much faster. A separate, smaller, slower (and persumably cheaper) peripheral computer would copy batches of input decks onto tape and transcribe output tapes to print. The situation was better, but there were still problems. Even magnetic tapes drives were not fast enough to keep the mainframe CPU busy, and the peripheral computers, while cheaper than the mainframe, were still not cheap (perhaps hundreds of thousands of dollars).

Then someone came up with a brilliant idea. The card reader and printer were hooked up to the mainframe (along with the tape drives) and the mainframe CPU was reprogrammed to swtich rapidly among several tasks. First it would tell the card reader to start reading the next card of the next input batch. While it was waiting for that operation to finish, it would go and work for a while on another job that had been read into "core" (main memory) earlier. When enough time had gone by for that card be read in, the CPU would temporarily set aside the main computation, start transfering the data from that card to one of the tape units (say tape 1), start the card reader reading the next card, and return to the main computation. It would continue this way, servicing the card reader and tape drive when they needed attention and spending the rest of its time on the main computation. Whenever it finished working on one job in the main computation, the CPU would read another job from an input tape that had been prepared earlier (tape 2). When it finished reading in and exceuting all the jobs from tape 2, it would swap tapes 1 and 2. It would then start executing the jobs from tape 1, while the input "process" was filling up tape 2 with more jobs from the card reader. Of course, while all this was going on, a similar process was copying output from yet another tape to the printer. This amazing juggling act was called Simultaneous Peripheral Operations On Line, or SPOOL for short. The hardware that enabled SPOOLing is called direct memory access, or DMA. It allows the card reader to copy data directly from cards to core and the tape drive to copy data from core to tape, while the expensive CPU was doing something else. The software that enabled SPOOLing is called multiprogramming. The CPU switches from one activity, or "process" to another so quickly that it appears to be doing several things at once. In the 1960's, multiprogramming was extended to ever more ambitious forms. The first extension was to allow more than one job to execute at a time. Hardware developments supporting this extension included decreasing cost of core memory (replaced during this period by semi-conductor randomaccess memory (RAM)), and introduction of direct-access storage devices (called DASD - pronounced "dazdy" - by IBM and "disks" by everyone else). With larger main memory, multiple jobs could be kept in core at once, and with input spooled to disk rather than tape, each job could get directly at its part of the input. With more jobs in memory at once, it became less likely that they would all be simultaneously blocked waiting for I/O, leaving the expensive CPU idle. Another break-through idea from the 60's based on multiprogramming was timesharing, which involves running multiple interactive jobs, switching the CPU rapidly among them so that each interactive user feels as if he has the whole computer to himself. Timesharing let the programmer back into the computer room - or at least a virtual computer room. It allowed the development of interactive programming, making programmers much more productive. Perhaps more importantly, it supported new applications such as airline reservation and banking systems that allowed 100s or even 1000s of agents or tellers to access the same computer "simultaneously". Visionaries talked about an "computing utility" by analogy with the water and electric utilities, which would delived low-cost computing power to the masses. Of course, it didn't quite work out that way. The cost of computers dropped faster than almost anyone expected, leading to mini computers in the '70s and personal computers (PCs) in the 80's. It was only in the 90's that the idea was revived, in the form of an information utility otherwise known as the information superhighway or the World-Wide Web. Today, computers are used for a wide range of applications, including personal interactive use (word-processing, games, desktop publishing, web browing, email), real-time systems (patient care, factories, missiles), embedded systems (cash registers, wrist watches, tosters), and transaction processing (banking, reservations, e-commerce).

What is an OS For? Beautification Principle

The goal of an OS is to make hardware look better than it is. • More regular, uniform (instead of lots of idiosyncratic devices) • Easier to program (e.g., don't have to worry about speeds, asynchronous events) • Closer to what's needed for applications: • named, variable-length files, rather than disk blocks • multiple ``CPU's'', one for each user (in shared system) or activity (in single-user system) • multiple large, dynamically-growing memories (virtual memory) Resource principle • The goal of an OS is to mediate sharing of scarce resources

•

•

Q:

What is a ``resource''?

A:

Something that costs money!

Why share? • expensive devices • need to share data (database is an ``expensive device''!) • cooperation between people (community is an ``expensive device''!!) Problems: • getting it to work at all • getting it to work efficiently • utilization (keeping all the devices busy) • throughput (getting a lot of useful work done per hour) • response (getting individual things done quickly) • getting it to work correctly • limiting the effects of bugs (preventing idiots from ruining it for everyone) • preventing unauthorized • access to data • modification of data • use of resources (preventing bad guys from ruining it for everyone)

Bottom-up View (starting with the hardware) Hardware (summary; more details later) •

components

•

• •

• one or more central processing units (CPU's) • main memory (RAM, core) • I/O devices • bus, or other communication mechanism connects them all together CPU has a PC1 • fetches instructions one at a time from location specified by PC • increments PC after fetching instruction; branch instructions can also alter the PC • responds to "interrupts" by jumping to a different location (like an unscheduled procedure call) Memory responds to "load" and "store" requests from the CPU, one at a time. I/O device • Usually looks like a chunk of memory to the CPU. • CPU sets options and starts I/O by sending "store" requests to a particular address. • CPU gets back status and small amounts of data by issuing "load" requests. • Direct memory access (DMA): Device may transfer large amounts of data directly to/from memory by doing loads and stores just like a CPU. • Issues an interrupt to the CPU to indicate that it is done.

Timing problem • •

•

• •

I/O devices are millions or even billions of times slower than CPU. E.g.: • Typical PC is >10 million instructions/sec • Typical disk takes > 10 ms to get one byte from disk ratio: 100,000 : 1 • Typical typist = 60 wpm = 1 word = 5 bytes/sec = 200 ms = 2 million instructions per key-stroke. And that doesn't include head-scratching time! Solution: start disk device do 100,000 instructions of other useful computation wait for disk to finish Terrible program to write; debug. And it would change with a faster disk! Better solution: Process 1: for (;;) { start I/O wait for it to finish use the data for something } Process 2: for (;;) { do some useful computation } Operating system takes care of switching back and forth between process 1 and process 2 as ``appropriate''. (Question: which process should have higher priority?)

Space problem •

•

Most of the time, a typical program is "wasting" most of the memory space allocated to it. • Looping in one subroutine (wasting space allocated to rest of program) • Fiddling with one data structure (wasting space allocated to other data structures) • Waiting for I/O or user input (wasting all of its space) Solution: virtual memory • Keep program and data on disk (100-1000 times cheaper/byte). • OS automatically copies to memory pieces needed by program on demand.

Top-Down View (what does it look like to various kinds of users?) •

•

•

End user. • Wants to get something done (bill customers, write a love letter, play a game, design a bomb). • Doesn't know what an OS is (or care!) May not even realize there is a computer there. Application programmer. • Writes software for end users. Uses ``beautified'' virtual machine • named files of unlimited size • unlimited memory • read/write returns immediately • Calls library routines • some really are just subroutines written by someone else • sort an array • solve a differential equation • search a string for a character • others call the operating system • read/write • create process • get more memory Systems programmer (you, at the end of this course) • Creates abstractions for application programmers • Deals with real devices

Course Outline 1. Processes. • What processes are. • Using processes • synchronization and communication • semaphores, critical regions, monitors, conditions, • messages, pipes • process structures • pipelines, producer/consumer, remote procedure call • deadlock

•

2.

3.

4.

5.

Implementing processes • mechanism • critical sections • process control block • process swap • semaphores, monitors • policy (short-term scheduling) • fcfs, round-robin, shortest-job next, multilevel queues Memory • Main-memory allocation • Swapping, overlays • Stack allocation (implementation of programming languages) • Virtual memory hardware • paging, segmentation, translation lookaside buffer • policy • page-replacement algorithms • random, fifo, lru, clock, working set I/O devices • device drivers, interrupt handlers • disks • hardware characteristics • disk scheduling • elevator algorithm File systems • file naming • file structure (user's view) • flat (array of bytes) • record-structured • indexed • random-access • metadata • mapped files • implementation • structure • linked, tree-structured, B-tree • inodes • directories • free-space management Protection and security • threats • access policy • capabilities, access-control lists • implementation

• •

1

authentication/determination/enforcement encryption • conventional • public-key • digital signatures

In this course PC stands for program counter, not personal computer or politically correct

CS 537 Lecture Notes Part 2

Java for C++ Programmers Contents • • • • • • • • • • • • • •

Introduction A First Example Names, Packages, and Separate Compilation Values, Objects, and Pointers Garbage Collection Static, Final, Public, and Private Arrays Strings Constructors and Overloading Inheritance, Interfaces, and Casts Exceptions Threads Input and Output Other Goodies

Introduction The purpose of these notes is to help students in Computer Sciences 537 (Introduction to Operating Systems) at the University of Wisconsin - Madision learn enough Java to do the course projects. The Computer Sciences Department is in the process of converting most of its classes from C++ to Java as the principal langauge for programming projects. CS 537 was the first course to make the switch, Fall term, 1996. At that time vitually all the students had heard of Java and none had used it. Over the last few years more and more of our courses were converted to Java. Finally last year (1998-99), the introductory programming prerequisites for this course, CS 302 and CS 367, were taught in Java. Nonetheless, many students are unfamiliar with Java, having learned how to program from earlier versions of 302 and 367, or from courses at other institutions.

Applications vs Applets The first thing you have to decide when writing a Java program is whether you are writing an application or an applet. An applet is piece of code designed to display a part of a document. It is run by a browser (such as Netscape Navigator or Microsoft Internet Explorer) in response to an tag in the document. We will not be writing any applets in this course. An application is a stand-alone program. All of our programs will be applications. Java was orginally designed to build active, multimedia, interactive environments, so its standard runtime library has lots of features to aid in creating user interfaces. There are standard classes to create scrollbars, pop-up menus, etc. There are special facilities for manipulating URL's and network connections. We will not be using any of these features. On the other hand, there is one thing operating systems and user interfaces have in common: They both require multiple, cooperating threads of control. We will be using those features in this course.

JavaScript You may have heard of JavaScript. JavaScript is an addition to HTML (the language for writing Web pages) that supports creation of ``subroutines''. It has a syntax that looks sort of like Java, but otherwise it has very little to do with Java. I have heard one very good analogy: JavaScript is to Java as the C Shell (csh) is to C.

The Java API The Java langauge is actually rather small and simple - an order of magnitude smaller and simpler than C++, and in some ways, even smaller and simpler than C. However, it comes with a very large and constantly growing library of utility classes. Fortunately, you only need to know about the parts of this library that you really need, you can learn about it a little at a time, and there is excellent, browsable, on-line documentation. These libraries are grouped into packages. One set of about 60 packgages, called the Java 2 Platform API comes bundled with the language (API stands for "Application Programming Interface"). You will probably only use classes from three of these packages: • java.lang contains things like character-strings, that are essentially "built in" to the langauge. • java.io contains support for input and output, and • java.util contains some handy data structures such as lists and hash tables.

A First Example Large parts of Java are identical to C++. For example, the following procedure, which sorts an array of integers using insertion sort, is exactly the same in C++ or Java.1 /** Sort the array a[] in ascending order ** using an insertion sort. */ void sort(int a[], int size) { for (int i = 1; i < size; i++) { // a[0..i-1] is sorted // insert a[i] in the proper place int x = a[i]; int j; for (j = i-1; j >=0; --j) { if (a[j] <= x) break; a[j+1] = a[j]; } // now a[0..j] are all <= x // and a[j+2..i] are > x a[j+1] = x; } } Note that the syntax of control structures (such as for and if), assignment statements, variable declarations, and comments are all the same in Java as in C++. To test this procedure in a C++ program, we might use a ``main program'' like this: #include #include <stdlib.h>

extern "C" long random(); /** Test program to test sort */ int main(int argc, char *argv[]) { if (argc != 2) { cerr << "usage: sort array-size" << endl; exit(1); } int size = atoi(argv[1]); int *test = new int[size]; for (int i = 0; i < size; i++) test[i] = random() % 100; cout << "before" << endl; for (int i = 0; i < size; i++) cout << " " << test[i]; cout << endl; sort(test, size); cout << "after" << endl; for (int i = 0; i < size; i++) cout << " " << test[i]; cout << endl; return 0; } A Java program to test the sort procedure is different in a few ways. Here is a complete Java program using the sort procedure. import java.io.*; import java.util.Random; class SortTest { /** Sort the array a[] in ascending order ** using an insertion sort. */ static void sort(int a[], int size) { for (int i = 1; i < size; i++) { // a[0..i-1] is sorted // insert a[i] in the proper place int x = a[i]; int j; for (j = i-1; j >=0; --j) { if (a[j] <= x) break; a[j+1] = a[j]; } // now a[0..j] are all <= x // and a[j+2..i] are > x a[j+1] = x;

} } /** Test program to test sort */ public static void main(String argv[]) { if (argv.length != 1) { System.out.println("usage: sort array-size"); System.exit(1); } int size = Integer.parseInt(argv[0]); int test[] = new int[size]; Random r = new Random(); for (int i = 0; i < size; i++) test[i] = (int)(r.nextFloat() * 100); System.out.println("before"); for (int i = 0; i < size; i++) System.out.print(" " + test[i]); System.out.println(); sort(test, size); System.out.println("after"); for (int i = 0; i < size; i++) System.out.print(" " + test[i]); System.out.println(); System.exit(0); } } A copy of this program is available in ~cs537-1/public/examples/SortTest.java. To try it out, create a new directory and copy the example to a file named SortTest.java in that directory or visit with your web browser and use the Save As... option from the File menu. The file must be called SortTest.java! mkdir test1 cd test1 cp ~cs537-1/public/examples/SortTest.java SortTest.java javac SortTest.java java SortTest 10 (The C++ version of the program is also available in ~cs537-1/public/examples/sort.cc ). The javac command invokes the Java compiler on the source file SortTest.java. If all goes well, it will create a file named SortTest.class, which contains code for the Java virtual machine. The java command invokes the Java interpreter to run the code for class SortTest. Note that the first parameter is SortTest, not SortTest.class or SortTest.java because it is the name of a class, not a file. There are several things to note about this program. First, Java has no ``top-level'' or ``global'' variables or functions. A Java program is always a set of class definitions. Thus, we had to make sort and main member functions (called ``methods'' in Java) of a class, which we called SortTest.

Second, the main function is handled somewhat differently in Java from C++. In C++, the first function to be executed is always a function called main, which has two arguments and returns an integer value. The return value is the ``exit status'' of the program; by convention, a status of zero means ``normal termination'' and anything else means something went wrong. The first argument is the number of words on the command-line that invoked the program, and the second argument is a an array of character strings (denoted char *argv[] in C++) containing those words. If we invoke the program by typing sort 10 we will find that argc==2, argv[0]=="sort", and argv[1]=="10". In Java, the first thing executed is the method called main of the indicated class (in this case SortTest). The main method does not return any value (it is of type void). For now, ignore the words ``public static'' preceding void. We will return to these later. The main method takes only one parameter, an array of strings (denoted String argv[] in Java). This array will have one element for each word on the command line following the name of the class being executed. Thus in our example call, java SortTest 10 argv[0] == "10". There is no separate argument to tell you how many words there are, but in Java, you can tell how big any array is by using length. In this case argv.length == 1, meaning argv contains only one word. The third difference to note is the way I/O is done in Java. System.out in Java is roughly equivalent to cout in C++ (or stdout in C), and System.out.println(whatever); is (even more) roughly equivalent to cout << whatever << endl; Our C++ program used three functions from the standard library, atoi, random, and exit. Integer.parseInt does the same thing as atoi: It converts the character-string "10" to the integer value ten, and System.exit(1) does the same thing as exit(1): It immediately terminates the program, returning an exit status of 1 (meaning something's wrong). The library class Random defines random-number generators. The statement Random r = new Random() create an instance of this class, and r.nextFloat() uses it to generate a floating point number between 0 and 1. The cast (int) means the same thing in Java as in C++. It converts its floating-point argument to an integer, throwing away the fraction. Finally, note that the #include directives from C++ have been replaced by import declarations. Although they have roughly the same effect, the mechanisms are different. In C++, #include pulls in a source file called iostream.h from a source library and compiles it along with the rest of the program. #include is usually used to include files containing declarations of library functions and classes, but the file could contain any C++ source code whatever. The Java declaration import java.util.Random imports the pre-compiled class Random from a package called java.util. The next section explains more about packages.

Names, Packages, and Separate Compilation As in C or C++, case is significant in identifiers in Java. Aside from a few reserved words, like if, while, etc., the Java langauge places no restrictions on what names you use for functions, variables, classes, etc. However, there is a standard naming convention, which all the standard Java libraries follow, and which you must follow in this class. • Names of classes are in MixedCase starting with a capital letter. If the most natural name for the class is a phrase, start each word with a capital letter, as in StringBuffer.

Names of "constants" (see below) are ALL_UPPER_CASE. Separate words of phrases with underscores as in MIN_VALUE. • Other names (functions, variables, reserved words, etc.) are in lower case or mixedCase, starting with a lower-case letter. A more extensive set of guidelines is included in the Java Language Specification. Simple class defintions in Java look rather like class definitions in C++ (although, as we shall see later, there are important differences). class Pair { int x, y; } Each class definition should go in a separate file, and the name of the source file must be exactly the same (including case) as the name of the class, with ".java" appended. For example, the definition of Pair must go in file Pair.java. The file is compiled as shown above and produces a .class file. There are exceptions to the rule that requires a separate source file for each class. In particular, class definitions may be nested. However, this is an advanced feature of Java, and you should never nest class definitions unless you known what you're doing! There is a large set of predefined classes, grouped into packages. The full name of one of these predefined classes includes the name of the package as prefix. We already saw the class java.util.Random. The import statement allows you to omit the package name from one of these classes. Because the SortTest program starts with import java.util.Random; we can write Random r = new Random(); rather than java.util.Random r = new java.util.Random(); You can import all the classes in a package at once with a notation like import java.io.*; The package java.lang is special; every program behaves as if it started with import java.lang.*; whether it does or not. You can define your own packages, but defining packages is an advanced topic beyond the scope of what's required for this course. The import statement doesn't really "import" anything. It just introduces a convenient abbreviation for a fully-qualified class name. When a class needs to use another class, all it has to do is use it. The Java compiler will know that it is supposed to be a class by the way it is used, will import the appropriate .class file, and will even compile a .java file if necessary. (That's why it's important for the name of the file to match the name of the class). For example, here is a simple program that uses two classes: class HelloTest { public static void main(String[] args) { Hello greeter = new Hello(); greeter.speak(); } } class Hello { void speak() { System.out.println("Hello World!"); } } Put each class in a separate file (HelloTest.java and Hello.java). Then try this: javac HelloTest.java •

java Hello You should see a cheery greeting. If you type ls you will see that you have both HelloTest.class and Hello.class even though you only asked to compiled HelloTest.java. The Java compiler figured out that class HelloTest uses class Hello and automatically compiled it. Try this to learn more about what's going on: rm -f *.class javac -verbose HelloTest.java java Hello

Values, Objects, and Pointers It is sometimes said that Java doesn't have pointers. That is not true. In fact, objects can only be referenced with pointers. More precisely, variables can hold primitive values (such as integers or floating-point numbers) or references (pointers) to objects. A variable cannot hold an object, and you cannot make a pointer to a primitive value. Since you don't have a choice, Java doesn't have a special notation like C++ does to indicate when you want to use a pointer. There are exactly eight primitive types in Java, boolean, char, byte, short, int, long, float, and double. Most of these are similar to types with the same name in C++. We mention only the differences. A boolean value is either true or false. You cannot use an integer where a boolean is required (e.g. in an if or while statement) nor is there any automatic conversion between boolean and integer. A char value is 16 bits rather than 8 bits, as it is in C or C++, to allow for all sorts of international alphabets. As a practical matter, however, you are unlikely to notice the difference. The byte type is an 8-bit signed integer (like signed char in C or C++). A short is 16 bits and an int is 32 bits, just as in C or C++ on most modern machines (in C++ the size is machine-dependent, but in Java it is guaranteed to be 32 bits). A Java long is not the same as in C++; it is 64 bits long--twice as big as a normal int--so it can hold any value from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807. The types float and double are just like in C++: 32-bit and 64-bit floating point. As in C++, objects are instances of classes. There is no prefix * or & operator or infix -> operator. As an example, consider the class declaration (which is the same in C++ and in Java) class Pair { int x, y; } C++ Pair origin; Pair *p, *q, *r; origin.x = 0; p = new Pair; p -> y = 5; q = p; r = &origin;

Java Pair origin = new Pair(); Pair p, q, r; origin.x = 0; p = new Pair(); p.y = 5; q = p; not possible

.

As in C or C++, arguments to a Java procedure are passed ``by value'' : void f() { int n = 1; Pair p = new Pair(); p.x = 2; p.y = 3; System.out.println(n); // prints 1 System.out.println(p.x); // prints 2 g(n,p); System.out.println(n); // still prints 1 System.out.println(p.x); // prints 100 } void g(int num, Pair ptr) { System.out.println(num); // prints 1 num = 17; // changes only the local copy System.out.println(num); // prints 17 System.out.println(ptr.x);// prints 2 ptr.x = 100; // changes the x field of caller's Pair ptr = null; // changes only the local ptr } The formal parameters num and ptr are local variables in the procedure g initialized with copies of the values of n and p. Any changes to num and ptr affect only the copies. However, since ptr and p point to the same object, the assignment to ptr.x in g changes the value of p.x. Unlike C++, Java has no way of declaring reference parameters, and unlike C++ or C, Java has no way of creating a pointer to a (non-object) value, so you can't do something like this /* C or C++ */ void swap1(int *xp, int *yp) { int tmp; tmp = *xp; *xp = *yp; *yp = tmp; } int foo = 10, bar = 20; swap1(&foo, &bar); /* now foo==20 and bar==10 */ // C++ only void swap2(int &xp, int &yp) { int tmp; tmp = xp; xp = yp; yp = tmp; } int this_one = 88, that_one = 99; swap2(this_one, that_one); // now this_one==99 and that_one==88

You'll probably miss reference parameters most in situations where you want a procedure to return more than one value. As a work-around you can return an object or array or pass in a pointer to an object. See Section 2.6 on page 36 of the Java book for more information.

Garbage Collection New objects are create by the new operator in Java just like C++ (except that an argument list is required after the class name, even if the constructor for the class doesn't take any arguments so the list is empty). However, there is no delete operator. The Java system automatically deletes objects when no references to them remain. This is a much more important convenience than it may at first seem. delete operator is extremely error-prone. Deleting objects too early can lead to dangling reference, as in p = new Pair(); // ... q = p; // ... much later delete p; q -> x = 5; // oops! while deleting them too late (or not at all) can lead to garbage, also known as a storage leak.

Static, Final, Public, and Private Just as in C++, it is possible to restrict access to members of a class by declaring them private, but the syntax is different In C++: class C { private: int i; double d; public: int j; void f() { /*...*/ } } In Java: class C { private int i; public int j; private double d; public void f() { /* ... */ } } As in C++, private members can only be accessed from inside the bodies of methods (function members) of the class, not ``from the outside.'' Thus if x is an instance of C, x.i is not legal, but i can be accessed from the body of x.f(). (protected is also supported; it means the same thing as it does in C++). The default (if neither public nor private is specified) is that a member can be accessed from anywhere in the same package, giving a facility rather like ``friends'' in C++. You will probably be putting all your classes in one package, so the default is essentially public, but you should not rely on this default. In this course, every member must be declared public, protected, or private. The keyword static also means the same thing in Java as C++, which not what the word implies: Ordinary members have one copy per instance, whereas a static member has only one copy, which is shared by all instances. In effect, a static member lives in the class itself, rather than instances.

class C { int x = 1; // by the way, this is ok in Java but not C++ static int y = 1; void f(int n) { x += n; } static int g() { return ++y; } } C p = new C(); C q = new C(); p.f(3); q.f(5); System.out.println(p.x); // prints 4 System.out.println(q.x); // prints 6 System.out.println(C.y); // prints 1 System.out.println(p.y); // means the same thing System.out.println(C.g());// prints 2 System.out.println(q.g());// prints 3 Static members are often used instead of global variables and functions, which do not exist in Java. For example, Math.tan(x); // tan is a static method of class Math Math.PI; // a static "field" of class Math with value 3.14159... Integer.parseInt("10"); // used in the sorting example The keyword final is roughly equivalent to const in C++: final fields cannot be changed. It is often used in conjunction with static to defined named constants. class Card { public int suit = CLUBS; // default public final static int CLUBS = 1; public final static int DIAMONDS = 2; public final static int HEARTS = 3; public final static int SPADES = 4; } Card c = new Card(); c.suit = Card.SPADES; Each Card has its own suit. The value CLUBS is shared by all instances of Card so it only needs to be stored once, but since it's final, it doesn't need to be stored at all!

Arrays In Java, arrays are objects. Like all objects in Java, you can only point to them, but unlike a C++ variable, which is treated like a pointer to the first element of the array, a Java array variable points to the whole object. There is no way to point to a particular slot in an array. Each array has a read-only (final) field length that tells you how many elements it has. The elements are numbered starting at zero as in C++: a[0] ... a[a.length-1]. Once you create an array (using new), you can't change its size. If you need more space, you have to create a new (larger) array and copy over the elements (but see the library class Vector below). int x = 3; // a value int[] a; // a pointer to an array object; initially null int a[]; // means exactly the same thing (for compatibility with C) a = new int[10]; // now a points to an array object

a[3] = 17; // accesses one of the slots in the array a = new int[5]; // assigns a different array to a // the old array is inaccessible (and so // is garbage-collected) int[] b = a; // a and b share the same array object System.out.println(a.length); // prints 5

Strings Since you can make an array of anything, you can make an an array of char or an an array of byte, but Java has something much better: the type String. The + operator is overloaded on Strings to mean concatenation. What's more, you can concatenate anything with a string; Java automatically converts it to a string. Built-in types such as numbers are converted in the obvious way. Objects are converted by calling their toString() methods. Library classes all have toString methods that do something reasonable. You should do likewise for all classes you define. This is great for debugging. String s = "hello"; String t = "world"; System.out.println(s + ", " + t); // prints "hello, world" System.out.println(s + "1234"); // "hello1234" System.out.println(s + (12*100 + 34)); // "hello1234" System.out.println(s + 12*100 + 34); // "hello120034" (why?) System.out.println("The value of x is " + x); // will work for any x System.out.println("System.out = " + System.out); // "System.out = java.io.PrintStream@80455198" String numbers = ""; for (int i=0; i<5; i++) numbers += " " + i; System.out.println(numbers); // " 0 1 2 3 4" Strings have lots of other useful operations: String s = "whatever", t = "whatnow"; s.charAt(0); // 'w' s.charAt(3); // 't' t.substring(4); // "now" (positions 4 through the end) t.substring(4,6); // "no" (positions 4 and 5, but not 6) s.substring(0,4); // "what" (positions 0 through 3) t.substring(0,4); // "what" s.compareTo(t); // a value less than zero // s precedes t in "lexicographic" // (dictionary) order t.compareTo(s); // a value greater than zero (t follows s) t.compareTo("whatnow"); // zero t.substring(0,4) == s.substring(0,4); // false (they are different String objects) t.substring(0,4).equals(s.substring(0,4)); // true (but they are both equal to "what") t.indexOf('w'); // 0 t.indexOf('t'); // 3 t.indexOf("now"); // 4

t.lastIndexOf('w'); // 6 t.endsWith("now"); // true and more. You can't modify a string, but you can make a string variable point to a new string (as in numbers += " " + i;). See StringBuffer if you want a string you can scribble on.

Constructors and Overloading A constructor is like in C++: a method with the same name as the class. If a constructor has arguments, you supply corresponding values when using new. Even if it has no arguments, you still need the parentheses (unlike C++). There can be multiple constructors, with different numbers or types of arguments. The same is true for other methods. This is called overloading. Unlike C++, you cannot overload operators. The operator `+' is overloaded for strings and (various kinds of) numbers, but userdefined overloading is not allowed. class Pair { int x, y; Pair(int u, int v) { x = u; // the same as this.x = u y = v; } Pair(int x) { this.x = x; // not the same as x = x! y = 0; } Pair() { x = 0; y = 0; } } class Test { public static void main(String[] argv) { Pair p1 = new Pair(3,4); Pair p2 = new Pair(); // same as new Pair(0,0) Pair p3 = new Pair; // error! } } NB: The bodies of the methods have to be defined in line right after their headers as shown above. You have to write class Foo { double square(double d) { return d*d; } }; rather than class Foo { double square(double); }; double Foo::square(double d) { return d*d; } // ok in C++ but not in Java

Inheritance, Interfaces, and Casts In

C++, when we write class Derived : public Base { ... } we mean two things: • A Derived can do anything a Base can, and perhaps more. • A Derived does things the way a Base does them, unless specified otherwise. The first of these is called interface inheritance or subtyping and the second is called method inheritance. In Java, they are specified differently. Method inheritance is specified with the keyword extends. class Base { int f() { /* ... */ } void g(int x) { /* ... */ } } class Derived extends Base { void g(int x) { /* ... */ } double h() { /* ... */ } } Class Derived has three methods: f, g, and h. The method Derived.f() is implemented in the same way (the same executable code) as Base.f(), but Derived.g() overrides the implementation of Base.g(). We call Base the super class of Derived and Derived a subclass of Base. Every class (with one exception) has exactly one super class (single inheritance). If you leave out the extends specification, Java treats it like ``extends Object''. The primordial class Object is the lone exception -- it does not extend anything. All other classes extend Object either directly or indirectly. Object has a method toString, so every class has a method toString; either it inherits the method from its super class or it overrides it. Interface inheritance is specified with implements. A class implements an Interface, which is like a class, except that the methods don't have bodies. Two examples are given by the built-in interfaces Runnable and Enumeration. interface Runnable { void run(); } interface Enumeration { Object nextElement(); boolean hasMoreElements(); } An object is Runnable if it has a method named run that is public2 and has no arguments or results. To be an Enumeration, a class has to have a public method nextElement() that returns an Object and a public method hasMoreElements that returns a boolean. A class that claims to implement these interfaces has to either inherit them (via extends) or define them itself. class Words extends StringTokenizer implements Enumeration, Runnable { public void run() { for (;;) { String s = nextToken(); if (s == null) { return; } System.out.println(s);

} } Words(String s) { super(s); // perhaps do something else with s as well } } The class Words needs methods run, hasMoreElements, and nextElement to meet its promise to implement interfaces Runnable and Enumeration. It inherits implementations of hasMoreElements and nextElement from StringTokenizer , but it has to give its own implementation of run. The implements clause tells users of the class what they can expect from it. If w is an instance of Words, I know I can write w.run(); or if (w.hasMoreElements()) ... A class can only extend one class, but it can implement any number of interfaces. By the way, constructors are not inherited. The call super(s) in class Words calls the constructor of StringTokenizer that takes one String argument. If you don't explicitly call super, Java automatically calls the super class constructor with no arguments (such a constructor must exist in this case). Note the call nextToken() in Words.run, which is short for this.nextToken(). Since this is an instance of Words, it has a nextToken method -- the one it inherited from StringTokenizer. A cast in Java looks just like a cast in C++: It is a type name in parentheses preceding an expression. We have already seen an example of a cast used to convert between primitive types. A cast can also be used to convert an object reference to a super class or subclass. For example, Words w = new Words("this is a test"); Object o = w.nextElement(); String s = (String)o; System.out.println("The first word has length " + s.length()); We know that w.nextElement() is ok, since Words implements the interface Enumeration, but all that tells us is that the value returned has type Object. We cannot call o.length() because class Object does not have a length method. In this case, however, we know that o is not just any kind of Object, but a String in particular. Thus we cast o to type String. If we were wrong about the type of o we would get a run-time error. If you are not sure of the type of an object, you can test it with instanceof (note the lower case `o'), or find out more about it with the method Object.getClass() if (o instanceof String) { n = ((String)o).length(); } else { System.err.println("Bad type " + o.getClass().getName()); }

Exceptions A Java program should never ``core dump,'' no matter how buggy it is. If the compiler excepts it and something goes wrong at run time, Java throws an exception. By default, an exception causes the program to terminate with an error message, but you can also catch an exception. try { // ... foo.bar();

// ... a[i] = 17; // ... } catch (IndexOutOfBoundsException e) { System.err.println("Oops: " + e); } The try statement says you're interested in catching exceptions. The catch clause (which can only appear after a try) says what to do if an IndexOutOfBoundsException occurs anywhere in the try clause. In this case, we print an error message. The toString() method of an exception generates a string containing information about what went wrong, as well as a call trace. Because we caught this exception, it will not terminate the program. If some other kind of exception occurs (such as divide by zero), the exception will be thrown back to the caller of this function and if that function doesn't catch it, it will be thrown to that function's caller, and so on back to the main function, where it will terminate the program if it isn't caught. Similarly, if the function foo.bar throws an IndexOutOfBoundsException and doesn't catch it, we will catch it here. The catch clause actually catches IndexOutOfBoundsException or any of its subclasses, including ArrayIndexOutOfBoundsException , StringIndexOutOfBoundsException , and others. An Exception is just another kind of object, and the same rules for inheritance hold for exceptions as any other king of class. You can define and throw your own exceptions. class SytaxError extends Exception { int lineNumber; SytaxError(String reason, int line) { super(reason); lineNumber = line; } public String toString() { return "Syntax error on line " + lineNumber + ": " + getMessage(); } } class SomeOtherClass { public void parse(String line) throws SyntaxError { // ... if (...) throw new SyntaxError("missing comma", currentLine); //... } public void parseFile(String fname) { //... try { // ... nextLine = in.readLine(); parse(nextLine); // ... } catch (SyntaxError e) { System.err.println(e);

} } } Each function must declare in its header (with the keyword throws) all the exceptions that may be thrown by it or any function it calls. It doesn't have to declare exceptions it catches. Some exceptions, such as IndexOutOfBoundsException, are so common that Java makes an exception for them (sorry about that pun) and doesn't require that they be declared. This rule applies to RuntimeException and its subclasses. You should never define new subclasses of RuntimeException. There can be several catch clauses at the end of a try statement, to catch various kinds of exceptions. The first one that ``matches'' the exception (i.e., is a super class of it) is executed. You can also add a finally clause, which will always be executed, no matter how the program leaves the try clause (whether by falling through the bottom, executing a return, break, or continue, or throwing an exception).

Threads Java lets you do several things at once by using threads. If your computer has more than one CPU, it may actually run two or more threads simultaneously. Otherwise, it will switch back and forth among the threads at times that are unpredictable unless you take special precautions to control it. There are two different ways to create threads. I will only describe one of them here. Thread t = new Thread(command); // t.start(); // t start running command, but we don't wait for it to finish // ... do something else (perhaps start other threads?) // ... later: t.join(); // wait for t to finish running command The constructor for the built-in class Thread takes one argument, which is any object that has a method called run. This requirment is specified by requiring that command implement the Runnable interface described earlier. (More precisely, command must be an instance of a class that implements Runnable). The way a thread ``runs'' a command is simply by calling its run() method. It's as simple as that! In project 1, you are supposed to run each command in a separate thread. Thus you might declare something like this: class Command implements Runnable { String theCommand; Command(String c) { theCommand = c; } public void run() { // Do what the command says to do } } You can parse the command string either in the constructor or at the start of the run() method. The main program loop reads a command line, breaks it up into commands, runs all of the commands concurrently (each in a separate thread), and waits for them to all finish before issuing the next prompt. In outline, it may look like this. for (;;) { System.out.print("% "); System.out.flush();

String line = inputStream.readLine(); int numberOfCommands = // count how many comands there are on the line Thread t[] = new Thread[numberOfCommands]; for (int i=0; i
class Buffer { private Queue q; public synchronized void put(Object o) { q.enqueue(o); notify(); } public synchronized Object get() { while (q.isEmpty()) wait(); return q.dequeue(); } } This class solves the so-call ``producer-consumer'' problem (it assumes the Queue class has been defined elsewhere). ``Producer'' threads somehow create objects and put them into the buffer by calling Buffer.put(), while ``consumer'' threads remove objects from the buffer (using Buffer.get()) and do something with them. The problem is that a consumer thread may call Buffer.get() only to discover that the queue is empty. By calling wait() it releases the monitor lock and goes to sleep so that producer threads can call put() to add more objects. Each time a producer adds an object, it calls notify() just in case there is some consumer waiting for an object. This example is not correct as it stands (and the Java compiler will reject it). The wait() method can throw an InterruptedException exception, so the get() method must either catch it or declare that it throws InterruptedException as well. The simplest solution is just to catch the exception and ignore it: class Buffer { private Queue q; public synchronized void put(Object o) { q.enqueue(o); notify(); } public synchronized Object get() { while (q.isEmpty()) { try { wait(); } catch (InterruptedException e) { e.printStackTrace(); } } return q.dequeue(); } } The method printStackTrace() prints some information about the exception, including the line number where it happened. It is a handy thing to put in a catch clause if you don't know what else to put there. Never use an empty catch clause. If you violate this rule, you will live to regret it! There is also a version of Object.wait() that takes an integer parameter. The call wait(n) will return after n milliseconds if nobody wakes up the thread with notify or notifyAll sooner. You may wonder why Buffer.get() uses while (q.isEmpty()) rather than if (q.isEmpty()). In this particular case, either would work. However, in more complicated situations, a sleeping thread might be

awakened for the ``wrong'' reason. Thus it is always a good idea when you wake up to recheck the condition that made to decide to go to sleep before you continue.

Input and Output Input/Output, as described in Chapter 12 of the Java book, is not as complicated as it looks. You can get pretty far just writing to System.out (which is of type PrintStream ) with methods println and print. For input, you probably want to wrap the standard input System.in in a BufferedReader , which provides the handy method readLine() BufferedReader in = new BufferedReader(new InputStreamReader(System.in)); for(;;) { String line = in.readLine(); if (line == null) { break; } // do something with the next line } If you want to read from a file, rather than from the keyboard (standard input), you can use FileReader, probably wrapped in a BufferedReader. BufferedReader in = new BufferedReader(new FileReader("somefile")); for (;;) { String line = in.readLine(); if (line == null) { break; } // do something with the next line } Similarly, you can use new PrintWriter(new FileOutputStream("whatever")) to write to a file.

Other Goodies The library of pre-defined classes has several other handy tools. See the online manual , particularly java.lang and java.util for more details.

Integer, Character, etc. Java makes a big distinction between values (integers, characters, etc.) and objects. Sometimes you need an object when you have a value (the next paragraph has an example). The classes Integer, Character, etc. serve as convenient wrappers for this purpose. For example, Integer i = new Integer(3) creates a version of the number 3 wrapped up as an object. The value can be retrieved as i.intValue. These classes also serve as convenient places to define utility functions for manipulating value of the given types, often as static methods or defined constants. int i = Integer.MAX_VALUE; // 2147483648, the largest possible int int i = Integer.parseInt("123"); // the int value 123 String s = Integer.toHexString(123);// "7b" (123 in hex)

double x = Double.parseDouble("123e-2"); // the double value 1.23 Character.isDigit('3') // true Character.isUpperCase('a') // false Character.toUpperCase('a') // 'A'

Vector A Vector is like an array, but it grows as necessary to allow you to add as many elements as you like. Unfortunately, there is only one kind of Vector--a vector of Object. Thus you can insert objects of any type into it, but when you take objects out, you have to use a cast to recover the original type.4 Vector v = new Vector(); // an empty vector for (int i=0; i<100; i++) v.add(new Integer(i)); // now it contains 100 Integer objects // print their squares for (int i=0; i<100; i++) { Integer member = (Integer)(v.get(i)); int n = member.intValue(); System.out.println(n*n); } // another way to do that for (Iterator i = v.iterator(); i.hasNext(); ) { int n = ((Integer)(i.next())).intValue(); System.out.println(n*n); } v.set(5, "hello"); // like v[5] = "hello" Object o = v.get(3); // like o = v[3]; v.add(6, "world"); // set v[6] = "world" after first shifting // element v[7], v[8], ... to the right // to make room v.remove(3); // remove v[3] and shift v[4], ... to the // left to fill in the gap Elements of a Vector must be objects, not values. That means you can put a String or an instance of a user-defined class into a Vector, but if you want to put an integer, floating-point number, or character into Vector, you have to wrap it: v.add(47); // WRONG! sum += v.get(i); // WRONG! v.add(new Integer(47)); // right sum += ((Integer)v.get(i)).intValue(); // ugly, but right The class Vector is implemented using an ordinary array that is generally only partially filled. If Vector runs out of space, it allocates a bigger array and copies over the elements. There are a variety of additional methods, not shown here, that let you give the implementation advice on how to manage the extra space more efficiently. For example, if you know that you are not going to add any more elements

to v, you can call v.trimToSize() to tell the system to repack the elements into an array just big enough to hold them. Don't forget to import java.util.Vector; or import java.util.*; .

Maps and Sets The interface Map5 represents a table mapping keys to values. It is sort of like an array or Vector, except that the ``subscripts'' can be any objects, rather than non-negative integers. Since Map is an interface rather than a class you cannot create instances of it, but you can create instances of the class HashMap, which implements Map using a hash table. Map table = new HashMap(); // an empty table table.put("seven", new Integer(7)); // key is the String "seven"; // value is an Integer object table.put("seven", 7); // WRONG! (7 is not an object) Object o = table.put("seven", new Double(7.0)); // binds "seven" to a double object // and returns the previous value int n = ((Integer)o).intValue(); // n = 7 table.containsKey("seven"); // true table.containsKey("twelve"); // false // print out the contents of the table for (Iterator i = table.keySet().iterator(); i.hasNext(); ) { Object key = i.next(); System.out.println(key + " -> " + table.get(key)); } o = table.get("seven"); // get current binding (a Double) o = table.remove("seven"); // get current binding and remove it table.clear(); // remove all bindings Sometimes, you only care whether a particular key is present, not what it's mapped to. You could always use the same object as a value (or use null), but it would be more efficient (and, more importantly, clearer) to use a Set. System.out.println("What are your favorite colors?"); BufferedReader in = new BufferedReader(new InputStreamReader(System.in)); Set favorites = new HashSet(); try { for (;;) { String color = in.readLine(); if (color == null) { break; } if (!favorites.add(color)) { System.out.println("you already told me that"); } } } catch (IOException e) {

e.printStackTrace(); } int n = favorites.size(); if (n == 1) { System.out.println("your favorite color is:"); } else { System.out.println("your " + n + " favorite colors are:"); } for (Iterator i = favorites.iterator(); i.hasNext(); ) { System.out.println(i.next()); }

StringTokenizer A StringTokenizer is handy in breaking up a string into words separated by white space (or other separator characters). The following example is from the Java book: String str = "Gone, and forgotten"; StringTokenizer tokens = new StringTokenizer(str, " ,"); while (tokens.hasMoreTokens()) System.out.println(tokens.nextToken()); It prints out Gone and forgotten The second arguement to the constructor is a String containing the characters that such be considered separators (in this case, space and comma). If it is omitted, it defaults to space, tab, return, and newline (the most common ``white-space'' characters). There is a much more complicated class StreamTokenizer for breaking up an input stream into tokens. Many of its features seem to be designed to aid in parsing the Java langauge itself (which is not a surprise, considering that the Java compiler is written in Java).

Other Utilities The random-number generator Random was presented above. See Chapter 13 of the Java book for information about other handy classes. Previous Next Contents 1

Processes

Introduction and

Synchronization

Throughout this tutorial, examples in C++ are shown in green and examples in Java are shown in blue. This example could have been in either green or blue! 2 All the members of an Interface are implicitly public. You can explicitly declare them to be public, but you don't have to, and you shouldn't. 3 as a practical matter, it's probably the one that has been sleeping the longest, but you can't depend on that 4 Interface Iterator was introduced with Java 1.2. It is a somewhat more convenient version of the older interface Enumeration discussed earlier.

5

Interfaces Map and Set were introduced with Java 1.2. Earier versions of the API contained only Hashtable, which is similar to HashMap.

CS 537 Lecture Notes Part 3 Processes and Synchronization Previous Next Contents

Java

for

C++ Deadlock

Programmers

Contents • • • • • •

Using Processes What is a Process? Why Use Processes Creating Processes Process States Synchronization • Race Conditions • Semaphores • The Bounded Buffer Problem • The Dining Philosophers • Monitors • Messages

The text book mixes a presentation of the features of processes of interest to programmers creating concurrent programs with discussion of techniques for implementing them. The result is (at least to me) confusing. I will attempt to first present processes and associated features from the user's point of view with as little concern as possible for questions about how they are implemented, and then turn to the question of implementing processes.

Using Processes What is a Process? [Silberschatz, Galvin, and Gagne, Sections 4.1, 5.1, 5.2] A process is a ``little bug'' that crawls around on the program executing the instructions it sees there. Normally (in so-called sequential programs) there is exactly one process per program, but in concurrent programs, there may be several processes executing the same program. The details of what constitutes a ``process'' differ from system to system. The main difference is the amount of private state associated with each process. Each process has its own program counter, the register that tells it where it is in the program. It also needs a place to store the return address when it calls a subroutine, so that two processes executing the same subroutine called from different places can return to the correct calling points. Since subroutines can call other subroutines, each process needs its own stack of return addresses. Processes with very little private memory are called threads or light-weight processes. At a minimum, each thread needs a program counter and a place to store a stack of return addresses; all other values could be stored in memory shared by all threads. At the other extreme, each process could have its own

private memory space, sharing only the read-only program text with other processes. This essentially the way a Unix process works. Other points along the spectrum are possible. One common approach is to put the local variables of procedures on the same private stack as the return addresses, but let all global variables be shared between processes. A stack frame holds all the local variables of a procedure, together with an indication of where to return to when the procedure returns, and an indication of where the calling procedure's stack frame is stored. This is the approach taken by Java threads. Java has no global variables, but threads all share the same heap. The heap is the region of memory used to hold objects allocated by new. In short, variables declared in procedures are local to threads, but objects are all shared. Of course, a thread can only ``see'' an object if it can reach that object from its ``base'' object (the one containing its run method, or from one of its local variables. class Worker implements Runnable { Object arg, other; Worker(Object a) { arg = a; } public void run() { Object tmp = new Object(); other = new Object(); for(int i = 0; i < 1000; i++) // do something } } class Demo { static public void main(String args[]) { Object shared = new Object(); Runnable worker1 = new Worker(shared); Thread t1 = new Thread(worker1); Runnable worker2 = new Worker(shared); Thread t2 = new Thread(worker2); t1.start(); t2.start(); // do something here } } There are three treads in this program, the main thread and two child threads created by it. Each child thread has its own stack frame for Worker.run(), with space for tmp and i. Thus there are two copies of the variable tmp, each of which points to a different instance of Object. Those objects are in the shared heap, but since one thread has no way of getting to the object created by the other thread, these objects are effectively ``private'' to the two threads.1 Similarly, the objects pointed to by other are effectively private. But both copies of the field arg and the variable shared in the main thread all point to the same (shared) object. Other names sometimes used for processes are job or task. It is possible to combine threads with processes in the same system. For example, when you run Java under Unix, each Java program is run in a separate Unix process. Unix processes share very little with each other, but the Java threads in one Unix process share everything but their private stacks.

Why Use Processes

Processes are basically just a programming convenience, but in some settings they are such a great convenience, it would be nearly impossible to write the program without them. A process allows you to write a single thread of code to get some task done, without worrying about the possibility that it may have to wait for something to happen along the way. Examples:

A server providing services to others. One thread for each client. A timesharing system. One thread for each logged-in user. A real-time control computer controlling a factory. One thread for each device that needs monitoring. A network server. One thread for each connection. Creating Processes [Silberschatz, Galvin, and Gagne, Sections 4.3.1, 4.3.2, 5.6.1] When a new process is created, it needs to know where to start executing. In Java, a thread is given an object when it is created. When it is started, it starts execution at the beginning of the run method of that object. In Unix, a new process is started with the fork() command. It starts a new process running in the same program, starting at the statement immediately following the fork() call. After the call, both the parent (the process that called fork()) and the child are both executing at the same point in the program. The child is given its own memory space, which is initialized with an exactly copy of the memory space (globals, stack, heap objects) of the parent. Thus the child looks like an exact clone of the parent, and indeed, it's hard to tell them apart. The only difference is that fork() returns 0 in the child, but a non-zero value in the parent. #include #include char *str; int f() { int k; k = fork(); if (k == 0) { str = "the child has value "; return 10; } else {

str = "the parent has value "; return 39; } } main() { int j; str = "the main program "; j = f(); cout << str << j << endl; } This program starts with one process executing main(). This process calls f(), and inside f() it calls fork(). Two processes appear to return from fork(), a parent and a child process. Each has its own copy of the global global variable str and its own copy of the stack, which contains a frame for main with variable j and a frame for f with variable k. After the return from fork the parent sets its copy of k to a non-zero value, while the child sets its copy of k to zero. Each process then assigns a different string to its copy of the global str and returns a different value, which is assigned to the process' own copy of j. Two lines are printed: the parent has value 39 the child has value 10 (actually, the lines might be intermingled).

Process States [Silberschatz, Galvin, and Gagne, Sections 4.1.2, 5.6.2, 5.6.3. See Figure 4.1 on page 89] Once a process is started, it is either runnable or blocked. It can become blocked by doing something that explicitly blocks itself (such as wait()) or by doing something that implicitly blocks it (such as a read() request). In some systems, it is also possible for one process to block another (e.g., Thread.suspend() in Java2 A runnable process is either ready or running. There can only be as many running processes as there are CPUs. One of the responsibilities of the operating system, called short-term scheduling is to switch processes between ready and running state. Two other possible states are new and terminated. In a batch system, a newly submitted job might be left in new state until the operating system decides there are enough available resources to run the job without overloading the system. The decision of when to move a job from new to ready is called long-term scheduling. A process may stay in terminated state after finishing so that the OS can clean up after it (print out its output, etc.) Many systems also allow one process to enquire about the state of another process, or to wait for another process to complete. For example, in Unix, the wait() command blocks the current process until at least one of its children has terminated. In Java, the method Thread.join() blocks the caller until the indicated thread has terminated (returned from its run method). To implement these functions, the OS has to keep a copy of the terminated process around. In Unix, such a process is called a ``zombie.'' Some systems require every process to have a parent. What happens when a the parent dies before the child? One possibility is cascading termination. Unix uses a different model. An ``orphan'' process is adopted by a special process called ``init'' that is created at system startup and only goes away at system shutdown.

Synchronization [Silberschatz, Galvin, and Gagne, Chapter 7]

Race Conditions Consider the following extremely simple procedure void deposit(int amount) { balance += amount; } (where we assume that balance is a shared variable). If two processes try to call deposit concurrently, something very bad can happen. The single statement balance += amount is really implemented, on most computers, buy a sequence of instructions such as Load Reg, balance Add Reg, amount Store Reg, balance Suppose process P1 calls deposit(10) and process P2 calls deposit(20). If one completes before the other starts, the combined effect is to add 30 to the balance, as desired. However, suppose the calls happen at exactly the same time, and the executions are interleaved. Suppose the initial balance is 100, and the two processes run on different CPUs. One possible result is P1 loads 100 into its register P2 loads 100 into its register P1 adds 10 to its register, giving 110 P2 adds 20 to its register, giving 120 P1 stores 110 in balance P2 stores 120 in balance and the net effect is to add only 20 to the balance! This kind of bug, which only occurs under certain timing conditions, is called a race condition. It is an extremely difficult kind of bug to track down (since it may disappear when you try to debug it) and may be nearly impossible to detect from testing (since it may occur only extremely rarely). The only way to deal with race conditions is through very careful coding. To avoid these kinds of problems, systems that support processes always contain constructs called synchronization primitives.

Semaphores [Silberschatz, Galvin, and Gagne, Section 7.5] One of the earliest and simplest synchronization primitives is the semaphore. We will consider later how semaphores are implemented, but for now we can treat them like a Java object that hides an integer value and only allows three operations: initialization to a specified value, increment, or decrement.3 class Semaphore { private int value; public Semaphore(int v) { value = v; } public void up() { /* ... */ } public void down() { /* ... */ }; }

Although there are methods for changing the value, there is no way to read the current value! There two bits of ``magic'' that make this seemingly useless class extremely useful: 6. The value is never permitted to be negative. If the value is zero when a process calls down, that process is forced to wait (it goes into blocked state) until some other process calls up on the semaphore. 7. The up and down operations are atomic: A correct implementation must make it appear that they occur instantaneously. In other words, two operations on the same semaphore attempted at the same time must not be interleaved. (In the case of a down operation that blocks the caller, it is the actual decrementing that must be atomic; it is ok if other things happen while the calling process is blocked). Our first example uses semaphores to fix the deposit function above. shared Semaphore mutex = new Semaphore(1); void deposit(int amount) { mutex.down(); balance += amount; mutex.up(); } We assume there is one semaphore, which we call mutex (for ``mutual exclusion'') shared by all processes. The keyword shared (which is not Java) will be omitted if it is clear which variables are shared and which are private (have a separate copy for each process). Semaphores are useless unless they are shared, so we will omit shared before Semaphore. Also we will abbreviate the declaration and initialization as Semaphore mutex = 1; Let's see how this works. If only one process wants to make a deposit, it does mutex.down(), decreasing the value of mutex to zero, adds its amount to the balance, and returns the value of mutex to one. If two processes try to call deposit at about the same time, one of them will get to do the down operation first (because down is atomic!). The other will find that mutex is already zero and be forced to wait. When the first process finishes adding to the balance, it does mutex.up(), returning the value to one and allowing the other process to complete its down operation. If there were three processes trying at the same time, one of them would do the down first, as before, and the other two would be forced to wait. When the first process did up, one of the other two would be allowed to complete its down operation, but then mutex would be zero again, and the third process would continue to wait.

The Bounded Buffer Problem [Silberschatz, Galvin, and Gagne, Sections 4.4 and 7.6.1 ] Suppose there are producer and consumer processes. There may be many of each. Producers somehow produce objects, which consumers then use for something. There is one Buffer object used to pass objects from producers to consumers. A Buffer can hold up to 10 objects. The problem is to allow concurrent access to the Buffer by producers and consumers, while ensuring that 1. The shared Buffer data structure is not screwed up by race conditions in accessing it. 2. Consumers don't try to remove objects from Buffer when it is empty. 3. Producers don't try to add objects to the Buffer when it is full.

When condition (3) is dropped (the Buffer is assumed to have infinite capacity), the problem is called the unbounded-buffer problem, or sometimes just the producer-consumer problem. Here is a solution. First we implement the Buffer class. This is just an easy CS367 exercise; it has nothing to do with processes. class Buffer { private Object[] elements; private int size, nextIn, nextOut; Buffer(int size) { this.size = size; elements = new Object[size]; nextIn = 0; nextOut = 0; } public void addElement(Object o) { elements[nextIn++] = o; if (nextIn == size) nextIn = 0; } public Object removeElement() { Object result = elements[nextOut++]; if (nextOut == size) nextOut = 0; return result; } } Now for a solution to the bounded-buffer problem using semaphores.4 shared Buffer b = new Buffer(10); Semaphore mutex = 1, empty = 10, full = 0; class Producer implements Runnable { Object produce() { /* ... */ } public void run() { Object item; for (;;) { item = produce(); empty.down(); mutex.down(); b.addElement(item); mutex.up(); full.up(); } } } class Consumer implements Runnable { void consume(Object o) { /* ... */ }

public void run() { Object item; for (;;) { full.down(); mutex.down(); item = b.removeElement(); mutex.up(); empty.up(); consume(item); } } } As before, we surround operations on the shared Buffer data structure with mutex.down() and mutex.up() to prevent interleaved changes by two processes (which may screw up the data structure). The semaphore full counts the number of objects in the buffer, while the semaphore empty counts the number of free slots. The operation full.down() in Consumer atomically waits until there is something in the buffer and then ``lays claim'' to it by decrementing the semaphore. Suppose it was replaced by while (b.empty()) { /* do nothing */ } mutex.down(); /* as before */ (where empty is a new method added to the Buffer class). It would be possible for one process to see that the buffer was non-empty, and then have another process remove the last item before it got a chance to grab the mutex semaphore. There is one more fine point to notice here: Suppose we reversed the down operations in the consumer mutex.down(); full.down(); and a consumer tries to do these operation when the buffer is empty. It first grabs the mutex semaphore and then blocks on the full semaphore. It will be blocked forever because no other process can grab the mutex semaphore to add an item to the buffer (and thus call full.up()). This situation is called deadlock. We will study it in length later.

The Dining Philosophers [Silberschatz, Galvin, and Gagne, Section 7.6.3] There are five philosopher processes numbered 0 through 4. Between each pair of philosophers is a fork. The forks are also numbered 0 through 4, so that fork i is between philosophers i-1 and i (all

arithmetic on fork numbers and philosopher numbers is modulo 5 so fork 0 is between philosophers 4 and 0).

Each philosopher alternates between thinking and eating. To eat, he needs exclusive access to the forks on both sides of him. class Philosopher implements Runnable { int i; // which philosopher public void run() { for (;;) { think(); take_forks(i); eat(); put_forks(i) } } } A first attempt to solve this problem represents each fork as a semaphore: Semaphore fork[5] = 1; void take_forks(int i) { fork[i].down(); fork[i+1].down(); } void put_forks(int i) { fork[i].up(); fork[i+1].up(); } The problem with this solution is that it can lead to deadlock. Each philosopher picks up his right fork before he tried to pick up his left fork. What happens if the timing works out such that all the philosophers get hungry at the same time, and they all pick up their right forks before any of them gets

a chance to try for his left fork? Then each philosopher i will be holding fork i and waiting for fork i+1, and they will all wait forever.

There's a very simple solution: Instead of trying for the right fork first, try for the lower numbered fork first. We will show later that this solution cannot lead to deadlock. This solution, while deadlock-free, is still not as good as it could be. Consider again the situation in which all philosophers get hungry at the same time and pick up their lower-numbered fork. Both philosopher 0 and philosopher 4 try to grab fork 0 first. Suppose philosopher 0 wins. Since

philosopher 4 is stuck waiting for fork 0, philosopher 3 will be able to grab both is forks and start eating.

Philosopher 3 gets to eat, but philosophers 0 and 1 are waiting, even though neither of them shares a fork with philosopher 3, and hence one of them could eat right away. In summary, this solution is safe (no two adjacent philosophers eat at the same time), but not as concurrent as possible: A philosopher's meal may be delayed even though the delay is not required for safety. Dijkstra suggests a better solution. More importantly, he shows how to derive the solution by thinking about two goals of any synchronization problem:

Safety Make sure nothing bad happens. Liveness

Make sure something good happens whenever it can. For each philosopher i let state[i] be the state of philosopher i--one of THINKING, HUNGRY, or EATING. The safety requirement is that no to adjacent philosophers are simultaneously EATING. The liveness criterion is that no philosopher is hungry unless one of his neighbors is eating (a hungry philosopher should start eating unless the safety criterion prevents him). More formally, Safety For all i, !(state[i]==EATING && state[i+1]==EATING)

Liveness For all i, !(state[i]==HUNGRY && state[i-1]!=EATING && state[i+1]!=EATING) With this observation, the solution almost writes itself Semaphore mayEat[5] = { 0, 0, 0, 0, 0}; Semaphore mutex = 1; final static public int THINKING = 0; final static public int HUNGRY = 1; final static public int EATING = 2; int state[5] = { THINKING, THINKING, THINKING, THINKING, THINKING }; void take_forks(int i) { mutex.down(); state[i] = HUNGRY; test(i); mutex.up(); mayEat[i].down(); } void put_forks(int i) { mutex.down(); state[i] = THINKING; test(i==0 ? 4 : i-1); // i-1 mod 5 test((i==4 ? 0 : i+1); // i+1 mod 5 mutex.up(); } void test(int i) { if (state[i]==HUNGRY && state[i-1]!=EATING && state[i+1] != EATING) { state[i] = EATING; mayEat[i].up(); } } The method test(i) checks for a violation of liveness at position i. Such a violation can only occur when philosopher i gets hungry or one of his neighbors finishes eating. Each philosopher has his own mayEat semaphore, which represents permission to start eating. Philosopher i calls mayEat[i].down() immediately before starting to eat. If the safety condition allows philosopher i to eat, the procedure test(i) grants permission by calling mayEat[i].up(). Note that the

permission may be granted by a neighboring philosopher, in the call to test(i) in put_forks, or the hungry philospher may give himself permission to eat, in the call to test(i) in get_forks.

Monitors [Silberschatz, Galvin, and Gagne, Section 7.7] Although semaphores are all you need to solve lots of synchronization problems, they are rather ``low level'' and error-prone. As we saw before, a slight error in placement of semaphores (such as switching the order of the two down operations in the Bounded Buffer problem) can lead to big problems. It is also easy to forget to protect shared variables (such as the bank balance or the buffer object) with a mutex semaphore. A better (higher-level) solution is provided by the monitor (also invented by Dijkstra). If you look at the example uses of semaphores above, you see that they are used in two rather different ways: One is simple mutual exclusion. A semaphore (always called mutex in our examples) is associated with a shared variable or variables. Any piece of code that touches these variables is preceded by mutex.down() and followed by mutex.up(). Since it's hard for a programmer to remember to do this, but easy for a compiler, why not let the compiler do the work?5 monitor class BankAccount { private int balance; public void deposit(int amount) { balance += amount; } // etc } The keyword monitor tells the compiler to add a field Semaphore mutex = 1; to the class, add a call of mutex.down() to the beginning of each method, and put a call of mutex.up() at each return point in each method. The other way semaphores are used is to block a process when it cannot proceed until another process does something. For example, a consumer, on discovering that the buffer is empty, has to wait for a producer; a philosopher, on getting hungry, may have to wait for a neighbor to finish eating. To provide this facility, monitors can have a special kind of variable called a condition variable. interface Condition { public void signal(); public void wait(); } A condition variable is like a semaphore, with two differences: 1. A semaphore counts the number of excess up operations, but a signal operation on a condition variable has no effect unless some process is waiting. A wait on a condition variable always blocks the calling process. 2. A wait on a condition variable atomically does an up on the monitor mutex and blocks the caller. In other words if c is a condition variable c.wait() is rather like mutex.up(); c.down(); except that both operations are done together as a single atomic action. Here is a solution to the Bounded Buffer problem using monitors.

monitor BoundedBuffer { private Buffer b = new Buffer(10); private int count = 0; private Condition nonfull, nonempty; public void insert(Object item) { if (count == 10) nonfull.wait(); b.addElement(item); count++; nonempty.signal(); } public Object remove() { if (count == 0) nonempty.wait(); item result = b.removeElement(); count--; nonfull.signal(); return result; } } In general, each condition variable is associated with some logical condition on the state of the monitor (some expression that may be either true or false). If a process discovers, part-way through a method, that some logical condition it needs is not satisfied, it waits on the corresponding condition variable. Whenever a process makes one of these conditions true, it signals the corresponding condition variable. When the waiter wakes up, he knows that the problem that caused him to go to sleep has been fixed, and he may immediately proceed. For this kind of reasoning to be valid, it is important that nobody else sneak in between the time that the signaller does the signal and the waiter wakes up. Thus, calling signal blocks the signaller on yet another queue and immediately wakes up the waiter (if there are multiple processes blocked on the same condition variable, the one waiting the longest wakes up). When a process leaves the monitor (returns from one of its methods), a sleeping signaller, if any, is allowed to continue. Otherwise, the monitor mutex is released, allowing a new process to enter the monitor. In summary, waiters are have precedence over signalers. This strategy, while nice for avoiding certain kinds of errors, is very inefficient. As we will see when we consider implementation, it is expensive to switch processes. Consider what happens when a consumer is blocked on the nonempty condition variable and a producer calls insert. • The producer adds the item to the buffer and calls nonempty.signal(). • The producer is immediately blocked and the consumer is allowed to continue. • The consumer removes the item from the buffer and leaves the monitor. • The producer wakes up, and since the signal operation was the last statement in insert, leaves the monitor. There is an unnecessary switch from the producer to the consumer and back again. To avoid this inefficiency, all recent implementations of monitors replace signal with notify. The notify operation is like signal in that it awakens a process waiting on the condition variable if there is one and otherwise does nothing. But as the name implies, a notify is a ``hint'' that the associated logical condition might be true, rather than a guarantee that it is true. The process that called notify is allowed to continue. Only when it leaves the monitor is the awakened waiter allowed to

continue. Since the logical condition might not be true anymore, the waiter needs to recheck it when it wakes up. For example the Bounded Buffer monitor should be rewritten to replace if (count == 10) nonfull.wait(); with while (count == 10) nonfull.wait(); Java has built into it something like this, but with two key differences. First, instead of marking a whole class as monitor, you have to remember to mark each method as synchronized. Every object is potentially a monitor. Second, there are no explicit condition variables. In effect, every monitor has exactly one anonymous condition variable. Instead of writing c.wait() or c.notify(), where c is a condition variable, you simply write wait() or notify(). A solution to the Bounded Buffer problem in Java might look like this: class BoundedBuffer { private Buffer b = new Buffer(10); private int count = 0; synchronized public void insert(Object item) { while (count == 10) wait(); b.addElement(item); count++; notifyAll(); } synchronized public Object remove() { while (count == 0) wait(); Object result = b.removeElement(); count--; notifyAll(); return result; } } Instead of waiting on a specific condition variable corresponding to the condition you want (buffer nonempty or buffer non-full), you simply wait, and whenever you make either of these conditions true, you simply notifyAll. The operation notifyAll is similar to notify, but it wakes up all the processes that are waiting rather than just one6 In general, a process has to use notifyAll rather than notify, since the process awakened by notify is not necessarily waiting for the condition that the notifier just made true. This BoundedBuffer solution is not correct if it uses notify insteal of notifyAll. Consider a system with 20 consumer threads and one producer and suppose the following sequence of events occurs. 1. All 20 consumer threads call remove. Since the buffer starts out empty, they all call wait and stop running.

The producer thread calls insert 11 times. Each of the first 10 times, it adds an object to the buffer and wakes up one of the waiting consumers. The 11th time, it finds that count == 10 and calls wait. Unlike signal, which blocks the caller, notify allows the producer thread to continue, so it may finish this step before any of the awakened consumer threads resume execution. 3. Each of the 10 consumer threads awakened in Step 2 re-tests the condition count == 0, finds it false, removes an object from the buffer, decrements count, and calls notify. Java makes no promises which thread is awakened by each notify, but it is possible indeed likely, that each notify will awaken one of the remaining 10 consumer threads blocked in Step 1. 4. Each of the consumer threads awakened in Step 3 finds that count == 0 and calls wait again. At this point, the system grinds to a halt. The lone producer thread is blocked on wait even though the buffer is empty. The problem is that the notify calls in Step 3 woke up the "wrong" threads; the notify in BoundedBuffer.remove is meant to wake up waiting producers, not waiting consumers. The correct solution, which uses notifyAll rather than notify, wakes up the remaining 10 consumers and the producer in Step 3. The 10 consumers go back to sleep, but the producer is allowed to continue adding objects to the buffer. As another example, here's a version of Dijkstra's solution to the dining philosophers problem in "Java". 2.

class Philosopher implements Runnable { private int id; private DiningRoom diningRoom; public Philosopher(int id, DiningRoom diningRoom) { this.id = id; this.diningRoom = diningRoom; } public void think() { ... } public void run() { for (int i=0; i<100; i++) { think(); diningRoom.dine(id); } } } class DiningRoom { final static private int THINKING = 0; final static private int HUNGRY = 1; final static private int EATING = 2; private int[] state = { THINKING, THINKING, ... }; private Condition[] ok = new Condition[5]; private void eat(int p) { ... } public synchronized void dine(int p) { state[p] = HUNGRY; test(p);

while (state[p] != EATING) try { ok[p].wait(); } catch (InterruptedException e) {} eat(); state[p] = THINKING; test((p+4)%5); // (p-1) mod 5 test((p+1)%5); // (p+1) mod 5 } private void test(int p) { if (state[p] == HUNGRY && state[(p+1)%5] != EATING && state[(p+4)%5] != EATING ){ state[p] = EATING; ok[p].notify(); } } } When a philosopher p gets hungry, he calls DiningRoom.dine(p). In that procedure, he advertizes that he is HUNGRY and calls test to see if he can eat. Note that in this case, the notify has no effect: the only thread that ever waits for ok[p] is philosopher p, and since he is the caller, he can't be waiting! If his neighbors are not eating, he will set his own state to EATING, the while loop will be skipped, and he will immediately eat. Otherwise, he will wait for ok[p]. When a philosopher finishes eating, he calls test for each of his neighbors. Each call checks to see if the neighbor is hungry and able to eat. If so, it sets the neighbor's state to EATING and notify's the neighbor's ok condition in case he is already waiting. This solution is fairly simple and easy to read. Unfortunately, it is wrong! There are two problems. First, it isn't legal Java (which is why I put "Java" in quotes above). Java does not have a Condition type. Instead it has exactly one anonymous condition variable per monitor. That part is (surprisingly) easy to fix. Get rid of all mention of the array ok (e.g., ok[p].wait() becomes simply wait()) and change notify() to notifyAll(). Now, whenever any philosopher's state is changed to EATING, all blocked philosophers are awakened. Those whose states are still HUNGRY will simply go back to sleep. (Now you see why we wrote while (state[p] != EATING) rather than if (state[p] != EATING)). The solution is a little less efficient, but not enough to worry about. If there were 10,000 philosophers, and if a substantial fraction of them were blocked most of the time, we would have more to worry about, and perhaps we would have to search for a more efficient solution. The second problem is that this solution only lets one philopher at a time eat. The call to eat is inside the synchronized method dine, so while a philosopher is eating, no other thread will be able to enter the DiningRoom The solution to this problem is to break dine into two pieces: one piece that grabs the forks and another piece that releases them. The dine method no longer needs to be synchronized. public void dine(int p) { grabForks(p); eat(p); releaseForks(p);

} private synchronized void grabForks(int p) { state[p] = HUNGRY; test(p); while (state[p] != EATING) try { wait(); } catch (InterruptedException e) {} } private synchronized void releaseForks(int p) { state[p] = THINKING; test((p+4)%5); test((p+1)%5); }

Messages [Silberschatz, Galvin, and Gagne, Section 4.5] Since shared variables are such a source of errors, why not get rid of them altogether? In this section, we assume there is no shared memory between processes. That raises a new problem. Instead of worrying about how to keep processes from interfering with each other, we have to figure out how to let them cooperate. Systems without shared memory provide message-passing facilities that look something like this: send(destination, message); receive(source, message_buffer); The details vary substantially from system to system.

Naming How are destination and source specified? Each process may directly name the other, or there may be some sort of mailbox or message queue object to be used as the destination of a send or the source of a receive. Some systems allow a set of destinations (called multicast and meaning ``send a copy of the message to each destination'') and/or a set of sources, meaning ``receive a message from any one of the sources.'' A particularly common feature is to allow source to be ``any'', meaning that the receiver is willing to receive a message from any other process that is willing to send a message to it. Synchronization Does send (or receive) block the sender, or can it immediately continue? One common combination is non-blocking send together with blocking receive. Another possibility is rendezvous, in which both send and receive are blocking. Whoever gets there first waits for the other one. When a sender and matching

receiver are both waiting, the message is transferred and both are allowed to continue. Buffering Are messages copied directly from the sender's memory to the receiver's memory, or are first copied into some sort of ``system'' memory in between? Message Size Is there an upper bound on the size of a message? Some systems have small, fixedsize messages to send signals or status information and a separate facility for transferring large blocks of data. These design decisions are not independent. For example, non-blocking send is generally only available in systems that buffer messages. Blocking receive is only useful if there is some way to say ``receive from any'' or receive from a set of sources. Message-based communication between processes is particularly attractive in distributed systems (such as computer networks) where processes are on different computers and it would be difficult or impossible to allow them to share memory. But it is also used in situations where processes could share memory but the operating system designer chose not allow sharing. One reason is to avoid the bugs that can occur with sharing. Another is to build a wall of protection between processes that don't trust each other. Some systems even combine message passing with shared memory. A message may include a pointer to a region of (shared) memory. The message is used as a way of transferring ``ownership'' of the region. There might be a convention that a process that wants to access some shared memory had to request permission from its current owner (by sending a message). Unix is a message-based system (at the user level). Processes do not share memory but communicate through pipes.7 A pipe looks like an output stream connected to an input stream by a chunk of memory used to make a queue of bytes. One process sends data to the output stream the same way it would write data to a file, and another reads from it the way it would read from a file. In the terms outlined above, naming is indirect (with the pipe acting as a mailbox or message queue), send (called write in Unix) is non-blocking, while receive (called read) is blocking, and there is buffering in the operating system. At first glance it would appear that the message size is unbounded, but it would actually be more accurate to say each ``message'' is one byte. The amount of data sent in a write or received in a read is unbounded, but the boundaries between writes are erased in the pipe: If the sender does three writes of 60 bytes each and the receive does two reads asking for 100 bytes, it will get back the first 100 bytes the first time and the remaining 80 bytes the second time. Continued... Previous Next Contents 1

Java

for

C++ Deadlock

Programmers

I'm using the term ``private'' informally here. The variable tmp is not a field of a class but rather a local variable, so it cannot be declared public, private, etc. It is ``private'' only in the sense that no other thread has any way of getting to this object. 2 Note that this method is deprecated, which means you should never use it!

3

In the original definition of semaphores, the up and down operations were called V() and P(), respectively, but people had trouble remembering which was which. Some books call them signal and wait, but we will be using those names for other operations later. 4 Remember, this is not really legal Java. We will show a Java solution later. 5 Monitors are not available in this form in Java. We are using Java as a vehicle for illustrating various ideas present in other languages. See the discussion of monitors later for a similar feature that is available in Java. 6 The Java language specification says that if any threads are blocked on wait() in an object, a notify in that object will wake up exactly one thread. It does not say that it has to be any particular thread, such as the one that waited the longest. In fact, some Java implementations actually wake up the thread that has been waiting the shortest time! 7 There are so many versions of Unix that just about any blanket statement about Unix is sure to be a lie. Some versions of Unix allow memory to be shared between processes, and some have other ways for processes to communicate other than pipes. [email protected] Wed Mar 8 13:36:15 CST 2000 Copyright © 1996-1999 by Marvin Solomon. All rights reserved.

CS 537 Lecture Notes Part 4 Processes and Synchronization, Continued Deadlock Contents • • • • •

Terminology Deadlock Detection Deadlock Recovery Deadlock Prevention Deadlock Avoidance

Using Processes (Continued) Deadlock [Silberschatz, Galvin, and Gagne, Chapter 8]

Terminology The Dining Philosophers problem isn't just a silly exercise. It is a scale-model example of a very important problem in operating systems: resource allocation. A ``resource'' can be defined as something that costs money. The philosophers represent processes, and the forks represent resources. There are three kinds of resources: • sharable • serially reusable • consumable Sharable resources can be used by more than one process at a time. A consumable resource can only be used by one process, and the resource gets ``used up.'' A serially reusable resource is in between. Only only process can use the resource at a time, but once it's done, it can give it back for use by another process. Examples are the CPU and memory. These are the most interesting type of resource. We won't say any more about the other kinds. A process requests a (serially reusable) resource from the OS and holds it until it's done with it; then it releases the resource. The OS may delay responding to a request for a resource. The requesting process is blocked until the OS responds. Sometimes we say the process is ``blocked on the resource.'' In actual systems, resources might be represented by semaphores, monitors, or condition variables in monitors-anything a process may wait for. A resource might be preemptable, meaning that the resource can be ``borrowed'' from the process without harm. Sometimes a resource can be made preemptable by the OS, at some cost. For example, memory can be preempted from a process by suspending the process, and copying the contents of the

memory to disk. Later, the data is copied back to the memory, and the process is allowed to continue. Preemption effectively makes a serially reusable resource look sharable. There are three ways of dealing with deadlocks: detection and recovery, prevention, or avoidance.

Deadlock Detection [Silberschatz, Galvin, and Gagne, Section 8.6] The formal definition of deadlock:

A set of processes is deadlocked if each process in the set is waiting for an event that only a process in the set can cause. We can show deadlock graphically by building the waits-for graph. Draw each process as a little circle, and draw an arrow from P to Q if P is waiting for Q. The picture is called a graph, the little circles are called nodes, and the arrows connecting them are called arcs [Silberschatz, Galvin, and Gagne, Figure 8.10(b)]. We can find out whether there is a deadlock as follows: for (;;) { find a node n with no arcs coming out of it; if (no such node can be found) break; erase n and all arcs coming into it; } if (any nodes are left) there is a deadlock; This algorithm simulates a best-case scenario: Every runnable process runs and causes all events that are expected from it, and no process waits for any new events. A node with no outgoing arcs represents a process that isn't waiting for anything, so is runnable. It causes all events other processes are waiting for (if any), thereby erasing all incoming arcs. Then, since it will never wait for anything, it cannot be part of a deadlock, and we can erase it. Any processes that are left at the end of the algorithm are deadlocked, and will wait forever. The graph that's left must contain a cycle (a path starting and ending at the same node and following the arcs). It may also contain processes that are not part of the cycle but are waiting for processes in the cycle, or for processes waiting for them, etc. The algorithm will never erase any of the nodes in a cycle, since each one will always have an outgoing arc pointing to the next node in the cycle. The simplest cycle is an arc from a node to itself. This represents a process that is waiting for itself, and usually represents a simple programming bug: Semaphore s = 0; ... s.down(); s.up(); If no other process can do s.up(), this process is deadlocked with itself. Usually, processes block waiting for (serially reusable) resources. The ``events'' they are waiting for are release of resources. In this case, we can put some more detail into the graph. Add little boxes representing resources. Draw an arc from a process to a resource if the process is waiting for the resource, and an arc from the resource to the process if the process holds the resource. The same algorithm as before will tell whether there is a deadlock. As before, deadlock is associated with cycles: If there is no cycle in the original graph, there is no deadlock, and the algorithm will erase everything.

If there is a cycle, the algorithm will never erase any part of it, and the final graph will contain only cycles and nodes that have paths from them to cycles. Resource Types [Silberschatz, Galvin, and Gagne, Section 8.2.2] Often, a request from a process is not for a particular resource, but for any resource of a given type. For example, a process may need a block of memory. It doesn't care which block of memory it gets. To model this, we will assume there there some number m of resource types, and some number U[r] of units of resource r, for each r between 1 and m. To be very general, we will allow a process to request multiple resources at once: Each request will tell now many units of each resource the process needs to continue. The graph gets pretty hard to draw [Silberschatz, Galvin, and Gagne, Figure 8.1], but essentially the same algorithm can be used to determine whether there is a deadlock. We will need a few arrays for bookkeeping. U[r] = total number of units of resource r in the system curAlloc[p][r] = number of units of r currently allocated to process p available[r] = number of units of r that have not been allocated to any process request[p][r] = number of units of r requested by p but not yet allocated As before, the algorithm works by simulating a best-case scenario. We add an array of boolean done[] with one element for each process, and initially set all elements to false. In this, and later algorithms, we will want to compare arrays of numbers. If A and B are arrays, we say that A <= B if A[i] <= B[i] for all subscripts i.1 boolean lessOrEqual(int[] a, int[] b) { for (int i=0; i b[i]) return false; return true; } Similarly, when we add together two arrays, we add them element by element. The following methods increment or decrement each element of one array by the corresponding element of the second. void incr(int[] a, int[] b) { for (int i=0; i
return b; } Finally, note that request is a two dimensional array, but for any particular value of p, request[p] is a one-dimensional array rp corresponding to the pth row of request and representing the current allocation state of process p: For each resource r, rp[r] = request[p][r] = the amount of resource r requested by process p. Similar remarks apply to to curAlloc and other two-dimensional arrays we will introduce later. With this machinery in place, we can easily write a procedure to test for deadlock. /** Check whether the state represented by request[][] and the ** global arrays curAlloc[][] and available[] is deadlocked. ** Return true if there is a deadlock. */ boolean deadlocked(int[][] request) { int[] save = copy(available); boolean[] done = new boolean[numberOfProcesses]; for (int i = 0; i < done.length; i++) done[i] = false; for (int i = 0; i < numberOfProcesses; i++) { // Find a process that hasn't finished yet, but // can get everything it needs. int p; for (p = 0; p < numberOfProcesses; p++) { if (!done[p] && lessOrEqual(request[p], available)) break; } if (p == numberOfProcesses) { // No process can continue. There is a deadlock available = save; return true; } // Assume process p finishes and gives back everything it has // allocated. incr(available, curAlloc[p]); done[p] = true; } available = save; return false; } The algorithm looks for a process whose request can be satisfied immediately. If it finds one, it assumes that the process could be given all the resources it wants, would do what ever it wanted with them, and would eventually give them back, as well as all the resources it previously got. It can be proved that it doesn't matter what order we consider the processes; either we succeed in completing them, one at a time, or there is a deadlock. How expensive is this algorithm? Let n denote the number of processes and m denote the number of resources. The body of the third for loop (the line containing the call to lessOrEqual) is executed at most n2 times and each call requires m comparisons. Thus the entire method may make up to n2m

comparisons. Everything else in the procedure has a lower order of complexity, so running time of the procedure is O(n2m). If there are 100 processes and 100 resources, n2m = 1,000,000, so if each iteration takes about a microsecond (a reasonable guess on current hardware), the procedure will take about a second. If, however, the number of processes and resources each increase to 1000, the running time would be more like 1000 seconds (16 2/3 minutes)! We might want to use a more clever coding in such a situation.

Deadlock Recovery Once you've discovered that there is a deadlock, what do you do about it? One thing to do is simply reboot. A less drastic approach is to yank back a resource from a process to break a cycle. As we saw, if there are no cycles, there is no deadlock. If the resource is not preemptable, snatching it back from a process may do irreparable harm to the process. It may be necessary to kill the process, under the principle that at least that's better than crashing the whole system. Sometimes, we can do better. For example, if we checkpoint a process from time to time, we can roll it back to the latest checkpoint, hopefully to a time before it grabbed the resource in question. Database systems use checkpoints, as well as a a technique called logging, allowing them to run processes ``backwards,'' undoing everything they have done. It works like this: Each time the process performs an action, it writes a log record containing enough information to undo the action. For example, if the action is to assign a value to a variable, the log record contains the previous value of the record. When a database discovers a deadlock, it picks a victim and rolls it back. Rolling back processes involved in deadlocks can lead to a form of starvation, if we always choose the same victim. We can avoid this problem by always choosing the youngest process in a cycle. After being rolled back enough times, a process will grow old enough that it never gets chosen as the victim-at worst by the time it is the oldest process in the system. If deadlock recovery involves killing a process altogether and restarting it, it is important to mark the ``starting time'' of the reincarnated process as being that of its original version, so that it will look older that new processes started since then. When should you check for deadlock? There is no one best answer to this question; it depends on the situation. The most ``eager'' approach is to check whenever we do something that might create a deadlock. Since a process cannot create a deadlock when releasing resources, we only have to check on allocation requests. If the OS always grants requests as soon as possible, a successful request also cannot create a deadlock. Thus the we only have to check for a deadlock when a process becomes blocked because it made a request that cannot be immediately granted. However, even that may be too frequent. As we saw, the deadlock-detection algorithm can be quite expensive if there are a lot of processes and resources, and if deadlock is rare, we can waste a lot of time checking for deadlock every time a request has to be blocked. What's the cost of delaying detection of deadlock? One possible cost is poor CPU utilization. In an extreme case, if all processes are involved in a deadlock, the CPU will be completely idle. Even if there are some processes that are not deadlocked, they may all be blocked for other reasons (e.g. waiting for I/O). Thus if CPU utilization drops, that might be a sign that it's time to check for deadlock. Besides, if the CPU isn't being used for other things, you might as well use it to check for deadlock! On the other hand, there might be a deadlock, but enough non-deadlocked processes to keep the system busy. Things look fine from the point of view of the OS, but from the selfish point of view of the deadlocked processes, things are definitely not fine. If the processes may represent interactive users, who can't understand why they are getting no response. Worse still, they may represent time-critical processes (missile defense, factory control, hospital intensive care monitoring, etc.) where something disastrous can happen if the deadlock is not detected and corrected quickly. Thus another reason to check for deadlock is that a process has been blocked on a resource request ``too long.'' The definition

of ``too long'' can vary widely from process to process. It depends both on how long the process can reasonably expect to wait for the request, and how urgent the response is. If an overnight run deadlocks at 11pm and nobody is going to look at its output until 9am the next day, it doesn't matter whether the deadlock is detected at 11:01pm or 8:59am. If all the processes in a system are sufficiently similar, it may be adequate simply to check for deadlock at periodic intervals (e.g., one every 5 minutes in a batch system; once every millisecond in a real-time control system).

Deadlock Prevention There are four necessary condition for deadlock. 8. Mutual Exclusion. Resources are not sharable. 9. Non-preemption. Once a resource is given to a process, it cannot be revoked until the process voluntarily gives it up. 10. Hold/Wait. It is possible for a process that is holding resources to request more. 11. Cycles. It is possible for there to be a cyclic pattern of requests. It is important to understand that all four conditions are necessary for deadlock to occur. Thus we can prevent deadlock by removing any one of them. There's not much hope of getting rid of condition (1)--some resources are inherently non-sharable--but attacking (2) can be thought of as a weak form of attack on (1). By borrowing back a resource when another process needs to use it, we can make it appear that the two processes are sharing it. Unfortunately, not all resources can be preempted at an acceptable cost. Deadlock recovery, discussed in the previous section, is an extreme form of preemption. We can attack condition (3) either by forcing a process to allocate all the resources it will ever need at startup time, or by making it release all of its resources before allocating any more. The first approach fails if a process needs to do some computing before it knows what resources it needs, and even it is practical, it may be very inefficient, since a process that grabs resources long before it really needs them may prevent other processes from proceeding. The second approach (making a process release resources before allocating more) is in effect a form of preemption and may be impractical for the same reason preemption is impractical. An attack on the fourth condition is the most practical. The algorithm is called hierarchical allocation. If resources are given numbers somehow (it doesn't matter how the numbers are assigned), and processes always request resources in increasing order, deadlock cannot occur.

Proof. As we have already seen, a cycle in the waits-for graph is necessary for there to be deadlock. Suppose there is a deadlock, and hence a cycle. A cycle consists of alternating resources and processes. As we walk around the cycle, following the arrows, we see that each process holds the resource preceding it and has requested the one following it. Since processes are required to request resources in increasing order, that means the numbers assigned to the resources must be increasing as we go around the cycle. But it is impossible for the number to keep increasing all the way around the cycle; somewhen there must be drop. Thus we have a contradiction: Either some process violated the rule on requesting resources, or there is no cycle, and hence no deadlock. More precisely stated, the hierarchical allocation algorithm is as follows:

When a process requests resources, the requested resources must all have numbers strictly greater than the number of any resource currently held by the process. This algorithm will work even if some of the resources are given the same number. In fact, if they are all given the same number, this rule reduces to the ``no-hold-wait'' condition, so hierarchical allocation can also be thought of as a relaxed form of the no-hold-wait condition. These ideas can be applied to the Dining Philosophers problem. Dijkstra's solution to the dining philosophers problem gets rid of hold-wait. The mutex semaphore allows a philosopher to pick up both forks ``at once.'' Another algorithm would have a philosopher pick up one fork and then try to get the other one. If he can't, he puts down the first fork and starts over. This is a solution using preemption. It is not a very good solution (why not?). If each philosopher always picks up the lower numbered fork first, there cannot be any deadlock. This algorithm is an example of hierarchical allocation. It is better than Dijkstra's solution because it prevents starvation. (Can you see why starvation is impossible?) The forks don't have to be numbered 0 through 4; any numbering that doesn't put any philosopher between two forks with the same number would do. For example, we could assign the value 0 to fork 0, 1 to all other even-numbered forks, and 2 to odd-numbered forks. (One numbering is better than the other. Can you see why?)

Deadlock Avoidance The final approach we will look at is called deadlock avoidance. In this approach, the OS may delay granting a resource request, even when the resources are available, because doing so will put the system in an unsafe state where deadlock may occur later. The best-known deadlock avoidance algorithm is called the ``Banker's Algorithm,'' invented by the famous E. W. Dijkstra. This algorithm can be thought of as yet another relaxation of the the no-hold-wait restriction. Processes do not have to allocate all their resources at the start, but they have to declare an upper bound on the amount of resources they will need. In effect, each process gets a ``line of credit'' that is can drawn on when it needs it (hence the name of the algorithm). When the OS gets a request, it ``mentally'' grants the request, meaning that it updates its data structures to indicate it has granted the request, but does not immediately let the requesting process proceed. First it checks to see whether the resulting state is ``safe''. If not, it undoes the allocation and keeps the requester waiting. To check whether the state is safe, it assumes the worst case: that all running processes immediately request all the remaining resources that their credit lines allow. It then checks for deadlock using the algorithm above. If deadlock occurs in this situation, the state is unsafe, and the resource allocation request that lead to it must be delayed. To implement this algorithm in Java, we will need one more table beyond those defined above. creditLine[p][r] = number of units of r reserved by process p but not yet allocated to it Here's the procedure /** Try to satisfy a particular request in the state indicated by the ** global arrays curAlloc, creditLine, and available. ** If the request can be safely granted, update the global state ** appropriately and return true. ** Otherwise, leave the state unchanged and return false. */ boolean tryRequest(int p, int[] req) {

if (!lessOrEqual(req, creditLine[p])) { System.out.println("process " + p + " is requesting more than it reserved!"); return false; } if (!lessOrEqual(req, available)) { System.out.println("process " + p + " is requesting more than there is available!"); return false; } int[] saveAvail = copy(available); int[][] saveAlloc = copy(curAlloc); int[][] saveLine = copy(creditLine); // Tentatively give him what he wants decr(available, req); decr(creditLine[p], req); incr(curAlloc[p], req); if (safe()) { return true; } else { curAlloc = saveAlloc; available = saveAvail; creditLine = saveLine; return false; } } /** Check whether the current state is safe. */ boolean safe() { // Assume everybody immediately calls in their credit. int[][] request = copy(creditLine); // See whether that causes a deadlock. return !deadlocked(request); } When a process p starts, creditLine[p][r] is set to p's declared maximum claim on resource r. Whenever p is granted some resource, not only is the amount deducted from available, it is also deducted from creditLine. When a new request arrives, we first see if it is legal (it does not exceed the requesting process' declared maximum allocation for any resources), and if we have enough resources to grant it. If so, we tentatively grant it and see whether the resulting state is safe. To see whether a state is safe, we consider a ``worst-case'' scenario. What if all processes suddenly requested all the resources remaining in their credit lines? Would the system deadlock? If so, the state is unsafe, so we reject the request and ``ungrant'' it. The code written here simply rejects requests that cannot be granted because they would lead to an unsafe state or because there are not enough resources available. A more complete version would record

such requests and block the requesting processes. Whenever another process released some resources, the system would update the state accordingly and reconsider all the blocked proce sses to see whether it could safely grant the request of any of them.

An Example A system has three classes of resource: A, B, and C. Initially, there are 8 units of A and 7 units each of resources B and C. In other words, the array U above has the value { 8, 7, 7 }. There are five processes that have declared their maximum demands, and have been allocated some resources as follows: Maximum Demand CurrentAllocation Process A B C A B C 1 4 3 6 1 1 0 2 0 4 4 0 2 1 3 4 2 2 1 1 1 4 1 6 3 0 0 2 5 7 3 2 2 1 0 (The table CurrentAllocation is the array curAlloc in the Java program.) To run the Bankers Algorithm, we need to know the amount of remaining credit available for each process (credLine[p][r]), and the amount resources left in the bank after the allocations (available[r]). The credit line for a process and resource type is computed by subtracting the current allocation for that process and resource from the corresponding maximum demand. Remaining Credit Process A B C 1 3 2 6 2 0 2 3 3 3 1 1 4 1 6 1 5 5 2 2 The value available[r] is calculated by subtracting from U[r] the sum of the rth column of curAlloc: available = { 4, 2, 3 }. If process 4 were to request two units of resource C, the request would be rejected as an error because process 4 initially declared that it would never need more than 3 units of C and it has already been granted 2. A request of five units of resource A by process 5 would be delayed, even though it falls within his credit limit, because 4 of the original 8 units of resource A have already been allocated, leaving only 4 units remaining. Suppose process 1 were to request 1 unit each of resources B and C. To see whether this request is safe, we grant the request by subtracting it from process 1's remaining credit and adding it to his current allocation, yielding Current Allocation Remaining Credit Process A B C A B C 1 1 2 1 3 1 5 2 1 2 1 0 2 3 3 1 1 1 3 1 1 4 0 0 2 1 6 1 5 2 1 0 5 2 2

We also have to subtract the allocation from the amount available, yeilding available = { 4, 1, 2 }. To see whether the resulting state is safe, we treat the Remaining Credit array as a Request array and check for deadlock. We note that the amounts in available are not enough to satisfy the request of process 1 because it wants 5 more units of C and we have only 2. Similarly, we cannot satisfy 2, 4, or 5 because we have only one unit remaining of B and they all want more than that. However, we do have enough to grant 3's request. Therefore, we assume that we will give process 3 its request, and it will finish and return those resources, along with the remaining resources previously allocated to it, and we will increase our available holdings to { 5, 2, 3 }. Now we can satisfy the request of either 2 or 5. Suppose we choose 2 (it doesn't matter which process we choose first). After 2 finishes we will have { 6, 4, 4 } and after 5 finishes, our available will increase to { 8, 5, 4 }. However, at this point, we do not have enough to satisfy the request of either of the remaining processes 1 or 4, so we conclude that the system is deadlocked, so the original request was unsafe. If the original request (1 unit each of B and C) came from process 2 rather than 1, however, the state would be found to be safe (try it yourself!) and so it would be granted immediately. Previous Next Contents

Processes Implementation

and of

1

Synchronization Processes

Note that, unlike numbers, it is possible to have arrays A and B such that neither A <= B nor B <= A. This will happen if some of the elements of A are smaller than the corresponding elements of B and some are bigger. [email protected] Mon Jan 24 13:34:16 CST 2000 Copyright © 1996-1999 by Marvin Solomon. All rights reserved.

CS 537 Lecture Notes Part 5 Processes and Synchronization, Continued Implementation of Processes Contents • • • •

Implementing Monitors Implementing Semaphores Implementing Critical Sections Short-term Scheduling

Implementing Processes We presented processes from the ``user's'' point of view bottom-up: starting with the process concept, then introducing semaphores as a way of synchronizing processes, and finally adding a higher-level synchronization facility in the form of monitors. We will now explain how to implement these things in the opposite order, starting with monitors, and finishing with the mechanism for making processes run. Some text books make a big deal out of showing that various synchronization primitives are equivalent to each other. While this is true, it kind of misses the point. It is easy to implement semaphores with monitors, class Semaphore { private int value; public Semaphore(int initialValue) { value = initialValue; } public synchronized void up() { value++; notify(); } public synchronized void down() { while (value == 0) wait(); value--; } } but that's not the way it usually works. Normally, semaphores (or something very like them) are implemented using lower level facilities, and then they are used to implement monitors.

Implementing Monitors Since monitors are a language feature, they are implemented with the help of a compiler. In response to the keywords monitor, condition, signal, wait, and notify, the compiler inserts little bits of code here and there in the program. We will not worry about how the compiler manages to do that, but only concern ourselves with what the code is and how it works. The monitor keyword in ``standard'' monitors says that there should be mutual exclusion between the methods of the monitor class (the effect is similar to making every method a synchronized method in Java). Thus the compiler creates a semaphore mutex initialized to 1 and adds

muxtex.down(); to the head of each method. It also adds a chunk of code that we call exit (described below) to each place where a method may return--at the end of the procedure, at each return statement, at each point where an exception may be thrown, at each place where a goto might leave the procedure (if the language has gotos), etc. Finding all these return points can be tricky in complicated procedures, which is why we want the compiler to help us out. When a process signals or notifies a condition variable on which some other process is waiting, we have a problem: We can't let both of the processes continue immediately, since that would violate the cardinal rule that there may never be more than one process active in methods of the same monitor object at the same time. Thus we must block one of the processes: the signaller in the case of signal and the waiter in the case of notify We will first show how signal-style monitors are implemented, and later show the (simpler) solution for notify-style monitors. When a process calls signal, it blocks itself on a semaphore we will call highPriority since processes blocked on it are given preference over processes blocked on mutex trying to get in ``from the outside.'' We will also need to know whether any process is waiting for this semaphore. Since semaphores have no method for asking whether anybody is waiting, we will use an ordinary integer variable highCount to keep track of the number of processes waiting for highPriority. Both highPriority and highCount are initialized to zero. Each condition variable c is replaced by a semaphore cSem, initialized to zero, and an integer variable cCount, also initialized to zero. Each call c.wait() becomes cCount++; if (highCount rel="nofollow"> 0) highPriority.up(); else mutex.up(); cSem.down(); cCount--; Before a process blocks on a condition variable, it lets some other process go ahead, preferably one waiting on the highPriority semaphore. The operation c.signal() becomes if (cCount > 0) { highCount++; cSem.up(); highPriority.down(); highCount--; } Notice that a signal of a condition that is not awaited has no effect, and that a signal of a condition that is awaited immediately blocks the signaller. Finally, the code for exit which is placed at every return point, is if (highCount > 0) highPriority.up(); else

mutex.up(); Note that this is the code for c.wait() with the code manipulating cCount and cSem deleted. If a signal call is the very last thing before a return the operations on highPriority and highCount may be deleted. If all calls of signal are at return points (not an unusual situation), highPriority and highCount can be deleted altogether, along with all code that mentions them. The variables highCount and cCount are ordinary integer variables, so there can be problems if two or more processes try to access them at the same time. The code here is carefully written so that a process only inspects or changes one of these variables before calling up() on any semaphore that would allow another process to become active inside the monitor. In systems that use notify (such as Java), c.notify() is replaced by if (cCount > 0) { cCount--; cSem.up(); } In these systems, the code for c.wait() also has to be modified to delay waking up until the notifying process has blocked itself or left the monitor. One way to do this would be for the process to call highPriority.down immediately after waking up from a wait. A simpler solution (the one actually used in Java) is get rid of the highPriority semaphore and make the waiter call mutex.down. In summary, the code for c.wait() is cCount++; mutex.up(); cSem.down(); mutex.down(); and the code for exit is mutex.up(); Note that when a language has notify instead of signal it has to implement wait and exit differently. No system offers both signal and notify. In summary,

source code method start

c.wait()

c.signal()

signal implementation mutex.down() cCount++; if (highCount > 0) highPriority.up(); else mutex.up(); cSem.down(); cCount--; if (cCount > 0) { highCount++; cSem.up(); highPriority.down();

notify implementation mutex.down() cCount++; mutex.up(); cSem.down(); mutex.down();

highCount--; } if (cCount > 0) { cCount--; cSem.up(); } while (cCount > 0) { cCount--; cSem.up(); }

c.notify()

c.notifyAll()

method exit

if (highCount > 0) highPriority.up(); else mutex.up();

mutex.up();

Finally, note that we do not use the full generality of semaphores in this implementation of monitors. The semaphore mutex only takes on the values 0 and 1 (it is a so-called binary semaphore) and the other semaphores never have any value other than zero.

Implementing Semaphores A simple-minded attempt to implement semaphores might look like this: class Semaphore { private int value; Semaphore(int v) { value = v; } public void down() { while (value == 0) {} value--; } public void up() { value++; } } There are two things wrong with this solution: First, as we have seen before, attempts to manipulate a shared variable without synchronization can lead to incorrect results, even if the manipulation is as simple as value++. If we had monitors, we could make the modifications of value atomic by making the class into a monitor (or by making each method synchronized), but remember that monitors are implemented with semaphores, so we have to implement semaphores with something even more primitive. For now, we will assume that we have critical sections: If we bracket a section of code with beginCS and endCS, beginCS() do something; endCS() the code will execute atomically, as if it were protected by a semaphore mutex.down();

do something; mutex.up(); where mutex is a semaphore initialized to 1. Of course, we can't actually use a semaphore to implement semaphores! We will show how to implement beginCS and endCS in the next section. The other problem with our implementation of semaphores is that it includes a busy wait. While Semaphore.down() is waiting for value to become non-zero, it is looping, continuously testing the value. Even if the waiting process is running on its own CPU, this busy waiting may slow down other processes, since it is repeatedly accessing shared memory, thus interfering with accesses to that memory by other CPU's (a shared memory unit can only respond to one CPU at a time). If there is only one CPU, the problem is even worse: Because the process calling down() is running, another process that wants to call up() may not get a chance to run. What we need is some way to put a process to sleep. If we had semaphores, we could use a semaphore, but once again, we need something more primitive. For now, let us assume that there is a data structure called a PCB (short for ``Process Control Block'') that contains information about a process, and a procedure swapProcess that takes a pointer to a PCB as an argument. When swapProcess(pcb) is called, state of the currently running process (the one that called swapProcess) is saved in pcb and the CPU starts running the process whose state was previously stored in pcb instead. Given beginCS, endCS, and swapProcess, the complete implementation of semaphores is quite simple (but very subtle!). class Semaphore { private PCBqueue waiters; // processes waiting for this Semaphore private int value; // if negative, number of waiters private static PCBqueue ready; // list of all processes ready to run Semaphore(int initialValue) { value = initialValue; } public void down() { beginCS() value--; if (value < 0) { // The current process must wait // Find some other process to run. The ready list must // be non-empty or there is a global deadlock. PCB pcb = ready.removeElement(); swapProcess(pcb); // Now pcb contains the state of the process that called // down(), and the currently running process is some // other process. waiters.addElement(pcb); } endCS }

public void up() { beginCS() value++; if (value <= 0) { // The value was previously negative, so there is // some process waiting. We must wake it up. PCB pcb = waiters.removeElement(); ready.addElement(pcb); } endCS } } // Semaphore The implementation of swapProcess is ``magic'': /* This procedure is probably really written in assembly language, * but we will describe it in Java. Assume the CPU's current * stack-pointer register is accessible as "CPU.sp". */ void swapProcess(PCB pcb) { int newSP = pcb.savedSP; pcb.savedSP = CPU.sp; CPU.sp = newSP; } As we mentioned earlier, each process has its own stack with a stack frame for each procedure that process has called but not yet completed. Each stack frame contains, at the very least, enough information to implement a return from the procedure: the address of the instruction that called the procedure, and a pointer to the caller's stack frame. Each CPU devotes one of its registers (call it SP) to point to the current stack frame of the process it is currently running. When the CPU encounters a return statement, it reloads its SP and PC (program counter) registers from the stack frame. An approximate description in pseudo-Java might be something like this. class StackFrame { int callersSP; int callersPC; } class CPU { static StackFrame sp; // the current stack pointer static InstructionAddress pc; // the program counter } // Here's how to do a "return" register InstructionAddress rtn = CPU.sp.callersPC; CPU.sp = CPU.sp.callersSP; goto rtn; (of course, there isn't really a goto statement in Java, and this would all be done in the hardware or a sequence of assembly language statements).

Suppose process P0 calls swapProcess(pcb), where pcb.savedSP points to a stack frame representing a call of swapProcess by some other process P1. The call to swapProcess creates a frame on P0's stack and makes SP point to it. The second statement of swapProcess saves a pointer to that stack frame in pcb. The third statement then loads SP with a pointer to P1's stack frame for swapProcess. Now, when the procedure returns, it will be a return to whatever procedure called swapProcess in process P1.

Implementing Critical Sections The final piece in the puzzle is to implement beginCS and endCS. There are several ways of doing this, depending on the hardware configuration. First suppose there are multiple CPU's accessing a single shared memory unit. Generally, the memory or bus hardware serializes requests to read and write memory words. For example, if two CPU's try to write different values to the same memory word at the same time, the net result will be one of the two values, not some combination of the values. Similarly, if one CPU tries to read a memory word at the same time another modifies it, the read will return either the old or new value--it will not see a ``half-changed'' memory location. Surprisingly, that is all the hardware support we need to implement critical sections. The first solution to this problem was discovered by the Dutch mathematician T. Dekker. A simpler solution was later discovered by Gary Peterson. Peterson's solution looks deceptively simple. To see how tricky the problem is, let us look at a couple of simpler-- but incorrect--solutions. For now, we will assume there are only two processes, P0 and P1. The first idea is to have the processes take turns. shared int turn; // 0 or 1 void beginCS(int i) { // process i's version of beginCS while (turn != i) { /* do nothing */ } } void endCS(int i) { // process i's version of endCS turn = 1 - i; // give the other process a chance. } This solution is certainly safe, in that it never allows both processes to be in their critical sections at the same time. The problem with this solution is that it is not live. If process P0 wants to enter its critical section and turn == 1, it will have to wait until process P1 decides to enter and then leave its critical section. Since we will only used critical sections to protect short operations (see the implementation of semaphores above), it is reasonable to assume that a process that has done beginCS will soon do endCS, but the converse is not true: There's no reason to assume that the other process will want to enter its critical section any time in the near future (or even at all!). To get around this problem, a second attempt to solve the problem uses a shared array critical to indicate which processes are in their critical sections. shared boolean critical[] = { false, false }; void beginCS(int i) { critical[i] = true; while (critical[1 - i]) { /* do nothing */ } } void endCS(int i) { critical[i] = false; }

This solution is unfortunately prone to deadlock. If both processes set their critical flags to true at the same time, they will each loop forever, waiting for the other process to go ahead. If we switch the order of the statements in beginCS, the solution becomes unsafe. Both processes could check each other's critical states at the same time, see that they were false, and enter their critical sections. Finally, if we change the code to void beginCS(int i) { critical[i] = true; while (critical[1 - i]) { critical[i] = false; /* perhaps sleep for a while */ critical[i] = true; } } livelock can occur. The processes can get into a loop in which each process sets its own critical flag, notices that the other critical flag is true, clears its own critical flag, and repeats. Peterson's (correct) solution combines ideas from both of these attempts. Like the second ``solution,'' each process signals its desire to enter its critical section by setting a shared flag. Like the first ``solution,'' it uses a turn variable, but it only uses it to break ties. shared int turn; shared boolean critical[] = { false, false }; void beginCS(int i) { critical[i] = true; // let other guy know I'm trying turn = 1 - i; // be nice: let him go first while ( critical[1-i] // the other guy is trying && turn != i // and he has precedence ) { /* do nothing */ } } void endCS(int i) { critical[i] = false; // I'm done now } Peterson's solution, while correct, has some drawbacks. First, it employs a busy wait (sometimes called a spin lock) which is bad for reasons suggested above. However, if critical sections are only used to protect very short sections of code, such as the down and up operations on semaphores as above, this isn't too bad a problem. Two processes will only rarely attempt to enter their critical sections at the same time, and even then, the loser will only have to ``spin'' for a brief time. A more serious problem is that Peterson's solution only works for two processes. Next, we present three solutions that work for arbitrary numbers of processes. Most computers have additional hardware features that make the critical section easier to solve. One such feature is a ``test and set'' instruction that sets a memory location to a given value and at the same time records in the CPU's unshared state information about the location's previous value. For example, the old value might be loaded into a register, or a condition code might be set to indicate whether the old value was zero. Here is a version using Java-like syntax

shared boolean lock = false; // true if any process is in its CS void beginCS() { // same for all processes for (;;) { boolean key = testAndSet(lock); if (!key) return; } } void endCS() { lock = false; } Some other computers have a swap instruction that swaps the value in a register with the contents of a shared memory word. shared boolean lock = false; // true if any process is in its CS void beginCS() { // same for all processes boolean key = true; for (;;) { swap(key, lock) if (!key) return; } } void endCS() { boolean key = false; swap(key, lock) } The problem with both of these solutions is that they do not necessarily prevent starvation. If several processes try to enter their critical sections at the same time, only one will succeed (safety) and the winner will be chosen in a bounded amount of time (liveness), but the winner is chosen essentially randomly, and there is nothing to prevent one process from winning all the time. The ``bakery algorithm'' of Leslie Lamport solves this problem. When a process wants to get service, it takes a ticket. The process with the lowest numbered ticket is served first. The process id's are used to break ties. static final int N = ...;

// number of processes

shared boolean choosing[] = { false, false, ..., false }; shared int ticket[] = { 0, 0, ..., 0 }; void beginCS(int i) { choosing[i] = true; ticket[i] = 1 + max(ticket[0], ..., ticket[N-1]); choosing[i] = false; for (int j=0; j
|| (ticket[j] == ticket[i] && j < i) ) { /* nothing */ } } } void endCS(int i) { ticket[i] = 0; } Finally, we note that all of these solutions to the critical-section problem assume multiple CPU's sharing one memory. If there is only one CPU, we cannot afford to busy-wait. However, the good news is that we don't have to. All we have to do is make sure that the short-term scheduler (to be discussed in the next section) does not switch processes while a process is in a critical section. One way to do this is simply to block interrupts. Most computers have a way of preventing interrupts from occurring. It can be dangerous to block interrupts for an extended period of time, but it's fine for very short critical sections, such as the ones used to implement semaphores. Note that a process that blocks on a semaphore does not need mutual exclusion the whole time it's blocked; the critical section is only long enough to decide whether to block.

Short-term Scheduling Earlier, we called a process that is not blocked ``runnable'' and said that a runnable process is either ready or running. In general, there is a list of runnable processes called the ready list. Each CPU picks a process from the ready list and runs it until it blocks. It then chooses another process to run, and so on. The implementation of semaphores above illustrates this. This switching among runnable processes is called short-term scheduling 1 and the algorithm that decides which process to run and how long to run it is called a short-term scheduling policy or discipline. Some policies are preemptive, meaning that the CPU may switch processes even when the current process isn't blocked. Before we look at various scheduling policies, it is worthwhile to think about what we are trying to accomplish. There is a tension between maximizing overall efficiency and giving good service to individual ``customers.'' From the system's point of view, two important measures are

Throughput. The amount of useful work accomplished per unit time. This depends, of course, on what constitutes ``useful work.'' One common measure of throughput is jobs/minute (or second, or hour, depending on the kinds of job). Utilization. For each device, the utilization of a device is the fraction of time the device is busy. A good scheduling algorithm keeps all the devices (CPU's, disk drives, etc.) busy most of the time. Both of these measures depend not only on the scheduling algorithm, but also on the offered load. If load is very light--jobs arrive only infrequently--both throughput and utilization will be low. However,

with a good scheduling algorithm, throughput should increase linearly with load until the available hardware is saturated and throughput levels off.

Each ``job''2 also wants good service. In general, ``good service'' means good response: It is starts quickly, runs quickly, and finishes quickly. There are several ways of measuring response:

Turnaround. The length of time between when the job arrives in the system and when it finally finishes. Response Time. The length of time between when the job arrives in the system and when it starts to produce output. For interactive jobs, response time might be more important than turnaround. Waiting Time. The amount of time the job is ready (runnable but not running). This is a better measure of scheduling quality than turnaround, since the scheduler has no control of the amount of time the process spends computing or blocked waiting for I/O. Penalty Ratio. Elapsed time divided by the sum of the CPU and I/O demands of the the job. This is a still better measure of how well the scheduler is doing. It measures how many times worse the turnaround is than it would be in an ``ideal'' system. If the job never had to wait for another job, could allocate each I/O device as soon as it wants it, and experienced no overhead for other operating system functions, it would have a penalty ratio of 1.0. If it takes twice as long to complete as it would in the perfect system, it has a penalty ratio of 2.0. To measure the overall performance, we can then combine the performance of all jobs using any one of these measures and any way of combining. For example, we can compute average waiting time as the average of waiting times of all jobs. Similarly, we could calculate the sum of the waiting times, the average penalty ratio, the variance in response time, etc. There is some evidence that a high variance in response time can be more annoying to interactive users than a high mean (within reason). Since we are concentrating on short-term (CPU) scheduling, one useful way to look at a process is as a sequence of bursts. Each burst is the computation done by a process between the time it becomes ready and the next time it blocks. To the short-term scheduler, each burst looks like a tiny ``job.''

First-Come-First-Served The simplest possible scheduling discipline is called First-come, first-served (FCFS). The ready list is a simple queue (first-in/first-out). The scheduler simply runs the first job on the queue until it blocks, then it runs the new first job, and so on. When a job becomes ready, it is simply added to the end of the queue. Here's an example, which we will use to illustrate all the scheduling disciplines. Burst Arrival Time Burst Length A 0 3 B 1 5 C 3 2 D 9 5 E 12 5 (All times are in milliseconds). The following Gantt chart shows the schedule that results from FCFS scheduling.

The main advantage of FCFS is that it is easy to write and understand, but it has some severe problems. If one process gets into an infinite loop, it will run forever and shut out all the others. Even if we assume that processes don't have infinite loops (or take special precautions to catch such processes), FCFS tends to excessively favor long bursts. Let's compute the waiting time and penalty ratios for these jobs. Start Finish Waiting Penalty Burst Time Time Time Ratio A 0 3 0 1.0 B 3 8 2 1.4 C 8 10 5 3.5 D 10 15 1 1.2 E 15 20 3 1.6 Average 2.2 1.74 As you can see, the shortest burst (C) has the worst penalty ratio. The situation can be much worse if a short burst arrives after a very long one. For example, suppose a burst of length 100 arrives at time 0 and a burst of length 1 arrives immediately after it, at time 1. The first burst doesn't have to wait at all, so its penalty ratio is 1.0 (perfect), but the second burst waits 99 milliseconds, for a penalty ratio of 100. Favoring long bursts means favoring CPU-bound processes (which have very long CPU bursts between I/O operations). In general, we would like to favor I/O-bound processes, since if we give the CPU to an I/O-bound process, it will quickly finish its burst, start doing some I/O, and get out of the ready list. Consider what happens if we have one CPU-bound process and several I/O-bound processes. Suppose we start out on the right foot and run the I/O-bound processes first. They will all quickly finish their bursts and go start their I/O operations, leaving us to run the CPU-bound job. After a while, they will finish their I/O and queue up behind the CPU-bound job, leaving all the I/O devices idle. When the

CPU-bound job finishes its burst, it will start an I/O operation, allowing us to run the other jobs. As before, they will quickly finish their bursts and start to do I/O. Now we have the CPU sitting idle, while all the processes are doing I/O. Since the CPU hog started its I/O first, it will likely finish first, grabbing the CPU and making all the other processes wait. The system will continue this way, alternating between periods when the CPU is busy and all the I/O devices are idle with periods when the CPU is idle and all the processes are doing I/O. We have destroyed one of the main motivations for having processes in the first place: to allow overlap between computation with I/O. This phenomenon is called the convoy effect. In summary, although FCFS is simple, it performs poorly in terms of global performance measures, such as CPU utilization and throughput. It also gives lousy response to interactive jobs (which tend to be I/O bound). The one good thing about FCFS is that there is no starvation: Every burst does get served, if it waits long enough.

Shortest-Job-First A much better policy is called shortest-job-first (SJF). Whenever the CPU has to choose a burst to run, it chooses the shortest one. (The algorithm really should be called ``shortest burst first'', but the name SJF is traditional). This policy certainly gets around all the problems with FCFS mentioned above. In fact, we can prove the SJF is optimal with respect to average waiting time. That is, any other policy whatsoever will have worse average waiting time. By decreasing average waiting time, we also improve processor utilization and throughput. Here's the proof that SJF is optimal. Suppose we have a set of bursts ready to run and we run them in some order other than SJF. Then there must be some burst that is run before shorter burst, say b1 is run before b2, but b1 > b2. If we reversed the order, we would increase the waiting time of b1 by b2, but decrease the waiting time of b2 by b1. Since b1 > b2, we have a net decrease in total, and hence average, waiting time. Continuing in this manner to move shorter bursts ahead of longer ones, we eventually end up with the bursts sorted in increasing order of size (think of this as a bubble sort!). Here's our previous example with SJF scheduling Start Finish Waiting Penalty Burst Time Time Time Ratio A 0 3 0 1.0 B 5 10 4 1.8 C 3 5 0 1.0 D 10 15 1 1.2 E 15 20 3 1.6 Average 1.6 1.32 Here's the Gantt chart:

As described, SJF is a non-preemptive policy. There is also a preemptive version of the SJF, which is sometimes called shortest-remaining-time-first (SRTF). Whenever a new job enters the ready queue, the algorithm reconsiders which job to run. If the new arrival has a burst shorter than the remaining portion of the current burst, the scheduler moves the current job back to the ready queue (to the appropriate position considering the remaining time in its burst) and runs the new arrival instead. With SJF or SRTF, starvation is possible. A very long burst may never get run, because shorter bursts keep arriving in the ready queue. We will return to this problem later. There's only one problem with SJF (or SRTF): We don't know how long a burst is going to be until we run it! Luckily, we can make a pretty good guess. Processes tend to be creatures of habit, so if one burst of a process is long, there's a good chance the next burst will be long as well. Thus we might guess that each burst will be the same length as the previous burst of the same process. However, that strategy won't work so well if a process has an occasional oddball burst that unusually long or short burst. Not only will we get that burst wrong, we will guess wrong on the next burst, which is more typical for the process. A better idea is to make each guess the average of the length of the immediately preceding burst and the guess we used before that burst: guess = (guess + previousBurst)/2. This strategy takes into account the entire past history of a process in guessing the next burst length, but it quickly adapts to changes in the behavior of the process, since the ``weight'' of each burst in computing the guess drops off exponentially with the time since that burst. If we call the most recent burst length b1, the one before that b2, etc., then the next guess is b1/2 + b2/4 + b4/8 + b8/16 + ....

Round-Robin and Processor Sharing Another scheme for preventing long bursts from getting too much priority is a preemptive strategy called round-robin (RR). RR keeps all the bursts in a queue and runs the first one, like FCFS. But after a length of time q (called a quantum), if the current burst hasn't completed, it is moved to the tail of the queue and the next burst is started. Here are Gantt charts of our example with round-robin and quantum sizes of 4 and 1.

With q = 4, we get an average waiting time of 3.2 and an average penalty ratio of 1.88 (work it out yourself!). With q = 1, the averages increase to 3.6 and 1.98, respectively, but the variation in penalty ratio descreases. With q = 4 the penalty ratios range from 1.0 to 3.0, whereas with q = 1, the range is only 1.6 to 2.5. The limit, as q approaches zero, is called processor sharing (PS). PS causes the CPU to be shared equally among all the ready processes. In the steady state of PS, when no bursts enter or leave the ready list, each burst sees a penalty ratio of exactly n, the length of the ready queue. In this particular example, burst A arrives at time 0 and for one millisecond, it has the CPU to itself, so when B arrives at time 1, A has used up 1 ms of its demand and has 2 ms of CPU demand remaining. From time 1 to 3, A and B share the CPU equally. Thus each of them gets 1 ms CPU time, leaving A with 1 ms remaining and B with 4 ms remaining. After C arrives at time 3, there are three bursts sharing the CPU, so it takes 3 ms -- until time 6 -- for A to finish. Continuing in a similar manner, you will find that for this example, PS gives exactly the same results as RR with q = 1. Of course PS is only of theoretical interest. There is a substantial overhead in switching from one process to another. If the quantum is too small, the CPU will spend most its time switching between processes and practically none of it actually running them!

Priority Scheduling There are a whole family of scheduling algorithms that use priorities. The basic idea is always to run the highest priority burst. Priority algorithms can be preemptive or non-preemptive (if a burst arrives that has higher priority than the currently running burst, does do we switch to it immediately, or do we wait until the current burst finishes?). Priorities can be assigned externally to processes based on their importance. They can also be assigned (and changed) dynamically. For example, priorities can be used to prevent starvation: If we raise the priority of a burst the longer it has been in the ready queue, eventually it will have the highest priority of all ready burst and be guaranteed a chance to finish. One interesting use of priority is sometimes called multi-level feedback queues (MLFQ). We maintain a sequence of FIFO queues, numbered starting at zero. New bursts are added to the tail of queue 0. We always run the burst at the head of the lowest numbered non-empty queue. If it doesn't complete in complete within a specified time limit, it is moved to the tail of the next higher queue. Each queue has its own time limit: one unit in queue 0, two units in queue 1, four units in queue 2, eight units in queue 3, etc. This scheme combines many of the best features of the other algorithms: It favors short bursts, since they will be completed while they are still in low-numbered (high priority) queues. Long bursts, on the other hand, will be run with comparatively few expensive process switches. This idea can be generalized. Each queue can have its own scheduling discipline, and you can use any criterion you like to move bursts from queue to queue. There's no end to the number of algorithms you can dream up.

Analysis It is possible to analyze some of these algorithms mathematically. There is a whole branch of computer science called ``queuing theory'' concerned with this sort of analysis. Usually, the analysis uses

statistical assumptions. For example, it is common to assume that the arrival of new bursts is Poisson: The expected time to wait until the next new burst arrives is independent of how long it has been since the last burst arrived. In other words, the amount of time that has passed since the last arrival is no clue to how long it will be until the next arrival. You can show that in this case, the probability of an arrival in the next t milliseconds is 1 - e-at, where a is a parameter called the arrival rate. The average time between arrivals is 1/a. Another common assumption is that the burst lengths follow a similar ``exponential'' distribution: the probability that the length of a burst is less than t is 1 - e-bt, where b is another parameter, the service rate. The average burst length is 1/b. This kind of system is called an ``M/M/1 queue.'' The ratio p = a/b is of particular interest:3 If p > 1, burst are arriving, on the average, faster than they are finishing, so the ready queue grows without bound. (Of course, that can't happen because there is at most one burst per process, but this is theory!) If p = 1, arrivals and departures are perfectly balanced. It can be shown that for FCFS, the average penalty ratio for bursts of length t is P(t) = 1 + p / [ (1-p)bt ] As you can see, as t decreases, the penalty ratio increases, proving that FCFS doesn't like short bursts. Also note that as p approaches one, the penalty ratio approaches infinity. For processor sharing, as we noticed above, all processes have a penalty ratio that is the length of the queue. It can be shown that on the average, that length is 1/(1-p). Previous Deadlock Next Memory Management Contents 1

We will see medium-term and long-term scheduling later in the course. A job might be a batch job (such as printing a run of paychecks), an interactive login session, or a command issued by an interactive session. It might consist of a single process or a group of related processes. 3 Actually, a, b, and p are supposed to be the Greek letters ``alpha,'' ``beta,'' and ``rho,'' but I can't figure out how to make them in HTML. 2

CS 537 Lecture Notes Part 6 Memory Management Contents •

Allocating Main Memory • Algorithms for Memory Management • Compaction and Garbage Collection • Swapping

Allocating Main Memory We first consider how to manage main (``core'') memory (also called random-access memory (RAM)). In general, a memory manager provides two operations: Address allocate(int size); void deallocate(Address block); The procedure allocate receives a request for a contiguous block of size bytes of memory and returns a pointer to such a block. The procedure deallocate releases the indicated block, returning it to the free pool for reuse. Sometimes a third procedure is also provided, Address reallocate(Address block, int new_size); which takes an allocated block and changes its size, either returning part of it to the free pool or extending it to a larger block. It may not always be possible to grow the block without copying it to a new location, so reallocate returns the new address of the block. Memory allocators are used in a variety of situations. In Unix, each process has a data segment. There is a system call to make the data segment bigger, but no system call to make it smaller. Also, the system call is quite expensive. Therefore, there are library procedures (called malloc, free, and realloc) to manage this space. Only when malloc or realloc runs out of space is it necessary to make the system call. The C++ operators new and delete are just dressed-up versions of malloc and free. The Java operator new also uses malloc, and the Java runtime system calls free when an object is no found to be inaccessible during garbage collection (described below). The operating system also uses a memory allocator to manage space used for OS data structures and given to ``user'' processes for their own use. As we saw before, there are several reasons why we might want multiple processes, such as serving multiple interactive users or controlling multiple devices. There is also a ``selfish'' reason why the OS wants to have multiple processes in memory at the same time: to keep the CPU busy. Suppose there are n processes in memory (this is called the level of multiprogramming) and each process is blocked (waiting for I/O) a fraction p of the time. In the best case, when they ``take turns'' being blocked, the CPU will be 100% busy provided n(1-p) >= 1. For example, if each process is ready 20% of the time, p = 0.8 and the CPU could be kept completely busy with five processes. Of course, real processes aren't so cooperative. In the worst case, they could all

decide to block at the same time, in which case, the CPU utilization (fraction of the time the CPU is busy) would be only 1 - p (20% in our example). If each processes decides randomly and independently when to block, the chance that all n processes are blocked at the same time is only pn, so CPU utilization is 1 - pn. Continuing our example in which n = 5 and p = 0.8, the expected utilization would be 1 - .85 = 1 - .32768 = 0.67232. In other words, the CPU would be busy about 67% of the time on the average.

Algorithms for Memory Management [ Silberschatz, Galvin, and Gagne, Section 9.3 ] Clients of the memory manager keep track of allocated blocks (for now, we will not worry about what happens when a client ``forgets'' about a block). The memory manager needs to keep track of the ``holes'' between them. The most common data structure is doubly linked list of holes. This data structure is called the free list. This free list doesn't actually consume any space (other than the head and tail pointers), since the links between holes can be stored in the holes themselves (provided each hole is at least as large as two pointers. To satisfy an allocate(n) request, the memory manager finds a hole of size at least n and removes it from the list. If the hole is bigger than n bytes, it can split off the tail of the hole, making a smaller hole, which it returns to the list. To satisfy a deallocate request, the memory manager turns the returned block into a ``hole'' data structure and inserts it into the free list. If the new hole is immediately preceded or followed by a hole, the holes can be coalesced into a bigger hole, as explained below. How does the memory manager know how big the returned block is? The usual trick is to put a small header in the allocated block, containing the size of the block and perhaps some other information. The allocate routine returns a pointer to the body of the block, not the header, so the client doesn't need to know about it. The deallocate routine subtracts the header size from its argument to get the address of the header. The client thinks the block is a little smaller than it really is. So long as the client ``colors inside the lines'' there is no problem, but if the client has bugs and scribbles on the header, the memory manager can get completely confused. This is a frequent problem with malloc in Unix programs written in C or C++. The Java system uses a variety of runtime checks to prevent this kind of bug.

To make it easier to coalesce adjacent holes, the memory manager also adds a flag (called a ``boundary

tag'') to the beginning and end of each hole or allocated block, and it records the size of a hole at both ends of the hole.

When the block is deallocated, the memory manager adds the size of the block (which is stored in its header) to the address of the beginning of the block to find the address of the first word following the block. It looks at the tag there to see if the following space is a hole or another allocated block. If it is a hole, it is removed from the free list and merged with the block being freed, to make a bigger hole. Similarly, if the boundary tag preceding the block being freed indicates that the preceding space is a hole, we can find the start of that hole by subtracting its size from the address of the block being freed (that's why the size is stored at both ends), remove it from the free list, and merge it with the block being freed. Finally, we add the new hole back to the free list. Holes are kept in a doubly-linked list to make it easy to remove holes from the list when they are being coalesced with blocks being freed.

How does the memory manager choose a hole to respond to an allocate request? At first, it might seem that it should choose the smallest hole that is big enough to satisfy the request. This strategy is called best fit. It has two problems. First, it requires an expensive search of the entire free list to find the best hole (although fancier data structures can be used to speed up the search). More importantly, it leads to the creation of lots of little holes that are not big enough to satisfy any requests. This situation is called fragmentation, and is a problem for all memory-management strategies, although it is particularly bad for best-fit. One way to avoid making little holes is to give the client a bigger block than it asked for. For example, we might round all requests up to the next larger multiple of 64 bytes. That doesn't make the fragmentation go away, it just hides it. Unusable space in the form of holes is called external fragmentation, while unused space inside allocated blocks is called internal fragmentation. Another strategy is first fit, which simply scans the free list until a large enough hole is found. Despite the name, first-fit is generally better than best-fit because it leads to less fragmentation. There is still one problem: Small holes tend to accumulate near the beginning of the free list, making the memory allocator search farther and farther each time. This problem is solved with next fit, which starts each search where the last one left off, wrapping around to the beginning when the end of the list is reached. Yet another strategy is to maintain separate lists, each containing holes of a different size. This approach works well at the application level, when only a few different types of objects are created (although there might be lots of instances of each type). It can also be used in a more general setting by rounding all requests up to one of a few pre-determined choices. For example, the memory manager may round all requests up to the next power of two bytes (with a minimum of, say, 64) and then keep lists of holes of size 64, 128, 256, ..., etc. Assuming the largest request possible is 1 megabyte, this requires only 14 lists. This is the approach taken by most implementations of malloc. This approach eliminates external fragmentation entirely, but internal fragmentation may be as bad as 50% in the worst case (which occurs when all requests are one byte more than a power of two). Another problem with this approach is how to coalesce neighboring holes. One possibility is not to try. The system is initialized by splitting memory up into a fixed set of holes (either all the same size or a variety of sizes). Each request is matched to an ``appropriate'' hole. If the request is smaller than the hole size, the entire hole is allocated to it anyhow. When the allocate block is released, it is simply returned to the appropriate free list. Most implementations of malloc use a variant of this approach (some implementations split holes, but most never coalesce them). An interesting trick for coalescing holes with multiple free lists is the buddy system. Assume all blocks and holes have sizes which are powers of two (so requests are always rounded up to the next power of two) and each block or hole starts at an address that is an exact multiple of its size. Then each block has a ``buddy'' of the same size adjacent to it, such that combining a block of size 2 n with its buddy creates a properly aligned block of size 2n+1 For example, blocks of size 4 could start at addresses 0, 4, 8, 12, 16, 20, etc. The blocks at 0 and 4 are buddies; combining them gives a block at 0 of length 8. Similarly 8 and 12 are buddies, 16 and 20 are buddies, etc. The blocks at 4 and 8 are not buddies even though they are neighbors: Combining them would give a block of size 8 starting at address 4, which is not a multiple of 8. The address of a block's buddy can be easily calculated by flipping the nth bit from the right in the binary representation of the block's address. For example, the pairs of buddies (0,4), (8,12), (16,20) in binary are (00000,00100), (01000,01100), (10000,10100). In each case, the two addresses in the pair differ only in the third bit from the right. In short, you can find the address of the buddy of a block by taking the exclusive or of the address of the block with its size. To allocate a block of a given size, first round the size up to the next power of two and look on the list of blocks of that size. If that list is empty, split a block from the next higher list (if that list is empty, first add two blocks to it by splitting a block from the next higher list, and so on). When deallocating a block, first check to see

whether the block's buddy is free. If so, combine the block with its buddy and add the resulting block to the next higher free list. As with allocations, deallocations can cascade to higher and higher lists.

Compaction and Garbage Collection What do you do when you run out of memory? Any of these methods can fail because all the memory is allocated, or because there is too much fragmentation. Malloc, which is being used to allocate the data segment of a Unix process, just gives up and calls the (expensive) OS call to expand the data segment. A memory manager allocating real physical memory doesn't have that luxury. The allocation attempt simply fails. There are two ways of delaying this catastrophe, compaction and garbage collection. Compaction attacks the problem of fragmentation by moving all the allocated blocks to one end of memory, thus combining all the holes. Aside from the obvious cost of all that copying, there is an important limitation to compaction: Any pointers to a block need to be updated when the block is moved. Unless it is possible to find all such pointers, compaction is not possible. Pointers can stored in the allocated blocks themselves as well as other places in the client of the memory manager. In some situations, pointers can point not only to the start of blocks but also into their bodies. For example, if a block contains executable code, a branch instruction might be a pointer to another location in the same block. Compaction is performed in three phases. First, the new location of each block is calculated to determine the distance the block will be moved. Then each pointer is updated by adding to it the amount that the block it is pointing (in)to will be moved. Finally, the data is actually moved. There are various clever tricks possible to combine these operations. Garbage collection finds blocks of memory that are inaccessible and returns them to the free list. As with compaction, garbage collection normally assumes we find all pointers to blocks, both within the blocks themselves and ``from the outside.'' If that is not possible, we can still do ``conservative'' garbage collection in which every word in memory that contains a value that appears to be a pointer is treated as a pointer. The conservative approach may fail to collect blocks that are garbage, but it will never mistakenly collect accessible blocks. There are three main approaches to garbage collection: reference counting, mark-and-sweep, and generational algorithms. Reference counting keeps in each block a count of the number of pointers to the block. When the count drops to zero, the block may be freed. This approach is only practical in situations where there is some ``higher level'' software to keep track of the counts (it's much too hard to do by hand), and even then, it will not detect cyclic structures of garbage: Consider a cycle of blocks, each of which is only pointed to by its predecessor in the cycle. Each block has a reference count of 1, but the entire cycle is garbage. Mark-and-sweep works in two passes: First we mark all non-garbage blocks by doing a depth-first search starting with each pointer ``from outside'': void mark(Address b) { mark block b; for (each pointer p in block b) { if (the block pointed to by p is not marked) mark(p); } } The second pass sweeps through all blocks and returns the unmarked ones to the free list. The sweep pass usually also does compaction, as described above. There are two problems with mark-and-sweep. First, the amount of work in the mark pass is proportional to the amount of non-garbage. Thus if memory is nearly full, it will do a lot of work with very little payoff. Second, the mark phase does a lot of jumping around in memory, which is bad for virtual memory systems, as we will soon see.

The third approach to garbage collection is called generational collection. Memory is divided into spaces. When a space is chosen for garbage collection, all subsequent references to objects in that space cause the object to be copied to a new space. After a while, the old space either becomes empty and can be returned to the free list all at once, or at least it becomes so sparse that a mark-and-sweep garbage collection on it will be cheap. As an empirical fact, objects tend to be either short-lived or long-lived. In other words, an object that has survived for a while is likely to live a lot longer. By carefully choosing where to move objects when they are referenced, we can arrange to have some spaces filled only with long-lived objects, which are very unlikely to become garbage. We garbage-collect these spaces seldom if ever.

Swapping [ Silberschatz, Galvin, and Gagne, Section 9.2 ] When all else fails, allocate simply fails. In the case of an application program, it may be adequate to simply print an error message and exit. An OS must be able recover more gracefully. We motivated memory management by the desire to have many processes in memory at once. In a batch system, if the OS cannot allocate memory to start a new job, it can ``recover'' by simply delaying starting the job. If there is a queue of jobs waiting to be created, the OS might want to go down the list, looking for a smaller job that can be created right away. This approach maximizes utilization of memory, but can starve large jobs. The situation is analogous to short-term CPU scheduling, in which SJF gives optimal CPU utilization but can starve long bursts. The same trick works here: aging. As a job waits longer and longer, increase its priority, until its priority is so high that the OS refuses to skip over it looking for a more recently arrived but smaller job. An alternative way of avoiding starvation is to use a memory-allocation scheme with fixed partitions (holes are not split or combined). Assuming no job is bigger than the biggest partition, there will be no starvation, provided that each time a partition is freed, we start the first job in line that is smaller than that partition. However, we have another choice analogous to the difference between first-fit and best fit. Of course we want to use the ``best'' hole for each job (the smallest free partition that is at least as big as the job), but suppose the next job in line is small and all the small partitions are currently in use. We might want to delay starting that job and look through the arrival queue for a job that better uses the partitions currently available. This policy re-introduces the possibility of starvation, which we can combat by aging, as above. If a disk is available, we can also swap blocked jobs out to disk. When a job finishes, we first swap back jobs from disk before allowing new jobs to start. When a job is blocked (either because it wants to do I/O or because our short-term scheduling algorithm says to switch to another job), we have a choice of leaving it in memory or swapping it out. One way of looking at this scheme is that it increases the multiprogramming level (the number of jobs ``in memory'') at the cost of making it (much) more expensive to switch jobs. A variant of the MLFQ (multi-level feedback queues) CPU scheduling algorithm is particularly attractive for this situation. The queues are numbered from 0 up to some maximum. When a job becomes ready, it enters queue zero. The CPU scheduler always runs a job from the lowest-numbered non-empty queue (i.e., the priority is the negative of the queue number). It runs a job from queue i for a maximum of i quanta. If the job does not block or complete within that time limit, it is added to the next higher queue. This algorithm behaves like RR with short quanta in that short bursts get high priority, but does not incur the overhead of frequent swaps between jobs with long bursts. The number of swaps is limited to the logarithm of the burst size. Previous Next Contents

Implementation

of Paging

Processes

[email protected] Mon Jan 24 13:34:16 CST 2000 Copyright © 1996-1999 by Marvin Solomon. All rights reserved.

CS 537 Lecture Notes Part 7 Paging Contents •

Paging • •

Page Tables Page Replacement • Frame Allocation for a Single Process • Frame Allocation for Multiple Processes

Paging [ Silberschatz, Galvin, and Gagne, Section 9.4 ] Most modern computers have special hardware called a memory management unit (MMU). This unit sits between the CPU and the memory unit. Whenever the CPU wants to access memory (whether it is to load an instruction or load or store data), it sends the desired memory address to the MMU, which translates it to another address before passing it on the the memory unit. The address generated by the CPU, after any indexing or other addressing-mode arithmetic, is called a virtual address, and the address it gets translated to by the MMU is called a physical address.

Normally, the translation is done at the granularity of a page. Each page is a power of 2 bytes long, usually between 1024 and 8192 bytes. If virtual address p is mapped to physical address f (where p is a multiple of the page size), then address p+o is mapped to physical address f+o for any offset o less than the page size. In other words, each page is mapped to a contiguous region of physical memory called a page frame.

The MMU allows a contiguous region of virtual memory to be mapped to page frames scattered around physical memory making life much easier for the OS when allocating memory. Much more importantly, however, it allows infrequently-used pages to be stored on disk. Here's how it works: The tables used by the MMU have a valid bit for each page in the virtual address space. If this bit is set, the translation of virtual addresses on a page proceeds as normal. If it is clear, any attempt by the CPU to access an address on the page generates an interrupt called a page fault trap. The OS has an interrupt handler for page faults, just as it has a handler for any other kind of interrupt. It is the job of this handler to get the requested page into memory. In somewhat more detail, when a page fault is generated for page p1, the interrupt handler does the following: • Find out where the contents of page p1 are stored on disk. The OS keeps this information in a table. It is possible that this page isn't anywhere at all, in which case the memory reference is simply a bug. In this case, the OS takes some corrective action such as killing the process that made the reference (this is source of the notorious message ``memory fault -- core dumped''). Assuming the page is on disk: • Find another page p2 mapped to some frame f of physical memory that is not used much. • Copy the contents of frame f out to disk. • Clear page p2's valid bit so that any subsequent references to page p2 will cause a page fault. • Copy page p1's data from disk to frame f. • Update the MMU's tables so that page p1 is mapped to frame f. • Return from the interrupt, allowing the CPU to retry the instruction that caused the interrupt.

Page Tables [ Silberschatz, Galvin, and Gagne, Sections 9.4.1-9.4.4 ] Conceptually, the MMU contains a page table which is simply an array of entries indexed by page number. Each entry contains some flags (such as the valid bit mentioned earlier) and a frame number. The physical address is formed by concatenating the frame number with the offset, which is the loworder bits of the virtual address.

There are two problems with this conceptual view. First, the lookup in the page table has to be fast, since it is done on every single memory reference--at least once per instruction executed (to fetch the instruction itself) and often two or more times per instruction. Thus the lookup is always done by special-purpose hardware. Even with special hardware, if the page table is stored in memory, the table lookup makes each memory reference generated by the CPU cause two references to memory. Since in modern computers, the speed of memory is often the bottleneck (processors are getting so fast that they spend much of their time waiting for memory), virtual memory could make programs run twice as slowly as they would without it. We will look at ways of avoiding this problem in a minute, but first we will consider the other problem: The page tables can get large. Suppose the page size is 4K bytes and a virtual address is 32 bits long (these are typical values for current machines). Then the virtual address would be divided into a 20-bit page number and a 12-bit offset (because 212 = 4096 = 4K), so the page table would have to have 2 20 = 1,048,576 entries. If each entry is 4 bytes long, that would use up 4 megabytes of memory. And each process has its own page table. Newer machines being introduced now generate 64-bit addresses. Such a machine would need a page table with 4,503,599,627,370,496 entries! Fortunately, the vast majority of the page table entries are normally marked ``invalid.'' Although the virtual address may be 32 bits long and thus capable of addressing a virtual address space of 4 gigabytes, a typical process is at most a few megabytes in size, and each megabyte of virtual memory uses only 256 page-table entries (for 4K pages). There are several different page table organizations use in actual computers. One approach is to put the page table entries in special registers. This was the approach used by the PDP-11 minicomputer introduced in the 1970's. The virtual address was 16 bits and the page size was 8K bytes. Thus the virtual address consisted of 3 bits of page number and 13 bits of offset, for a total of 8 pages per process. The eight page-table entries were stored in special registers. [As an aside, 16-bit virtual addresses means that any one process could access only 64K bytes of memory. Even in those days that was considered too small, so later versions of the PDP-11 used a trick called ``split I/D space.'' Each memory reference generated by the CPU had an extra bit indicating whether it was an instruction fetch (I) or a data reference (D), thus allowing 64K bytes for the program and 64K bytes for the data.] Putting page table entries in registers helps make the MMU run faster (the registers were much faster than main memory), but this approach has a downside as well. The registers are expensive, so it works for very small page-table size. Also, each time the OS wants to switch processes, it has to reload the registers with the page-table entries of the new process. A second approach is to put the page table in main memory. The (physical) address of the page table is held in a register. The page field of the virtual address is added to this register to find the page table entry in physical memory. This approach has the advantage that switching processes is easy (all you

have to do is change the contents of one register) but it means that every memory reference generated by the CPU requires two trips to memory. It also can use too much memory, as we saw above. A third approach is to put the page table itself in virtual memory. The page number extracted from the virtual address is used as a virtual address to find the page table entry. To prevent an infinite recursion, this virtual address is looked up using a page table stored in physical memory. As a concrete example, consider the VAX computer, introduced in the late 70's. The virtual address of the VAX is 30 bits long, with 512-byte pages (probably too small even at that time!) Thus the virtual address a consists of a 21bit page number p and a nine-bit offset o. The page number is multiplied by 4 (the size of a page-table entry) and added to the contents of the MMU register containing the address of the page table. This gives a virtual address that is resolved using a page table in physical memory to get a frame number f1. In more detail, the high order bits of p index into a table to find a physical frame number, which, when concatenated with the low bits of p give the physical address of a word containing f. The concatenation of f with o is the desired physical address.

As you can see, another way of looking at this algorithms is that the virtual address is split into fields that are used to walk through a tree of page tables. The SPARC processor (which you are using for this course) uses a similar technique, but with one more level: The 32-bit virtual address is divided into three index fields of 8, 6, and 6 bits and a 12-bit offset. The root of the tree is pointed to by an entry in a context table, which has one entry for each process. The advantage of these schemes is that they save on memory. For example, consider a VAX process that only uses the first megabyte of its address space (2048 512-byte pages). Since each second level page table has 128 entries, there will be 16 of them used. Adding to this the 64K bytes needed for the first-level page table, the total space used for page tables is only 72K bytes, rather than the 8 megabytes that would be needed for a one-level page table. The downside is that each level of page table adds one more memory lookup on each reference generated by the CPU. A fourth approach is to use what is called an inverted page table. (Actually, the very first computer to have virtual memory, the Atlas computer built in England in the late 50's used this approach, so in some sense all the page tables described above are ``inverted.'') An ordinary page table has an entry for each page, containing the address of the corresponding page frame (if any). An inverted page table has an entry for each page frame, containing the corresponding page number. To resolve a virtual address, the table is searched to find an entry that contains the page number. The good news is that an inverted page table only uses a fixed fraction of memory. For example, if a page is 4K bytes and a page-table entry is 4 bytes, there will be exactly 4 bytes of page table for each 4096 bytes of physical memory. In other words, less that 0.1% of memory will be used for page tables. The bad news is that this is by far the slowest of the methods, since it requires a search of the page table for each reference. The original Atlas machine had special hardware to search the table in parallel, which was reasonable since the table had only 2048 entries. All of the methods considered thus far can be sped up by using a trick called caching. We will be seeing many many more examples of caching used to speed things up throughout the course. In fact, it has been said that caching is the only technique in computer science used to improve performance. In this case, the specific device is called a translation lookaside buffer (TLB). The TLB contains a set of

entries, each of which contains a page number, the corresponding page frame number, and the protection bits. There is special hardware to search the TLB for an entry matching a given page number. If the TLB contains a matching entry, it is found very quickly and nothing more needs to be done. Otherwise we have a TLB miss and have to fall back on one of the other techniques to find the translation. However, we can take that translation we found the hard way and put it into the TLB so that we find it much more quickly the next time. The TLB has a limited size, so to add a new entry, we usually have to throw out an old entry. The usual technique is to throw out the entry that hasn't been used the longest. This strategy, called LRU (least-recently used) replacement is also implemented in hardware. The reason this approach works so well is that most programs spend most of their time accessing a small set of pages over and over again. For example, a program often spends a lot of time in an ``inner loop'' in one procedure. Even if that procedure, the procedures it calls, and so on are spread over 40K bytes, 10 TLB entries will be sufficient to describe all these pages, and there will no TLB misses provided the TLB has at least 10 entries. This phenomenon is called locality. In practice, the TLB hit rate for instruction references is extremely high. The hit rate for data references is also good, but can vary widely for different programs. If the TLB performs well enough, it almost doesn't matter how TLB misses are resolved. The IBM Power PC and the HP Spectrum use inverted page tables organized as hash tables in conjunction with a TLB. The MIPS computers (MIPS is now a division of Silicon Graphics) get rid of hardware page tables altogether. A TLB miss causes an interrupt, and it is up to the OS to search the page table and load the appropriate entry into the TLB. The OS typically uses an inverted page table implemented as a software hash table. Two processes may map the same page number to different page frames. Since the TLB hardware searches for an entry by page number, there would be an ambiguity if entries corresponding to two processes were in the TLB at the same time. There are two ways around this problem. Some systems simply flush the TLB (set a bit in all entries marking them as unused) whenever they switch processes. This is very expensive, not because of the cost of flushing the TLB, but because of all the TLB misses that will happen when the new process starts running. An alternative approach is to add a process identifier to each entry. The hardware then searches on for the concatenation of the page number and the process id of the current process. We mentioned earlier that each page-table entry contains a ``valid'' bit as well as some other bits. These other bits include

Protection At a minimum one bit to flag the page as read-only or read/write. Sometimes more bits to indicate whether the page may be executed as instructions, etc. Modified This bit, usually called the dirty bit, is set whenever the page is referenced by a write (store) operation. Referenced This bit is set whenever the page is referenced for any reason, whether load or store. We will see in the next section how these bits are used.

Page Replacement [ Silberschatz, Galvin, and Gagne, Section 10.2-10.3 ] All of these hardware methods for implementing paging have one thing in common: When the CPU generates a virtual address for which the corresponding page table entry is marked invalid, the MMU generates a page fault interrupt and the OS must handle the fault as explained above. The OS checks its tables to see why it marked the page as invalid. There are (at least) three possible reasons: • There is a bug in the program being run. In this case the OS simply kills the program (``memory fault -- core dumped''). • Unix treats a reference just beyond the end of a process' stack as a request to grow the stack. In this case, the OS allocates a page frame, clears it to zeros, and updates the MMU's page tables so that the requested page number points to the allocated frame. • The requested page is on disk but not in memory. In this case, the OS allocates a page frame, copies the page from disk into the frame, and updates the MMU's page tables so that the requested page number points to the allocated frame. In all but the first case, the OS is faced with the problem of choosing a frame. If there are any unused frames, the choice is easy, but that will seldom be the case. When memory is heavily used, the choice of frame is crucial for decent performance. We will first consider page-replacement algorithms for a single process, and then consider algorithms to use when there are multiple processes, all competing for the same set of frames. Frame Allocation for a Single Process FIFO (First-in, first-out) Keep the page frames in an ordinary queue, moving a frame to the tail of the queue when it it loaded with a new page, and always choose the frame at the head of the queue for replacement. In other words, use the frame whose page has been in memory the longest. While this algorithm may seem at first glance to be reasonable, it is actually about as bad as you can get. The problem is that a page that has been memory for a long time could equally likely be ``hot'' (frequently used) or ``cold'' (unused), but FIFO treats them the same way. In fact FIFO is no better than, and may indeed be worse than RAND (Random) Simply pick a random frame. This algorithm is also pretty bad. OPT (Optimum) Pick the frame whose page will not be used for the longest time in the future. If there is a page in memory that will never be used again, it's frame is obviously the best choice for replacement. Otherwise, if (for example) page A will be next referenced 8 million instructions in the future and page B will be referenced 6 million instructions in the future, choose page A. This algorithm is sometimes called Belady's MIN algorithm after its inventor. It can be shown that OPT is the best possible algorithm, in the sense that for any reference string (sequence of page numbers touched by a process), OPT gives the smallest number of page faults. Unfortunately, OPT, like SJF processor scheduling, is unimplementable because it requires knowledge of the future. It's only use is as a theoretical limit. If you have an algorithm you think looks promising, see how it compares to OPT on some sample reference strings.

LRU (Least Recently Used) Pick the frame whose page has not been referenced for the longest time. The idea behind this algorithm is that page references are not random. Processes tend to have a few hot pages that they reference over and over again. A page that has been recently referenced is likely to be referenced again in the near future. Thus LRU is likely to approximate OPT. LRU is actually quite a good algorithm. There are two ways of finding the least recently used page frame. One is to maintain a list. Every time a page is referenced, it is moved to the head of the list. When a page fault occurs, the least-recently used frame is the one at the tail of the list. Unfortunately, this approach requires a list operation on every single memory reference, and even though it is a pretty simple list operation, doing it on every reference is completely out of the question, even if it were done in hardware. An alternative approach is to maintain a counter or timer, and on every reference store the counter into a table entry associated with the referenced frame. On a page fault, search through the table for the smallest entry. This approach requires a search through the whole table on each page fault, but since page faults are expected to tens of thousands of times less frequent than memory references, that's ok. A clever variant on this scheme is to maintain an n by n array of bits, initialized to 0, where n is the number of page frames. On a reference to page k, first set all the bits in row k to 1 and then set all bits in column k to zero. It turns out that if row k has the smallest value (when treated as a binary number), then frame k is the least recently used. Unfortunately, all of these techniques require hardware support and nobody makes hardware that supports them. Thus LRU, in its pure form, is just about as impractical as OPT. Fortunately, it is possible to get a good enough approximation to LRU (which is probably why nobody makes hardware to support true LRU). NRU (Not Recently Used) There is a form of support that is almost universally provided by the hardware: Each page table entry has a referenced bit that is set to 1 by the hardware whenever the entry is used in a translation. The hardware never clears this bit to zero, but the OS software can clear it whenever it wants. With NRU, the OS arranges for periodic timer interrupts (say once every millisecond) and on each ``tick,'' it goes through the page table and clears all the referenced bits. On a page fault, the OS prefers frames whose referenced bits are still clear, since they contain pages that have not been referenced since the last timer interrupt. The problem with this technique is that the granularity is too coarse. If the last timer interrupt was recent, all the bits will be clear and there will be no information to distinguished frames from each other. SLRU (Sampled LRU) This algorithm is similar to NRU, but before the referenced bit for a frame is cleared it is saved in a counter associated with the frame and maintained in software by the OS. One approach is to add the bit to the counter. The frame with the lowest counter value will be the one that was referenced in the smallest number of recent ``ticks''. This variant is called NFU (Not Frequently Used). A better approach is to shift the bit into the counter (from the left). The frame that hasn't been reference for the largest number of ``ticks'' will be associated with the counter that has the largest number of leading zeros. Thus we can approximate the least-recently used frame by selecting the frame corresponding to the smallest value (in binary). (That will select the frame unreferenced for the largest number of ticks, and break ties in favor of the frame longest unreferenced before that). This only approximates LRU for two reasons: It only records whether a

page was referenced during a tick, not when in the tick it was referenced, and it only remembers the most recent n ticks, where n is the number of bits in the counter. We can get as close an approximation to true LRU as we like, at the cost of increasing the overhead, by making the ticks short and the counters very long. Second Chance When a page fault occurs, look at the page frames one at a time, in order of their physical addresses. If the referenced bit is clear, choose the frame for replacement, and return. If the referenced bit is set, give the frame a ``second chance'' by clearing its referenced bit and going on to the next frame (wrapping around to frame zero at the end of memory). Eventually, a frame with a zero referenced bit must be found, since at worst, the search will return to where it started. Each time this algorithm is called, it starts searching where it last left off. This algorithm is usually called CLOCK because the frames can be visualized as being around the rim of an (analogue) clock, with the current location indicated by the second hand. We have glossed over some details here. First, we said that when a frame is selected for replacement, we have to copy its contents out to disk. Obviously, we can skip this step if the page frame is unused. We can also skip the step if the page is ``clean,'' meaning that it has not been modified since it was read into memory. Most MMU's have a dirty bit associated with each page. When the MMU is setting the referenced bit for a page, it also sets the dirty bit if the reference is a write (store) reference. Most of the algorithms above can be modified in an obvious way to prefer clean pages over dirty ones. For example, one version of NRU always prefers an unreferenced page over a referenced one, but with one category, it prefers clean over dirty pages. The CLOCK algorithm skips frames with either the referenced or the dirty bit set. However, when it encounters a dirty frame, it starts a disk-write operation to clean the frame. With this modification, we have to be careful not to get into an infinite loop. If the hand makes a complete circuit finding nothing but dirty pages, the OS simply has to wait until one of the page-cleaning requests finishes. Hopefully, this rarely if ever happens. There is a curious phenomenon called Belady's Anomaly that comes up in some algorithms but not others. Consider the reference string (sequence of page numbers) 0 1 2 3 0 1 4 0 1 2 3 4. If we use FIFO with three page frames, we get 9 page faults, including the three faults to bring in the first three pages, but with more memory (four frames), we actually get more faults (10).

Frame Allocation for Multiple Processes [ Silberschatz, Galvin, and Gagne, Section 10.4-10.5 ] Up to this point, we have been assuming that there is only one active process. When there are multiple processes, things get more complicated. Algorithms that work well for one process can give terrible results if they are extended to multiple processes in a naive way. LRU would give excellent results for a single process, and all of the good practical algorithms can be seen as ways of approximating LRU. A straightforward extension of LRU to multiple processes still chooses the page frame that has not been referenced for the longest time. However, that is a lousy idea. Consider a workload consisting of two processes. Process A is copying data from one file to another, while process B is doing a CPU-intensive calculation on a large matrix. Whenever process A blocks for I/O, it stops referencing its pages. After a while process B steals all the page frames away from A. When A finally finishes with an I/O operation, it suffers a series of page faults until it gets back the pages it needs, then computes for a very short time and blocks again on another I/O operation. There are two problems here. First, we are calculating the time since the last reference to a page incorrectly. The idea behind LRU is ``use it or lose it.'' If a process hasn't referenced a page for a long time, we take that as evidence that it doesn't want the page any more and re-use the frame for another

purpose. But in a multiprogrammed system, there may be two different reasons why a process isn't touching a page: because it is using other pages, or because it is blocked. Clearly, a process should only be penalized for not using a page when it is actually running. To capture this idea, we introduce the notion of virtual time. The virtual time of a process is the amount of CPU time it has used thus far. We can think of each process as having its own clock, which runs only while the process is using the CPU. It is easy for the CPU scheduler to keep track of virtual time. Whenever it starts a burst running on the CPU, it records the current real time. When an interrupt occurs, it calculates the length of the burst that just completed and adds that value to the virtual time of the process that was running. An implementation of LRU should record which process owns each page, and record the virtual time its owner last touched it. Then, when choosing a page to replace, we should consider the difference between the timestamp on a page and the current virtual time of the page's owner. Algorithms that attempt to approximate LRU should do something similar. There is another problem with our naive multi-process LRU. The CPU-bound process B has an unlimited appetite for pages, whereas the I/O-bound process A only uses a few pages. Even if we calculate LRU using virtual time, process B might occasionally steal pages from A. Giving more pages to B doesn't really help it run any faster, but taking from A a page it really needs has a severe effect on A. A moment's thought shows that an ideal page-replacement algorithm for this particular load would divide into two pools. Process A would get as many pages as it needs and B would get the rest. Each pool would be managed LRU separately. That is, whenever B page faults, it would replace the page in its pool that hadn't been referenced for the longest time. In general, each process has a set of pages that it is actively using. This set is called the working set of the process. If a process is not allocated enough memory to hold its working set, it will cause an excessive number of page faults. But once a process has enough frames to hold its working set, giving it more memory will have little or no effect.

More formally, given a number t, the working set with parameter t of a process, denoted Wt, is the set of pages touched by the process during its most recent t references to memory1 Because most processes have a very high degree of locality, the size of t is not very important provided it's large enough. A common choice of t is the number of instructions executed in 1/2 second. In other words, we will consider the working set of a process to be the set of pages it has touched during the previous 1/2 second of virtual time. The Working Set Model of program behavior says that the system will only run efficiently if each process is given enough page frames to hold its working set. What if there aren't enough frames to hold the working sets of all processes? In this case, memory is over-committed and it is hopeless to run all the processes efficiently. It would be better to simply stop one of the processes and give its pages to others. Another way of looking at this phenomenon is to consider CPU utilization as a function of the level of multiprogramming (number of processes). With too few processes, we can't keep the CPU busy. Thus as we increase the number of processes, we would like to see the CPU utilization steadily improve, eventually getting close to 100%. Realistically, we cannot expect to quite that well, but we would still expect increasing performance when we add more processes.

Unfortunately, if we allow memory to become over-committed, something very different may happen:

After a point, adding more processes doesn't help because the new processes do not have enough memory to run efficiently. They end up spending all their time page-faulting instead of doing useful work. In fact, the extra page-fault load on the disk ends up slowing down other processes until we reach a point where nothing is happening but disk traffic. This phenomenon is called thrashing. The moral of the story is that there is no point in trying to run more processes than will fit in memory. When we say a process ``fits in memory,'' we mean that enough page frames have been allocated to it to hold all of its working set. What should we do when we have more processes than will fit? In a batch system (one were users drop off their jobs and expect them to be run some time in the future), we can just delay starting a new job until there is enough memory to hold its working set. In an interactive system, we may not have that option. Users can start processes whenever they want. We still have the option of modifying the scheduler however. If we decide there are too many processes, we can stop one or more processes (tell the scheduler not to run them). The page frames assigned to those processes can then be taken away and given to other processes. It is common to say the stopped processes have been ``swapped out'' by analogy with a swapping system, since all of the pages of the stopped processes have been moved from main memory to disk. When more memory becomes available (because a process has terminated or because its working set has become smaller) we can ``swap in'' one of the stopped processes. We could explicitly bring its working set back into memory, but it is sufficient (and usually a better idea) just to make the process runnable. It will quickly bring its working set back into memory simply by causing page faults. This control of the number of active processes is called load control. It is also sometimes called medium-term scheduling as contrasted with long-term scheduling, which is concerned with deciding when to start a new job, and short-term scheduling, which determines how to allocate the CPU resource among the currently active jobs. It cannot be stressed too strongly that load control is an essential component of any good pagereplacement algorithm. When a page fault occurs, we want to make a good decision on which page to

replace. But sometimes no decision is good, because there simply are not enough page frames. At that point, we must decide to run some of the processes well rather than run all of them very poorly. This is a very good model, but it doesn't immediately translate into an algorithm. Various specific algorithms have been proposed. As in the single process case, some are theoretically good but unimplementable, while others are easy to implement but bad. The trick is to find a reasonable compromise.

Fixed Allocation Give each process a fixed number of page frames. When a page fault occurs use LRU or some approximation to it, but only consider frames that belong to the faulting process. The trouble with this approach is that it is not at all obvious how to decide how many frames to allocate to each process. If you give a process too few frames, it will thrash. If you give it too many, the extra frames are wasted; you would be better off giving those frames to another process, or starting another job (in a batch system). In some environments, it may be possible to statically estimate the memory requirements of each job. For example, a real-time control system tends to run a fixed collection of processes for a very long time. The characteristics of each process can be carefully measured and the system can be tuned to give each process exactly the amount of memory it needs. Fixed allocation has also been tried with batch systems: Each user is required to declare the memory allocation of a job when it is submitted. The customer is charged both for memory allocated and for I/O traffic, including traffic caused by page faults. The idea is that the customer has the incentive to declare the optimum size for his job. Unfortunately, even assuming good will on the part of the user, it can be very hard to estimate the memory demands of a job. Besides, the working-set size can change over the life of the job. Page-Fault Frequency (PFF) This approach is similar to fixed allocation, but the allocations are dynamically adjusted. The OS continuously monitors the fault rate of each process, in page faults per second of virtual time. If the fault rate of a process gets too high, either give it more pages or swap it out. If the fault rate gets too low, take some pages away. When you get back enough pages this way, either start another job (in a batch system) or restart some job that was swapped out. This technique is actually used in some existing systems. The problem is choosing the right values of ``too high'' and ``too low.'' You also have to be careful to avoid an unstable system, where you are continually stealing pages from a process until it thrashes and then giving them back. Working Set The Working Set (WS) algorithm (as contrasted with the working set model) is as follows: Constantly monitor the working set (as defined above) of each process. Whenever a page leaves the working set, immediately take it away from the process

and add its frame to a pool of free frames. When a process page faults, allocate it a frame from the pool of free frames. If the pool becomes empty, we have an overload situation--the sum of the working set sizes of the active processes exceeds the size of physical memory--so one of the processes is stopped. The problem is that WS, like SJF or true LRU, is not implementable. A page may leave a process' working set at any time, so the WS algorithm would require the working set to be monitored on every single memory reference. That's not something that can be done by software, and it would be totally impractical to build special hardware to do it. Thus all good multi-process paging algorithms are essentially approximations to WS. Clock Some systems use a global CLOCK algorithm, with all frames, regardless of current owner, included in a single clock. As we said above, CLOCK approximates LRU, so global CLOCK approximates global LRU, which, as we said, is not a good algorithm. However, by being a little careful, we can fix the worst failing of global clock. If the clock ``hand'' is moving too ``fast'' (i.e., if we have to examine too many frames before finding one to replace on an average call), we can take that as evidence that memory is over-committed and swap out some process. WSClock An interesting algorithm has been proposed (but not, to the best of my knowledge widely implemented) that combines some of the best features of WS and CLOCK. Assume that we keep track of the current virtual time VT(p) of each process p. Also assume that in addition to the reference and dirty bits maintained by the hardware for each page frame i, we also keep track of process[i] (the identity of process that owns the page currently occupying the frame) and LR[i] (an approximation to the time of the last reference to the frame). The time stamp LR[i] is expressed as the last reference time according to the virtual time of the process that owns the frame.

In this flow chart, the WS parameter (the size of the window in virtual time used to

determine whether a page is in the working set) is denoted by the Greek letter tau. The parameter F is the number of frames--i.e., the size of physical memory divided by the page size. Like CLOCK, WSClock walks through the frames in order, looking for a good candidate for replacement, cleaning the reference bits as it goes. If the frame has been referenced since it was last inspected, it is given a ``second chance''. (The counter LR[i] is also updated to indicate that page has been referenced recently in terms of the virtual time of its owner.) If not, the page is given a ``third chance'' by seeing whether it appears to be in the working set of its owner. The time since its last reference is approximately calculated by subtracting LR[i] from the current (virtual) time. If the result is less than the parameter tau, the frame is passed over. If the page fails this test, it is either used immediately or scheduled for cleaning (writing its contents out to disk and clearing the dirty bit) depending on whether it is clean or dirty. There is one final complication: If a frame is about to be passed over because it was referenced recently, the algorithm checks whether the owning process is active, and takes the frame anyhow if not. This extra check allows the algorithm to grab the pages of processes that have been stopped by the load-control algorithm. Without it, pages of stopped processes would never get any ``older'' because the virtual time of a stopped process stops advancing.

Like CLOCK, WSClock has to be careful to avoid an infinite loop. As in the CLOCK algorithm, it may may a complete circuit of the clock finding only dirty candidate pages. In that case, it has to wait for one of the cleaning requests to finish. It may also find that all pages are unreferenced but "new" (the reference bit is clear but the comparison to tau shows the page has been referenced recently). In either case, memory is overcommitted and some process needs to be stopped. Previous Next Contents

Memory More

About

Management Paging

1

The parameter t really should be the Greek letter tau, but it's hard to do Greek letters on web pages. [email protected] Mon Jan 24 13:34:17 CST 2000 Copyright © 1996-1999 by Marvin Solomon. All rights reserved.

CS 537 Lecture Notes Part 7a More About Paging CS 537 Lecture Notes Paging Details Paging Details Real-world hardware CPUs have all sorts of ``features'' that make life hard for people trying to write page-fault handlers in operating systems. Among the practical issues are the following.

Page Size How big should a page be? This is really a hardware design question, but since it depends on OS considerations, we will discuss it here. If pages are too large, lots of space will be wasted by internal fragmentation: A process only needs a few bytes, but must take a full page. As a rough estimate, about half of the last page of a process will be wasted on the average. Actually, the average waste will be somewhat larger, if the typical process is small compared to the size of a page. For example, if a page is 8K bytes and the typical process is only 1K, 7/8 of the space will be wasted. Also, the relative amount of waste as a percentage of the space used depends on the size of a typical process. All these considerations imply that as typical processes get bigger and bigger, internal fragmentation becomes less and less of a problem. On the other hand, with smaller pages it takes more page table entries to describe a given process, leading to space overhead for the page tables, but more importantly time overhead for any operation that manipulates them. In particular, it adds to the time needed to switch form one process to another. The details depend on how page tables are organized. For example, if the page tables are in registers, those registers have to be reloaded. A TLB will need more entries to cover the same size ``working set,'' making it more expensive and require more time to re-load the TLB when changing processes. In short, all current trends point to larger and larger pages in the future. If space overhead is the only consideration, it can be shown that the optimal size of a page is sqrt(2se), where s is the size of an average process and e is the size of a page-table entry. This calculation is based on balancing the space wasted by internal fragmentation against the space used for page tables. This formula should be taken with a big grain of salt however, because it overlooks the time overhead incurred by smaller pages.

Restarting the instruction After the OS has brought in the missing page and fixed up the page table, it should restart the process in such a way as to cause it to re-try the offending instruction. Unfortunately, that may not be easy to do, for a variety of reasons.

Variable-length instructions Some CPU architectures have instructions with varying numbers of arguments. For example the Motorola 68000 has a move instruction with two arguments (source and target of the move). It can cause faults for three different reasons: the instruction itself or either of the two operands. The fault handler has to determine which reference faulted. On some computers, the OS has to figure that out by interpreting the instruction and in effect simulating the hardware. The 68000 made it easier for the OS by updating the PC as it goes, so the PC will be pointing at the word immediate following the part of the instruction that caused the fault. On the other hand, this makes it harder to restart the instruction: How can the OS figure out where the instruction started, so that it can back the PC up to retry? Side effects Some computers have addressing modes that automatically increment or decrement index registers as a side effect, making it easy to simulate in one step the effect of the C statement *p++ = *q++;. Unfortunately, if an instruction faults part-way through, it may be difficult to figure out which registers have been modified so that they can be restored to their original state. Some computers also have instructions such as ``move characters,'' which work on variable-length data fields, updating a pointer or count register. If an operand crosses a page boundary, the instruction may fault part-way through, leaving a pointer or counter register modified. Fortunately, most CPU designers know enough about operating systems to understand these problems and add hardware features to allow the OS to recover. Either they undo the effects of the instruction before faulting, or they dump enough information into registers somewhere that the OS can undo them. The original 68000 did neither of these and so paging was not possible on the 68000. It wasn't that the designers were ignorant of OS issues, it was just that there was not enough room on the chip to add the features. However, one clever manufacturer built a box with two 68000 CPUs and an MMU chip. The first CPU ran ``user'' code. When the MMU detected a page fault, instead of interrupting the first CPU, it delayed responding to it and interrupted the second CPU. The second CPU would run all the OS code necessary to respond to the fault and then cause the MMU to retry the storage access. This time, the access would succeed and return the desired result to the first CPU, which never realized there was a problem.

Locking Pages There are a variety of cases in which the OS must prevent certain page frames from being chosen by the page-replacement algorithm. For example, suppose the OS has chosen a particular frame to service a page fault and sent a request to the disk scheduler to read in the page. The request may take a long time to service, so the OS will allow other processes to run in the meantime. It must be careful, however,

that a fault by another process does not choose the same page frame! A similar problem involves I/O. When a process requests an I/O operation it gives the virtual address of the buffer the data is supposed to be read into or written out of. Since DMA devices generally do not know anything about virtual memory, the OS translates the buffer address into a physical memory location (a frame number and offset) before starting the I/O device. It would be very embarrassing if the frame were chosen by the page-replacement algorithm before the I/O operation completes. Both of these problems can be avoided by marking the frame a ineligible for replacement. We usually say that the page in that frame is ``pinned'' in memory. An alternative way of avoid the I/O problem is to do the I/O operation into or out of pages that belong to the OS kernel (and are not subject to replacement) and copying between these pages and user pages.

Missing Reference Bits At least one popular computer, the Digital Equipment Corp. VAX computer, did not have any REF bits in its MMU. Some people at the University of California at Berkeley came up with a clever way of simulating the REF bits in software. Whenever the OS cleared the simulated REF bit for a page, it mark the hardware page-table entry for the page as invalid. When the process first referenced the page, it would cause a page fault. The OS would note that the page really was in memory, so the fault handler could return without doing any I/O operations, but the fault would give the OS the chance to turn the simulated REF bit on and mark the page as valid, so subsequent references to the page would not cause page faults. Although the software simulated hardware with a real real REF bit, the net result was that there was a rather high cost to clearing the simulated REF bit. The people at Berkeley therefore developed a version of the CLOCK algorithm that allowed them to clear the REF bit infrequently.

Fault Handling Overall, the core of the OS kernel looks something like this: // This is the procedure that gets called when an interrupt occurs // on some computers, there is a different handler for each "kind" // of interrupt. void handler() { save_process_state(current_PCB); // Some state (such as the PC) is automatically saved by the HW. // This code copies that info to the PCB and possibly saves some // more state. switch (what_caused_the_trap) { case PAGE_FAULT: f = choose_frame(); if (is_dirty(f)) schedule_write_request(f); // to clean the frame else schedule_read_request(f); // to read in requested page record_state(current_PCB); // to indicate what this process is up to make_unrunnable(current_PCB); current_PCB = select_some_other_ready_process(); break; case IO_COMPLETION: p = process_that_requested_the_IO();

switch (reason_for_the_IO) { case PAGE_CLEANING: schedule_read_request(f); to read in requested page break; case BRING_IN_NEW_PAGE: case EXPLICIT_IO_REQUEST: make_runnable(p); break; } case IO_REQUEST: schedule_io_request(); record_state(current_PCB); // to indicate what this process is up to make_unrunnable(current_PCB); current_PCB = select_some_other_ready_process(); break; case OTHER_OS_REQUEST: perform_request(); break; } // At this point, the current_PCB is pointing to a process that // is ready to run. It may or may not be the process that was // running when the interrupt occurred. restore_state(current_PCB); return_from_interrupt(current_PCB); // This hardware instruction restores the PC (and possibly other // hardware state) and allows the indicated process to continue. } Previous Next Contents

Paging Segmentation

[email protected] Mon Jan 24 13:34:17 CST 2000 Copyright © 1996-1998 by Marvin Solomon. All rights reserved.

CS 537 Lecture Notes, Part 8 Segmentation • • •

Segmentation Multics Intel x86

Segmentation [ Silberschatz, Galvin, and Gagne, Section 9.5 ] In accord with the beautification principle, paging makes the main memory of the computer look more ``beautiful'' in several ways. • It gives each process its own virtual memory, which looks like a private version of the main memory of the computer. In this sense, paging does for memory what the process abstraction does for the CPU. Even though the computer hardware may have only one CPU (or perhaps a few CPUs), each ``user'' can have his own private virtual CPU (process). Similarly, paging gives each process its own virtual memory, which is separate from the memories of other processes and protected from them. • Each virtual memory looks like a linear array of bytes, with addresses starting at zero. This feature simplifies relocation: Every program can be compiled under the assumption that it will start at address zero. • It makes the memory look bigger, by keeping infrequently used portions of the virtual memory space of a process on disk rather than in main memory. This feature both promotes more efficient sharing of the scarce memory resource among processes and allows each process to treat its memory as essentially unbounded in size. Just as a process doesn't have to worry about doing some operation that may block because it knows that the OS will run some other process while it is waiting, it doesn't have to worry about allocating lots of space to a rarely (or sparsely) used data structure because the OS will only allocate real memory to the part that's actually being used. Segmentation caries this feature one step further by allowing each process to have multiple ``simulated memories.'' Each of these memories (called a segment) starts at address zero, is independently protected, and can be separately paged. In a segmented system, a memory address has two parts: a segment number and a segment offset. Most systems have some sort of segementation, but often it is quite limited. Unix has exactly three segments per process. One segment (called the text segment) holds the executable code of the process. It is generally1 read-only, fixed in size when the process starts, and shared among all processes running the same program. Sometimes read-only data (such as constents) are also placed in this segment. Another segment (the data segment) holds the memory used for global variables. Its protection is read/write (but usually not executable), and is normally not shared between processes.2 There is a special system call to extend the size of the data segment of a process. The third segment is the stack segment. As the name implies, it is used for the process' stack, which is used to hold information used in procedure calls and returns (return address, saved contents of registers, etc.) as well as local variables of procedures. Like the data segment, the stack is read/write but usually not executable. The stack is automatically extended by the OS whenever the process causes a fault by referencing an address beyond the current size of the stack (usually in the course of a procedure call). It is not shared between processes. Some variants of Unix have a fourth segment, which contains part of the OS data structures. It is read-only and shared by all processes.

Many application programs would be easier to write if they could have as many segments as they liked. As an example of an application program that might want multiple segments, consider a compiler. In addition to the usual text, data, and stack segments, it could use one segment for the source of the program being compiled, one for the symbol table, etc. (see Fig 9.18 on page 287). Breaking the address space up into segments also helps sharing (see Fig. 9.19 on page 288). For example, most programs in Unix include the library program printf. If the executable code of printf were in a separate segment, that segment could easily be shared by multiple processes, allowing (slightly) more efficient sharing of physical memory.3 If you think of the virtual address as being the concatenation of the segment number and the segment offset, segmentation looks superficially like paging. The main difference is that the application programmer is aware of the segment boundaries, but can ignore the fact that the address space is divided up into pages. The implementation of segmentation is also superficially similar to the implementation of paging (see Fig 9.17 on page 286). The segment number is used to index into a table of ``segment descriptors,'' each of which contains the length and starting address of a segment as well as protection information. If the segment offset not less than the segment length, the MMU traps with a segmentation violation. Otherwise, the segment offset is added to the starting address in the descriptor to get the resulting physical address. There are several differences between the implementation of segments and pages, all derived from the fact that the size of a segment is variable, while the size of a page is ``built-in.'' • The size of the segment is stored in the segment descriptor and compared with the segment offset. The size of a page need not be stored anywhere because it is always the same. It is always a power of two and the page offset has just enough bits to represent any legal offset, so it is impossible for the page offset to be out of bounds. For example, if the page size is 4k (4096) bytes, the page offset is a 12-bit field, which can only contain numbers in the range 0...4095. • The segment descriptor contains the physical address of the start of the segment. Since all page frames are required to start at an address that is a multiple of the page size, which is a power of two, the low-order bits of the physical address of a frame are always zero. For example, if pages are 4k bytes, the physical address of each page frame ends with 12 zeros. Thus a page table entry contains a frame number, which is just the higher-order bits of the physical address of the frame, and the MMU concatenates the frame number with the page offset, as contrasted with adding the physical address of a segment with the segment offset.

Multics One of the advantages of segmentation is that each segment can be large and can grow dynamically. To get this effect, we have to page each segment. One way to do this is to have each segment descriptor contain the (physical) address of a page table for the segment rather than the address of the segment itself. This is the way segmentation works in Multics, the granddaddy of all modern operating systems and a pioneer of the idea of segmentation. Multics ran on the General Electric (later Honeywell) 635 computer, which was a 36-bit word-addressable machine, which means that memory is divided into 36bit words, with consecutive words having addresses that differ by 1 (there were no bytes). A virtual address was 36 bits long, with the high 18 bits interpreted as the segment number and the low 18 bits as segment offset. Although 18 bits allows a maximum size of 218 = 262,144 words, the software enforced a maximum segment size of 216 = 65,536 words. Thus the segment offset is effectively 16 bits long. Associated with each process is a table called the descriptor segment. There is a register called the Descriptor Segment Base Register (DSBR) that points to it and a register called the Descriptor Segment Length Register (DSLR) that indicates the number of entries in the descriptor segment.

First the segment number in the virtual address is used to index into the descriptor segment to find the appropriate descriptor. (If the segment number is too large, a fault occurs). The descriptor contains permission information, which is checked to see if the current process has rights to access the segment as requested. If that check succeeds, the memory address of a page table for the segment is found in the descriptor. Since each page is 1024 words long, the 16-bit segment offset is interpreted as a 6-bit page number and a 10-bit offset within the page. The page number is used to index into the page table to get an entry containing a valid bit and frame number. If the valid bit is set, the physical address of the desired word is found by concatenating the frame number with the 10-bit page offset from the virtual address. Actually, I've left out one important detail to simplify the description. The ``descriptor segment'' really is a segment, which means it really is paged, just like any other segment. Thus there is another page table that is the page table for the descriptor segment. The 18-bit segment number from the virtual address is split into an 8-bit page number and a 10-bit offset. The page number is used to select an entry from the decriptor segment's page table. That entry contains the (physical) address of a page of the descriptor segment, and the page-offset field of the segment number is used to index into that page to get the descriptor itself. The rest of the translation occurs as described in the preceding paragraph. In total, each memory reference turns into four accesses to memory. 12. one to retrieve an entry from the descriptor segment's page table, 13. one to retrieve the descriptor itself, 14. one to retrieve an entry from the page table for the desired segment, and 15. one to load or store the desired data.

Multics used a TLB mapping the segment number and page number within the segment to a page frame to avoid three of these accesses in most cases.

Intel x86 [ Silberschat, Galvin, and Gagne, Section 9.6 ] The Intex 386 (and subsequent members of the X86 family used in personal computers) uses a different approach to combining paging with segmentation. A virtual address consists of a 16-bit segment selector and a 16 or 32-bit segment offset. The selector is used to fetch a segment descriptor from a table (actually, there are two tables and one of the bits of the selector is used to choose which table). The 64-bit descriptor contains the 32-bit address of the segment (called the segment base) 21 bits indicating its length, and miscellaneous bits indicating protections and other options. The segment length is indicated by a 20-bit limit and one bit to indicate whether the limit should be interpreted as bytes or pages. (The segment base and limit ``fields'' are actually scattered around the descriptor to provide compatibility with earlier version of the hardware.) If the offset from the original virtual address does not exceed the segment length, it is added to the base to get a ``physical'' address called the linear address (see Fig 9.20 on page 292). If paging is turned off, the linear address really is the physical address. Otherwise, it is translated by a two-level page table as described previously, with the 32-bit address divided into two 10-bit page numbers and a 12 bit offset (a page is 4K on this machine). Previous Next Contents

More

About Disks

1

Paging

I have to say ``generally'' here and elsewhere when I talk about Unix because there are many variants of Unix in existence. Sometimes I will use the term ``classic Unix'' to decribe the features that were in Unix before it spread to many distinct dialects. Features in classic Unix are generally found in all of its dialects. Sometimes features introduced in one variant became so popular that they were widely immitated and are now available in most dialects. 2 This a good example of one of those ``popular'' features not in classic Unix but in most modern variants: System V (an AT&T variant of Unix) introduced the ability to map a chunk of virtual memory into the address spaces of multiple processes at some offset in the data segment (perhaps a different offset in each process). This chunk is called a ``shared memory segment,'' but is not a segment in the sense we are using the term here. So-called ``System V shared memory'' is available in most current versions of Unix. 3 Many variants of Unix get a similar effect with so-called ``shared libraries,'' which are implemented with shared memory but without general-purpose segmentation support. [email protected] Mon Jan 24 13:34:18 CST 2000 Copyright © 1996-1999 by Marvin Solomon. All rights reserved.

CS 537 Lecture Notes, Part 9 Disk Scheduling Previous Segmentation Next File Systems Contents

Contents Disk Hardware [ Silberschatz, Galvin, and Gagne Section 13.1 ] A (hard) disk drive record data on the surfaces of metal plates called platters that are coated with a substance containing ground-up iron, or other substances that allow zeros and ones to be recorded as tiny spots of magnetization. Floppy disks (also called ``diskettes'' by those who think the term ``floppy'' is undignified) are similar, but use a sheet of plastic rather than metal, and permanently enclose it in a paper or plastic envelope. I won't say anything more about floppy disks, but most of facts about hard disks are also true for floppies, but slower. It is customary to use the simple term ``disk'' to mean ``hard disk drive'' and say ``platter'' when you mean the disk itself. When in use, the disk spins rapidly and a read/write head slides along the surface. Usually, both sides of a platter are used for recording, so there is a head for each surface. In some more expensive disk drives, there are several platters, all on a common axle spinning together. The heads are fixed to an arm that can move radially in towards the axle or out towards the edges of the platters. All of the heads are attached to the same arm, so they are all at the same distance from the centers of their platters at any given time. To read or write a bit of data on the disk, a head has to be right over the spot where the data is stored. This may require three operations, giving rise to four kinds of delay. • The correct head (i.e., the correct surface) must be selected. This is done electronicly, so it is very fast (at most a few microseconds). • The head has to be moved to the correct distance from the center of the disk. This movement is called seeking and involves physically moving the arm in or out. Because the arm has mass (inertia), it must be accelerated and decelerated. When it finally gets where it's going, the disk has to wait a bit for the vibrations caused by the jerky movement to die out. All in all, seeking can take several milliseconds, depending on how far the head has to move. • The disk has to rotate until the correct spot is under the selected disk. Since the disk is constantly spinning, all the drive has to do is wait for the correct spot to come around. • Finally, the actual data has to be transferred. On a read operation, the data is usually transferred to a RAM buffer in the device and then copied, by DMA, to the computer's main memory. Similarly, on write, the data is transferred by DMA to a buffer in the disk, and then copied onto the surface of a platter.

The total time spent getting to the right place on the disk is called latency and is divided into rotational latency and seek time (although sometimes people use the term ``seek time'' to cover both kinds of latency). The data on a disk is divided up into fixed-sized disk blocks. The hardware only supports reading or writing a whole block at a time. If a program wants to change one bit (or one byte) on the disk, it has to read in an entire disk block, change the part of it it want to change, and then write it back out. Each block has a location, sometimes called a disk address that consists of three numbers: surface, track, and sector. The part of the disk swept out by a head while it is not moving is a ring-shaped region on the surface called a track. The track number indicates how far the data is from the center of the disk (the axle). Each track is divided up into some number of sectors. On some disks, the outer tracks have more sectors than the inner ones because the outer tracks are are longer, but all sectors are the same size. The set of tracks swept out by all the heads while the arm is not moving is a called a cylinder. Thus a seek operation moves to a new cylinder, positioning each each on one track of the cylinder. This basic picture of disk organization hasn't changed much in forty years. What has changed is that disks keep getting smaller and cheaper and the data on the surfaces gets denser (the spots used to record bits are getting smaller and closer together). The first disks were several feet in diameter, cost tens of thousands of dollars, and held tens of thousands of bytes. Currently (1998) a typical disk is 3-1/2 inches in diameter, costs a few hundred dollars and holds several gigabytes (billions of bytes) of data. What hasn't changed much is physical limitations. Early disks spun at 3600 revolutions per minute (RPM), and only in last couple of years have faster rotation speeds become common (7200 RPM is currently a common speed for mid-range workstation disks). At 7200 RPM, the rotational latency is at worst 1/7200 minute (8.33 milliseconds) and on the average it is half that (4.17 ms). The heads and the arm that moves them have gotten much smaller and lighter, allowing them to be moved more quickly, but the improvement has been modest. Current disks take anywhere from a millisecond to 10s of milliseconds to seek to particular cylinder. Just for reference, here are the specs for a popular disk used in PC's and currently selling at the the DoIT tech store for $287.86. Capacity 27.3 Gbyte Platters 4 Heads 8 Cylinders 17,494 Sector size 512 bytes Sectors per track ??? Max recording density 282K bits/inch Min seek (1 track) 2.2 ms Max seek 15.5 ms Average seek 9.0 ms Rotational speed 7200 RPM Average rotational latency 4.17 ms Media transfer rate 248 to 284 Mbits/sec Data buffer 2MB Minimum sustained transfer rate 13.8 to 22.9 MB/sec Price About $287.86 Disk manufacturers such as Seagate, Quantum, and IBM have details of many more disks available online.

Disk Scheduling [ Silberschatz, Galvin and Gagne Section 13.2 ] When a process wants to do disk I/O, it makes a call to the operating system. Since the operation may take some time, the process is put into a blocked state, and the I/O request is sent to a part of the OS called a device driver. If the disk is idle, the operation can be started right away, but if the disk is busy servicing another request, it must be added to a queue of requests and wait its turn. Thus the total delay seen by the process has several components: • The overhead of getting into and out of the OS, and the time the OS spends fiddling with queues, etc. • The queuing time spent waiting for the disk to become available. • The latency spent waiting for the disk to get the right track and sector. • The transfer time spent actually reading or writing the data. Although I mentioned a ``queue'' of requests, there is no reason why the requests have to be satisfied first-come first-served. In fact, that is a very bad way to schedule disk requests. Since requests from different processes may be scattered all over the disk, satisfying them in the order they arrive would entail an awful lot of jumping around on the disk, resulting in excessive rotational latency and seek time -- both for individual requests and for the system as a whole. Fortunately, better algorithms are not hard to devise.

Shortest Seek Time First (SSTF) When a disk operation finishes, choose the request that is closest to the current head position (the one that minimizes rotational latency and seek time). This algorithm minimizes latency and thus gives the best overall performance, but suffers from poor fairness. Requests will get widely varying response depending on how lucky they are in being close the the current location of the heads. In the worst case, requests can be starved (be delayed arbitrarily long). The Elevator Algorithm The disk head progresses in a single direction (from the center of the disk to the edge, or vice versa) serving the closest request in that direction. When it runs out of requests in the direction it is currently moving, it switches to the opposite direction. This algorithm usually gives more equitable service to all requests, but in the worst case, it can still lead to starvation. While it is satisfying requests on one cylinder, other requests for the same cylinder could arrive. If enough requests for the same cylinder keep coming, the heads would stay at that cylinder forever, starving all other requests. This problem is easily avoided by limiting how long the heads will stay at any one cylinder. One simple scheme is only to serve the requests for the cylinder that are already there when the heads gets there. New requests for that cylinder that arrive while existing requests are being served will have to wait for the next pass. One-way Elevator Algorithm

The simple (two-way) elevator algorithm gives poorer service to requests near the center and edges of the disk than to requests in between. Suppose it takes time T for a pass (from the center to the edge or vice versa). A request at either end of a pass (near the hub or the edge of the disk) may have to wait up to time 2T for the heads to travel to the other end and back, and on average the delay will be T. A request near the ``middle'' (half way between the hub and the edge) will get twice as good service: The worse-case delay is T and the average is T/2. If this bias is a problem, it can be solved by making the elevator run in one direction only (say from hub to edge). When it finishes the request closest to the edge, it seeks all the way back to the first request (the one closes to the hub) and starts another pass from hub to edge. In general, this approach will increase the total amount of seek time because of the long seek from the edge back to the hub, but on a heavily loaded disk, that seek will be so infrequent as not to make much difference. Previous Segmentation Next File Systems Contents [email protected] Mon Jan 24 13:34:18 CST 2000 Copyright © 1996-1999 by Marvin Solomon. All rights reserved.

CS 537 Lecture Notes, Part 10 File Systems Previous Disks Next More About File Systems Contents

Contents •

• •

The User Interface to Files • Naming • File Structure • File Types • Access Modes • File Attributes • Operations The User Interface to Directories Implementing File Systems • Files • Directories • Symbolic Links • Mounting • Special Files

First we look at files from the point of view of a person or program using the file system, and then we consider how this user interface is implemented.

The User Interface to Files Just as the process abstraction beautifies the hardware by making a single CPU (or a small number of CPUs) appear to be many CPUs, one per ``user,'' the file system beautifies the hardware disk, making it appear to be a large number of disk-like objects called files. Like a disk, a file is capable of storing a large amount of data cheaply, reliably, and persistently. The fact that there are lots of files is one form of beautification: Each file is individually protected, so each user can have his own files, without the expense of requiring each user to buy his own disk. Each user can have lots of files, which makes it easier to organize persistent data. The filesystem also makes each individual file more beautiful than a real disk. At the very least, it erases block boundaries, so a file can be any length (not just a multiple of the block size) and programs can read and write arbitrary regions of the file without worrying about whether they cross block boundaries. Some systems (not Unix) also provide assistance in organizing the contents of a file. Systems use the same sort of device (a disk drive) to support both virtual memory and files. The question arises why these have to be distinct facilities, with vastly different user interfaces. The answer is that they don't. In Multics, there was no difference whatsoever. Everything in Multics was a segment.

The address space of each running process consisted of a set of segments (each with its own segment number), and the ``file system'' was simply a set of named segments. To access a segment from the file system, a process would pass its name to a system call that assigned a segment number to it. From then on, the process could read and write the segment simply by executing ordinary loads and stores. For example, if the segment was an array of integers, the program could access the ith number with a notation like a[i] rather than having to seek to the appropriate offset and then execute a read system call. If the block of the file containing this value wasn't in memory, the array access would cause a page fault, which was serviced as explained in the previous chapter. This user-interface idea, sometimes called ``single-level store,'' is a great idea. So why is it not common in current operating systems? In other words, why are virtual memory and files presented as very different kinds of objects? There are possible explanations one might propose:

The address space of a process is small compared to the size of a file system. There is no reason why this has to be so. In Multics, a process could have up to 256K segments, but each segment was limited to 64K words. Multics allowed for lots of segments because every ``file'' in the file system was a segment. The upper bound of 64K words per segment was considered large by the standards of the time; The hardware actually allowed segments of up to 256K words (over one megabyte). Most new processors introduced in the last few years allow 64-bit virtual addresses. In a few years, such processors will dominate. So there is no reason why the virtual address space of a process cannot be large enough to include the entire file system. The virtual memory of a process is transient--it goes away when the process terminates--while files must be persistent. Multics showed that this doesn't have to be true. A segment can be designated as ``permanent,'' meaning that it should be preserved after the process that created it terminates. Permanent segments to raise a need for one ``file-system-like'' facility, the ability to give names to segments so that new processes can find them. Files are shared by multiple processes, while the virtual address space of a process is associated with only that process. Most modern operating systems (including most variants of Unix) provide some way for processes to share portions of their address spaces anyhow, so this is a particularly weak argument for a distinction between files and segments. The real reason single-level store is not ubiquitous is probably a concern for efficiency. The usual filesystem interface encourages a particular style of access: Open a file, go through it sequentially, copying big chunks of it to or from main memory, and then close it. While it is possible to access a file like an array of bytes, jumping around and accessing the data in tiny pieces, it is awkward. Operating system designers have found ways to implement files that make the common ``file like'' style of access very efficient. While there appears to be no reason in principle why memory-mapped files cannot be made to give similar performance when they are accessed in this way, in practice, the added functionality of mapped files always seems to pay a price in performance. Besides, if it is easy to jump around in a file,

applications programmers will take advantage of it, overall performance will suffer, and the file system will be blamed.

Naming Every file system provides some way to give a name to each file. We will consider only names for individual files here, and talk about directories later. The name of a file is (at least sometimes) meant to used by human beings, so it should be easy for humans to use. Different operating systems put different restrictions on names: Size. Some systems put severe restrictions on the length of names. For example DOS restricts names to 11 characters, while early versions of Unix (and some still in use today) restrict names to 14 characters. The Macintosh operating system, Windows 95, and most modern version of Unix allow names to be essentially arbitrarily long. I say ``essentially'' since names are meant to be used by humans, so they don't really to to be all that long. A name that is 100 characters long is just as difficult to use as one that it forced to be under 11 characters long (but for different reasons). Most modern versions of Unix, for example, restrict names to a limit of 255 characters.1 Case. Are upper and lower case letters considered different? The Unix tradition is to consider the names Foo and foo to be completely different and unrelated names. In DOS and its descendants, however, they are considered the same. Some systems translate names to one case (usually upper case) for storage. Others retain the original case, but consider it simply a matter of decoration. For example, if you create a file named ``Foo,'' you could open it as ``foo'' or ``FOO,'' but if you list the directory, you would still see the file listed as ``Foo''. Character Set. Different systems put different restrictions on what characters can appear in file names. The Unix directory structure supports names containing any character other than NUL (the byte consisting of all zero bits), but many utility programs (such as the shell) would have troubles with names that have spaces, control characters or certain punctuation characters (particularly `/'). MacOS allows all of these (e.g., it is not uncommon to see a file name with the Copyright symbol © in it). With the worldwide spread of computer technology, it is becoming increasingly important to support languages other than English, and in fact alphabets other than Latin. There is a move to support character strings (and in particular file names) in the Unicode character set, which devotes 16 bits to each character rather than 8 and can represent the alphabets of all major modern languages from Arabic to Devanagari to Telugu to Khmer.

Format. It is common to divide a file name into a base name and an extension that indicates the type of the file. DOS requires that each name be compose of a bast name of eight or less characters and an extension of three or less characters. When the name is displayed, it is represented as base.extension. Unix internally makes no such distinction, but it is a common convention to include exactly one period in a file name (e.g. foo.c for a C source file).

File Structure Unix hides the ``chunkiness'' of tracks, sectors, etc. and presents each file as a ``smooth'' array of bytes with no internal structure. Application programs can, if they wish, use the bytes in the file to represent structures. For example, a wide-spread convention in Unix is to use the newline character (the character with bit pattern 00001010) to break text files into lines. Some other systems provide a variety of other types of files. The most common are files that consist of an array of fixed or variable size records and files that form an index mapping keys to values. Indexed files are usually implemented as B-trees.

File Types Most systems divide files into various ``types.'' The concept of ``type'' is a confusing one, partially because the term ``type'' can mean different things in different contexts. Unix initially supported only four types of files: directories, two kinds of special files (discussed later), and ``regular'' files. Just about any type of file is considered a ``regular'' file by Unix. Within this category, however, it is useful to distinguish text files from binary files; within binary files there are executable files (which contain machine-language code) and data files; text files might be source files in a particular programming language (e.g. C or Java) or they may be human-readable text in some mark-up language such as html (hypertext markup language). Data files may be classified according to the program that created them or is able to interpret them, e.g., a file may be a Microsoft Word document or Excel spreadsheet or the output of TeX. The possibilities are endless. In general (not just in Unix) there are three ways of indicating the type of a file: 16. The operating system may record the type of a file in meta-data stored separately from the file, but associated with it. Unix only provides enough meta-data to distinguish a regular file from a directory (or special file), but other systems support more types. 17. The type of a file may be indicated by part of its contents, such as a header made up of the first few bytes of the file. In Unix, files that store executable programs start with a two byte magic number that identifies them as executable and selects one of a variety of executable formats. In the original Unix executable format, called the a.out format, the magic number is the octal number 0407, which happens to be the machine code for a branch instruction on the PDP-11 computer, one of the first computers to implement Unix. The operating system could run a file by loading it into memory and jumping to the beginning of it. The 0407 code, interpreted as an instruction, jumps to the word following the 16-byte header, which is the beginning of the executable code in this format. The PDP-11 computer is extinct by now, but it lives on through the 0407 code! 18. The type of a file may be indicated by its name. Sometimes this is just a convention, and sometimes it's enforced by the OS or by certain programs. For example, the Unix Java compiler refuses to believe that a file contains Java source unless its name ends with .java. Some systems enforce the types of files more vigorously than others. File types may be enforced

• Not at all, • Only by convention, • By certain programs (e.g. the Java compiler), or • By the operating system itself. Unix tends to be very lax in enforcing types.

Access Modes [ Silberschatz, Galvin, and Gagne, Section 11.2 ] Systems support various access modes for operations on a file. • Sequential. Read or write the next record or next n bytes of the file. Usually, sequential access also allows a rewind operation. • Random. Read or write the nth record or bytes i through j. Unix provides an equivalent facility by adding a seek operation to the sequential operations listed above. This packaging of operations allows random access but encourages sequential access. • Indexed. Read or write the record with a given key. In some cases, the ``key'' need not be unique--there can be more than one record with the same key. In this case, programs use a combination of indexed and sequential operations: Get the first record with a given key, then get other records with the same key by doing sequential reads. Note that access modes are distinct from from file structure--e.g., a record-structured file can be accessed either sequentially or randomly--but the two concepts are not entirely unrelated. For example, indexed access mode only makes sense for indexed files.

File Attributes This is the area where there is the most variation among file systems. Attributes can also be grouped by general category.

Name. Ownership and Protection. Owner, owner's ``group,'' creator, access-control list (information about who can to what to this file, for example, perhaps the owner can read or modify it, other members of his group can only read it, and others have no access). Time stamps. Time created, time last modified, time last accessed, time the attributes were last changed, etc. Unix maintains the last three of these. Some systems record not only when the file was last modified, but by whom. Sizes. Current size, size limit, ``high-water mark'', space consumed (which may be larger than size because of internal fragmentation or smaller because of various compression techniques). Type Information.

As described above: File is ASCII, is executable, is a ``system'' file, is an Excel spread sheet, etc. Misc. Some systems have attributes describing how the file should be displayed when a directly is listed. For example MacOS records an icon to represent the file and the screen coordinates where it was last displayed. DOS has a ``hidden'' attribute meaning that the file is not normally shown. Unix achieves a similar effect by convention: The ls program that is usually used to list files does not show files with names that start with a period unless you explicit request it to (with the -a option). Unix records a fixed set of attributes in the meta-data associated with a file. If you want to record some fact about the file that is not included among the supported attributes, you have to use one of the tricks listed above for recording type information: encode it in the name of the file, put it into the body of the file itself, or store it in a file with a related name (e.g. ``foo.attributes''). Other systems (notably MacOS and Windows NT) allow new attributes to be invented on the fly. In MacOS, each file has a resource fork, which is a list of (attribute-name, attribute-value) pairs. The attribute name can be any fourcharacter string, and the attribute value can be anything at all. Indeed, some kinds of files put the entire ``contents'' of the file in an attribute and leave the ``body'' of the file (called the data fork) empty.

Operations [ Silberschatz, Galvin, and Gagne, Section 11.1.2 ] POSIX, a standard API (application programming interface) based on Unix, provides the following operations (among others) for manipulating files: fd = open(name, operation) fd = creat(name, mode) status = close(fd) byte_count = read(fd, buffer, byte_count) byte_count = write(fd, buffer, byte_count) offset = lseek(fd, offset, whence) status = link(oldname, newname) status = unlink(name) status = stat(name, buffer) status = fstat(fd, buffer) status = utimes(name, times) status = chown(name, owner, group) or fchown(fd, owner, group) status = chmod(name, mode) or fchmod(fd, mode) status = truncate(name, size) or ftruncate(fd, size) Some types of arguments and results need explanation.

status Many functions return a ``status'' which is either 0 for success or -1 for errors (there is another mechanism to get more information about went wrong). Other functions also use -1 as a return value to indicate an error.

name A character-string name for a file. fd A ``file descriptor'', which is a small non-negative integer used as a short, temporary name for a file during the lifetime of a process. buffer The memory address of the start of a buffer for supplying or receiving data. whence One of three codes, signifying from start, from end, or from current location. mode A bit-mask specifying protection information. operation An integer code, one of read, write, read and write, and perhaps a few other possibilities such as append only. The open call finds a file and assigns a decriptor to it. It also indicates how the file will be used by this process (read only, read/write, etc). The creat call is similar, but creates a new (empty) file. The mode argument specifies protection attributes (such as ``writable by owner but read-only by others'') for the new file. (Most modern versions of Unix have merged creat into open by adding an optional mode argument and allowing the operation argument to specify that the file is automatically created if it doesn't already exist.) The close call simply announces that fd is no longer in use and can be reused for another open or creat. The read and write operations transfer data between a file and memory. The starting location in memory is indicated by the buffer parameter; the starting location in the file (called the seek pointer is wherever the last read or write left off. The result is the number of bytes transferred. For write it is normally the same as the byte_count parameter unless there is an error. For read it may be smaller if the seek pointer starts out near the end of the file. The lseek operation adjusts the seek pointer (it is also automatically updated by read and write). The specified offset is added to zero, the current seek pointer, or the current size of the file, depending on the value of whence. The function link adds a new name (alias) to a file, while unlink removes a name. There is no function to delete a file; the system automatically deletes it when there are no remaining names for it. The stat function retrieves meta-data about the file and puts it into a buffer (in a fixed, documented format), while the remaining functions can be used to update the meta-data: utimes updates time stamps, chown updates ownership, chmod updates protection information, and truncate changes the size (files can be make bigger by write, but only truncate can make them smaller). Most come in two flavors: one that take a file name and one that takes a descriptor for an open file. To learn more details about any of these functions, type something like

man 2 lseek to any Unix system. The `2' means to look in section 2 of the manual, where system calls are explained. Other systems have similar operations, and perhaps a few more. For example, indexed or indexed sequential files would require a version of seek to specify a key rather than an offset. It is also common to have a separate append operation for writing to the end of a file.

The User Interface to Directories [ Silberschatz, Galvin, and Gagne, Section 11.3 ] We already talked about file names. One important feature that a file name should have is that it be unambiguous: There should be at most one file with any given name. The symmetrical condition, that there be at most one name for any given file, is not necessarily a good thing. Sometimes it is handy to be able to give multiple names to a file. When we consider implementation, we will describe two different ways to implement multiple names for a file, each with slightly different semantics. If there are a lot of files in a system, it may be difficult to avoid giving two files the same name, particularly if there are multiple uses independently making up names. One technique to assure uniqueness is to prefix each file name with the name (or user id) of the owner. In some early operating systems, that was the only assistance the system gave in preventing conflicts. A better idea is the hierarchical directory structure, first introduced by Multics, then popularized by Unix, and now found in virtually every operating system. You probably already know about hierarchical directories, but I would like to describe them from an unusual point of view, and then explain how this point of view is equivalent to the more familiar version. Each file is named by a sequence of names. Although all modern operating systems use this technique, each uses a different character to separate the components of the sequence when displaying it as a character string. Multics uses `>', Unix uses `/', DOS and its descendants use `\', and MacOS uses ':'. Sequences make it easy to avoid naming conflicts. First, assign a sequence to each user and only let him create files with names that start with that sequence. For example, I might be assigned the sequence (``usr'', ``solomon''), written in Unix as /usr/solomon. So far, this is the same as just appending the user name to each file name. But it allows me to further classify my own files to prevent conflicts. When I start a new project, I can create a new sequence by appending the name of the project to the end of the sequence assigned to me, and then use this prefix for all files in the project. For example, I might choose /usr/solomon/cs537 for files associated with this course, and name them /usr/solomon/cs537/foo, /usr/solomon/cs537/bar, etc. As an extra aid, the system allows me to specify a ``default prefix'' and a short-hand for writing names that start with that prefix. In Unix, I use the system call chdir to specify a prefix, and whenever I use a name that does not start with `/', the system automatically adds that prefix. It is customary to think of the directory system as a directed graph, with names on the edges. Each path in the graph is associated with a sequence of names, the names on the edges that make up the path. For that reason, the sequence of names is usually called a path name. One node is designated as the root node, and the rule is enforced that there cannot be two edges with the same name coming out of one node. With this rule, we can use path name to name nodes. Start at the root node and treat the path name as a sequence of directions, telling us which edge to follow at each step. It may be impossible to follow the directions (because they tell us to use an edge that does not exist), but if is possible to follow the directions, they will lead us unambiguously to one node. Thus path names can be used as unambiguous names for nodes. In fact, as we will see, this is how the directory system is actually implemented. However, I think it is useful to think of ``path names'' simply as long names to avoid naming conflicts, since it clear separates the interface from the implementation.

Implementing File Systems Files [ Silberschatz, Galvin, and Gagne, Section 11.6 ] We will assume that all the blocks of the disk are given block numbers starting at zero and running through consecutive integers up to some maximum. We will further assume that blocks with numbers that are near each other are located physically near each other on the disk (e.g., same cylinder) so that the arithmetic difference between the numbers of two blocks gives a good estimate how long it takes to get from one to the other. First let's consider how to represent an individual file. There are (at least!) four possibilities:

Contiguous [Section 11.6.1] The blocks of a file are the block numbered n, n+1, n+2, ..., m. We can represent any file with a pair of numbers: the block number of of first block and the length of the file (in blocks). (See Figure 11.15 on page 378). The advantages of this approach are • It's simple • The blocks of the file are all physically near each other on the disk and in order so that a sequential scan through the file will be fast. The problem with this organization is that you can only grow a file if the block following the last block in the file happens to be free. Otherwise, you would have to find a long enough run of free blocks to accommodate the new length of the file and copy it. As a practical matter, operating systems that use this organization require the maximum size of the file to be declared when it is created and pre-allocate space for the whole file. Even then, storage allocation has all the problems we considered when studying main-memory allocation including external fragmentation. Linked List (Section 11.6.2). A file is represented by the block number of its first block, and each block contains the block number of the next block of the file. This representation avoids the problems of the contiguous representation: We can grow a file by linking any disk block onto the end of the list, and there is no external fragmentation. However, it introduces a new problem: Random access is effectively impossible. To find the 100th block of a file, we have to read the first 99 blocks just to follow the list. We also lose the advantage of very fast sequential access to the file since its blocks may be scattered all over the disk. However, if we are careful when choosing blocks to add to a file, we can retain pretty good sequential access performance. Both the space overhead (the percentage of the space taken up by pointers) and the time overhead (the percentage of the time seeking from one place to another) can be decreased by using larger blocks. The hardware designer fixes the block size (which

is usually quite small) but the software can get around this problem by using ``virtual'' blocks, sometimes called clusters. The OS simply treats each group of (say) four continguous phyical disk sectors as one cluster. Large, clusters, particularly if they can be variable size, are sometimes called extents. Extents can be thought of as a compromise between linked and contiguous allocation. Disk Index The idea here is to keep the linked-list representation, but take the link fields out of the blocks and gather them together all in one place. This approach is used in the ``FAT'' file system of DOS, OS/2 and older versions of Windows. At some fixed place on disk, allocate an array I with one element for each block on the disk, and move the link field from block n to I[m] (see Figure 11.17 on page 382). The whole array of links, called a file access table (FAT) is now small enough that it can be read into main memory when the systems starts up. Accessing the 100th block of a file still requires walking through 99 links of a linked list, but now the entire list is in memory, so time to traverse it is negligible (recall that a single disk access takes as long as 10's or even 100's of thousands of instructions). This representation has the added advantage of getting the ``operating system'' stuff (the links) out of the pages of ``user data''. The pages of user data are now full-size disk blocks, and lots of algorithms work better with chunks that are a power of two bytes long. Also, it means that the OS can prevent users (who are notorious for screwing things up) from getting their grubby hands on the system data. The main problem with this approach is that the index array I can get quite large with modern disks. For example, consider a 2 GB disk with 2K blocks. There are million blocks, so a block number must be at least 20 bits. Rounded up to an even number of bytes, that's 3 bytes--4 bytes if we round up to a word boundary--so the array I is three or four megabytes. While that's not an excessive amount of memory given today's RAM prices, if we can get along with less, there are better uses for the memory. File Index [Section 11.6.3] Although a typical disk may contain tens of thousands of files, only a few of them are open at any one time, and it is only necessary to keep index information about open files in memory to get good performance. Unfortunately the whole-disk index described in the previous paragraph mixes index information about all files for the whole disk together, making it difficult to cache only information about open files. The inode structure introduced by Unix groups together index information about each file individually. The basic idea is to represent each file as a tree of blocks, with the data blocks as leaves. Each internal block (called an indirect block in Unix jargon) is

an array of block numbers, listing its children in order. If a disk block is 2K bytes and a block number is four bytes, 512 block numbers fit in a block, so a one-level tree (a single root node pointing directly to the leaves) can accommodate files up to 512 blocks, or one megabyte in size. If the root node is cached in memory, the ``address'' (block number) of any block of the file can be found without any disk accesses. A two-level tree, with 513 total indirect blocks, can handle files 512 times as large (up to one-half gigabyte). The only problem with this idea is that it wastes space for small files. Any file with more than one block needs at least one indirect block to store its block numbers. A 4K file would require three 2K blocks, wasting up to one third of its space. Since many files are quite small, this is serious problem. The Unix solution is to use a different kind of ``block'' for the root of the tree. An index node (or inode for short) contains almost all the meta-data about a file listed above: ownership, permissions, time stamps, etc. (but not the file name). Inodes are small enough that several of them can be packed into one disk block. In addition to the meta-data, an inode contains the block numbers of the first few blocks of the file. What if the file is too big to fit all its block numbers into the inode? The earliest version of Unix had a bit in the meta-data to indicate whether the file was ``small'' or ``big.'' For a big file, the inode contained the block numbers of indirect blocks rather than data blocks. More recent versions of Unix contain pointers to indirect blocks in addition to the pointers to the first few data blocks. The inode contains pointers to (i.e., block numbers of) the first few blocks of the file, a pointer to an indirect block containing pointers to the next several blocks of the file, a pointer to a doubly indirect block, which is the root of a two-level tree whose leaves are the next blocks of the file, and a pointer to a triply indirect block. A large file is thus a lop-sided tree. (See Figure 11.19 on page 384). A real-life example is given by the Solaris 2.5 version of Unix. Block numbers are four bytes and the size of a block is a parameter stored in the file system itself, typically 8K (8192 bytes), so 2048 pointers fit in one block. An inode has direct pointers to the first 12 blocks of the file, as well as pointers to singly, doubly, and triply indirect blocks. A file of up to 12+2048+2048*2048 = 4,196,364 blocks or 34,376,613,888 bytes (about 32 GB) can be represented without using triply indirect blocks, and with the triply indirect block, the maximum file size is (12+2048+2048*2048+2048*2048*2048)*8192 = 70,403,120,791,552 bytes (slightly more than 246 bytes, or about 64 terabytes). Of course, for such huge files, the size of the file cannot be represented as a 32-bit integer. Modern versions of Unix store the file length as a 64-bit integer, called a ``long'' integer in Java. An inode is 128 bytes long, allowing room for the 15 block pointers plus lots of meta-data. 64 inodes fit in one disk block. Since the inode for a file is kept in memory while the file is open,

locating an arbitrary block of any file requires reading at most three I/O operations, not counting the operation to read or write the data block itself.

Directories [ Silberschatz, Galvin, and Gagne, Section 11.3 ] A directory is simply a table mapping character-string human-readable names to information about files. The early PC operating system CP/M shows how simple a directory can be. Each entry contains the name of one file, its owner, size (in blocks) and the block numbers of 16 blocks of the file. To represent files with more than 16 blocks, CP/M used multiple directory entries with the same name and different values in a field called the extent number. CP/M had only one directory for the entire system. DOS uses a similar directory entry format, but stores only the first block number of the file in the directory entry. The entire file is represented as a linked list of blocks using the disk index scheme described above. All but the earliest version of DOS provide hierarchical directories using a scheme similar to the one used in Unix. Unix has an even simpler directory format. A directory entry contains only two fields: a character-string name (up to 14 characters) and a two-byte integer called an inumber, which is interpreted as an index into an array of inodes in a fixed, known location on disk. All the remaining information about the file (size, ownership, time stamps, permissions, and an index to the blocks of the file) are stored in the inode rather than the directory entry. A directory is represented like any other file (there's a bit in the inode to indicate that the file is a directory). Thus the inumber in a directory entry may designate a ``regular'' file or another directory, allowing arbitrary graphs of nodes. However, Unix carefully limits the set of operating system calls to ensure that the set of directories is always a tree. The root of the tree is the file with inumber 1 (some versions of Unix use other conventions for designating the root directory). The entries in each directory point to its children in the tree. For convenience, each directory also two special entries: an entry with name ``..'', which points to the parent of the directory in the tree and an entry with name ``.'', which points to the directory itself. Inumber 0 is not used, so an entry is marked ``unused'' by setting its inumber field to 0. The algorithm to convert from a path name to an inumber might be written in Java as int namei(int current, String[] path) { for (int i = 0; i<path.length; i++) { if (inode[current].type != DIRECTORY) throw new Exception("not a directory"); current = nameToInumber(inode[current], path[i]); if (current == 0) throw new Exception("no such file or directory"); } return current; } The procedure nameToInumber(Inode node, String name) (not shown) reads through the directory file represented by the inode node, looks for an entry matching the given name and returns the inumber contained in that entry. The procedure namei walks the directory tree, starting at a given inode and following a path described by a sequence of strings. There is a procedure with this name in the Unix kernel. Files are always specified in Unix system calls by a character-string path name. You can learn the inumber of a file if you like, but you can't use the inumber when talking to the Unix kernel. Each system call that has a path name as an argument uses namei to translate it to an inumber.

If the argument is an absolute path name (it starts with `/'), namei is called with current == 1. Otherwise, current is the current working directory. Since all the information about a file except its name is stored in the inode, there can be more than one directory entry designating the same file. This allows multiple aliases (called links) for a file. Unix provides a system call link(old-name, new-name) to create new names for existing files. The call link("/a/b/c", "/d/e/f") works something like this: if (namei(1, parse("/d/e/f")) != 0) throw new Exception("file already exists"); int dir = namei(1, parse("/d/e")): if (dir==0 || inode[dir].type != DIRECTORY) throw new Exception("not a directory"); int target = namei(1, parse("/a/b/c")); if (target==0) throw new Exception("no such directory"); if (inode[target].type == DIRECTORY) throw new Exception("cannot link to a directory"); addDirectoryEntry(inode[dir], target, "f"); The procedure parse (not shown here) is assumed to break up a path name into its components. If, for example, /a/b/c resolves to inumber 123, the entry (123, "f") is added to directory file designated by "/d/e". The result is that both "/a/b/c" and "/d/e/f" resolve to the same file (the one with inumber 123). We have seen that a file can have more than one name. What happens if it has no names (does not appear in any directory)? Since the only way to name a file in a system call is by a path name, such a file would be useless. It would consume resources (the inode and probably some data and indirect blocks) but there would be no way to read it, write to it, or even delete it. Unix protects against this ``garbage collection'' problem by using reference counts. Each inode contains a count of the number of directory entries that point to it. ``User'' programs are not allowed to update directories directly. System calls that add or remove directory entries (creat, link, mkdir, rmdir, etc) update these reference counts appropriately. There is no system call to delete a file, only the system call unlink(name) which removes the directory entry corresponding to name. If the reference count of an inode drops to zero, the system automatically deletes the files and returns all of its blocks to the free list. We saw before that the reference counting algorithm for garbage collection has a fatal flaw: If there are cycles, reference counting will fail to collect some garbage. Unix avoids this problem by making sure cycles cannot happen. The system calls are designed so that the set of directories will always be a single tree rooted at inode 1: mkdir creates a new empty (except for the . and .. entries) as a leaf of the tree, rmdir is only allowed to delete a directory that is empty (except for the . and .. entries), and link is not allowed to link to a directory. Because links to directories are not allowed, the only place the file system is not a tree is at the leaves (regular files) and that cannot introduce cycles. Although this algorithm provides the ability to create aliases for files in a simple and secure manner, it has several flaws: • It's hard to figure own how to charge users for disk space. Ownership is associated with the file not the directory entry (the owner's id is stored in the inode). A file cannot be deleted without finding all the links to it and deleting them. If I create a file and you make a link to it, I will continue to be charged for it even if I try to remove it through my original name for it. Worse still, your link may be in a directory I don't have access to, so I may be unable to delete the file,

even though I'm being charged for its space. Indeed, you could make it much bigger after I have no access to it. • There is no way to make an alias for a directory. • As we will see later, links cannot cross boundaries of physical disks. • Since all aliases are equal, there's no one ``true name'' for a file. You can find out whether two path names designate the same file by comparing inumbers. There is a system call to get the meta-data about a file, and the inumber is included in that information. But there is no way of going in the other direction: to get a path name for a file given its inumber, or to find a path name of an open file. Even if you remember the path name used to get to the file, that is not a reliable ``handle'' to the file (for example to link two files together by storing the name of one in the other). One of the components of the path name could be removed, thus invalidating the name even though the file still exists under a different name. While it's not possible to find the name (or any name) of an arbitrary file, it is possible to figure out the name of a directory. Directories do have unique names because the directories form a tree, and one of the properties of a tree is that there is a unique path from the root to any node. The ``..'' and ``.'' entries in each directory make this possible. Here, for example, is code to find the name of the current working directory. class DirectoryEntry { int inumber; String name; } String cwd() { FileInputStream thisDir = new FileInputStream("."); int thisInumber = nameToInumber(thisDir, "."); getPath(".", thisInumber); } String getPath(String currentName, int currentInumber) { String parentName = currentName + "/.."; FileInputSream parent = new FileInputStream(parentName); int parentInumber = nameToInumber(parent, "."); String fname = inumberToName(parent, currentInumber); if (parentInumber == 1) return "/" + fname; else return getPath(parentInumber, parentName) + "/" + fname; } The procedure nameToInumber is similar to the procedure with the same name described above, but takes an InputStream as an argument rather than an inode. Many versions of Unix allow a program to open a directory for reading and read its contents just like any other file. In such systems, it would be easy to write nameToInumber as a user-level procedure if you know the format of a directory. 2 The procedure inumberToName is similar, but searches for an entry containing a particular inumber and returns the name field of the entry.

Symbolic Links To get around the limitations with the original Unix notion of links, more recent versions of Unix introduced the notion of a symbolic link (to avoid confusion, the original kind of link, described in the

previous section, is sometimes called a hard link). A symbolic link is a new type of file, distinguished by a code in the inode from directories, regular files, etc. When the namei procedure that translates path names to inumbers encounters a symlink, it treats the contents of the file as a pathname and uses it to continue the translation. If the contents of the file is a relative path name (it does not start with a slash), it is interpreted relative to the directory containing the link itself, not the current working directory of the process doing the lookup. int namei(int current, String[] path) { for (int i = 0; i<path.length; i++) { if (inode[current].type != DIRECTORY) throw new Exception("not a directory"); current = nameToInumber(inode[current], path[i]); if (current == 0) throw new Exception("no such file or directory"); while (inode[current].type == SYMLINK) { String link = getContents(inode[current]); String[] linkPath = parse(link); if (link.charAt(0) == '/') current = namei(1, linkPath); else current = namei(current, linkPath); if (current == 0) throw new Exception("no such file or directory"); } } return current; } The only change from the previous version of this procedure is the addition of the while loop. Any time the procedure encounters a node of type SYMLINK, it recursively calls itself to translate the contents of the file, interpreted as a path name, into an inumber. Although the implementation looks complicated, it does just what you would expect in normal situations. For example, suppose there is an existing file named /a/b/c and an existing directory /d. Then the the command ln -s /a/b /d/e makes the path name /d/e a synonym for /a/b, and also makes /d/e/c a synonym for /a/b/c. From the user's point of view, the the picture looks like this: In implementation terms, the picture looks like this

where the hexagon denotes a node of type symlink. Here's a more elaborate example that illustrates symlinks with relative path names. Suppose I have an existing directory /usr/solomon/cs537/s90 with various sub-directories and I am setting up project 5 for this semester. I might do something like this: cd /usr/solomon/cs537 mkdir f96 cd f96 ln -s ../s90/proj5 proj5.old cat proj5.old/foo.c cd /usr/solomon/cs537 cat f96/proj5.old/foo.c cat s90/proj5/foo.c Logically, the situation looks like this:

and physically, it looks like this: All three of the cat commands refer to the same file.

The added flexibility of symlinks over hard links comes at the expense of less security. Symlinks are neither required nor guaranteed to point to valid files. You can remove a file out from under a symlink, and in fact, you can create a symlink to a non-existent file. Symlinks can also have cycles. For example, this works fine: cd /usr/solomon mkdir bar ln -s /usr/solomon foo ls /usr/solomon/foo/foo/foo/foo/bar

However, in some cases, symlinks can cause infinite loops or infinite recursion in the namei procedure. The real version in Unix puts a limit on how many times it will iterate and returns an error code of ``too many links'' if the limit is exceeded. Symlinks to directories can also cause the ``change directory'' command cd to behave in strange ways. Most people expect that the two commands cd foo cd .. to cancel each other out. But in the last example, the commands cd /usr/solomon cd foo cd .. would leave you in the directory /usr. Some shell programs treat cd specially and remember what alias you used to get to the current directory. After cd /usr/solomon; cd foo; cd foo, the current directory is /usr/solomon/foo/foo, which is an alias for /usr/solomon, but the command cd .. is treated as if you had typed cd /usr/solomon/foo.

Mounting [ Silberschatz, Galvin, and Gagne, Sections 11.5.2, 17.6, and 20.7.5 ] What if your computer has more than one disk? In many operating systems (including DOS and its descendants) a pathname starts with a device name, as in C:\usr\solomon (by convention, C is the name of the default hard disk). If you leave the device prefix off a path name, the system supplies a default current device similar to the current directory. Unix allows you to glue together the directory trees of multiple disks to create a single unified tree. There is a system call mount(device, mount_point) where device names a particular disk drive and mount_point is the path name of an existing node in the current directory tree (normally an empty directory). The result is similar to a hard link: The mount point becomes an alias for the root directory of the indicated disk. Here's how it works: The kernel maintains a table of existing mounts represented as (device1, inumber, device2) triples. During namei, whenever the current (device, inumber) pair matches the first two fields in one

of the entries, the current device and inumber become device2 and 1, respectively. Here's the expanded code: int namei(int curi, int curdev, String[] path) { for (int i = 0; i<path.length; i++) { if (disk[curdev].inode[curi].type != DIRECTORY) throw new Exception("not a directory"); curi = nameToInumber(disk[curdev].inode[curi], path[i]); if (curi == 0) throw new Exception("no such file or directory"); while (disk[curdev].inode[curi].type == SYMLINK) { String link = getContents(disk[curdev].inode[curi]); String[] linkPath = parse(link); if (link.charAt(0) == '/') current = namei(1, linkPath); else current = namei(current, linkPath); if (current == 0) throw new Exception("no such file or directory"); } int newdev = mountLookup(curdev, curi); if (newdev != -1) { curdev = newdev; curi = 1; } } return current; } In this code, we assume that mountLookup searches the mount table for matching entry, returning -1 if no matching entry is found. There is a also a special case (not shown here) for ``..'' so that the ``..'' entry in the root directory of a mounted disk behaves like a pointer to the parent directory of the mount point. The Network File System (NFS) from Sun Microsystems extends this idea to allow you to mount a disk from a remote computer. The device argument to the mount system call names the remote computer as well as the disk drive and both pieces of information are put into the mount table. Now there are three pieces of information to define the ``current directory'': the inumber, the device, and the computer. If the current computer is remote, all operations (read, write, creat, delete, mkdir, rmdir, etc.) are sent as messages to the remote computer. Information about remote open files, including a seek pointer and the identity of the remote machine, is kept locally. Each read or write operation is converted locally to one or more requests to read or write blocks of the remote file. NFS caches blocks of remote files locally to improve performance.

Special Files I said that the Unix mount system call has the name of a disk device as an argument. How do you name a device? The answer is that devices appear in the directory tree as special files. An inode whose type is ``special'' (as opposed to ``directory,'' ``symlink,'' or ``regular'') represents some sort of I/O device. It is customary to put special files in the directory /dev, but since it is the inode that is marked

``special,'' they can be anywhere. Instead of containing pointers to disk blocks, the inode of a special file contains information (in a machine-dependent format) about the device. The operating system tries to make the device look as much like a file as possible, so that ordinary programs can open, close, read, or write the device just like a file. Some devices look more like real file than others. A disk device looks exactly like a file. Reads return whatever is on the disk and writes can scribble anywhere on the disk. For obvious security reasons, the permissions for the raw disk devices are highly restrictive. A tape drive looks sort of like a disk, but a read will return only the next physical block of data on the device, even if more is requested. The special file /dev/tty represents the terminal. Writes to /dev/tty display characters on the screen. Reads from /dev/tty return characters typed on the keyboard. The seek operation on a device like /dev/tty updates the seek pointer, but the seek pointer has no effect on reads or writes. Reads of /dev/tty are also different from reads of a file in that they may return fewer bytes than requested: Normally, a read will return characters only up through the next end-of-line. If the number of bytes requested is less than the length of the line, the next read will get the remaining bytes. A read call will block the caller until at least one character can be returned. On machines with more than one terminal, there are multiple terminal devices with names like /dev/tty0, /dev/tty1, etc. Some devices, such as a mouse, are read-only. Write operations on such devices have no effect. Other devices, such as printers, are write-only. Attempts to read from them give an end-of-file indication (a return value of zero). There is special file called /dev/null that does nothing at all: reads return endof-file and writes send their data to the garbage bin. (New EPA rules require that this data be recycled. It is now used to generate federal regulations and other meaningless documents.) One particularly interesting device is /dev/mem, which is an image of the memory space of the current process. In a sense, this device is the exact opposite of memory-mapped files. Instead of making a file look like part of virtual memory, it makes virtual memory look like a device. This idea of making all sorts of things look like files can be very powerful. Some versions of Unix make network connections look like files. Some versions have a directory with one special file for each active process. You can read these files to get information about the states of processes. If you delete one of these files, the corresponding process is killed. Another idea is to have a directory with one special file for each print job waiting to be printed. Although this idea was pioneered by Unix, it is starting to show up more and more in other operating systems. Previous Next Contents

Disks More

About

File

1

Systems

Note that we are referring here to a single pathname component. The Solaris version of Unix on our workstations has a special system call for reading directories, so this code couldn't be written in Java without resorting to native methods. 2

[email protected] Mon Jan 24 13:34:18 CST 2000 Copyright © 1996-1999 by Marvin Solomon. All rights reserved.

CS 537 Lecture Notes, Part 11 More About File Systems Contents Long File Names Space Management • Block Size and Extents • Free Space • Reliability • Bad-block Forwarding • Back-up Dumps • Consistency Checking • Transactions • Performance This web page extends the previous page with more information about the implementation of file systems. • •

Long File Names The Unix implementation described previously allows arbitrarily long path names for a files, but each component is limited in length. In the original Unix implementation, each directory entry is 16 bytes long: two bytes for the inumber and 14 bytes for a path name component. 1 class Dirent { public short inumber; public byte name[14]; } If the name is less than 14 characters long, trailing bytes are filled with nulls (bytes with all bits set to zero--not to be confused with `0' characters). An inumber of zero is used to mark an entry as unused (inumbers for files start at 1). • To look up a name, search the whole directory, starting at the beginning. • To ``remove'' an entry, set its inumber flag to zero. • To add an entry, search for an entry with a zero inumber field and re-use it. If there aren't any, add an entry to the end (making the file 16 bytes bigger). This representation has one advantage. • It is very simple. In particular, space allocation is easy because all entries are the same length. However, it has several disadvantages. • Since an inumber is only 16 bits, there can be at most 65,535 files on any one disk. • A file name can be at most 14 characters long. • Directories grow, but they never shrink. • Searching a very large directory can be slow.

The people at Berkeley, while they were rewriting the file system code to make it faster, also changed the format of directories to get rid of the first two problems (they left the remaining problems unfixed). This new organization has been adopted by many (but not all) versions of Unix introduced since then. The new format of a directory entry looks like this:2 class DirentLong { int inumber; short reclen; short namelen; byte name[]; } The inumber field is now a 4-byte (32-bit) integer, so that a disk can have up to 4,294,967,296 files. The reclen field indicates the entire length of the DirentLong entry, including the 8-byte header. The actual length of the name array is thus reclen - 8 bytes. The namelen field indicates the length of the name. The remaining space in the name array is unused. This extra padding at the end of the entry serves three purposes. • It allows the length of the entry to be padded up to a multiple of 4 bytes so that the integer fields are properly aligned (some computer architectures require integers to be stored at addresses that are multiples of 4). • The last entry in a disk block can be padded to make it extend to the end of the block. With this trick, Unix avoids entries that cross block boundaries, simplifying the code. • It supports a cute trick for coalescing free space. To delete an entry, simply increase the size of the previous entry by the size of the entry being deleted. The deleted entry looks like part of the padding on the end of the previous entry. Since all searches of the directory are done sequentially, starting at the beginning, the deleted entry will effectively ``disappear.'' There's only one problem with this trick: It can't be used to delete the first entry in the directory. Fortunately, the first entry is the `.' entry, which is never deleted. To create a new entry, search the directory for an entry that has enough padding (according to its reclen and namelen fields) to hold the new entry and split it into two entries by decreasing its reclen field. If no entry with enough padding is found, extend the directory file by one block, make the whole block into one entry, and try again. This approach has two very minor additional benefits over the old scheme. In the old scheme, every entry is 16 bytes, even if the name is only one byte long. In the new scheme, an name uses only as much space as it needs (although this doesn't save much, since the minimum size of an entry in the new scheme is 9 bytes--12 if padding is used to align entries to integer boundaries). The new approach also allows nulls to appear in file names, but other parts of the system make that impractical, and besides, who cares?

Space Management Block Size and Extents All of the file organizations I've mentioned store the contents of a file in a set of disk blocks. How big should a block be? The problem with small blocks is I/O overhead. There is a certain overhead to read or write a block beyond the time to actually transfer the bytes. If we double the block size, a typical file will have half as many blocks. Reading or writing the whole file will transfer the same amount of data, but it will involve half as many disk I/O operations. The overhead for an I/O operations includes a variable amount of latency (seek time and rotational delay) that depends on how close the blocks are to

each other, as well as a fixed overhead to start each operation and respond to the interrupt when it completes. Many years ago, researchers at the University of California at Berkeley studied the original Unix file system. They found that when they tried reading or writing a single very large file sequentially, they were getting only about 2% of the potential speed of the disk. In other words, it took about 50 times as long to read the whole file as it would if they simply read that many sequential blocks directly from the raw disk (with no file system software). They tried doubling the block size (from 512 bytes to 1K) and the performance more than doubled! The reason the speed more than doubled was that it took less than half as many I/O operations to read the file. Because the blocks were twice as large, twice as much of the file's data was in blocks pointed to directly by the inode. Indirect blocks were twice as large as well, so they could hold twice as many pointers. Thus four times as much data could be accessed through the singly indirect block without resorting to the doubly indirect block. If doubling the block size more than doubled performance, why stop there? Why didn't the Berkeley folks make the blocks even bigger? The problem with big blocks is internal fragmentation. A file can only grow in increments of whole blocks. If the sizes of files are random, we would expect on the average that half of the last block of a file is wasted. If most files are many blocks long, the relative amount of waste is small, but if the block size is large compared to the size of a typical file, half a block per file is significant. In fact, if files are very small (compared to the block size), the problem is even worse. If, for example, we choose a block size of 8k and the average file is only 1K bytes long, we would be wasting about 7/8 of the disk. Most files in a typical Unix system are very small. The Berkeley researchers made a list of the sizes of all files on a typical disk and did some calculations of how much space would be wasted by various block sizes. Simply rounding the size of each file up to a multiple of 512 bytes resulted in wasting 4.2% of the space. Including overhead for inodes and indirect blocks, the original 512-byte file system had a total space overhead of 6.9%. Changing to 1K blocks raised the overhead to 11.8%. With 2k blocks, the overhead would be 22.4% and with 4k blocks it would be 45.6%. Would 4k blocks be worthwhile? The answer depends on economics. In those days disks were very expensive, and a wasting half the disk seemed extreme. These days, disks are cheap, and for many applications people would be happy to pay twice as much per byte of disk space to get a disk that was twice as fast. But there's more to the story. The Berkeley researchers came up with the idea of breaking up the disk into blocks and fragments. For example, they might use a block size of 2k and a fragment size of 512 bytes. Each file is stored in some number of whole blocks plus 0 to 3 fragments at the end. The fragments at the end of one file can share a block with fragments of other files. The problem is that when we want to append to a file, there may not be any space left in the block that holds its last fragment. In that case, the Berkeley file system copies the fragments to a new (empty) block. A file that grows a little at a time may require each of its fragments to be copied many times. They got around this problem by modifying application programs to buffer their data internally and add it to a file a whole block's worth at a time. In fact, most programs already used library routines to buffer their output (to cut down on the number of system calls), so all they had to do was to modify those library routines to use a larger buffer size. This approach has been adopted by many modern variants of Unix. The Solaris system you are using for this course uses 8k blocks and 1K fragments. As disks get cheaper and CPU's get faster, wasted space is less of a problem and the speed mismatch between the CPU and the disk gets worse. Thus the trend is towards larger and larger disk blocks. At first glance it would appear that the OS designer has no say in how big a block is. Any particular disk drive has a sector size, usually 512 bytes, wired in. But it is possible to use larger ``blocks''. For example, if we think it would be a good idea to use 2K blocks, we can group together each run of four consecutive sectors and call it a block. In fact, it would even be possible to use variable-sized ``blocks,'' so long as each one is a multiple of the sector size. A variable-sized ``block'' is called an extent. When

extents are used, they are usually used in addition to multi-sector blocks. For example, a system may use 2k blocks, each consisting of 4 consecutive sectors, and then group them into extents of 1 to 10 blocks. When a file is opened for writing, it grows by adding an extent at a time. When it is closed, the unused blocks at the end of the last extent are returned to the system. The problem with extents is that they introduce all the problems of external fragmentation that we saw in the context of main memory allocation. Extents are generally only used in systems such as databases, where high-speed access to very large files is important.

Free Space [ Silberschatz, Galvin, and Gagne, Section 11.7 ] We have seen how to keep track of the blocks in each file. How do we keep track of the free blocks-blocks that are not in any file? There are two basic approaches. • Use a bit vector. That is simply an array of bits with one bit for each block on the disk. A 1 bit indicates that the corresponding block is allocated (in some file) and a 0 bit says that it is free. To allocate a block, search the bit vector for a zero bit, and set it to one. • Use a free list. The simplest approach is simply to link together the free blocks by storing the block number of each free block in the previous free block. The problem with this approach is that when a block on the free list is allocated, you have to read it into memory to get the block number of the next block in the list. This problem can be solved by storing the block numbers of additional free blocks in each block on the list. In other words, the free blocks are stored in a sort of lopsided tree on disk. If, for example, 128 block numbers fit in a block, 1/128 of the free blocks would be linked into a list. Each block on the list would contain a pointer to the next block on the list, as well as pointers to 127 additional free blocks. When the first block of the list is allocated to a file, it has to be read into memory to get the block numbers stored in it, but then we and allocate 127 more blocks without reading any of them from disk. Freeing blocks is done by running this algorithm in reverse: Keep a cache of 127 block numbers in memory. When a block is freed, add its block number to this cache. If the cache is full when a block is freed, use the block being freed to hold all the block numbers in the cache and link it to the head of the free list by adding to it the block number of the previous head of the list. How do these methods compare? Neither requires significant space overhead on disk. The bitmap approach needs one bit for each block. Even for a tiny block size of 512 bytes, each bit of the bitmap describes 512*8 = 4096 bits of free space, so the overhead is less than 1/40 of 1%. The free list is even better. All the pointers are stored in blocks that are free anyhow, so there is no space overhead (except for one pointer to the head of the list). Another way of looking at this is that when the disk is full (which is the only time we should be worried about space overhead!) the free list is empty, so it takes up no space. The real advantage of bitmaps over free lists is that they give the space allocator more control over which block is allocated to which file. Since the blocks of a file are generally accessed together, we would like them to be near each other on disk. To ensure this clustering, when we add a block to a file we would like to choose a free block that is near the other blocks of a file. With a bitmap, we can search the bitmap for an appropriate block. With a free list, we would have to search the free list on disk, which is clearly impractical. Of course, to search the bitmap, we have to have it all in memory, but since the bitmap is so tiny relative to the size of the disk, it is not unreasonable to keep the entire bitmap in memory all the time. To do the comparable operation with a free list, we would need to keep the block numbers of all free blocks in memory. If a block number is four bytes (32 bits), that means that 32 times as much memory would be needed for the free list as for a bitmap. For a concrete example, consider a 2 gigabyte disk with 8K blocks and 4-byte block numbers. The disk contains 2 31/213 = 218 = 262,144 blocks. If they are all free, the free list has 262,144 entries, so it would take one megabyte of memory to keep them all in memory at once. By contrast, a bitmap requires 218 bits, or 215

= 32K bytes (just four blocks). (On the other hand, the bit map takes the same amount of memory regardless of the number of blocks that are free).

Reliability Disks fail, disks sectors get corrupted, and systems crash, losing the contents of volatile memory. There are several techniques that can be used to mitigate the effects of these failures. We only have room for a brief survey.

Bad-block Forwarding When the disk drive writes a block of data, it also writes a checksum, a small number of additional bits whose value is some function of the ``user data'' in the block. When the block is read back in, the checksum is also read and compared with the data. If either the data or checksum were corrupted, it is extremely unlikely that the checksum comparison will succeed. Thus the disk drive itself has a way of discovering bad blocks with extremely high probability. The hardware is also responsible for recovering from bad blocks. Modern disk drives do automatic bad-block forwarding. The disk drive or controller is responsible for mapping block numbers to absolute locations on the disk (cylinder, track, and sector). It holds a little bit of space in reserve, not mapping any block numbers to this space. When a bad block is discovered, the disk allocates one of these reserved blocks and maps the block number of the bad block to the replacement block. All references to this block number access the replacement block instead of the bad block. There are two problems with this scheme. First, when a block goes bad, the data in it is lost. In practice, blocks tend to be bad from the beginning, because of small defects in the surface coating of the disk platters. There is usually a stand-alone formatting program that tests all the blocks on the disk and sets up forwarding entries for those that fail. Thus the bad blocks never get used in the first place. The main reason for the forwarding is that it is just too hard (expensive) to create a disk with no defects. It is much more economical to manufacture a ``pretty good'' disk and then use bad-block forwarding to work around the few bad blocks. The other problem is that forwarding interferes with the OS's attempts to lay out files optimally. The OS may think it is doing a good job by assigning consecutive blocks of a file to consecutive block numbers, but if one of those blocks is forwarded, it may be very far away for the others. In practice, this is not much of a problem since a disk typically has only a handful of forwarded sectors out of millions. The software can also help avoid bad blocks by simply leaving them out of the free list (or marking them as allocated in the allocation bitmap).

Back-up Dumps [ Silberschatz, Galvin, and Gagne, Section 11.10.2 ] There are a variety of storage media that are much cheaper than (hard) disks but are also much slower. An example is 8 millimeter video tape. A ``two-hour'' tape costs just a few dollars and can hold two gigabytes of data. By contrast, a 2GB hard drive currently casts several hundred dollars. On the other hand, while worst-case access time to a hard drive is a few tens of milliseconds, rewinding or fastforwarding a tape to desired location can take several minutes. One way to use tapes is to make periodic back up dumps. Dumps are really used for two different purposes: • To recover lost files. Files can be lost or damaged by hardware failures, but far more often they are lost through software bugs or human error (accidentally deleting the wrong file). If the file is saved on tape, it can be restored. • To recover from catastrophic failures. An entire disk drive can fail, or the whole computer can be stolen, or the building can burn down. If the contents of the disk have been saved to tape, the

data can be restored (to a repaired or replacement disk). All that is lost is the work that was done since the information was dumped. Corresponding to these two ways of using dumps, there are two ways of doing dumps. A physical dump simply copies all of the blocks of the disk, in order, to tape. It's very fast, both for doing the dump and for recovering a whole disk, but it makes it extremely slow to recover any one file. The blocks of the file are likely to be scattered all over the tape, and while seeks on disk can take tens of milliseconds, seeks on tape can take tens or hundreds of seconds. The other approach is a logical dump, which copies each file sequentially. A logical dump makes it easy to restore individual files. It is even easier to restore files if the directories are dumped separately at the beginning of the tape, or if the name(s) of each file are written to the tape along with the file. The problem with logical dumping is that it is very slow. Dumps are usually done much more frequently than restores. For example, you might dump your disk every night for three years before something goes wrong and you need to do a restore. An important trick that can be used with logical dumps is to only dump files that have changed recently. An incremental dump saves only those files that have been modified since a particular date and time. Fortunately, most file systems record the time each file was last modified. If you do a backup each night, you can save only those files that have changed since the last backup. Every once in a while (say once a month), you can do a full backup of all files. In Unix jargon, a full backup is called an epoch (pronounced ``eepock'') dump, because it dumps everything that has changed since ``the epoch''--January 1, 1970, which is the the earliest possible date in Unix.3 The Computer Sciences department currently does backup dumps on about 260 GB of disk space. Epoch dumps are done once every 14 days, with the timing on different file systems staggered so that about 1/14 of the data is dumped each night. Daily incremental dumps save about 6-10% of the data on each file system. Incremental dumps go fast because they dump only a small fraction of the files, and they don't take up a lot of tape. However, they introduce new problems: • If you want to restore a particular file, you need to know when it was last modified so that you know which dump tape to look at. • If you want to restore the whole disk (to recover from a catastrophic failure), you have to restore from the last epoch dump, and then from every incremental dump since then, in order. A file that is modified every day will appear on every tape. Each restore will overwrite the file with a newer version. When you're done, everything will be up-to-date as of the last dump, but the whole process can be extremely slow (and labor-intensive). • You have to keep around all the incremental tapes since the last epoch. Tapes are cheap, but they're not free, and storing them can be a hassle. The First problem can be solved by keeping a directory of what was dumped when. A bunch of UW alumni (the same guys that invented NFS) have made themselves millionaires by marketing software to do this. The other problems can be solved by a clever trick. Each dump is assigned a positive integer level. A level n dump is an incremental dump that dumps all files that have changed since the most recent previous dump with a level greater than or equal to n. An epoch dump is considered to have infinitely high level. Levels are assigned to dumps as follows:

This scheme is sometimes called a ruler schedule for obvious reasons. Level-1 dumps only save files that have changed in the previous day. Level-2 dumps save files that have changed in the last two days, level-3 dumps cover four days, level-4 dumps cover 8 days, etc. Higher-level dumps will thus include more files (so they will take longer to do), but they are done infrequently. The nice thing about this scheme is that you only need to save one tape from each level, and the number of levels is the logarithm of the interval between epoch dumps. Thus even if did a dump each night and you only did an epoch dump only once a year, you would need only nine levels (hence nine tapes). That also means that a full restore needs at worst one restore from each of nine tapes (rather than 365 tapes!). To figure out what tapes you need to restore from if your disk is destroyed after dump number n, express n in binary, and number the bits from right to left, starting with 1. The 1 bits tell you which dump tapes to use. Restore them in order of decreasing level. For example, 20 in binary is 10100, so if the disk is destroyed after the 20th dump, you only need to restore from the epoch dump and from the most recent dumps at levels 5 and 3.

Consistency Checking [ Silberschatz, Galvin, and Gagne, Section 11.10.1 ] Some of the information in a file system is redundant. For example, the free list could be reconstructed by checking which blocks are not in any file. Redundancy arises because the same information is represented in different forms to make different operations faster. If you want to know which blocks are in a given file, look at the inode. If you you want to know which blocks are not in any inode, use the free list. Unfortunately, various hardware and software errors can cause the data to become inconsistent. File systems often include a utility that checks for consistency and optionally attempts to repair inconsistencies. These programs are particularly handy for cleaning up the disks after a crash. Unix has a utility called fscheck. It has two principal tasks. First, it checks that blocks are properly allocated. Each inode is supposed to be the root of a tree of blocks, the free list is supposed to be a tree of blocks, and each block is supposed to appear in exactly one of these trees. Fscheck runs through all the inodes, checking each allocated inode for reasonable values, and walking through the tree of blocks rooted at the inode. It maintains a bit vector to record which blocks have been encountered. If block is encountered that has already been seen, there is a problem: Either it occurred twice in the same file (in which case it isn't a tree), or it occurred in two different files. A reasonable recovery would be to allocate a new block, copy the contents of the problem block into it, and substitute the copy for the problem block in one of the two places where it occurs. It would also be a good idea to log an error message so that a human being can check up later to see what's wrong. After all the files are scanned, any block that hasn't been found should be on the free list. It would be possible to scan the free list in a similar manner, but it's probably easier just to rebuild the free list from the set of blocks that were not found in any file. If a bitmap instead of a free list is used, this step is even easier: Simply overwrite the file system's bitmap with the bitmap constructed during the scan. The other main consistency requirement concerns the directory structure. The set of directories is supposed to be a tree, and each inode is supposed to have a link count that indicates how many times it appears in directories. The tree structure could be checked by a recursive walk through the directories,but it is more efficient to combine this check with the walk through the inodes that checks for disk blocks, but recording, for each directory inode encountered, the inumber of its parent. The set of directories is a tree if and only if and only if every directory other than the root has a unique parent. This pass can also rebuild the link count for each inode by maintaining in memory an array with one slot for each inumber. Each time the inumber is found in a directory, increment the corresponding element of the array. The resulting counts should match the link counts in the inodes. If not, correct the counts in the inodes.

This illustrates a very important principal that pops up throughout operating system implementation (indeed, throughout any large software system): the doctrine of hints and absolutes. Whenever the same fact is recorded in two different ways, one of them should be considered the absolute truth, and the other should be considered a hint. Hints are handy because they allow some operations to be done much more quickly that they could if only the absolute information was available. But if the hint and the absolute do not agree, the hint can be rebuilt from the absolutes. In a well-engineered system, there should be some way to verify a hint whenever it is used. Unix is a bit lax about this. The link count is a hint (the absolute information is a count of the number of times the inumber appears in directories), but Unix treats it like an absolute during normal operation. As a result, a small error can snowball into completely trashing the file system. For another example of hints, each allocated block could have a header containing the inumber of the file containing it and its offset in the file. There are systems that do this (Unix isn't one of them). The tree of blocks rooted at an inode then becomes a hint, providing an efficient way of finding a block, but when the block is found, its header could be checked. Any inconsistency would then be caught immediately, and the inode structures could be rebuilt from the information in the block headers. By the way, if the link count calculated by the scan is zero (i.e., the inode, although marked as allocated, does not appear in any directory), it would not be prudent to delete the file. A better recovery is to add an entry to a special lost+found directory pointing to the orphan inode, in case it contains something really valuable.

Transactions The previous section talks about how to recover from situations that ``can't happen.'' How do these problems arise in the first place? Wouldn't it be better to prevent these problems rather than recover from them after the fact? Many of these problems arise, particularly after a crash, because some operation was ``half-completed.'' For example, suppose the system was in the middle of executing a unlink system call when the lights went out. An unlink operation involves several distinct steps: • remove an entry from a directory, • decrement a link count, and if the count goes to zero, • move all the blocks of the file to the free list, and • free the inode. If the crash occurs between the first and second steps, the link count will be wrong. If it occurs during the third step, a block may be linked both into the file and the free list, or neither, depending on the details of how the code is written. And so on... To deal with this kind of problem in a general way, transactions were invented. Transactions were first developed in the context of database management systems, and are used heavily there, so there is a tradition of thinking of them as ``database stuff'' and teaching about them only in database courses and text books. But they really are an operating system concept. Here's a two-bit introduction. We have already seen a mechanism for making complex operations appear atomic. It is called a critical section. Critical sections have a property that is sometimes called synchronization atomicity. It is also called serializability because if two processes try to execute their critical sections at about the same time, the next effect will be as if they occurred in some serial order.4 If systems can crash (and they can!), synchronization atomicity isn't enough. We need another property, called failure atomicity, which means an ``all or nothing'' property: Either all of the modifications of nonvolatile storage complete or none of them do. There are basically two ways to implement failure atomicity. They both depend on the fact that a writing a single block to disk is an atomic operation. The first approach is called logging. An appendonly file called a log is maintained on disk. Each time a transaction does something to file-system data,

it creates a log record describing the operation and appends it to the log. The log record contains enough information to undo the operation. For example, if the operation made a change to a disk block, the log record might contain the block number, the length and offset of the modified part of the block, and the the original content of that region. The transaction also writes a begin record when it starts, and a commit record when it is done. After a crash, a recovery process scans the log looking for transactions that started (wrote a begin record) but never finished (wrote a commit record). If such a transaction is found, its partially completed operations are undone (in reverse order) using the undo information in the log records. Sometimes, for efficiency, disk data is cached in memory. Modifications are made to the cached copy and only written back out to disk from time to time. If the system crashes before the changes are written to disk, the data structures on disk may be inconsistent. Logging can also be used to avoid this problem by putting into each log record redo information as well as undo information. For example, the log record for a modification of a disk block should contain both the old and new value. After a crash, if the recovery process discovers a transaction that has completed, it uses the redo information to make sure the effects of all of its operations are reflected on disk. Full recovery is always possible provided • The log records are written to disk in order, • The commit record is written to disk when the transaction completes, and • The log record describing a modification is written to disk before any of the changes made by that operation are written to disk. This algorithm is called write-ahead logging. The other way of implementing transactions is called shadow blocks.5 Suppose the data structure on disk is a tree. The basic idea is never to change any block (disk block) of the data structure in place. Whenever you want to modify a block, make a copy of it (called a shadow of it) instead, and modify the parent to point to the shadow. Of course, to make the parent point to the shadow you have to modify it, so instead you make a shadow of the parent an modify it instead. In this way, you shadow not only each block you really wanted to modify, but also all the blocks on the path from it to the root. You keep the shadow of the root block in memory. At the end of the transaction, you make sure the shadow blocks are all safely written to disk and then write the shadow of the root directly onto the root block. If the system crashes before you overwrite the root block, there will be no permanent change to the tree on disk. Overwriting the root block has the effect of linking all the modified (shadow blocks) into the tree and removing all the old blocks. Crash recovery is simply a matter of garbage collection. If the crash occurs before the root was overwritten, all the shadow blocks are garbage. If it occurs after, the blocks they replaced are garbage. In either case, the tree itself is consistent, and it is easy to find the garbage blocks (they are blocks that aren't in the tree). Database systems almost universally use logging, and shadowing is mentioned only in passing in database texts. But the shadowing technique is used in a variant of the Unix file system called (somewhat misleadingly) the Log-structured File System (LFS). The entire file system is made into a tree by replacing the array of inodes with a tree of inodes. LFS has the added advantage (beyond reliability) that all blocks are written sequentially, so write operations are very fast. It has the disadvantage that files that are modified here and there by random access tend to have their blocks scattered about, but that pattern of access is comparatively rare, and there are techniques to cope with it when it occurs. The main source of complexity in LFS is figuring out when and how to do the ``garbage collection.''

Performance [ Silberschatz, Galvin, and Gagne, Section 11.9 ]

The main trick to improve file system performance (like anything else in computer science) is caching. The system keeps a disk cache (sometimes also called a buffer pool) of recently used disk blocks. In contrast with the page frames of virtual memory, where there were all sorts of algorithms proposed for managing the cache, management of the disk cache is pretty simple. On the whole, it is simply managed LRU (least recently used). Why is it that for paging we went to great lengths trying to come up with an algorithm that is ``almost as good as LRU'' while here we can simply use true LRU? The problem with implementing LRU is that some information has to be updated on every single reference. In the case of paging, references can be as frequent as every instruction, so we have to make do with whatever information hardware is willing to give us. The best we can hope for is that the paging hardware will set a bit in a page-table entry. In the case of file system disk blocks, however, each reference is the result of a system call, and adding a few extra instructions added to a system call for cache maintenance is not unreasonable. Adding page caching to the file system implementation is actually quite simple. Somewhere in the implementation, there is probably a procedure that gets called when the system wants to access a disk block. Let's suppose the procedure simply allocates some memory space to hold the block and reads it into memory. Block readBlock(int blockNumber) { Block result = new Block(); Disk.read(blockNumber, result); return result; } To add caching, all we have to do is modify this code to search the disk cache first. class CacheEntry { int blockNumber; Block buffer; CacheEntry next, previous; } class DiskCache { CacheEntry head, tail; CacheEntry find(int blockNumber) { // Search the list for an entry with a matching block number. // If not found, return null. } void moveToFront(CacheEntry entry) { // more entry to the head of the list } CacheEntry oldest() { return tail; } Block readBlock(int blockNumber) { Block result; CacheEntry entry = find(blockNumber); if (entry == null) { entry = oldest(); Disk.read(blockNumber, entry.buffer); entry.blockNumber = blockNumber;

} moveToFront(entry); return entry.buffer; } } This code is not quite right, because it ignores writes. If the oldest buffer is dirty (it has been modified since it was read from disk), it first has to be written back to the disk before it can be used to hold the new block. Most systems actually write dirty buffers back to the disk sooner than necessary to minimize the damage caused by a crash. The original version of Unix had a background process that would write all dirty buffers to disk every 30 seconds. Some information is more critical than others. Some versions of Unix, for example, write back directory blocks (the data block of directory files of type directory) as each time they are modified. This technique--keeping the block in the cache but writing its contents back to disk after any modification--is called write-through caching. (Some modern versions of Unix use techniques inspired by database transactions to minimize the effects of crashes). LRU management automatically does the ``right thing'' for most disk blocks. If someone is actively manipulating the files in a directory, all of the directory's blocks will probably be in the cache. If a process is scanning a large file, all of its indirect blocks will probably be in memory most of the time. But there is one important case where LRU is not the right policy. Consider a process that is traversing (reading or writing) a file sequentially from beginning to end. Once that process has read or written the last byte of a block, it will not touch that block again. The system might as well immediately move the block to the tail of the list as soon as the read or write request completes. Tanenbaum calls this technique free behind. It is also sometimes called most recently used (MRU) to contrast it with LRU. How does the system know to handle certain blocks MRU? There are several possibilities. • If the operating system interface distinguishes between random-access files and sequential files, it is easy. Data blocks of sequential files should be managed MRU. • In some systems, all files are alike, but there is a different kind of open call, or a flag passed to open, that indicates whether the file will be accessed randomly or sequentially. • Even if the OS gets no explicit information from the application program, it can watch the pattern of reads an writes. If recent history indicates that all (or most) reads or writes of the file have been sequential, the data blocks should be managed MRU. A similar trick is called read-ahead. If a file is being read sequentially, it is a good idea to read a few blocks at a time. This cuts down on the latency for the application (most of the time the data the application wants is in memory before it even asks for it). If the disk hardware allows multiple blocks to be read at a time, it can cut the number of disk read requests, cutting down on overhead such as the time to service a I/O completion interrupt. If the system has done a good job of clustering together the disks of the file, read-ahead also takes better advantage of the clustering. If the system reads one block at a time, another process, accessing a different file, could make the disk head move away from the area containing the blocks of this file between accesses. The Berkeley file system introduced another trick to improve file system performance. They divided the disk into chunks, which they called cylinder groups (CGs) because each one is comprised of some number of adjacent cylinders. Each CG is like a miniature disk. It has its own super block and array of inodes. The system attempts to put all the blocks of a file in the same CG as its inode. It also tries to keep all the inodes in one directory together in the same CG so that operations like ls -l *

will be fast. It uses a variety to techniques to assign inodes and blocks to CGs in such as way as to distribute the free space fairly evenly between them, so there will be enough room to do this clustering. In particular, • When a new file is created, its inode is placed in the same CG as its parent directory (if possible). But when a new directory is created, its inode is placed in CG with the largest amount of free space (so that the files in the directory will be able to be near each other). • When blocks are added to a file, they are allocated (if possible) from the same CG that contains it inode. But when the size of the file crosses certain thresholds (say every megabyte or so), the system switches to a different CG, one that is relatively empty. The idea is to prevent a big file from hogging all the space in one CG and preventing other files in the CG from being well clustered. 1

This Java declaration is actually a bit of a lie. In Java, an instance of class Dirent would include some header information indicating that it was a Dirent object, a two-byte short integer, and a pointer to an array object (which contains information about its type an length, in addition to the 14 bytes of data). The actual representation is given by the C (or C++) declaration struct direct { unsigned short int inumber; char name[14]; } Unfortunately, there's no way to represent this in Java. 2 This is also a lie for the reasons cited in the previous footnote as well as the fact that the field byte name[], which is intended to indicate an array of indeterminant length, rather than a pointer to an array. The actual C declaration is struct dirent { unsigned long int inumber; unsigned short int reclen; unsigned short int reclen; char name[256]; } The array size 256 is a lie. The code depends on the fact that the C language does not do any array bounds checking. 3 The dictionary defines epoch as 1 : an instant of time or a date selected as a point of reference in astronomy 2 a : an event or a time marked by an event that begins a new period or development b : a memorable event or date 4 Critical sections are usually implemented so that they actually occur one after the other, but all that is required is that they behave as if they were serialized. For example, if neither transaction modifies anything, or if they don't touch any overlapping data, they can be run concurrently without any harm. Database implementations of transactions go to a great deal of trouble to allow as much concurrency as possible.

5

Actually, the technique is usually called ``shadow paging'' because in the context of databases, disk blocks are often called ``pages.'' We reserve the term ``pages'' for virtual memory.

CS 537 Lecture Notes, Part 12 Protection and Security Contents Security • Threats • The Trojan Horse • Design Principles • Authentication • Protection Mechanisms • Access Control Lists • Capabilities The terms protection and security are often used together, and the distinction between them is a bit blurred, but security is generally used in a broad sense to refer to all concerns about controlled access to facilities, while protection describes specific technological mechanisms that support security. •

Security As in any other area of software design, it is important to distinguish between policies and mechanisms. Before you can start building machinery to enforce policies, you need to establish what policies you are trying to enforce. Many years ago, I heard a story about a software firm that was hired by a small savings and loan corporation to build a financial accounting system. The chief financial officer used the system to embezzle millions of dollars and fled the country. The losses were so great the S&L went bankrupt, and the loss of the contract was so bad the software company also went belly-up. Did the accounting system have a good or bad security design? The problem wasn't unauthorized access to information, but rather authorization to the wrong person. The situation is analogous to the old saw that every program is correct according to some specification. Unfortunately, we don't have the space here to go into the whole question of security policies here. We will just assume that terms like ``authorized access'' have some well-defined meaning in a particular context.

Threats Any discussion of security must begin with a discussion of threats. After all, if you don't know what you're afraid of, how are you going to defend against it? Threats are generally divided in three main categories. • Unauthorized disclosure. A ``bad guy'' gets to see information he has no right to see (according to some policy that defines ``bad guy'' and ``right to see''). • Unauthorized updates. The bad guy makes changes he has no right to change. • Denial of service. The bad guy interferes with legitimate access by other users. There is a wide spectrum of denial-of-service threats. At one end, it overlaps with the previous category. A bad guy deleting a good guy's file could be considered an unauthorized update. A the other end of the spectrum, blowing up a computer with a hand grenade is not usually considered an unauthorized update. As this second example illustrates, some denial-of-service threats can only be enforced by physical security. No matter how well your OS is designed, it can't protect my files from

his hand grenade. Another form of denial-of-service threat comes from unauthorized consumption of resources, such as filling up the disk, tying up the CPU with an infinite loop, or crashing the system by triggering some bug in the OS. While there are software defenses against these threats, they are generally considered in the context of other parts of the OS rather than security and protection. In short, discussion of software mechanisms for computer security generally focus on the first two threats. In response to these threats counter measures also fall into various categories. As programmers, we tend to think of technological tricks, but it is also important to realize that a complete security design must involve physical components (such as locking the computer in a secure building with armed guards outside) and human components (such as a background check to make sure your CFO isn't a crook, or checking to make sure those armed guards aren't taking bribes).

The Trojan Horse Break-in techniques come in numerous forms. One general category of attack that comes in a great variety of disguises is the Trojan Horse scam. The name comes from Greek mythology. The ancient Greeks were attacking the city of Troy, which was surrounded by an impenetrable wall. Unable to get in, they left a huge wooden horse outside the gates as a ``gift'' and pretended to sail away. The Trojans brought the horse into the city, where they discovered that the horse was filled with Greek soldiers who defeated the Trojans to win the Rose Bowl (oops, wrong story). In software, a Trojan Horse is a program that does something useful--or at least appears to do something useful--but also subverts security somehow. In the personal computer world, Trojan horses are often computer games infected with ``viruses.'' Here's the simplest Trojan Horse program I know of. Log onto a public terminal and start a program that does something like this: print("login:"); name = readALine(); turnOffEchoing(); print("password:"); passwd = readALine(); sendMail("badguy",name,passwd); print("login incorrect"); exit(); A user waking up to the terminal will think it is idle. He will attempt to log in, typing his login name and password. The Trojan Horse program sends this information to the bad guy, prints the message login incorrect and exits. After the program exits, the system will generate a legitimate login: message and the user, thinking he mistyped his password (a common occurrence because the password is not echoed) will try again, log in successfully, and have no suspicion that anything was wrong. Note that the Trojan Horse program doesn't actually have to do anything useful, it just has to appear to.

Design Principles 19.

Public Design. A common mistake is to try to keep a system secure by keeping its algorithms secret. That's a bad idea for many reasons. First, it gives a kind of all-or-nothing security. As soon as anybody learns about the algorithm, security is all gone. In the words of Benjamin Franklin, ``Two people can keep a secret if one of them is dead.'' Second, it is usually not that hard to figure out the algorithm, by seeing how the system responds to various inputs, decompiling the code, etc. Third, publishing the algorithm can have beneficial effects. The bad guys probably have already figured out your algorithm and found its weak points. If you publish

20.

21.

22.

23.

24.

it, perhaps some good guys will notice bugs or loopholes and tell you about them so you can fix them. Default = No Access. Start out by granting as little access a possible and adding privileges only as needed. If you forget to grant access where it is legitimately needed, you'll soon find out about it. Users seldom complain about having too much access. Timely Checks. Checks tend to ``wear out.'' For example, the longer you use the same password, the higher the likelihood it will be stolen or deciphered. Be careful: This principle can be overdone. Systems that force users to change passwords frequently encourage them to use particularly bad ones. A system that forced users to supply a password every time they wanted to open a file would inspire all sorts of ingenious ways to avoid the protection mechanism altogether. Minimum Privilege. This is an extension of point 2. A person (or program or process) should be given just enough powers to get the job done. In other contexts, this principle is called ``need to know.'' It implies that the protection mechanism has to support fine-grained control. Simple, Uniform Mechanisms. Any piece of software should be as simple as possible (but no simpler!) to maximize the chances that it is correctly and efficiently implemented. This is particularly important for protection software, since bugs are likely be usable as security loopholes. It is also important that the interface to the protection mechanisms be simple, easy to understand, and easy to use. It is remarkably hard to design good, foolproof security policies; policy designers need all the help they can get. Appropriate Levels of Security. You don't store your best silverware in a box on the front lawn, but you also don't keep it in a vault at the bank. The US Strategic Air Defense calls for a different level of security than my records of the grades for this course. Not only does excessive security mechanism add unnecessary cost and performance degradation, it can actually lead to a less secure system. If the protection mechanisms are too hard to use, users will go out of their way to avoid using them.

Authentication Authentication is a process by which one party convinces another of its identity. A familiar instance is the login process, though which a human user convinces the computer system that he has the right to use a particular account. If the login is successful, the system creates a process and associates with it the internal identifier that identifies the account. Authentication occurs in other contexts, and it isn't always a human being that is being authenticated. Sometimes a process needs to authenticate itself to another process. In a networking environment, a computer may need to authenticate itself to another computer. In general, let's call the party that whats to be authenticated the client and the other party the server. One common technique for authentication is the use of a password. This is the technique used most often for login. There is a value, called the password that is known to both the server and to legitimate clients. The client tells the server who he claims to be and supplies the password as proof. The server compares the supplied password with what he knows to be the true password for that user. Although this is a common technique, it is not a very good one. There are lots of things wrong with it.

Direct attacks on the password. The most obvious way of breaking in is a frontal assault on the password. Simply try all possible passwords until one works. The main defense against this attack is the time it takes to try lots of possibilities. If the client is a computer program (perhaps masquerading as a human being), it can try lots of combinations very quickly, but by if the password is long enough, even the fastest computer cannot try succeed in a reasonable amount of time. If the password is a string of 8 letters and digits,

there are 2,821,109,907,456 possibilities. A program that tried one combination every millisecond would take 89 years to get through them all. If users are allowed to pick their own passwords, they are likely to choose ``cute doggie names'', common words, names of family members, etc. That cuts down the search space considerably. A password cracker can go through dictionaries, lists of common names, etc. It can also use biographical information about the user to narrow the search space. There are several defenses against this sort of attack. • The system chooses the password. The problem with this is that the password will not be easy to remember, so the user will be tempted to write it down or store it in a file, making it easy to steal. This is not a problem if the client is not a human being. • The system rejects passwords that are too ``easy to guess''. In effect, it runs a password cracker when the user tries to set his password and rejects the password if the cracker succeeds. This has many of the disadvantages of the previous point. Besides, it leads to a sort of arms race between crackers and checkers. • The password check is artificially slowed down, so that it takes longer to go through lots of possibilities. One variant of this idea is to hang up a dial-in connection after three unsuccessful login attempts, forcing the bad guy to take the time to redial.

Eavesdropping. This is a far bigger program for passwords than brute force attacks. In comes in many disguises. • Looking over someone's shoulder while he's typing his password. Most systems turn off echoing, or echo each character as an asterisk to mitigate this problem. • Reading the password file. In order to verify that the password is correct, the server has to have it stored somewhere. If the bad guy can somehow get access to this file, he can pose as anybody. While this isn't a threat on its own (after all, why should the bad guy have access to the password file in the first place?), it can magnify the effects of an existing security lapse. Unix introduced a clever fix to this problem, that has since been almost universally copied. Use some hash function f and instead of storing password, store f(password). The hash function should have two properties: Like any hash function it should generate all possible result values with roughly equal probability, and in addition, it should be very hard to invert-that is, given f(password), it should be hard to recover password. It is quite easy to devise functions with these properties. When a client sends his password, the server applies f to it and compares the result with the value stored in the password file. Since only f(password) is stored in the password file, nobody can find out the password for a given user, even with full access to the password file, and logging in requires knowing password, not f(password). In fact, this technique is so secure, it has become customary to make the password file publicly readable! • Wire tapping. If the bad guy can somehow intercept the information sent from the client to the server, password-based authentication breaks down altogether. It is increasingly the case the authentication occurs over an insecure channel such as a dial-up line or a local-area network. Note that the Unix scheme of storing f(password) is of no help here, since the password is sent in its original form (``plaintext'' in the jargon of encryption) from the client to the server. We will consider this problem in more detail below.

Spoofing. This is the worst threat of all. How does the client know that the server is who it appears to be? If the bad guy can pose as the server, he can trick the client into divulging his password. We saw a form of this attack above. It would seem that the server needs to authenticate itself to the client before the client

can authenticate itself to the server. Clearly, there's a chicken-and-egg problem here. Fortunately, there's a very clever and general solution to this problem.

Challenge-response. There are wide variety of authentication protocols, but they are all based on a simple idea. As before, we assume that there is a password known to both the (true) client and the (true) server. Authentication is a four-step process. • The client sends a message to the server saying who he claims to be and requesting authentication. • The server sends a challenge to the client consisting of some random value x. • The client computes g(password,x) and sends it back as the response. Here g is a hash function similar to the function f above, except that it has two arguments. It should have the property that it is essentially impossible to figure out password even if you know both x and g(password,x). • The server also computes g(password,x) and compares it with the response it got from the client. Clearly this algorithm works if both the client and server are legitimate. An eavesdropper could learn the user's name, x and g(password,x), but that wouldn't help him pose as the user. If he tried to authenticate himself to the server he would get a different challenge x', and would have no way to respond. Even a bogus server is no threat. The change provides him with no useful information. Similarly, a bogus client does no harm to a legitimate server except for tying him up in a useless exchange (a denial-of-service problem!).

Protection Mechanisms First, some terminology:

objects The things to which we wish to control access. They include physical (hardware) objects as well as software objects such as files, databases, semaphores, or processes. As in object-oriented programming, each object has a type and supports certain operations as defined by its type. In simple protection systems, the set of operations is quite limited: read, write, and perhaps execute, append, and a few others. Fancier protection systems support a wider variety of types and operations, perhaps allowing new types and operations to be dynamically defined. principals Intuitively, ``users''--the ones who do things to objects. Principals might be individual persons, groups or projects, or roles, such as ``administrator.'' Often each process is associated with a particular principal, the owner of the process. rights

Permissions to invoke operations. Each right is the permission for a particular principal to perform a particular operation on a particular object. For example, principal solomon might have read rights for a particular file object. domains Sets of rights. Domains may overlap. Domains are a form of indirection, making it easier to make wholesale changes to the access environment of a process. There may be three levels of indirection: A principal owns a particular process, which is in a particular domain, which contains a set of rights, such as the right to modify a particular file. Conceptually, the protection state of a system is defined by an access matrix. The rows correspond to principals (or domains), the columns correspond to objects, and each cell is a set of rights. For example, if access[solomon]["/tmp/foo"] = { read, write } Then I have read and write access to file "/tmp/foo". I say ``conceptually'' because the access is never actually stored anywhere. It is very large and has a great deal of redundancy (for example, my rights to a vast number of objects are exactly the same: none!), so there are much more compact ways to represent it. The access information is represented in one of two ways, by columns, which are called access control lists (ACLs), and by rows, called capability lists.

Access Control Lists An ACL (pronounced ``ackle'') is a list of rights associated with an object. A good example of the use of ACLs is the Andrew File System (AFS) originally created at Carnegie-Mellon University and now marketed by Transarc Corporation as an add-on to Unix. This file system is widely used in the Computer Sciences Department. Your home directory is in AFS. AFS associates an ACL with each directory, but the ACL also defines the rights for all the files in the directory (in effect, they all share the same ACL). You can list the ACL of a directory with the fs listacl command: % fs listacl /u/c/s/cs537-1/public Access list for /u/c/s/cs537-1/public is Normal rights: system:administrators rlidwka system:anyuser rl solomon rlidwka The entry system:anyuser rl means that the principal system:anyuser (which represents the role ``anybody at all'') has rights r (read files in the directory) and l (list the files in the directory and read their attributes). The entry solomon rlidwka means that I have all seven rights supported by AFS. In addition to r and l, they include the rights to insert new file in the the directory (i.e., create files), delete files, write files, lock files, and administer the ACL list itself. This last right is very powerful: It allows me to add, delete, or modify ACL entries. I thus have the power to grant or deny any rights to this directory to anybody. The remaining entry in the list shows that the principal system:administrators has the same rights I do (namely, all rights). This principal is the name of a group of other principals. The command pts membership system:administrators lists the members of the group.

Ordinary Unix also uses an ACL scheme to control access to files, but in a much stripped-down form. Each process is associated with a user identifier (uid) and a group identifier (gid), each of which is a 16-bit unsigned integer. The inode of each file also contains a uid and a gid, as well as a nine-bit protection mask, called the mode of the file. The mask is composed of three groups of three bits. The first group indicates the rights of the owner: one bit each for read access, write access, and execute access (the right to run the file as a program). The second group similarly lists the rights of the file's group, and the remaining three three bits indicate the rights of everybody else. For example, the mode 111 101 101 (0755 in octal) means that the owner can read, write, and execute the file, while members of the owning group and others can read and execute, but not write the file. Programs that print the mode usually use the characters rwx- rather than 0 and 1. Each zero in the binary value is represented by a dash, and each 1 is represented by r, w, or x, depending on its position. For example, the mode 111101101 is printed as rwxr-xr-x. In somewhat more detail, the access-checking algorithm is as follows: The first three bits are checked to determine whether an operation is allowed if the uid of the file matches the uid of the process trying to access it. Otherwise, if the gid of the file matches the gid of the process, the second three bits are checked. If neither of the id's match, the last three bits are used. The code might look something like this. boolean accessOK(Process p, Inode i, int operation) { int mode; if (p.uid == i.uid) mode = i.mode >> 6; else if (p.gid == i.gid) mode = i.mode >> 3; else mode = i.mode; switch (operation) { case READ: mode &= 4; break; case WRITE: mode &= 2; break; case EXECUTE: mode &= 1; break; } return (mode != 0); } (The expression i.mode >> 3 denotes the value i.mode shifted right by three bits positions and the operation mode &= 4 clears all but the third bit from the right of mode.) Note that this scheme can actually give a random user more powers over the file than its owner. For example, the mode --r--rw- (000 100 110 in binary) means that the owner cannot access the file at all, while members of the group can only read the file, and other can both read and write. On the other hand, the owner of the file (and only the owner) can execute the chmod system call, which changes the mode bits to any desired value. When a new file is created, it gets the uid and gid of the process that created it, and a mode supplied as an argument to the creat system call. Most modern versions of Unix actually implement a slightly more flexible scheme for groups. A process has a set of gid's, and the check to see whether the file is in the process' group checks to see whether any of the process' gid's match the file's gid. boolean accessOK(Process p, Inode i, int operation) { int mode; if (p.uid == i.uid)

mode = i.mode >> 6; else if (p.gidSet.contains(i.gid)) mode = i.mode >> 3; else mode = i.mode; switch (operation) { case READ: mode &= 4; break; case WRITE: mode &= 2; break; case EXECUTE: mode &= 1; break; } return (mode != 0); } When a new file is created, it gets the uid of the process that created it and the gid of the containing directory. There are system calls to change the uid or gid of a file. For obvious security reasons, these operations are highly restricted. Some versions of Unix only allow the owner of the file to change it gid, only allow him to change it to one of his gid's, and don't allow him to change the uid at all. For directories, ``execute'' permission is interpreted as the right to get the attributes of files in the directory. Write permission is required to create or delete files in the directory. This rule leads to the surprising result that you might not have permission to modify a file, yet be able to delete it and replace it with another file of the same name but with different contents! Unix has another very clever feature--so clever that it is patented! The file mode actually has a few more bits that I have not mentioned. One of them is the so-called setuid bit. If a process executes a program stored in a file with the setuid bit set, the uid of the process is set equal to the uid of the file. This rather curious rule turns out to be a very powerful feature, allowing the simple rwx permissions directly supported by Unix to be used to define arbitrarily complicated protection policies. As an example, suppose you wanted to implement a mail system that works by putting all mail messages in to one big file, say /usr/spool/mbox. I should be able to read only those message that mention me in the To: or Cc: fields of the header. Here's how to use the setuid feature to implement this policy. Define a new uid mail, make it the owner of /usr/spool/mbox, and set the mode of the file to rw------- (i.e., the owner mail can read and write the file, but nobody else has any access to it). Write a program for reading mail, say /usr/bin/readmail. This file is also owned by mail and has mode srwxr-xr-x. The `s' means that the setuid bit is set. My process can execute this program (because the ``execute by anybody'' bit is on), and when it does, it suddenly changes its uid to mail so that it has complete access to /usr/spool/mbox. At first glance, it would seem that letting my process pretend to be owned by another user would be a big security hole, but it isn't, because processes don't have free will. They can only do what the program tells them to do. While my process is running readmail, it is following instructions written by the designer of the mail system, so it is safe to let it have access appropriate to the mail system. There's one more feature that helps readmail do its job. A process really has two uid's, called the effective uid and the real uid. When a process executes a setuid program, its effective uid changes to the uid of the program, but its real uid remains unchanged. It is the effective uid that is used to determine what rights it has to what files, but there is a system call to find out the real uid of the current process. Readmail can use this system call to find out what user called it, and then only show the appropriate messages.

Capabilities An alternative to ACLs are capabilities. A capability is a ``protected pointer'' to an object. It designates an object and also contains a set of permitted operations on the object. For example, one capability may permit reading from a particular file, while another allows both reading and writing. To perform an

operation on an object, a process makes a system call, presenting a capability that points to the object and permits the desired operation. For capabilities to work as a protection mechanism, the system has to ensure that processes cannot mess with their contents. There are three distinct ways to ensure the integrity of a capability.

Tagged architecture. Some computers associate a tag bit with each word of memory, marking the word as a capability word or a data word. The hardware checks that capability words are only assigned from other capability words. To create or modify a capability, a process has to make a kernel call. Separate capability segments. If the hardware does not support tagging individual words, the OS can protect capabilities by putting them in a separate segment and using the protection features that control access to segments. Encryption. Each capability can be extended with a cryptographic checksum that is computed from the rest of the content of the capability and a secret key. If a process modifies a capability it cannot modify the checksum to match without access to the key. Only the kernel knows the key. Each time a process presents a capability to the kernel to invoke an operation, the kernel checks the checksum to make sure the capability hasn't been tampered with. Capabilities, like segments are a ``good idea'' that somehow seldom seems to be implemented in real systems in full generality. Like segments, capabilities show up in an abbreviated form in many systems. For example, the file descriptor for an open file in Unix is a kind of capability. When a process tries to open a file for writing, the system checks the file's ACL to see whether the access is permitted. If it is, the process gets a file descriptor for the open file, which is a sort of capability to the file that permits write operations. Unix uses the separate segment approach to protect the capability. The capability itself is stored in a table in the kernel and the process has only an indirect reference to it (the index of the slot in the table). File descriptors are not full-fledged capabilities, however. For example, they cannot be stored in files, because they go away when the process terminates. Previous Next Contents

More

About Cryptographic

File

[email protected] Mon Jan 24 13:34:19 CST 2000 Copyright © 1996-1999 by Marvin Solomon. All rights reserved.

Systems Protocols

CS 537 Lecture Notes, Part 13 Cryptographic Protocols Contents • • •

Encryption Key Distribution Public Key Encryption

Encryption In distributed systems, data is usually sent over insecure channels. A prudent user should assume that it is easy for a "bad guy" to see all the data that goes over the wire. In fact, the bad guy may be assumed to have the power to modify the data as it goes by, delete messages, inject new messages into the stream, or any combination of these operations, such as stealing a message and playing it back at a later time. In such environments, security is based on cryptographic techniques. Messages are scrambled, or encrypted before they are sent, and decrypted on receipt.

Here M is the original plaintext message, E is the encrypted message, f1 and f2 are the encryption and decryption functions, and K is the key. In mathematical notation,

E = f1(M,K) f2(E,K) = f2(f1(M,K), K) = M According to the principle of public design, the encryption and decryption functions are well-known publicly available algorithms. It is the key K, known only to the sender and receiver, that provides security. The most important feature of the encryption algorithm f1 is that be infeasible to invert the function. That is, it should be impossible, or at least very hard, to recover M from E without knowing K. In fact, it is quite easy to come up with such an algorithm: exclusive or. If the length of K (in bits) is the same as the length of M, let each bit of E be zero if corresponding bits of M and K are the same, and one if they are different. Another way of looking at this function is that it flips bits of M that correspond to one bits in K and passes through unchanged bits of M in the same position as zero bits of K. In this case, f1 and f2 are the same function. Where there is a zero bit in K the corresponding bit of M passes through both boxes unchanged; where there is a one bit, the input bit gets flipped by the first box and flipped back to its original value by the second box. This algorithm is perfect, from the point of view of invertability. If the bits of K are all chosen at random, knowing E tells you absolutely nothing about M.

However, it has one fatal flaw: The key has to be the same length as the message, and you can only use it once (in the jargon of encryption, this is a one-time pad cipher). Encryption algorithms have been devised with fixed-length keys of 100 or so bits (regardless of the length of M) with the property that M is provably hard (computationally infeasible) to recover from E even if the bad guy • Has seen lots of messages encrypted with the same key, • Has seen lots of (M,E) pairs encrypted with the same key (a "known plaintext" attack), • Can trick the sender into encrypting sample messages chosen by the bad guy (a "chosen plaintext" attack). The algorithms (proof of their properties) depend on high-powered mathematics that is beyond the scope of this course.

Key Distribution Even with such an algorithm in hand, there's still the problem of how the two parties who wish to communicate get the same key in the first place -- the key distribution problem. If the key is sent over the network without encryption, a bad guy could see it and it would become useless. But if the key is to be sent encrypted, the two sides have to somehow agree on a key to encrypt the key, which leaves us back where we started. One could always send the key through some other means, such as a trusted courier (think of James Bond with a briefcase handcuffed to his wrist). This is called "out-of-band" transmission. It tends to be expensive and introduces risks of its own (see any James Bond movie for examples). Ultimately, some sort of out-of-band transmission is required to get things going, but we would like to minimize it. A clever partial solution to the key distribution problem was devised by Needham and Schroeder. The algorithm is a bit complicated, and would be totally unreadable without some helpful abbreviations. Instead denoting the result of encrypting message M with key K with the expression f1(M,K), we will write it as [M]K. Think of this as a box with M inside secured with a lock that can only be opened with key K. We will assume that there is a trusted Key Distribution Center (KDC) that helps processes exchange keys with each other securely. A the beginning of time, each process A has a key KA that is known only to A and the KDC. Perhaps these keys were distributed by some out-of-band technique. For example, the software for process A may have been installed from a (trusted!) floppy disk that also contained a key for A to use. The algorithm uses five messages.

Message 1 is very simple. A sends the KDC a message saying that it wants to establish a secure channel with B. It includes in the message a random number, denoted id, which it makes up on the spot. 1: request + id (In these examples, "+" represents concatenation of messages.) The KDC makes up a brand new key Kc which it sends back to A in a rather complicated message. 2: [Kc + id + request + [Kc + A]KB]KA First note that the entire message is encrypted with A's key KA. The encryption serves two purposes. First, it prevents any eavesdropper from opening the message and getting at the contents. Only A can open it. Second, it acts as a sort of signature. When A successfully decrypts the message, it knows it must have come from the KDC and not an imposter, since only the KDC (besides A itself) knows K A and could use it to create a message that properly decrypts.1 A saves the key Kc from the body of the message for later use in communicating with B. The original request is included in the response so that A can see that nobody modified the request on its way to KDC. The inclusion of id proves that this is a response to the request just sent, not an earlier response intercepted by the bad guy and retransmitted now. The last component of the response is itself encrypted with B's key. A does not know B's key, so it cannot decrypt this component, but it doesn't have to. It just sends it to B as message 3.

3: [Kc + A]KB As with message 2, the encryption by KB serves both to hide Kc from eavesdroppers and to certify to B that the message is legitimate. Since only the KDC and B know KB, when B successfully decrypts this message, it knows that the message was prepared by the KDC. A and B now know the new key K c, and can use it to communicate securely. However, there are two more messages in the protocol. Messages 4 and 5 are used by B to verify that the message 3 was not a replay. B chooses another random number id' and sends it to A encrypted with the new key Kc. A decrypts the message, modifies the random number in some well-defined way (for example, it adds one to it), re-encrypts it and sends it back.

4: [ id' ]Kc 5: [ f(id') ] Kc This is an example of a challenge/response protocol.

Public Key Encryption In the 1970's, Diffie and Hellman invented a revolutionary new way of doing encryption, called publickey (or asymmetric) cryptography. At first glance, the change appears minor. Instead of using the same key to encrypt and decrypt, this method uses two different keys, one for encryption and one for decryption. Diffie and Hellman invented an algorithm for generating a pair of keys (P,S) and an encryption algorithm such that messages encrypted with key P can be decrypted only with key S.

Since then, several other similar algorithms have been devised. The most commonly used one is called RSA (after its inventors, Rivest, Shamir, and Adelman) and is patented (although the 17-year lifetime of the patent is about to run out). By contrast, the older cryptographic technique described above is called conventional, private key, or symmetric cryptography. In most public-key algorithms, the functions f1 and f2 are the same, and either key can be used to decrypt messages encrypted with the other. That is,

f(f(M,P), S) = f(f(M,S), P) = M. The beauty of public key cryptography is that if I want you to send me a secret message, all I have to do is generate a key pair (P,S) and send you the key P. You encrypt the message with P and I decrypt it with S. I don't have to worry about sending P across the network without encrypting it. If a bad guy intercepts it, there's nothing he can do with it that can harm me (there's no way to compute S from P or vice versa). S is called the secret key and the P is the public key. However, there's a catch. A bad guy could pretend to be me and send you his own public key Pbg, claiming it was my public key. If you encrypt the message using Pbg, the bad guy could intercept it and decrypt it, since he knows the corresponding Sbg. Thus my problem is not how to send my public key to you securely, it is how to convince you that it really is mine. We'll see in a minute a (partial) solution to this problem. Public key encryption is particularly handy for digital signatures. Suppose I want to send you a message M in such a way as to assure you it really came from me. First I compute a hash function h(M) from M using a cryptographic hash function f. Then I encrypt h(M) using my secret key S. I send you both M and the signature [h(M)]S. When you get the message, you compute the hash code h(M) and use my public key P to decrypt the signature. If the two values are the same, you can conclude that the message really came from me. Only I know my secret key S, so only I could encrypt h(M) so that it would correctly decrypt with S. As before, for this to work, you must already know and believe my public key. An important application of digital signatures is a certificate, which is a principal's name and public key signed by another principal. Suppose Alice wants to send her public key to Bob in such a way that Bob can be reassured that it really is Alice's key. Suppose, further, that Alice and Bob have a common friend Charlie, Bob knows and trusts Charlie's public key, and Charlie knows and trusts Alice's public key. Alice can get a certificate from Charlie, which contains Alice's name and public key, and which is signed by Charlie:

[Alice + PAlice]SCharlie Alice sends this certificate to Bob. Bob verifies Charlie's signature on the certificate, and since he trusts Charlie, he believes that PAlice really is Alice's public key. He can use it to send secret messages to Alice and to verify Alice's signature on messages she sends to him. Of course, this scenario starts by assuming Bob has Charlie's public key and Charlie has Alice's public key. It doesn't explain how they got them. Perhaps they got them by exchanging other certificates, just as Bob got Alice's key. Or

perhaps the keys were exchanged by some out-of-band medium such snail mail, a telephone call, or a face-to-face meeting. A certificate authority (CA) is a service that exists expressly for the purpose of issuing certificates. When you install a web browser such as Netscape, it has built into it a set of public keys for a variety of CAs. In Netscape, click the "Security" button or select "Security info" item from the "Communicator" menu. In the window that appears, click on "Signers". You will get a list of these certificate authorities. When you visit a "secure" web page, the web server sends your browser a certificate containing its public key. If the certificate is signed by one of the CAs it recognizes, the browser generates a conventional key and uses the server's public key to transmit it securely to the server the browser and the server can now communicate securely by encrypting all their communications with the new key. The little lock-shaped icon in the lower left corner of the browser changes shape to show the lock securely closed to indicate the secure connection. Note that both public-key and conventional (private key) techniques are used. The public-key techniques are more flexible, but conventional encryption is much faster, so it is used whenever large amounts of data need to be transmitted. You can learn more about how Netscape handles security from Netscape's web site. Previous Contents

Protection

and

1

Security

A message encrypted with some other key could be "decrypted" with KA, but the results would be gibberish. The inclusion of the requests and id in the message ensures that A can tell the difference between a valid message and gibberish. [email protected] Mon Jan 24 13:28:57 CST 2000 Copyright © 1996-1999 by Marvin Solomon. All rights reserved.

Lecture Notes On Operating Systems

Overview

More details

Related Documents

Lecture Notes On Operating Systems

Operating Systems

Operating Systems

Operating Systems

Jk Lecture Notes On Electric Power Systems

Operating Systems