Justin Spencer: xv6 by Russ Cox, Frans Kaashoek, and Robert Morris

xv6: a simple, unix-like teaching operating system

The job of an operating system is to share a computer among multiple programs and to provide a more useful set of services than the hardware alone supports.
An operating system provides services to user programs through an interface. Designing a good interface turns out to be difficult.
xv6 takes the traditional form of a kernel, a special program that provides services to running programs.
Each running program, called a process, has memory containing instructions, data, and a stack. The instructions implement the program's computation. The data are the variables on which the computation acts. The stack organizes the program's procedure calls.
When a process needs to invoke a kernel service, it invokes a procedure call in the operating system interface. Such a procedure call is called a system call. The system call enters the kernel; the kernel performs the service and returns. Thus a process alternates between executing in user space and kernel space.
The kernel uses the CPU's hardware protection mechanisms to ensure that each process executing in user space can access only its own memory.
When a user program invokes a system call, the hardware raises the privilege level and starts executing a pre-arranged function in the kernel.
The shell is an ordinary program that reads commands from the user and executes them.
An xv6 process consists of user-space memory (instructions, data, and stack) and per-process state private to the kernel.
When a process is not executing, xv6 saves its CPU registers, restoring them when it next runs the process.
A process may create a new process using the fork system call. Fork creates a new process, called the child process, with exactly the same memory contents as the calling process, called the parent process.
The exit system call causes the calling process to stop executing and to release resources such as memory and open files.
Although the child has the same memory contents as the parent initially, the parent and child are executing with different memory and different registers: changing a variable in one does not affect the other.
The exec system call replaces the calling process's memory with a new memory image loaded from a file stored in the file system.
xv6 does not provide a notion of users or of protecting one user from another; in Unix terms, all xv6 processes run as root.
A file descriptor is a small integer representing a kernel-managed object that a process may read from or write to.
Internally, the xv6 kernel uses the file descriptor as an index into a per-process table, so that every process has a private space of file descriptors starting at zero.
The read and write system calls read bytes from and write bytes to open files named by file descriptors.
The dup system call duplicates an existing file descriptor, returning a new one that refers to the same underlying I/O object.
File descriptors are a powerful abstraction, because they hid the details of what they are connected to: a process writing to file descriptor 1 may be writing to a file, to a device like the console, or to a pipe.
A pipe is a small kernel buffer exposed to processes as a pair of file descriptors, one for reading and one for writing. Writing data to one end of the pipe makes that data available for reading from the other end of the pipe. Pipes provide a way for processes to communicate.
Pipes have at least four advantages over temporary files:

Pipes automatically clean themselves up.
Pipes can pass arbitrarily long streams of data.
Pips allow for parallel executing of pipeline stages.
If you are implementing inter-process communication, pipes' blocking reads and writes are more efficient than the non-blocking semantics of files.

The xv6 file system provides data files, which are uninterpreted byte arrays, and directories, which contain named references to data files and other directories. The directories form a tree, starting at a special directory called the root.
mknod creates a file in the file system, but the file has no contents. Instead, the file's metadata marks it as a device file and records the major and minor device numbers, which uniquely identify a kernel device.
A file's name is distinct from the file itself; the same underlying file, called an inode, can have multiple names, called links. The link system call creates another file system name referring to the same inode as an existing file.
The unlink system call removes a name from the file system. The file's inode and the disk space holding it's content are only freed when the file's link count is zero and no file descriptors refer to it.
cd must change the current working directory of the sell itself.
Unix's combination of the "standard" file descriptors, pipes, and convenient shell syntax for operations on them was a major advance in writing general-purpose reusable programs. The idea sparked a whole culture of "software tools" that was responsible for much of Unix's power and popularity, and the shell was the first so-called "scripting language".
The Unix system call interface has been standardized through the Portable Operating System Interface (POSIX) standard.
Our main goals for xv6 are simplicity and clarity while providing a simple UNIX-like system call interface.
The authors of Unix went on to build Plan 9, which applied the "resources are files" concept to modern facilities, representing networks, graphics, and other resources as files or file trees.
Any operating system must multiplex processes onto the underlying hardware, isolate processes from each other, and provide mechanisms for controlled inter-process communication.
A key requirement for an operating system is to support several activities at once.
An operating system must fulfill three requirements: multiplexing, isolation, and interaction.
To achieve strong isolation it's helpful to forbid applications from directly accessing sensitive hardware resources, and instead to abstract the resources into services.
Unix transparently switches hardware processors among processes, saving and restoring register state as necessary, so that applications don't have to be aware of time sharing.
Unix processes use exec to build up their memory image, instead of directly interacting with physical memory. This allows the operating system to decide where to place a process in memory; if memory is tight, the operating system might even store some of a process's data on disk.
The Unix interface is not the only way to abstract resources, but it has proven to be a very good one.
Applications shouldn't be able to modify (or even read) the operating system's data structures or instructions, should be able to access other process's memory, etc.
Processors provide hardware support for strong isolation.
An application can execute only user-mode instructions and is said to be running in user space, while the software in kernel mode can also execute privileged instructions and is said to be running in kernel space. The software running in kernel space (or in kernel mode) is called the kernel.
Processors provide a special instruction that switched the processor from user mode to kernel mode and enters the kernel at an entry point specified by the kernel.
A key design question is what part of the operating system should run in kernel mode. One possibility is that the entire operating system resides in the kernel, so that the implementations of all system calls run in kernel mode. This organization is called a monolithic kernel.
In a monolithic kernel, a mistake is fatal, because an error in kernel mode will often result in the kernel to fail. If the kernel fails, the computer stops working, and thus all applications fail too. The computer must reboot to start again.
To reduce the risk of mistakes in the kernel, OS designers can minimize the amount of operating system code that runs in kernel mode, and execute the bulk of the operating system in user mode. This kernel organization is called a microkernel.
OS services running as processes are called servers.
To allow applications to interact with the file server, the kernel provides an inter-process communication mechanism to send messages from one user-mode process to another.
In a microkernel, the kernel interface consists of a few low-level functions for starting applications, sending messages, accessing device hardware, etc. This organization allows the kernel to be relatively simple, as most of the operating system resides in user-level servers.
The unit of isolation in xv6 is a process. The process abstraction prevents one process from wrecking or spying on another process' memory, CPU, file descriptors, etc. It also prevents a process from wrecking the kernel itself, so that a process can't subvert the kernel's isolation mechanisms.
The mechanisms used by the kernel to implement processes include the user/kernel mode flag, address spaces, and time-slicing of threads.
A process provides a program with what appears to be a private memory system, or address space, which other processes can not read or write. A process also provides the program with what appears to be its own CPU to execute the program's instructions.
xv6 uses page tables (which are implemented by hardware) to give each process its own address space. The x86 page table translates (or "maps") a virtual address (the address than an x86 instruction manipulates) to a physical address (an address that the processor chip sends to main memory).
xv6 maintains a separate page table for each process that defines that process's address space.
Each process's address space maps the kernel's instructions and data as well as the user program's memory. When a process invokes a system call, the system call executes in the kernel mappings of the process's address space. This arrangement exists so that the kernel's system call code can directly refer to user memory.
A process's most important pieces of kernel state are its page table, its kernel stack, and its run state.
Each process has a thread of execution (or thread for short) that executes the process's instructions.
Each process has two stacks: a user stack and a kernel stack.
The kernel stack is separate (and protected from user code) so that the kernel can execute even if a process has wrecked its user stack.
When a process makes a system call, the processor switched to the kernel stack, raises the hardware privilege level, and starts executing the kernel instructions that implement the system call. When the system call completes, the kernel returns to user space: the hardware lowers its privilege level, switches back to the user stack, and resumes executing user instructions just after the system call instruction.
The first step in providing strong isolation is setting up the kernel to run in its own address space.
The way that control transfers from user software to the kernel is via an interrupt mechanism, which is used by system calls, interrupts, and exceptions.
Page tables are the mechanism through which the operating system controls what memory addresses mean. They allow xv6 to multiplex the address spaces of different process onto a single physical memory, and to protect the memories of different processes.
As a reminder, x86 instructions (both user and kernel) manipulate virtual addresses. The machine's RAM, or physical memory, is indexed with physical addresses. The x86 page table hardware connects these two kinds of addresses, by mapping each virtual address to a physical address.
Each process has a separate page table, and xv6 tells the page table hardware to switch page tables when xv6 switches between processes.
The kernel must allocate and free physical memory at run-time for page tables, process user memory, kernel stacks, and pipe buffers.
To guard a stack growing off the stack page, xv6 places a guard page right below the stack. The guard page is not mapped and so if the stack runs off the stack page, the hardware will generate an exception because it cannot translate the faulting address.
sbrk is the system call for a process to shrink or grow its memory.
Exec is the system call that creates the user part of an address space.
When running a process, a CPU executes the normal processor loop: read an instruction, advance the program counter, execute the instruction, repeat. But there are events on which control from a user program must transfer back to the kernel instead of executing the next instruction. These events include a device signaling that it wants attention, a user program doing something illegal, or a user program asking the kernel for a service with a system call.
There are three cases when control must be transferred from a user program to the kernel.

First, a system call: when a user program asks for an operating system service.
Second, an exception: when a program performs an illegal action.
Third, an interrupt: when a device generates a signal to indicate that it needs attention from the operating system.

On the x86, interrupt handlers are defined in the interrupt descriptor table (IDT). The IDT has 256 entries, each giving the %cs and %rip to be used when handling the corresponding interrupt.
To make a system call on the x86, a program invokes the int n instruction, where n specifies the index into the IDT.
An operating system can use the iret instruction to return from an int instruction. It pops the saved values during the int instruction from the stack, and resumed execution at the saved %eip.
Devices on the motherboard can generate interrupts, and xv6 must set up the hardware to handle these interrupts.
Interrupts are usually optional in the sense that the kernel could instead periodically check (or "poll") the device hardware to check for new events. Interrupts are preferable to polling if the events are relatively rare, so that polling would waste CPU time.
Interrupts are similar to system calls, except devices generate them at any time.
A driver is the code in an operating system that manages a particular device: it tells the device hardware to perform operations, configures the device to generate interrupts when done, and handles the resulting interrupts. Driver code can be tricky to write because a driver executes concurrently with the device that it manages.
In many operating systems, the drivers together account for more code in the operating system than the core kernel.
A lock provides mutual exclusion, ensuring that only one CPU at a time can hold the lock. If a lock is associated with each shared data item, and the code always holds the associate lock when using a given item, then we can be sure that the item is used from only one CPU at a time. In this situation, we say that the lock protects the data item.
You must keep in mind that a single C statement can be several machine instructions and thus another processor or an interrupt may muck around in the middle of a C statement. You cannot assume that lines of code on the page are executed atomically. Concurrency makes reasoning about correctness much more difficult.
A race condition is a situation in which a memory location is accessed concurrently, and at least one access is a write. A race is often a sign of a bug, either a lost update or a read of an incompletely-updated data structure. The outcome of a race depends on the exact timing of the two CPUs involved and how their memory operations are ordered by the memory system, which can make race-induced errors difficult to reproduce and debug.
The usual way to avoid races is to use a lock. Locks ensure mutual exclusion, so that only one CPU can execute at a time.
Invariants are properties of data structures that are maintained across operations.
Interrupts can cause concurrency even on a single processor: if interrupts are enabled, kernel code can be stopped at any moment to run an interrupt handler instead.
Many compilers and processors execute code out of order to achieve higher performance. If an instruction takes many cycles to complete, a processor may want to issue the instruction early so that it can overlap with other instructions and avoid processor stalls.
It is best to use locks as the base for higher-level constructs like synchronized queues.
It is possible to implement locks without atomic instructions, but it is expensive, and most operating systems use atomic instructions.
Switching from one thread to another involves saving the old thread's CPU registers, and restoring the previously-saved registers of the new thread; the fact that %esp and %eip are saved and restored means that the CPU will switch stacks and switch what code it is executing.
The reason to enable interrupts periodically on an idling CPU is that there might be no RUNNABLE process because processes are waiting for I/O; if the scheduler left interrupts disabled all the time, the I/O would never arrive.
The purpose of a file system is to organize and store data. File systems typically support sharing of data among users and applications, as well as persistence so that data is still available after a reboot.
The file system needs on-disk data structures to represent the tree of named directories and files, to record the identities of the blocks that hold each file's content, and to record which areas of the disk are free.
One of the most interesting problems in file system design is crash recovery. The problem arises because many file system operations involve multiple writes to the disk, and a crash after a subset of the writes may leave the on-disk file system in an inconsistent state.
From our point of view, we can abstract the PC into three components: CPU, memory, and input/output (I/O) devices. The CPU performs computation, the memory contains instructions and data for that computation, and devices allow the CPU to interact with hardware for storage, communication, and other functions.
A computer's CPU runs a conceptually simple loop: it consults an address in a register called the program counter, reads a machine instruction from that address in memory, advances the program counter past the instruction, and executes the instruction. Repeat.
A register is a storage cell inside the processor itself, capable of holding a machine word-size value. Data stored in registers can typically be read or written quickly, in a single CPU cycle.
The modern x86 provides eight general purpose 32-bit registers--eax, ebx, ecx, edx, edi, esi, ebp, and esp--and a program counter eip.
Main memory is 10-100x slower than a register, but it is much cheaper, so there can be more of it. One reason main memory is relatively slow is that it is physically separate from the processor chip.
The cache memory serves as a middle ground between registers and memory both in access time and in size.

Justin Spencer

Pages

20180127

xv6 by Russ Cox, Frans Kaashoek, and Robert Morris

No comments:

Post a Comment