Justin Spencer: Computer Science from the Bottom Up

Computer Science from the Bottom Up

An often -quoted tenet of UNIX-like systems such as Linux or BSD is everything is a file.
The concept of a file is a good abstraction of either a sink for, or source of, data. As such it is an excellent abstraction of all the devices one might attach to the computer.
No one person can understand everything from designing a modern user-interface to the internal workings of a modern CPU, much less build it all themselves. To programmers, abstractions are the common language that allows us to collaborate and invent.
Learning to navigate across abstractions gives on greater insight into how to use the abstractions in the best and most innovative ways.
In general, abstraction is implemented by what is generically termed an Application Programming Interface (API).
A common method used in the Linux kernel and other large C code bases, which lack a built-in concept of object-orientation, is function pointers. Learning to read this idiom is key to navigating most large C code bases. By understanding how to read the abstractions provided within the code an understanding of internal API designs can be built.
Libraries have two roles which illustrate abstraction.

Allow programmers to reuse commonly accessed code.
Act as a black box implementing functionality for the programmer.

The standard library of a UNIX platform is generically referred to as libc. It provides the basic interface to the system: fundamental calls such as read(), write(), and printf(). This API is described in its entirety by a specification called POSIX.
Libraries are a fundamental abstraction with many details.The value returned by an 'open' call is termed a file descriptor and is essentially an index into an array of open files kept by the kernel.
Starting at the lowest level, the operating system requires a programmer to create a device driver to be able to communicate with a hardware device. This device driver is written to an API provided by the kernel; the device driver will provide a range of functions which are called by the kernel in response to various requirements.
To provide the abstraction to user-space, the kernel provides a file-interface via what is generically termed a device layer. Physical devices on the host are represented by a file in a special file system such as /dev.
Mounting a file system has the dual purpose of setting up a mapping so the file system knows ht underlying device that provides the storage and the kernel knows that files opened under that mount-point should be directed to the file system driver.
The shell is the gateway to interacting with the operating system.
The 'pipe' is an in-memory buffer that connects two processes together, file descriptors point to the pipe object, which buffers data sent to it (via a write) to be drained (via a read).
Writes to the pipe are stored by the kernel until a corresponding read from the other side drains the buffer. This is a very powerful concept and is one of the fundamental forms of inter-process communication or IPC in UNIX-like operating systems.
Binary is a base-2 number system that uses two mutually exclusive states to represent information. A binary number is made up of elements called bits where each bit can be in one of the two possible states.
We can essentially choose to represent anything by a number, which can be concerted to binary and operated on by the computer.
Parity allows a simple check of the bits of a byte to ensure they were read correctly. We can implement either odd or even parity by using the extra bit as a parity bit. In odd parity, if the number of 1' in the information is odd, the parity bit is set, otherwise it is not set. Even parity is the opposite; if the number of 1's is even the parity bit is set to 1.
It can be very useful to commit base-2 factors to memory as an aide to quickly correlate the relationship between number-of-bits and "human" sizes.
Electronically, the Boolean operations are implemented in gates made by transistors.
Computers only ever deal in binary and hexadecimal is simply a shortcut for us humans trying to work with the computer.
In low level code, it is often important to keep your structures and variables as space efficient as possible. In some cases, this can involve effectively packing two (generally related) variables into one.
Often a program will have a large number of variables that only exist as flags to some condition.
C is the common language of the systems programming world. Every operating system and its associate system libraries in common use is written in C, and every system provides a C compiler.
The "glue" between the C standard and the underlying architecture is the Application Binary Interface (or ABI) which we discuss below.
In a typed language, such as C, every variable must be declared with a type. The type tells the computer about what we expect to store in a variable; the compiler can then both allocate sufficient space for this usage and check that the programmer does not violate the rules of the type.
The C99 standard purposely only mentions the smallest possible size of each of the types defined for C. This is because across different processor architectures and operating systems the best size for types can be wildly different.
To be completely safe, programmers need to never assume the size of any of their variables.
Pointers are really just an address (i.e. their value is an address and thus "points" somewhere else in memory) therefore a pointer needs to be sufficient in size to be able to address any memory in the system.
A 64-bit variable is so large that it is not generally required to represent many variables.
'signed' and 'unsigned' are probably the two most important qualifiers; and they say if a variable can take on a negative value or not.
Qualifiers are all intended to pass extra information about how the variable will be used to the compiler. This means two things; the compiler can check if you are violating your own rules and it can make optimizations based upon the extra knowledge.
By implementing two's complement hardware designers need only provide logic for addition circuits; subtraction can be done by two's complement negating the value to be subtracted and then adding the new value. Similarly you could implement multiplication with repeated addition and division with repeated subtraction. Consequently two's complement can reduce all simple mathematical operations down to addition!
All modern computers use two's complement representation.
Because of two's complement format, when increasing the size of a signed value, it is important that the additional bits be sign-extended; that is, copied from the top-bit of the existing value.
To create a decimal number, we require some way to represent the concept of the decimal place in binary. The most common scheme for this is known as the IEEE-754 floating point standard because the standard.
The CPU performs instructions on values held in registers.
To greatly simplify, a computer consists of a central processing unit (CPU) attached to memory.
The CPU executes instructions read from memory. There are two categories of instructions:

Those that load values from memory registers and store values from registers to memory.
Those that operate on values stored in registers. For example adding, subtracting, multiplying or dividing the values in two registers, performing bitwise operations, or performing other mathematical operations.

Internally, the CPU keeps a record of the next instruction to be executed in the instruction pointer. Usually, the instruction pointer is increment to point to the next instruction sequentially; the branch instruction will usually check if a specific register is zero or if a flag is set and, if so, will modify the pointer to a different address. Thus the next instruction to execute will be from a different part of program; this is how loops and decision statements work.
Executing a single instruction consists of a particular cycle of events.

Fetch: get the instruction from memory into the processor.
Decode: internally decode what is has to do.
Execute: take the values from the registers, actually add them together.
Store: store the result back into another register.

The CPU has two main types of registers, those for integer calculations and those for floating point calculations.
A register file is the collective name for the registers inside the CPU.
The Arithmetic Logic Unit (ALU) is the heart of the CPU operation. It takes values in registers and performs any of the multitude of operations the CPU is capable of. All modern processors have a number of ALUs so each can be working independently.
The Address Generation Unit (AGU) handles talking to cache and main memory to get values into the registers for the ALU to operation on and get values out of registers back into main memory.
If you require 'acquire' semantics this means that for this instruction you must ensure that the results of all previous instructions have been completed. If you require release semantics you are saying that all instructions after this one must see the current result.
As we know from the memory hierarchy, registers are the fastest type of memory and ultimately all instructions must be performed on values held in registers, so all other things being equal more registers leads to higher performance.
The CPU can only directly fetch instructions and data from cache memory, located directly on the processor chip. Cache memory must be loaded in from the main system memory (RAM). RAM however, only retains its contents when the power is on, so needs to be stored on more permanent storage.
Cache memory is memory actually embedded inside the CPU.
The important point to know about the memory hierarchy is the trade offs between speed and size--the faster the memory the smaller it is.
The reason caches are effective is because computer code generally exhibits two forms of locality:

Spatial locality suggests that data within blocks is likely to be accessed together.
Temporal locality suggests that data that was used recently will likely be used again shortly.

Cache is one of the most important elements of the CPU architecture.
When data is only read from the cache there is no need to ensure consistency with main memory. However, when the processor starts writing to cache lines it needs to make some decisions about how to update the underlying main memory.

A write-through cache will write the changes directly into the main system memory as the processor updates the cache.
A write-back cache delays writing the changes to RAM until absolutely necessary.

To quickly decide if an address lies within the cache it is separated into three parts; the tag and the index and the offset.
Peripherals are any of the many external devices that connect to your computer.
The communication channel between the processor and the peripheral is called a bus.
A device requires both input and output to be useful.
An interrupt allows the device to literally interrupt the processor to flag some information. Each device is assigned an interrupt by some combination of the operating system and BIOS.
Device are generally connected to an programmable interrupt controller (PIC), a separate chip that is part of the motherboard which buffers and communicates interrupt information to the main processor. Each device has a physical interrupt line between it and one of the PIC's provided by the system. When the device wants to interrupt, it will modify the voltage on this line.
A very broad description of the PIC's role is that it receives this interrupt and converts it to a message for consumption by the main processor. While the exact procedure varies by architecture, the general principle is that the operating system has configured an interrupt descriptor table which pairs each of the possible interrupts with a code address to jump to when the interrupt is received.
Writing the interrupt handler is the job of the device driver author in conjunction with the operating system.
A generic overview of handling an interrupt. The device raises the interrupt to the interrupt controller, which passes the information onto the processor. The processor looks at its descriptor table, filled out by the operating system, to find the code to handle the fault.
Most drivers will split up handling of interrupts into bottom and top halves. The bottom half will acknowledge the interrupt, queues actions for processing and return the processor to what it was doing quickly. The top half will then run later when the CPU is free and do the more intensive processing. This is to stop an interrupt hogging the entire CPU.
While an interrupt is generally associated with an external event from a physical device, the same mechanism is useful for handling internal system operations.
There are two main ways of signalling interrupts on a line--level and edge triggered.
It is important for the system to be able to mask or prevent interrupts at certain times. Generally, it is possible to put interrupts on hold, but a particular class of interrupts, called non-maskable interrupts (NMI), are the exception to this rule.
The most common form of IO is called memory mapped IO where registers on the device are mapped into memory. This means that to communicate with the device, you need simply read or write to a specific address in memory.
Direct Memory Access (DMA) is a method of transferring data directly between peripheral and system RAM.
Snooping is where a processor listens on a bus, which all processors are connected to, for cache events, and updates its cache accordingly.
Having [multiple] processors all on the same bus starts to present physical problems. Physical properties of wires only allow them to be laid out at certain distances from each other and to only have certain lengths. With processors that run at many gigahertz, the speed of light starts to become a real consideration in how long it takes messages to move around a system.
Much of the time of a modern processor is spent waiting for much slower devices in the memory hierarchy to deliver data for processing. Thus strategies to keep the pipeline of the processor full are paramount.
A cluster is simply a number of individual computers which have some ability to talk to each other. At the hardware level the systems have no knowledge of each other; the task of stitching the individual computers together is left up to software.
Programmers need to use techniques such as profiling to analyze the code paths taken and what consequences their code is causing for the system to extract best performance.
Programmers use a higher level of abstraction called locking to allow simultaneous operation of programs when there are multiple CPUs. When a program acquires a lock over a piece of code, no other processor can obtain the lock until it is release. Before any critical pieces of code, the processor must attempt to take the lock; if it can not have it, it does not continue.
Locking schemes make programming more complicated, as it is possible to deadlock programs.
A simple lock that simply has two states--locked or unlocked--is referred to as a mutex (short for mutual exclusion; that is if one person has it the other can not have it).
The fundamental operation of the operating system (OS) is to abstract the hardware to the programmer and user. The operating system provides generic interfaces to services provided by the underlying hardware.
Processes the kernel is running live in userspace, and the kernel talks both directly to hardware and through drivers.
The kernel is the operating system.
Just as the kernel abstracts the hardware to user programs, drivers abstract hardware to the kernel.
The Linux kernel implements a module system, where drivers can be loaded into the running kernel "on the fly" as they are required.
System calls are how userspace programs interact with the kernel.
Each and every system call has a system call number which is known by both the userspace and the kernel.
The Application Binary Interface (ABI) is very similar to an API but rather than being software is for hardware.
Ensuring the application only accesses memory it owns is implemented by the virtual memory system. The essential point is that the hardware is responsible for enforcing these rules.
The process ID (or the PID) is assigned by the operating system and is unique to each running process.
Program code and data should be kept separately since they require different permission from the operating system and separation facilitates sharing of code. The operating system needs to give program code permission to be read and executed, but generally not written to. On the other hand (variables) require read and write permission but should not be executable.
Stacks are fundamental to function calls. Each time a function is called it gets a new "stack frame". This is an area of memory which usually contains, at a minimum, the address to return to when complete, the input arguments to the function and space for local variables.
Stacks are ultimately managed by the compiler, as it is responsible for generating the program code. To the operating system the stack just looks like any other area of memory for the process.
To keep track of the current growth of the stack, the hardware defines a register as the stack pointer.
The heap is an are of memory that is managed by the process for on the fly memory allocation. This is for variables whose memory requirements are not known at compile time.
The bottom of the heap is known as the brk, so called for the system call which modifies it. By using the brk call to grow the area downwards the process can request the kernel allocate more memory for it to use.
The heap is most commonly managed by the malloc library call. This makes managing the heap easy for the programmer by allowing them to simply allocate and free heap memory.
Due to the complexity of managing memory correctly, it is very uncommon for any modern program to have a reason to call brk directly.
Whilst the operating system can run many processes at the same time, in fact it only every directly starts one process called the init (short for initial) process. This isn't a particularly special process except that its PID is always 0 and it will always be running. All other processes can be considered children of this initial process.
The return value from the system call is the only way the process can determine if ti was the existing process or a new one. The return value to the parent process will be the Process ID (PID) of the child, whilst the child will get a return value of 0.
Separate processes can not see each others memory. They can only communicate with each other via other system calls. Thread however, share the same memory. So you have the advantage of multiple processes, with the expense of having to use system calls to communicate between them.
A running system has many processes, maybe even into the hundreds or thousands. The part of the kernel that keeps track of all these processes is called the scheduler because it schedules which process should be run next.
Scheduling strategies can broadly fall into two categories.

Co-operative scheduling is where the currently running process voluntarily gives up executing to allow another process to run.
Preemptive scheduling is where the process is interrupted to stop it to allow another process to run.

Hard-realtime system make guarantees about scheduling decisions like the maximum amount of time a process will be interrupted before it can run again. They are often used in life critical applications like medical, aircraft and military applications.
Big-O notation is a way of describing how long an algorithm takes to run given increasing inputs.
On a UNIX system, the shell is the standard interface to handling a process on your system.
The primary job of the shell is to help the user handle starting, stopping, and otherwise controlling processes running in the system.
Processes running in the system require a way to be told about events that influence them. On UNIX there is infrastructure between the kernel and processes called signals which allows a process to receive notification about events important to it.
When a signal is sent to a process, the kernel invokes a handler which the process must register with the kernel to deal with that signal. A handler is simply a designated function in the code that has been written to specifically deal with the interrupt.
Virtual memory is all about making use of address space.
The address space of a processor refers the range of possible addresses that it can use when loading and storing to memory. The address space is limited by the width of the registers, since as we know to load an address we need to issue a load instruction with the address to load from stored in a register.
Every program compiled in 64-bit mode requires 8-byte pointers, which can increase code and data size, and hence impact both instruction and data cache performance.
While 64-bit processors have 64-bit wide registers, systems generally do not implement all 64-bits for addressing--it is not actually possible to do load or store to all 16 exabytes of theoretical physical memory.
As with most components of the operating system, virtual memory acts as an abstraction between the address space and the physical memory available in the system. This means that when a program uses an address that address does not refer to the bits in an actual physical location in memory.
All addresses a program uses are virtual. The operating system keeps track of virtual addresses and how they are allocated to physical addresses. When a program does a load or store from an address, the processors and operating system work together to convert this virtual address to the actual address in the system memory chips.
The total address-space is divided into individual pages. Pages can be many different sizes; generally they are around 4Kib. The page is the smallest unit of memory that the operating system and hardware can deal with.
Just as the operating system divides the possible address space up into pages, it divides the available physical memory up into frames. A frame is just the conventional name for a hunk of physical memory the same size as the system page size.
The operating system keeps a frame-table which is a list of all possible pages of physical memory and if they are free (available for allocation) or not). When memory is allocated to a process, it is marked as used in the frame-table. In this way, the operating-system keeps track of all memory allocations.
It is the job of the operating system to keep track of which virtual-page points to which physical frame. This information is kept in a page-table which, in its simplest form, could simpy be a table where each row contains its associate frame--this is termed a linear page-table.
Virtual address translation refers to the process of finding out which physical page maps to which virtual page.
By giving each process its own page table, every process can pretend that it has access to the entire address space available from the processor.
IN a system without virtual memory, every process has complete access to all of system memory. This means that there is nothing stopping one process from overwriting another processes memory, causing it to crash (or perhaps worse).
System that use virtual memory are inherently more stable, because, assuming the perfect operating system, a process can only crash itself and not the entire system.
Virtual memory is necessarily quite dependent on the hardware architecture, and each architecture has its own subtleties.
All processors have some concept of either operating in physical or virtual mode. In physical mode, the hardware expects that nay address will refer to an address in actual system memory. In virtual mode, the hardware knows that addresses will need to be translated to find their physical address.
Segmentation is really only interesting as a historical note, since virtual memory has made it less relevant.
In segmentation there are a number of registers which hold an address that is the start of a segment. The only way to get to an address in memory is to specify it as an offset from one of these segment registers.
The Translation Lookaside Buffer (TLB) is the main component of the processor responsible for virtual-memory. It is a cache of virtual-page to physical-frame translations inside the processor. The operating system and hardware work together to manage the TLB as the system runs.
A program that can be loaded directly into memory needs to be in a straight binary format. The process of converting source code, written in a language such as C, to a binary file ready to be executed is called compiling.
A compiled program is completely dependent on the hardware of the machine it is compiled for, since it must be able to simply be copied to memory and executed. A virtual machine is an abstraction of hardware into software.
The linking process is really two steps; combining all object files into one executable file and then going through each object file to resolve any symbols. This usually requires two passes; one to read all the symbol definitions and take not of unresolved symbols and a second to fix up all those unresolved symbols to the right place.
At a minimum, any executable file format will need to specify where the code and data are in the binary file. These are the two primary sections within an executable file.
The common thread between all executable file formats is that they include a predefined, standardized header which describes how program code and data are stored in the rest of the file.
a.out is a very simple header format that only allows a single data, code, and BSS section. This is insufficient for modern systems with dynamic libraries.
The ELF specification provides for symbol tables which are simply mappings of strings (symbols) to locations in the file. Symbols are required for linking.
A relocation is simply a blank space left to be patched up later.
Sections are a way to organize the binary into logical areas to communicate information between the compiler and the linker.
The .bss section is defined for global variables whose value should be zero when the program starts.
A library is simply a collection of functions which you can call from your program.
A shared library is a library that is loaded dynamically at runtime for each application that requires it.
Dynamic linking is one of the more intricate parts of a modern operating system.
A core dump is simply a complete snapshot of the program as it was running at a particular time.
ABI's refer to lower level interfaces which the compiler, operating system, and to some extent processor must agree on to communicate together.
The kernel needs to communicate some things to programs when they start up; namely the arguments to the program, the current environment variables, and a special structure called the Auxiliary Vector or auxv. The kernel communicates this by putting all the required information on the stack for the newly created program to pick up. Thus when the program starts it can use its stack pointer to find all the startup information required.
The auxiliary vector is a special structure that is for passing information directly from the kernel to the newly running program.
Libraries are very much like a program that never gets started. They have code and data sections (functions and variables) just like every executable; but no where to start running. They just provide a library of functions for developers to call.
The dynamic linker is the program that manages shared dynamic libraries on behalf of an executable. It works to load libraries into memory and modify the program at runtime to call the functions in the library.
The essential part of the dynamic linker is fixing up addresses at runtime, which is the only time you can know for certain where you are loaded in memory. A relocation can simply be thought of as a note that a particular address will need to be fixed at load time. Before the code is ready to run you will need to go through and read all the relocations and fix the addresses it refers to to point to the right place.
All libraries must be produced with code that can execute no matter where it is put into memory, known as position independent code (PIC).
The important points to remember [about dynamic linking] are:

Library calls in your program actually call a stub of code in the PLT of the binary.
That stub code loads an address and jumps to it.
Initially, that address points to a function in the dynamic linker which is capable of looking up the "real" function, given the information in the relocation entry for that function.
The dynamic linker re-writes the address that the stub code reads, so that the next time the function is called it will go straight to the right address.

With only static libraries there is much less potential for problems, as all library code is built directly into the binary of the application.
The binding of a symbol dictates its external visibility during the dynamic linking process. A local symbol is not visible outside the object file it is define in. A global symbol is visible to other object files, and can satisfy undefined references in other objects.
A weak reference is a special type of lower priority global reference. This means it is designed to be overridden.

Justin Spencer

Pages

20180219

Computer Science from the Bottom Up

No comments:

Post a Comment