Justin Spencer: Programming from the Ground Up by Jonathan Bartlett

One of the greatest programmers, Donald Knuth, describes programming not as telling a computer how to do something, but telling a person how they would instruct a computer to do something. The point is that programs are meant to be read by people, not just computers.
The kernel is the core part of an operating system that keeps track of everything.
As a gate, it allows programs to access hardware in a uniform way. Without the kernel, you would have to write programs to deal with every device model ever made.
As a fence, the kernel prevents programs from accidentally overwriting each other's data and from accessing files and devices that they don't have permission to. It limits the amount of damage a poorly-written program can do to other running programs.
Every command the computer sees is given as a number or sequence of numbers.
High-level languages are there to make programming easier. Assembly language requires you to work with the machine itself. High-level languages allow you to describe the program in a more natural language. A single command in a high-level language usually is equivalent to several commands in an assembly language.
The Von Neumann architecture divides the computer up into two main parts--the CPU (for Central Processing Unit) and the memory.
In fact, in a computer, there is no difference between a program and a program's data except how it is used by the computer. They are both stored and accessed the same way.
The CPU reads in instructions from memory one at a time and executes them. This is known as the fetch-execute cycle.
The CPU contains the following elements to accomplish this:

program counter
instruction decoder
data bus
general-purpose registers
arithmetic and logic unit

The program counter is used to tell the computer where to fetch the next instruction from.
General-purpose registers are where the main action happens. Addition, subtraction, multiplication, comparisons, and other operations generally use general-purpose registers for processing.
Computer memory is a numbered sequence of fixed-size storage locations. The number attached to each storage location is called it's address. The size of a single storage location is called a byte.
Registers are what the computer uses for computation. Think of a register as a place on your desk--it holds things you are currently working on.
Registers keeps the contents of numbers that you are currently manipulating.
Addresses which are stored in memory are also called pointers, because instead of having a regular value in them, they point you to a different location in memory.
The only way the computer knows that a memory location is an instruction is that a special-purpose register called the instruction pointer points to them at one point or another.
Computers are very exact. Because they are exact, programmers have to be equally exact. A computer has no idea what your program is supposed to do. Therefore, it will only do exactly what you tell it to do.
The computer will execute your instructions in the exact order you specify, even if it doesn't make sense.
Remember, computers can only store numbers, so letters, pictures, music, web pages, documents, and anything else are just long sequences of numbers in the computer, which particular programs know how to interpret.
Processors have a number of different ways of accessing data, known as addressing modes. The simplest mode is immediate mode, in which the data to access is embedded in the instruction itself.
In the register addressing mode, the instruction contains a register to access, rather than a memory location.
In the direct addressing mode, the instruction contains the memory address to access.
In the indexed addressing mode, the instruction contains a memory address to access, and also specifies an index register to offset that address.
In the indirect addressing mode, the instruction contains a register that contains a pointer to where the data should be accessed.
Even if your tinkering does not work, every failure will help you learn.
Source code is the human-readable form of a program. In order to transform it into a program that a computer can run, we need to assemble and link it.
Assembling is the process that transforms what you typed into instructions for the machine.
An assembly language is a more human-readable form of the instructions a computer understands.
An object file is code that is in the machine's language, but has not been completely put together.
The linker is the program that is responsible for putting the object files together and adding information to it so that the kernel knows how to load and run it.
You must always reassemble and relink programs after you modify the source file for the changes to occur in the program.
UNIX programs return numbers other than zero to indicate failure or other errors, warnings, or statuses. The programmer determines what each number means.
Comments are not translated by the assembler. They are used only for the programmer to talk to anyone who looks at the code in the future.
You should always document any strange behavior you program performs.
Symbols are generally used to mark locations of programs or data, so you can refer to them by name instead of by their location number.
_start is a special symbol that always needs to be marked with .globl because it marks the location of the start of the program.
Labels define a symbol's value.
The number 1 is the number of the exit system call.
An interrupt interrupts the normal program flow, and transfers control from our program to Linux so that it will do a system call.
To recap--Operating System features are accessed through system calls. These are invoked by setting up the registers in a special way and issuing the instruction "int 0x80". Linux knows which system call we want to access by what we stored in the eax register. Each system call has other requirements as to what needs to be stored in the other registers.
A loop is a piece of program code that is meant to be repeated.
Just be aware that the result of the comparison is stored in the status register.
The general form of memory address references is this:

ADDRESS_OR_OFFSET(%BASE_OR_OFFSET, %INDEX, MULTIPLIER)

Every mode except immediate mode can be used as either the source or destination operand. Immediate mode can only be a source operand.
Programmers use functions to break their programs into pieces which can be independently developed and tested.
A function's name is a symbol that represents the address where the function's code starts.
A functions parameters are the data items that are explicitly given to the function for processing.
Local variables are data storage that a function uses while processing that is thrown away when it returns.
Static variables are data storage that a function uses while processing that is not thrown away afterwords, but is reused for every time the function's code is activated.
Global variables are data storage that a function uses for processing which are managed outside the function.
The return value is the main method of transferring data back to the main program. Most programming languages only allow a single return value for a function.
A convention is a way of doing things that is standardized, but not forcibly so.
In the C language calling convention, the stack is the key element for implementing a function's local variables, parameters, and return address.
Before executing a function, a program pushes all of the parameters for the function onto the stack in the reverse order that they are documented. Then the program issues a call instruction indicating which function it wishes to start. The call instruction does two things. First it pushes the address of the next instruction, which is the return address, onto the stack. Then it modifies the instruction pointer (eip) to point to the start of the function.
The base pointer is a special register used for accessing function parameters and local variables.
The only difference between the global and static variables is that static variables are only used by one function, while global variables are used by many functions.
When a function is done executing, it does three things:

It stores it's return value in eax.
It resets the stack to what it was when it was called (it gets rid of the current stack frame and puts the stack frame of the calling code back into effect).
It returns control back to wherever it was called from. This is done using the ret instruction, which pops whatever value is at the top of the stack, and sets the instruction pointer, eip, to that value.

When you call a function, you should assume that everything currently in your registers will be wiped out.
If there are registers you want to save before calling a function, you need to save them by pushing them on the stack before pushing the function's parameters. You can then pop them back off in reverse order after popping off the parameters.
In fact, almost all of programming is writing and calling functions.
Data which is stored in files is called persistent data, because it persists in files that remain on the disk even when the program isn't running.
UNIX files, no matter what program created them, can all be accessed as a sequential stream of bytes. When you access a file, you start by opening it by name. The operating system then gives you a number, called a file descriptor, which you use to refer to the file until you are through with it. You can then read and write to the file using its file descriptor. When you are done reading and writing, you then close the file, which then makes the file descriptor useless.
A buffer is a continuous block of bytes used for bulk data transfer.
Communication between processes is usually done through special files called pipes.
One of the keys of programming is continually breaking down problems into smaller and smaller chunks until it's small enough that you can easily solve the problem. Then you can build these chunks back up until you have a working program.
In programming, a constant is a value that is assigned when a program assembles or compiles, and is never changed.
Guarding against potential user and programming errors is an important task of a programmer.
When a Linux program begins, all pointers to command-line arguments are stored on the stack.
Structured data is data that is divided up into fields and records.
Robust programs are able to handle error conditions gracefully. They are programs that do not crash no matter what the user does. Building robust programs is essential to the practice of programming.
Programmers schedule poorly. In almost every programming project, programmers will take two, four, or even eight times as long to develop a program or function than they originally estimated.
It takes a lot of time and effort to develop robust program.s
Testing is one of the mot essential things a programmer does. If you haven't tested something, you should assume it doesn't work.
Allowing non-programmers to use your program for testing purposes usually gives you much more accurate results as to how robust your program truly is.
Most important is testing corner cases or edge cases. Corner cases are the inputs that are most likely to cause problems or behave unexpectedly.
When testing numeric data, there are several corner cases you always need to test:

The number 0.
The number 1.
A number within the expected range.
A number outside the expected range.
The first number in the expected range.
The last number in the expected range.
The first number below the expected range.
The first number above the expected range.

You need to test that your program behaves as expected for lists of 0 items, 1 item, massive numbers of items, and so on. In addition, you should also test nay turning points you have.
Not only should you test your program as a whole, you need to test the individual pieces of your program. As you develop your program, you should test individual functions by providing it with data you create to make sure it responds appropriately.
The simplest way to handle recovery points is to wrap the whole program into a single recovery point. You would just have a simple error-reporting function that you can call with an error code and a message. The function would print them and simply exit the program. This is not usually the best solution for real-world situations, but it is a god fall-back, last resort mechanism.
After every system call, function call, or instruction which can have erroneous results you should add error checking and handling code.
When using dynamic linking, the name itself resides within the executable, and is resolved by the dynamic linker when it is run. When the program is run by the user, the dynamic linker loads the shared libraries listed in our link statements, and then finds all of the function and variable names that were named by our program but not found at link time, and matches them up with corresponding entries in the shared libraries it loads. It then replaces all of the names with the addresses which they are loaded at.
When you use shared libraries, your program is then dynamically-linked, which means that not all of the code needed to run the program is actually contained within the program file itself, but in external libraries.
In Linux, functions are described in the C programming languages. In fact, most Linux programs are written in C. That is why most documentation and binary compatibility is defined using the C language.
A typedef basically allows you to rename a type.
A computer looks at memory as a long sequence of numbered storage locations. A sequence of millions of numbered storage locations. Everything is stored in these locations. Your programs are stored there, your data is stored there, everything. Each storage location looks lie every other one. The locations holding your program are just like the ones holding your data. In fact, the computer has no idea which are which, except that the executable file tells it where to start executing.
An address is a number that refers to a byte in memory.
Every piece of data on the computer not in a register has an address.
A pointer is a register or memory word whose value is an address.
When your program is loaded into memory, each .section is loaded into its own region of memory.
The actual instructions (the .text section) are loaded at the address 0x08048000.
The .data section is loaded immediately after that, followed by the .bss section.
The last byte that can be addressed on Linux is location 0xbfffffff. Linux starts the stack here and grows it downward toward the other sections. Between them is a huge gap.
At the bottom of the stack there is a word of memory that is zero. After that comes the null-terminated name of the program using ASCII characters. After the program name comes the program's environment variables. Then come the program's command-line arguments.
Your program's data region starts at the bottom of memory and goes up. The stack starts at the top of memory, and moves downward with each push. This middle part between the stack and your program's data sections is inaccessible memory, you are not allowed to access it until you tell the kernel that you need it. If you try, you will get an error (the error message is usually "segmentation fault").
The last accessible memory address to your program is called the system break (also called the current break or just the break).
Physical memory refers to the actual RAM chips inside your computer and what they contain.
If we talk about a physical memory address, we are talking about where exactly on these chips a piece of memory is located.
Virtual memory is the way your program things about memory. Before loading your program, Linux finds an empty physical memory space large enough to fit your program, and then tells the processor to pretend that this memory is actually at the address 0x08048000 to load your program into.
Each program gets its own sandbox to play in. Every program running on your computer thinks that it was loaded at memory address 0x08048000, and that it's stack starts at 0xbfffffff. When Linux loads a program, it finds a section of unused memory, and then tells the processor to use that section of memory as the address 0x08048000 for this program. The address that a program believes it uses is called the virtual address, while the actual address on the chips that it refers to is called the physical address. The process of assigning virtual addresses to physical addresses is called mapping.
Here is an overview of the way memory accesses are handled under Linux:

The program tries to load memory from a virtual address.
The processor, using tables supplied by Linux, transforms the virtual memory address into a physical memory address on the fly.

Note that not only can Linux have a virtual address map to a different physical address, it can also move those mapping around as needed.

All of the memory mappings are done a page at a time. Physical memory assignment, swapping, mapping, etc. are all done to memory pages instead of individual memory addresses.
If you try to access a piece of virtual memory that hasn't been mapped yet, it triggers an error known as a segmentation fault, which will terminate you program.
The way we tell Linux to move the break point is through the brk system call.
A memory manager is a set of routines that takes care of the dirty work of getting your program memory for you. Most memory managers have two basic functions--allocate and deallocate.
The way memory managers work is that they keep track of where the system break is, and where the memory that have allocated is. They mark each block of memory in the heap as being used or unused. When you request memory, the memory manager checks to see if there are any unused block of the appropriate size. If not, it calls the brk system call to request more memory.
When you free memory is marks the block as unused so that future requests can retrieve it.
A memory manager by itself is not a full program--it doesn't do anything. It is simply a utility to be used by other program.s
Generally, you should avoid calling the kernel unless you really need to.
Each digit in a binary number is called a bit, which stands for binary digit.
You should note that it takes most computers a lot longer to do floating-point arithmetic than it does integer arithmetic. So, for programs that really need speed, integers are mostly used.
Sign extension means that you have to pad the left-hand side of the quantity with whatever digit is in the sign digit when you add bits.
The x86 processor is a little-endian processor, which means that it stores the "little end", or least-significant-byte of its words first.
Other processors are big-endian processors, which means that they store the "big end", or most significant byte, of their words first, the way we would naturally read a number.
Assembly language is the language used at the machine's level, but most people find coding in assembly language too cumbersome for everyday use. Many computer languages have been invented to make the programming task easier.
High-level languages, whether compiled or interpreted, are oriented around you, the programmer, instead of around the machine.
Each language is different, and the more languages you know the better programmer you will be.
The main function is a special function in the C language--it is the start of all C programs (much like _start in our assembly-language programs).
Optimization is the process of making your application run more effectively.
It is better to not optimize at all than to optimize too soon. When you optimize, your code generally becomes less clear, because it becomes more complex.
Once you have determined that you have a performance issue you need to determine where in the code the problems occur. You can do this by running a profiler. A profiler is a program that will let you run your program, and it will tell you how much time is spent in each function, and how many times they are run.
Any optimization done on the first code base is completely wasted.
After running a profiler, you can determine which functions are called the most or have the most time spent in them. These are the ones you should focus your optimization efforts on.
If a program only spends 1% of its time in a given function, then no matter how much you speed it up you will only achieve a maximum of a 1% overall speed improvement. However, if a program spends 20% of its time in a given function, then even minor improvements to that functions speed will be noticeable.
Sometimes a function has a limited number of possible inputs and outputs. In fact, it may be so few that you can actually precompute all of the possible answers beforehand, and simply look up the answer when the function is called.
Registers are the fastest memory locations on the computer.
Parallelization means that your algorithm can effectively be split among multiple processes.
The more parallelize-able your application is, the better it can take advantage of multiprocessor and clusters computer configurations.
Two great benefits resulting from statelessness is that most stateless functions are parallelize-able and often benefit from memoization.
If you are constantly looking for new and better ways of doing and thinking, you will make a successful programmer.

Justin Spencer

Pages

20170818

Programming from the Ground Up by Jonathan Bartlett

No comments:

Post a Comment