Justin Spencer: compsci

Showing posts with label compsci. Show all posts

20191210

Stack Computers by Philip Koopman

Stack machines offer processor complexity that is much lower than that of CISC machines, and overall system complexity that is lower than that of either RISC or CISC machines.
LIFO stacks, also known as “push down” stacks, are the conceptually simplest way of saving information in a temporary storage location for such common computer operations as mathematical expression evaluation and recursive subroutine calling.
LIFO stacks may be programmed into conventional computers in a number of ways. The most straightforward way is to allocate an array in memory, and keep a variable with the array index number of the topmost active element.
“Pushing” a stack element refers to the act of allocating a new word on the stack and placing data into it. “Popping” the stack refers to the action of removing the top element from the stack and then returning the data value removed to the routine requesting the pop.
A very important property of stacks is that, in their purest form, they only allow access to the top element in the data structure.
Stacks make excellent mechanisms for temporary storage of information within procedures. A primary reason for this is that they allow recursive invocations of procedure without risk of destroying data from previous invocations of the routine.
Hardware implementation of stacks has the obvious advantage that it can be much faster than software management.
Stack machines are easier to write compilers for, since they have fewer exceptional cases to complicate a compiler.
Stack machines are also simpler than other machines, and provide very good computational power using little hardware.
A particularly favorable application area for stac machines is in real time embedded control applications, which require a combination of small size, high processing speed, and excellent support for interrupt handling that only stack machines can provide.
Both hardware and software stacks have been used to support four major computing requirements: expression evaluation, subroutine return address storage, dynamically allocated local variable storage, and subroutine parameter passing.
The use of an expression evaluation stack is so basic to the evaluation of expressions that even register-based machine compilers often allocate registers as if they formed an expression evaluation stack.
The solution to the recursion problem is to use a stack for storing the subroutine return address. As each subroutine is called the machine saves the return address of the calling program on a stack.
Modern machines usually have some sort of hardware support for a return address stack. In conventional machines, this support is often a stack pointer register and instructions for performing subroutine calls and subroutine returns.
Whenever a subroutine is called it must usually be given a set of parameters upon which to act.
Real machines combine the various stack types. It is common in register-based machines to see the local variable stack, parameter stack, and return address stack combined into a single stack of activation records, or “frames”.
Many of the designs for these stack computers have their roots in the Forth programming language. This is because Forth forms both a high level and assembly language for a stack machine that has two hardware stacks: one for expression evaluation/parameter passing, and one for return addresses.
An interesting point to note is that, although some of these machines are designed primarily to run Forth, they are also good at running conventional languages.
Stack machines have small program size, low system complexity, high system performance, and good performance consistency under varying conditions.
Stack machines run conventional languages reasonably well, and do so using less hardware for a given level of performance than register-based machines.
Stack machines are very good at running the Forth language, which is known for rapid program development because of its interactivity and flexibility. Forth is also known for producing compact code that is very well suited to real time control problems.
A very good application area for stack machines is embedded real time control.
In general, a single stack leads to simple hardware, but at the expense of intermingling data parameters with return address information.
An advantage of having a single stack is that it is easier for an operating system to manage only one block of variable sized memory per process.
A disadvantage of a single stack is that parameter and return address information are forced to become mutually well nested.
Multiple stacks allow separating control flow information from data operands.
An important advantage of having multiple stacks is one of speed. Multiple stacks allow access to multiple values within a clock cycle.
The amount of dedicated memory used to buffer stack elements is a crucial performance issue.
To be competitive in speed, a stack machine must have at least one or two stack elements buffer inside the processor.
A small stack buffer with primary stacks residing in program memory allows quick switching between stacks for different tasks since the stack elements are predominantly memory resident at all times.
The fact that a small dedicated stack buffer is simple to implement and easy to manage makes it very popular. In particular, the fact that most stack elements reside in main memory makes managing pointers, strings, and other data structures quite easy. The disadvantage of this approach is that significant main memory bandwidth is consumed to read and write stack elements.
An advantage of a large stack buffer is that program memory cycles are not consumed while accessing data elements and subroutine return addresses. This can lead to significant speedups, particularly in subroutine intensive environments.
The number of addressing modes has a tremendous effect on how the stacks are constructed and how the stacks can be used by programs.
0-operand instructions do not allow any operands to be associated with the opcode. All operations are implicitly specified to be on the top stack element(s). This kind of addressing is often called “pure” stack addressing.
A disadvantage to the 0-operand addressing mode is that complex addressing modes for data structure accessing may take several instructions to synthesize.
A machine with 1-operand instruction format usually performs operations on the specified operand and uses the top stack element as the implicit second operand.
2-operand instruction formats allow each instruction to specify both a source and destination.
2-operand machines offer a maximum of flexibility, but requires more complicated hardware to perform efficiently.
Forth is an unconventional programming language which uses a two-stack model of computation and strongly encourages frequent calls to many small procedures.
Stack machines support small program sizes by encouraging frequent use of subroutines to reduce code size, and by the fact that stack machines can have short instructions. Small program sizes reduce memory costs, component count, and power requirements, and can improve system speed by allowing the cost effective use of smaller, higher speed memory chips.
Decreased system complexity decreases system development time and chip size. This decreased chip size leaves more room on-chip for semi custom features and program memory.
System performance includes not only raw execution speed, but also total system cost and system adaptability when used in real world applications.
For simplicity, the Canonical Stack Machine has a single bus connecting the system resources.
The data stack is a memory with an internal mechanism to implement a LIFO stack. A common implementation for this might be a conventional memory with an up/down counter used for address generation.
The data stack allows two operations: push and pop. The push operation allocates a new cell on the top of the stack and writes the value on the data bus into that cell. The pop operations places the value on the top cell of the stack onto the data bus, then deletes the cell, exposing the next cell on the stack for the next processor operation.
The return stack is a LIFO stack implemented in an identical manner to the data stack. The only difference is that the return stack is used to store subroutine return address instead of instruction operands.
The ALU functional block performs arithmetic and logical computations on pairs of data elements.
The ALU supports the standard primitive operations needed by any computer.
The program counter holds the address of the next instruction to be executed. The PC may be loaded from the bus to implement branches, or may be incremented to fetch the next sequential instruction from program memory.
The program memory block has both a Memory Address REgister (MAR) and a reasonable amount of random access memory.
An important point to note is that Forth notation often makes extensive use of special characters.
Stack machines execute data manipulation operations using postfix operations.
Postfix operations are distinguished by the fact that the operations come before the operation.
In postfix notation, the operator acts upon the most recently seen operands, and uses an implied stack for evaluation.
Postfix notation has an economy of expression when compared to infix notation in that neither operator precedence nor parentheses are necessary. It is much better suited to the needs of computers.
The Canonical Stack Machine described in the preceding section is designed to execute postfix operations directly without burdening the compiler with register allocation bookkeeping.
The instruction >R and its complement R> allow shuffling data elements between the data and return stacks. This technique is used to access stack elements buried more than two elements deep on the stacks be temporarily placing the topmost data stack elements on the return stack.
Even though all arithmetic and logical operations are performed on data elements on the stack, there must be some way of loading information onto the stack before operations are performed, and storing information from the stack into memory. The Canonical Stack Machine uses a simple load/store architecture, so has only a single load instruction ‘@’ and a single store instruction ‘!’.
Somehow there must be a way to get a constant value onto the stack. The instruction to do this is called the literal instruction, which is often called a load-immediate instruction in register-based architecture. The literal instruction uses two consecutive instruction words: one for the actual instruction, and one to hold a literal value to be pushed onto the stack.
The Program Counter is the register that holds a pointer to the next instruction to be executed. After fetching an instruction, the program counter is automatically incremented to point to the next word in memory. In the case of a branch or subroutine call instruction, the program counter is loaded with a new value to accomplish the branch.
In order to be able to make decisions, the machine must have available some method for conditional branching. The Canonical Stack Machine uses the simplest method possible. A conditional branch may be performed based on whether the top stack element is equal to 0. This approach eliminates the need for condition codes, yet allows implementation of all control flow structures.
Since there is a dedicated return address stack, subroutine calls simply require pushing the current program counter value onto the stack, then loading the program counter with a new value.
Subroutine returns are accomplished by simply popping the return address from the top of the return address stack and placing the address in the program counter.
An important consideration in real time control applications is how the processor can handle interrupts and task switches.
Interrupts are caused either by exceptional events, such as stack overflow, or by requests for I/O service. Both events require quick resolution without disturbing the flow of the current task.
I/O servicing is a potentially frequent event that must be handled quickly for real time control tasks.
Stack machines treat interrupts as hardware generated subroutine calls.
Interrupts are much less expensive on stack machines than on conventional machines for several reasons: registers need not be saved since the stack allocates working storage automatically, there are no condition code flags to be saved since the only branch conditions are kept as flags on the data stack, and most stack processors have a short or nonexistent data pipeline, so there is no penalty for saving the pipeline state when an interrupt is received.
Because of its roots, Forth stresses efficiency, compactness, flexible and efficient hardware/software interaction.
The Forth Virtual Machine has two stacks: a data stack and a return stack.
Forth programs consist of very small subroutines that execute only calls to other subroutines and primitive stack operation instructions.
The main characteristic of Forth programs that separates Forth from most other languages is the high frequency of subroutine calls.
Good Forth programming style encourages incremental program development and testing with small subroutines.
The consensus among Forth programmers is that use of the Forth language reduces development time by a factor of 10 compared to other languages over a wide range of applications.
A major reason that Forth has historically been a 16-bit language is that 8-bits is too small for general purpose computations and addressing data structures. While 12-bits was tried in some of the earliest minicomputers, 16-bits seems to be the smallest integer size that is truly useful.
16-bit machines are capable of addressing 64K of memory, which for a stack machine is a rather large program memory.
Embedded applications require a small processor with a small amount of program memory to satisfy demanding power, weight, size, and cost considerations.
One of the difficult technical challenges that arise when designing a 32-bit stack processor is the management of the stacks.
The traditional solution for a growing program size is to employ a hierarchy of memory devices with a series of capacity/cost/access-time tradeoffs.
The memory problem comes down to one of supplying a sufficient quantity of memory fast enough to support the processor at a price that can be afforded. This is accomplished by fitting the momst program possible into the fastest level of the memory hierarchy.
The usual way to manage the fastest level of the memory hierarchy is by using cache memories. Cache memories work on the principle that a small section of a program is likely to be used more than once within a short period of time.
Stack machines have much smaller programs than either RISC or CISC machines. Stack machine programs can be 2.5 to 8 times smaller than CISC code.
When speaking of the complexity of a computer, two levels are important: processor complexity and system complexity.
Processor complexity is the amount of logic (measured in chip area, number of transistors, etc.) in the actual core of the processor that does the computations.
System complexity considers the processor embedded in a fully functional system which contains support circuitry, the memory hierarchy, and software.
The complexity of CISC machines is partially the result of encoding instructions to keep programs relatively small.
Stack machines strive to achieve a balance between processor complexity and system complexity. Stack machine designs realize processor simplicity not by restricting the number of instructions, but rather by limiting the data upon which instructions may operate: all operands are on the top stack elements.
Stack machines are extraordinarily simple: 16-bit stack machines typically use only 20 to 35 thousand transistors for the processor core.
Stack machine compilers are also simple, because the instructions are very consistent in format and operand selection. In fact, most compilers for register machines go through a stack-like view of the source program for expression evaluation, then map that information onto a register set.
Stack machine compilers have that much less work to do in mapping the stack-like version of the source code into assembly language.
Forth compilers are well known to be exceedingly simple and flexible.
Stack computer systems are also simple as a whole. Because stack programs are so small, exotic cache control schemes are not required for good performance. Typically the entire program can fit into cache-speed memory chips without the complexity of cache control circuitry.
Interrupts are treated as hardware invoked subroutine calls. There is no pipeline to flush or save, so the only thing a stack processor needs to do to process an interrupt is to insert the interrupt response address as a subroutine call into the instruction stream, and push the interrupt make register onto the stack while masking interrupts (to prevent an infinite recursion of interrupt service calls). Once the interrupt service routine is entered, no registers need be saved, since the new routine can simply push its data onto the top of the stack.
The simplest way to solve the stack problem is simply to assume that stack overflow will never happen.
Given that stack overflows are allowed to occur on a regular basis, the most conceptually appealing way to deal with the problem is to use a demand fed stack manager that moves single elements on and off the stack as required.
To implement this strategy, the stack buffer is set up as a circular buffer with a head and tail pointer. A pointer to memory is also needed to keep track of the top element of the memory-resident portion of the stack. Whenever a stack overflow is encountered, the bottom-most buffer-resident element is copied to memory, freeing a buffer location. Whenever an underflow is encountered, one element from memory is copied into the buffer. This technique has the appeal that the processor never moves a stack element to or from memory unless absolutely necessary, guaranteeing the minimum amount of stack traffic.
An alternative to the demand-fed strategy is to generate an interrupt on stack overflow and underflow, then use software to manage the stack spill. This approach uses less control hardware then the demand-fed method, but requires a stack buffer that is somewhat bigger to reduce the frequency of the interrupts.
There are three components to the performance of processing interrupts.
The first component is the amount of time that elapses between the time that an interrupt request is received by the processor and the time that the processor takes action to begin processing the interrupt service routine. This delay is called interrupt latency.
The second component of interrupt service performance is interrupt processing time. This is the amount of time that the processor spends actually saving the machine state of the interrupted job and diverting execution to the interrupt service routine.
The third component of interrupt service performance is what we shall call state saving overhead. This is the amount of time taken to save machine registers that are not automatically saved by the interrupt processing logic, but which must be saved in order for the interrupt service routine to do its job.
A stack machine can implement lightweight threads simply by requiring that each task run a short sequence of instructions when invoked, then relinquish control to the central task manager. This can be called non-preemptive, or cooperative task management.
If each task starts and stops its operation with no parameters on the stack, there is no overhead for context switches between tasks.
The amount of stack memory needed by most programs is typically rather small.
Any computer system is worthless without software.
Having hardware that effectively supports software requirements is of the utmost importance.
The use of a large number of small procedures when writing a program reduces the complexity of each piece that must be written, tested, debugged, and understood by the programmer.
Lower software complexity implies lower development and maintenance costs, as well as better reliability.
The currently accepted best practice in software design involves structured programming using modular designs.
On a large scale, the use of modules is essential for partitioning tasks among members of programming teams.
On a smaller scale, modules control complexity by limiting the amount of information that a programmer must deal with at any given time.
Conventional computers are still optimized for executing programs made up of streams of serial instructions.
If a procedure contains more than seven distinct operations it should be broken apart by chunking related portions into subordinate procedures to reduce the complexity of each portion of the program.
Today’s hardware and programming environments unnecessarily restrict the usage of modularity, and therefore unnecessarily increase the cost of providing computer-based solutions to problems.
The language used should reflect the entire system being developed, including the system operating environment, the suitability of the language to solve the problem at hand, developing time and costs, the maintainability of the finished product, the strengths of the underlying processor at running various languages, and the previous programming experience of the programmers assigned to the project.
Forth is the most obvious language to consider using on a stack machine. That is because the Forth language is based upon a set of primitives that execute on a virtual stack machine architecture.
One of the characteristics of Forth is its very high use of subroutine calls. This promotes an unprecedented level of modularity, with approximately 10 instructions per procedure being the norm.
One of the advantages of the Forth programming language is that it covers the full spectrum of language levels.
Forth tends to act as a programmer amplifier.
Forth is best used on medium-sized programming projects involving no more than two or three programmers who have compatible programming styles.
Probably the most common reason for using a conventional language will be the existence of a large body of existing source code that must be ported onto a better processor.
Generation of stack-based code for expression evaluation is relatively straightforward.
Stack machines are well suited to LISP programming as well as to expert systems. This is because LISP and Forth are very similar languages in many respects.
The major difference is that LISP involves dynamic storage allocation for its cells, while Forth uses a static storage allocation.
Functional programming languages offer the promise of a new way of solving problems using a different model of computation than that used by conventional computers.
One of the side effects of using a functional programming language is that a high degree of parallelism is available during program execution.
This raises the idea of a massively parallel computer made of stack processors that is programmed with a functional programming language.
A key conceptual feature of stack machines is their uniformity of interface between high level code and machine instructions.
Any system in which a high speed processor with low system complexity is needed is a good candidate for using a stack processor.
Real time embedded control processors are computers that are built into pieces of (usually) complicated equipment such as cars, airplanes, computer peripherals, audio electronics, and military vehicles/weapons. The processor is embedded because it is built into a piece of equipment that is not itself considered a computer.
Most embedded systems place severe constraints on the processor in terms of requirements for size, weight, cost, power, reliability and operating environment.
Real-time events are typically external stimuli to the system which require a response within a matter of microseconds or milliseconds.
A processor that has a large number of pins takes up precious printed circuit board area.
Microcoded machines are more flexible than hardwired machines.
Preliminary research shows that stack machines can execute functional programming languages very efficiently. Programs written in these languages have a great deal of inherent parallelism, which may be exploited by a multiprocessor stack machine system.
Conventional languages can be implemented very easily on stack machines.
The use of virtual memory and memory protection schemes are concepts that have not yet been widely incorporated into existing stack machines.
Protection is the capability for the hardware to prevent a program from accessing another program’s memory except under very carefully controlled conditions.
Memory bandwidth is the amount of information that can be transferred to and from memory per user.
The usual way of implementing an IF statement is by using a conditional branch that is taken if the condition tested by the IF is false.
Stack machines do not seem to be best suited as primary CPUs for the workstation and minicomputer markets.
The Forth language is based on an extensible, interactive compiler that creates code for a virtual stack machine.

20190526

The Algorithm Design Manual by Steven S. Skiena

There are three desirable properties for a good algorithm. We seek algorithms that are correct and efficient, while being easy to implement. These goals may not be simultaneously achievable.
There is a fundamental difference between algorithms, which always produce a correct result, and heuristics, which may usually do a good job, but without providing any guarantee.
Reasonable-looking algorithms can easily be incorrect. Algorithm correctness is a property that must be carefully demonstrated.
The heart of any algorithm is an idea. If your idea is not clearly revealed when you express an algorithm, then you are using too low-level a notation to describe it.
It is impossible to prove the correctness of an algorithm for a fuzzily-stated problem. Put another way, ask the wrong problem and you will get the wrong answer.
An important and honorable technique in algorithm design is to narrow the set of allowable instances until there is a correct and efficient algorithm.
The best way to prove that an algorithm is incorrect is to produce an instance in which it yields an incorrect answer. Such instances are called counter-examples.
Hunting for counter-examples is a skill worth developing. It bears some similarity to the task of developing test sets for computer programs, but relies more on inspiration then exhaustion.
Searching for counterexamples is the best way to disprove the correctness of a heuristic.
Mathematical induction is usually the right way to verify the correctness of a recursive or incremental insertion algorithm.
Modeling is the art of formulating your application in terms of precisely described, well-understood problems. Proper modeling is the key to applying algorithmic design techniques to real-world problems. Indeed, proper modeling can eliminate the need to design or even implement algorithms, by relating your application to what has been done before.
Modeling is only the first step in designing an algorithm for a problem.
Modeling your application in terms of well-defined structures and algorithms is the most important single step towards a solution.
Learning to think recursively is learning to look for big things that are made from smaller things of exactly the same type as the big thing.
Recursive structures occur everywhere in the algorithmic world.
Algorithms are the most important and durable part of computer science because they can be studied in a language and machine independent way.
Algorithms can be understood and studied in a language and machine independent manner.
The worst-case complexity of the algorithm is the function defined by the maximum number of steps taken in any instance of size n.
The best-case complexity of the algorithm is the function defined by the minimum number of steps taken in any instance of size n.
The average-case complexity of the algorithm, which is the function defined by the average number of steps over all instances of size n.
The Big-Oh notation and worst-case analysis are tools that greatly simplify our ability to compare the efficiency of algorithms.
Although esoteric functions arise in advanced algorithm analysis, a small variety of time complexities suffice and account for most algorithms that are widely used in practice.
A basic rule of thumb in Big-Oh analysis is that worst-case running time follows from multiplying the largest number of times each nested loop can iterate.
Pattern matching is the most fundamental algorithmic operation on text strings.
A logarithm is simply an inverse exponential function.
Exponential functions grow at a distressingly fast rate. Thus, inverse exponential functions--i.e. Logarithms--grow refreshingly slowly.
Logarithms arise in any process where things are repeatedly halved.
Binary search is one of the most powerful ideas in algorithm design.
Logarithms arise whenever things are repeatedly halved or doubled.
The maximum benefit from good date structures results from designing your program around them in the first place.
Data structures can be neatly classified as either continuous or linked, depending upon whether they are based on arrays or pointers.
Contiguously-allocated structures are composed of single slabs of memory, and include arrays, matrices, heaps, and hash tables.
Linked data structures are composed of distinct chunks of memory bound together by pointers, and include lists, trees, and graph adjacency lists.
The array is the fundamental contiguously-allocated data structure. Arrays are structures of fixed-sized data records such that each element can be efficiently located by its index or address.
Physical continuity between successive data accesses helps exploit the high-speed cache memory on modern computer architectures.
Pointers are the connections that hold the pieces of linked structures together.
Pointers represent the address of a location in memory.
A variable storing a pointer to a given data item can provide more freedom than storing a copy of the item itself.
Dynamic memory allocation provides us with flexibility on how and where we use our limited storage resources.
Stacks are simple to implement and very efficient. For this reason, stacks are probably the right container to use when retrieval order doesn’t matter at all, such as when processing batch jobs.
Algorithmically, LIFO tends to happen in the course of executing recursive algorithms.
Queues are somewhat trickier to implement than stacks and thus are more appropriate for appliances where the order is important.
The dictionary data type permits access to data items by content. You stick an item into a dictionary so you can find it when you need it.
Data structure design must balance all the different operations it supports. The fastest data structure to support both operations A and B may well not be the fastest structure to support either operation A or B.
Picking the wrong data structure for the job can be disastrous in terms of performance. Identifying the very best data structure is usually not as critical, because there can be several choices that perform similarly.
Building algorithms around data structures such as dictionaries and priority queues leads to both clean structure and good performance.
Greedy heuristics always try to grab the best possible thing first.
When working with a large enough data set, only linear or near linear algorithms are likely to be fast enough.
Choosing the right data structure is often the key to getting the time complexity down to this point.
Using smart heuristics like greedy is likely to significantly improve quality over the naive approach.
Hash tables are a very practical way to maintain a dictionary. THey exploit the fact that looking an item up in an array takes constant time once you have its index.
A hash function is a mathematical function that maps keys to integers. We will use the value of our hash function as an index into an array, and store our item at that position.
Chaining is the easiest approach to collision resolution.
Pragmatically, a hash table is often the best data structure to maintain a dictionary.
Strings are sequences of characters where the order of the characters matters.
Text strings are fundamental to a host of computing applications, from programming language parsing/compilation, to web search engines, to biological sequence analysis.
The primary data structure for representing strings is an array of characters. This allows us constant-time access to the i-th character of the string.
The most fundamental operation on text strings is substring search.
The key idea of hashing is to represent a large object using a single number. The goal is a representation of the large object by an entity that can be manipulated in constant time, such that it is relatively unlikely that two different large objects map to the same value.
Hashing has a variety of clever applications beyond just speeding up search.
Suffix trees are amazing data structures.
Sorting is the basic building block that many other algorithms are built around. By understanding sorting, we obtain an amazing amount of power to solve other problems.
An important algorithm design technique is to use sorting as a basic building block, because many other problems become easy once a set of items is sorted.
Never be afraid to spend time sorting, provided you use an efficient sorting routine.
Sorting lies at the heart of many algorithms. Sorting the data is one of the first things any algorithm designer should try in the quest for efficiency.
Heaps are a simple and elegant data structure for efficiently supporting the priority queue operations insert and extract-min.
Although other algorithms prove slightly faster in practice, you won’t go wrong using heapsort for sorting data that sites in the computer’s main memory.
Recursive algorithms reduce large problems into smaller ones.
Randomization is a powerful tool to improve algorithms with bad worst-case but good average-case complexity. It can be used to make algorithms more robust to boundary cases and more efficient on highly structured input stances that confound heuristic decisions.
The fundamental step in quicksort is partitioning elements around a pivot.
Sorting can be used to illustrate most algorithm design paradigms. Data structure techniques, divide-and-conquer, randomization, and incremental construction all lead to efficient sorting algorithms.
Binary search is a fast algorithm for searching in a sorted array of keys.
Binary search and its variants are the quintessential divide-and-conquer algorithms.
One of the most powerful techniques for solving problems is to break them down into smaller, more easily solved pieces. Smaller problems are less overwhelming, and they permit us to focus on details that are lost when we are studying the entire problem.
Effective parallel processing requires decomposing jobs into at least as many tasks as processors, and is becoming more important with the advent of cluster computing and multicore processors.
The key to solving many algorithmic problems is to think of them in terms of graphs. Graph theory provides a language for talking about the properties of relationships, and it is amazing how often messy applied problems have a simple description and solution in terms of classical graph properties.
Designing truly novel graph algorithms is a very difficult task. The key to using graph algorithms effectively in applications lies in correctly modeling your problem so you can take advantage of existing algorithms.
Graphs can be used to model a wide variety of structures and relationships. Graph-theoretic terminology gives us a language to talk about them.
Adjacency lists are the right data structure for most applications of graphs.
The take-home lesson is that even elementary problems like initializing data structures can prove to be bottlenecks in algorithm development. Most programs working with large amounts of data have to run in linear or almost linear time.
Breadth-first and depth-first searches provide mechanisms to visit each edge and vertex of the graph. They prove the basis of most simple, efficient graph algorithms.
By storing the vertices in a first-in, first-out (FIFO) queue, we explore the oldest unexplored vertices first.
By storing the vertices in a last-in, first-out (LIFO) stack, we explore the vertices by lurching along a path, visiting a new neighbor if one is available, backtracking up only when we are surrounded by previously discovered vertices.
DFS organizes vertices by entry/exit times, and edges into tree and back edges. This organization iw what gives DFS its real power.
As algorithm design paradigms go, a depth-first search isn’t particularly intimidating. It is surprisingly subtle, however meaning that its correctness requires getting details right.
Greedy algorithms make the decision of what to do next by selecting the best local option from all available choices without regard to the global structure.
Most applications of graphs can be reduced to standard grph properties where well-known algorithms can be used. These include minimum spanning tees, shortest paths, and other problems presented in the catalog.
The maximum flow from s to t always equals the weight of the minimum s-t cut. Thus, flow algorithms can be used to solve general edge and vertex connectivity problems in graphs.
Proper modeling is the key to making effective use of graph algorithms.
Designing novel graph algorithms is very hard, so don’t do it. Instead, try to design graphs that enable you to use classical algorithms to model your problem.
Backtracking is a systematic way to iterate through all the possible configurations of a search space.
Combinatorial searches, when augmented with tree pruning techniques, can be used to find the optimal solution of small optimization problems. How small depends upon the specific problem, but typical size limits are somewhere between 15 <= N <= 50 items.
Successful pruning requires looking ahead to see when a solution is doomed to go nowhere, and backing off as soon as possible.
Even simple pruning strategies can suffice to reduce running time from impossible to instantaneous.
Clever pruning can make short work of surprisingly hard combinatorial search problems. Proper pruning will have a greater impact on search time than any other factor.
The simplest method to search in a solution space uses random sampling. It is also called the Monte Carlo method. We repeatedly construct random solutions and evaluate them, stopping as soon as we get a good enough solution, or (more likely) when we are tired of waiting. We report the best solution found over the course of our sampling.
Simulated annealing is effective because it spends much more of its time working on good elements of the solution space than on bad ones, and because it avoids getting trapped repeatedly in the same local optima.
Simulated annealing is a simple but effective technique for efficiently obtaining good but not optimal solutions to combinatorial search problems.
I have never encountered any problem where genetic algorithms seemed to me the right way to attack it. Further, I have never seen any computational results reported using genetic algorithms that have favorably impressed me. Stick to simulated annealing for your heuristic search voodoo needs.
Your time spent parallelizing code might be better spent enhancing the sequential version.
I recommend considering parallel processing only after attempts at solving a problem sequentially prove too slow. Even then, I would restrict attention to algorithms that parallelize the problem by partitioning the input into distinct tasks where no communication is needed between the processors, except to collect the final results.
Once you understand it, dynamic programming is probably the easiest algorithm design technique to apply in practice.
Dynamic programming is a technique for efficiently implementing a recursive algorithm by storing partial results. The trick is seeing whether the naive recursive algorithm computes the same subproblems over and over and over again. If so, storing the answer for each subproblems in a table to look up instead of recompute can lead to an efficient algorithm.
Start with a recursive algorithm or definition. Only once we have a correct recursive algorithm do we worry about speeding it up by using a results matrix.
Dynamic programming is generally the right method for optimization problems on combinatorial objects that have an inherent left to right order among components.
Dynamic programming is essentially a tradeoff of space for time. Repeatedly recomputing a given quantity is harmless unless the time spent doing so becomes a drag on performance. Then we are better off storing the results of the initial computation and looking them up instead of recomputing them again.
Explicit caching of the results of recursive calls provides most of the benefits of dynamic programming, including usually the same running time as the more elegant full solution. If you prefer doing extra programming to more subtle thinking, you can stop here.
Once you understand dynamic programing, it can be easier to work out such algorithms from scratch than to try to look them up.
For any optimization problem on left-to-right objects, such as characters in a string, elements of a permutation, points around a polygon, or leaves in a search tree, dynamic programming likely leads to an efficient algorithm to find the optimal solution.
Dynamic programming algorithms are only as correct as the recurrence relations they are based on.
The running time of any dynamic programming algorithm is a function of two things: (1) number of partial solutions we must keep track of, and (2) how long it will take to evaluate each partial solution.
Without an inherent left-to-right ordering on the objects, dynamic programming is usually doomed to require exponential space and time.
The global optimum is often noticeably better than the solution found by typical heuristics. How important this improvement is depends on your application, but it can never hurt.
Reductions are a way to show that two problems are essentially identical. A fast algorithm for one of the problems implies a fast algorithm for the other.
A small set of NP-complete problems suffice to prove the hardness of most other hard problems.
Approximation algorithms guarantee answers that are always close to the optimal solution. They can provide a practical approach to dealing with NP-complete problems.
The key to algorithm design (or any other problem-solving task) is to proceed by asking yourself questions to guide your thought process.
Should you get stuck on the problem, the best thing to do is move onto the next question.
By clearly articulating your reasoning as to why something doesn’t work, you can check whether you have glossed over a possibility that you didn’t want to think hard enough about. It is amazing how often the reason you can’t find a convincing explanation for something is because your conclusion is wrong.
The distinction between strategy and tactics is important to keep aware of during any design process. Strategy represents the quest for the big picture--the framework around which we construct our path to the goal. Tactics are used to win the minor battles we must fight along the way.
In problem-solving, it is important to check repeatedly whether you are thinking on the right level.
Problem-solving is not a science, but part art and part skill. It is one of the skills most worth developing.
Data structures are not so much algorithms as they are the fundamental constructs around which you build your application. Becoming fluent in what the standard data structures can do for you is essential to get full value from them.
The abstract data type “dictionary” is one of the most important structures in computer science.
In practice, it is more important to avoid using a bad data structure than to identify the single best option available.
For applications involving a moderate-to-large number of keys (say between 100 and 10,000,000), a hash table is probably the right way to go.
Balanced search trees use local rotation operations to restructure search trees, moving more distant nodes closer to the root while maintaining the in-order search structure of the tree.
Which tree is best for your application? Probably the one of which you have the best implementation. The flavor of balanced tree is probably not as important as the skill of the programmer who coded it.
Once a data structure has to be stored outside of main memory, the search time grows by several orders of magnitude.
The idea behind a B-tree is to collapse several levels of a binary search tree into a single large node, so that we can make the equivalent of several search steps before another disk access is needed.
Priority queue are useful data structures in simulations, particularly for maintaining a set of future events ordered by time. The are called “priority” queue because they enable you to retrieve items not by the insertion time (as in a stack or queue), nor by a key match (as in a dictionary), but by which item has the highest priority of retrieval.
Suffix trees and arrays are phenomenally useful data structures for solving string problems elegantly and efficiently. Proper use of suffix trees oten speeds up string processing algorithms [...] to linear time--likely the answer.
Tries are very simple to build (repeatedly insert new strings) and very fast to search, although they can be expensive in terms of memory.
Special efforts must be taken to represent very large graphs efficiently.
The bottom line is that you should try to avoid working in high-dimensional spaces, perhaps by discarding (or projecting away) the least important dimensions.
Numerical algorithms typically perform repeated floating-point computations, which accumulate error at each operation until, eventually, the results are meaningless.
A simple and dependable way to test for round-off errors in numerical programs is to run them both at single and double precision, and then think hard whenever there is a disagreement.
Linear programming is the most important problem in mathematical optimization and operations research.
The simplex method is the standard algorithm for linear programming. Each constraint in a linear programming problem acts like a knife that carves away a region from the space of possible solutions. We seek the point within the remaining region that maximizes (or minimizes) f(X).
The bottom line on linear programming is this: you are much better off using an existing LP code than writing your own.
Linear programming is an algorithmic problem of such economic important that commercial implementations are far superior to free versions.
It is fundamentally impossible to produce truly random numbers on any deterministic device.
The best we can hope for are pseudorandom numbers, a stream of numbers that appear as if they were generated randomly.
The standard algorithm of choice [for generating random numbers] is the linear congruential generator. It is fast, simple, and (if instantiated with the right constraints) gives reasonable pseudorandom numbers.
Note that the stream of numbers produced by a linear congruential generator repeats the instant the first number repeats.
High but not arbitrary-precision arithmetic can be conveniently performed using Chinese remainder theorem and modular arithmetic.
Taking the Fourier transform of a function is equivalent to representing it as the sum of sine functions.
By eliminating undesirable high and or low frequency components and taking an inverse fourier transform to get us back into the time domain, we can filter an image to remove noise and other artifacts.
A convolution is the pairwise product of elements from two different sequences, such as in multiplying two n-variable polynomials f and g or comparing two character strings.
Indeed, “when in doubt, sort” is one of the first rules of algorithm design.
Degeneracy refers to annoying special cases that must be treated in substantially different ways, such as when two lines intersect in more or less than a single point.
There are three primary approaches to dealing with degeneracy:

Ignore it--Make as an operating assumption that your program will work correctly only if no three points are collinear, no three lines meet at a point, no intersections happen at the endpoints of line segments, etc.
Fake it-- Randomly or symbolically perturb your data so that it becomes nondegenerate.
Deal with it--Geometric applications can be made more robust by writing special code to handle each of the special cases that arise.

Geometric computations often involve floating-point arithmetic, which leads to problems with overflows and numerical precision. There are three basic approaches to the issue of numerical stability:

Integer arithmetic--By forcing all points of interest to lie on a fixed-size integer grid, you can perform exact comparison to test whether any two points are equal or two line segments interest.
Double precision reals--By using double-precision floating point numbers, you may get lucky and avoid numerical errors.
Arbitrary precision arithmetic--This is certain to be correct, but also to be slow.

Sets are collections of symbols whose order is assumed to carry no significant, while strings are defined by the sequence or arrangement of symbols.
The assumption of a fixed order makes it possible to solve string problems much more efficiently than set problems, through techniques such as dynamic programming and advanced data structures like suffix trees.
A good algorithm designer does not reinvent the wheel, and a good programmer does not rewrite code that other people have written. Picasso put it best: “Good artists borrow. Great artists steal.”

20190406

Thinking Forth by Leo Brodie

To call forth a concept, a word is needed; to portray a phenomenon, a concept is needed.
Programming computers can be crazy making.
Programmers design, build, and repair the stuff of imagination, ghostly mechanisms that escape the senses.
Our work takes place not in RAM, not in an editor, but within our own minds.
Forth is a language, an operating system, a set of tools, and a philosophy.
Assembly-language programming is characterized by a one-for-one correspondence between each command that the programmer types and each command that the processor performs.
High-level languages are clearly more “powerful” than assembly languages in the sense that each instruction might compile dozens of machine instructions. But more significantly, high-level languages eliminate the linear corresponding between source code and the resulting machines instructions.
Decisions about the algorithms and associated data structures should be made in parallel.
Simplicity is the primary measurement recommended for evaluating alternative designs relative to reduced debugging and modification time.
By dividing a problem into simple modules, programs were expected to be easier to write, easier to change, and easier to understand.
The safest kind of coupling is the passing of local variables as parameters from one module to another.
The smallest atom of a Forth program is not a module or a subroutine or a procedure, but a “word”.
Everything in Forth is a word.
Calls are implicit.
Data passing is implicit.
Because Forth uses a stack for passing data, words can nest within words.
Forth eliminates from our programmes the details of how words are invoked and how data are passed.
Because Forth is interactive, the programmer can type and test the primitive commands.
Forth programming consists of extending the root language toward the application, providing new commands that can be used to describe the problem at hand.
Forth is a programming environment for creating application-oriented languages.
The purpose of hiding information is simply to minimize the effects of a possible design-change by localizing things that might change within each component.
Forth conveniently seperates “things” from “actions” by allowing addresses of data structures to be passed on the stack and providing the “fetch” and “store” commands.
Forth pays little attention to whether something is a data structure or an algorithm. This indifference allows us programmers incredible freedom in creating the parts of speech we need to describe our applications.
Forth is a design language.
Unfortunately, human foresight is limited even under the best conditions. Too much planning becomes counterproductive.
Constructing a prototype is a more refined way to plan, just as breadboarding is in electronic design.
Experimentation proves more reliable in arriving at the truth than the guesswork of planning.
Overall, Forth outdoes all other high-level languages in speed, capability, and compactness.
Although Forth is an interpretive language, it executes compiled code.
Forth’s use of a data stack greatly reduces the performance cost of passing arguments from word to word.
Forth can do anything any other language can do--usually easier.
Forth can be written to run on top of any operating system or, for those who prefer it, Forth can be written as a self-sufficient operating system including its own terminal drivers and disk drivers.
Start simple. Get it running. Learn what you’re trying to do. Add complexity gradually, as needed to fit the requirements and constraints. Don’t be afraid to restart from scratch.
Testing and prototyping are the best ways to discover what is really needed.
For newcomers to Forth: Keep the analysis phase to a minimum.
For Forth addicts: Hold off on coding as long as you possibly can.
Plan for change (by designing components that can be changed).
Prototype.
The first step is to determine what the application should do.
A conceptual model is an imaginary solution to the problem. It is a view of how the system appears to work.
A design begins to describe how the system really works.
Strive to build a solid conceptual model before beginning the design.
First, and most importantly, the conceptual model should describe the system’s interfaces.
Decide on error- and exception-handling early as part of defining the interface.
Develop the conceptual model by imagining the data traveling through and being acted upon by the parts of the model.
Forth encourages you to think in terms of the conceptual model, and Forth’s implicit use of a data stack makes the passing of data among modules so simple it can usually be taken for granted. This is because Forth, used properly, approaches a functional language.
Most of your efforts at defining a problem will center on describing the interface.
Visualization of ideas helps in understanding problems, particularly those problems that are too complex to perceive in a linear way.
You don’t understand a problem until you can simplify it.
Keep it simple.
Given two solutions to a problem, the correct one is the simpler.
Generality usually involves complexity. Don’t generalize your solution any more than will be required; instead, keep it changeable.
Go back to what the problem was before the customer tried to solve it. Exploit the “don’t cares”.
To simplify, quantize.
To simplify, keep the user out of trouble.
To simplify, take advantage of what’s available.
Careful planning is essential, because things always take longer than you expect.
The mean time for making a “two-hour” addition to an application is approximately 12 hours.
Conventional wisdom reveres complexity.
Always give your client some options. Clients like options.
Everything takes longer than you think, including thinking.
To see the relationship between two things, put them close together. To remind yourself of the relationship, keep them together.
The goal of preliminary design is to determine what components are necessary to accomplish the requirements.
Within each component, implement only the commands needed for the current iteration. (But don’t preclude future additions.)
Definitions are invoked by being named.
The purpose of an interface component is to implement, and hide information about, the data interface between two or more other components.
Both data structures and the commands involved in the communication of data between modules should be localized in an interface component.
We must distinguish between data structures that are validly used only within a single component and those that may be shared by more than one component.
Express in objective units any data to be shared by components.
One of Forth’s rules is that a word must already have been defined to be invoked or referred to.
Most of us are guilty of over-emphasizing the difference between “high-level” and “low-level”. This notion is an arbitrary one. It limits our ability to think clearly about software problems.
An object is a portion of code that can be invoked by a single name, but that can perform more than one function. To select a particular function you have to invoke the object and pass it a parameter or a group of parameters.
Don’t bury your tools.
A component is simply a set of commands that together transform data structures and algorithms into useful functions. These functions can be used without knowledge of the structures and/or algorithms within.
By thinking about the ways in which we solve problems, apart from the problems themselves, we enrich our subconscious storehouse of techniques.
Determine your goal.
Picture the problem as a whole.
Develop a plan.
Set a course for action and avoid the trap of fumbling about aimlessly.
Think of an analogous problem.
Work forward.
Work backward.
Belief is a necessary ingredient for successfully working backward.
Recognize the auxiliary problem.
Step back from the problem.
It’s easy to get so emotionally attached to one particular solution that we forget to keep an open mind.
Use whole-brain thinking.
When a problem has you stumped and you seem to be getting nowhere, relax, stop worrying about it, perhaps even forget about it for a while.
Creative people have always noted that their best ideas seem to come out of the blue, in bed or in the shower.
Evaluate your solutions. Look for other solutions.
Don’t invest too much effort in your first solution without asking yourself for a second opinion.
The human mind works exceptionally well with analogies.
Each definition should perform a simple, well-defined task.
In designing a component, the goal is to create a lexicon that will make your later code readable and easy to maintain.
Let numbers precede names.
Let text follow names.
Let definitions consume their arguments.
Use zero-relative numbering.
Since computers are numeric processors, software becomes easier to write when we use zero-relative numbering.
Let addresses precede counts.
Let sources precede destinations.
Avoid expectations (in the input stream).
Let commands perform themselves.
Don’t write your own interpreter/compiler when you can use Forth’s.
A simple solution is one that does not obscure the problem with irrelevancies.
An algorithm is a procedure, described as a finite number of rules, for accomplishing a certain task. The rules must be unambiguous and guaranteed to terminate after a finite number of applications.
A data structure is an arrangement of data or locations for data, organized especially to match the problem.
We’ve stated before that the best solution to a problem is the simplest adequate one; for any problem we should strive for the simplest approach.
In choosing which approach to apply towards solving a problem, give preference in the following order: calculation, data structures, logic.
In devising an algorithm, consider exceptions last. In writing code, handle exceptions first.
In Forth we try to minimize our dependent on logic.
If you have trouble thinking about a conceptual model, visualize it--or draw it--as a mechanical device.
Good organization has three aspects: decomposition, composition, disk partitioning.
Composition is the putting together of pieces to create a whole. Good composition requires as much artistry as good decomposition.
Structure your application listing like a book: hierarchically.
Modularity of the source code is one of the reasons for Forth’s quick turnaround time for editing, loading, and testing (necessary for the iterative approach).
Begin all definitions at the left edge of the screen, and define only one word per line.
Spacing and indentation are essential for readability.
Every colon or code definition that consumes and/or returns any arguments on the stack must include a stack-effect comment.
The purpose of commenting is to allow a reader of your code to easily determine what’s going on.
Design your application so that the code, not the comments, conveys the meaning.
The most-accurate, least-expensive documentation is self-documenting code.
Choose names according to “what”, not “how”.
Find the most expressive word.
Spell names in full.
Favor short words.
Hyphenated names may be a sign of bad factoring.
Don’t bundle numbers into names.
Use prefixes and suffixes to differentiate between like words rather than to cram details of meaning into the name itself.
Maintainability requires readability.
Factoring means organizing code into useful fragments.
Don’t pass control flags downward.
If a series of definitions contains identical functions, with variation only in data, use a defining word.
Keep definitions short.
A word should be a line long. That’s the target.
Factor at the point where you feel unsure about your code (where complexity approaches the conscious limit).
Factor at the point where a comment seems necessary.
Limit repetition of code.
When factoring out duplicate code, make sure the factored code serves a single purpose.
Look for repetition of patterns.
Be sure you can name what you factor.
If you have a concept that you can’t assign a single name to, not a hyphenated name, but a name, it’s not a well-formed concept. The ability to assign a name is a necessary part of decomposition.
Simplify the command interface by reducing the number of commands.
For maximum maintainability, limit redundancy even at compile time.
Work on only one aspect of a problem at a time.
By concentrating on one dimension of the problem at a time, you can solve each dimension more efficiently.
Don’t change too much at once.
Don’t try to anticipate ways to factor too early.
Building levels of abstraction is a dynamic process, not one you can predict.
Today, make it work. Tomorrow, optimize it.
A good programmer continually tries to balance the expense of building-in changeability against the expense of changing things later if necessary.
Anticipate things-that-may-change by organizing information, not by adding complexity. Add complexity only as necessary to make the current iteration work.
Forth handles data in one of two ways: either on the stack or in data structures.
The simplest way for Forth words to pass arguments to each other is via the stack.
Simplify code by using the stack. But don’t stack too deeply within any single definition. Redesign, or, as a last resort, use a named variable.
Generally, three elements on the stack is the most you can manage within a single definition.
Especially in the design phase, keep on the stack only the arguments you’re using immediately. Create local variables for any others. (If necessary, eliminate the variables during the optimization phase.)
When determining which arguments to handle via data structures rather than via the stack, chose the arguments that are the more permanent or that represent a current state.
Make sure that stack effects balance out under all possible control flows.
When doing two things with the same number, perform the function that will go underneath first.
Where possible, keep the number of return arguments the same in all possible cases.
Keep return stack operators symmetrical.
Unless it involves cluttering up the stack to the point of unreadability, try to pass arguments via the stack rather than pulling them out of variables.
Most of the modularity of Forth comes from designing and treating Forth words as “functions” in the mathematical sense.
The return stack is a component of the Forth system designed to hold return addresses, and thereby serve as an indication of where you’ve been and where you’re going.
When the application requires handling a group of conditions simultaneously, use a state table, not seperate variables.
A CONSTANT’s value should never be changed once the application is compiled.
Using the stack is preferable for testing and reusability, but too many values manipulated on the stack by a single definition hurts readability and writability.
Control structures aren’t as important in Forth as they are in other languages.
The use of conditional structures adds complexity to your code.
The more complex your code is, the harder it will be for you to read and to maintain.
Give each function its own definition.
The Forth dictionary is a giant string case statement. The match and execute functions are hidden within the Forth system.
Don’t test for something that has already been excluded.
When multiple conditions have dissimilar weights (in likelihood or calculation time) nest conditionals with the term that is least likely to be true or easiest to calculate on the outside.
The most elegant code is that which most closely matches the problem. Choose the control structure that most closely matches the control-flow problem.
Don’t decide, calculate.
Many times conditional control structures are applied mistakenly to situations in which the difference in outcome results from a difference in numbers.
A trick is simply taking advantage of certain properties of operation.
The use of tricks becomes dangerous when a trick depends on something likely to change, or when the thing it depends on is not protected by information hiding.
Use decision tables.
A decision table is clearly a better choice than a conditional structure when the problem has multiple dimensions.
One change at the bottom can save ten decisions at the top.
Don’t test for something that can’t possible happen.
Reexamine the algorithm.
A lot of conditionals arise from fuzzy thinking about the problem.
Avoid the need for special handling.
Use the structured exit.
Fooling with the return stack is like playing with fire. You can get burned.
Take the action when you know you need to, not later.
Don’t put off till run time what you can compile today.
Any time you can make a decision prior to compiling an application, do.
The use of logic and conditionals as a significant structural element in programming leads to overly-complicated, difficult-to-maintain, and inefficient code.
Most people go out and attack problems with complicated tools. But simpler tools are available and more useful.
Forth is expressed in words (and numbers) and is separated by spaces.
All words, whether included with the system or user-defined, exist in the “dictionary”, a linked list.
Writing a Forth application consists of building increasingly powerful definitions in terms of previously defined ones.

20181216

Make: FPGAs by David Romano

For the FPGA hobbyist and DIYers, Papilio by Gadget Factory is second to none.
Papilio series of FPGA development boards and add-on hardware application modules called “Wings” that plug into the main board--sort of like Arduino shields.
Logic cell ratings are intended to show the logic density of one Xilinx device as compared to another device.
Like a logic block, the typical cell includes a couple of flip-flops, multiplier, logic gates, and a small amount of RAM for a configurable lookup table (LUT).
A block RAM (BRAM) is a dedicated two-port memory block containing several kilobits of RAM.
A DOA (dead on arrival) test is a very popular test for SoC validation engineers to write (typically it is the first test they write for a new chip or design block). This test usually checks to see if the chip or design has any life at all after power-up.
Logic synthesis is the process by which the register-transfer level (RTL) of the SoC is turned into a design implementation in terms of logic gates, typically by a CAD tool called a synthesis tool.
We can use a simple binary counter as a clock frequency divider circuit.
It’s very important to use a naming convention for labeling--this is a good design habit to get yourself into. For small designs, it is not critical, but in larger desigins or joint designs it becomes essential.
Verilog modules are like functions in other programming languages. They are pieces of code that can be used and reused within a single program.
Modules can contain both structural and behavioral statements. Structural statements represent circuit components like logic gates, counters, and microprocessors. Behavioral-level statements are programming statements that have no direct mapping to circuit components like loops, if-then statements, and stimulus vectors that are used to exercise a circuit.
We now can see that a synchronous circuit consists of two kinds of elements: registers and combination logic. Registers (usually implemented as D flip-flops) synchronize the circuit’s operation to the edges of the clock signal, and are the only elements in the circuit that have memory properties. Combinational logic performs all the logical functions in the circuit, and it typically consists of logic gates.
Binary counters make good frequency dividers.
Using a simple LED blink test and an FPGA frequency divider is a good vehicle to test the functionality of a new FPGA platform.
The RTL consists of two kinds of elements: registers and combinational logic. Registers (usually implemented as D flip-flops) synchronize the circuit’s operation to the edges of the clock signal, and are the only elements in the circuit that have memory properties. Combinational logic performs all the logical functions in the circuit, and it typically consists of logic gates.
Timing issues are the most challenging issues to debug in an FPGA design. For us, it’s good to know that we will not be working with clock speeds that push the envelope of our FPGA technology.
Doing a simple DOA simulation first, even for small designs, provides a good sanity check to see if the design works functionally at some level.
Assigning physical I/O connections can be the most difficult part of the whole design process, because this is where you link the virtual world of your ISE design to the real world of the actual FPGA chip and your particular FPGA module circuit board layout.
The first thing you need to wrap your head around is that when you are coding in HDL, you are not writing a software program; rather, you are describing digital hardware logic functionality.
With concurrency, there are no sequential steps of code execution like: “first do this, then do this, then do that.” There really is only one instant in time, and that is the clock tick.
Think of the HDL you are using as more like describing a block diagram than a flow chart.
Real hardware operates concurrently.
Concurrency is what differentiates HDL from other programming languages, which are sequential. Concurrency is not explicit in programming languages like C.
In the simplest sense, a test bench is a virtual testing environment used to verify that a design does everything it’s supposed to do and doesn’t do anything it’s not supposed to do.
A test bench applies stimuli (inputs) to the unit under test (UUT), also referred to as the device under test (DUT). It monitors the outputs, providing status and error reporting in a readable and user-friendly format.
The I2C serial bus is a multimaster, multipoint protocol, unlike the very popular UART serial bus protocol, which is point-to-point only.
Wishbone is an open source hardware computer bus that uses a single simple, high-speed synchronous specification to connect components together in an SoC.
Waveforms are very useful for viewing large amounts of data quickly and efficiently.
Graphical analysis is an easy way to see if there is a difference in test results.
A test bench is a virtual testing environment used to verify that a design does everything it is supposed to do and doesn’t do anything it’s not supposed to do.
A test bench applies stimuli (inputs) to the unit under test (UUT), also referred to as the device under test (DUT). It monitors the outputs, providing status and error reporting in a readable and user-friendly format.
The I2C bus is a low-speed, low-power, two-wire serial bus/protocol that is an industry standard for many peripheral devices used in the electronics and computer industry.
Brian Stuart of Drexel University has written a great piece on CARDIAC, where he explains how the simple instruction set makes for very easy understanding of how complex programs can be out of simpler sets of operations and data.
VTACH is an OpenCores FPGA project that is actually a Verilog implementation of the original CARDIAC teaching computer from Bell Labs.
Simple put, an SoC is a semiconductor microchip that contains multiple electronic components integrated together on a single silicon die. This single chip may contain digital, analog, mixed-signal, and even RF (radio frequency) functions that collectively comprise a complete system.
Gadget Factory’s DesignLab is a great frontend tool for Xilinx ISE schematic entry.
The ZPUino soft processor core is a 32-bit processor that is easily programmed like the Arduino and is a great building block for FPGA SoCs.
SDR is a radio communication system where components that have typically been implemented in hardware (e.g. mixers, filters, amplifiers, modulators/demodulators, detectors, etc) are implemented through software, typically in an embedded system or PC.

20181206

Compiler Construction by Niklaus Wirth

Compilers convert program texts into internal code. Hence the constitute the bridge between software and hardware.
Genuine understanding of a subject is best acquired from an in-depth involvement with both concepts and details.
We consider syntax analysis as a means to an end, but not as the ultimate goal.
Computer programs are formulated in a programming language and specify classes of computing processes.
The construction of the first compiler for the language Fortran around 1956 was a daring enterprise, whose success was not at all assured.
The translation process essentially consists of the following parts:

The sequence of characters of a source text is translated into a corresponding sequence of symbols of the vocabulary of the language.
The sequence of symbols is transformed into a representation that directly mirrors the syntactic structure of the source text and lets this structure easily be recognized.
High-level languages are characterized by the fact that objects of programs are classified according to their type. Therefore, in addition to syntactic rules, compatibility rules among types of operators and operands define the language. Hence, verification of whether these compatibility rules are observed by a program is an additional duty of a compiler. This verification is called type checking.
On the basis of the representation resulting from step 2, a sequence of instructions taken from the instruction set of the target computer is generated. This phase is called code generation.

A partitioning of the compilation process into as many parts as possible was the predominant technique until about 1980, because until then the available [memory] store was too small to accommodate the entire compiler.
Modern computers with their apparently unlimited stores make it feasible to avoid intermediate storage on disk. And with it, the complicated process of serializing a data structure for output, and its reconstruction on input can be discarded as well.
A wise compromise exists in the form of a compiler with two parts, namely a front-end and a back-end. The first part comprises lexical and syntax analyses and type checking, and it generates a tree representing the syntactic structure of the source text. This tree is held in main store and constitutes the interface to the second part which handles code generation.
The idea of decoupling source language and target architecture has also led to projects creating several front ends for different languages generating trees for a single back-end.
A compiler which generates code for a computer different from the one executing the compiler is called a cross compiler.
Every language displays a structure called its grammar or syntax.
A language is defined by the following:

The set of terminal symbols. These are the symbols that occur in its sentences.
The set of non-terminal symbols. They denote classes and can be substituted.
The set of syntactic equations. These define the possible substitutions of non-terminal symbols.
The start symbol.

A language is the set of sequences of terminal symbols which, starting with the start symbol, can be generated by repeated application of syntactic equations, that is, substitutions.
The idea of defining languages and their grammar with mathematical precision goes back to Noam Chomsky.
The use of the Chomsky formalism is also responsible for the term programming language, because programming languages seemed to exhibit a structure similar to spoken languages.
A language us "regular", if its syntax can be expressed by a single EBNF expression.
The method of recursive descent is only one of several techniques to realize the top-down parsing principle.
Parsing in the bottom-up direction is also called shift-reduce parsing.
Bottom-up parsing is in general more intricate and complex than top-down parsing.
Ultimately, the basic idea behind every language is that it should serve as a means for communication. This means that partners must use and understand the same language.
The scanner has to recognize terminal symbols in the source text.
As soon as an unacceptable symbol turns up, the task of the parser is completed, and the process of syntax analysis is terminate.
We postulate the following quality criteria for error handling:

As many errors as possible must be detected in a single scan through the text.
As few additional assumptions as possible about the language are to be made.
Error handling features should not slow down the parser appreciably.
The parser program should not grow in size significantly.

Without doubt, a simple language structure significantly simplifies error diagnostics, or, in other words, a complicated syntax complicates error handling unnecessarily.
The essential characteristics of a good compiler, regardless of details, are that (1) no sequence of symbols leads to its crash, and (2) frequently encountered errors are correctly diagnosed and subsequently generate no, or few additional, spurious error messages.
Context is represented by a data structure which contains an entry for every declared identifier. This entry associates the identifier with the denoted object and its properties. The data structure is known by the name symbol table.
Every declaration results in a new symbol table entry.
Every occurrence of an identifier in a statement requires a search of the symbol table in order to determine the attributes (properties) of the object denoted by the identifier.
The simplest form of data structure for representing a set of items is the list. Its major disadvantage is a relatively slow search process, because it has to be traversed from its root to the desired element.
In order to speed up the search process, the list is often replaced by a tree structure. Its advantage becomes noticeable only with a fairly large number of entries.
In languages featuring data types, their consistency checking is one of the most important tasks of a compiler. The checks are based on the type attribute recorded in every symbol table entry.
We must determine the format in which data are to be represented at run-time in the [memory] store. The choice inherently depends on the target architecture, although this fact is less apparent because of the similarity of virtually all computers in this respect.
Every computer features certain elementary data types together with corresponding instructions, such as integer addition and floating-point addition. These types are invariably scalar types, and they occupy a small number of consecutive memory locations (bytes).
The size of an array is its element size multiplied by the number of its elements. The address of an element is the sum of the array's address and the elment's index multiplied by the element size.
Absolute addresses of variables are usually unknown at the time of compilation. All generated addresses must be considered as relative to a common base address which is given at run-time. The effective address is then the sum of this base address and the address determined by the compiler.
Although bytes can be accessed individually, typically a small number of bytes are transferred from or to memory as a packet, a so-called word.
If allocation occurs strictly in sequential order it is possible that a variable may occupy (parts of) several words. But this should be avoided, because otherwise a variable access would involve several memory accesses, resulting in an appreciable slowdown.
A simple method of overcoming this problem is to round up (or down) each variable's address to the next multiple of its size. This process is called alignment.
The price of alignment is the loss of some bytes in memory, which is quite negligible.
Following a longstanding tradition, addresses of variables are assigned negative values, that is, negative offsets to the common base address determined during program execution.
The acronym RISC stands for reduced instruction set computer, where "reduced" is to be understood as relative to architectures with large sets of complex instructions.
From the viewpoints of the programmer and the compiler designer the computer consists of an arithmetic unit, a control unit and a store.
Register instructions operate on registers only and feed data through a shifter and the arithmetic logic unit ALU.
Memory instructions fetch and store data in memory.
Branch instructions affect the program counter.
Our ideal computer would be capable of directly interpreting post-fix notation. Such an ideal computer requires a stack for holding intermediate results. Such a computer architecture is called a stack architecture.
Whenever arithmetic expression are evaluated, the inherent danger of overflow exists. The evaluating systems should therefore be suitably guarded.
The essence of delayed code generation is that code is not emitted before it is clear that no better solution exists.
Conditional and repeated statements are implemented with the aid of jump instructions, also called branch instructions.
Procedures, which are also known as subroutines, are perhaps the most important tool for structuring programs. Because of their frequency of occurrence, it is mandatory that their implementation is efficient. Implementation is based on the branch instructions which saves the current PC value and thereby the point of return after termination of the procedure, when this value is reloaded into the PC register.
Algol-60 introduced the very fundamental concept of local variables. It implied that every identifier declared had a limited range of visibility and validity.
In concrete terms, variables may be declared local to a procedure such that they are visible and valid within this procedure only.
Addresses of local variables generated by the compiler are always relative to the base address of the respective activation frame. Since in programs most variables are local, their addressing also must be highly efficient. This is achieved by reserving a register to hold the base address, and to make use of the fact that the effective address is the sum of a register value and the instruction's address field.
Parameters constitute the interface between the calling and the called procedures. Parameters on the calling side are said to be actual parameters, and those on the called side formal parameters. The latter are in fact only place holders for which the actual parameters are substituted.
Most programming languages distinguish between at least two kinds of parameters. The first is the value parameter where, as its name suggests, the value of the actual parameter is assigned to the formal variable. The second kind of parameter is the reference parameters, where, also as suggested by its name, a reference to the actual parameter is assigned to the formal variable.
The fact that with real numbers only approximate results can be obtained, may be understood by considering that real numbers are represented by scaled integers with a fixed, finite number of digits.
In many computers, instructions for floating-point operands use a special set of registers. The reason behind this is that often separate co-processors, so-called floating-point units (FPUs) are used which implement all floating-point instructions and contain this set of floating-point registers.
In order to be complete, a computer's set of instructions must also contain conversion instructions which convert integers into floating-point numbers and vice-versa. The same holds at the level of the programming language.
Conversions between integer types are simple and efficient, because they consist of a sign extension only.
An open array is an array parameter whose length is unknown (open) at the time of compilation.
The two forms of data structures provided in Oberon are the array (all elements of the same type, homogeneous structure) and the record (heterogeneous structure).
If every pointer variable is initialized to NIL, it suffices to precede every access via a pointer with a test for the pointer value NIL.
A variable is no longer relevant when there are no references to it.
The principle advantage of separate compilation is that changes in a module M do not invalidate clients of M, if the interface of M remains unaffected.
A primary goal of good code optimization is the most effective use of registers in order to reduce the number of accesses to the relatively slow main memory. A good strategy of register usage yields more advantages than any other branch of optimization.

20181205

xv6: a simple, Unix-like teaching operating system

The job of an operating system is to share a computer among multiple programs and to provide a more useful set of services than the hardware alone supports.
An operating system provides services to user programs through an interface.
Each running program, called a process, has memory containing instructions, data, and a stack. The instructions implement the program's computation. The data are the variables on which the computation acts. The stack organizes the program's procedure calls.
When a process needs to invoke a kernel service, it invokes a procedure call in the operating system interface. Such a procedure call is called a system call. The system call enters the kernel; the kernel performs the service and returns.
The kernel uses the CPU's hardware protection mechanisms to ensure that each process executing in user space can access only its own memory.
The shell is an ordinary program that reads commands from the user and executes them, and is the primary user interface to traditional Unix-like systems.
An xv6 process consists of user-space memory (instructions, data, and stack) and per-process state private to the kernel.
The exec system call replaces the calling process's memory with a new memory image loaded from a file stored in the file system.
A process that needs more memory at run-time (perhaps for malloc) can call sbrk(n) to grow its data memory by n bytes; sbrk returns the location of the new memory.
A file descriptor is a small integer representing a kernel-managed object that a process may read from or write to.
Internally, the xv6 kernel uses the file descriptor as an index into a per-process table, so that every process has a private space of file descriptors starting at zero.
By convention, a process read from file descriptor 0 (standard input), writes to file descriptor 1 (standard output), and writes error messages to file descriptor 2 (standard error).
The read and write system calls read bytes from and write bytes to open files named by file descriptors.
Each file descriptor that refers to a file has an offset associated with it. Read reads data from the current file offset and then advances that offset by the number of bytes read: a subsequent read will return the bytes following the ones returned by the first read.
Like read, write writes data at the current file offset and then advances that offset by the number of bytes written: each write picks up where the previous one left off.
A newly allocated file descriptor is always the lowest-numbered unused descriptor of the current process.
Fork copies the parent's file descriptor table along with its memory, so that the child starts with exactly the same open files as the parent.
The dup system call duplicates an existing file descriptor, returning a new one that refers to the same underlying I/O object.
File descriptors are a powerful abstraction, because they hide the details of what they are connected to: a process writing to file descriptor 1 may be writing to a file, to a device like the console, or to a pipe.
A pipe is a small kernel buffer exposed to processes as a pair of file descriptors, one for reading and one for writing. Writing data to one end of the pipe makes that data available for reading from the other end of the pipe.
Pipes provide a way for processes to communicate.
Pipes allow for synchronization: two processes can use a pair of pipes to send messages back and forth to each other, with each read blocking its calling process until the other process has sent data with write.
A file's name is distinct from the file itself; the same underlying file, called an inode, can have multiple names, called links.
The link system call creates another file system name referring to the same inode as an existing file.
Each inode is identified by a unique inode number.
Unix's combination of the "standard" file descriptors, pipes, and convenient shell syntax for operations on them was a major advance in writing general-purpose reusable programs.
The authors of Unix went on to build Plan 9, which applied the "resources are files" concept to modern facilities, representing networks, graphics, and other resources as files or file trees.
Any operating system must multiplex processes onto the underlying hardware, isolate processes from each other, and provide mechanisms for controlled inter-process communication.
The implementation of an operating system must achieve three requirements: multiplexing, isolation, and interaction.
To achieve strong isolation a helpful approach is to disallow applications to have direct access to the hardware resources, but instead to abstract the resources into services.
In kernel mode, the processor is allowed to execute privileged instructions.
The software running in kernel space (or in kernel mode) is called the kernel.
A key design question for an operating system is what part of the operating system should run in kernel mode.
The unit of isolation in xv6 is a process.
A process is an abstraction that provides the illusion to a program that it has its own abstract machine.
The x86 page table translates (or "maps) a virtual address (the address that an x86 instruction manipulates) to a physical address (an address that the processor chip sends to main memory).
The xv6 kernel maintains many pieces of state for each process, which it gathers into a struct proc. A process's most important pieces of kernel state are its page table, its kernel stack, and its run state.
Each process has two stacks: a user stack and a kernel stack.
When a process makes a system call, the processor switches to the kernel stack, raises the hardware privilege level, and starts executing the kernel instructions that implement the system call. When the system call completes, the kernel returns to user space: the hardware lowers its privilege level, switches back to the user stack, and resumes executing user instructions just after the system call instruction.
Page tables are the mechanism through which the operating system controls what memory addresses mean.
An x86 page table is logically an array of 2^20 page table entries (PTEs). Each PTE contains a 20-bit physical page number (PPN) and some flags. The paging hardware translates a virtual address by using its top 20 bits with the PPN in the PTE. The paging hardware copies the low 12 bits unchanged from the virtual to the translated physical address.
A page table is stored in physical memory as a two-level tree. [...] This two-level structure allows a page table to omit entire page table pages in the common case in which large ranges of virtual addresses have no mappings.
Physical memory refers to storage cells in DRAM. A byte of physical memory has an address, called a physical address. Instructions use only virtual addresses, which the paging hardware translates to physical addresses, and then sends to the DRAM hardware to read or write storage.
Each process has a separate page table, and xv6 tells the page table hardware to switch page tables when xv6 switches between processes.
Different processes' page tables translate user addresses to different pages of physical memory, so that each process has private user memory.
To guard a stack growing off the stack page, xv6 places a guard page right below the stack. The guard page is not mapped and so if the stack runs off the stack page, the hardware will generate an exception because it cannot translate the faulting address.
Exec is the system call that creates the user part of an address space. It initializes the user part of an address space from a file stored in the file system.
If the ELF header has the right magic number, exec assumes that the binary is well-formed.
Small pages make sens when physical memory is small, to allow allocation and page-out to disk with fine granularity.
Larger pages make sense on machines with lots of RAM, and may reduce overhead for page-table manipulation.
Today people care more about speed than space-efficiency.
When running a process, a CPU executes the normal processor loop: read an instruction, advance the program counter, execute the instruction, repeat.
The term exception refers to an illegal program action that generates an interrupt.
The term interrupt refers to a signal generated by a hardware device, indicating that it needs attention of the operating system.
The kernel handles all interrupts, rather than processes handling them, because in most cases only the kernel has the required privilege and state.
An interrupt stops the normal processor loop and starts executing a new sequence called an interrupt handler.
It is important to remember that traps are caused by the current process running on a processor, and interrupts are caused by devices and may not be related to the currently running process.
The x86 has 4 protection levels, numbered 0 (most privileged) to 3 (least privilege). In practice, most operating systems use only 2 levels: 0 and 3, which are then called kernel mode and user mode, respectively.
The current privilege level with which the x86 executes instructions is stored in the %cs register, in the field CPL.
On the x86, interrupt handlers are defined in the interrupt descriptor table (IDT). The IDT has 256 entries, each giving the %cs and %eip to be used when handling the corresponding interrupt.
To make a system cal on the x86, a program invokes the int n instruction, where n specifies the index into the IDT.
An operating system can use the iret instruction to return from an int instruction. It pops the saved values during the int instruction from the stack, and resumes execution at the saved %eip.
System calls conventionally return negative numbers to indicate errors, positive numbers for success.
Interrupts are similar to system calls, except devices generate them at any time.
A processor can control if it wants to receive interrupts through the IF flag in the eflags register. The instruction cli disables interrupts on the process by clearing IF, and sti enables interrupts on a processor.
xv6 disables interrupts during booting of the main cpu and the other processors.
A driver is the piece of code in an operating system that manages a particular device: it provides interrupt handlers for a device, causes a device to perform operations, cases a device to generate interrupts, etc.
Driver code can be tricky to write because a driver executes concurrently with the device that it manages.
In many operating systems, the drivers together account for more code in the operating system than the core kernel.
Typically devices are slower than CPU, so the hardware uses interrupts to notify the operating system of status changes.
Using DMA means that the CPU is not involved at all in the [data] transfer, which can be more efficient and is less taxing for the CPU's memory caches.
All modern devices are programmed using memory-mapped I/O.
A lock provides mutual exclusion, ensuring that only one CPU at a time can hold the lock.
In one atomic operation, xchg swaps a word in memory with the contents of a register.
Interrupts can cause concurrency even on a single processor: if interrupts are enabled, kernel code can be stopped at any moment to run an interrupt handler instead.
It is possible to implement locks without atomic instructions, but it is expensive, and most operating systems use atomic instructions.
Any operating system is likely to run with more processes than the computer has processors, and so a plan is needed to time-share the processors among the processes.
Switching from one thread to another involves saving the old thread's CPU registers, and restoring previously-saved registers of the new thread; the fact that %esp and %eip are saved and restored means that the CPU will switch stacks and switch what code it is executing.
Each pipe is represented by a struct pipe, which contains a lock and a data buffer.
The xv6 scheduler implements a simple scheduling policy, which runs each process in turn. This policy is called round robin.
A semaphore is an integer value with two operations, increment and decrement (or up and down). It is always possible to increment a semaphore, but the semaphore value is not allowed to drop below zero.
The purpose of a file system is to organize and store data.
One of the most interesting problems in file system design is crash recovery. The problem arises because many file system operations involve multiple writes to the disk, and a crash after a subset of the writes may leave the on-disk file system in an inconsistent state.
One of the cool aspects of the Unix interface is that most resources in Unix are represented as a file, including devices such as the console, pipes, and of course, real files. The file descriptor layer is the layer that achieves this uniformity.
All the open files in the system are kept in a global file table, the ftable.
From our point of view, we can abstract the PC into three components: CPU, memory, and input/output (I/O) devices. The CPU performs computation, the memory contains instructions and data for that computation, and devices allow the CPU to interact with hardware for storage, communication, and other functions.
A computer's CPU runs a conceptually simple loop: it consults an address in a register called the program counter, reads a machine instruction from that address in memory, advances the program counter past the instruction, and executes the instruction. Repeat.
A register is a storage cell inside the processor itself, capable of holding a machine word-sized value.
The modern x86 provides eight general purpose 32-bit registers: eax, ebx, ecx, edx, edi, esi, ebp, and esp--and a program counter eip.
Registers are fast but expensive.
Main memory is 10-100x slower than a register, but it is much cheaper, so there can be more of it. One reason main memory is relatively slow is that it is physically separate from the processor chip.
The x86 processor provides special in and out instructions that read and write values from device address called I/O ports.

Pages

20191210

Stack Computers by Philip Koopman

20190526

The Algorithm Design Manual by Steven S. Skiena

20190406

Thinking Forth by Leo Brodie

20181216

Make: FPGAs by David Romano

20181206

Compiler Construction by Niklaus Wirth

20181205

xv6: a simple, Unix-like teaching operating system