Justin Spencer: Stack Computers by Philip Koopman

Stack machines offer processor complexity that is much lower than that of CISC machines, and overall system complexity that is lower than that of either RISC or CISC machines.
LIFO stacks, also known as “push down” stacks, are the conceptually simplest way of saving information in a temporary storage location for such common computer operations as mathematical expression evaluation and recursive subroutine calling.
LIFO stacks may be programmed into conventional computers in a number of ways. The most straightforward way is to allocate an array in memory, and keep a variable with the array index number of the topmost active element.
“Pushing” a stack element refers to the act of allocating a new word on the stack and placing data into it. “Popping” the stack refers to the action of removing the top element from the stack and then returning the data value removed to the routine requesting the pop.
A very important property of stacks is that, in their purest form, they only allow access to the top element in the data structure.
Stacks make excellent mechanisms for temporary storage of information within procedures. A primary reason for this is that they allow recursive invocations of procedure without risk of destroying data from previous invocations of the routine.
Hardware implementation of stacks has the obvious advantage that it can be much faster than software management.
Stack machines are easier to write compilers for, since they have fewer exceptional cases to complicate a compiler.
Stack machines are also simpler than other machines, and provide very good computational power using little hardware.
A particularly favorable application area for stac machines is in real time embedded control applications, which require a combination of small size, high processing speed, and excellent support for interrupt handling that only stack machines can provide.
Both hardware and software stacks have been used to support four major computing requirements: expression evaluation, subroutine return address storage, dynamically allocated local variable storage, and subroutine parameter passing.
The use of an expression evaluation stack is so basic to the evaluation of expressions that even register-based machine compilers often allocate registers as if they formed an expression evaluation stack.
The solution to the recursion problem is to use a stack for storing the subroutine return address. As each subroutine is called the machine saves the return address of the calling program on a stack.
Modern machines usually have some sort of hardware support for a return address stack. In conventional machines, this support is often a stack pointer register and instructions for performing subroutine calls and subroutine returns.
Whenever a subroutine is called it must usually be given a set of parameters upon which to act.
Real machines combine the various stack types. It is common in register-based machines to see the local variable stack, parameter stack, and return address stack combined into a single stack of activation records, or “frames”.
Many of the designs for these stack computers have their roots in the Forth programming language. This is because Forth forms both a high level and assembly language for a stack machine that has two hardware stacks: one for expression evaluation/parameter passing, and one for return addresses.
An interesting point to note is that, although some of these machines are designed primarily to run Forth, they are also good at running conventional languages.
Stack machines have small program size, low system complexity, high system performance, and good performance consistency under varying conditions.
Stack machines run conventional languages reasonably well, and do so using less hardware for a given level of performance than register-based machines.
Stack machines are very good at running the Forth language, which is known for rapid program development because of its interactivity and flexibility. Forth is also known for producing compact code that is very well suited to real time control problems.
A very good application area for stack machines is embedded real time control.
In general, a single stack leads to simple hardware, but at the expense of intermingling data parameters with return address information.
An advantage of having a single stack is that it is easier for an operating system to manage only one block of variable sized memory per process.
A disadvantage of a single stack is that parameter and return address information are forced to become mutually well nested.
Multiple stacks allow separating control flow information from data operands.
An important advantage of having multiple stacks is one of speed. Multiple stacks allow access to multiple values within a clock cycle.
The amount of dedicated memory used to buffer stack elements is a crucial performance issue.
To be competitive in speed, a stack machine must have at least one or two stack elements buffer inside the processor.
A small stack buffer with primary stacks residing in program memory allows quick switching between stacks for different tasks since the stack elements are predominantly memory resident at all times.
The fact that a small dedicated stack buffer is simple to implement and easy to manage makes it very popular. In particular, the fact that most stack elements reside in main memory makes managing pointers, strings, and other data structures quite easy. The disadvantage of this approach is that significant main memory bandwidth is consumed to read and write stack elements.
An advantage of a large stack buffer is that program memory cycles are not consumed while accessing data elements and subroutine return addresses. This can lead to significant speedups, particularly in subroutine intensive environments.
The number of addressing modes has a tremendous effect on how the stacks are constructed and how the stacks can be used by programs.
0-operand instructions do not allow any operands to be associated with the opcode. All operations are implicitly specified to be on the top stack element(s). This kind of addressing is often called “pure” stack addressing.
A disadvantage to the 0-operand addressing mode is that complex addressing modes for data structure accessing may take several instructions to synthesize.
A machine with 1-operand instruction format usually performs operations on the specified operand and uses the top stack element as the implicit second operand.
2-operand instruction formats allow each instruction to specify both a source and destination.
2-operand machines offer a maximum of flexibility, but requires more complicated hardware to perform efficiently.
Forth is an unconventional programming language which uses a two-stack model of computation and strongly encourages frequent calls to many small procedures.
Stack machines support small program sizes by encouraging frequent use of subroutines to reduce code size, and by the fact that stack machines can have short instructions. Small program sizes reduce memory costs, component count, and power requirements, and can improve system speed by allowing the cost effective use of smaller, higher speed memory chips.
Decreased system complexity decreases system development time and chip size. This decreased chip size leaves more room on-chip for semi custom features and program memory.
System performance includes not only raw execution speed, but also total system cost and system adaptability when used in real world applications.
For simplicity, the Canonical Stack Machine has a single bus connecting the system resources.
The data stack is a memory with an internal mechanism to implement a LIFO stack. A common implementation for this might be a conventional memory with an up/down counter used for address generation.
The data stack allows two operations: push and pop. The push operation allocates a new cell on the top of the stack and writes the value on the data bus into that cell. The pop operations places the value on the top cell of the stack onto the data bus, then deletes the cell, exposing the next cell on the stack for the next processor operation.
The return stack is a LIFO stack implemented in an identical manner to the data stack. The only difference is that the return stack is used to store subroutine return address instead of instruction operands.
The ALU functional block performs arithmetic and logical computations on pairs of data elements.
The ALU supports the standard primitive operations needed by any computer.
The program counter holds the address of the next instruction to be executed. The PC may be loaded from the bus to implement branches, or may be incremented to fetch the next sequential instruction from program memory.
The program memory block has both a Memory Address REgister (MAR) and a reasonable amount of random access memory.
An important point to note is that Forth notation often makes extensive use of special characters.
Stack machines execute data manipulation operations using postfix operations.
Postfix operations are distinguished by the fact that the operations come before the operation.
In postfix notation, the operator acts upon the most recently seen operands, and uses an implied stack for evaluation.
Postfix notation has an economy of expression when compared to infix notation in that neither operator precedence nor parentheses are necessary. It is much better suited to the needs of computers.
The Canonical Stack Machine described in the preceding section is designed to execute postfix operations directly without burdening the compiler with register allocation bookkeeping.
The instruction >R and its complement R> allow shuffling data elements between the data and return stacks. This technique is used to access stack elements buried more than two elements deep on the stacks be temporarily placing the topmost data stack elements on the return stack.
Even though all arithmetic and logical operations are performed on data elements on the stack, there must be some way of loading information onto the stack before operations are performed, and storing information from the stack into memory. The Canonical Stack Machine uses a simple load/store architecture, so has only a single load instruction ‘@’ and a single store instruction ‘!’.
Somehow there must be a way to get a constant value onto the stack. The instruction to do this is called the literal instruction, which is often called a load-immediate instruction in register-based architecture. The literal instruction uses two consecutive instruction words: one for the actual instruction, and one to hold a literal value to be pushed onto the stack.
The Program Counter is the register that holds a pointer to the next instruction to be executed. After fetching an instruction, the program counter is automatically incremented to point to the next word in memory. In the case of a branch or subroutine call instruction, the program counter is loaded with a new value to accomplish the branch.
In order to be able to make decisions, the machine must have available some method for conditional branching. The Canonical Stack Machine uses the simplest method possible. A conditional branch may be performed based on whether the top stack element is equal to 0. This approach eliminates the need for condition codes, yet allows implementation of all control flow structures.
Since there is a dedicated return address stack, subroutine calls simply require pushing the current program counter value onto the stack, then loading the program counter with a new value.
Subroutine returns are accomplished by simply popping the return address from the top of the return address stack and placing the address in the program counter.
An important consideration in real time control applications is how the processor can handle interrupts and task switches.
Interrupts are caused either by exceptional events, such as stack overflow, or by requests for I/O service. Both events require quick resolution without disturbing the flow of the current task.
I/O servicing is a potentially frequent event that must be handled quickly for real time control tasks.
Stack machines treat interrupts as hardware generated subroutine calls.
Interrupts are much less expensive on stack machines than on conventional machines for several reasons: registers need not be saved since the stack allocates working storage automatically, there are no condition code flags to be saved since the only branch conditions are kept as flags on the data stack, and most stack processors have a short or nonexistent data pipeline, so there is no penalty for saving the pipeline state when an interrupt is received.
Because of its roots, Forth stresses efficiency, compactness, flexible and efficient hardware/software interaction.
The Forth Virtual Machine has two stacks: a data stack and a return stack.
Forth programs consist of very small subroutines that execute only calls to other subroutines and primitive stack operation instructions.
The main characteristic of Forth programs that separates Forth from most other languages is the high frequency of subroutine calls.
Good Forth programming style encourages incremental program development and testing with small subroutines.
The consensus among Forth programmers is that use of the Forth language reduces development time by a factor of 10 compared to other languages over a wide range of applications.
A major reason that Forth has historically been a 16-bit language is that 8-bits is too small for general purpose computations and addressing data structures. While 12-bits was tried in some of the earliest minicomputers, 16-bits seems to be the smallest integer size that is truly useful.
16-bit machines are capable of addressing 64K of memory, which for a stack machine is a rather large program memory.
Embedded applications require a small processor with a small amount of program memory to satisfy demanding power, weight, size, and cost considerations.
One of the difficult technical challenges that arise when designing a 32-bit stack processor is the management of the stacks.
The traditional solution for a growing program size is to employ a hierarchy of memory devices with a series of capacity/cost/access-time tradeoffs.
The memory problem comes down to one of supplying a sufficient quantity of memory fast enough to support the processor at a price that can be afforded. This is accomplished by fitting the momst program possible into the fastest level of the memory hierarchy.
The usual way to manage the fastest level of the memory hierarchy is by using cache memories. Cache memories work on the principle that a small section of a program is likely to be used more than once within a short period of time.
Stack machines have much smaller programs than either RISC or CISC machines. Stack machine programs can be 2.5 to 8 times smaller than CISC code.
When speaking of the complexity of a computer, two levels are important: processor complexity and system complexity.
Processor complexity is the amount of logic (measured in chip area, number of transistors, etc.) in the actual core of the processor that does the computations.
System complexity considers the processor embedded in a fully functional system which contains support circuitry, the memory hierarchy, and software.
The complexity of CISC machines is partially the result of encoding instructions to keep programs relatively small.
Stack machines strive to achieve a balance between processor complexity and system complexity. Stack machine designs realize processor simplicity not by restricting the number of instructions, but rather by limiting the data upon which instructions may operate: all operands are on the top stack elements.
Stack machines are extraordinarily simple: 16-bit stack machines typically use only 20 to 35 thousand transistors for the processor core.
Stack machine compilers are also simple, because the instructions are very consistent in format and operand selection. In fact, most compilers for register machines go through a stack-like view of the source program for expression evaluation, then map that information onto a register set.
Stack machine compilers have that much less work to do in mapping the stack-like version of the source code into assembly language.
Forth compilers are well known to be exceedingly simple and flexible.
Stack computer systems are also simple as a whole. Because stack programs are so small, exotic cache control schemes are not required for good performance. Typically the entire program can fit into cache-speed memory chips without the complexity of cache control circuitry.
Interrupts are treated as hardware invoked subroutine calls. There is no pipeline to flush or save, so the only thing a stack processor needs to do to process an interrupt is to insert the interrupt response address as a subroutine call into the instruction stream, and push the interrupt make register onto the stack while masking interrupts (to prevent an infinite recursion of interrupt service calls). Once the interrupt service routine is entered, no registers need be saved, since the new routine can simply push its data onto the top of the stack.
The simplest way to solve the stack problem is simply to assume that stack overflow will never happen.
Given that stack overflows are allowed to occur on a regular basis, the most conceptually appealing way to deal with the problem is to use a demand fed stack manager that moves single elements on and off the stack as required.
To implement this strategy, the stack buffer is set up as a circular buffer with a head and tail pointer. A pointer to memory is also needed to keep track of the top element of the memory-resident portion of the stack. Whenever a stack overflow is encountered, the bottom-most buffer-resident element is copied to memory, freeing a buffer location. Whenever an underflow is encountered, one element from memory is copied into the buffer. This technique has the appeal that the processor never moves a stack element to or from memory unless absolutely necessary, guaranteeing the minimum amount of stack traffic.
An alternative to the demand-fed strategy is to generate an interrupt on stack overflow and underflow, then use software to manage the stack spill. This approach uses less control hardware then the demand-fed method, but requires a stack buffer that is somewhat bigger to reduce the frequency of the interrupts.
There are three components to the performance of processing interrupts.
The first component is the amount of time that elapses between the time that an interrupt request is received by the processor and the time that the processor takes action to begin processing the interrupt service routine. This delay is called interrupt latency.
The second component of interrupt service performance is interrupt processing time. This is the amount of time that the processor spends actually saving the machine state of the interrupted job and diverting execution to the interrupt service routine.
The third component of interrupt service performance is what we shall call state saving overhead. This is the amount of time taken to save machine registers that are not automatically saved by the interrupt processing logic, but which must be saved in order for the interrupt service routine to do its job.
A stack machine can implement lightweight threads simply by requiring that each task run a short sequence of instructions when invoked, then relinquish control to the central task manager. This can be called non-preemptive, or cooperative task management.
If each task starts and stops its operation with no parameters on the stack, there is no overhead for context switches between tasks.
The amount of stack memory needed by most programs is typically rather small.
Any computer system is worthless without software.
Having hardware that effectively supports software requirements is of the utmost importance.
The use of a large number of small procedures when writing a program reduces the complexity of each piece that must be written, tested, debugged, and understood by the programmer.
Lower software complexity implies lower development and maintenance costs, as well as better reliability.
The currently accepted best practice in software design involves structured programming using modular designs.
On a large scale, the use of modules is essential for partitioning tasks among members of programming teams.
On a smaller scale, modules control complexity by limiting the amount of information that a programmer must deal with at any given time.
Conventional computers are still optimized for executing programs made up of streams of serial instructions.
If a procedure contains more than seven distinct operations it should be broken apart by chunking related portions into subordinate procedures to reduce the complexity of each portion of the program.
Today’s hardware and programming environments unnecessarily restrict the usage of modularity, and therefore unnecessarily increase the cost of providing computer-based solutions to problems.
The language used should reflect the entire system being developed, including the system operating environment, the suitability of the language to solve the problem at hand, developing time and costs, the maintainability of the finished product, the strengths of the underlying processor at running various languages, and the previous programming experience of the programmers assigned to the project.
Forth is the most obvious language to consider using on a stack machine. That is because the Forth language is based upon a set of primitives that execute on a virtual stack machine architecture.
One of the characteristics of Forth is its very high use of subroutine calls. This promotes an unprecedented level of modularity, with approximately 10 instructions per procedure being the norm.
One of the advantages of the Forth programming language is that it covers the full spectrum of language levels.
Forth tends to act as a programmer amplifier.
Forth is best used on medium-sized programming projects involving no more than two or three programmers who have compatible programming styles.
Probably the most common reason for using a conventional language will be the existence of a large body of existing source code that must be ported onto a better processor.
Generation of stack-based code for expression evaluation is relatively straightforward.
Stack machines are well suited to LISP programming as well as to expert systems. This is because LISP and Forth are very similar languages in many respects.
The major difference is that LISP involves dynamic storage allocation for its cells, while Forth uses a static storage allocation.
Functional programming languages offer the promise of a new way of solving problems using a different model of computation than that used by conventional computers.
One of the side effects of using a functional programming language is that a high degree of parallelism is available during program execution.
This raises the idea of a massively parallel computer made of stack processors that is programmed with a functional programming language.
A key conceptual feature of stack machines is their uniformity of interface between high level code and machine instructions.
Any system in which a high speed processor with low system complexity is needed is a good candidate for using a stack processor.
Real time embedded control processors are computers that are built into pieces of (usually) complicated equipment such as cars, airplanes, computer peripherals, audio electronics, and military vehicles/weapons. The processor is embedded because it is built into a piece of equipment that is not itself considered a computer.
Most embedded systems place severe constraints on the processor in terms of requirements for size, weight, cost, power, reliability and operating environment.
Real-time events are typically external stimuli to the system which require a response within a matter of microseconds or milliseconds.
A processor that has a large number of pins takes up precious printed circuit board area.
Microcoded machines are more flexible than hardwired machines.
Preliminary research shows that stack machines can execute functional programming languages very efficiently. Programs written in these languages have a great deal of inherent parallelism, which may be exploited by a multiprocessor stack machine system.
Conventional languages can be implemented very easily on stack machines.
The use of virtual memory and memory protection schemes are concepts that have not yet been widely incorporated into existing stack machines.
Protection is the capability for the hardware to prevent a program from accessing another program’s memory except under very carefully controlled conditions.
Memory bandwidth is the amount of information that can be transferred to and from memory per user.
The usual way of implementing an IF statement is by using a conditional branch that is taken if the condition tested by the IF is false.
Stack machines do not seem to be best suited as primary CPUs for the workstation and minicomputer markets.
The Forth language is based on an extensible, interactive compiler that creates code for a virtual stack machine.

1 comment:

udolfogagon9/11/22 08:41
Video slots often embody 메리트카지노 bonus games, where players can win free spins and returns on their bets. Sometimes the bonus round is easy, but often extra complex video slots will offer fully themed hidden bonus games. If a game gives players the chance to win a progressive jackpot then it’s often discovered inside one of the bonus games.

Justin Spencer

Pages

20191210

Stack Computers by Philip Koopman

1 comment: