Justin Spencer: Compiler Construction by Niklaus Wirth

Compiler Construction

Compilers convert program texts into internal code. Hence they constitute the bridge between software and hardware.
However, from my experience as a teacher, genuine understanding of a subject is best acquired from an in-depth involvement with both concepts and details.
A partitioning of the compilation process into as many parts as possible was the predominant technique until about 1980, because until then the available store [of memory] was too small to accommodate the entire compiler.
Modern computers with their apparently unlimited stores make it feasible to avoid intermediate storage on disk. And with it, the complicated process of serializing a data structure for output, and its reconstruction on input can be discarded as well.
A wise compromise exists in the form of a compiler with two parts, namely a front end and a back end. [...] The main advantage of this solution lies in the independence of the front end of the target computer and its instruction set.
The idea of decoupling source language and target architecture has also led to projects creating several from tends for different languages generating tress for a single back end.
A compiler which generates code for a computer different from the one executing the compiler is called a cross compiler. The generated code is then transferred--downloaded--via a data transmission line.
Every language displays a structure called its grammar or syntax.
Syntactic equations of the form defined in EBNF generate context-free languages. The term "context-free" is due to Chomsky and stems from the fact that substitution of the symbol left of "=" by a sequence derived from the expression to the right of "=" is always permitted, regardless of the context in which the symbol is embedded within the sequence.
A language is regular, if its syntax can be expressed by a single EBNF expression.
Sentences of regular languages can be recognized by finite state machines. They are usually described by transition diagrams.
Regular languages are subject to the restriction that no nested structures can be expressed. Nested structures can be expressed with the aid of recursion only.
The method of recursive descent is only one of several techniques to realize the top-down parsing principle.
Parsing in the bottom-up direction is also called shift-reduce parsing. The syntactic constructs are built up and then reduced; the syntax tree grows from the bottom to the top.
Ultimately, the basic idea behind every language is that it should serve as a means for communication. This means that partners must use and understand the same language.
Procedures represent functional units of statements. It is therefore appropriate to associate the concept of locality of names with the notion of the procedure.
The simplest form of data structure for representing a set of items is the list. Its major disadvantage is a relatively slow search process, because it has to be traversed from its root to the desired element.
In order to speed up the search process, the list is often replaced by a tree structure. Its advantage becomes noticeable only with a fairly large number of entries.
In languages featuring data types, their consistency checking is one of the most important tasks of a compiler. The checks are based on the type attribute recorded in every symbol table entry.
The size of an array is its element size multiplied by the number of its elements. The address of an element is the sum of the array's address and the element's index multiplied by the element size.
Absolute addresses of variables are usually unknown at the time of compilation. All generated addresses must be considered as relative to a common base address which is given at run-time. The effective address is then the sum of this base address and the address determined by the compiler.
Although bytes can be accessed individually, typically a small number of bytes (say 4 or 8) are transferred from or to memory as a packet, a so-called word.
[To prevent a variable from occupying several different words you should enforce variable alignment.] A simple method of overcoming this problem is to round up (or down) each variable's address to the next multiple of its size. This process is called alignment.
The price of alignment is the loss of some bytes in memory, which is quite negligible.
The acronym RISC stands for reduced instruction set computer, where "reduced" is to be understood as relative to architectures with large sets of complex instructions, as these were dominant until about 1980.
An architecture defines those aspects of a computer that are relevant to the programmer and the compiler designer. A computer consists of an arithmetic unit, a control unit, and a store.
There are three types of instructions and instruction formats. Register instructions operate on registers only and feed data through the arithmetic-logic unit ALU or through a shifter. Memory instructions move data between registers and memory. Branch instructions affect the program counter.
There are only two instructions accessing memory, load and store. It is a characteristic of the RISC structure that access to memory is not combined with any other operation. All arithmetic or logical operations are performed on registers.
Branch instructions are used to break the sequence of instructions.
Our ideal computer would be capable of directly interpreting postfix notation. Such an ideal computer requires a stack for holding intermediate results. Such a computer architecture is called a stack architecture.
Computers based on a stack architecture are not in common use. Sets of explicitly addressable registers are preferred to a stack.
The principle of delayed code emission is also used to avoid the emission of arithmetic instructions if the compiler can perform the operation itself. This is the case when both operands are constants. The technique is known as constant folding.
Type checking is typically performed along with syntactic analysis. But, whenever arithmetic expressions are evaluated, the inherent danger of overflow exists. The evaluating statements should therefore by suitable guarded.
The essence of delayed code generation is that code is not emitted before it is clear that no better solution exists.
The principle of delayed code generation is also useful in many other cases, but it becomes indispensable when considering computers with complex addressing modes, for which reasonable efficient code has to be generated by making good use of the available complex modes.
Conditional and repeated statements are implemented with the aid of branch instructions.
The destination location of branches is still unknown when the instruction is to be emitted. This problem is solved by adding the location of the branch instruction as an attribute to the item generated. This attribute is used later when the destination of the jump becomes known in order to complete the branch with its true address. This is called a fixup.
Procedures, which are also known as subroutines, are perhaps the most important tool for structuring programs. Because of their frequency of occurrence, it is mandatory that their implementation is efficient. Implementation is based on the branch instruction which saves the current PC value and thereby the point of return after termination of the procedure, when this value is reloaded into the PC register.
Algol-60 introduced the very fundamental concept of local variables. It implied that every identifier declared had a limited range of visibility and validity.
Addresses of local variables generated by the compiler are always relative to the base address of the respective activation frame. Since in programs most variables are local, their addressing also must be highly efficient. This is achieved by reserving a register to hold the base address, and to make use of the fact that the effective address is the sum of a register value and the instruction's address field. The reserved register is called the frame pointer (FP). This scheme makes it possible to call procedures recursively.
Parameters constitute the interface between the calling and the called procedures. Parameters on the calling side are said to be actual parameters, and those on the called side formal parameters. The latter are in fact only place holders for which the actual parameters are substituted. Basically, a substitution is an assignment of the actual value to the formal variable.
Most programming languages distinguish between at least two kinds of parameters. The first is the value parameter where, as its name suggest, the value of the actual parameters is assigned to the formal variable. The second kind of parameter is the reference parameter, where, also as suggested by its name, a reference to the actual parameters is assigned to the formal variable.
Most programming languages feature certain procedures and functions which do not need to be declared in a program. They are said to be predeclared and they can be called from anywhere, as they are pervasive.
In many computers, instructions for floating-point operands use a special set of registers. The reason behind this is that often separate co-processors, so-called floating-point units (FPUs) are used which implement all floating-point instructions and contain this set of floating-point registers.
An open array is an array parameter whose length is unknown (open) at the time of compilation.
In order to check index bounds when accessing elements of the open array parameters, the length must be known.
Relationships between components are expressed explicitly by pointers.
The step from the referencing pointer variable to the referenced record variable is called dereferencing.
The range of visibility of an identifier in the text is called scope, and it extends over the block in which the identifier is declared.
The desire to hide certain objects and details is particularly justified if a system consists of various parts whose tasks are relatively well separated, and if the parts themselves are of a certain complexity.
This encapsulation of details solely responsible for the specified invariants is the true purpose of information hiding and of the module concept.
The principle advantage of separate compilation is that changes in a module M do not invalidate clients of M, if the interface of M remains unaffected.
Symbol files thus make it possible to make modules available without giving away the source text.
Perhaps the best known case among the target independent optimizations is the elimination of common sub-expressions.
The dominant theme in the subject of optimization is the use and allocation of processor registers.
A primary goal of good code optimization is the most effective use of registers in order to reduce the number of accesses to the relatively slow main memory. A good strategy of register usage yields more advantages than any other branch optimization.
A widespread technique is register allocation using graph coloring.

Justin Spencer

Pages

20180222

Compiler Construction by Niklaus Wirth

No comments:

Post a Comment