Justin Spencer: Compiler Construction by Niklaus Wirth

Compilers convert program texts into internal code. Hence the constitute the bridge between software and hardware.
Genuine understanding of a subject is best acquired from an in-depth involvement with both concepts and details.
We consider syntax analysis as a means to an end, but not as the ultimate goal.
Computer programs are formulated in a programming language and specify classes of computing processes.
The construction of the first compiler for the language Fortran around 1956 was a daring enterprise, whose success was not at all assured.
The translation process essentially consists of the following parts:

The sequence of characters of a source text is translated into a corresponding sequence of symbols of the vocabulary of the language.
The sequence of symbols is transformed into a representation that directly mirrors the syntactic structure of the source text and lets this structure easily be recognized.
High-level languages are characterized by the fact that objects of programs are classified according to their type. Therefore, in addition to syntactic rules, compatibility rules among types of operators and operands define the language. Hence, verification of whether these compatibility rules are observed by a program is an additional duty of a compiler. This verification is called type checking.
On the basis of the representation resulting from step 2, a sequence of instructions taken from the instruction set of the target computer is generated. This phase is called code generation.

A partitioning of the compilation process into as many parts as possible was the predominant technique until about 1980, because until then the available [memory] store was too small to accommodate the entire compiler.
Modern computers with their apparently unlimited stores make it feasible to avoid intermediate storage on disk. And with it, the complicated process of serializing a data structure for output, and its reconstruction on input can be discarded as well.
A wise compromise exists in the form of a compiler with two parts, namely a front-end and a back-end. The first part comprises lexical and syntax analyses and type checking, and it generates a tree representing the syntactic structure of the source text. This tree is held in main store and constitutes the interface to the second part which handles code generation.
The idea of decoupling source language and target architecture has also led to projects creating several front ends for different languages generating trees for a single back-end.
A compiler which generates code for a computer different from the one executing the compiler is called a cross compiler.
Every language displays a structure called its grammar or syntax.
A language is defined by the following:

The set of terminal symbols. These are the symbols that occur in its sentences.
The set of non-terminal symbols. They denote classes and can be substituted.
The set of syntactic equations. These define the possible substitutions of non-terminal symbols.
The start symbol.

A language is the set of sequences of terminal symbols which, starting with the start symbol, can be generated by repeated application of syntactic equations, that is, substitutions.
The idea of defining languages and their grammar with mathematical precision goes back to Noam Chomsky.
The use of the Chomsky formalism is also responsible for the term programming language, because programming languages seemed to exhibit a structure similar to spoken languages.
A language us "regular", if its syntax can be expressed by a single EBNF expression.
The method of recursive descent is only one of several techniques to realize the top-down parsing principle.
Parsing in the bottom-up direction is also called shift-reduce parsing.
Bottom-up parsing is in general more intricate and complex than top-down parsing.
Ultimately, the basic idea behind every language is that it should serve as a means for communication. This means that partners must use and understand the same language.
The scanner has to recognize terminal symbols in the source text.
As soon as an unacceptable symbol turns up, the task of the parser is completed, and the process of syntax analysis is terminate.
We postulate the following quality criteria for error handling:

As many errors as possible must be detected in a single scan through the text.
As few additional assumptions as possible about the language are to be made.
Error handling features should not slow down the parser appreciably.
The parser program should not grow in size significantly.

Without doubt, a simple language structure significantly simplifies error diagnostics, or, in other words, a complicated syntax complicates error handling unnecessarily.
The essential characteristics of a good compiler, regardless of details, are that (1) no sequence of symbols leads to its crash, and (2) frequently encountered errors are correctly diagnosed and subsequently generate no, or few additional, spurious error messages.
Context is represented by a data structure which contains an entry for every declared identifier. This entry associates the identifier with the denoted object and its properties. The data structure is known by the name symbol table.
Every declaration results in a new symbol table entry.
Every occurrence of an identifier in a statement requires a search of the symbol table in order to determine the attributes (properties) of the object denoted by the identifier.
The simplest form of data structure for representing a set of items is the list. Its major disadvantage is a relatively slow search process, because it has to be traversed from its root to the desired element.
In order to speed up the search process, the list is often replaced by a tree structure. Its advantage becomes noticeable only with a fairly large number of entries.
In languages featuring data types, their consistency checking is one of the most important tasks of a compiler. The checks are based on the type attribute recorded in every symbol table entry.
We must determine the format in which data are to be represented at run-time in the [memory] store. The choice inherently depends on the target architecture, although this fact is less apparent because of the similarity of virtually all computers in this respect.
Every computer features certain elementary data types together with corresponding instructions, such as integer addition and floating-point addition. These types are invariably scalar types, and they occupy a small number of consecutive memory locations (bytes).
The size of an array is its element size multiplied by the number of its elements. The address of an element is the sum of the array's address and the elment's index multiplied by the element size.
Absolute addresses of variables are usually unknown at the time of compilation. All generated addresses must be considered as relative to a common base address which is given at run-time. The effective address is then the sum of this base address and the address determined by the compiler.
Although bytes can be accessed individually, typically a small number of bytes are transferred from or to memory as a packet, a so-called word.
If allocation occurs strictly in sequential order it is possible that a variable may occupy (parts of) several words. But this should be avoided, because otherwise a variable access would involve several memory accesses, resulting in an appreciable slowdown.
A simple method of overcoming this problem is to round up (or down) each variable's address to the next multiple of its size. This process is called alignment.
The price of alignment is the loss of some bytes in memory, which is quite negligible.
Following a longstanding tradition, addresses of variables are assigned negative values, that is, negative offsets to the common base address determined during program execution.
The acronym RISC stands for reduced instruction set computer, where "reduced" is to be understood as relative to architectures with large sets of complex instructions.
From the viewpoints of the programmer and the compiler designer the computer consists of an arithmetic unit, a control unit and a store.
Register instructions operate on registers only and feed data through a shifter and the arithmetic logic unit ALU.
Memory instructions fetch and store data in memory.
Branch instructions affect the program counter.
Our ideal computer would be capable of directly interpreting post-fix notation. Such an ideal computer requires a stack for holding intermediate results. Such a computer architecture is called a stack architecture.
Whenever arithmetic expression are evaluated, the inherent danger of overflow exists. The evaluating systems should therefore be suitably guarded.
The essence of delayed code generation is that code is not emitted before it is clear that no better solution exists.
Conditional and repeated statements are implemented with the aid of jump instructions, also called branch instructions.
Procedures, which are also known as subroutines, are perhaps the most important tool for structuring programs. Because of their frequency of occurrence, it is mandatory that their implementation is efficient. Implementation is based on the branch instructions which saves the current PC value and thereby the point of return after termination of the procedure, when this value is reloaded into the PC register.
Algol-60 introduced the very fundamental concept of local variables. It implied that every identifier declared had a limited range of visibility and validity.
In concrete terms, variables may be declared local to a procedure such that they are visible and valid within this procedure only.
Addresses of local variables generated by the compiler are always relative to the base address of the respective activation frame. Since in programs most variables are local, their addressing also must be highly efficient. This is achieved by reserving a register to hold the base address, and to make use of the fact that the effective address is the sum of a register value and the instruction's address field.
Parameters constitute the interface between the calling and the called procedures. Parameters on the calling side are said to be actual parameters, and those on the called side formal parameters. The latter are in fact only place holders for which the actual parameters are substituted.
Most programming languages distinguish between at least two kinds of parameters. The first is the value parameter where, as its name suggests, the value of the actual parameter is assigned to the formal variable. The second kind of parameter is the reference parameters, where, also as suggested by its name, a reference to the actual parameter is assigned to the formal variable.
The fact that with real numbers only approximate results can be obtained, may be understood by considering that real numbers are represented by scaled integers with a fixed, finite number of digits.
In many computers, instructions for floating-point operands use a special set of registers. The reason behind this is that often separate co-processors, so-called floating-point units (FPUs) are used which implement all floating-point instructions and contain this set of floating-point registers.
In order to be complete, a computer's set of instructions must also contain conversion instructions which convert integers into floating-point numbers and vice-versa. The same holds at the level of the programming language.
Conversions between integer types are simple and efficient, because they consist of a sign extension only.
An open array is an array parameter whose length is unknown (open) at the time of compilation.
The two forms of data structures provided in Oberon are the array (all elements of the same type, homogeneous structure) and the record (heterogeneous structure).
If every pointer variable is initialized to NIL, it suffices to precede every access via a pointer with a test for the pointer value NIL.
A variable is no longer relevant when there are no references to it.
The principle advantage of separate compilation is that changes in a module M do not invalidate clients of M, if the interface of M remains unaffected.
A primary goal of good code optimization is the most effective use of registers in order to reduce the number of accesses to the relatively slow main memory. A good strategy of register usage yields more advantages than any other branch of optimization.

Justin Spencer

Pages

20181206

Compiler Construction by Niklaus Wirth

No comments:

Post a Comment