Justin Spencer: Language Implementation Patterns by Terence Parr

The truth is that the architecture of most language applications is freakishly similar.
A good way to learn about language design is to look at lots of different languages.
A domain-specific language is just that: a computer language designed to make users particularly productive in a specific domain.
Many algorithms and processes are inherently recursive.
A language is just a set of valid sentences.
A reader builds a data structure from one or more input streams. The input streams are usually text but can be binary data as well.
A generator walks an internal data structure and emits output.
A translator reads text or binary input and emits output conforming to the same or a different language. It is essentially a combined reader and generator.
An interpreter reads, decodes, and executes instructions.
Some languages are tougher to parse than others, and so we need parsers of varying strength. The trade-off is that the stronger parsing patterns are more complicated and sometimes a bit slower.
Backtracking strength comes at the cost of slow execution speed.
Rather than repeatedly parsing the input text in every stage, we’ll construct an IR. The IR is a highly processed version of the input text that’s easy to traverse.
An abstract-syntax tree (AST) has a node for every important token and uses operators as subtree roots.
There are four common kinds of scoping rules: languages with a single scope, nested scopes, C-style struct scopes, and class scopes.
Interpreters execute instructions stored in the AST but usually need other data structures too, like a symbol table.
The more you know about existing language applications, the easier it’ll be to design your own.
An interpreter is a program that executes other programs. In effect, an interpreter simulates a hardware processor in software, which is why we call them virtual machines. An interpreter’s instruction set is typically pretty low level but higher level than raw machine code. We call the instructions bytecodes because we can represent each instruction with a unique integer code from 0...255 (a byte’s range).
To execute a program, the interpreter uses a fetch-decode-execute cycle.
Self-assignment is when we assign a variable to itself.
All the scary voodoo within a compiler happens inside the semantic analyzer and optimizer.
Just as a parser is the key to analyzing the syntax, a symbol table is the key to understanding the semantics (meaning) of the input. In a nutshell, syntax tells us what to do, and semantics tells us what to do it to.
Having an AST lets us sniff the input multiple times without having to reparse it, which would be pretty inefficient.
The act of recognizing a phrase by computer is called parsing.
Parse trees are important because they tell us everything we need to know about the syntax (structure) of a phrase.
A parser checks whether a sentence conforms to the syntax of a language.
Building recursive-descent parsers in a general-purpose programming language is tedious and error-prone.
Tools that translate grammers to parsers are called parser generators.
Grammars are concise and act like functional specifications for languages.
Recognizers that feed off character streams are called tokenizers or lexers.
Just as an overall sentence has structure, the individual tokens have structure. At the character level, we refer to syntax as the lexical structure.
The more powerful the underlying recognition strategy, the easier it is to write a grammar. That is because more powerful parsing strategies allow a larger set of grammars.
Lexers derive a stream of tokens from a character stream by recognizing lexical patterns.
Lexers are also called scanners, lexical analyzers, and tokenizers.
The goal of the lexer is to emit a sequence of tokens. Each token has two primary attributes: a token type (symbol category) and the text associated with it.
Having more lookahead is like being able to see farther down multiple paths emanating from a fork in a maze.
The simplest way to provide lots of parser lookahead is to buffer up all the input tokens.
A backtracking parser attempts alternatives in order until one of them matches the current input.
A context-free language is a language whose constructs don’t depend on the presence of other constructs.
Using exceptions for control flow is usually a very bad idea because they act like gotos.
The heart of a backtracking parser lies in its token buffer management. The buffer must handle an arbitrary amount of lookahead and support nested mark and release operations. The easiest way to deal with arbitrary lookahead is simple to buffer up the entire token stream.
Compiler writers often leverage an existing optimizer and code generator by translating multiple languages to an established IR.
Not all programming language constructs map directly to executable code.
The best way to create ASTs and to verify their structure is with a formal mechanism.
A parse tree describes how a parser recognized an input sentence.
Compiler optimizers try to reduce operations to simpler equivalents in an effort to generate faster code.
To enforce language semantics, we need to track symbol definitions and be able to identify those symbols later.
A symbol is just a name for a program entity like a variable or method.
Language applications track symbols in an abstract data structure called a symbol table.
Scopes start and end at the start and end of language structures such as functions.
A scope is a code region with a well-defined boundary that groups symbol definitions (in a dictionary associated with that code region).
Scope boundaries usually coincide with begin and end tokens such as curly braces. We call this lexical scoping because the extent of a scope is lexically delimited.
Think of dynamic scoping as allowing methods to see the local variables of invoking methods.
To avoid ambiguity, programming languages use context to figure out which symbol we’re talking about. Context corresponds to the scope surrounding the symbol and any enclosing scopes.
To track nested scopes, we push and pop scopes onto a scope stack.
A forward reference is a reference to a method, type, or variable defined later in the input file.
The goal of automatic type promotion is to get all operands of a single operation to be of the same type or compatible types.
A language application can convert between types at will as long as it doesn’t lose information. We call such automatic conversion promotion because we can widen types without problem but can’t narrow them in general.
Polymorphism means that a pointer can refer to objects of multiple types.
To execute a program not written in machine code, we’ve got to interpret the program or translate it to the equivalent program in a language that already runs on that machine.
High-level interpreters directly execute source code instructions or the AST equivalent. Low-level interpreter execute instructions called bytecodes that are close to CPU machine instructions.
Usually, simplicity and low-cost implementation trump execution efficiency for DSLs.
Function calls save return addresses so they can return to the instruction following the function call.
There are three things to consider when building an interpreter: how to store data, how and when to track symbols, and how to execute instructions.
High-level interpreters store values according to value names, not memory addresses (like low-level interpreters and CPUs do). That means we’ve got to represent memory with a dictionary mapping names to values.
The basic idea behind executing instructions is called the fetch-decode-execute cycle. First, we load an instruction from code memory. Then, we decode the instruction to find the operation and operands. Finally, we execute the operation. Rinse and repeat. Ultimately the interpreter runs out of interpreter in the main program, or it executes a halt instruction.
The more highly we process the program before execution, the faster it will go at run-time.
Bytecode interpreters have a number of useful characteristics including portability.
Adding specialized instructions for common operations is the easiest way to speed up an interpreter.
The only real difference between a register and a stack machine is in the way we program them. We either load operands into registers or push them onto the stack; the difference is as simple as that.
The instruction pointer is a special-purpose “register” that points into code memory at the next instruction to execute.
To execute instructions, the interpreter has a simulated CPU that amounts to a loop around a giant “switch on bytecode” statement. This is called the instruction dispatcher.
The interpreter has a stack to hold function call return addresses as well as parameters and local variables.
The processor is the heart of an interpreter and has a simple job: it loops around a fetch-decode-execute mechanism.
The CPU stops when it hits a halt instruction or runs out of instructions to execute at the end of the code memory.
Bytecode interpreters store strings and other non integer constants in a constant pool.
The constant pool is just an array of objects and is commonly used by bytecode interpreters.
A stack frame is an object that tracks information about a function call.
One of the things you’ll discover quickly when reading source code or building your own interpreter is that making an interpreter go really fast is not easy. Every CPU clock cycle and memory access counts.
Register machines can often execute high-level programs faster than stack machines.
Bytecode interpreters feed off low-level byte arrays full of bytecodes (operation codes that fit into a byte) and integer operands.
Bytecode interpreters don’t know anything about symbols. For speed reasons, they only deal with code addresses, data addresses, and constant pool indexes.
To jump around an assembly language program, branch instructions use code label operands. A code label is just a symbol that makes a location in code memory and saves the programmer from having to manually compute addresses.
A stack-based bytecode interpreter simulates a hardware processor with no general-purpose registers. That means that bytecode instructions must use an operand stack to hold temporary values.
To return a value, a function pushes a value onto the operand stack and then execute the ‘ret’ instruction.
The biggest speed impediment for a stack-based interpreter is the extra work involved in constantly flicking the operand stack.
Building an interpreter is one way to implement a language, but we can also build a translator from the new language to an existing language.
One of the most important lessons I’ve learned over the years is that we should compute information and translate phrases when it’s convenient and efficient to do so, not when the output order demands it.
There are two key components to implementing this DSL: the translator itself and the run-time support.
The basic idea in an object-to-relational database mapping is to translate classes to tables and fields to columns in the corresponding table.

Justin Spencer

Pages

20180408

Language Implementation Patterns by Terence Parr

No comments:

Post a Comment