Justin Spencer: Crafting Interpreters by Bob Nystrom

Static type systems in particular require rigorous formal reasoning.
Being able to reason precisely and formally about about syntax and semantics is a vital skill when working on a language.
For every successful general-purpose language out there, there are a thousand successful niche ones.
Implementing a language is a real test of programming skill.
You must master recursion, dynamic arrays, trees, graphs, and hash tables.
A compiler reads in files in one language and translates them to files in another language. You can implement a compiler in any language, including the same language it compiles, a process called “self-hosting”.
C is the perfect language for understanding how an implementation really works, all the way down to the bytes in memory and the code flowing through the CPU.
The first step is scanning, or lexing, or (if you’re trying to impress someone) lexical analysis.
A scanner takes in a linear stream of characters and chunks them together into a series of something more akin to “words”. In programming languages, each of these words is called a token.
The next step is parsing. This is where our syntax gets a grammar--the ability to compose larger expressions and statements out of smaller parts.
A parser takes a flat sequence of tokens and builds a tree structure that mirrors the nested nature of the grammar.
The first bit of analysis that most languages do is called binding or resolution. For each identifier we find out where that name is defined and wire the two together. This is where scope comes into play--the region of source code where a certain name can be used to refer to a certain declaration.
The most powerful bookkeeping tool is to transform the tree into an entirely new data structure that more directly expresses the semantics of the code.
You can think of the compiler as a pipeline where each state's job is to organize the code in a way that makes the next stage simpler to implement.
If some expression always evaluates to the exact same value we can do the evaluation at compile time and replace the code for the expression with its result.
Optimizing is a huge part of the programming language business.
Many successful languages have surprisingly few compile-time optimizations.
Native code is lightning fast, but generating it is a lot of work.
Speaking the chip’s language also means your compiler is tied to a specific architecture.
In a fully compiled language, the code implementing the runtime gets inserted directly into the resulting executable.
If the language is run inside an interpreter or a VM, then the runtime lives there.
Compiling is an implementation technique that involves translating a source language to some other--usually lower-level--form.
When we say a language implementation “is a compiler”, we mean it translates source code to some other form but doesn’t execute it.
When we say an implementation “is an interpreter”, we mean it takes in source code and executes it immediately.
C’s most egregious grammar problems are around types.
A static type system is a ton of work to learn and implement.
High-level languages exist to eliminate error-prone, low-level drudgery.
There are two main techniques for managing memory: reference counting and tracing garbage collection.
Where an expression’s main job is to produce a value, a statement’s job is to produce an effect.
An argument is an actual value you pass to a function when you call it.
A parameter is a variable that holds the value of the argument inside the body of the function.
Prototypes are simpler in the language, but they seem to accomplish that only by pushing the complexity onto the user.
The idea behind object-oriented programming is encapsulating behavior and state together.
If you care about making a language that is actually usable, then handling errors gracefully is vital.
It’s a good engineering practice to separate the code that generates the errors from the code that reports them.
Our job is to scan through the list of characters and group them together in the smallest sequences that still represent something. Each of these blobs of characters is called a lexeme.
An interactive prompt is also called a REPL.
That’s a token: a bundle containing the raw lexeme along with the other things the scanner learned about it.
The core of the scanner is a loop. Starting at the first character of the source code, it figures out what lexeme it belongs to, and consumes it and any following characters that are part of that lexeme. When it reaches the end of that lexeme, it emits a token.
Maximal munch: When two grammar rules can both match match a chunk of code that the scanner is looking at, whichever one matches the most characters wins.
Regular expressions aren’t powerful enough to handle expressions which can nest arbitrarily deeply.
A formal grammar takes a set of atomic pieces it calls its “alphabet”. Then it defines a (usually infinite) set of “strings” that are “in” the grammar. Each string is a sequence of “letters” in the alphabet.
A formal grammar’s job is to specify which strings are valid and which aren’t.
The visitor pattern is really about approximating the functional style within an OOP language.
Unlike overriding, overloading is statically dispatched at compile time.
When we debug our parser and interpreter, it’s often useful to look at a parsed syntax tree and make sure it has the structure we expect.
Converting a tree to a string is sort of the opposite of a parser, and is often called “pretty printing” when the goal is to produce a string of text that is valid syntax in the source language.
Writing a real parser--one with decent error-handling, a coherent internal structure, and the ability to robustly chew through a sophisticated syntax--is considered a rare, impressive skill.
Precedence determines which operator is evaluated first in an expression containing a mixture of different operators.
Associativity determines which operator is evaluated first in a series of the same operator.
Recursive descent is the simplest way to build a parser, and doesn’t require using complex parser generator tools.
Recursive descent parsers are fast, robust, and can support sophisticated error handling.
A parser really has two jobs:

Given a valid sequence of tokens, produce a corresponding syntax tree.
Given an invalid sequence of tokens, detect any errors and tell the user about their mistakes.

Syntax errors are a fact of life and language tools have to be robust in the face of them.
The way a parser responds to an error and keeps going to look for later errors is called “error recovery”.
The traditional place in the grammar to synchronize is between statements.
Taking advantage of what users already know is one of the most powerful tools you can use to ease adoption of your language. It’s almost impossible to underestimate how useful this is.
A literal is a bit of syntax that produces a value.
Once you misinterpret bits in memory, all bets are off.
While a runtime error needs to stop evaluating the expression, it shouldn’t kill the interpreter. If a user is running the REPL and has a typo in a line of code, they should still be able to keep going and enter more code after that.
Some language are statically typed which means type errors are detected and reported a compile type before any code is run.
Others are dynamically typed and defer checking for type errors until runtime right before an operation is attempted.
A key reason users choose statically typed languages is because of the confidence the language gives them that certain kinds of errors can never occur when their program is run.
To support bindings, our interpreter needs internal state.
State and statements go hand in hand. Since statements, by definition, don’t evaluate to a value, they need to do something else to be useful. That something is called a “side effect”.
A single token lookahead recursive descent parser can’t see far enough to tell that it’s parsing an assignment until after it has gone through the left-hand side and stumbled onto the =.
The key difference between assignment and definition is that assignment is not allowed to create a new variable.
A scope is a region where a name maps to a certain entity. Multiple scopes enable the same name to refer to different things in different contexts.
Lexical scope is a specific style of scope where the text of the program itself shows where a scope begins and ends.
This is in contrast with dynamic scope where you don’t know what a name refers to until you execute the code.
One motivation for lexical scope is encapsulation--a block of code in one corner of the program shouldn’t interfere with some other one.
When a local variable has the same name as a variable in an enclosing scope, it shadows the outer one. Code inside the block can’t see it anymore, but it’s still there.
The main advantage to implicit declaration is simplicity.
In fact, any programming language with some minimum level of expressiveness is powerful enough to compute any computable function.
Reducing syntactic sugar to semantically equivalent but more verbose forms is called desugaring.
A rich syntax makes the language more pleasant and productive to work in.
Arity is the fancy term for the number of arguments a function or operation expects.
When it comes to making your language actually good at doing useful stuff, the native functions your implementation provides are key. They provide access to the fundamental services that all programs are defined in terms of.
Many languages also allow users to provide their own native functions. The mechanism for doing so is called a foreign function interface (FFI), native extension, native interface, or something along those lines.
By using different environments when we execute the body, calls to the same function with the same code can produce different results.
When a function is declared, it captures a reference to the current environment.
A closure retains a reference to the environment instance in play when the function was declared.
There are three broad paths to object-oriented programming: classes, prototypes, and multimethods.
Doing a hash table lookup for every field access is fast enough for many language implementations, but not ideal.
Methods and fields let us encapsulate state and behavior together so that an object always stays in a valid configuration.
Prototypes are simpler than classess--less code for the language implementer to write, and fewer concepts for the user to learn and understand.
Breadth is the range of different things the language lets you express.
Ease is how little effort it takes to make the language do what you want.
Complexity is how big the language is.
INheriting from another class means that everything that’s true of the superclass should be true, more or less, of the subclass.
A tree-walk interpreter is fine for some kinds of high-level, declarative languages. But for a general-purpose, imperative language--even a “scripting” language, it won’t fly.
A dynamically-typed language is never going to be as fast as a statically-typed language with manual memory management.
In engineering, few choices are without trade-offs.
Modern CPUs process data way faster than they can pull it from RAM. To compensate for that, chips have multiple layers of caching.
Many implementations of malloc() store the allocated size in memory right before the returned address.
C asks us not just to manage memory explicit, but mentally.
A trie stores a set of strings.
Tries are a special case of an even more fundamental data structure: a deterministic finite automata.
We sometimes fall into the trap of thinking that performance comes from complicated data structures, layers of caching, and other fancy optimizations. But, many times, all that’s required is to do less work, and I often find that writing the simplest code I can is sufficient to accomplish that.
Bytecode was good enough for Niklaus Wirth.
Pratt parsers are a sort of oral tradition in industry.
Vaughn Pratt’s “top-down operator precedence parsing” is the most elegant way I know to parse expressions.
A compiler has roughly two jobs. It parses the user’s source code to understand what it means. Then it takes that knowledge and outputs low-level instructions that produce the same semantics.
Good error handling and reporting is more valuable to users than almost anything else you can put time into in the front end.
You wouldn’t believe the breadth of software problems that miraculously seem to require a new little language in their solution as soon as you ask a compiler hacker for help.
The nice thing about working in C is that we can build our data structures from the raw bits up.
A value contains two parts: a “type” tag, and a payload for the actual value.
A union looks like a struct except that all of its fields overlap in memory.
The size of a union is the size of its largest field.
Most architectures prefer values to be aligned to their size.
A bytecode VM spends much of its execution time reading and decoding instructions. The fewer, simpler instructions you need for a given piece of behavior, the faster it goes.
There’s no maximum length for a string.
We need a way to support values whos size varies, sometimes greatly. This is exactly what dynamic allocation on the heap is designed for.
The [garbage] collector must ensure it it can find every bit of memory that is still being used so that it doesn’t collect live data.
If your language needs GC, get it working as soon as you can.
Choosing a character representation and encoding involves fundamental trade-offs.

Justin Spencer

Pages

20181105

Crafting Interpreters by Bob Nystrom

No comments:

Post a Comment