Justin Spencer: Designing Data-Intensive Applications by Martin Kleppmann

Many applications today are data intensive, as opposed to compute-intensive. Raw CPU power is rarely a limiting factor for these applications--bigger problems are usually the amount of data, the complexity of data, and the speed at which it is changing.
A data-intensive application is typically built from standard building blocks that provide commonly needed functionality.
A fault is usually defined as one component of the system deviating from its spec, whereas a failure is when the system as a whole stops providing the required service to the user.
It is impossible to reduce the probability of a fault to zero; therefore it is usually best to design fault-tolerance mechanisms that prevent faults from causing failures.
Many critical bugs are actually due to poor error handling; by deliberately inducing faults, you ensure that the fault tolerance machinery is continually exercised and tested, which can increase your confidence that faults will be handled correctly when they occur naturally.
Design systems in a way that minimize opportunities for error.
Some systems are elastic, meaning that they can automatically add computing resources when they detect a load increase, whereas other systems are scaled manually.
An elastic system can be useful if load is highly unpredictable, but manually scaled systems are similar and may have fear operation surprises.
One of the best tools we have for removing accidental complexity is abstraction. A good abstraction can hide a great deal of implementation detail behind a clean, simple-to-understand facade.
Data models are perhaps the most important part of developing software, because they have such a profound effect: not only on how the software is written, but also on how we think about the problem that we are solving.
Most applications are built by layering one data model on top of another. For each layer, the key question is: how is it represented in terms of the next-lower layer?
MapReduce is a programming model for processing large amounts of data in bulk across many machines.
The Datalog approach requires a different kind of thinking to the other query languages discussed in this chapter, but it’s a very powerful approach, because rules can be combined and reused in different queries. It’s less convenient for simple one-off queries, but it can cope better if your data is complex.
Document databases target use cases where data comes in self-contained documents and relationships between one document and another are rare.
Graph databases go in the opposite direction, targeting use cases where anything is potentially related to everything.
On the most fundamental level, a database needs to do two things: when you give it some data, it should store the data, and when you ask it again later, it should give the data back to you.
In order to efficiently find the value for a particular key in the database, we need a different data structure: an index.
An index is an additional structure that is derived from the primary data. Many databases allow you to add and remove indexes, and this doesn’t affect the contents of the database; it only affects the performance of queries.
Key-value stores are quite similar to the dictionary type that you can find in most programming languages, and which is usually implemented as a hash map.
The simplest possible indexing strategy is this: keep an in-memory hash map where every key is mapped to a byte offset in the data file--the location at which the value can be found. Whenever you append a new key-value pair to the file, you also update the hash map to reflect the offset of the data you just wrote. When you want to look up a value, use the hash map to find the offset in the data file, seek to that location, and read the value.
An append-only log seems wasteful at first glance: why don’t you update the file in place, overwriting the old value with the new value? But an append-only design turns out to be good for several reasons.
Appending and segment merging are sequential write operations, which are generally much faster than random writes, especially on magnetic spinning-disk hard drives.
Concurrency and crash recovery are much simpler if segment files are append-only or immutable.
Merging old segments avoids the problem of data files getting fragmented over time.
A full-text index is much more complex than a key-value index but is based on a similar idea: given a word in a search query, find all the documents that mention the word. This is implemented with a key-value structure where the key is a word and the value is the list of IDs of all the documents that contain the word.
A Bloom filter is a memory-efficient data structure for approximating the contents of a set. It can tell you if a key does not appear in the database, and thus saves many unnecessary disk reads for nonexistent keys.
The basic idea of LSM-trees--keeping a cascade of SSTables that are merged in the background--is simple and effective.
B-trees have stood the test of time very well. They remain the standard index implementation in almost all relational databases, and many non relational databases use them too.
Like SSTables, B-trees keep key-value pairs sorted by key, which allows efficient key-value lookups and range queries.
In write-heavy applications, the performance bottleneck might be the rate at which the database can write to disk.
A primary key uniquely identifies one row in a relational table, or one document in a document database, or one vertex in a graph database.
It is also very common to have secondary indexes. In relational databases, you can create several secondary indexes on the same table [...] and they are often crucial for performing joins efficiently.
The key in an index is the thing that queries search for, but value can be one of two things: it could be the actual row (document, vertex) in question, or it could be a reference to the row stored elsewhere.
The heap file approach is common because it avoids duplicating data when multiple secondary indexes are present: each index just references a location in the heap file, and the actual data is kept in one place.
As RAM becomes cheaper, the cost-per-gigabyte argument is eroded. Many datasets are simply not that big, so it’s quite feasible to keep them entirely in memory, potentially distributed across several machines. This has led to the development of in-memory databases.
When an in-memory database is restarted, it needs to reload its state, either from disk or over the network from a replica.
A data warehouse is a separate database that analysts can query to their hearts’ content, without affecting OLTP operations. The data warehouse contains a read-only copy of the data in all the various OLTP systems in the company.
The big advantage of using a separate data warehouse, rather than querying OLTP systems directly for analytics, is that the data where can be optimized for analytic access patterns.
The name “star schema” comes from the fact that when the table relationships are visualized, the fact table is in the middle, surrounded by its dimension tables; the connections to these tables are like the rays of a star.
The idea behind column-oriented storage is simple: don’t store all the values from one row together, but store all the values from each column together instead. If each column is stored in a separate file, a query only needs to read and parse those columns that are used in that query, which can save a lot of work.
Log-structured storage engines are a comparatively recent development. Their key idea is that they systematically turn random-access writes into sequential writes on disk, which enables higher write throughput due to the performance characteristics of hard drives and SSDs.
In most cases, a change to an application’s features also requires a change to data that it stores: perhaps a new field or record type needs to be captured, or perhaps existing data needs to be presented in a new way.
Many programming languages come with built-in support for encoding in-memory objects into byte sequences.
CSV does not have any schema, so it is up to the application to define the meaning of each row and column. If an application change adds a new row or column, you have to handle that change manually.
The difficulty of getting different organizations to agree on anything outweighs most other concerns.
Apache Thirst and Protocol Buffers are binary encoding libraries that are based on the same principle. Protocol Buffers was originally developed at Google, Thrift was originally developed at Facebook, and both were made open source in 07-08. Both Thrift and Protocol Buffers require a scheme for any data that is encoded.
A key design goal of a service-oriented/microservices architecture is to make the application easier to change and maintain by making services independently deployable and evolvable.
REST is not a protocol, but rather a design philosophy that builds upon the principles of HTTP. It emphasizes simple data formats, using URLs for identifying resources and using HTTP features for cache control ,authentication, and content type negotiation.
The detailed delivery semantics vary by implementation and configuration, but in general, message brokers are used as follows: one process sends a message to a named queue or topic, and the broker ensures that the message is delivered to one or more consumers of or subscribers to that queue or topic. There can be many producers and many consumers on the same topic.
If all you need is to scale to higher load, the simplest approach is to buy a more powerful machine (sometimes called vertical scaling or scaling up). Many CPUs, many RAM chips, and many disks can be joined together under one operating system, and a fast interconnect allows any CPU to access any part of the memory or disk. In this kind of shared-memory architecture, all the components can be treated as a single machine.
Being able to reboot individual nodes without downtime is a big advantage for operations and maintenance. Thus, our goal is to keep the system as a whole running despite individual node failures, and to keep the impact of a node outage as small as possible.
Despite being a simple goal--keeping a copy of the same data on several machines--replication turns out to be a remarkably tricky problem. It requires carefully thinking about concurrency and about all the things that can go wrong, and dealing with the consequences of those faults.
The main reason for wanting to partition data is scalability. Different partitions can be placed on different nodes in s shared-nothing cluster. Thus, a large dataset can be distributed across many disks, and the query load can be distributed across many processors.
The simplest approach for avoiding hot spots would be to assign records to nodes randomly. That would distribute the data quite evenly across the nodes, but it has a big disadvantage: when you’re trying to read a particular item, you have no way of knowing which node it is on, so you have to query all nodes in parallel.
A good hash function takes skewed data and makes it uniformly distributed.
For partitioning purposes, the hash function need not be cryptographically strong.
Partitioning is necessary when you have so much data that storing and processing it on a single machine is no longer feasible.
In general, atomic reference to something that cannot be broken down into smaller parts.
A transaction is usually understood as a mechanism for grouping multiple operations on multiple objects into one unit of execution.
A key feature of a transaction is that it can be aborted and safely retried if an error occurred. ACID databases are based on this philosophy: if the database is in danger of violating its guarantee of atomicity, isolation, or durability, it would rather abandon the transaction entirely than allow it to remain half-finished.
If two transactions don’t touch the same data, they can safely be run in parallel, because neither depends on the other.
The simplest way of avoiding concurrency problems is to remove the concurrency entirely: to execute only one transaction at a time, in serial order, on a single threat. By doing so, we completely sidestep the problem of detecting and preventing conflicts between transactions: the resulting isolation is by definition serializable.
Transactions are an abstraction layer that allows an application to pretend that certain concurrency problems and certain kinds of hardware and software faults don’t exist.
A large class of errors is reduced down to a simple transaction abort, and the application just needs to try again.
This nondeterminism and possibility of partial failures is what makes distributed systems hard to work with.
With these philosophies come very different approaches to handling faults.
The bigger a system nuggets, the more likely it is that one of its components is broken. Over time, broken things get fixed and new things break, but in a system with thousands of nodes, it is reasonable to assume that something is always broken.
When the error handling strategy consists of simply giving up, a large system can end up spending a lot of its time recovering from faults rather than doing useful work.
If we want to make distributed systems work, we must accept the possibility of partial failure and build fault-tolerance mechanisms into the software. In other words, we need to build a reliable system from unreliable components.
When one part of the network is cut off from the rest due to a network fault, that is sometimes called a network partition or net-split.
Rapid feedback about a remote node being down is useful, but you can’t count on it. [...] If you want to be sure that a request was successful, you need a positive response from the application itself.
UDP is a good choice in situations where delayed data is worthless.
In a distributed system, time is a tricky business, because communication is not instantaneous: it takes time for a message to travel across the network from one machine to another.
A time-of-day clock does what you intuitively expect of a clock: it returns the current date and time according to some calendar.
A monotonic clock is suitable for measuring a duration (time interval), such as timeout or a service’s response time.
Monotonic clocks don’t need synchronization, but time-of-day clocks need to be set according to an NTP server or other external time source in order to be useful.
Thus, if you use software that requires synchronized clocks, it is essential that you also carefully monitor the clock offsets between all the machines. Any node whose clock drifts too far from the others should be declared dead and removed from the cluster. Such monitoring ensures that you notice the broken clock before they can cause too much damage.
A node in a distributed system must assume that its execution can be paused for a significant length of time at any point, even in the middle of a function. During the pause, the rest of the world keeps moving and may even declare the paused node dead because it’s not responding. Eventually, the paused node may continue running, without even noticing that it was asleep until it checks its clock sometime later.
In embedded systems, real-time means that a system is carefully designed and tested to meet specified timing guarantees in all circumstances.
Providing real-time guarantees in a system requires support from all levels of the software stack.
For most server-side data processing systems, real-time guarantees are simply not economical or appropriate. Consequently, these systems must suffer the pauses and clock instability that come from operating in a non-real-time environment.
The moral of these stories is that a node cannot necessarily trust its own judgement of a situation. A distributed system cannot exclusively rely on a single node, because a node may fail at any time ,potentially leaving the system stuck and unable to recover.
If a quorum of nodes declares another node dead, then it must be considered dead, even if that node still very much feels alive. The individual node must abide by the quorum decision and step down.
To tolerate faults, the first step is to detect them, but even that is hard. Most systems don’t have an accurate mechanism of detecting whether a node has failed, so most distributed algorithms rely on timeouts to determine whether a remote node is still available.
If you can avoid opening Pandora’s box and simply keep things on a single machine, it is generally worth doing so.
The best way of building fault-tolerant systems is to find some general-purpose abstractions with useful guarantees, implement them once, and then let applications rely on those guarantees.
One of the most important abstractions for distributed systems is consensus: that is, getting all of the nodes to agree on something.
Most replicated databases provide at least eventual consistency, which means that if you stop writing to the database and wait for some unspecified length of time, then eventually all read requests will return the same value.
A better name for eventual consistency may be convergence, as we expect all replicas to eventually converge to the same value.
The basic idea behind linearizability is simple: to make a system appear as if there is only a single copy of the data.
Consensus is one of the most important and fundamental problems in distributed computing.
Surprisingly many data analyses can be done in a few minutes using some combination of awk, sed, grep, sort, uniq, and xargs, and they perform surprisingly well.
The main difference from pipelines of Unix commands is that MapReduce can parallelize a computation across many machines, without you having to write code to explicitly handle the parallelism.
In the Unix world, the uniform interface that allows one program to be composed with another is files and pipes; in MapReduce, that interface is a distributed file system.
In general, a “stream” refers to data that is incrementally made available over time.
There is no single system that can satisfy all data storage, querying, and processing needs. In practice, most nontrivial applications need to combine several different technologies in order to satisfy their requirements.
Conventional search engines first index the documents and then run queries over the index. By contrast, searching a stream turns the processing on its head: the queries are stored, and the documents run past the queries, like in CEP.
Actor frameworks are primarily a mechanism for managing concurrency and distributed execution of communication modules, whereas stream processing is primarily a data management technique.
One solution is to break the stream into small blocks, and treat each block like a miniature batch process. This approach is called micro batching, and it is used in Spark Streaming.
An idempotent operation is one that you can perform multiple times, and it has the same effect as if you performed it only once.
Even if an operation is not naturally idempotent, it can often be made idempotent with a bit of extra metadata.
A recurring theme in this book has been that for any given problem, there are several solutions, all of which have different pros, cons, and trade-offs.
Thus, the most appropriate choice of software tool also depends on the circumstances. Every piece of software, even a so-called “general-purpose” database, is designed for a particular usage pattern.

Justin Spencer

Pages

20210411

Designing Data-Intensive Applications by Martin Kleppmann

No comments:

Post a Comment