《Database Internals: A Deep-Dive into How Distributed Data System Work》读书笔记
Overview
Terminology can sometimes be ambiguous and hard to understand without a complete context.
Distinction among the DBMS in terms of :
a storage medium:
Memory
In-memory DBMS (main memory DBMS) store data primarily in memory and use the disk for recovery and logging.
Store their contents almost exclusively in RAM
Disk based DBMS
Disk based DBMS hold most of the data on disk and use memory for caching disk contents or as a temporary storage.
layout: Column vs. Row Oriented DBMS
three major categories:
Online transaction processing databases (OLTP)
Online analytical processing databases (OLAP)
Hybrid transactional and analytical processing (HTAP)
another way:
key-value stores
relational databases
document-oriented stores
graph databases
Client/Server
Database management systems use a client/server model, where database system instances (nodes) take the role of servers, and application instances take the role of clients.
Client requests arrive through the transpot subsystem.

The storage engine has several components with dedicated responsibilities:
1. Transaction manager
2. Lock manager
3. Access methods
4. Buffer manager
5. Recovery manager
Horizontal Scaling vs. Scaling Vertically
Scaling out: improving performance and increasing capacity by running multiple database instances acting as a single logical unit: Gamma Database Machine Project, Teradata, Greenplum, Parallel DB2.
Scaling up by moving the database to a larger, more powerful machine.
Database components
1.
A transport layer accepting request
2.
A query processor determing the most efficient way to run queries
3.
An execution engine carrying out the operations
4.
A storage engine(or databasa engine)
A software component of a database management system responsible for storing, retrieving, and amanging data in memory and on disk, designed to capture a persistent, long-term memory of each node.
Database management systems are applications built on top of storage engines, offering a schema, a query language, indexing, transactions, and many other useful features.
Storage engines were developed independently from the database management systems they’re now embedded into.
Database file system
The primary goal of a database system is to store data and to allow quick access to it.
File organization improve efficiency.
Database systems do use files for storing the data, but instead of relying on filesystem hierarcheies of directories and files for lacating records, they compose files using implementation-specific formats.
The main reasons to use specialized file organization over flat files are:
1. Storage efficiency
2. Access efficiency
3. Update efficincy
A database system ususally separates data files and index files: data files store data records, while index files store record metadata and use it to locate records in data files.
Data files also called primary files.
B-Tree
B-Trees were introduced by Rudolph Bayer and Edward M. McCreight back in 1971 and gained popularity over the years.
Binary search trees
A binary search tree (BST) is a sorted in-memory data structure, used for efficient key-value lookups.
The balanced tree is defined as one that has a height of log_2N where N is the total number of items in the tree; and the difference in height between the two subtrees is not greater than one.
Unbalanced trees have a worst-case complexity of O(N), balanced trees give us an average O(log_2N).
Binary trees have a fanout of just two.
2-3 Trees and other low-fanout trees have a similar limitation: while they are useful as in-memory data structures, small node size makes them impractical for external storage.
* `High fanout` to improve locality of the neighboring keys
* `Low height` to reduce the number of seeks during traversal
Disk-Basked Structures
Most traditional algorithms were developed when spinning disks were the most widespread persistent storage medium, which significantly influenced their design.
On spinning disk, seeks increase costs of random reads because they require disk rotation and mechanical head movements to position the read/write head to the desired location. Once the expensive part is done, reading or writing contiguous bytes (i.e., sequential operations) is relatively cheap.
Head positioning is the most expensive part of an operation on the HDD.
Solid state drives (SSDs) do not have moving parts: there’s no disk that spins, or head that has to be positioned for the read. A typical SSD is built of memory cells , connected into strings (typeically 32 to 64 cells per string), strings are combined into arrays , arrays are combined into pages(range from 2 to 16 Kb), and pages are combined into blocks.
| SSD concepts | size |
|---|---|
| memory cells | one or multiple bits |
| string | 32-64 cells |
| array | |
| page | 2-16 Kb(smallest unit that can be written (programmed) or read) |
| block | 64-512 pages(the smallest erase entity) |
| plane | |
| die | where plane are placed on |
| SSD | one or more dies |
The smallest unit that can be written (programmed) or read is a page, however we can only make changes to the empty memory cells (i.e., to ones that have been erased befor the write). The smallest erase entity is not a page, but a block that holds multiple pages, which is why it is often called an erase block.
pages in an empty block have to be written sequentially.
On-Disk Structures
Besides the cost of disk access itself, the main limitation and design condition for building efficient on-disk structures is the fact that the smallest unit of disk operation is a block.
Ubiquitous B-Trees
B-Trees can be thought of as a cast catalog room in the library
B-Trees build upon the foundataion of balanced search trees and are different in that they have higher fanout (have more child nodes) and smaller height.
While binary tree nodes are drawn as circles, B-Tree nodes are often drawn as rectangles.

B-Trees are sorted: keys inside the B-Tree nodes are stored in order. This implies that lookups in B-Trees have logarithmic complexity.
B-Trees consist of multiple nodes, each node holds up to N keys and N+1 pointers to the child nodes. These nodes are logically grouped into three groups:
* Boot node
* Leaf nodes
* Internal nodes
B-Trees are a page organization technique, we often use terms node and page interchangeably.
Keys stored in B-Tree nodes are called index entries (sepatator keys, divider cells), they split the tree into subtrees (branches, subranges)
Page Structure
In B-Trees, we distinguish:
* leaf nodes: hold keys and data records pair
* nonleaf nodes: hold keys and pointers to other nodes
Each B-Tree node occupies one page or multiple pages linked together.
Slotted page
To efficiently store variable-size records such as strings, binary large objects (BLOBs), etc., we can use an organization technique called slotted page or slot directory.
Cell Layout
key cells
Key cells hold a separator key and a pointer to the page between two neighboring pointers
key-value cells
Key-value cells hold keys and data records associated with them
Database Transaction
A transaction is an indivisible logical unit of work in database management system, allowing you to represent multiple operations as a single step.
Distributed System
In a distributed system, we have several participants (also called processes, nodes, or replicas).
Each participant has its own local state.
CPUs are tiny distributed systems with links, processors, and communication protocols.
Algorithms define participant roles, exchanged messages, states, transitions, eecuted steps, properties of the delivery medium, timing assumptions, failure models, and other characteristics that describe processes and their interactions.
Consistency models
Consistency models describe concurrent executions and establish an order in which operations can be executed and made visible to the participants.
Concurrent and Parallel
When two sequences of steps execute concurrently, both of them are in progress, but only one of them is executed at any moment; Concurrent operations overlap in time;
If two sequences execute in parallel, their steps can be executed simultaneously; parallel operations are executed by multiple processors.
Network Partitions
When two or more servers cannot communicate with each other, we call the situation network partition.
