Advertisement

《Database Internals: A Deep-Dive into How Distributed Data System Work》读书笔记

阅读量:

Overview

Terminology can sometimes be ambiguous and hard to understand without a complete context.

Distinction among the DBMS in terms of :

a storage medium:

Memory

In-memory DBMS (main memory DBMS) store data primarily in memory and use the disk for recovery and logging.

Store their contents almost exclusively in RAM

Disk based DBMS

Disk based DBMS hold most of the data on disk and use memory for caching disk contents or as a temporary storage.

layout: Column vs. Row Oriented DBMS

three major categories:

Online transaction processing databases (OLTP)
Online analytical processing databases (OLAP)
Hybrid transactional and analytical processing (HTAP)

another way:

key-value stores
relational databases
document-oriented stores
graph databases

Client/Server

Database management systems use a client/server model, where database system instances (nodes) take the role of servers, and application instances take the role of clients.

Client requests arrive through the transpot subsystem.
在这里插入图片描述

The storage engine has several components with dedicated responsibilities:

复制代码
1. Transaction manager
2. Lock manager
3. Access methods
4. Buffer manager
5. Recovery manager

Horizontal Scaling vs. Scaling Vertically

Scaling out: improving performance and increasing capacity by running multiple database instances acting as a single logical unit: Gamma Database Machine Project, Teradata, Greenplum, Parallel DB2.

Scaling up by moving the database to a larger, more powerful machine.

Database components

复制代码
1. 

A transport layer accepting request

复制代码
2. 

A query processor determing the most efficient way to run queries

复制代码
3. 

An execution engine carrying out the operations

复制代码
4. 

A storage engine(or databasa engine)

A software component of a database management system responsible for storing, retrieving, and amanging data in memory and on disk, designed to capture a persistent, long-term memory of each node.

Database management systems are applications built on top of storage engines, offering a schema, a query language, indexing, transactions, and many other useful features.

Storage engines were developed independently from the database management systems they’re now embedded into.

Database file system

The primary goal of a database system is to store data and to allow quick access to it.

File organization improve efficiency.

Database systems do use files for storing the data, but instead of relying on filesystem hierarcheies of directories and files for lacating records, they compose files using implementation-specific formats.

The main reasons to use specialized file organization over flat files are:

复制代码
1. Storage efficiency
2. Access efficiency
3. Update efficincy

A database system ususally separates data files and index files: data files store data records, while index files store record metadata and use it to locate records in data files.

Data files also called primary files.

B-Tree

B-Trees were introduced by Rudolph Bayer and Edward M. McCreight back in 1971 and gained popularity over the years.

A binary search tree (BST) is a sorted in-memory data structure, used for efficient key-value lookups.

The balanced tree is defined as one that has a height of log_2N where N is the total number of items in the tree; and the difference in height between the two subtrees is not greater than one.

Unbalanced trees have a worst-case complexity of O(N), balanced trees give us an average O(log_2N).

理解fanout in database

Binary trees have a fanout of just two.

2-3 Trees and other low-fanout trees have a similar limitation: while they are useful as in-memory data structures, small node size makes them impractical for external storage.

复制代码
* `High fanout` to improve locality of the neighboring keys
* `Low height` to reduce the number of seeks during traversal

Disk-Basked Structures

Most traditional algorithms were developed when spinning disks were the most widespread persistent storage medium, which significantly influenced their design.

On spinning disk, seeks increase costs of random reads because they require disk rotation and mechanical head movements to position the read/write head to the desired location. Once the expensive part is done, reading or writing contiguous bytes (i.e., sequential operations) is relatively cheap.

Head positioning is the most expensive part of an operation on the HDD.

Solid state drives (SSDs) do not have moving parts: there’s no disk that spins, or head that has to be positioned for the read. A typical SSD is built of memory cells , connected into strings (typeically 32 to 64 cells per string), strings are combined into arrays , arrays are combined into pages(range from 2 to 16 Kb), and pages are combined into blocks.

SSD concepts size
memory cells one or multiple bits
string 32-64 cells
array
page 2-16 Kb(smallest unit that can be written (programmed) or read)
block 64-512 pages(the smallest erase entity)
plane
die where plane are placed on
SSD one or more dies

The smallest unit that can be written (programmed) or read is a page, however we can only make changes to the empty memory cells (i.e., to ones that have been erased befor the write). The smallest erase entity is not a page, but a block that holds multiple pages, which is why it is often called an erase block.

pages in an empty block have to be written sequentially.

On-Disk Structures

Besides the cost of disk access itself, the main limitation and design condition for building efficient on-disk structures is the fact that the smallest unit of disk operation is a block.

Ubiquitous B-Trees

B-Trees can be thought of as a cast catalog room in the library

B-Trees build upon the foundataion of balanced search trees and are different in that they have higher fanout (have more child nodes) and smaller height.

While binary tree nodes are drawn as circles, B-Tree nodes are often drawn as rectangles.
在这里插入图片描述

B-Trees are sorted: keys inside the B-Tree nodes are stored in order. This implies that lookups in B-Trees have logarithmic complexity.

B-Trees consist of multiple nodes, each node holds up to N keys and N+1 pointers to the child nodes. These nodes are logically grouped into three groups:

复制代码
* Boot node
* Leaf nodes
* Internal nodes

B-Trees are a page organization technique, we often use terms node and page interchangeably.

Keys stored in B-Tree nodes are called index entries (sepatator keys, divider cells), they split the tree into subtrees (branches, subranges)

Page Structure

In B-Trees, we distinguish:

复制代码
* leaf nodes: hold keys and data records pair
* nonleaf nodes: hold keys and pointers to other nodes

Each B-Tree node occupies one page or multiple pages linked together.

Slotted page

To efficiently store variable-size records such as strings, binary large objects (BLOBs), etc., we can use an organization technique called slotted page or slot directory.

理解slotted page

Cell Layout

key cells

Key cells hold a separator key and a pointer to the page between two neighboring pointers

key-value cells

Key-value cells hold keys and data records associated with them

Database Transaction

A transaction is an indivisible logical unit of work in database management system, allowing you to represent multiple operations as a single step.

Distributed System

In a distributed system, we have several participants (also called processes, nodes, or replicas).

Each participant has its own local state.

CPUs are tiny distributed systems with links, processors, and communication protocols.

Algorithms define participant roles, exchanged messages, states, transitions, eecuted steps, properties of the delivery medium, timing assumptions, failure models, and other characteristics that describe processes and their interactions.

Consistency models

Consistency models describe concurrent executions and establish an order in which operations can be executed and made visible to the participants.

Concurrent and Parallel

When two sequences of steps execute concurrently, both of them are in progress, but only one of them is executed at any moment; Concurrent operations overlap in time;

If two sequences execute in parallel, their steps can be executed simultaneously; parallel operations are executed by multiple processors.

Network Partitions

When two or more servers cannot communicate with each other, we call the situation network partition.

全部评论 (0)

还没有任何评论哟~