15645 Introduction to Database Systems ~ Notes

Notes for CMU-15645 Introduction to Database Systems.

Link to the course website.

Link to the course’s open-source database project.

Lecture 1: Algebra & Relational Model

Course info:

Grading Scheme The final grade for the course will be based approximately on the following weights:

15% — Homeworks (6 sets)
45% — Programming Projects
20% — Midterm Exam
20% — Final Exam

A database is an organized collection of interrelated data that models some aspect of the real world. Databases are the core component of most computer applications.

A database management system (DBMS) is the software that manages a database. Among many things, a DBMS is responsible for inserting, deleting, and retrieving data from a database.

Flat File Strawman

A simple way to store data is to use a flat file (e.g. CSV). The application has to parse these files each time it wants to read or update any records.

Issues:

Data Integrity: How to ensure that the data is correct and consistent?
Implementation: Need to write a lot of code to implement the application logic.
Durability: How to ensure that the data is not lost due to crashes or other failures?

Data Models

A data model is a collection of concepts for describing the data in database. Some examples include:

Relational (most common)
NoSQL (key/value, document, graph)
Array / Matrix / Vector (for machine learning)

A schema is a description of a particular collection of data using a given data model. This defines the structure of data for a data model. Otherwise, you have random bits with no meaning.

Relational Model

The relational data model defines three concepts:

Structure: The definition of relations and their contents independent of their physical representation. Each relation has a set of attributes. Each attribute has a domain of values.
Integrity: Ensure the database’s contents satisfy certain constraints. An example of a constraint would be that the age of a person cannot be a negative number.
Manipulation: Declarative API for accessing and modifying a database’s contents via relations(sets). Programmers only specify the desired result; the database system will decide the most efficient query plan to execute.

The relational model provides data independence that isolates the user/application from low-level data representation. The user only worries about high-level application logic. DBMS optimizes the layout according to operating environment, database contents, and workload. And it will re-optimize if/when these factors change. So the change in the physical data will not break the applications.

Relation (/the table): an unordered set that contains the relationship of attributes that represent entities. Possibly with duplicates, as long as their primary keys are different. A relation with n attributes is called an n-ary relation (/a table with n columns).
Tuple (/a row): a set of attribute values (/domain) in the relation.
Primary key: uniquely identifies a single tuple in a table. Sometimes auto-generated by the DBMS.
Foreign key: an attribute from one relation maps to a tuple in another relation. Generally, the foreign key will point / be equal to a primary key in another table.
- Benefit of foreign keys:
  - Data integrity: ensures that the data in the database is consistent and accurate.
  - Query optimization: allows the database to optimize queries.
- But why foreign keys are abandoned in some large-scale applications?
  - IO latch competition/deadlock.
  - Tables may be in different databases.
- Ways to replace foreign keys:
  - Application-level foreign keys: the application is responsible for maintaining the integrity of the data.
  - Soft deletes: use is_deleted flag instead of actually deleting records.
  - Message queues: use message queues to ensure that the data is consistent across different databases.
Constraints: a user-defined condition that must hold for any instance of the database (e.g. primary key constraint, foreign key constraint, not-null column…).

Data Manipulation Languages (DMLs)

DMLs refer to the API that a DBMS exposes to applications to store and retrieve information from a database. There are two classes of languages for Manipulating a database:

Procedural: The query specifies the (high-level) execution strategy the DBMS should use to find the desired result based on sets / bags.
- E.g. use a for loop to scan all records and count how many records are there to retrieve the number of records in the table.
Non-Procedural (Declarative): The query specifies only what data is wanted and not how to find it.
- E.g. use SQL SELECT COUNT(*) FROM artist to count how many records are there in the table.

Relational Algebra

Relational Algebra is a set of fundamental operations to retrieve and manipulate tuples in a relation. Each operator takes in one or more relations as inputs, and outputs a new relation. To write queries we can ‘chain’ these operators together.

Select: takes in a relation and outputs a subset of the tuples from that relation that satisfy a selection predicate.
- Syntax: $σ_{predicate}(R)$
- Example: $σ_{a_{id}=a2}(R)$; SQL: SELECT * FROM R WHERE a_id = 2
Projection: takes in a relation and outputs a relation with tuples that contain only specified attributes.
- Syntax: $π_{A_1, A_2, …}(R)$
- Example: $π_{b_{id}-100, a_{id}}(σ_{a_{id}=a2}(R))$; SQL: SELECT b_id-100, a_id FROM R WHERE a_id = 2
Union: takes in two relations and outputs a relation that contains all tuples that appear in at least one of the input relations. Note: The two input relations have to have the exact same attributes.
- Syntax: $R \cup S$
- Example: $R \cup S$; SQL: (SELECT * FROM R) UNION ALL (SELECT * FROM S)
Intersection: takes in two relations and outputs a relation that contains all tuples that appear in both input relations. Note: The two input relations have to have the exact same attributes.
- Syntax: $R \cap S$
- Example: $R \cap S$; SQL: (SELECT * FROM R) INTERSECT (SELECT * FROM S)
Difference: takes in two relations and outputs a relation that contains all tuples that appear in the first input relation but not in the second input relation. Note: The two input relations have to have the exact same attributes.
- Syntax: $R - S$
- Example: $R - S$; SQL: (SELECT * FROM R) EXCEPT (SELECT * FROM S)
Product: takes in two relations and outputs a relation that contains all possible combinations of tuples from the two input relations. Note: The two input relations do not have to have the same attributes.
- Syntax: $R \times S$
- Example: $R \times S$; SQL: (SELECT * FROM R) CROSS JOIN (SELECT * FROM S) or SELECT * FROM R, S
Join: takes in two relations and outputs a relation that contains all combinations of tuples from the two input relations that satisfy a join predicate.
- Syntax: $R \bowtie_{predicate} S$
- Example: $R \bowtie S$; SQL: SELECT * FROM R NATURAL JOIN S or SELECT * FROM R JOIN S ON predicate or SELECT * FROM R, S USING (common_attribute)
  - NATURAL JOIN: automatically joins two tables based on all columns with the same name.
  - JOIN ... USING: automatically joins two tables based on the specified common attribute.
  - JOIN ... ON: joins two tables based on the specified join predicate. This WILL NOT remove duplicate columns in the output.

Procedural vs Declarative:

For example, $\sigma_{b_{id}=102}(R \bowtie S)$ represents joining $R$ and $S$ and then selecting / filtering the result. However, $(R \bowtie (\sigma_{b_{id}=102}(S)))$ will do the selection on $S$ first, and then join the result of the selection with $R$. These two statements will always produce the same answer. However, if $S$ has 1 billion tuples and there is only 1 tuple in $S$ with $b_{id}=102$, then:

$\sigma_{b_{id}=102}(R \bowtie S)$ will first join $R$ and $S$, which will produce a huge intermediate result, and then filter the result to get the final answer. This is very inefficient.
$(R \bowtie (\sigma_{b_{id}=102}(S)))$ will first filter $S$ to get the 1 tuple with $b_{id}=102$, and then join $R$ with that 1 tuple, which will produce a much smaller intermediate result, and then filter the result to get the final answer. This is much more efficient.

A better approach is to state the high-level result you want, and the DBMS will figure out the best way to execute the query.

Other Data Models

Document Model: Hierarchy of key-value pairs (JSON). High flexibility, but poor integrity. Good for rapid prototyping and semi-structured data.
Vector Model: One-dimensional arrays (Embeddings). Core task: Similarity Search (NN). Critical for AI/LLM applications. Relational DBs are now integrating this (e.g., pgvector).

Lecture 2: Modern SQL

Relational Languages

Data Manipulation Language (DML): API for accessing and modifying a database’s contents via relations (sets).
Data Definition Language (DDL): API for defining and modifying the database schema.
Data Control Language (DCL): API for controlling access to the database (e.g., GRANT, REVOKE).

Aggregate Functions

An aggregation function takes in a bag of tuples as its input and then produces a single scalar value as its output. Aggregate functions can (almost) only be used in a SELECT output list.

Examples: AVG, COUNT, MAX, MIN, SUM.

Some aggregate functions (e.g. COUNT, SUM, AVG) support the DISTINCT keyword.

Note: COUNT(column) counts the number of NON-NULL values, while COUNT(*), COUNT(1), COUNT(1+1+1)… counts the number of rows.

Non-aggregated values in SELECT output clause must appear in the GROUP BY clause. This will partition the tuples based off of the value and calculate the aggregates for each subset. In this case there will be a canonical value for each group.

Example: Get the average GPA of students in each course.

SELECT AVG(s.gpa), e.cid
FROM enrolled AS e JOIN student AS s
WHERE e.sid = s.sid
GROUP BY e.cid;

GROUPING SETS allows us to specify multiple groupings in the same query. This is useful when we want to get aggregates for different levels of granularity.

Example: Get the count of students by each course and grade, the count of students by course, and the total student count.

SELECT c.name AS c_name, e.grade,
  COUNT(*) AS num_students
FROM enrolled AS e
  JOIN course AS c ON e.cid = c.cid
GROUP BY GROUPING SETS (
  (c.name, e.grade),
  (c.name),
  (),
);

The FILTER clause filters results before aggregation computation (i.e. filter out rows going into the aggregate function). This allows computing multiple aggregates on the same raw data but with different conditions. It is helpful for pivoting rows to columns, among other things.

Example: Get the avg course grade of students enrolled in 15-445 and 15-721.

SELECT
  AVG(s.gpa) FILTER(WHERE e.cid = '15-445') AS intro_db_avg_gpa,
  AVG(s.gpa) FILTER(WHERE e.cid = '15-721') AS adv_db_avg_gpa
FROM enrolled AS e
  JOIN student AS s
ON e.sid = s.sid

The HAVING clause filters output results based on aggregation computation (i.e. filters out groups as opposed to filtering rows which is what the WHERE clause does). This makes HAVING behave like a WHERE clause for a GROUP BY.

Example: Get the set of courses in which the average student GPA is greater than 3.9.

SELECT AVG(s.gpa), e.cid
FROM enrolled AS e
  JOIN student AS s
ON e.sid = s.sid
GROUP BY e.cid
HAVING AVG(s.gpa) > 3.9;

String Operations

’%’: matches any sequence of zero or more characters.
‘_’: matches any single character.
SIMILAR TO: supports more complex patterns using regular expressions.
String functions: CONCAT (or ||, +), SUBSTRING, UPPER, LOWER…

Date and Time

Differs wildly by systems.

Output Control

ORDER BY ... (ASC|DESC): sorts the output by the specified columns in ascending (by default) or descending order.
LIMIT n OFFSET m: limits the output to the first n rows, starting from row m.

Output Redirection

Instead of having the result a query returned to the client (e.g., terminal), you can tell the DBMS to store the results into another table. You can then access this data in subsequent queries.

New Table: Store the output of the query into a new (permanent) table.
- SELECT DISTINCT cid INTO CourseIds FROM enrolled;
Existing Table: Store the output of the query into a table that already exists in the database. The target table must have the same number of columns with the same types as the target table, but the names of the columns in the output query do not have to match.
- INSERT INTO CourseIds (SELECT DISTINCT cid FROM enrolled);
Temporary Table: Store the output of the query into a temporary table created during the insertion. This table can then be used until the client disconnects.
- SELECT DISTINCT cid INTO TEMPORARY CourseIds FROM enrolled;

Nested Queries

Invoke a query inside of another query to compose more complex computations.

The scope of the outer query is included in an inner query (i.e. the inner query can access attributes from outer query). The opposite is not true.

SELECT Output Targets:
```
SELECT (SELECT 1) AS one FROM student;
```

FROM Clause:

SELECT name
FROM student AS s, (SELECT sid FROM enrolled) AS e
WHERE s.sid = e.sid;

WHERE Clause:

SELECT name FROM student
WHERE sid IN ( SELECT sid FROM enrolled );

Result expressions:

ALL: returns true if all rows in the subquery satisfy the condition.
ANY: returns true if at least one row in the subquery satisfies the condition.
IN: =ANY().
EXISTS: returns true if the subquery returns at least one row.
NOT: not.

Example: Find all courses that have no students enrolled in it.

SELECT * FROM course
WHERE NOT EXISTS(
  SELECT * FROM enrolled
  WHERE course.cid = enrolled.cid
);

Lateral Joins

The LATERAL operator allows a nested query to reference attributes in other nested queries that precede it. You can think of lateral joins like a for loop that allows you to invoke another query for each tuple in a table.

Example: Calculate the number of students enrolled in each course and the average GPA. Sort by enrollment count in descending order.

SELECT * FROM course AS c
  LATERAL (SELECT COUNT(*) AS cnt FROM enrolled AS e WHERE e.cid = c.cid) AS enrollment_count
  LATERAL (SELECT AVG(s.gpa) AS avg_gpa FROM student AS s 
    JOIN enrolled AS e ON s.sid = e.sid
    WHERE e.cid = c.cid) AS average_gpa
ORDER BY enrollment_count DESC;

Common Table Expressions (CTEs)

Common Table Expressions (CTEs) are an alternative to windows or nested queries when writing more complex queries. They provide a way to write auxiliary statements for use in a larger query. A CTE can be thought of as a temporary table that is scoped to a single query.

Example: Generate a CTE called cteName that contains a single tuple with a single attribute set to 1. Select all attributes from cteName.

WITH cteName AS (
  SELECT 1
)
SELECT * FROM cteName;

Adding the RECURSIVE keyword after WITH allows a CTE to reference itself. This enables the implementation of recursion in SQL queries. With recursive CTEs, SQL is provably Turing-complete, implying that it is as computationally expressive as more general purpose programming languages (ignoring the fact that it is a bit more cumbersome).

Example: Print the sequence of numbers from $1$ to $10$.

WITH RECURSIVE cteSource (counter) AS (
  ( SELECT 1 )
  UNION
  ( SELECT counter + 1 FROM cteSource
    WHERE counter < 10 )
)
SELECT * FROM cteSource;

Windows Functions

A window function performs “sliding” calculation across a set of tuples that are related. Window functions are similar to aggregations, but tuples are not collapsed into a singular output tuple.

Note: The DBMS computes RANK after the window function sorting, whereas it computes ROW NUMBER before the sorting.

Example: Find the student with the second highest grade for each course.

SELECT * FROM (
  SELECT *, RANK() OVER (PARTITION BY cid ORDER BY grade ASC) AS rank
  FROM enrolled
) AS ranking
WHERE ranking.rank = 2;

Lecture 3: Database Storage (Part I)

Storage Hierarchy

Summary: The further you get away from the CPU, the larger but slower the storage devices get.

Volatile Storage: CPU Registers, Cache, Main Memory (RAM).
- Random access; byte addressable. All called memory in this class.
Non-Volatile Storage: Solid State Drives (SSDs), Hard Disk Drives (HDDs), Network Storage.
- Block access; block addressable.

Access Time and Access Pattern

Access time to memory $«$ access time to disk.
In memory, random access $\approx$ sequential access.
In disk, random access $»$ sequential access.
- Algorithms want to maximize sequential accesses.

Disk-Oriented DBMS Overview

The database is stored on disk, and the data within the database files are organized into pages, with the first page being the directory page. To operate on the data, the DBMS needs to bring the data into memory. It does this by having a buffer pool manager that manages the data movement back and forth between disk and memory.

File Storage

File format: different from how the OS stores files.
File system: most DBMSs use the file system provided by the OS.
File storage manager: is responsible for managing the files on disk.

Database Pages

A page is a fixed-size block of data.

It can contain tuples, meta-data, indexes, log records…
Most systems do not mix page types.
Some systems require a page to be self-contained.

Each page is given a unique identifier (page ID).

A page ID could be unique per DBMS instance, per database, or per table.
The DBMS uses an indirection layer to map page IDs to physical locations.

DBMSs specializing in read-heavy workloads tend to have larger page sizes (1 MB or larger).

Fetching a single page brings in many tuples that are needed for a query.

DBMSs specializing in write-heavy workloads tend to have smaller page sizes (4-16 KB).

The system must write entire page to disk even if only a small portion of it is modified.

Database Heap

A heap file is an unordered collection of pages with tuples that are stored in random order.

Create / Get / Write / Delete Page
Must also support iterating over all pages.

Need additional meta-data to track location of files and free space availability.

Page Layout

Header (meta-data: page size, checksum, DBMS version…) + Content

Row-oriented storage mode
- Tulpe-oriented: slotted pages
- Log-structured, Index-organized…

Slotted Pages: Header keeps track of the number of used slots, the offset of the starting location of the last used slot, and a slot array, which keeps track of the location of the start of each tuple. To add a tuple, the slot array will grow from the beginning to the end, and the data of the tuples will grow from end to the beginning. The page is considered full when the slot array and the tuple data meet.

Record ID, RID: a unique record identifier that represents its physical location in the database. Applications should never rely on these IDs to mean anything.

Tuple Layout

Header: Visibility (for concurrency control), Bitmap for NULL values.
Data
- Attributes are typically stored in the order that you specify them when you create the table -> Reordering.
- Attributes must be word aligned.
- Most DBMSs do not allow a tuple to exceed the size of a page.

Data Representation

INTEGER/BIGINT/SMALLINT/TINYINT: C/C++.
FLOAT/REAL vs. NUMERIC/DECIMAL
- IEEE-754 Standard (C/C++ native, efficient) / Fixed-point Decimals (allows arbitrary precision and scale, e.g. financial applications).
VARCHAR / VARBINARY / TEXT / BLOB
- Stored as header with length with data bytes or pointer to data page and offset.
- External storage: the DBMS has no durability and transaction guarantees on it.
TIME / DATE / TIMESTAMP / INTERVAL
- Stored as 32/64-bit integer of micro or milli-seconds since Unix epoch.
Null Data Types
- Null Column Bitmap Header (Most common approach in row-stores)
  - Store bitmap in centralized header, bit is set when the corresponding attribute is NULL.
- Special Values (Most common in column-stores)
  - Use a special placeholder for NULL for a data type (e.g. INT32 MIN).
- Per-Attribute Null Flag (Undesirable because of the storage overhead)
  - Stores a flag per-attribute that marks a value is null.

DBMS vs OS

OS is not your friend.

We do not advise using mmap in a DBMS for correctness and performance reasons.

No Control: OS decides when to evict/flush pages.
Blocking: Page faults block threads, hurting concurrency.
Correctness: Difficult to handle write errors and maintain transactional guarantees.

Even though the system will have functionalities (madvise, mlock, msync) that seem like something the OS can provide, having the DBMS implement these procedures itself gives it better control and performance.

Lecture 4: Memory & Disk Management

Introduction

Two main things that we will try to optimize for:

Spatial Control: where pages are physically located on disk.
- The goal of spatial control is to keep pages that are used together often as physically close together as possible on disk. This can potentially help with prefetching and other optimizations.
Temporal Control: when pages have been brought into memory and when they should be written back out to disk.
- Temporal control aims to minimize the number of stalls from having to read data from disk.

Buffer Pool

Buffer pool: an in-memory cache of pages between memory and disk. The entries are named frames.

Buffer pool meta-data: page table, dirty flag, pin counter, access tracking info…

The page directory is the mapping from page ids to page locations on the physical database files. All changes must be recorded on disk to allow the DBMS to find them on restart.
The page table is the mapping from page ids to a copy of the page in buffer pool frames. This is an in-memory data structure that does not need to be stored on disk.

OS vs DBMS

Lock vs Latch

	Lock	Latch
Protect	Logical database objects	Internal DS
Goal	Isolate different transactions	Protect critical sections in internal DS
Duration	Until end of transaction	During critical section execution
Rollback	Require support	No
Visibility	Visible to user (e.g. `SELECT ... FOR UPDATE`)	Invisible to user
Implementation	Lock table	Mutex/Spinlock/Atomic operations
	New for Lecture 10
Modes	Shared/Exclusive/Update/Intention	Read/Write
Deadlock	Detection&Resolution	Avoidance
…by…	Waits-for, Timeout, Aborts	Coding Discipline
Kept in	Lock Manager	Protected Data Structure

Again, why not OS?
- Transaction Safety: The OS can flush dirty pages at any time.
- I/O Stalls: The DBMS doesn’t know which pages are in memory. The OS might stall a thread on a page fault.
- Error Handling: It is difficult to validate pages. Any page access can cause a SIGBUS signal that the DBMS then must handle.
- Performance Issues: Internal OS data structure contention. TLB shootdowns.

Buffer Replacement Policies

Least Recently Used, LRU
CLOCK
- Each page has an access bit.
- When a page is accessed, set its bit to 1.
- As the hand visits each page, check if its access bit is set to 1. If yes, set it to zero. If no, then evict.

Both are susceptible to sequential flooding: A query performs a sequential scan that reads every page in a table one or more times (e.g., nested-loop joins). This pollutes the buffer pool with pages that are read once and then never again.

LRU-K: track the K most recent accesses to each page. Evict the page with the oldest K-th access.
- MySQL Approximate LRU-K: Single LRU linked list but with two entry points (“old” vs “young”).
Adaptive Replacement Cache, ARC:
- Recency list: track pages by recency of access (like LRU).
- Frequency list: track pages by frequency of access (like LFU).
- Target size: parameter adjusts the factor of recency vs frequency.

Dirty page: whether the page has been modified in memory but not yet written back to disk.

Background writing: A separate thread periodically scans the buffer pool for dirty pages and writes them back to disk.

Disk IO and OS Cache

Summary: The DBMS has its own disk scheduler.

Optimizations

Multiple Buffer Pools.
Pre fetching: bring pages into memory before they are needed.
Scan Sharing (Synchronized Scans): Query cursors can reuse data retrieved from storage or operator computations. Wasteful if we pay per IO.
Buffer Pool Bypass: The sequential scan operator will not store fetched pages in the buffer pool to avoid overhead. Instead, memory is local to the running query. This works well if an operator needs to read a large sequence of pages that are contiguous on disk and will not be used again. Buffer Pool Bypass can also be used for temporary data (sorting, joins).

Lecture 5: Database Storage (Part II)

Page Layout, Lecture 3

Header (meta-data: page size, checksum, DBMS version…) + Content

Row-oriented storage mode

Tuple-oriented: slotted pages

Log-structured, Index-organized…

Tuple-oriented

Search: check the page directory to find the page position; fetch the page; use the slot array to find the tuple position; fetch the tuple.
Insert: check the page directory to find a page with free space; check if the tuple can fit in the page (if not, find another page/create a new page); insert the tuple into the page; update the slot array and the page header.
But, the cost of updating/deleting can be high: if failed to update in-place, need to mark as deleted and insert a new tuple.
- Fragmentation
- Useless Disk IO
- Random Disk IO

Log-structured

Each log entry represents a tuple PUT/DELETE operation.

MemTable: in memory; ordered; mutable.
SSTable: on disk; ordered; immutable.
Flush: when the MemTable is full, flush it to disk as a new SSTable.
Read: check the MemTable first; if not found, check the SSTables in order of recency.
- Binary search in each SSTable to find the key.
- Use SummaryTable to track metadata (e.g. min/max key) of the SSTable to skip irrelevant SSTables.
Compaction: periodically merge SSTables to reduce the number of SSTables and improve read performance
- Leveled Compaction: SSTables are organized into levels. Each level has a size limit, and when the limit is exceeded, the SSTables in that level are merged and moved to the next level.
  - Level 0: new SSTables; possible key range overlap
  - Level 1+: no key range overlap
  - If the merged SSTable overlaps with some higher-level SSTables, then those SSTables will also be merge-sorted and split (according to size limit).
  - Better for read-heavy workloads.
- Universal Compaction: only one level, merge when too many SSTables overlap in key ranges or exceed size thresholds. Better for insert-heavy workloads.
Tradeoffs:
- Fast sequential writes, good for append only storage.
- Reads may be slow.
- Compaction is expensive.
- Write amplification (for each logical write, there could be multiple physical writes during the compaction process).

Index-organized

In the index-organized storage scheme, the DBMS directly stores a table’s tuples as the value of an index data structure (e.g. B+ tree, skip list, trie). The DBMS uses a page layout similar to a slotted page, and tuples are typically sorted in the page based on key.

System Catalogs

A DBMS stores meta-data about databases in its internal catalogs.

Tables, columns, indexes, views
Users, permissions
Internal statistics

Lecture 6: Storage Models & Compression

Database Workloads

On-Line Transaction Processing (OLTP): Fast operations that only read/update a small amount of data each time. (Simple query, write heavy.) On-Line Analytical Processing (OLAP): Complex queries that read a lot of data to compute aggregates. (Complex query, read heavy.) Hybrid Transaction + Analytical Processing (HTAP): OLTP + OLAP together on the same database instance.

Storage Models

N-ray Storage Model (NSM): Row store; stores (almost) all attributes for a single tuple contiguously in a single page; ideal for OLTP.
- Advantage
  - Fast insert/update/delete.
  - Better for queries that access most/all attributes of a tuple.
  - Can use index-oriented physical storage for clustering.
- Disadvantage
  - Inefficient for scanning large portions of the table and/or a subset of the attributes.
  - Poor memory locality in access patterns.
  - Difficult to apply compression because of multiple value domains within a single page.
Decomposed Storage Model (DSM): Column store; stores each attribute of a tuple in a separate page; ideal for OLAP.
- Advantage
  - Reduces the amount of I/O wasted per query because the DBMS only reads the attributes that it needs for a given query.
  - Better (faster) query processing because of increased locality and cached data reuse.
  - Better data compression. (see later…)
- Disadvantage
  - Slow for point queries, inserts, updates, and deletes because of tuple splitting/stitching.
- Put the tuple back together:
  - Fixed-length offsets: the value in a given column will belong to the same tuple as the value in another column at the same offset.
  - Embedded tuple ID: BAD PRACTICE.
Partition Attributes Across (PAX): horizontally partition data into row groups. Then vertically partition their attributes into column chunks. Each row group contains its own meta-data header about its contents.
- Goal: get the benefit of faster processing on columnar storage while retaining the spatial locality benefits of row storage.
- Note that when DBMS say they are a column store, they are most likely using PAX.

Database Compression

Idea: Compress to reduce disk IO and memory usage.

Must produce fixed-length compressed values except for var-length data stored in separate pool.
Postpone decompression for as long as possible during query execution (i.e. late materialization).
Must be lossless.

Compression granularity:

Block Level: Compress a block of tuples for the same table.
Tuple Level: Compress the contents of the entire tuple (NSM only).
Attribute Level: Compress a single attribute value within one tuple. Can target multiple attributes for the same tuple.
Columnar Level: Compress multiple values for one or more attributes stored for multiple tuples (DSM only). This allows for more complicated compression schemes.

Naive Compression

Compress with general purpose algorithms.

Disadvantage
- Need to decompress the entire block to access any value.
- Not designed for database workloads, so may not be very effective.

Columnar Compression

Run-Length Encoding (RLE)
- Compress consecutive repeated values into (value, start offset, run length) triplets.
- Best for low-cardinality, sorted columns containing many repeated values.
- Poor when data is random.
Bit-Packing Encoding
- When all values for an attribute are less than the value’s declared largest size, store them with fewer bits.
- Patching / Mostly Encoding: variant that uses a special marker to indicate when a value exceeds the largest size and then maintains a look-up table to store them. Use when values are ‘mostly’ less than the largest size.
Bitmap Encoding
- For each distinct value, create a bitmap with a bit for each tuple. The bit is set to 1 if the tuple has that value and 0 otherwise.
Delta Encoding
- Store the difference between each value and a base value.
- Combine with RLE for better compression.
Dictionary Encoding
- The DBMS replaces frequent patterns in values with smaller codes, then stores only these codes and a data structure that maps codes to their original values (the dictionary).
- Need to encode and decode efficiently.
- Need to be order-persevering for range queries.
- Best example: trie + string match.

Bifurcated Environment

Idea: Use OLTP for online transactions and OLAP for analytical queries. When new data arrives:

Extract, Transform, Load (ETL): periodically extract data from OLTP, transform it to fit the OLAP schema, and load it into the OLAP database.

How can we use only one system? Exploits the temporal nature of data:

Data is “hot” when it enters the database.
As a tuple ages, it is updated less frequently.

Fractured Mirror: The DBMS automatically maintains a second copy of the database in a DSM layout.

All updates are first entered in NSM then eventually copied into DSM mirror.
If the DBMS supports updates, it must invalidate tuples in the DSM mirror.

Delta Store: Stage updates to the database in an NSM table. A background thread migrates updates from delta store and applies them to DSM data.

Batch large chunks and then write them out as a PAX file.
Delete records in the delta store once they are in column store.
A tuple can only exist in either NSM or DSM, not both.

Lecture 7: Hash Table

Data Structures

Different DS in different parts:

Internal Meta-Data: page tables, page directories…
Core Data Storage: actual tuples
Temporary Data Storage: intermediate results during query execution (e.g. hash tables for joins, sorting buffers…)
Table Indices: additional data structures that help efficiently locate specific tuples.

Key design decisions:

Data Organization: how the data layout affects the efficiency.
Concurrency: how multiple threads can access and modify the data structure concurrently.

Hash Table

Time Complexity: O(1) on average. But remember that the constant matters.
Space Complexity: O(n).
Hashing functions: maps a key to a hash value.
- In DB, we care about speed and collision rate, but not (key) security.
Hashing Schemes: How to handle collisions? …

Static Hashing Schemes

Static hashing schemes: the size of the hash table is fixed and known before.

Linear Probe Hashing
- Insertion: For insertions, when a collision occurs, we linearly search the subsequent slots until an open one is found, looping around from the end to the start of the array if necessary.
- Lookup: For lookups, we can check the slot the key hashes to, and search linearly until we find the desired entry.
- Deletion: Set a tombstone flag.

Issues and solutions:

Storing Keys and Values
- Fixed length: Store the key and value together in the hash table.
  - Can also store hash(key) to accelerate lookups.
- Variable length: Insert the (key, value) into a temp table and use the hash table to store (hash(key), rid).
Non-unique keys
- Separate Linked List? Can overflow.
- Redundant Keys: Store duplicate keys entries together in the hash table. This is what most systems do. The fact that the keys are also stored in the hash table (required by linear probe hashing itself to deal with collisions) allows us to handle non-unique keys without needing additional data structures.
Optimizations
- Specialized hashing schemes.
- Store metadata in a separate array (e.g. packed bitmap to track whether a slot is empty, occupied, or deleted).
- Use table + slot versioning to quickly invalidate all entries.

Cuckoo Hashing

Instead of using a single hash table, this approach maintains multiple hashtables with different hash functions. The hash functions are the same algorithm (e.g., XXHash, CityHash); they generate different hashes for the same key by using different seed values.

When we insert, we check every table and choose one that has a free slot
- If multiple have one, we can compare things like load factor, or more commonly, just choose a random table.
If no table has a free slot, we choose (typically a random one) and evict the old entry, then rehash the old entry into a different table.
- In rare cases, we may end up in a cycle. If this happens, we can rebuild all of the hash tables with new hash function seeds (less common) or rebuild the hash tables using larger tables (more common).

Cuckoo hashing guarantees O(1) lookups and deletions, but insertions may be more expensive.

Professor’s note: The essence of cuckoo hashing is that multiple hash functions map a key to different slots. In practice, cuckoo hashing is implemented with multiple hash functions that map a key to different slots in a single hash table. Further, as hashing may not always be O(1), cuckoo hashing lookups and deletions may cost more than O(1).

Dynamic Hashing Schemes

Adjusting the size of the hash table as the number of entries changes.

Chained Hashing: Each slot contains a pointer to a linked list of entries.
- Can use a filter (e.g. Bloom filter) to quickly check if a key is not in the list.
Extendible Hashing: If a slot is full, split it into two slots and redistribute the entries, with each entry being hashed again and determining which of the two new slots it belongs to by the next bit of the hash value.
Linear Hashing: Instead of immediately splitting a bucket when it overflows, this scheme maintains a split pointer that keeps track of the next bucket to split. No matter whether this pointer is pointing to the bucket that overflowed, the DBMS always splits. The overflow criterion is left up to the implementation.
- When any bucket overflows, split the bucket at the pointer location. Add a new slot entry and a new hash function, and apply this function to rehash the keys in the split bucket.
- If the original hash function maps to a slot that has previously been pointed to by the split pointer, apply the new hash function to determine the actual location of the key.
- When the pointer reaches the very last slot, delete the original hash function (since all buckets are now addressed using the new hash function) and move the pointer back to the beginning.
- If the highest bucket below the split pointer is empty, we can also remove the bucket and move the split pointer in the reverse direction, thereby shrinking the size of the hash table.
- The key advantage of linear hashing is that the table grows or shrinks incrementally, avoiding the high cost of a full rebuild.

Lecture 8: Indexes and Filters (Part I)

B+Tree:

Insertion
- To insert a new entry into a B+Tree, one must traverse down the tree and use the inner nodes to figure out which leaf node to insert the key into.
  1. Find correct leaf L.
  2. Add new entry into L in sorted order:
    - If L has enough space, the operation is done.
    - Otherwise split L into two nodes L1 and L2. Redistribute entries evenly and copy up the middle key. Insert an entry pointing to L2 into the parent of L.
    - If the parent overflows, repeat the split step for the parent.
  3. To split an inner node, redistribute entries evenly, but push up the middle key.
  4. Hint: copy means the key is stored in both the child and the parent, while push means the key is only stored in the parent.
Deletion
- Whereas in inserts we occasionally had to split leaves when the tree got too full, if a deletion causes a tree to be less than half-full, we must merge in order to re-balance the tree.
  1. Find correct leaf L.
  2. Remove the entry:
    - If L is at least half full, the operation is done.
    - Otherwise, you can try to borrow from a sibling; update the parent key if needed.
    - If borrowing fails, merge L and a sibling. Delete the separator key (and pointer) entry in the parent for the node that was removed.
      - If the parent underflows, repeat the borrow and merge step for the parent.
  3. To merge two inner nodes, pull down the key (that separates them) from their parent and merge it with the inner nodes.
Composite Index: The key is composed of multiple attributes. The order of the attributes in the key matters.
Selection Conditions: Because B+Trees are in sorted order, look-ups have fast traversal and also do not require the entire key.
Duplicate Keys: append the RID to make it unique.
Clustered Indexes: The table is stored in the sorted order specified by the primary key, as either heap- or index-organized storage. Since some DBMSs always use a clustered index, they will automatically make a hidden row id primary key if a table doesn’t have an explicit one, but others cannot use them at all.
Index Scan Page Sorting: Since directly retrieving tuples from an unclustered index is inefficient, the DBMS can first figure out all the tuples that it needs and then sort them based on their page id. This way, each page will only need to be fetched exactly once.

Design Choices

Node size:
- On disk: in the order of MB (reduce the number of seeks needed)
- In memory: ~ 512 Bytes (reduce the number of cache misses)
Merge threshold: probably do not merge immediately when a node is less than half full…
Variable-length keys:
- Pointer? Need to get the value from each node…
- Var-length leaf?/Fill? Memory waste…
- Key Map/Indirection: The method that nearly everyone uses is replacing the keys with an index to the key-value pair in a separate dictionary.
Intra-node search: linear, binary search, interpolation search (for academic purposes)

Optimizations

Prefix Compression: store the prefix once at the beginning of the (internal/leaf) node
Duplication: for non-unique key cases
Suffix Truncation: store the minimum prefix that is needed to route probes into the correct node (e.g. [banana, apple, abs] -> [b, ap, ab])
Pointer Swizzling: reduce a page-table search
Bulk Insert: build from the leaf nodes instead of inserting one by one from the root
Write-optimized B+Tree: logs changes in the internal node and lazily propagates the updates down to the leaf node later
Partial Indexes: create an index on a subset of the table (e.g. only on active users/time index per month or per year) to save space and improve performance for queries that only access that subset.
Covering Indexes: include additional attributes in the index to avoid having to access the table at all for certain queries.

Lecture 9: Indexes and Filters (Part II)

Index vs Filter

The index answers the question “Where is the data that I need?” while the filter answers the question “Do I need this data?”.

Bloom Filter

$m$ bit Bitmap + $k$ Hash Functions

Insert: set all $k$ bits to $1$.
Lookup: check if all $k$ bits are $1$; if yes, return “possibly in set”; if no, return “definitely not in set”.

Variants:

Counting Bloom Filter: use a counter instead of a bit to allow for deletions.
- Can delete, but cost more space.
Cuckoo Filter: store fingerprints of elements.
- Can delete.
Succinct Range Filter: an immutable compact trie that supports approximate exact matches and range filtering.

Skip List

Each layer has about half the number of entries as the layer below it.

Search: start from the top layer and traverse down to the bottom layer, using the upper layers to skip over large sections of the list.
Insert: coins are flipped to decide until which level this new node is going to insert.
Delete: could be expensive, so use a background thread.
Advantages:
- Less memory usage if not including reverse pointer compared to B+tree.
- No rebalancing is needed while inserting and deleting.
Disadvantages:
- Not disk/cache friendly because they do not optimize locality of reference
- Reverse search is non-trivial, it becomes tricky to handle both ascending and descending scans.

Trie

An order-preserving tree structure where each node represents a common prefix of the keys below it. The keys are stored in the leaf nodes, and the inner nodes only store meta-data (e.g. prefix, pointer to child nodes).

Vertically Compress: if a node has only one child, we can merge it with its child and store the combined prefix in the merged node.
Horizontally Compress: if the number of a child is known, we can use an array instead of a hash map to store the childs (can cause false positive since we do not know whether a compressed prefix is actually in the trie or just a prefix of some other key).

Inverted Index

Goal: Keyword search

Sol: word -> RID (“inverted”)

Lucene Implementation:

Finite State Transducer (FST) + weight (to optimize some queries)
Background Merging
Compression: delta encoding + bit packing

PostgreSQL Implementation:

Generalized Inverted Index (GIN): word -> list of RIDs
- Few RIDs -> sorted list; Many RIDs -> another B+Tree
Pending list: to avoid frequent updates.

Enhancements:

boolean match -> ranking (e.g. term frequency, inverse document frequency)
tokenizer: Split terms into n-grams to support fuzzy text searches and autocomplete (“did you mean”)

Vector Index

LLM embedding: text -> vector
Inverted indexes: Partition vectors into smaller groups by a clustering algorithm and then build an inverted index that maps cluster centroids to records.
- E.g. IVFFlat
Graph: builds a graph that represents the neighbor relationship between vectors, where each node represents a vector and its edges link to its n nearest neighbors.
- Can use multiple layers to accelerate search
- E.g. FAISS, HNSWlib

Lecture 10: Index Concurrency Control

Goal: A DBMS needs to allow multiple workers to safely access data structures to take advantage of additional CPU cores and hide disk I/O stalls.

Logical Correctness: Can a worker see the data that it is supposed to see?
Physical Correctness: Is the internal representation of the object valid?

Lock vs Latch

-> see Lecture 4

Read latch: Multiple workers are allowed to read the same item at the same time. A worker can acquire the latch in read mode even if another thread has already acquired it in read mode.
Write latch: Only one worker is allowed to access the item. A worker cannot acquire a write latch if another thread holds the latch in any mode. A worker holding a write latch also prevents other workers from acquiring the latch in read mode.

Latch Implementation

Goals:

Small memory footprint.
Fast execution path when no contention.
Decentralized management of latches.
Avoid expensive system calls.

Atomic Instruction Example: Compare-and-Swap (CAS), __sync_bool_compare_and_swap(&M, 20, 30)

If M is 20, set M to 30 and return true.
If M is not 20, do not change M and return false.

Approach 1: Test-and-Set Spin Latch (TAS), std::atomic<T>

Very efficient: single instruction to latch/unlatch;
Not-scaleable: all threads spin on the same memory location, causing contention and cache coherence traffic;
Not cache friendly: the cache line containing the latch is constantly invalidated and updated by different threads.
Not OS friendly: the CPU will waste cycles on spinning, and the threads look busy to the OS.

Approach 2: Blocking OS Mutex, std::mutex

Simple to use
Do not block the CPU when contended
Not-scaleable: ~ 25 ns per latch/unlatch

Approach 3: Reader-Writer Latch, std::shared_mutex

Allows for concurrent readers. Must manage read/write queues to avoid starvation.
Can be implemented on top of spinlocks.

Hash Table Latch

Approach 1: Latch the entire hash table. This is simple but not scalable.

Approach 2: Page/Block latch

Each page/block has an embedded reader-writer latch.
Workers acquire latch before accessing a page/block.
Release latch before moving to the next page/block.
Trade-off: decreases parallelism, but accessing multiple slots in a page will be fast for a single thread

Approach 3: Slot latch

Each slot has its own latch (can use single-mode to save space).
Trade-off: increases parallelism, but increases the storage and computational overhead of accessing the table.

B+Tree Latch

Goal: allow multiple workers to read and update a B+Tree at the same time.

Problems:

Workers modifying the contents of a node at the same time.
One worker traversing the tree while another worker splits/merges nodes.

Sol: latch crabbing

Get latch for parent
Get latch for child
Release parent latch if the child is safe:
- For search: always safe
- For insertion: it is not full
- For deletion: it is more than half full
Optimistic Latch Crabbing Protocol: Assume all nodes are safe (i.e. do search), and if the leaf node is not safe, then release all latches and restart the operation.

Leaf node scan: possible deadlock! If two threads starts to scan from different directions and meet in the middle…

Wait? But how long…
Kill the other thread? What will happen…
Kill self? Probably the only option
To avoid: support no-wait mode (e.g. restart)

B+Tree Write Operations

Record all write operations privately, and if failed, rollback all changes.

Lecture 11: Sorting & Aggregation Algorithm

Query Plan

Compile a SQL into a query plan, which is a tree (or DAG) of operators.

Sorting

Assumptions.

Query Results may not fit in memory.
Algorithms that maximize sequential I/O are faster.

Why sorting? GROUP BY, ORDER BY, DISTINCT, JOIN…

For a given run (i.e. list of key-value pairs):

Early Materialization: store the entire key-value pair in the run.
- The sorting result is directly usable
- But cost more memory
Late Materialization: store only the key and a pointer to the value in the run.
- Cost less memory
- But need to fetch the value after sorting

External Merge Sort:

Sort each chunk of data that fits in memory and write the sorted runs to disk.
Merge the sorted runs together by repeatedly merging a few runs at a time until only one run is left.

2-Way External Merge Sort:

During the sorting phase, the algorithm reads each page, sorts it, and writes the sorted version back to disk.
Then, in the merge phase, it uses three buffer pages.
It reads two sorted pages in from disk, and merges them together into a third buffer page. Whenever the third page fills up, it is written back to disk and replaced with an empty page.
Each set of sorted pages is called a run. The algorithm then recursively merges the runs together.

General (K-way) Merge Sort:

Let $b_r$ be the size of the relation, $B$ be the number of buffer pages available in memory, $b_b$ be the size of output buffer (which can be adjusted by ourselves), we can combine $K$ runs in each pass.
$K = \lfloor\frac{B}{b_b}\rfloor - 1$
Number of runs: $N = \lceil\frac{b_r}{B}\rceil$
Total passes through the data: $P = \lceil \log_{K}(N) \rceil$.
Total I/O cost: $(2P+1)\cdot b_r$ blocks.

Double Buffering Optimization:

Prefetch the next run in the background while merging the current runs.
Reduce the wait time for I/O, but reduce the effective number of buffers by half.

Comparison Optimizations:

Code Specialization: Instead of providing the comparator as a function pointer to the sorting algorithm, the sorting function can be hard-coded to the specific key type. An example of this is template specialization in C++.
Suffix Truncation: Another optimization for string-based comparisons is only compare part of keys first. This first compares binary prefixes of long VARCHAR keys, falling back to an entire string comparison if the prefixes are equal.
Key Normalization: This converts variable-length keys into a single encoded / padded fixed-length string that preserves sort order.

Using B+Tree:

Clustered B+Tree: the BEST case (no computational cost, all sequential IO).
Unclustered B+Tree: not a good idea.

Top-N Heap Sorting:

For ORDER BY ... LIMIT N (WITH TIES) case, we can use HeapSort (and dynamically expand the heap if there are ties)

Aggregation

By Sorting:

First sorts the tuples on the GROUP BY key(s), then scans the sorted tuples sequentially to compute the aggregates for each group.
Important to maximize the efficiency (e.g. perform the filter first).

By Hashing: Sometime we do not need the data to be ordered:

GROUP BY without ORDER BY
DISTINCT

Generally, hashing is more efficient than sorting for aggregation, unless the data is already sorted or the output is required to be sorted.

External Hashing Aggregation:

Partition:
- Use $h1$ to split tuples into partitions on disk.
- Use $B-1$ buffers to hold the partitions in memory, $1$ for input, and spill to disk if a partition exceeds the buffer size.
Rehashing:
- For each partition, use $h2$ to hash the tuples into an in-memory hash table.
- Then go through each bucket of this hash table to bring together matching tuples.
  - Or, we can calculate the targeted aggregates during this process (i.e. update the (GroupByKey, RunningValue) pair)
- This assumes that each partition fits in memory. If not, recursively apply the same partitioning and hashing process to that partition until it fits in memory.

Exercises

Suppose a point-query heavy workload is run a DBMS that currently implements vanilla merge sort. Do you expect speedup from switching to 2-way external merge sort?

No. Point queries do not require sorting the entire dataset, so the performance of merge sort is not a bottleneck.
Consider Top-N heap sort, B+ Trees, and K-way merge sort. For each sorting algorithm, describe a scenario where it is clearly preferable to the others.
- Top-N: LIMIT
- B+ Tree: when the data is already indexed and we want to sort by the indexed key.
- K-way merge sort: when the data is too large to fit in memory and we want to minimize the number of passes through the data.
(Textbook Practice Exercise 15.9) What is the effect on the cost of merging runs if the number of buffer blocks per run is increased while overall memory available for buffering runs remains fixed?
- Less seeks, more passes.
- Trade-off: depend on the disk characteristics.
(Textbook 15.17) Suppose you need to sort a relation of $40$ gigabytes, with $4$-kilobyte blocks, using a memory size of $40$ megabytes. Suppose the cost of a seek is $5$ milliseconds, while the disk transfer rate is $40$ megabytes per second.
1. Find the cost of sorting the relation, in seconds, with $b_b = 1$ and with $b_b = 100$. Note $b_b$ is the size of the output buffer. Number of blocks of the relation: $b_r = 40\text{GB} / 4\text{KB} = 10^7$ blocks.
  
  Number of blocks available in memory: $B = 40\text{MB} / 4\text{KB} = 10^4$ blocks.
  
  Number of runs needed: $N = \frac{b_r}{B} = 10^3$.
  
  For $b_b=1$:
  - $K = \lfloor\frac{B}{b_b}\rfloor - 1 = 9999$.
  - $P = \lceil \log_{9999}(1000) \rceil = 1$.
  - Number of block IO: $n_{io}=(2P+1)\cdot b_r = 3\cdot 10^7$.
  - Time per block IO: $t_{io}=\frac{4\text{KB}}{40\text{MB/s}} = 0.1\text{ms}$.
  - Number of seeks: $n_s=2N+\lceil\frac{b_r}{b_b}\rceil\cdot(2P-1)=1.0002\times10^7$.
  - Time per seek: $t_s=5\text{ms}$.
  - Total time: $T=n_{io}\times t_{io}+n_s\times t_s=5.301\times 10^7\text{ms}$
  For $b_b=100$:
  - $K = 99, P = 2$
  - $n_{io}=(2P+1)\cdot b_r = 5\cdot 10^7$.
  - $t_{io}=0.1\text{ms}$.
  - $n_s=2N+\lceil\frac{b_r}{b_b}\rceil\cdot(2P-1)=3.02\times10^5$.
  - $t_s=5\text{ms}$.
  - $T=n_{io}\times t_{io}+n_s\times t_s=6.51\times 10^6\text{ms}$.
2. In each case, how many merge passes are required?
  
  For $b_b=1$, $P=1$. For $b_b=100$, $P=2$.
3. Suppose a flash storage device is used instead of a disk, and it has a latency of $20\text{μs}$ and a transfer rate of $400\text{MB/s}$. Recompute the cost of sorting the relation, in seconds, with $b_b=1$ and with $b_b=100$, in this setting.
  
  $t_{io}=\frac{4\text{KB}}{400\text{MB/s}}=0.01\text{ms}$. $t_{s}=20\text{μs}=0.02\text{ms}$. For $b_b=1$, $T=n_{io}\times t_{io}+n_s\times t_s=5.0004\times 10^5\text{ms}$. For $b_b=100$, $T=n_{io}\times t_{io}+n_s\times t_s=5.0604\times 10^5\text{ms}$.

Lecture 12: Join Algorithms

Why join?

Reconstruct the original tuples without any information loss.

In this lecture, we focus on binary joins using inner equijoins.

In general, we want the smaller table to always be the left table (“outer table”) in the query plan.

Operator Output

Early Materialization: Copy the value when joining.
- No need to fetch the value after joining.
Late Materialization: Store a pointer to the value when joining.
- Cost less memory.
- Need to fetch the value after joining.
- Good for column stores.
How to analyse the cost?
- Number of Disk I/O.

Algorithms

Assume we are joining table $R$ with $M$ pages, $m$ tuples and table $S$ with $N$ pages, $n$ tuples. Assume we have $B$ buffer pages in memory.

Loop Join

Naive loop join:
- Cost: $M+m\cdot N$.
Block nested loop join:
- Cost: $M+M\cdot N$.
- With buffer pool: one buffer for inner table, $B-2$ buffers for outer table, and one buffer for output.
- Cost: $M+\lceil\frac{M}{B-2}\rceil\cdot N$.
Index nested loop join:
- Assume the cost of each index probe is some constant $C$ per tuple in the outer table.
- Cost: $M+m\cdot C$.

Sort-Merge Join

Sort, Cost: $2M\cdot(1+\lceil\log_{B-1}(\lceil\frac{M}{B}\rceil)\rceil)+2N\cdot(1+\lceil\log_{B-1}(\lceil\frac{N}{B}\rceil)\rceil)$.
Merge, Cost: $M+N$.
- The worst case for the merging phase is when the join attribute of all the tuples in both relations contains the same value.
- Cost: $MN+$sort.

Hash Join

Simple hash: Build + Probe
Optimization, probe filter
Partitioned (GRACE) hash join:
- Used when the table is too large to fit in memory.
- Cost: $2(M+N)$ for partition, $M+N$ for probe, total $3(M+N)$.

Lecture 13: Query Processing (Part I)

Query plan: a DAG/tree of operators.

Pipeline: a sequence of operators where tuples continuously flow between them without intermediate storage.

Pipeline Breakers: an operator that cannot finish until all its children emit all their tuples.

Processing Model

A processing model defines how a DBMS executes a query plan.

Iterator Model (also called Volcano Model/Pipeline Model): (most common)

Each operator implements a Next() method that returns the next tuple in the output of that operator.
Also Open() and Close() methods for initialization and cleanup.
Good for LIMIT queries, because it can stop early without processing the entire dataset.

Materialization Model:

Each operator produces an entire intermediate result and stores it in memory or on disk before the next operator can start processing.
Each operator implements a Output() method that returns the entire result of that operator.
Good for OLTP, where each query typically accesses a small portion of the data.

Vectorized Model:

Batched Next() method that returns a batch of tuples at a time.
Good for OLAP: fewer function calls, and can be optimized by modern CPU + SIMD.

Plan Processing Direction:

Top-to-Bottom (Pull):
- Easy to control output via LIMIT.
- Parent operator blocks until its child returns with a tuple.
- Additional overhead because operators’ Next() functions are implemented as virtual functions.
- Branching costs on each Next() invocation.
Bottom-to-Top (Push):
- Allows for tighter control of caches/registers in pipelines.
- May not have exact control of intermediate result sizes.
- Difficult to implement some operators (Sort-Merge Join).

Access Model

Sequential Access: scan the entire table.
- Approximate Queries (lossy data skiping): Execute on a sampled subset of the data.
- Zone Map (lossless data skipping): Pre-compute aggregations and skip pages that do not satisfy the query predicate.
Index Access: use an index to directly access the relevant tuples.
Multi-index Access: use multiple indexes and use bitmap/hashmap/… to calculate the intersection of the results.

Modification Queries

UPDATE/DELETE:

Child operators pass Record IDs for target tuples.
Must keep track of previously seen tuples. INSERT:
Choice 1: Materialize tuples inside of the operator.
Choice 2: Operator inserts any tuple passed in from child operators.

Update Query Problem (Halloween Problem):

A query updates the position of a tuple.
Solution: Track modified record ids per query.

Expression Evaluation

WHERE clause is expressed by a tree. Optimizations:

JIT (Just-In-Time) Compilation: Compile the expression into machine code at runtime for faster execution.
Constant Folding: Precompute constant expressions at compile time to reduce runtime computation.
Common Sub-Expr. Elimination: Identify and compute common sub-expressions only once to avoid redundant calculations.

Lecture 14: Query Processing (Part II)

Parllel vs. Distributed

Parllel DBMS: Data phisically stored on a single machine, but multiple threads can access it concurrently.
- Communication is cheap and reliable.
Distributed DBMS: Data is distributed across multiple machines, and the DBMS manages the communication and coordination between them.
- Communication cost and problem is non-trivial.

Process Models

A DBMS’s process model defines how the system is architected to support concurrent requests / queries.

A worker is the DBMS component responsible for executing tasks on behalf of the client and returning the results.

Process per Worker

Each connection gets its own process.
Pros: Single failure will not affect other connections.
Cons: Use shared memory for data exchange, which is expensive.

Thread per Worker (Most common)

Each connection gets its own thread.
Pros: Cheaper than process per worker, since threads share memory, and the DBMS can fully utilize the CPU cores.
Cons: A failed thread can affect the entire process, and thus all connections.

Embedded DBMS

DBMS runs in the same address space as the client application.

Scheduling

For each query plan, the DBMS has to decide where, when, and how to execute. Relevant questions include:
- How many tasks should it use?
- How many CPU cores should it use?
- What CPU cores should the tasks execute on?
- Where should a task store its output?

Intra-Query Parallelism

Improve the performance of a single query by executing its operators in parallel.

Think of the organization of operators in terms of a producer/consumer paradigm.

Approach 1: Intra-Operator (Horizontal)

Divide an operator into multiple tasks that can be executed in parallel.
- E.g. Sequential scan.
Introduce Exchange Operators to manage the communication between tasks.
- Gather: Combine the output of multiple tasks into a single stream for the next operator.
- Distribute: Distribute the input stream to multiple tasks for the next operator.
- Repartition: Shuffle multiple input streams across multiple output streams.

Approach 2: Inter-Operator (Vertical)

Also called pipelined parallelism.
Different operators runs on different threads.
- E.g. Project and Join

Approach 3: Bushy Parallelism (Hybrid)

Combine of two.

Data Parallelism

Use SIMD instructions to process multiple data items in parallel within a single CPU core.
- E.g.: Selection scan: use SIMD Compare.

I/O Parallelism

Multi-Disk:

DBMS’s file is stored across multiple devices.

Database Partitioning:

The DB is split up into disjoint subsets and assigned to different disks.

Lecture 15: Query Optimization (Part I)

Goal: Input a logical plan, find a equivalent phisical plan with the lowest cost.

Logical plan: an algebra expression.
Physical plan: a specific execution strategy using an access path.

Transformations

Selection: Split conjunctive predicates and push them down.

Joins: Joins are commutative and associative, so we can reorder them.

Complex problem. Possible choices are related to Catalan numbers.

Search Algorithms

Rules/Heuristics-Based Optimization:

Pros: Easy and fast.
Cons: Magic numbers.

Cost-Based Optimization:

Apply transformation rules to enumerate different variations of a query’s plan estimate their costs to guide the search process.
Approaches:
- Approach 1, Wall-clock Time: Stop after the optimizer runs for some length of time.
- Approach 2, Cost Threshold: Stop when the optimizer finds a plan that has a lower cost than some threshold.
- Approach 3, Exhaustion: Stop when there are no more enumerations of the target plan. Usually done per sub-plan/group.
- Approach 4, Transformation Count: Stop after a certain number of rules/transformations have been considered.

Query Planning Direction

Top-Down/Backward Chaining:

Start from the target, and divide it into sub-targets recursively.
Uses DFS.
Star/Snowflake Queries:
- Sort by selectivity and build a left-deep or right-deep tree.

Bottom-Up/Forward Chaining:

Start from basic tables, build from sub-plans to the final plan.
Uses Dynamic Programming.

Enforcers:

Physical operators that ensure the properties of the output of a sub-plan / expression.
- E.g.: ORDER BY $\rightarrow$ Join (without ordering) + Sort Enforcer.
Optimizer can choose between different strategies:
- E.g. for ORDER BY, we can use INDEX_SCAN (costs a little more) or SEQ_SCAN (costs less) + Sort (may cost a lot).

Lecture 16: Query Optimization (Part II)

The hard part of query planning: estimating the cost of a plan.

Summarization Approaches

Histograms (most common)
- Maintain an occurrence count per value (or range of values) in a column.
- Implemetations:
  - Equal-width: divide the range of values into equal-width buckets.
  - Equal-depth: divide the values into buckets with equal number of occurrences.
  - End-biased: use $N-1$ buckets for the most frequent $N-1$ values, and one bucket for the rest of the values.
Sketches (increasing useage)
- Probabilistic data structure that gives an approximate count for a given value.
- E.g. Count-Min Sketch, HyperLogLog.
Sampling (rare)
- DBMS maintains a small subset of each table that it then uses to evaluate expressions to compute selectivity.
- Approach 1: Maintain a read-only copy.
- Approach 2: Sample real tables.
ML Model (very rare)
- Train an ML model that learns the selectivity of predicates and correlations between multiple tables.

Cost Estimation

Goal: with statistics info of each table, estimate the cost of a plan.

Selection cardinality $SC(A,R)=N_R/V(A,R)$: the average number of tuples with a value for an attribute $A$, where $N_R$ is the total number of tuples in the relation $R$, and $V(A,R)$ is the number of distinct values for attribute $A$ in relation $R$.
Selectivity $sel(p)$: the fraction of tuples that qualify.
- Depends on the predicate. E.g.:
- For $p={A=x}$, $sel(p)=\text{num of occurrences}/N_R$.
- For $p=a\land b$, $sel(p)=sel(a)\cdot sel(b)$.
Assumptions:
- Uniform Data: The distribution of values (except for the heavy hitters) is the same.
- Independent Predicates: The predicates on attributes are independent.
- Containment Principle: The domain of join keys overlap such that each key in the inner relation will also exist in the outer table.
Join size estimation: $\text{size} = N_R\cdot N_S/\max(V(A,R),V(A,S))$.
- However, error can propagate and amplify.

Lecture 17: Concurrency Control Theory

Strawman System:

Execute each transaction one-by-one.
Before the transaction, copy the database to a new file and make all changes to it.
If the transaction completes successfully, overwrite the original file with the new file; otherwise, discard the new file.
Cons: very slow, no concurrency.

Correctness Criteria: ACID

Atomicity: All actions in txn happen, or none happen.
- All or nothing…
- Redo/Undo Mechanisms, Concurrency Control
Consistency: If each txn is consistent and the DB starts consistent, then it ends up consistent.
- It looks correct to me…
- Integrity Constraints, Replication Protocols
Isolation: Execution of one txn is isolated from that of other txns.
- All by myself…
- Concurrency Control
Durability: If a txn commits, its effects persist.
- My changes will survive…
- Redo/Undo Mechanisms, Replication Protocols

Atomicity

Logging:

Records all actions, and undo them if needed (i.e. if the transaction aborts).
- Also maintain undo records.
- Replay log after crash to put database back to a consistent state.
Almost used by all DBMSs

Shadow Paging:

Makes copies of pages, and txns make changes to them. If the txn commits, update the page table to point to the new pages.
- Instant recovery after crash.
- But, not a good idea.

Consistency

Make sure that the database accurately models the real world. E.g., Age > 0.

Approaches:

SQL specify integrity constraints using CHECK, ADD CONSTRAINT, or by key definitions, etc.
Constraints are defined by the application.
DBMS ensures that all integrity constraints are satisfied before and after each transaction.

Eventual Consistency… See Lecture 23.

Isolation

A concurrency control protocol is how the DBMS decides the proper interleaving of operations from multiple transactions.

Pessimistic: Do not let problems arise in the first place.
Optimistic: Assume conflicts are rare; deal with them after they happen.

Goal:

Txns still appear to execute serially (i.e. serializability).
When one txn stalls because of a resource (e.g., page fault), another txn can continue executing and make forward progress.

Schedules:

Serial Schedule: No interleaving of operations from different transactions.
Equivalent Schedules: Two schedules are equivalent if they produce the same final state of the database.
Serializable Schedule: A schedule is serializable if it is equivalent to ANY serial schedule.
- If each txn preserves consistency, every serializable schedule also preserves consistency.

Interleaved Execution Anomalies:

Unrepeatable Read (Read-Write)
Dirty Read (Write-Read)
Lost Update (Write-Write)
Phantom Reads (Scan-Write) (Lecture 18)
Write-Skew (Read-Write) (Lecture 20)

Unrepeatable Read:

T1.R, T2.W, T1.R $rightarrow$ T1 reads different values for the same data item.

Dirty Read:

T1.W, T2.R, T1.ABORT $rightarrow$ T2 reads uncommitted changes made by T1.

Lost Update:

T1.W(A), T2.W(A), T2.W(B), T2.COMMIT, T1.COMMIT $rightarrow$ T1’s update to A is lost (T2 overwrites T1’s uncommitted data).

Conflict Serializability:

Two schedules are conflict equivalent iff:
- They involve the same actions of the same transactions.
- Every pair of conflicting actions is ordered the same way.
Schedule S is conflict serializable if:
- S is conflict equivalent to some serial schedule.
- Intuition: You can transform S into a serial schedule by swapping consecutive non-conflicting operations of different transactions.

Dependency Graph (also called Precedence Graph):

Nodes: transactions.
Edges: T1 $\rightarrow$ T2 if T1 has an action that conflicts with an action of T2, and T1’s action comes before T2’s in the schedule

View Serializability:

Two schedules are view equivalent if:
- If T1 reads initial value of A in S1, then T1 also reads initial value of A in S2.
- If T1 reads value of A written by T2 in S1, then T1 also reads value of A written by T2 in S2.
- If T1 writes final value of A in S1, then T1 also writes final value of A in S2.
DBMS can not do this: NP-complete.

Summary:

Serial $\subset$ Conflict Serializable $\subset$ View Serializable $\subset$ All Schedules.
Conflict Serializable is the most commonly used correctness criterion for isolation, because it is easier to enforce and check.
How to check? Build the dependency graph and check for cycles. If there is a cycle, then the schedule is not conflict serializable.
How to enforce? We still do not know for now…

Durability

If commits, always exists. Also logging/shadow-paging.