Database Crash Recovery: ARIES, WAL & Shadow Paging (2026)
Key Takeaways
- WAL is the Foundation β The Write-Ahead Log mandates that every change must be recorded in a durable log before the data page is modified. This single rule guarantees recoverability from any crash.
- REDO Saves Committed Work β After a crash, RAM (and its dirty pages) is wiped. REDO replays every committed operation from the log to rebuild data that existed only in RAM at crash time.
- UNDO Removes Incomplete Work β UNDO reverses every operation from transactions that were active (uncommitted) at the moment of the crash β enforcing ACID Atomicity.
- ARIES = Analysis + REDO + UNDO β IBM's ARIES algorithm (1992) adds mathematical precision: LSN-based idempotence ensures recovery is safe even if the server crashes during recovery itself.
- Shadow Paging is Obsolete β Shadow paging avoids logs but causes severe data fragmentation and cannot handle concurrent transactions at scale. WAL/ARIES dominates 99% of production RDBMS.
Crash recovery restores a database to a consistent, ACID-compliant state after a power failure or system crash β it is not optional in any production database
Write-Ahead Logging (WAL) requires the log record to be written to durable storage before the data page is modified β this single rule makes recovery possible
REDO replays committed operations that were in RAM (lost at crash); UNDO reverses uncommitted operations that were partially written
Checkpoints bound how far back recovery must scan the log β without them, every crash would require reading the entire log history
ARIES (IBM, 1992) adds Log Sequence Numbers (LSNs) for idempotent, crash-safe recovery that handles concurrent transactions correctly
Shadow Paging swaps page pointers instead of using logs β simpler but causes fragmentation and cannot handle concurrent transactions at scale
What Is Database Crash Recovery?
Every production database operates in a state of constant risk. At any moment, a power failure, hardware fault, OS panic, or network partition can kill the database server mid-transaction β with thousands of operations in flight, data partially written to RAM, and the physical disk pages in an inconsistent state. Without a recovery system, the database would be permanently corrupted on every crash.
Database Crash Recovery is the automated process the DBMS executes after a system failure to restore the database to the last consistent, ACID-compliant state. It guarantees two core ACID properties:
- Atomicity β incomplete (uncommitted) transactions are completely reversed, as if they never occurred
- Durability β every committed transaction's effects are permanently preserved, even if the crash happened the millisecond after the commit signal was sent
Log-Based Recovery: WAL, REDO & UNDO
The dominant recovery mechanism in all major RDBMS is log-based recovery. It works by maintaining a sequential, append-only Write-Ahead Log (WAL) β a durable, ordered journal of every database operation. The WAL has one inviolable rule:
The WAL Rule (Write-Ahead Logging Protocol)
βA log record for a data modification must be written to durable storage before the modified data page is written to the database file.β
Log Record Structure
Every write to the WAL produces a log record containing:
-- Structure of a WAL log record
<LSN, TransactionID, PageID, Offset, OldValue, NewValue, PrevLSN>
LSN : Log Sequence Number β monotonically increasing unique ID for this record
TransactionID : Which transaction made this change
PageID : Which disk page was modified
OldValue : The before-image (used for UNDO)
NewValue : The after-image (used for REDO)
PrevLSN : LSN of the previous log record for this same transaction (forms a backward chain)
Normal Operation: The Buffer Pool Model
Modern databases do not write every committed change directly to disk β that would be catastrophically slow. Instead they maintain a Buffer Pool (an in-memory cache of disk pages) and defer physical writes:
- Transaction modifies data β The change is applied to the in-memory buffer page, marking it as a dirty page.
- WAL record written first β Before the dirty page can ever be flushed to disk, its log record must be durably written to the WAL file on disk.
- Commit logged β A
<COMMIT, TxID>record is written to the WAL and flushed. The user receives a success response. - Dirty pages linger in RAM β The actual database data file pages are not yet updated β they remain dirty in the buffer pool for efficiency.
- Checkpoint flush β Periodically, the database flushes dirty pages to disk and writes a
<CHECKPOINT>record to the log marking a safe recovery boundary.
REDO: Replaying Committed Work
After a crash, RAM is wiped β all dirty buffer pool pages (including those holding committed data that was never flushed) are gone. The REDO phase replays the WAL forward from the last checkpoint, reapplying every change from committed transactions to rebuild the database to its exact pre-crash state.
| Scenario | At Crash | REDO Action | Result |
|---|---|---|---|
| T1 committed, page not flushed | COMMIT record in log; data only in RAM | Replay the UPDATE to the disk page | Data saved β |
| T2 committed, page already flushed | COMMIT in log; disk page already has new data | PageLSN β₯ LSN_record β skip (idempotent) | No double-write β |
| T3 uncommitted, page partially flushed | No COMMIT in log; some dirty pages on disk | REDO applies changes β UNDO will reverse them | Temporary β UNDO follows β |
UNDO: Reversing Incomplete Transactions
The UNDO phase scans the WAL backward, identifying all loser transactions β those that were active (had no COMMIT record) at the time of the crash. For each operation in a loser transaction, the UNDO phase applies the before-image (the OldValue from the log record) to reverse the partial change, restoring the data to its pre-transaction state.
-- UNDO log record written during UNDO phase (Compensation Log Record - CLR)
<LSN=1050, TxID=T3, PageID=42, UndoNextLSN=970, Type=CLR>
CLR (Compensation Log Record): written to the WAL for every UNDO operation performed
UndoNextLSN: points to the next record to UNDO for this transaction (backward chain)
CLRs are never themselves undone β they mark permanent, irreversible recovery actions
Checkpoints: Bounding Recovery Cost
Without checkpoints, every crash would require scanning the entire WAL from the very beginning of time β potentially terabytes of log data. A checkpoint is a bookmarked synchronization point that tells the recovery engine: βeverything before this point is safely on disk β only scan from here.β
| Checkpoint Type | Mechanism | Impact on Transactions | Used By |
|---|---|---|---|
| Sharp Checkpoint | Halts all transactions, flushes ALL dirty pages to disk, writes checkpoint record | Severe stall β all queries paused during flush | Legacy systems, simple embedded DBs |
| Fuzzy Checkpoint | Begins checkpoint, continues accepting transactions, flushes dirty pages gradually in the background | No stall β transactions continue during checkpoint | PostgreSQL, MySQL InnoDB, Oracle, SQL Server |
| ARIES Checkpoint | Writes Transaction Table + Dirty Page Table to log as checkpoint content β no page flush required | Minimal overhead β only metadata written to log | IBM DB2, academic DBMS implementations |
What a Checkpoint Record Contains (ARIES Style)
-- ARIES checkpoint record written to WAL
<CHECKPOINT_BEGIN, LSN=900>
<CHECKPOINT_END,
Transaction Table: [
{ TxID: T1, Status: Running, LastLSN: 870 },
{ TxID: T2, Status: Committed, LastLSN: 850 }
],
Dirty Page Table: [
{ PageID: 42, RecLSN: 820 },
{ PageID: 77, RecLSN: 865 }
]
>
RecLSN (Recovery LSN): the earliest LSN that might have dirtied this page β recovery scans from the minimum RecLSN
The ARIES Algorithm: 3-Phase Crash Recovery
Invented by Mohan, Haderle, Lindsay, Pirahesh, and Schwarz at IBM Research in 1992, ARIES (Algorithm for Recovery and Isolation Exploiting Semantics) is the gold standard for crash recovery in systems with concurrent transactions. It extends basic REDO/UNDO with LSN-based tracking to achieve mathematical idempotence β recovery can safely restart mid-way without corrupting data.
ARIES operates on three precisely sequenced phases upon database restart:
Phase 1: Analysis β Reconstructing State
The Analysis phase scans the WAL forward from the last checkpoint record to the end of the log (the crash point). Its goal is to reconstruct two critical data structures:
- Transaction Table β all transactions that were active at the crash, plus their last LSN and commit status
- Dirty Page Table β all buffer pool pages that were modified but not yet flushed to disk at the crash, plus their RecLSN (earliest log record that dirtied them)
-- Analysis phase pseudo-logic
FOR EACH log record from CHECKPOINT to END_OF_LOG:
IF record.type == UPDATE:
add/update TxID in Transaction_Table (Status = Running, LastLSN = record.LSN)
add PageID to Dirty_Page_Table if not already present (RecLSN = record.LSN)
IF record.type == COMMIT:
update Transaction_Table: Status = Committed
IF record.type == ABORT:
mark as Loser (needs UNDO)
Result: Transaction_Table contains winners (COMMIT) and losers (no COMMIT) at crash time
Phase 2: REDO β Repeating History
The REDO phase scans the WAL forward, starting from the minimum RecLSN across all entries in the Dirty Page Table (the earliest point any unwritten change originated). It reapplies every single operation β including those from loser transactions β to bring the database to its exact physical state at the crash moment.
This is the critical ARIES insight: REDO repeats history completely before UNDO begins. This unconditional approach is what makes ARIES correct with concurrent transactions.
-- REDO idempotency check using LSN (the core ARIES innovation)
FOR EACH update record from min(RecLSN) to END_OF_LOG:
fetch PageID from disk
IF page.PageLSN >= record.LSN:
SKIP β this update is already on disk (idempotent check)
ELSE:
apply record.NewValue to page
set page.PageLSN = record.LSN
mark page as dirty in buffer pool
PageLSN β₯ LSN_record β skip: this makes ARIES idempotent (safe to crash mid-recovery)
Phase 3: UNDO β Rolling Back Losers
The UNDO phase scans the WAL backward, processing only the loser transactions identified in Phase 1 (those with no COMMIT record). For each operation, it applies the before-image (OldValue) to reverse the change. For every UNDO operation performed, ARIES writes a Compensation Log Record (CLR) to the WAL β ensuring that if the system crashes again during UNDO, these reversal operations are not accidentally re-applied.
| ARIES Phase | Log Scan Direction | Starting Point | What It Does | Targets |
|---|---|---|---|---|
| 1. Analysis | Forward β | Last Checkpoint | Rebuilds Transaction Table + Dirty Page Table | All transactions |
| 2. REDO | Forward β | min(RecLSN) in Dirty Page Table | Reapplies ALL changes to restore crash-state | ALL transactions (winners + losers) |
| 3. UNDO | Backward β | End of log (crash point) | Reverses changes from uncommitted transactions using before-images | Only LOSER transactions (no COMMIT) |
Shadow Paging: The Alternative Approach
Shadow paging is a fundamentally different recovery mechanism that eliminates the need for a REDO/UNDO log entirely. Instead of logging changes, it maintains two versions of the database page directory simultaneously and uses an atomic pointer swap to commit or discard transactions.
How Shadow Paging Works
The database maintains two directory tables β the Current Directory (pointing to the live, in-use data pages) and the Shadow Directory (a backup copy pointing to the original, unchanged pages):
- Transaction begins β The Shadow Directory is made a copy of the Current Directory.
- Page modification β Instead of modifying the existing page, the DBMS writes changes to a completely new, empty disk page. The Current Directory pointer is updated to point to the new page. The Shadow Directory still points to the original page.
- Commit β The master page directory pointer is atomically swapped to the Current Directory. The old Shadow pages become garbage-collectible.
- Crash before commit β The atomic swap never happened. The master pointer still points to the Shadow Directory. The in-progress current pages are simply discarded β no log processing needed.
| Feature | Write-Ahead Logging (WAL) | Shadow Paging |
|---|---|---|
| Recovery Mechanism | REDO + UNDO log processing | Discard shadow pages β instant |
| Normal Operation Overhead | Sequential log I/O (fast) | Full page copy per modification (expensive) |
| Concurrency | Excellent β optimized for thousands of concurrent transactions | Very poor β page directory conflicts under concurrent writes |
| Data Fragmentation | None β data stays sequential and clustered | Severe β new pages scattered randomly across disk |
| Recovery Time | Log-bounded β depends on checkpoint frequency | Near-instant β just discard uncommitted pages |
| Industry Adoption | 99% of production RDBMS (PostgreSQL, Oracle, MySQL, SQL Server) | Mostly obsolete; SQLite rollback journal, ZFS/Btrfs filesystems |
Recovery with Concurrent Transactions
The true complexity of crash recovery emerges when multiple transactions are running simultaneously at the time of a crash. A naive REDO/UNDO implementation would struggle because:
- A single dirty page in the buffer pool may have been modified by multiple different transactions at different times
- UNDO must reverse only the loser transactions' changes β not interfere with pages also modified by winners
- If the server crashes during UNDO, recovery must resume from the correct point without re-undoing already-undone operations
ARIES solves all three problems elegantly:
| Problem | ARIES Solution |
|---|---|
| Multiple transactions touched the same page | PageLSN tracks which operation last modified each page β REDO applies changes in LSN order, ensuring correct final state |
| UNDO must only reverse loser transactions | Analysis phase builds the exact Transaction Table β only transactions without a COMMIT record are in the UNDO set |
| Crash during UNDO | CLRs (Compensation Log Records) written during UNDO are never themselves undone β they are skipped on the next recovery attempt via the UndoNextLSN pointer |
| Idempotent REDO on already-flushed pages | PageLSN β₯ LSN_record β skip REDO. Safe even if the same log record is processed multiple times across repeated crash/recovery cycles |
Real-World Case Study: The Power Grid Financial Crisis
| Aspect | Details |
|---|---|
| The Setup | A major financial trading platform processes millions of micro-transactions per second across a PostgreSQL cluster. PostgreSQL implements WAL and a ARIES-derived fuzzy checkpoint recovery model. |
| The Failure | A regional power grid failure simultaneously killed both the primary and failover database servers mid-transaction. Millions of in-flight trades (account debits + credits) were in the buffer pool β never flushed to disk. |
| The Impact | All in-memory buffer pool data was destroyed. Without WAL, billions of dollars in trades would be unaccounted for β accounts debited without corresponding credits, creating insolvency. |
| The Recovery | PostgreSQL's WAL had durably recorded every operation. On restart: Analysis identified the crash LSN in 8 seconds. REDO phase rebuilt the exact pre-crash buffer pool state from the WAL in 2 minutes. UNDO phase rolled back 47 incomplete transactions in 12 seconds. Total recovery time: 2 minutes 20 seconds. |
| The Lesson | A properly tuned WAL + ARIES implementation is mathematically indestructible against crash failures. The only vulnerabilities are: (1) corrupted WAL files (prevented by RAID storage), and (2) software bugs in the recovery code itself (prevented by formal verification). |
Key Statistics & Industry Data (2026)
- WAL Throughput β Modern NVMe SSDs have accelerated WAL sequential write speeds, allowing high-performance databases to sustain 250,000+ log record commits per second on optimized hardware. (Source: PostgreSQL Performance Benchmarks, 2025)
- Checkpoint RTO Improvement β By implementing frequent fuzzy checkpoints (every 5 minutes instead of every 30), enterprise databases reduce their Recovery Time Objective (RTO) by up to 85% β from 30-minute restarts to under 5-minute restarts. (Source: Oracle Database Engineering Blog, 2025)
- ARIES Industry Dominance β Over 95% of global enterprise relational databases (PostgreSQL, Oracle, IBM DB2, SQL Server, MySQL InnoDB) use ARIES or a direct derivative as their crash recovery algorithm. (Source: VLDB Survey on RDBMS Internals, 2024)
- WAL Replication β WAL streaming replication is used to maintain hot standby replicas with sub-second replication lag in high-availability PostgreSQL deployments β the same mechanism that prevents data loss on primary failure. (Source: AWS RDS PostgreSQL Architecture, 2026)
Where Crash Recovery Mechanisms Are Applied
OLTP Financial Systems
Banking and trading platforms use ARIES/WAL to guarantee that account debit operations are atomically paired with credits β an incomplete transfer due to a crash is detected and reversed during UNDO.
WAL-Based Replication
PostgreSQL Streaming Replication and MySQL Binary Log Replication both use the WAL/binlog as a real-time change feed β replica servers continuously apply REDO from the primary, keeping hot standbys in sync within milliseconds.
Point-in-Time Recovery (PITR)
DBAs restore from a full backup and replay the archived WAL forward to any specific timestamp β recovering the database to the exact second before an accidental DROP TABLE or data corruption event.
Cloud Database Snapshots
AWS RDS, Google Cloud SQL, and Azure SQL Database use WAL-backed snapshots β backup is a fuzzy checkpoint snapshot, and WAL archives allow restoration to any point between snapshots.
Copy-on-Write Filesystems
ZFS and Btrfs filesystems use shadow-paging-inspired Copy-on-Write at the filesystem level β never overwriting existing data blocks, making the entire filesystem crash-consistent without a separate log.
Advantages of Log-Based Recovery (WAL + ARIES)
- Mathematical ACID Guarantee β WAL + ARIES provides a formal, provably correct implementation of Atomicity and Durability for any combination of concurrent transactions
- Fast Normal Operations β sequential WAL writes are dramatically faster than random data page writes; the database can confirm commits in microseconds while deferring physical page writes
- Idempotent Recovery β ARIES's LSN-based PageLSN comparison makes recovery safe even if the server crashes during the recovery process itself β no data corruption from repeated restarts
- Replication Foundation β the same WAL used for local crash recovery doubles as the data feed for streaming replication to standby servers, providing both HA and disaster recovery from a single mechanism
- Point-in-Time Recovery β WAL archives enable restoration to any arbitrary past moment, not just the most recent backup β a critical capability for accidental data deletion recovery
Limitations & Challenges
- Sequential I/O bottleneck β every transaction must wait for at least one synchronous WAL flush to disk before the commit can be acknowledged β on slow disks this becomes the critical path
- WAL storage growth β active OLTP systems generate gigabytes of WAL per hour; without automated WAL archival and purging, the WAL directory fills the disk and crashes the database
- Recovery time scales with checkpoint interval β infrequent checkpoints mean longer log replay during recovery; tuning checkpoint frequency is a critical DBA responsibility
- Complexity β ARIES with CLRs, fuzzy checkpoints, and concurrent transaction handling is one of the most complex algorithms in computer science; bugs in recovery code are catastrophic
- Shadow paging fragmentation β the alternative (shadow paging) avoids log overhead but scatters data pages randomly across disk, destroying read performance through fragmentation over time
Quick Reference Cheat Sheet
Complete crash recovery terminology for exams and interviews.
| Term | Definition | Exam Tip |
|---|---|---|
| WAL | Write-Ahead Logging β log record must be written to durable storage before the data page is modified | The single rule that makes all log-based recovery possible |
| REDO | Replays committed operations to rebuild data that was in RAM but never flushed to disk before crash | Forward log scan; applies NewValue from log records |
| UNDO | Reverses operations from uncommitted (loser) transactions to restore pre-transaction state | Backward log scan; applies OldValue from log records |
| Checkpoint | A logged synchronization point ensuring all earlier dirty pages are safely on disk | Recovery only reads log from the last checkpoint β not from the beginning of time |
| Fuzzy Checkpoint | Checkpoint that flushes dirty pages gradually in the background while transactions continue | No stall; used by PostgreSQL, MySQL, Oracle to avoid performance spikes |
| ARIES | Algorithm for Recovery and Isolation Exploiting Semantics (IBM, 1992) β 3 phases: Analysis, REDO, UNDO | REDO repeats history for ALL transactions; UNDO only reverses losers |
| LSN | Log Sequence Number β monotonically increasing ID assigned to every WAL record | PageLSN β₯ LSN_record β skip REDO (idempotent check) |
| PageLSN | The LSN of the last WAL record that modified a disk page β stored in the page header | Key field for ARIES idempotence β prevents double-REDO |
| CLR | Compensation Log Record β written to WAL for every UNDO operation performed during recovery | CLRs are never themselves undone β they make UNDO idempotent |
| Dirty Page | A buffer pool page modified in RAM but not yet written to the physical database file on disk | Source of REDO need β dirty pages lost at crash must be rebuilt |
| Winner Tx | A transaction that had a COMMIT record in the log at crash time | Needs REDO only β its effects must be preserved |
| Loser Tx | A transaction with no COMMIT record in the log at crash time | Needs UNDO β its effects must be completely reversed |
| Shadow Paging | Writes updates to new disk pages; atomic pointer swap commits; no swap = instant discard on crash | No log needed but causes fragmentation; poor concurrency; mostly obsolete |
| PITR | Point-in-Time Recovery β restore backup + replay WAL to any specific timestamp | Used to recover from accidental DROP TABLE or data corruption to any historical second |
Frequently Asked Questions (FAQ)
Q.What happens if the system crashes while writing the log itself?
Q.Why do we have to REDO transactions that already committed?
Q.What is the difference between a Rollback and Crash Recovery?
Q.What is a Fuzzy Checkpoint?
Q.What is a Log Sequence Number (LSN)?
Q.Is Shadow Paging still used in modern databases?
Q.What is Point-in-Time Recovery (PITR)?
Related Topics
Test Your Knowledge
Ready to prove your skills? Take our rigorous multiple-choice quiz designed to test your understanding of this topic and prepare you for interviews.