1. What is a Distributed Database?

A distributed database is a database stored across multiple sites/computers, connected through a network.

Each site:

Has its own DBMS
Can work independently
Can store part of the data or full copies

Example:
A bank has branches in Mumbai, Delhi, and Chennai.
Each branch stores account information of its local customers, but can also access data from other cities when needed.

Difference from parallel databases:

Parallel DB = tightly connected processors, one database
Distributed DB = loosely connected sites, multiple DBMSs

2. Types of Distributed Databases

a. Homogeneous Distributed DB

All sites run the same DBMS
All sites agree to cooperate
Same schema everywhere

Example:
All branches of a bank use MySQL with the same schema.

b. Heterogeneous Distributed DB

Sites use different DBMS and different schemas
Harder to integrate

Example:
One site uses PostgreSQL, another uses Oracle.

3. Distributed Data Storage

A relation can be stored at multiple sites using:

Replication
Fragmentation
A combination of both

3.1 Replication

The relation is stored as copies at multiple sites.

Advantages

High availability
If one site fails, another has the data.
Faster read performance
Read queries can be processed at multiple places.
Less data transfer
Data is usually available locally.

Disadvantages

Update overhead
All copies must be updated.
Complex concurrency control
Must keep all replicas consistent.

Example:
Account details are copied at 3 branches.
If one branch goes offline, the data is still available.

3.2 Fragmentation

A relation is split into parts, and each part is stored at different sites.

Two types:

a. Horizontal Fragmentation

Split rows based on a condition.

Example:
account table split by branch:

account1 = accounts of Mumbai branch
account2 = accounts of Delhi branch

To reconstruct original:
account = account1 ∪ account2

b. Vertical Fragmentation

Split columns, but keep a key in each fragment for joining later.

Example:
employee_info split into:

employee_public_info (id, name, designation)
employee_private_info (id, salary)

Reconstruction:
employee_info = employee_public_info ⋈ employee_private_info

3.3 Transparency

The user must not know:

Where data is stored
How many fragments/replicas exist
How data is accessed

Types:

Fragmentation transparency
Replication transparency
Location transparency

Example:
User just writes:
SELECT * FROM account WHERE account_number = 20;
The system finds the right site automatically.

4. Distributed Transactions

A transaction that accesses data at multiple sites.

Types:

Local transactions: run on single site
Global transactions: run on multiple sites

To maintain ACID across sites, all sites must coordinate.

4.1 Transaction Manager and Coordinator

Each site has:

Transaction Manager (TM)
Handles local execution, logging, and concurrency.
Transaction Coordinator (TC)
Starts transactions, sends subtransactions to sites, coordinates commit or abort.

5. Two-Phase Commit Protocol (2PC)

Guarantees:
A transaction either commits at all sites, or aborts at all sites.

Phase 1: Prepare Phase

Coordinator asks: "Are you ready to commit?"

Each site replies:

YES (ready T) → writes log, prepares
NO (abort T) → transaction will be aborted

Phase 2: Commit Phase

If all sites say YES → commit everywhere
If any site says NO → abort everywhere

Drawback:

If coordinator fails, sites may wait → blocking problem

Example:
Booking a flight + hotel:
If one site fails before commit, both must rollback.

6. Three-Phase Commit (3PC)

Adds an extra phase to avoid blocking.
More complex and slower, rarely used.

7. Concurrency Control in Distributed Systems

Goal: ensure correctness when multiple transactions run at different sites.

Main approaches:

7.1 Locking Methods

a. Single Lock Manager

One site controls all locks.
Simple but bottleneck and single point of failure.

b. Distributed Lock Manager

Each site manages locks for its own data.
Requires global deadlock detection.

c. Primary Copy

Only one site controls lock for a replicated item.

Example:
For account A, Mumbai branch is primary.
All locks for A go to Mumbai.

d. Majority Protocol

To read or write, lock majority of replicas.

Example:
If object has 3 replicas → need 2 for permission.

e. Biased Protocol

Shared locks are easy, exclusive locks require locking all replicas.

f. Quorum Consensus

Uses weights for sites, flexible read/write balancing.

7.2 Timestamp Ordering

Each transaction gets a unique timestamp.
Operations are ordered by timestamps.
Uses logical clocks or physical clocks.

7.3 Replica Consistency (Weak forms)

Master–slave replication

Updates at primary
Read from replicas
Replicas updated periodically

Multimaster (update-anywhere)

Writes can happen anywhere
Conflict resolution needed
Harder to maintain consistency

7.4 Deadlock Handling

Each site keeps a local wait-for graph.
A global deadlock may not show in local graphs.
Coordinator collects all graphs and checks for cycles.

Problems:

False deadlocks
Delayed messages may create false cycles

8. Availability in Distributed Databases

Goal: database should keep working despite failures.

Failures may include:

Site failure
Link failure
Network partition
Lost messages

System must:

Detect failure
Reconfigure
Recover

8.1 Majority-Based Approach for Availability

Reads require majority of replicas.
Writes update majority and carry version numbers.
After failure, recovering sites catch up automatically.

8.2 Read-One Write-All Available

Reads any one replica.
Writes all available replicas.
High availability but risk of inconsistencies during partitions.

8.3 Site Reintegration

When a site recovers:

Must update catalog
Must get latest replica values
May need to replay missed updates

9. Coordinator Selection

If a coordinator fails, a new one must be chosen.

Bully Algorithm

Highest-numbered active site becomes coordinator
A recovering site with higher ID can "bully" its way to leadership

Problem:
Each partition may elect its own coordinator unless majority checking is enforced.

10. CAP Theorem (Consistency, Availability, Partition Tolerance)

A distributed system can guarantee only two of the following:

Consistency
Availability
Partition Tolerance

Partition tolerance is unavoidable → must choose between:

Consistency (banks, finance)
Availability (social networks, caches)

Example:
Social network prefers availability → shows stale data instead of failing.