Questions & Solutions

Question 1: Functional Dependency Closure Analysis

Problem: Given relation R(A, B, C, D, E, F) with functional dependencies: F = {A → BC, CD → E, B → D, E → AF}

a) Find the closure of attribute set {A}⁺ b) Find all candidate keys for R c) Determine if the dependency B → F can be derived from F d) Calculate the minimal cover Fc of F

Solution:

Part (a): Finding {A}⁺

Start: {A}⁺ = {A}
Apply A → BC: {A}⁺ = {A, B, C}
Apply B → D: {A}⁺ = {A, B, C, D}
Apply CD → E: Since C, D ∈ {A}⁺, add E: {A}⁺ = {A, B, C, D, E}
Apply E → AF: Since E ∈ {A}⁺, add F: {A}⁺ = {A, B, C, D, E, F}

Therefore, {A}⁺ = {A, B, C, D, E, F}

Part (b): Finding candidate keys Since {A}⁺ = R, A is a superkey. Check if A is minimal:

Cannot remove A (A appears only on LHS of A → BC)
Candidate key: {A}

Part (c): Can B → F be derived? Find {B}⁺:

{B}⁺ = {B}
Apply B → D: {B}⁺ = {B, D}
No more rules apply
Since F ∉ {B}⁺, B → F cannot be derived

Part (d): Minimal cover Fc

Convert to singleton RHS: {A → B, A → C, CD → E, B → D, E → A, E → F}
Remove extraneous attributes: All LHS are minimal
Remove redundant FDs:
- E → A is redundant (can derive from A → B, B → D, CD → E, E → F through cycles)

Fc = {A → B, A → C, CD → E, B → D, E → F}

Question 2: BCNF Decomposition with Dependency Analysis

Problem: Consider relation R(Student, Professor, Course, Room, Time, Building) with FDs: F = { Student, Course → Professor, Professor → Room, Room → Building, Professor, Time → Course, Course → Time }

a) Prove that R is not in BCNF b) Decompose R into BCNF using the decomposition algorithm c) Analyze which dependencies are lost in the decomposition d) Calculate the space complexity increase factor if the original relation had n tuples

Solution:

Part (a): Proving R is not in BCNF First find candidate keys by computing closures:

{Student, Professor}⁺ = {Student, Professor, Room, Building, Course, Time} = R ✓
Other combinations don't give full closure

Candidate key: {Student, Professor}

Check BCNF violations:

Professor → Room: Professor is not a superkey → Violates BCNF
Room → Building: Room is not a superkey → Violates BCNF
Course → Time: Course is not a superkey → Violates BCNF

Part (b): BCNF Decomposition

Step 1: Decompose using Professor → Room

R₁(Professor, Room)
R₂(Student, Professor, Course, Time, Building)

Step 2: Check R₂, violates BCNF with Room → Building (but Room not in R₂) Check Course → Time: Course not superkey in R₂

R₂₁(Course, Time)
R₂₂(Student, Professor, Course, Building)

Step 3: Check R₂₂, still has issues with Building dependency Since Room → Building but Room not in R₂₂, this dependency is lost.

Final BCNF Relations:

R₁(Professor, Room)
R₂₁(Course, Time)
R₂₂(Student, Professor, Course)
R₃(Room, Building)

Part (c): Lost Dependencies Analysis Original FDs in F:

✓ Student, Course → Professor: Preserved in R₂₂
✓ Professor → Room: Preserved in R₁
✓ Room → Building: Preserved in R₃
❌ Professor, Time → Course: LOST (Professor and Time not together in any relation)
✓ Course → Time: Preserved in R₂₁

Part (d): Space Complexity Original: n tuples × 6 attributes = 6n storage units Decomposed:

R₁: ≤n tuples × 2 attributes = 2n
R₂₁: ≤n tuples × 2 attributes = 2n
R₂₂: ≤n tuples × 3 attributes = 3n
R₃: ≤n tuples × 2 attributes = 2n Total: ≤9n storage units Space increase factor: 1.5x

Question 3: 3NF vs BCNF Trade-off Analysis

Problem: Banking system relation Account_Transaction(Account_ID, Branch_ID, Manager_ID, Transaction_ID, Amount, Date, Customer_SSN) with dependencies:

F = { Account_ID → Branch_ID, Account_ID → Customer_SSN, Branch_ID → Manager_ID, Transaction_ID → Account_ID, Amount, Date }

a) Determine if the relation is in 3NF and BCNF b) If not in BCNF, provide both 3NF and BCNF decompositions c) Calculate the join cost for a query retrieving all transaction details with manager information for transactions > $10,000 d) Recommend which normal form to use for this banking system

Solution:

Part (a): Normal Form Analysis

Finding Candidate Keys:

{Transaction_ID}⁺ = {Transaction_ID, Account_ID, Amount, Date, Branch_ID, Customer_SSN, Manager_ID} = R
Candidate key: {Transaction_ID}

3NF Check: All FDs have Transaction_ID as superkey or involve prime attributes

Account_ID → Branch_ID: Account_ID is not prime, Branch_ID is not prime → Violates 3NF

BCNF Check:

Account_ID → Branch_ID: Account_ID not superkey → Violates BCNF
Branch_ID → Manager_ID: Branch_ID not superkey → Violates BCNF

Neither 3NF nor BCNF

Part (b): Decompositions

3NF Decomposition (Synthesis method):

Minimal cover: Same as F (already minimal)
Group by LHS:
- R₁(Account_ID, Branch_ID, Customer_SSN)
- R₂(Branch_ID, Manager_ID)
- R₃(Transaction_ID, Account_ID, Amount, Date)
Candidate key Transaction_ID appears in R₃ ✓

BCNF Decomposition:

Decompose using Account_ID → Branch_ID:
- R₁(Account_ID, Branch_ID, Customer_SSN)
- R₂(Transaction_ID, Account_ID, Amount, Date, Manager_ID)
Decompose R₁ using Account_ID → Customer_SSN:
- R₁₁(Account_ID, Customer_SSN)
- R₁₂(Account_ID, Branch_ID)
R₂ violates BCNF with Branch_ID → Manager_ID, but Branch_ID not in R₂
Need to add: R₃(Branch_ID, Manager_ID)

Final BCNF: R₁₁(Account_ID, Customer_SSN), R₁₂(Account_ID, Branch_ID), R₂(Transaction_ID, Account_ID, Amount, Date), R₃(Branch_ID, Manager_ID)

Part (c): Join Cost Analysis

Query: Find all transaction details with manager info for Amount > $10,000

3NF Cost:

Filter R₃: O(n) where n = transactions
Join R₃ ⋈ R₁: O(n) (on Account_ID)
Join result ⋈ R₂: O(n) (on Branch_ID)
Total: O(3n)

BCNF Cost:

Filter R₂: O(n)
Join R₂ ⋈ R₁₁: O(n) (on Account_ID)
Join R₂ ⋈ R₁₂: O(n) (on Account_ID)
Join result ⋈ R₃: O(n) (on Branch_ID)
Total: O(4n)

Part (d): Recommendation Use 3NF because:

Dependency preservation maintained
Lower join cost for common queries
Banking systems require strict constraint enforcement
Slightly higher redundancy acceptable for performance

Question 5: Advanced Decomposition Algorithm Analysis

Problem: Given relation R(A, B, C, D, E, F, G) with FDs: F = {AB → C, AC → B, AD → E, B → D, C → F, E → G}

a) Prove that the 3NF decomposition algorithm produces a lossless join b) Find the maximum number of relations in any 3NF decomposition of R c) Determine if there exists a dependency-preserving BCNF decomposition d) If multiple 3NF decompositions exist, which minimizes join cost for queries accessing all attributes?

Solution:

Part (a): Lossless Join Proof

Step 1: Find minimal cover Fc = F (already minimal after checking)

Step 2: Find candidate keys

{AB}⁺ = {A, B, C, D, E, F, G} = R
{AC}⁺ = {A, B, C, D, E, F, G} = R
Check minimality: Both AB and AC are minimal
Candidate keys: {AB}, {AC}

Step 3: 3NF Decomposition Group by LHS:

R₁(A, B, C) from AB → C, AC → B
R₂(A, D, E) from AD → E
R₃(B, D) from B → D
R₄(C, F) from C → F
R₅(E, G) from E → G

Step 4: Lossless proof Candidate key {AB} appears in R₁, ensuring lossless join property.

Part (b): Maximum Relations Count Each FD can generate at most one relation. With |F| = 6 functional dependencies, plus possibly one relation for candidate key preservation: Maximum relations: 7 (6 from FDs + 1 for key if needed)

Part (c): BCNF Dependency Preservation Check each relation for BCNF:

R₁(A,B,C): FDs AB → C, AC → B. Keys are AB, AC → BCNF ✓
R₂(A,D,E): FD AD → E. Key is AD → BCNF ✓
R₃(B,D): FD B → D. Key is B → BCNF ✓
R₄(C,F): FD C → F. Key is C → BCNF ✓
R₅(E,G): FD E → G. Key is E → BCNF ✓

Yes, this 3NF decomposition is also BCNF and dependency-preserving!

Part (d): Join Cost Optimization For queries accessing all attributes, minimize the number of joins:

Current decomposition: 5 relations → 4 joins needed Alternative grouping: Combine compatible relations:

R₁(A, B, C, D) from AB → C, AC → B, B → D
R₂(A, D, E, G) from AD → E, E → G
R₃(C, F) from C → F

Better decomposition: 3 relations → 2 joins needed Recommended for query optimization

Question 6: Probabilistic Database Consistency Analysis

Problem: A distributed database system maintains replicas across 3 data centers. Updates are propagated with probability p = 0.8 per replica per time unit. A normalization violation occurs when replicas are inconsistent.

Given relation R(ID, Name, Department, Manager) where Department → Manager: a) If the relation is not normalized and has 1000 tuples with 50 departments, calculate expected inconsistencies after 3 time units b) Compare with normalized relations R₁(ID, Name, Department) and R₂(Department, Manager) c) What's the optimal replication strategy to minimize inconsistency probability below 0.01?

Solution:

Part (a): Denormalized Inconsistency Calculation

Setup:

1000 tuples, 50 departments → Average 20 tuples per department
Each department's manager info is replicated 20 times
Probability of successful update per replica: p = 0.8
After 3 time units: Effective success probability = 1 - (1-0.8)³ = 1 - 0.008 = 0.992

Inconsistency Analysis: For each department group of 20 tuples across 3 replicas:

Total state combinations: 3²⁰ (each tuple can be in 3 states across replicas)
Consistent states: States where all manager values agree
P(inconsistent) = 1 - P(all replicas have same manager value)

For each department: P(inconsistent) ≈ 1 - (0.992)³ = 0.024

Expected inconsistent departments: 50 × 0.024 = 1.2 Expected inconsistent tuples: 1.2 × 20 = 24

Part (b): Normalized Relations Comparison

R₁(ID, Name, Department): 1000 tuples, no redundancy

P(inconsistent) = 1 - (0.992)³ ≈ 0.024
Expected inconsistent tuples: 1000 × 0.024 = 24

R₂(Department, Manager): 50 tuples, no redundancy

P(inconsistent) = 1 - (0.992)³ ≈ 0.024
Expected inconsistent tuples: 50 × 0.024 = 1.2

Total normalized inconsistencies: 24 + 1.2 = 25.2 Denormalized inconsistencies: 24

Normalization provides minimal improvement due to eliminated redundancy in manager data

Part (c): Optimal Replication Strategy

Target: P(inconsistent) < 0.01 per tuple

Required: (1 - p)ᵏ < 0.01 where k = time units For p = 0.8: (0.2)ᵏ < 0.01 k > log(0.01)/log(0.2) ≈ 2.86

Strategy 1: Increase update frequency to k = 3 time units Strategy 2: Increase success probability to p = 0.95

(0.05)³ = 0.000125 < 0.01 ✓

Strategy 3: Reduce replication factor with stronger consistency

Use 2 replicas with p = 0.9
(0.1)³ = 0.001 < 0.01 ✓

Recommendation: Strategy 3 (fewer replicas, higher individual reliability)

Question 7: Enterprise-Scale Normalization Performance Analysis

Problem: E-commerce platform with Orders(OrderID, CustomerID, ProductID, Quantity, Price, ProductName, CustomerName, CustomerEmail, CategoryID, CategoryName) containing 10M orders, 1M customers, 100K products, 1K categories.

a) Calculate storage reduction from full normalization b) Analyze query performance impact for: "Find all orders with customer and product details for electronics category" c) Design hybrid normalization strategy optimizing for 80% read, 20% write workload d) Calculate the break-even point where normalization overhead equals denormalization benefits

Solution:

Part (a): Storage Reduction Analysis

Current denormalized storage:

10M orders × 10 attributes × 50 bytes avg = 5GB

Normalized storage:

Orders(OrderID, CustomerID, ProductID, Quantity, Price): 10M × 5 × 8 bytes = 400MB
Customers(CustomerID, CustomerName, CustomerEmail): 1M × 3 × 50 bytes = 150MB
Products(ProductID, ProductName, CategoryID): 100K × 3 × 50 bytes = 15MB
Categories(CategoryID, CategoryName): 1K × 2 × 50 bytes = 100KB

Total normalized: 565MB Storage reduction: 5000MB → 565MB = 88.7% reduction

Part (b): Query Performance Analysis

Query: Electronics category orders with customer and product details

Denormalized approach:

Scan 10M orders: O(10M)
Filter CategoryName = 'Electronics': ~100K results (assuming 10% electronics)
Time complexity: O(10M), I/O: 5GB scan

Normalized approach:

Filter Categories: O(1K) → Find Electronics CategoryID
Join Products: O(100K) → Find electronics ProductIDs (~10K products)
Join Orders: O(10M) using ProductID index → ~1M electronics orders
Join Customers: O(1M) → Get customer details
Time complexity: O(1M), I/O: ~500MB

Performance improvement: 10x faster, 90% less I/O

Part (c): Hybrid Strategy for 80/20 Workload

Strategy: Selective Denormalization

Normalize core entities: Customers, Products, Categories (rarely change)
Partially denormalize Orders:
- Keep CustomerName (frequently accessed, low update rate)
- Normalize ProductName → Products (high update rate)
- Keep CategoryName (read-heavy, minimal updates)

Resulting schema:

Orders(OrderID, CustomerID, CustomerName, ProductID, Quantity, Price, CategoryName)
Products(ProductID, ProductName, CategoryID)
Categories(CategoryID, CategoryName)
Customers(CustomerID, CustomerEmail) [other details in Orders]

Benefits:

70% of queries avoid joins (customer name, category in Orders)
Product updates affect only Products table
Storage increase: 20% vs full normalization, 60% reduction vs full denormalization

Part (d): Break-even Analysis

Costs:

Denormalized: High storage cost, update complexity O(n) where n = redundancy factor
Normalized: Join cost O(log n) per query, consistent updates O(1)

Variables:

S = storage cost per GB per month
Q = queries per month
U = updates per month
R = redundancy factor (average tuple duplication)

Denormalized cost: (Storage × 5GB × S) + (Updates × R × U × C_update) Normalized cost: (Storage × 0.5GB × S) + (Queries × log(n) × Q × C_join)

Break-even when: 4.5GB × S + R × U × C_update = Q × log(n) × C_join

For typical values (S=$0.1/GB/month, R=5, C_update=$0.001, C_join=$0.01): Break-even occurs when Q/U > 45 (45:1 read-to-write ratio)

Since given workload is 80:20 = 4:1, recommend hybrid approach

Question 8: Temporal Database Normalization

Problem: Time-variant employee database stores historical salary and department changes: Employee_History(EmpID, ValidFrom, ValidTo, Name, DeptID, DeptName, Salary, ManagerID, ManagerName)

Temporal functional dependencies:

EmpID, ValidFrom → Name, DeptID, Salary
DeptID, ValidFrom → DeptName, ManagerID
ManagerID, ValidFrom → ManagerName

a) Define temporal normal forms analogous to traditional normal forms b) Normalize the temporal relation c) Design queries to find salary changes > 20% with manager approval timeline d) Calculate versioning overhead for 100K employees over 10 years with monthly changes

Solution:

Part (a): Temporal Normal Forms

Temporal 1NF (T1NF): All attributes atomic, valid time periods non-overlapping for same entity

Temporal 2NF (T2NF): T1NF + no partial temporal dependency

Non-temporal attributes fully dependent on (EntityKey, ValidFrom)

Temporal 3NF (T3NF): T2NF + no transitive temporal dependency

No A, T → B → C where T is temporal key component

Part (b): Temporal Normalization

Violations identified:

DeptID, ValidFrom → DeptName, ManagerID (partial dependency on composite temporal key)
ManagerID, ValidFrom → ManagerName (transitive dependency)

Normalized schema:

Employee_History(EmpID, ValidFrom, ValidTo, Name, DeptID, Salary)
Department_History(DeptID, ValidFrom, ValidTo, DeptName, ManagerID)
Manager_History(ManagerID, ValidFrom, ValidTo, ManagerName)

Part (c): Complex Temporal Query

-- Find salary changes > 20% with manager approval timeline
WITH SalaryChanges AS (
  SELECT e1.EmpID, e1.ValidFrom as ChangeDate, e1.DeptID,
         e1.Salary as NewSalary, e2.Salary as PrevSalary,
         (e1.Salary - e2.Salary) / e2.Salary as ChangePercent
  FROM Employee_History e1
  JOIN Employee_History e2 ON e1.EmpID = e2.EmpID 
    AND e1.ValidFrom = e2.ValidTo + 1
  WHERE ABS((e1.Salary - e2.Salary) / e2.Salary) > 0.2
),
ManagerTimeline AS (
  SELECT sc.EmpID, sc.ChangeDate, dh.ManagerID, mh.ManagerName,
         dh.ValidFrom as MgrValidFrom, dh.ValidTo as MgrValidTo
  FROM SalaryChanges sc
  JOIN Department_History dh ON sc.DeptID = dh.DeptID
    AND sc.ChangeDate BETWEEN dh.ValidFrom AND dh.ValidTo
  JOIN Manager_History mh ON dh.ManagerID = mh.ManagerID  
    AND sc.ChangeDate BETWEEN mh.ValidFrom AND mh.ValidTo
)
SELECT * FROM ManagerTimeline;

Part (d): Versioning Overhead

Assumptions:

100K employees × 10 years × 12 months = 12M potential version records
Average tenure: 5 years → 60% turnover
Monthly change rate: 5% of active employees

Calculation:

Active employee-months: 100K × 10 × 12 × 0.6 average = 7.2M
Actual changes: 7.2M × 0.05 = 360K salary records
Department changes (less frequent): 360K × 0.1 = 36K dept records
Manager changes (least frequent): 36K × 0.2 = 7.2K manager records

Total temporal records: 403.2K Overhead vs current snapshot: 403.2K / 100K = 4.03x storage Acceptable for historical analytics requirements

Question 9: Distributed Database Consistency with CAP Theorem

Problem: Global banking system uses distributed database across 5 continents. Account table: Account(ID, CustomerID, Balance, Branch, Country) with functional dependency Branch → Country.

During network partition between continents: a) Analyze consistency issues if denormalized (Branch, Country together) b) Compare with normalized approach: Account(ID, CustomerID, Balance, Branch) + Branch_Location(Branch, Country)
c) Prove that strong consistency is impossible during partition for both approaches d) Design eventual consistency protocol minimizing inconsistency window

Solution:

Part (a): Denormalized Consistency Issues

During Partition Scenario:

Network split: {US, Europe} vs {Asia, Africa, Australia}
Update: Move branch "London_Central" from UK to Ireland (Brexit scenario)

Denormalized problems:

Phantom updates: Accounts in partition 1 show London_Central → Ireland, partition 2 shows UK
Propagation complexity: Must update every account record with that branch (potentially millions)
Inconsistent queries: Same branch showing different countries simultaneously
Conflict resolution: Which partition's country value is authoritative?

Inconsistency impact: O(n) accounts per branch affected

Part (b): Normalized Approach

During same partition:

Account table: Unaffected by branch location changes
Branch_Location table: Only one record needs updating

Advantages:

Isolated updates: Branch location change affects only Branch_Location table
Reduced conflict scope: Single record conflict vs. thousands in denormalized
Cleaner resolution: Clear authority over Branch_Location vs. scattered updates

Inconsistency impact: O(1) - single branch record

Part (c): Strong Consistency Impossibility Proof

CAP Theorem Application:

Consistency (C): All nodes see same data simultaneously
Availability (A): System remains operational during failures
Partition tolerance (P): System continues during network splits

Proof by contradiction:

Assume strong consistency possible during partition
Partition occurs: Network split into P₁ and P₂
Update arrives at P₁: Branch location change
For strong consistency: P₂ must see update immediately
But network partition prevents P₁ → P₂ communication
Therefore: Either sacrifice Consistency (allow stale reads) or Availability (block operations)
Contradiction: Cannot have both C and A during P

Conclusion: Strong consistency impossible for both normalized and denormalized approaches during partition

Part (d): Eventual Consistency Protocol

Protocol: Vector Clock with Semantic Reconciliation

Components:

Vector clocks: Track causality per partition
Semantic rules: Branch_Location updates have priority over Account updates
Reconciliation windows: 30-second batched updates
Conflict resolution: Last-writer-wins with business rule validation

Algorithm:

On network partition:
1. Each partition maintains local vector clock
2. Updates get timestamped: (partition_id, local_time, vector_clock)
3. Queue cross-partition updates

On partition healing:
1. Exchange vector clocks between partitions
2. Identify conflicting updates using causal ordering
3. Apply semantic rules:
   - Branch_Location changes override Account.Branch references
   - Balance updates use commutative resolution (sum of changes)
4. Batch apply resolved updates
5. Synchronize vector clocks

Inconsistency window = max(network_partition_time, reconciliation_batch_time)

Performance Analysis:

Normalized: Inconsistency window = network partition time + O(log branches)
Denormalized: Inconsistency window = network partition time + O(accounts_per_branch)

Recommendation: Normalized approach reduces inconsistency resolution time by factor of 1000+ for large banks