Question 1: Functional Dependency Closure Analysis
Problem: Given relation R(A, B, C, D, E, F) with functional dependencies: F = {A → BC, CD → E, B → D, E → AF}
a) Find the closure of attribute set {A}⁺ b) Find all candidate keys for R c) Determine if the dependency B → F can be derived from F d) Calculate the minimal cover Fc of F
Solution:
Part (a): Finding {A}⁺
-
Start: {A}⁺ = {A}
-
Apply A → BC: {A}⁺ = {A, B, C}
-
Apply B → D: {A}⁺ = {A, B, C, D}
-
Apply CD → E: Since C, D ∈ {A}⁺, add E: {A}⁺ = {A, B, C, D, E}
-
Apply E → AF: Since E ∈ {A}⁺, add F: {A}⁺ = {A, B, C, D, E, F}
Therefore, {A}⁺ = {A, B, C, D, E, F}
Part (b): Finding candidate keys Since {A}⁺ = R, A is a superkey. Check if A is minimal:
-
Cannot remove A (A appears only on LHS of A → BC)
-
Candidate key: {A}
Part (c): Can B → F be derived? Find {B}⁺:
-
{B}⁺ = {B}
-
Apply B → D: {B}⁺ = {B, D}
-
No more rules apply
-
Since F ∉ {B}⁺, B → F cannot be derived
Part (d): Minimal cover Fc
-
Convert to singleton RHS: {A → B, A → C, CD → E, B → D, E → A, E → F}
-
Remove extraneous attributes: All LHS are minimal
-
Remove redundant FDs:
- E → A is redundant (can derive from A → B, B → D, CD → E, E → F through cycles)
Fc = {A → B, A → C, CD → E, B → D, E → F}
Question 2: BCNF Decomposition with Dependency Analysis
Problem: Consider relation R(Student, Professor, Course, Room, Time, Building) with FDs: F = { Student, Course → Professor, Professor → Room, Room → Building, Professor, Time → Course, Course → Time }
a) Prove that R is not in BCNF b) Decompose R into BCNF using the decomposition algorithm c) Analyze which dependencies are lost in the decomposition d) Calculate the space complexity increase factor if the original relation had n tuples
Solution:
Part (a): Proving R is not in BCNF First find candidate keys by computing closures:
-
{Student, Professor}⁺ = {Student, Professor, Room, Building, Course, Time} = R ✓
-
Other combinations don't give full closure
Candidate key: {Student, Professor}
Check BCNF violations:
-
Professor → Room: Professor is not a superkey → Violates BCNF
-
Room → Building: Room is not a superkey → Violates BCNF
-
Course → Time: Course is not a superkey → Violates BCNF
Part (b): BCNF Decomposition
Step 1: Decompose using Professor → Room
-
R₁(Professor, Room)
-
R₂(Student, Professor, Course, Time, Building)
Step 2: Check R₂, violates BCNF with Room → Building (but Room not in R₂) Check Course → Time: Course not superkey in R₂
-
R₂₁(Course, Time)
-
R₂₂(Student, Professor, Course, Building)
Step 3: Check R₂₂, still has issues with Building dependency Since Room → Building but Room not in R₂₂, this dependency is lost.
Final BCNF Relations:
-
R₁(Professor, Room)
-
R₂₁(Course, Time)
-
R₂₂(Student, Professor, Course)
-
R₃(Room, Building)
Part (c): Lost Dependencies Analysis Original FDs in F:
-
✓ Student, Course → Professor: Preserved in R₂₂
-
✓ Professor → Room: Preserved in R₁
-
✓ Room → Building: Preserved in R₃
-
❌ Professor, Time → Course: LOST (Professor and Time not together in any relation)
-
✓ Course → Time: Preserved in R₂₁
Part (d): Space Complexity Original: n tuples × 6 attributes = 6n storage units Decomposed:
-
R₁: ≤n tuples × 2 attributes = 2n
-
R₂₁: ≤n tuples × 2 attributes = 2n
-
R₂₂: ≤n tuples × 3 attributes = 3n
-
R₃: ≤n tuples × 2 attributes = 2n Total: ≤9n storage units Space increase factor: 1.5x
Question 3: 3NF vs BCNF Trade-off Analysis
Problem: Banking system relation Account_Transaction(Account_ID, Branch_ID, Manager_ID, Transaction_ID, Amount, Date, Customer_SSN) with dependencies:
F = { Account_ID → Branch_ID, Account_ID → Customer_SSN, Branch_ID → Manager_ID, Transaction_ID → Account_ID, Amount, Date }
a) Determine if the relation is in 3NF and BCNF b) If not in BCNF, provide both 3NF and BCNF decompositions c) Calculate the join cost for a query retrieving all transaction details with manager information for transactions > $10,000 d) Recommend which normal form to use for this banking system
Solution:
Part (a): Normal Form Analysis
Finding Candidate Keys:
-
{Transaction_ID}⁺ = {Transaction_ID, Account_ID, Amount, Date, Branch_ID, Customer_SSN, Manager_ID} = R
-
Candidate key: {Transaction_ID}
3NF Check: All FDs have Transaction_ID as superkey or involve prime attributes
- Account_ID → Branch_ID: Account_ID is not prime, Branch_ID is not prime → Violates 3NF
BCNF Check:
-
Account_ID → Branch_ID: Account_ID not superkey → Violates BCNF
-
Branch_ID → Manager_ID: Branch_ID not superkey → Violates BCNF
Neither 3NF nor BCNF
Part (b): Decompositions
3NF Decomposition (Synthesis method):
-
Minimal cover: Same as F (already minimal)
-
Group by LHS:
-
R₁(Account_ID, Branch_ID, Customer_SSN)
-
R₂(Branch_ID, Manager_ID)
-
R₃(Transaction_ID, Account_ID, Amount, Date)
-
-
Candidate key Transaction_ID appears in R₃ ✓
BCNF Decomposition:
-
Decompose using Account_ID → Branch_ID:
-
R₁(Account_ID, Branch_ID, Customer_SSN)
-
R₂(Transaction_ID, Account_ID, Amount, Date, Manager_ID)
-
-
Decompose R₁ using Account_ID → Customer_SSN:
-
R₁₁(Account_ID, Customer_SSN)
-
R₁₂(Account_ID, Branch_ID)
-
-
R₂ violates BCNF with Branch_ID → Manager_ID, but Branch_ID not in R₂
-
Need to add: R₃(Branch_ID, Manager_ID)
Final BCNF: R₁₁(Account_ID, Customer_SSN), R₁₂(Account_ID, Branch_ID), R₂(Transaction_ID, Account_ID, Amount, Date), R₃(Branch_ID, Manager_ID)
Part (c): Join Cost Analysis
Query: Find all transaction details with manager info for Amount > $10,000
3NF Cost:
-
Filter R₃: O(n) where n = transactions
-
Join R₃ ⋈ R₁: O(n) (on Account_ID)
-
Join result ⋈ R₂: O(n) (on Branch_ID)
-
Total: O(3n)
BCNF Cost:
-
Filter R₂: O(n)
-
Join R₂ ⋈ R₁₁: O(n) (on Account_ID)
-
Join R₂ ⋈ R₁₂: O(n) (on Account_ID)
-
Join result ⋈ R₃: O(n) (on Branch_ID)
-
Total: O(4n)
Part (d): Recommendation Use 3NF because:
-
Dependency preservation maintained
-
Lower join cost for common queries
-
Banking systems require strict constraint enforcement
-
Slightly higher redundancy acceptable for performance
Question 5: Advanced Decomposition Algorithm Analysis
Problem: Given relation R(A, B, C, D, E, F, G) with FDs: F = {AB → C, AC → B, AD → E, B → D, C → F, E → G}
a) Prove that the 3NF decomposition algorithm produces a lossless join b) Find the maximum number of relations in any 3NF decomposition of R c) Determine if there exists a dependency-preserving BCNF decomposition d) If multiple 3NF decompositions exist, which minimizes join cost for queries accessing all attributes?
Solution:
Part (a): Lossless Join Proof
Step 1: Find minimal cover Fc = F (already minimal after checking)
Step 2: Find candidate keys
-
{AB}⁺ = {A, B, C, D, E, F, G} = R
-
{AC}⁺ = {A, B, C, D, E, F, G} = R
-
Check minimality: Both AB and AC are minimal
-
Candidate keys: {AB}, {AC}
Step 3: 3NF Decomposition Group by LHS:
-
R₁(A, B, C) from AB → C, AC → B
-
R₂(A, D, E) from AD → E
-
R₃(B, D) from B → D
-
R₄(C, F) from C → F
-
R₅(E, G) from E → G
Step 4: Lossless proof Candidate key {AB} appears in R₁, ensuring lossless join property.
Part (b): Maximum Relations Count Each FD can generate at most one relation. With |F| = 6 functional dependencies, plus possibly one relation for candidate key preservation: Maximum relations: 7 (6 from FDs + 1 for key if needed)
Part (c): BCNF Dependency Preservation Check each relation for BCNF:
-
R₁(A,B,C): FDs AB → C, AC → B. Keys are AB, AC → BCNF ✓
-
R₂(A,D,E): FD AD → E. Key is AD → BCNF ✓
-
R₃(B,D): FD B → D. Key is B → BCNF ✓
-
R₄(C,F): FD C → F. Key is C → BCNF ✓
-
R₅(E,G): FD E → G. Key is E → BCNF ✓
Yes, this 3NF decomposition is also BCNF and dependency-preserving!
Part (d): Join Cost Optimization For queries accessing all attributes, minimize the number of joins:
Current decomposition: 5 relations → 4 joins needed Alternative grouping: Combine compatible relations:
-
R₁(A, B, C, D) from AB → C, AC → B, B → D
-
R₂(A, D, E, G) from AD → E, E → G
-
R₃(C, F) from C → F
Better decomposition: 3 relations → 2 joins needed Recommended for query optimization
Question 6: Probabilistic Database Consistency Analysis
Problem: A distributed database system maintains replicas across 3 data centers. Updates are propagated with probability p = 0.8 per replica per time unit. A normalization violation occurs when replicas are inconsistent.
Given relation R(ID, Name, Department, Manager) where Department → Manager: a) If the relation is not normalized and has 1000 tuples with 50 departments, calculate expected inconsistencies after 3 time units b) Compare with normalized relations R₁(ID, Name, Department) and R₂(Department, Manager) c) What's the optimal replication strategy to minimize inconsistency probability below 0.01?
Solution:
Part (a): Denormalized Inconsistency Calculation
Setup:
-
1000 tuples, 50 departments → Average 20 tuples per department
-
Each department's manager info is replicated 20 times
-
Probability of successful update per replica: p = 0.8
-
After 3 time units: Effective success probability = 1 - (1-0.8)³ = 1 - 0.008 = 0.992
Inconsistency Analysis: For each department group of 20 tuples across 3 replicas:
-
Total state combinations: 3²⁰ (each tuple can be in 3 states across replicas)
-
Consistent states: States where all manager values agree
-
P(inconsistent) = 1 - P(all replicas have same manager value)
For each department: P(inconsistent) ≈ 1 - (0.992)³ = 0.024
Expected inconsistent departments: 50 × 0.024 = 1.2 Expected inconsistent tuples: 1.2 × 20 = 24
Part (b): Normalized Relations Comparison
R₁(ID, Name, Department): 1000 tuples, no redundancy
-
P(inconsistent) = 1 - (0.992)³ ≈ 0.024
-
Expected inconsistent tuples: 1000 × 0.024 = 24
R₂(Department, Manager): 50 tuples, no redundancy
-
P(inconsistent) = 1 - (0.992)³ ≈ 0.024
-
Expected inconsistent tuples: 50 × 0.024 = 1.2
Total normalized inconsistencies: 24 + 1.2 = 25.2 Denormalized inconsistencies: 24
Normalization provides minimal improvement due to eliminated redundancy in manager data
Part (c): Optimal Replication Strategy
Target: P(inconsistent) < 0.01 per tuple
Required: (1 - p)ᵏ < 0.01 where k = time units For p = 0.8: (0.2)ᵏ < 0.01 k > log(0.01)/log(0.2) ≈ 2.86
Strategy 1: Increase update frequency to k = 3 time units Strategy 2: Increase success probability to p = 0.95
- (0.05)³ = 0.000125 < 0.01 ✓
Strategy 3: Reduce replication factor with stronger consistency
-
Use 2 replicas with p = 0.9
-
(0.1)³ = 0.001 < 0.01 ✓
Recommendation: Strategy 3 (fewer replicas, higher individual reliability)
Question 7: Enterprise-Scale Normalization Performance Analysis
Problem: E-commerce platform with Orders(OrderID, CustomerID, ProductID, Quantity, Price, ProductName, CustomerName, CustomerEmail, CategoryID, CategoryName) containing 10M orders, 1M customers, 100K products, 1K categories.
a) Calculate storage reduction from full normalization b) Analyze query performance impact for: "Find all orders with customer and product details for electronics category" c) Design hybrid normalization strategy optimizing for 80% read, 20% write workload d) Calculate the break-even point where normalization overhead equals denormalization benefits
Solution:
Part (a): Storage Reduction Analysis
Current denormalized storage:
- 10M orders × 10 attributes × 50 bytes avg = 5GB
Normalized storage:
-
Orders(OrderID, CustomerID, ProductID, Quantity, Price): 10M × 5 × 8 bytes = 400MB
-
Customers(CustomerID, CustomerName, CustomerEmail): 1M × 3 × 50 bytes = 150MB
-
Products(ProductID, ProductName, CategoryID): 100K × 3 × 50 bytes = 15MB
-
Categories(CategoryID, CategoryName): 1K × 2 × 50 bytes = 100KB
Total normalized: 565MB Storage reduction: 5000MB → 565MB = 88.7% reduction
Part (b): Query Performance Analysis
Query: Electronics category orders with customer and product details
Denormalized approach:
-
Scan 10M orders: O(10M)
-
Filter CategoryName = 'Electronics': ~100K results (assuming 10% electronics)
-
Time complexity: O(10M), I/O: 5GB scan
Normalized approach:
-
Filter Categories: O(1K) → Find Electronics CategoryID
-
Join Products: O(100K) → Find electronics ProductIDs (~10K products)
-
Join Orders: O(10M) using ProductID index → ~1M electronics orders
-
Join Customers: O(1M) → Get customer details
-
Time complexity: O(1M), I/O: ~500MB
Performance improvement: 10x faster, 90% less I/O
Part (c): Hybrid Strategy for 80/20 Workload
Strategy: Selective Denormalization
-
Normalize core entities: Customers, Products, Categories (rarely change)
-
Partially denormalize Orders:
-
Keep CustomerName (frequently accessed, low update rate)
-
Normalize ProductName → Products (high update rate)
-
Keep CategoryName (read-heavy, minimal updates)
-
Resulting schema:
-
Orders(OrderID, CustomerID, CustomerName, ProductID, Quantity, Price, CategoryName)
-
Products(ProductID, ProductName, CategoryID)
-
Categories(CategoryID, CategoryName)
-
Customers(CustomerID, CustomerEmail) [other details in Orders]
Benefits:
-
70% of queries avoid joins (customer name, category in Orders)
-
Product updates affect only Products table
-
Storage increase: 20% vs full normalization, 60% reduction vs full denormalization
Part (d): Break-even Analysis
Costs:
-
Denormalized: High storage cost, update complexity O(n) where n = redundancy factor
-
Normalized: Join cost O(log n) per query, consistent updates O(1)
Variables:
-
S = storage cost per GB per month
-
Q = queries per month
-
U = updates per month
-
R = redundancy factor (average tuple duplication)
Denormalized cost: (Storage × 5GB × S) + (Updates × R × U × C_update) Normalized cost: (Storage × 0.5GB × S) + (Queries × log(n) × Q × C_join)
Break-even when: 4.5GB × S + R × U × C_update = Q × log(n) × C_join
For typical values (S=$0.1/GB/month, R=5, C_update=$0.001, C_join=$0.01): Break-even occurs when Q/U > 45 (45:1 read-to-write ratio)
Since given workload is 80:20 = 4:1, recommend hybrid approach
Question 8: Temporal Database Normalization
Problem: Time-variant employee database stores historical salary and department changes: Employee_History(EmpID, ValidFrom, ValidTo, Name, DeptID, DeptName, Salary, ManagerID, ManagerName)
Temporal functional dependencies:
-
EmpID, ValidFrom → Name, DeptID, Salary
-
DeptID, ValidFrom → DeptName, ManagerID
-
ManagerID, ValidFrom → ManagerName
a) Define temporal normal forms analogous to traditional normal forms b) Normalize the temporal relation c) Design queries to find salary changes > 20% with manager approval timeline d) Calculate versioning overhead for 100K employees over 10 years with monthly changes
Solution:
Part (a): Temporal Normal Forms
Temporal 1NF (T1NF): All attributes atomic, valid time periods non-overlapping for same entity
Temporal 2NF (T2NF): T1NF + no partial temporal dependency
- Non-temporal attributes fully dependent on (EntityKey, ValidFrom)
Temporal 3NF (T3NF): T2NF + no transitive temporal dependency
- No A, T → B → C where T is temporal key component
Part (b): Temporal Normalization
Violations identified:
-
DeptID, ValidFrom → DeptName, ManagerID (partial dependency on composite temporal key)
-
ManagerID, ValidFrom → ManagerName (transitive dependency)
Normalized schema:
-
Employee_History(EmpID, ValidFrom, ValidTo, Name, DeptID, Salary)
-
Department_History(DeptID, ValidFrom, ValidTo, DeptName, ManagerID)
-
Manager_History(ManagerID, ValidFrom, ValidTo, ManagerName)
Part (c): Complex Temporal Query
-- Find salary changes > 20% with manager approval timeline
WITH SalaryChanges AS (
SELECT e1.EmpID, e1.ValidFrom as ChangeDate, e1.DeptID,
e1.Salary as NewSalary, e2.Salary as PrevSalary,
(e1.Salary - e2.Salary) / e2.Salary as ChangePercent
FROM Employee_History e1
JOIN Employee_History e2 ON e1.EmpID = e2.EmpID
AND e1.ValidFrom = e2.ValidTo + 1
WHERE ABS((e1.Salary - e2.Salary) / e2.Salary) > 0.2
),
ManagerTimeline AS (
SELECT sc.EmpID, sc.ChangeDate, dh.ManagerID, mh.ManagerName,
dh.ValidFrom as MgrValidFrom, dh.ValidTo as MgrValidTo
FROM SalaryChanges sc
JOIN Department_History dh ON sc.DeptID = dh.DeptID
AND sc.ChangeDate BETWEEN dh.ValidFrom AND dh.ValidTo
JOIN Manager_History mh ON dh.ManagerID = mh.ManagerID
AND sc.ChangeDate BETWEEN mh.ValidFrom AND mh.ValidTo
)
SELECT * FROM ManagerTimeline;
Part (d): Versioning Overhead
Assumptions:
-
100K employees × 10 years × 12 months = 12M potential version records
-
Average tenure: 5 years → 60% turnover
-
Monthly change rate: 5% of active employees
Calculation:
-
Active employee-months: 100K × 10 × 12 × 0.6 average = 7.2M
-
Actual changes: 7.2M × 0.05 = 360K salary records
-
Department changes (less frequent): 360K × 0.1 = 36K dept records
-
Manager changes (least frequent): 36K × 0.2 = 7.2K manager records
Total temporal records: 403.2K Overhead vs current snapshot: 403.2K / 100K = 4.03x storage Acceptable for historical analytics requirements
Question 9: Distributed Database Consistency with CAP Theorem
Problem: Global banking system uses distributed database across 5 continents. Account table: Account(ID, CustomerID, Balance, Branch, Country) with functional dependency Branch → Country.
During network partition between continents: a) Analyze consistency issues if denormalized (Branch, Country together) b) Compare with normalized approach: Account(ID, CustomerID, Balance, Branch) + Branch_Location(Branch, Country)
c) Prove that strong consistency is impossible during partition for both approaches d) Design eventual consistency protocol minimizing inconsistency window
Solution:
Part (a): Denormalized Consistency Issues
During Partition Scenario:
-
Network split: {US, Europe} vs {Asia, Africa, Australia}
-
Update: Move branch "London_Central" from UK to Ireland (Brexit scenario)
Denormalized problems:
-
Phantom updates: Accounts in partition 1 show London_Central → Ireland, partition 2 shows UK
-
Propagation complexity: Must update every account record with that branch (potentially millions)
-
Inconsistent queries: Same branch showing different countries simultaneously
-
Conflict resolution: Which partition's country value is authoritative?
Inconsistency impact: O(n) accounts per branch affected
Part (b): Normalized Approach
During same partition:
-
Account table: Unaffected by branch location changes
-
Branch_Location table: Only one record needs updating
Advantages:
-
Isolated updates: Branch location change affects only Branch_Location table
-
Reduced conflict scope: Single record conflict vs. thousands in denormalized
-
Cleaner resolution: Clear authority over Branch_Location vs. scattered updates
Inconsistency impact: O(1) - single branch record
Part (c): Strong Consistency Impossibility Proof
CAP Theorem Application:
-
Consistency (C): All nodes see same data simultaneously
-
Availability (A): System remains operational during failures
-
Partition tolerance (P): System continues during network splits
Proof by contradiction:
-
Assume strong consistency possible during partition
-
Partition occurs: Network split into P₁ and P₂
-
Update arrives at P₁: Branch location change
-
For strong consistency: P₂ must see update immediately
-
But network partition prevents P₁ → P₂ communication
-
Therefore: Either sacrifice Consistency (allow stale reads) or Availability (block operations)
-
Contradiction: Cannot have both C and A during P
Conclusion: Strong consistency impossible for both normalized and denormalized approaches during partition
Part (d): Eventual Consistency Protocol
Protocol: Vector Clock with Semantic Reconciliation
Components:
-
Vector clocks: Track causality per partition
-
Semantic rules: Branch_Location updates have priority over Account updates
-
Reconciliation windows: 30-second batched updates
-
Conflict resolution: Last-writer-wins with business rule validation
Algorithm:
On network partition:
1. Each partition maintains local vector clock
2. Updates get timestamped: (partition_id, local_time, vector_clock)
3. Queue cross-partition updates
On partition healing:
1. Exchange vector clocks between partitions
2. Identify conflicting updates using causal ordering
3. Apply semantic rules:
- Branch_Location changes override Account.Branch references
- Balance updates use commutative resolution (sum of changes)
4. Batch apply resolved updates
5. Synchronize vector clocks
Inconsistency window = max(network_partition_time, reconciliation_batch_time)
Performance Analysis:
-
Normalized: Inconsistency window = network partition time + O(log branches)
-
Denormalized: Inconsistency window = network partition time + O(accounts_per_branch)
Recommendation: Normalized approach reduces inconsistency resolution time by factor of 1000+ for large banks