Page 9: Dynamic Hashing - Extendable Hashing

Problem with Static Hashing

Static hashing uses a fixed number of buckets, causing problems as databases grow:

Problems:

  1. Too few buckets initially → Many overflows → Poor performance

  2. Too many buckets initially → Wasted space

  3. Fixed structure → Can't adapt to growth

Solution: Dynamic Hashing

Adjust hash structure as database grows or shrinks.

Extendable Hashing Concept

Key Idea: Use variable number of bits from hash value

Example Hash Values (4-bit for simplicity):

"Physics" → hash = 0011
"Math" → hash = 1001  
"Chemistry" → hash = 0101
"Biology" → hash = 1100

Dynamic Bit Usage:

  • Initially use 1 bit (0 or 1)

  • As needed, increase to 2 bits (00, 01, 10, 11)

  • Then 3 bits, 4 bits, etc.

Data Structure Components

1. Bucket Address Table

  • Array indexed by first i bits of hash value

  • i = global depth (starts small, grows as needed)

2. Buckets

  • Actual storage containers for records

  • Each has local depth ij (bits needed for records in this bucket)

3. Hash Function

  • Generates uniform b-bit values (usually b=32)

  • Use first i bits for bucket address table lookup

Example: Building Extendable Hash

Initial State (i=1, using 1 bit):

Global depth i = 1
Bucket Address Table:
0 → Bucket A (local depth = 1)
1 → Bucket B (local depth = 1)

Buckets:
Bucket A: [empty] (for hash values starting with 0)
Bucket B: [empty] (for hash values starting with 1)

Step 1: Insert "Physics" (hash = 0011, starts with 0)

Bucket A: [Physics]
Bucket B: [empty]

Step 2: Insert "Math" (hash = 1001, starts with 1)

Bucket A: [Physics]
Bucket B: [Math]

Step 3: Insert "Chemistry" (hash = 0101, starts with 0)

Bucket A: [Physics][Chemistry] 
Bucket B: [Math]

Step 4: Insert "Biology" (hash = 1100, starts with 1)

Bucket A: [Physics][Chemistry]
Bucket B: [Math][Biology]

Bucket Split Example

Assume bucket size = 2, and we insert "History" (hash = 0110)

Bucket A is full! Need to split.

Split Process:

  1. Increase global depth: i = 1 → i = 2

  2. Double bucket address table size

  3. Split Bucket A using 2 bits

Before Split (i=1):

0 → Bucket A: [Physics(0011)][Chemistry(0101)]
1 → Bucket B: [Math(1001)][Biology(1100)]

After Split (i=2):

Global depth i = 2
00 → Bucket A: [Physics(0011)] (local depth = 2)  
01 → Bucket C: [Chemistry(0101)][History(0110)] (local depth = 2)
10 → Bucket B: [Math(1001)] (local depth = 1)
11 → Bucket B: [Biology(1100)] (local depth = 1)

Notice: Buckets 10 and 11 both point to Bucket B because Biology and Math only need 1 bit to distinguish.

Search Operation

To find "Chemistry" (hash = 0101):

  1. Take first i=2 bits: 01

  2. Look up bucket address table[01] → Bucket C

  3. Search Bucket C linearly

Insertion Algorithm

To insert record with key K:

  1. Compute hash h(K)

  2. Take first i bits → bucket address j

  3. If bucket j has space → insert record

  4. If bucket j is full:

    • If i = ij (global depth = local depth):

      • Double bucket address table (i = i + 1)

      • Split bucket using i bits

    • If i > ij:

      • Split bucket without doubling table

      • Increase local depth: ij = ij + 1

Advantages of Extendable Hashing

Compared to Static Hashing:

  • ✅ Adapts to database growth

  • ✅ No performance degradation

  • ✅ Minimal space overhead

  • ✅ No periodic reorganization needed

Minor Disadvantages:

  • Slightly more complex implementation

  • Extra indirection through bucket address table

  • But indirection cost is negligible compared to disk I/O

When to Use Each Method

Static Hashing: Good when database size is predictable and stable

Extendable Hashing: Best for growing databases with unpredictable sizes

B+ Trees: Best overall choice for most applications due to:

  • Excellent range query support

  • Predictable performance

  • Good for both growth and mixed workloads


Summary and Key Takeaways

File Organization Methods Comparison

Method

Best For

Advantages

Disadvantages

Heap

Insertions

Simple, fast inserts

Slow searches

Sequential

Range queries

Fast ranges, ordered

Slow insertions

Hash

Exact matches

Very fast lookups

Poor ranges

B+ Tree

Mixed workload

Balanced performance

More complex

Index Types Summary

Index Type

Use Case

Performance

Dense

Small tables, frequent access

Fast but more storage

Sparse

Large tables, clustering index

Less storage, still fast

Multilevel

Very large indices

Manageable memory usage

Hash Methods Comparison

Method

Adaptability

Complexity

Performance

Static

Fixed size

Simple

Degrades with growth

Extendable

Dynamic

Moderate

Consistent

Choosing the Right Approach

Consider these factors:

  1. Query patterns: Range queries → B+ trees, Exact matches → Hashing

  2. Data growth: Predictable → Static, Unpredictable → Dynamic

  3. Update frequency: Many updates → Consider overflow handling

  4. Storage constraints: Limited space → Sparse indices, Plenty of space → Dense indices

Most Common Choice: B+ trees offer the best balance for most database applications, which is why they're the default in most commercial database systems.

Updated on