What is Hashing?
Hashing directly computes where to store/find records using a mathematical function, eliminating the need for index structures.
Basic Concept
Hash Function: h(key) = bucket address
Example:
h(dept_name) = sum of character codes % 8
"Physics" → P(16) + h(8) + y(25) + s(19) + i(9) + c(3) + s(19) = 99
99 % 8 = 3 → Store in bucket 3
"Math" → M(13) + a(1) + t(20) + h(8) = 42
42 % 8 = 2 → Store in bucket 2
Hash Operations
Insertion:
-
Compute h(search_key)
-
Store record in that bucket
Search:
-
Compute h(search_key)
-
Search only that bucket
Deletion:
-
Compute h(search_key)
-
Find and remove record from bucket
Hash Function Requirements
Good Hash Function Properties:
-
Uniform Distribution: Spreads keys evenly across buckets
-
Random Distribution: No correlation with external patterns
Bad Hash Function Example:
Map department names by first letter:
A-departments → Bucket 1
B-departments → Bucket 2
...
Problem: More departments start with 'C' than 'X'
Better Hash Function for Strings:
hash = 0
for each character c in string:
hash = hash * 31 + ascii_value(c)
return hash % number_of_buckets
Example: Instructor File Hashing
Data:
Record 1: Physics → h("Physics") = 3 → Bucket 3
Record 2: Math → h("Math") = 2 → Bucket 2
Record 3: Chemistry → h("Chemistry") = 1 → Bucket 1
Record 4: Physics → h("Physics") = 3 → Bucket 3 (collision!)
Result:
Bucket 0: [empty]
Bucket 1: [Chemistry record]
Bucket 2: [Math record]
Bucket 3: [Physics record 1][Physics record 4]
Bucket Overflow Problem
Causes of Overflow:
1. Insufficient Buckets
-
Too few buckets for number of records
-
Example: 100 records, 5 buckets → Average 20 records per bucket
2. Skew
-
Uneven distribution due to:
-
Multiple records with same key
-
Poor hash function
-
Example of Skew:
University with many Computer Science instructors:
Bucket 0: [empty]
Bucket 1: [Comp. Sci.][Comp. Sci.][Comp. Sci.][Comp. Sci.]...overflow!
Bucket 2: [Physics]
Bucket 3: [Math]
Handling Overflow: Overflow Chaining
Solution: Chain overflow buckets together
Example:
Primary Bucket 1: [Record A][Record B] → Overflow Bucket 1a
Overflow Bucket 1a: [Record C][Record D] → Overflow Bucket 1b
Overflow Bucket 1b: [Record E] → NULL
Search with Overflow:
-
Hash to primary bucket
-
Search primary bucket
-
Follow chain through all overflow buckets
Preventing Overflow
Formula: Number of buckets = (nr/fr) × (1 + d)
-
nr = number of records
-
fr = records per bucket
-
d = fudge factor (usually 0.2)
Example:
-
1000 records, 10 records per bucket
-
Buckets needed = (1000/10) × (1 + 0.2) = 100 × 1.2 = 120 buckets
-
20% waste but much fewer overflows
Hash Indices
Hash functions can also organize indices (not just files).
Example: Hash index on instructor ID
ID 12345 → h(12345) = 2 → Bucket 2: [12345 → pointer to record]
ID 67890 → h(67890) = 5 → Bucket 5: [67890 → pointer to record]
Benefits:
-
Very fast exact-match queries: O(1) average time
-
No need to traverse tree structures
Limitations:
-
Poor for range queries
-
Performance degrades with poor hash function