How Hash Tables Work
Deep-dive into hash tables — hashing functions, collision resolution (chaining & open addressing), load factor, rehashing, Big-O analysis, real-world implementations, and an interactive playground.
The Big Picture — What Problem Do Hash Tables Solve?
Imagine you have a million user records and need to find one by username. With an array, you'd scan every record — O(n). With a hash table, you compute an index directly from the username and jump straight to it — O(1). Hash tables are the most important data structure in computer science, powering everything from Python dicts to database indexes to Redis caches.
The Library Analogy
Finding a book in an unsorted pile vs a perfectly indexed library.
📖 Check book #1... nope
📖 Check book #2... nope
📖 Check book #3... nope
📖 ... keep going until you find it
🔑 Look up "Algorithms" in the index card
📍 Index says: Shelf 7, Position 3
📖 Walk directly there — found it!
Array vs Hash Table — Live Lookup
Toggle mode and search for a name. Watch the step counter.
Hash Tables Are Everywhere
Python
dict / set
Java
HashMap / HashSet
JavaScript
Object / Map / Set
Go
map[K]V
Redis
Key-Value Store
Database
Hash Index
🔑 Key Takeaway: Hash tables solve the fundamental lookup problem — turning O(n) linear search into O(1) constant-time access by computing exactly where each item lives, rather than scanning everything.
How Hashing Functions Work
A hash function takes any input — a string, number, or object — and produces a fixed-size integer that serves as a bucket index. The quality of this function determines everything: a good hash distributes keys uniformly across buckets, while a bad one clusters them together, destroying O(1) performance.
Properties of a Good Hash Function
Same input ALWAYS produces the same output. hash("alice") = 5 every time.
Outputs are spread evenly across all buckets. No bucket should be much fuller than others.
O(k) where k = key length. Must be faster than the lookup it enables.
A tiny change in input causes a massive change in output. "abc" → 7, "abd" → 2.
Hash Computation — Step by Step
Type any string and watch the hash pipeline: string → ASCII codes → sum → modulo → bucket index.
Step 1: String → ASCII Codes
Step 2: Sum All ASCII Codes
110 + 97 + 109 + 101 = 417
Step 3: Modulo (% 8 buckets)
417 % 8 = 1
djb2 (production hash):
djb2("name") = 2090536006 → bucket 6
Distribution Quality — Bucket Fill Visualization
Add keys and watch how evenly (or unevenly) they distribute across 8 buckets.
8 keys across 8 buckets — ideal: 1.0 per bucket
Famous Hash Functions
| Name | Speed | Quality | Use Case |
|---|---|---|---|
| djb2 | Very Fast | Good | General purpose, string hashing. Default in many C hash tables. |
| MurmurHash3 | Fast | Excellent | Used in Redis, Cassandra, Elasticsearch. Great distribution. |
| xxHash | Fastest | Excellent | Fastest non-crypto hash. Used in Linux kernel, LZ4 compression. |
| SHA-256 | Slow | Perfect | Cryptographic — overkill for hash tables. Used in Bitcoin, TLS, git. |
🔑 Key Takeaway: A hash function converts any key into a fixed-size bucket index. A good one distributes keys uniformly across all buckets — minimizing collisions and keeping lookups at O(1).
Collision Resolution — Chaining
When two keys hash to the same bucket — a collision— the hash table needs a strategy. Separate chaining stores all colliding entries in a linked list at that bucket. It's simple, robust, and the default in Java's HashMap. But when chains grow long, lookups degrade from O(1) to O(n).
Separate Chaining — Animated Insertion
Click "Insert Next" to add keys one by one and watch collisions form chains.
[0]
empty
[1]
empty
[2]
empty
[3]
empty
[4]
empty
[5]
empty
[6]
empty
[7]
empty
⚠️ Worst Case: All Keys in One Bucket
If every key hashes to bucket 0, the chain becomes a linked list of n elements. Lookup degrades from O(1) to O(n). This is why hash function quality matters.
🌳 Java 8+ Improvement: Treeification
When a chain grows beyond 8 nodes, Java HashMap converts it from a LinkedList to a Red-Black Tree. Worst-case lookup improves from O(n) to O(log n). When it shrinks below 6, it converts back.
Lookup in a Chain
Insert some keys above, then search for one to see the chain traversal.
Chaining Implementation
// Simplified chaining HashMap
class HashMap<K, V> {
private LinkedList<Entry<K,V>>[] buckets;
private int capacity = 16;
public void put(K key, V value) {
int index = Math.abs(key.hashCode()) % capacity;
if (buckets[index] == null)
buckets[index] = new LinkedList<>();
for (Entry<K,V> e : buckets[index])
if (e.key.equals(key)) { e.value = value; return; }
buckets[index].add(new Entry<>(key, value));
}
}🔑 Key Takeaway: Chaining handles collisions by storing multiple entries in each bucket as a linked list. Average lookup is O(1+α) where α is the load factor, but degrades to O(n) if all keys cluster in one bucket.
Collision Resolution — Open Addressing
Open addressing takes a different approach: instead of chaining, it finds another empty slot within the same array. Linear probing checks the next slot, quadratic probing jumps by i², and double hashing uses a second hash function. It's more cache-friendly than chaining but introduces clustering problems and makes deletion tricky.
Open Addressing — Probing Strategies
Instead of chaining, open addressing finds another empty slot in the same array.
index = (hash + i) % size
Check the next slot sequentially. Simple but causes primary clustering — long runs of occupied slots form and slow lookups.
Primary Clustering Problem (Linear Probing)
As the table fills, occupied slots cluster together into long runs. New insertions that hash ANYWHERE in the cluster must probe to the end — making the cluster even longer. It's a positive feedback loop that degrades performance.
🪦 The Deletion Problem: Tombstones
You can't simply clear a slot — it would break the probe chain for keys inserted after it. Instead, mark deleted slots as "tombstones" (DELETED). Lookups skip tombstones but don't stop. Insertions can reuse them. Too many tombstones degrade performance → periodic rehash to clean up.
Chaining vs Open Addressing
| Feature | Chaining | Open Addressing |
|---|---|---|
| Memory | Extra pointers per node (linked list overhead) | No extra pointers — all data in the array |
| Cache Performance | Poor — linked list nodes scattered in memory | Excellent — data is contiguous in memory |
| Load Factor | Can exceed 1.0 (chains grow freely) | Must stay below 1.0 (table must have empty slots) |
| Deletion | Simple — remove node from linked list | Complex — requires tombstone markers |
| Clustering | No clustering problem | Primary/secondary clustering with linear/quadratic |
| Worst Case | O(n) — long chain | O(n) — long probe sequence |
Open Addressing — Linear Probing
class HashTable:
def __init__(self, size=16):
self.size = size
self.keys = [None] * size
self.values = [None] * size
def put(self, key, value):
index = hash(key) % self.size
while self.keys[index] is not None:
if self.keys[index] == key:
self.values[index] = value
return
index = (index + 1) % self.size # linear probe
self.keys[index] = key
self.values[index] = value🔑 Key Takeaway: Open addressing stores everything in the array itself — no linked lists. It has better cache performance but suffers from clustering and requires tombstones for deletion. Double hashing gives the best distribution.
Load Factor & Rehashing
The load factor (entries ÷ capacity) is the critical health metric of a hash table. As it approaches 1.0, collisions skyrocket and lookups slow down exponentially. When it exceeds a threshold (typically 0.75), the table rehashes — doubling its capacity and reinserting every entry. This sounds expensive, but the amortized cost is still O(1) per insertion.
Load Factor & Rehashing
Load factor = entries / capacity. When it exceeds the threshold, the table doubles in size and all entries are rehashed.
Rehash Threshold
0.75Entries
0
Capacity
8
Load Factor
0.00
Collisions
0
[0]
—
[1]
—
[2]
—
[3]
—
[4]
—
[5]
—
[6]
—
[7]
—
Why Load Factor Matters — Collision Curve
Average probes needed vs load factor (open addressing). Above 0.75, performance degrades rapidly.
Load Factor (α) → Avg Probes
📊 Amortized O(1)
Rehashing costs O(n) when it happens, but it only happens after n/2 insertions (doubling). Spreading the O(n) cost across n/2 cheap O(1) insertions gives an amortized cost of O(1) per insertion — like a staircase where each big step is paid for by many small ones.
Rehashing Implementation
func (h *HashTable) resize() {
oldBuckets := h.buckets
h.capacity *= 2
h.buckets = make([][]Entry, h.capacity)
h.count = 0
for _, bucket := range oldBuckets {
for _, entry := range bucket {
h.Put(entry.Key, entry.Value)
}
}
}🔑 Key Takeaway: Rehashing doubles the table size when the load factor exceeds a threshold (typically 0.75). Every entry is reinserted with the new modulus. This keeps collisions low and maintains O(1) average performance — at an amortized cost.
Why HashMap is O(1) Average but O(n) Worst Case
Hash tables are O(1) on average under the assumption of uniform hashing. But an adversary who knows your hash function can craft keys that all collide in one bucket, degrading your O(1) table to O(n²). Real-world hash flooding attacks have exploited this in PHP, Python, Ruby, and Java — leading to the adoption of randomized hashing (SipHash) as a defense.
Why O(1) Average but O(n) Worst Case?
Hash tables are O(1) on average with uniform hashing — but adversarial inputs can force O(n).
| Operation | Chaining | Open Addressing | Java 8+ (Treeified) |
|---|---|---|---|
| Insert (avg) | O(1) | O(1/(1-α)) | O(1) |
| Insert (worst) | O(n) | O(n) | O(log n) tree |
| Lookup (avg) | O(1+α) | O(1/(1-α)) | O(1) |
| Lookup (worst) | O(n) | O(n) | O(log n) tree |
| Delete (avg) | O(1+α) | O(1/(1-α)) | O(1) |
| Delete (worst) | O(n) | O(n) | O(log n) tree |
☠️ Worst Case: Adversarial Hash Collision
// All keys hash to bucket 0 — degenerate case bucket[0]: key1 → key2 → key3 → ... → key_n bucket[1]: (empty) bucket[2]: (empty) ... bucket[m]: (empty) // Lookup for key_n requires traversing entire chain // Time: O(n) — no better than a linked list!
An attacker who knows the hash function can craft keys that ALL hash to the same bucket, turning your O(1) hash table into an O(n) linked list.
Hash Flooding Attack — Real CVEs
Attacker sends crafted HTTP headers/POST data that all collide, turning O(1) into O(n²).
CVE-2003-0364 — Linux Kernel
Hash table DoS in IP fragment reassembly. Crafted packets caused all fragments to collide in one bucket, enabling kernel-level DoS.
CVE-2011-4885 — PHP
Crafted POST parameters all hashed to bucket 0. Processing 100,000 colliding keys took 30+ seconds, freezing web servers.
CVE-2012-5371 — Ruby
Predictable hash function allowed crafted HTTP headers to trigger O(n²) hash table behavior, enabling DoS attacks.
Hashdos (2011) — Perl, Python 2, Java, ASP.NET
Universal attack across languages. A single HTTP request with ~100KB of crafted form data could consume minutes of CPU time.
Defenses Against Hash Flooding
Randomized Hash Seed
Each process uses a random seed for hash computation. Same key produces different bucket indices across restarts. Attacker cannot predict collisions.
Python: PYTHONHASHSEED (random since 3.3) Ruby: Hash seed randomized since 1.9 Perl: Hash seed randomized since 5.18
🔑 Key Takeaway: Hash tables are O(1) average under uniform hashing, but adversarial inputs can force O(n) worst case. Real-world hash flooding attacks exploited this in PHP, Python, Ruby, and Java. Modern defenses use randomized seeds (SipHash) to make collision crafting computationally infeasible.
Real-World Implementations Compared
Every language implements hash tables differently. Java uses chaining with tree fallback. Python uses open addressing with a compact layout that preserves insertion order. Go uses 8-entry buckets with overflow chains. Redis maintains two tables simultaneously for non-blocking incremental rehashing. Understanding these differences helps you choose the right tool and avoid performance pitfalls.
Real-World Hash Table Implementations
How Java, Python, Go, and Redis implement hash tables differently — and why.
Java HashMap
Implementation Notes
Java 8 introduced treeification: when a chain exceeds 8 nodes AND capacity ≥ 64, the chain converts to a red-black tree (O(log n) worst case). Untreeifies at 6 nodes. The hash is also perturbed: h ^ (h >>> 16) to spread bits.
Side-by-Side Comparison
| Feature | ☕ Java | 🐍 Python | 🐹 Go | 🔴 Redis |
|---|---|---|---|---|
| Collision Strategy | Chain → Tree | Open addressing | Bucket + overflow | Chaining |
| Load Factor | 0.75 | 0.667 | 6.5 | 1.0 |
| Rehash | ×2 (stop-world) | ×2/×4 | Incremental | Incremental (2 tables) |
| Ordering | None | Insertion order | Random | None |
| Hash Function | Object.hashCode() | SipHash-1-3 | AES-based | SipHash-2-4 |
🔑 Key Takeaway: Every language implements hash tables differently — Java uses chaining with tree fallback, Python uses open addressing with compact layout, Go uses 8-entry buckets, and Redis uses dual-table incremental rehashing. The best choice depends on your workload: memory-constrained? Use open addressing. Latency-sensitive? Use incremental rehashing.
Interactive Hash Table Playground
Build your own hash table from scratch. Insert key-value pairs, watch them get hashed into buckets, toggle between chaining and open addressing, monitor the load factor, and search for keys to see the probe sequence in action.
Interactive Hash Table Playground
Insert key-value pairs, watch them hash into buckets, and search for keys.
Entries
0
Capacity
8
Load Factor
0.00
Max Chain
1
Insert Key-Value Pair
[0]
empty
[1]
empty
[2]
empty
[3]
empty
[4]
empty
[5]
empty
[6]
empty
[7]
empty
🔍 Lookup Key
🔑 Key Takeaway: Hash tables provide O(1) average-case lookups by computing a bucket index from the key. The choice between chaining and open addressing affects memory layout, cache behavior, and how collisions are resolved. Keeping the load factor below 0.75 ensures collisions stay rare.