Hash Tables in C — C Programming Lecture Notes

A hash table (or hash map) is one of the most powerful data structures in computing. It maps keys to values with an average-case O(1) lookup, insertion, and deletion. This makes it the go-to choice for dictionaries, caches, symbol tables, database indexes, and deduplication systems. In this lecture, we will build a production-quality hash table from scratch in pure C using separate chaining with dynamically allocated linked-list buckets.

1. Why Hash Tables Matter

Imagine you need to store 100,000 key-value pairs and look up values by key hundreds of thousands of times. With an unsorted array, each lookup is O(n): you must scan every element. With a balanced binary search tree, it is O(log n). With a well-designed hash table, it is O(1) average—a constant number of steps irrespective of the table's size.

The core idea is deceptively simple: feed your key into a hash function, which produces an integer index. Use that index to jump directly to a slot (bucket) in an array. If two keys map to the same index (collision), resolve it.

+---+---+---+---+---+---+---+---+
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |   <-- Array of buckets
+---+---+---+---+---+---+---+---+
  |
  v
+---------+    +---------+
| "alice" |-->| "carol" |--> NULL   <-- Collision chain (linked list)
|  42     |    |  99     |
+---------+    +---------+

The three essential operations, all amortised O(1):

insert(key, value): Hash the key, find the bucket, prepend (or append) a node.
lookup(key): Hash the key, walk the chain at that bucket, return the value.
delete(key): Hash the key, walk the chain, unlink the matching node.

2. Hash Functions

A hash function maps arbitrary data (strings, structs) to a fixed-size integer. A good hash function must be:

Deterministic: The same key always produces the same hash.
Uniform: Outputs must be evenly spread across the output range.
Fast: Hashing must be cheap; it happens on every operation.
Avalanche effect: A tiny change in input should drastically change the output.

A poor hash function that always returns 0 degenerates a hash table into a single linked list—O(n) for everything. A hash table is only as good as its hash function.

2.1 The djb2 Hash

Created by Daniel J. Bernstein, djb2 is one of the simplest and most effective string hash functions. It multiplies an accumulator by 33 and XORs in each character:

unsigned long hash_djb2(const char *str) {
    unsigned long hash = 5381;
    int c;

    while ((c = *str++)) {
        hash = ((hash << 5) + hash) + c;  // hash * 33 + c
    }

    return hash;
}

The magic number 5381 and multiplier 33 work well empirically for ASCII strings. The left-shift trick (hash << 5) + hash computes hash * 33 using a fast bit-shift-and-add instead of multiplication. djb2 produces a 32-bit or 64-bit value, which we then reduce modulo the table size.

2.2 The FNV-1a Hash

Fowler–Noll–Vo (FNV) is another widely-used non-cryptographic hash. The FNV-1a variant XORs before multiplying:

#define FNV_OFFSET 14695981039346656037UL
#define FNV_PRIME  1099511628211UL

unsigned long hash_fnv1a(const char *str) {
    unsigned long hash = FNV_OFFSET;

    while (*str) {
        hash ^= (unsigned char)*str++;
        hash *= FNV_PRIME;
    }

    return hash;
}

FNV-1a has slightly better avalanche characteristics than djb2 for English text, but both are excellent choices for general-purpose hash tables.

2.3 Reducing to a Table Index

Hashes produce large integers. To map into [0, table_size - 1], take the modulo:

size_t index = hash(key) % table->capacity;

For best distribution, table sizes should be prime numbers when using modulo. Powers of two let you use the faster hash & (capacity - 1), but require a hash function with good low-bit entropy. We will use primes for safety.

3. Collision Handling

A collision happens when two different keys hash to the same bucket index. We must handle this gracefully.

3.1 Separate Chaining

Each bucket holds a linked list of all key-value pairs that hash to that index. Lookup walks the chain; insertion prepends. This is the approach we implement. Even with a poor load factor, performance degrades gracefully.

Bucket[0] --> NULL
Bucket[1] --> ("dog", 5) --> ("god", 7) --> NULL   <-- collision!
Bucket[2] --> ("cat", 3) --> NULL
Bucket[3] --> NULL

3.2 Open Addressing

All entries live directly in the array. On collision, a probe sequence (linear, quadratic, or double hashing) finds the next empty slot. Linear probing: index = (hash + i) % capacity. This is more cache-friendly but harder to implement correctly for deletions (you need tombstone markers).

Approach	Pros	Cons
Separate Chaining	Simple deletions, graceful degradation, unlimited capacity per bucket	Extra pointer memory overhead, potential cache misses from pointer chasing
Open Addressing	Cache-friendly, no pointer overhead, all data in one contiguous block	Complex deletions (tombstones), clustering, must resize earlier

4. Complete Hash Table Implementation

We will build a generic hash table with char * keys and int values using separate chaining. Every function includes defensive checks, and we provide full memory management.

4.1 Header File

/* hash_table.h */
#ifndef HASH_TABLE_H
#define HASH_TABLE_H

#include <stddef.h>   /* size_t */

/* A single node in a bucket's linked list */
typedef struct HashNode {
    char           *key;
    int             value;
    struct HashNode *next;
} HashNode;

/* The hash table itself */
typedef struct {
    HashNode **buckets;   /* array of linked-list head pointers */
    size_t     capacity;  /* number of buckets */
    size_t     size;      /* number of stored key-value pairs */
} HashTable;

/* --- Public API --- */

HashTable *ht_create(size_t capacity);
void       ht_destroy(HashTable *table);

/* Insert or update. Returns 0 on success, -1 on error. */
int  ht_insert(HashTable *table, const char *key, int value);

/* Lookup. Returns 1 and sets *value if found, 0 otherwise. */
int  ht_lookup(const HashTable *table, const char *key, int *value);

/* Delete. Returns 1 if the key was found and removed, 0 otherwise. */
int  ht_delete(HashTable *table, const char *key);

/* Utility */
size_t ht_size(const HashTable *table);
float  ht_load_factor(const HashTable *table);

#endif /* HASH_TABLE_H */

4.2 Implementation

/* hash_table.c */
#include "hash_table.h"
#include <stdlib.h>
#include <string.h>
#include <stdio.h>

/* ---------------------------------------------------------------------------
 * Internal: hash function (djb2)
 * --------------------------------------------------------------------------- */
static unsigned long hash_djb2(const char *str) {
    unsigned long hash = 5381;
    int c;
    while ((c = *str++)) {
        hash = ((hash << 5) + hash) + c;
    }
    return hash;
}

/* ---------------------------------------------------------------------------
 * Internal: resize the table to a new capacity and rehash all entries
 * --------------------------------------------------------------------------- */
static int ht_resize(HashTable *table, size_t new_capacity) {
    HashNode **new_buckets = calloc(new_capacity, sizeof(HashNode *));
    if (!new_buckets) return -1;

    /* Rehash every existing node into the new bucket array */
    for (size_t i = 0; i < table->capacity; i++) {
        HashNode *node = table->buckets[i];
        while (node) {
            HashNode *next = node->next;

            size_t new_index = hash_djb2(node->key) % new_capacity;

            /* Prepend to new bucket */
            node->next = new_buckets[new_index];
            new_buckets[new_index] = node;

            node = next;
        }
    }

    free(table->buckets);
    table->buckets  = new_buckets;
    table->capacity = new_capacity;
    return 0;
}

/* Prime numbers near powers of two for good modulo distribution */
static const size_t PRIMES[] = {
    53, 97, 193, 389, 769, 1543, 3079, 6151, 12289,
    24593, 49157, 98317, 196613, 393241, 786433,
    1572869, 3145739, 6291469, 12582917, 25165843,
    50331653, 100663319, 201326611, 402653189,
    805306457, 1610612741
};
static const size_t NUM_PRIMES =
    sizeof(PRIMES) / sizeof(PRIMES[0]);

/* ---------------------------------------------------------------------------
 * Public: create a hash table with the given initial capacity
 * --------------------------------------------------------------------------- */
HashTable *ht_create(size_t capacity) {
    HashTable *table = malloc(sizeof(HashTable));
    if (!table) return NULL;

    /* Snap to the next suitable prime */
    size_t cap = 53;  /* minimum */
    for (size_t i = 0; i < NUM_PRIMES; i++) {
        if (PRIMES[i] >= capacity) {
            cap = PRIMES[i];
            break;
        }
    }

    table->capacity = cap;
    table->size     = 0;
    table->buckets  = calloc(cap, sizeof(HashNode *));
    if (!table->buckets) {
        free(table);
        return NULL;
    }
    return table;
}

/* ---------------------------------------------------------------------------
 * Public: destroy the hash table, freeing all memory
 * --------------------------------------------------------------------------- */
void ht_destroy(HashTable *table) {
    if (!table) return;

    for (size_t i = 0; i < table->capacity; i++) {
        HashNode *node = table->buckets[i];
        while (node) {
            HashNode *next = node->next;
            free(node->key);
            free(node);
            node = next;
        }
    }
    free(table->buckets);
    free(table);
}

/* ---------------------------------------------------------------------------
 * Public: insert or update a key-value pair
 * --------------------------------------------------------------------------- */
int ht_insert(HashTable *table, const char *key, int value) {
    if (!table || !key) return -1;

    /* Check load factor and resize if needed (threshold: 0.75) */
    float load = (float)(table->size + 1) / (float)table->capacity;
    if (load > 0.75f) {
        /* Find next prime in our table */
        size_t new_cap = table->capacity;
        for (size_t i = 0; i < NUM_PRIMES; i++) {
            if (PRIMES[i] > table->capacity) {
                new_cap = PRIMES[i];
                break;
            }
        }
        /* If we are already at the largest prime, double it */
        if (new_cap == table->capacity) {
            new_cap = table->capacity * 2 + 1;
        }
        if (ht_resize(table, new_cap) != 0) return -1;
    }

    size_t index = hash_djb2(key) % table->capacity;

    /* Check if key already exists -- update value if so */
    HashNode *node = table->buckets[index];
    while (node) {
        if (strcmp(node->key, key) == 0) {
            node->value = value;
            return 0;  /* updated */
        }
        node = node->next;
    }

    /* Key not found -- prepend a new node */
    HashNode *new_node = malloc(sizeof(HashNode));
    if (!new_node) return -1;

    new_node->key = strdup(key);
    if (!new_node->key) {
        free(new_node);
        return -1;
    }
    new_node->value = value;
    new_node->next  = table->buckets[index];
    table->buckets[index] = new_node;
    table->size++;
    return 0;
}

/* ---------------------------------------------------------------------------
 * Public: look up a value by key
 * --------------------------------------------------------------------------- */
int ht_lookup(const HashTable *table, const char *key, int *value) {
    if (!table || !key || !value) return 0;

    size_t index = hash_djb2(key) % table->capacity;
    HashNode *node = table->buckets[index];

    while (node) {
        if (strcmp(node->key, key) == 0) {
            *value = node->value;
            return 1;
        }
        node = node->next;
    }
    return 0;
}

/* ---------------------------------------------------------------------------
 * Public: delete a key-value pair
 * --------------------------------------------------------------------------- */
int ht_delete(HashTable *table, const char *key) {
    if (!table || !key) return 0;

    size_t index = hash_djb2(key) % table->capacity;
    HashNode *node  = table->buckets[index];
    HashNode *prev  = NULL;

    while (node) {
        if (strcmp(node->key, key) == 0) {
            /* Unlink from the chain */
            if (prev) {
                prev->next = node->next;
            } else {
                table->buckets[index] = node->next;
            }
            free(node->key);
            free(node);
            table->size--;
            return 1;
        }
        prev = node;
        node = node->next;
    }
    return 0;
}

/* ---------------------------------------------------------------------------
 * Public: utility functions
 * --------------------------------------------------------------------------- */
size_t ht_size(const HashTable *table) {
    return table ? table->size : 0;
}

float ht_load_factor(const HashTable *table) {
    if (!table || table->capacity == 0) return 0.0f;
    return (float)table->size / (float)table->capacity;
}

5. Struct Design Deep Dive

Our design uses two structs:

typedef struct HashNode {
    char           *key;      /* heap-allocated copy of the key */
    int             value;    /* user's data */
    struct HashNode *next;    /* pointer to next node in chain */
} HashNode;

typedef struct {
    HashNode **buckets;       /* array of HashNode* pointers */
    size_t     capacity;      /* length of the buckets array */
    size_t     size;          /* number of key-value pairs stored */
} HashTable;

Key design decisions:

buckets is a HashNode **: An array of pointers-to-head-nodes. Each buckets[i] is NULL when empty, or points to the first node in a chain. This indirection keeps the table array itself compact (just pointers).
strdup() for keys: We copy incoming string keys to the heap so the caller can free or reuse their buffer. We must free(node->key) during deletions and table destruction. A production system might optionally take ownership via a flag.
Prepending (not appending): New nodes are inserted at the head of the chain (LIFO). This is O(1) without traversing the list. Traversal order is irrelevant for hash tables.
Tracking size separately: We increment/decrement on insert and delete so ht_size() is O(1) rather than counting every bucket every time.

6. Resizing and Rehashing

As the table fills up, chains grow longer and performance degrades from O(1) toward O(n). The load factor is the ratio of entries to buckets:

load_factor = size / capacity

We resize when the load factor exceeds 0.75—a threshold well-established in the literature. Resizing involves:

Allocate a new, larger buckets array (zeroed with calloc).
Walk every node in every chain of the old table.
Recompute hash(key) % new_capacity for each node.
Insert (prepend) into the new bucket array.
Free the old bucket array.

This is an O(n) operation that happens rarely—amortised across many insertions, it adds negligible per-operation cost. Our prime-number table of capacities doubles roughly each step, so resizes become exponentially less frequent as the table grows.

Resize is called from ht_insert before inserting the new element. This guarantees the table always has room:

static int ht_resize(HashTable *table, size_t new_capacity) {
    HashNode **new_buckets = calloc(new_capacity, sizeof(HashNode *));
    if (!new_buckets) return -1;

    for (size_t i = 0; i < table->capacity; i++) {
        HashNode *node = table->buckets[i];
        while (node) {
            HashNode *next = node->next;
            size_t new_index = hash_djb2(node->key) % new_capacity;
            node->next = new_buckets[new_index];
            new_buckets[new_index] = node;
            node = next;
        }
    }
    free(table->buckets);
    table->buckets  = new_buckets;
    table->capacity = new_capacity;
    return 0;
}

7. String Keys vs Integer Keys

Our implementation uses string keys because they are the most common real-world case. Adapting for integer keys is straightforward:

/* Integer hash function -- simple multiplicative hash */
static size_t hash_int(int key, size_t capacity) {
    /* Knuth's multiplicative hash */
    return (key * 2654435761UL) % capacity;
}

/* For integer-key variant, change the node struct: */
typedef struct IntHashNode {
    int              key;
    int              value;
    struct IntHashNode *next;
} IntHashNode;

/* Comparisons change from strcmp() to a simple == */

For generic key types, you can use a void * key with function pointers for hashing and comparison passed at table creation—but that adds complexity and indirection overhead. For most C projects, specialising on the key type is clearer and faster.

8. Common Pitfalls

8.1 Bad Hash Distribution

If your hash function maps many keys to the same bucket, you have built an expensive linked list. Always test distribution. A quick histogram tool:

void ht_print_distribution(const HashTable *table) {
    printf("Bucket distribution (capacity=%zu, size=%zu, load=%.2f):\n",
           table->capacity, table->size, ht_load_factor(table));
    for (size_t i = 0; i < table->capacity; i++) {
        int count = 0;
        HashNode *node = table->buckets[i];
        while (node) { count++; node = node->next; }
        if (count > 0) {
            printf("  bucket[%3zu]: %d\n", i, count);
        }
    }
}

8.2 Forgetting to Free Memory

Three levels of memory must be freed:

Each node (HashNode) allocated in ht_insert.
Each node's key string allocated by strdup.
The bucket array and the table struct itself.

Our ht_destroy() correctly walks every chain and frees nodes and keys, then the bucket array, then the table. Always use ht_destroy()—never free(table) alone.

8.3 Iterator Invalidation

If you are iterating over a hash table and call ht_insert (which may trigger a resize), all pointers to nodes become invalid because the entire bucket array is replaced. Either defer mutations during iteration, or use an index-based iteration pattern that re-acquires table->buckets after each step:

/* Safe iteration that survives resizes */
for (size_t i = 0; i < table->capacity; i++) {
    HashNode *node = table->buckets[i];  /* re-read after potential resize */
    while (node) {
        /* ... process node ... */
        node = node->next;
    }
}

8.4 NULL Key Guard

Always check for NULL keys at the API boundary. A null pointer passed to strcmp or hash_djb2 causes a segfault. Our functions return early if the key is NULL.

8.5 Integer Overflow in Hash Functions

Multiplying large values in a hash function can overflow. Our implementation uses unsigned long, which wraps around modulo 2^64 (or 2^32 on 32-bit systems) with well-defined behaviour in C. This is intentional and harmless—hash functions rely on modulo arithmetic.

9. Performance Characteristics

Average-case time complexities (assuming a good hash function and load factor ≤ 0.75):

Operation	Hash Table	Sorted Array	BST / AVL Tree
Insert	O(1)	O(n)	O(log n)
Lookup	O(1)	O(log n)*	O(log n)
Delete	O(1)	O(n)	O(log n)
Ordered traversal	No	Yes	Yes
Memory overhead	Moderate (pointers)	Minimal	Moderate (left/right pointers)

*Sorted array lookup is O(log n) with binary search.

When to use a hash table vs alternatives:

Hash table: You need fast lookups by exact key, order does not matter, keys are hashable.
Array: You need sequential access, random access by integer index, or minimal memory overhead.
Tree: You need ordered traversal (range queries, "next greater key"), or consistent worst-case guarantees.

10. Complete Demo Program

Putting everything together with a main() that exercises all operations:

/* demo.c */
#include "hash_table.h"
#include <stdio.h>
#include <assert.h>

int main(void) {
    /* Create with capacity hint of 10 (snaps to prime 53) */
    HashTable *ht = ht_create(10);
    assert(ht != NULL);

    /* Insert */
    ht_insert(ht, "alice", 42);
    ht_insert(ht, "bob",   17);
    ht_insert(ht, "carol", 99);
    ht_insert(ht, "dave",  55);

    printf("Size after 4 inserts: %zu (load=%.2f)\n",
           ht_size(ht), ht_load_factor(ht));

    /* Lookup */
    int val;
    if (ht_lookup(ht, "alice", &val)) {
        printf("alice = %d\n", val);        /* 42 */
    }

    /* Update */
    ht_insert(ht, "alice", 100);
    ht_lookup(ht, "alice", &val);
    printf("alice after update = %d\n", val); /* 100 */

    /* Delete */
    if (ht_delete(ht, "bob")) {
        printf("Deleted bob\n");
    }
    printf("Lookup bob after delete: %s\n",
           ht_lookup(ht, "bob", &val) ? "found" : "not found");

    /* Insert many to trigger resize */
    for (int i = 0; i < 200; i++) {
        char key[32];
        snprintf(key, sizeof(key), "user_%d", i);
        ht_insert(ht, key, i * 10);
    }

    printf("After bulk insert: size=%zu, capacity=%zu, load=%.2f\n",
           ht_size(ht), ht->capacity, ht_load_factor(ht));

    /* Verify a few */
    ht_lookup(ht, "user_42", &val);
    printf("user_42 = %d (expected 420)\n", val);
    ht_lookup(ht, "user_199", &val);
    printf("user_199 = %d (expected 1990)\n", val);

    /* Distribution check */
    ht_print_distribution(ht);

    ht_destroy(ht);
    printf("All tests passed.\n");
    return 0;
}

Compile and run:

gcc -Wall -Wextra -std=c99 -o demo hash_table.c demo.c
./demo

11. Exercises

Exercise 1: Word Frequency Counter

Using the hash table you built, write a program that reads a text file and counts how many times each word appears. Tokenise on whitespace and punctuation. At the end, print the top 10 most frequent words and their counts. Use the hash table with the word as key and count as value. Handle case-insensitivity by converting words to lowercase before hashing.

Test on a sample text—Gutenberg books work well. Expected output for the first paragraph of A Tale of Two Cities: "the" should appear 10 times, "of" 5 times.

Exercise 2: LRU Cache with Hash Table + Doubly Linked List

Extend the hash table to implement a Least Recently Used (LRU) cache with a maximum capacity. When the cache is full and a new item is inserted, evict the least recently used entry. Use a doubly linked list to track access order, with the hash table providing O(1) lookup into list nodes. Implement cache_get() (moves accessed item to front) and cache_put() (inserts or updates, evicts if needed).

Test with a capacity of 3 and the access sequence: put(1,1), put(2,2), get(1), put(3,3), put(4,4). After these operations, key 2 should be evicted and key 4 should be present.

Exercise 3: Hash Table with Custom Allocator

Modify the hash table to accept custom malloc/free function pointers at creation time via a HashTableConfig struct. Then build an arena allocator variant: pre-allocate a large block of memory and serve all hash table allocations from it using a simple bump pointer. Measure and compare the performance (insertions per second) between the standard malloc version and the arena version for 1,000,000 insertions. Discuss the trade-offs: when would you use each approach in a real system?