All Courses
C Advanced

Introduction to Threading with Pthreads

A thread is the smallest unit of execution within a process. While a traditional C program runs as a single thread doing one thing at a time, multithreading lets your program split into multiple threads that run concurrently—potentially in parallel on multi-core CPUs. The POSIX Threads library, or pthreads, is the standard threading API on Linux and Unix-like systems.

1. What Is Threading and Why It Matters

Think of a single-threaded program as one chef cooking a meal alone: chop vegetables, then boil water, then cook pasta, then prepare sauce—always one step at a time. A multi-threaded program is like having several chefs: while one boils water, another chops vegetables, and a third prepares the sauce. The work finishes faster because tasks overlap.

Key benefits of threading in C:

  • Parallelism: On multi-core CPUs, threads truly run simultaneously, dividing CPU-bound work across cores.
  • Responsiveness: A GUI or network server can keep one thread handling user input while another performs heavy computation in the background.
  • Resource sharing: Threads within the same process share the same memory space, making inter-thread communication cheap compared to inter-process communication (IPC).

However, shared memory also introduces race conditions and synchronisation challenges—topics we will tackle head-on in this lesson.

2. The pthreads Library Overview

pthreads is not part of the C standard library; it is defined by POSIX and available on Linux, macOS, and most Unix-like systems. To use it, include <pthread.h> and link with the -lpthread flag:

gcc -o program program.c -lpthread

Here are the fundamental types you will work with:

TypePurpose
pthread_tOpaque thread identifier (like a "handle" to a thread)
pthread_attr_tThread attributes (stack size, scheduling policy, detach state)
pthread_mutex_tMutual exclusion lock for protecting shared data
pthread_cond_tCondition variable for signalling between threads

We will focus on pthread_t and pthread_mutex_t in this lesson. Most pthread functions return 0 on success and a non-zero error code on failure.

3. Creating and Joining Threads

pthread_create()

The core function for spawning a new thread:

int pthread_create(
    pthread_t *thread,          /* OUT: thread ID */
    const pthread_attr_t *attr, /* NULL = default attributes */
    void *(*start_routine)(void*), /* function the thread runs */
    void *arg                   /* single argument passed to start_routine */
);

Each thread executes a function with the signature void* func(void *arg). The arg parameter lets you pass data—typically a pointer to a struct if you need multiple arguments.

pthread_join()

Joining is like waiting for a thread to finish. Without joining, the main thread might exit before worker threads complete, or you may leak thread resources:

int pthread_join(pthread_t thread, void **retval);

Here is a minimal complete example—two threads printing messages:

#include <stdio.h>
#include <pthread.h>
#include <unistd.h>

void* worker(void *arg) {
    int id = *(int*)arg;
    printf("Thread %d: starting work\n", id);
    sleep(1);  /* simulate work */
    printf("Thread %d: finished\n", id);
    return NULL;
}

int main() {
    pthread_t t1, t2;
    int id1 = 1, id2 = 2;

    pthread_create(&t1, NULL, worker, &id1);
    pthread_create(&t2, NULL, worker, &id2);

    /* Wait for both threads to finish */
    pthread_join(t1, NULL);
    pthread_join(t2, NULL);

    printf("Both threads done. Main exiting.\n");
    return 0;
}
gcc -o threads threads.c -lpthread
./threads

Output order may vary between runs—that is the nature of concurrency!

4. Race Conditions

A race condition occurs when two or more threads access shared data simultaneously, and at least one access is a write. The final result depends on the unpredictable timing of thread scheduling.

Here is the classic example—two threads incrementing a global counter 1,000,000 times each:

#include <stdio.h>
#include <pthread.h>

int counter = 0;  /* shared global variable */

void* increment(void *arg) {
    for (int i = 0; i < 1000000; i++) {
        counter++;  /* RACE CONDITION HERE */
    }
    return NULL;
}

int main() {
    pthread_t t1, t2;

    pthread_create(&t1, NULL, increment, NULL);
    pthread_create(&t2, NULL, increment, NULL);

    pthread_join(t1, NULL);
    pthread_join(t2, NULL);

    printf("Expected: 2000000\n");
    printf("Actual:   %d\n", counter);
    return 0;
}
gcc -o race race.c -lpthread
./race
# Possible output:
# Expected: 2000000
# Actual:   1456783  (varies every run!)

Why does this happen? The innocent-looking counter++ is actually three CPU operations:

  1. LOAD: Read counter from memory into a register.
  2. INCREMENT: Add 1 inside the register.
  3. STORE: Write the register back to memory.

If both threads read the same value before either writes back, one increment is lost:

Time   Thread 1              Thread 2

t1     LOAD counter (100)     ...
t2     INCREMENT (101)        LOAD counter (100)  /* still 100! */
t3     STORE counter (101)    INCREMENT (101)
t4     ...                    STORE counter (101)  /* overwrote T1's work! */

Both threads incremented once each, but the counter only went from 100 → 101 instead of 100 → 102.

5. Mutexes: Solving Race Conditions

A mutex (mutual exclusion) is a lock that ensures only one thread at a time executes a critical section of code. It is the simplest and most common synchronisation primitive.

The Mutex API

pthread_mutex_t lock;                   /* declare a mutex */

pthread_mutex_init(&lock, NULL);        /* initialise (NULL = default attrs) */
pthread_mutex_lock(&lock);              /* acquire the lock (blocks if held) */
/* ... critical section (only one thread here at a time) ... */
pthread_mutex_unlock(&lock);            /* release the lock */
pthread_mutex_destroy(&lock);           /* clean up when done */

Alternatively, use the static initialiser for global/static mutexes:

pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;

Fixing the Race Condition

#include <stdio.h>
#include <pthread.h>

int counter = 0;
pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;

void* increment(void *arg) {
    for (int i = 0; i < 1000000; i++) {
        pthread_mutex_lock(&lock);
        counter++;  /* safe now — only one thread at a time */
        pthread_mutex_unlock(&lock);
    }
    return NULL;
}

int main() {
    pthread_t t1, t2;

    pthread_create(&t1, NULL, increment, NULL);
    pthread_create(&t2, NULL, increment, NULL);

    pthread_join(t1, NULL);
    pthread_join(t2, NULL);

    printf("Expected: 2000000\n");
    printf("Actual:   %d\n", counter);

    pthread_mutex_destroy(&lock);
    return 0;
}
gcc -o race_fixed race_fixed.c -lpthread
./race_fixed
# Expected: 2000000
# Actual:   2000000  /* correct every time! */

Important: Lock only the smallest necessary region. Locking too much code destroys parallelism; locking too little allows races. Aim to minimise the critical section while keeping it correct.

6. Practical Example: Parallel Array Sum

Let us apply threading to a real computation. We will sum a large array of integers, splitting the work across 4 threads, and compare the performance against a single-threaded version.

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <time.h>

#define ARRAY_SIZE 100000000   /* 100 million */
#define NUM_THREADS 4

long long *array;
long long total_sum = 0;
pthread_mutex_t sum_lock = PTHREAD_MUTEX_INITIALIZER;

typedef struct {
    int thread_id;
    long long start;
    long long end;
} ThreadArgs;

void* partial_sum(void *arg) {
    ThreadArgs *args = (ThreadArgs*)arg;
    long long local_sum = 0;

    for (long long i = args->start; i < args->end; i++) {
        local_sum += array[i];
    }

    pthread_mutex_lock(&sum_lock);
    total_sum += local_sum;
    pthread_mutex_unlock(&sum_lock);

    printf("Thread %d: range [%lld, %lld) sum = %lld\n",
           args->thread_id, args->start, args->end, local_sum);
    return NULL;
}

int main() {
    /* Allocate and initialise array */
    array = (long long*)malloc(ARRAY_SIZE * sizeof(long long));
    for (long long i = 0; i < ARRAY_SIZE; i++) {
        array[i] = i + 1;  /* 1, 2, 3, ..., ARRAY_SIZE */
    }

    /* ------ Multi-threaded sum ------ */
    clock_t start = clock();

    pthread_t threads[NUM_THREADS];
    ThreadArgs thread_args[NUM_THREADS];
    long long chunk_size = ARRAY_SIZE / NUM_THREADS;

    for (int i = 0; i < NUM_THREADS; i++) {
        thread_args[i].thread_id = i;
        thread_args[i].start = i * chunk_size;
        thread_args[i].end = (i == NUM_THREADS - 1)
            ? ARRAY_SIZE
            : (i + 1) * chunk_size;
        pthread_create(&threads[i], NULL, partial_sum, &thread_args[i]);
    }

    for (int i = 0; i < NUM_THREADS; i++) {
        pthread_join(threads[i], NULL);
    }

    clock_t multi_end = clock();
    double multi_time = (double)(multi_end - start) / CLOCKS_PER_SEC;

    printf("\nMulti-threaded sum: %lld\n", total_sum);
    printf("Multi-threaded time: %.4f seconds\n", multi_time);

    /* ------ Single-threaded sum (for comparison) ------ */
    start = clock();

    long long single_sum = 0;
    for (long long i = 0; i < ARRAY_SIZE; i++) {
        single_sum += array[i];
    }

    clock_t single_end = clock();
    double single_time = (double)(single_end - start) / CLOCKS_PER_SEC;

    printf("Single-threaded sum: %lld\n", single_sum);
    printf("Single-threaded time: %.4f seconds\n", single_time);

    if (single_time > 0) {
        printf("\nSpeedup: %.2fx\n", single_time / multi_time);
    }

    pthread_mutex_destroy(&sum_lock);
    free(array);
    return 0;
}
gcc -O2 -o parallel_sum parallel_sum.c -lpthread
./parallel_sum

Key takeaways from this example:

  • Each thread computes a local partial sum first (no locking needed for the bulk of the work).
  • Only the final accumulation into total_sum requires the mutex—this minimises lock contention.
  • On a multi-core machine, you should see a significant speedup (ideally close to 4× with 4 threads, though overhead from thread creation and memory bandwidth limits the real-world gain).

7. Thread Safety Considerations

Not all C library functions are safe to call from multiple threads simultaneously. Here is what you need to know:

Thread-Safe Functions

POSIX requires most standard library functions to be thread-safe. Functions like malloc(), free(), printf(), and fopen() are safe to call concurrently. Each thread gets its own errno.

Functions That Are NOT Thread-Safe

Watch out for functions that use internal static buffers:

  • strtok() — use strtok_r() (the reentrant version) instead.
  • asctime(), ctime() — use asctime_r(), ctime_r().
  • localtime(), gmtime() — use localtime_r(), gmtime_r().
  • rand() — use rand_r() or a per-thread PRNG state.
  • readdir() — use readdir_r().

Reentrant Functions

A reentrant function can be safely interrupted mid-execution and called again (e.g., from a signal handler or another thread) without corrupting data. Reentrant functions do not rely on static/global data. Their names typically end in _r. Prefer the _r variants whenever you write multi-threaded code.

8. Common Pitfalls

Deadlocks

A deadlock happens when two threads each hold a lock the other needs, and neither can proceed:

/* Thread 1: */                  /* Thread 2: */
pthread_mutex_lock(&lockA);      pthread_mutex_lock(&lockB);
pthread_mutex_lock(&lockB);      pthread_mutex_lock(&lockA);
/* ... never reached ... */      /* ... never reached ... */

Prevention: Always acquire locks in the same global order. If you must lock A then B in one place, never lock B then A elsewhere. For complex cases, use pthread_mutex_trylock() and back off if acquisition fails.

Forgetting to Join

If you create threads but never join (or detach) them, you leak thread resources. Always either:

  • Call pthread_join() to wait for the thread and reclaim its resources, or
  • Call pthread_detach() to let the OS automatically clean up when the thread exits.

Sharing Stack Variables

Never pass a pointer to a local (stack) variable from the creating function to a thread if the creating function may return before the thread finishes with it:

void* worker(void *arg) {
    int *p = (int*)arg;
    printf("%d\n", *p);  /* DANGER: *p may be garbage! */
    return NULL;
}

void create_thread_badly() {
    int local = 42;
    pthread_t t;
    pthread_create(&t, NULL, worker, &local);
    /* create_thread_badly returns - local goes out of scope - dangling pointer! */
}

Fix: Either allocate the argument on the heap with malloc(), use a global/static variable, or ensure the creator joins the thread before returning.

False Sharing

When two threads write to different variables that happen to reside on the same CPU cache line (typically 64 bytes), the cache-coherency protocol forces constant invalidation, killing performance. Pad frequently-written per-thread data to cache-line boundaries if you observe unexplained slowdowns.

9. When NOT to Use Threads

Threads are not always the answer. Avoid threading when:

  • The problem is I/O-bound: Threads waiting on disk or network do not benefit from parallelism. Consider asynchronous I/O or event-driven approaches instead.
  • The work is too small: Thread creation and synchronisation have overhead. If a task takes microseconds, the threading overhead may outweigh any gain.
  • Single-core constrained: On a single-core CPU, threads do not provide true parallelism—only concurrency (time-slicing). The context-switching overhead can make multi-threaded code slower than single-threaded.
  • Debugging complexity is unacceptable: Race conditions and deadlocks are notoriously hard to reproduce and debug. If correctness is paramount and the problem is small, keep it single-threaded.
  • Alternatives are simpler: For many producer-consumer problems, multiple processes with pipes or message queues provide natural isolation that avoids shared-memory bugs entirely.

10. Exercises

Exercise 1: Threaded Greeting

Write a program that creates 5 threads. Each thread prints "Hello from thread N" where N is the thread number (passed as an argument). Use pthread_join() to wait for all threads before main() exits. Observe that the order of messages is non-deterministic.

Exercise 2: Bank Account with Mutex

Simulate a bank account with a starting balance of 10,000. Create two threads: one deposits 1,000 one thousand times; the other withdraws 500 one thousand times. Use a mutex to protect the balance. The final balance should be exactly 10000 + (1000×1000) − (500×1000) = 510,000. Run the program multiple times to confirm the result is always correct.

Exercise 3: Parallel Merge Sort

Implement a parallel merge sort that splits the array into two halves, recursively sorts each half in a separate thread, then merges. Measure and compare the runtime against a single-threaded merge sort for arrays of size 100,000 and 1,000,000. Think about when it makes sense to stop spawning new threads and fall back to sequential sort for small sub-arrays.