Lock-Free SPSC Ring Buffer in C++
Designing ultra-low latency systems means eliminating locks, syscalls, and unpredictable pauses. In high-frequency trading or real-time engines, even microseconds matter. This article explains how to build a Lock-Free Single Producer Single Consumer (SPSC) Ring Buffer using C++ atomics and memory ordering.
🚀 Problem Statement
We need fast communication between two threads without mutex overhead. Mutex causes:
- context switching
- kernel involvement
- latency spikes
Solution → Lock-free ring buffer
🧠 Architecture
Producer ---> [ Ring Buffer ] ---> Consumer
Only:
- Producer writes tail
- Consumer writes head
This is called the Single Writer Principle → zero write contention → naturally lock-free.
📦 Data Structure
template<typename T, size_t N>
struct RingBuffer {
T buffer[N];
std::atomic<size_t> head;
std::atomic<size_t> tail;
};
Explain each:
- buffer → Stores elements
- head → Index to read
- tail → Index to write
⚙️ Producer (push)
Producer steps:
- Load head (acquire)
- Check if buffer full
- Write element
- Update tail (release)
Why? Before updating tail → data must be fully written. Release ensures consumer sees correct memory.
bool push(const T& item) {
size_t t = tail.load(std::memory_order_relaxed);
size_t next = (t + 1) % N;
if (next == head.load(std::memory_order_acquire))
return false; // full
buffer[t] = item;
tail.store(next, std::memory_order_release);
return true;
}
⚙️ Consumer (pop)
Consumer steps:
- Load tail (acquire)
- If empty → return
- Read data
- Update head (release)
bool pop(T& item) {
size_t h = head.load(std::memory_order_relaxed);
if (h == tail.load(std::memory_order_acquire))
return false; // empty
item = buffer[h];
head.store((h + 1) % N, std::memory_order_release);
return true;
}
🧩 Why Acquire–Release?
Release → Ensures all writes before store become visible
Acquire → Ensures all reads after load see latest data
Creates happens-before relationship between producer and consumer.
🔥 Why No Mutex?
Mutex → kernel involvement, context switch, 1000ns+, unpredictable
Atomics → CPU instruction only, few nanoseconds
Lock-free gives deterministic latency, high throughput → ideal for HFT
🔥 Why Fast?
- No locks → no syscalls
- Contiguous memory → cache friendly
- No heap → preallocated
- No false sharing → alignas(64)
🧠 Common Follow-ups
- Q: Why only SPSC?
A: MPMC needs CAS → slower + ABA issues - Q: Why modulo slow?
A: Use power-of-two + bitmask → index & (N-1) - Q: What if producer faster?
A: Buffer full → drop or spin - Q: What about false sharing?
A: Pad head and tail
📝 Perfect Interview Answer
I implemented a lock-free SPSC ring buffer to achieve ultra-low latency inter-thread communication. Since only one thread writes head and one writes tail, there’s no write contention. I used std::atomic with acquire-release ordering to guarantee visibility without locks. The buffer is preallocated and cache-aligned to avoid heap allocation and false sharing. This eliminates syscalls and context switching, allowing us to achieve 10–20M messages/sec with sub-microsecond latency.
🔥 Problem Setup (Example)
Assume:
- Buffer size = 4
- Indexes: 0 1 2 3
Initial state:
buffer: [ _ _ _ _ ] head = 0 tail = 0
Meaning: head == tail → EMPTY
🎯 Scenario Simulation
- Producer pushes A → tail moves to 1 → buffer: [ A _ _ _ ]
- Producer pushes B → tail = 2 → buffer: [ A B _ _ ]
- Producer pushes C → tail = 3 → buffer: [ A B C _ ]
- Consumer pops first item → head = 1 → reads A → buffer stays [ A B C _ ]
- Consumer pops second item → head = 2 → reads B → buffer: [ A B C _ ]
- Producer pushes D → tail wraps to 0 → buffer: [ A B C D ]
- Producer pushes E → tail = 1 → buffer: [ E B C D ]
- Consumer pops all → head = 1 → reads C,D,E → final state head=tail=1 → EMPTY
Works without locks because producer only modifies tail and consumer only modifies head. Only reads other's pointer → safe.