How to Optimise C# (S1E3): Concurrency: Busy-Wait vs Semaphore
Continuing on from the hot/cold cache series (S1E1, S1E2). A concurrency sidestep this time.
Someone on Stack Overflow asked how to wait for a background thread to finish before posting a message into a multi-threaded dictionary. Their working solution was a busy-wait: a while (!ready) { /* spin */ } loop that pegs a CPU core until the flag flips.
It works. It also throws away the entire reason you’re in a multi-threaded program in the first place.
Let me contrast it with what I did in Frontier, an actor-model framework I built a couple of years ago for a project of mine. Every actor (“foundry”) has an inbox, a message loop, and a signal. When the inbox is empty the actor sleeps; when a message arrives the signal wakes exactly one sleeping actor. The signal is a SemaphoreSlim.
On the surface that’s just “semaphore replaces busy-wait”, the standard textbook advice. But the reason it matters for an actor system is deeper than “spinning is wasteful”. It’s the thing that makes the whole architecture scale.
The busy-wait
The shape of the Stack Overflow answer, slightly cleaned up:
// Don't do this.while (!_inbox.TryDequeue(out var message)) { // maybe a Thread.Yield() or Thread.Sleep(0) to be polite}ProcessMessage(message);
The polite variants soften the blow a bit. Thread.Yield() offers the CPU slice to whoever else is ready, Thread.Sleep(0) is similar, Thread.Sleep(1) actually sleeps for roughly 15 ms on Windows (the default scheduler quantum). But every variant has the same three costs:
- CPU burn. The thread runs forever. On a laptop that’s a warm lap and a flat battery. On a server it’s another core you can’t use for real work.
- Cache-line thrashing. The tight loop reads the queue’s head pointer on every iteration; the producer thread that wants to push a message has to fight a cache line that your waiter is perpetually pulling back into its own L1. Producer’s writes stall on the coherency protocol. Both sides get slower, which brings us neatly back to S1E1.
- Thread-per-actor. This is the one that bites later. If each actor in your system spins on its own
while (!ready)loop, every actor you instantiate needs its own OS thread. An OS thread on .NET/x64 costs you ~1 MB of stack plus kernel scheduling overhead. At a few hundred actors you’re fine. At ten thousand (a rounding error in an actor model) you’re out of memory and your scheduler is drowning in context switches.
The third cost is the one nobody warns you about when you’re writing your first threading code.
The semaphore
The Frontier actor’s message loop, trimmed:
private Task _runAsync() { return Task.Run(async () => { while (!cancellationToken.IsCancellationRequested) { await _messageSignal.WaitAsync(cancellationToken); while (_inbox.TryDequeue(out var envelope)) { await OnReceive(envelope, cancellationToken); } } });}
_messageSignal is a SemaphoreSlim initialised to 0. The producer side, Post(), looks like this:
public void Post(MessageEnvelope env) { _inbox.Enqueue(env); _messageSignal.Release(); // "one permit available"}
When the actor’s message loop hits await _messageSignal.WaitAsync(), one of three things happens:
- If the semaphore’s count is zero (no messages waiting), the task yields its thread. Not just the CPU: the
Task.Rundoesn’t actually pin an OS thread for the lifetime of the actor. It yields back to the .NET thread pool. The thread is now free to service any other ready task. - If the count is positive, the
awaitcompletes synchronously and the loop drains the inbox. - On cancellation,
WaitAsyncthrowsOperationCanceledExceptionand the loop breaks out cleanly.
When Release() is called by another actor’s Post(), the scheduler picks a thread from the pool and resumes the awaiting task on it.
Why this changes the math
With the busy-wait, one actor = one dedicated OS thread. With the semaphore + await, N actors share M threads, where M is whatever the thread pool decides, typically cores × a small constant. An actor that’s waiting for a message isn’t on any thread at all. It’s a continuation parked in a queue somewhere, costing the price of a closure object plus a few machine words of state.
In Frontier I can spin up tens of thousands of foundries on a single box without sweating. Every one is a fully independent actor with its own state and its own inbox. At any moment, at most Environment.ProcessorCount of them are actually running. The rest are sleeping for free.
This is the same architectural property that lets Erlang processes, Akka actors, and Orleans grains scale to millions. None of them are built on busy-waits. None of them could be.
When busy-waiting is actually right
Semaphores aren’t free either. The minimum latency between Release() and the awaiting task resuming is on the order of low microseconds: the cost of a thread-pool wake-up, a kernel round-trip on some platforms, and the task machinery. If you need sub-microsecond reaction time (a lock-free ring buffer between an audio thread and a mixer, a frame-boundary fence in a renderer, an atomic counter flipped every few hundred nanoseconds), busy-waiting wins, often with a short Thread.SpinWait.
The distinguishing question is always the same: do I expect the thing I’m waiting for to arrive within a few microseconds, every time? If yes, spin. If the wait could be milliseconds, seconds, or indefinite, sleep.
For message-passing actors, the wait is almost always indefinite. Sleep.
Takeaway
Busy-waiting is a sensible-looking solution to a concurrency question until you think about what your program looks like at scale. Semaphores are the part of the standard library that makes actor models actually work, not because they’re elegant (they are), but because they decouple number of actors from number of threads.
If you’re building anything that wants to have more units of concurrent work than you have cores, you want the sleeping behaviour. The rest is plumbing.