$ cat /post/how-to-optimise-csharp-for-memory-access.md

How to Optimise C# (S1E1): Memory Access: Hot & Cold Data

20 Feb 2020

Not all memory is equal.

It’s a beginner’s mistake to think that all memory access takes an equal amount of time. Many managed languages abstract memory layout away so the programmer can focus on functionality, but in doing so C# can hide some sizable performance cliffs.

Each cache level sits between the CPU and main memory; a memory access walks down through them, and the stall time grows the further out it has to go. Rough numbers — these shift between generations, treat them as orders of magnitude:

Memory type     Latency       Typical desktop core     Apple A11 (iPhone 8)L1 cache hit    ~1-2 ns       32-48 KB                 64 KB D + 64 KB IL2 cache hit    ~3-5 ns       512 KB - 1 MB            ~128 KB per coreL3 cache hit    ~15-40 ns     4-32 MB shared           (system cache varies)Main memory     ~80-120 ns    -                        -Disk seek       10 ms (SSD:  ~100 µs)

Each cache is usually split into a data cache and an instruction cache; the A11 above lists both L1 halves.

So what’s happening?

When a memory location is accessed, the line containing it — typically 64 bytes — is moved up through the cache hierarchy. If it’s found in L3, the 64-byte line around the requested byte is copied into L2 and L1. If your next read is the adjacent byte, it comes from L1 and is essentially free. Eventually the cache fills up and the oldest line gets evicted when a new one is loaded.

Modern CPUs also run hardware prefetchers that watch for sequential access patterns and speculatively pull the next one or two cache lines into L1 before you ask for them. For clean sequential reads (the kind we’re about to optimise towards) the prefetcher does a lot of the work — the worst-case numbers below are exactly that, worst case, not typical.

Given how expensive it is to retrieve a cache line from main memory, it pays to minimise jumping around memory randomly. Typically that means batching work on data that’s currently hot, and laying data out so related fields sit together.

In this series I’ll outline a few simple techniques I’ve used in published Unity games to keep the CPU fed.

Episode 1: Hot and Cold data

The L1 and L2 caches are small, and it’s easy to fill them with data you’re not actually using. Take the following:

struct Lukewarm {    public int myHotData;    public int myColdData1;    public int myColdData2;    public int myColdData3;}class MyManager {    Lukewarm[] lukewarm;    MyManager() {        lukewarm = new Lukewarm[100000];        //Initialise lukewarm data here...    }    void Update() {        for (int i = 0; i < 100000; i++) {            lukewarm[i].myHotData++;        }    }}

Note Lukewarm is a struct, not class — this matters. Arrays of structs are stored inline: new Lukewarm[100000] allocates one contiguous 1.6 MB block where each 64-byte cache line holds exactly 4 instances. Arrays of classes are arrays of references to individually heap-allocated objects; lukewarm[i].myHotData would need a pointer dereference into memory that could be anywhere on the heap, and the cache-line density argument below simply doesn’t apply. For hot-path code, use struct (and consider decorating it with [System.Runtime.InteropServices.StructLayout(LayoutKind.Sequential)] if you want to pin the field order).

Imagine that the cold fields aren’t accessed often, but MyManager.Update() runs 60 times a second — so each myHotData is touched 60 times a second. Each Lukewarm instance is 16 bytes, so a 64-byte cache line holds 4 instances. That’s 1 cache miss for every 4 accesses — 25,000 cache misses per pass across 100,000 instances, at a worst-case ~100 ns+ each. Run that at 60 Hz and you’ve burned ~2.5 ms of your 16.67 ms frame budget just stalling the CPU.

Every speedup figure in this post is back-of-envelope worst-case: miss-rate × miss-latency, ignoring hardware prefetchers, IL2CPP/Mono codegen differences, and what else is in the same frame. Treat them as theoretical upper bounds, not measurements. If you’re making a decision that depends on real numbers, capture them in the Unity Profiler or BenchmarkDotNet.

The fix is to split the record into hot and cold parts — what perf people call an AoS → SoA transform (Array-of-Structs → Struct-of-Arrays). Strictly, what follows is a two-AoS split rather than full SoA, but the density win is the same:

struct HotData {    public int myHotData;}struct ColdData {    public int myColdData1;    public int myColdData2;    public int myColdData3;}class MyManager {    HotData[] hotData;    ColdData[] coldData;    MyManager() {        hotData = new HotData[100000];        coldData = new ColdData[100000];        //Initialise hotData and coldData here...    }    void Update() {        for (int i = 0; i < 100000; i++) {            hotData[i].myHotData++;        }    }}

All we’ve done is separate frequently-accessed hot data from rarely-accessed cold data. Each cache line loaded into L1 now holds 16 instances of hot data rather than 4 — so 1 miss per 16 accesses, or ~6,250 cache misses per pass over 100,000 instances. Theoretical cost drops from ~2.5 ms to ~0.625 ms per frame.

The bigger win is cross-frame: hotData in total is only 400 KB, which fits comfortably in most desktop L2/L3. If nothing else evicts it in between, the next frame’s pass will find most of it still cached, and the per-frame cost drops further.

A full SoA version of the same transform would be int[] hot; int[] cold1; int[] cold2; int[] cold3; — same density win, no wrapper struct.

In Unity specifically

For Unity code, a few things make the hot/cold split cleaner still:

NativeArray<T> from Unity.Collections enforces struct-only storage, lives outside the managed heap (no GC churn), and is the expected input for Jobs.
[StructLayout(LayoutKind.Sequential)] pins C#‘s field order when layout matters — by default the runtime is allowed to reorder fields for alignment.
Burst compiles Jobs over NativeArray<T> through LLVM; tight loops like the one above frequently get vectorised.
ECS / DOTS (Data-Oriented Tech Stack) — in preview when I first wrote this — takes the hot/cold split as its core design pattern: components are naturally small structs, archetypes group related components contiguously, and iteration is chunk-by-chunk over contiguous memory. If you’re starting a new perf-critical system today, look at ECS before hand-rolling the transform above.

In practice

In MineSprint (iOS) I had a lot more cold data per instance than hot data, so the real-world gain from splitting was much larger than the 4× density improvement in the toy example.

Next time: Optimising nested loops for memory access

$ env | grep ^CAT

CAT_01=programming

CAT_02=optimisation

$ env | grep ^TAG

TAG_01=C#

TAG_02=unity

TAG_03=L2 Cache

TAG_04=Memory

TAG_05=hot & cold data