$ cat /post/how-to-optimise-csharp-for-memory-access-part-2-nested-loops.md

How to Optimise C# (S1E2): Memory Access: Nested Loops

21 Feb 2020

Continuing on from S1E1 that highlighted issues around memory access stalls, and optimisation of hot and cold data, the same issue shows up with nested loops — when we traverse a data structure larger than the cache in the wrong order we end up re-fetching cache lines we already had and then evicted.

Episode 2: Nested Loops

Take the following pseudocode that holds height data on a 2D map:

class MyMap {    double[,] altitude;    MyMap() {        altitude = new double[8000,8000];        //Initialise altitude data here...    }    void Update() {        for (int z = 0; z < 8000; z++) {            for (int x = 0; x < 8000; x++) {                altitude[x,z]++;            }        }    }}

The data structure is large: 8,000 × 8,000 × 8 bytes per double = 512 MB, far bigger than any CPU cache. As mentioned in S1E1, the further from the CPU the memory is, the longer it takes to retrieve the 64-byte cache line that holds our variable. In this example, 8 doubles fit into a single cache line.

So what goes wrong? C# multi-dimensional arrays T[,] are laid out in row-major order: in altitude[x, z], the second subscript (z) is the one that’s contiguous in memory. Incrementing z by 1 moves forward one double; incrementing x by 1 moves forward 8,000 doubles — one whole row. In the loop above the inner loop varies x, so every iteration jumps forward 8,000 doubles.

Because we can’t fit even one row (8,000 × 8 = 64 KB) into L1 (usually 32-48 KB on a modern desktop core), and can only fit a handful of rows in L2, by the time we loop back round to access the adjacent doubles in the same cache line, those cache lines have long since been evicted. We stall waiting for each one to be re-fetched.

Worst case: in every cache line of 8 doubles we pay 1 access from main memory (~100 ns) plus 7 accesses which now come from L3 rather than L1 (20-40 ns each) — call it ~300 ns per 8-double cache line. 64M doubles ÷ 8 = 8M cache lines × 300 ns = ~2.4 seconds in stalls alone. Catastrophic.

As in S1E1, the figures here are theoretical upper bounds — miss-rate × miss-latency, ignoring the hardware prefetcher. Stride-8000 access is the worst case for stream prefetchers (they expect stride-1 or stride-4), so the real numbers are probably closer to this upper bound than they are for the sequential case — but still measure, don’t quote.

Now if each row were 4,000 units wide (32 KB) instead of 8,000 (64 KB), each row would fit inside L1 on many CPUs, and the other 7 accesses in each cache line would cost ~2 ns rather than 20-40 ns. Shrinking the data structure is one optimisation if you’re programming for specific hardware; getting the entire thing into L1 is the dream but usually impossible.

More often you can’t shrink the data. The simpler fix is to loop in the right order so each cache line is loaded from memory once, consumed fully, and then replaced — rather than loaded, partially consumed, evicted, and loaded again:

class MyMap {    double[,] altitude;    MyMap() {        altitude = new double[8000,8000];        //Initialise altitude data here...    }    void Update() {        for (int x = 0; x < 8000; x++) {            for (int z = 0; z < 8000; z++) {                altitude[x,z]++;            }        }    }}

The only thing changed is the loop order: z is now the inner loop. Because z is the second subscript in altitude[x,z] — the contiguous one — each iteration moves forward exactly one double rather than one row. We’re walking memory sequentially. Worst case drops from ~300 ns per cache line to ~100 ns (one miss, 7 cheap L1 hits), and the hardware prefetcher will typically hide most of that remaining miss too.

When loop interchange isn’t enough

Loop interchange is the first lever for 2D access and it’s sufficient here because the operation (altitude[x,z]++) gives the same result in either loop order. When the algorithm genuinely needs access in both dimensions — matrix transpose, 2D stencil operations, anything where output[i,j] depends on inputs around [i, j] in both axes — a single fixed loop order will always thrash the other dimension’s cache lines.

The standard next technique is block (or tile) traversal: process the data in sub-blocks sized to fit comfortably in L1/L2, so each block’s cache lines are loaded once and fully consumed before you move on.

const int blockSize = 64;  // tune to cache sizefor (int xBlock = 0; xBlock < 8000; xBlock += blockSize) {    for (int zBlock = 0; zBlock < 8000; zBlock += blockSize) {        for (int x = xBlock; x < xBlock + blockSize; x++) {            for (int z = zBlock; z < zBlock + blockSize; z++) {                // ...work on altitude[x,z] and any neighbours you need...            }        }    }}

This pattern is called cache blocking or, when the block size is chosen adaptively, cache-oblivious blocking (see Frigo et al., “Cache-Oblivious Algorithms”, 1999). In everyday game code loop interchange is usually enough; blocking is the tool you reach for when both dimensions matter.

$ env | grep ^CAT

CAT_01=programming

CAT_02=optimisation

$ env | grep ^TAG

TAG_01=C#

TAG_02=unity

TAG_03=L1 L2 Cache

TAG_04=Memory

TAG_05=Nested Loops