#### CS 250B: Modern Computer Systems

#### **Cache-Efficient Algorithms**



Sang-Woo Jun



# Back To The Matrix Multiplication Example

#### Blocked matrix multiplication recap

- C1 sub-matrix = A1×B11 + A1×B21 + A1×B31 ...
- $\circ~$  Intuition: One full read of  $B^{\scriptscriptstyle T}$  per S rows in A. Repeated N/S times
- $\Box$  Best performance when S<sup>2</sup> ~= Cache size
  - Machine-dependent magic number!



# Back To The Matrix Multiplication Example

 $\Box$  For sub-block size S × S -> N \* N \* (N/S) reads. What S do we use?

- Optimized for L1? (32 KiB for me, who knows for who else?)
- If S\*S exceeds cache, we lose performance
- If S\*S is too small, we lose performance
- Do we ignore the rest of the cache hierarchy?
  - $\,\circ\,\,$  Say S optimized for L3,
    - S × S multiplication is further divided into T×T blocks for L2 cache
  - $\circ~$  T  $\times$  T multiplication is further divided into U×U blocks for L1 cache

0 ...

# Solution: Cache Oblivious Algorithms

□ No explicit knowledge of cache architecture/structure

- $\circ~$  Except that one exists, and is hierarchical
- Also, "tall cache assumption", which is natural
- □ Still (mostly) cache optimal
- □ Typically recursive, divide-and-conquer







# Aside: Even More Important With Storage/Network

- □ Latency difference becomes even larger
  - Cache: ~5 ns
  - DRAM: 100+ ns
  - Network: 10,000+ ns
  - Storage: 100,000+ ns
- □ Access granularity also becomes larger
  - Cache/DRAM: Cache lines (64 B)
  - Storage: Pages (4 KB 16 KB)

Also see: "Latency Numbers Every Programmer Should Know" <u>https://people.eecs.berkeley.edu/~rcs/research/interactive\_latency.html</u>

# **Applications of Interest**

- □ Matrix multiplication
- Merge Sort
- □ Stencil Computation
- Trees And Search

#### **Cache Optimized Matrix Multiplication**

□ How to make sure we use an optimal S, for all cache levels?



#### **Recursive Matrix Multiplication**



8 multiply-adds of  $(n/2) \times (n/2)$  matrices Recurse down until very small

# **Performance Analysis**

#### Generation Work:

- $\circ$  Recursion tree depth is  $\log_2(N)$ , each node fan-out is 8
- $\circ 8^{\log_2 N} = N^{\log_2 8} = N^3$
- o Same amount of work!
- Cache misses:
  - Recurse tree for cache access has depth log(N)-1/2(log(cM))
    - (Because we stop recursing at n<sup>2</sup> < cM for a small c)
  - So number of leaves =  $8^{\log N 1/2 \log cM} = N^{\log 8} \div cM^{1/2 \log 8} = N^3 / cM^{3/2}$
  - $\circ$  At leaf, we load cM/B cache lines
  - Total cache lines read =  $\theta(\frac{n^3}{BM^{1/2}})$  <- Optimal

Also, logN function call overhead is not high

#### Performance Oblivious to Cache Size



Steven G. Johnson, "Experiments with Cache-Oblivious Matrix Multiplication for 18.335," MIT Applied Math

#### Bonus: Cache-Oblivious Matrix Transpose

□ Also possible to define recursively



# **Applications of Interest**

- □ Matrix multiplication
- Trees And Search
- □ Merge Sort
- □ Stencil Computation

#### **Trees And Search**

□ Binary Search Trees are cache-ineffective

- $\circ$  e.g., Searching for 72 results in 3 cache line reads
- $\circ$  Not to mention trees in the heap!





Each traversal pretty much hits new cache line:  $\Theta(Log(N))$  cache lines read

### **Better Layout For Trees**

□ Tree can be organized into locally encoded sub-trees

- Much better cache characteristics!
- We want cache-obliviousness:
   How to choose the size of sub-tree?





# Recursive Tree Layout: van Emde Boas Layout



|  |  | ••• | ••• |  | ••• |
|--|--|-----|-----|--|-----|
|--|--|-----|-----|--|-----|

# Performance Evaluations Against Binary Tree



Brodal et.al., "Cache Oblivious Search Trees via Binary Trees of Small Height," SODA 02

# Performance Evaluations Against Binary Tree And B-Tree



\* *High1024:* 1024 elements per node, to make use of the whole cache line (B-Tree)

Question: How do we optimize N in HighN? Databases use N optimized for storage page

Note: Storage access not explicitly handled! Letting swap handle storage management

Figure 8: Beyond main memory

Brodal et.al., "Cache Oblivious Search Trees via Binary Trees of Small Height," SODA 02

### More on the van Emde Boas Tree

- □ Actually a tricky data structure to do inserts/deletions
  - $\circ~$  Tree needs to be balanced to be effective
  - $\circ~$  van Emde Boas trees with van Emde Boas trees as leaves?
- Good thing to have, in the back of your head!

# **Applications of Interest**

- □ Matrix multiplication
- **Trees And Search**
- □ Merge Sort
- □ Stencil Computation

# Merge Sort



Source: <u>https://imgur.com/gallery/voutF</u>, created by morolin

# Merge Sort Cache Effects

Depth-first binary merge sort is relatively cache efficient

- Log(N) full accesses on data, for blocks larger than M
- $\circ$  n × log( $\frac{n}{M}$ )
- □ Binary merge sort of higher fan-in (say, R) is more cache-efficient
  - Using a tournament of mergers!
  - $\circ$  n × log<sub>R</sub>( $\frac{n}{M}$ )
- Cache obliviousness: how to choose R?
  - Too large R spills merge out of cache -> Thrash -> Performance loss!



# Lazy K-Merger

- □ Again, recursive definition of mergers!
- □ Each sub-merger has k<sup>3</sup> element output buffer
- $\Box$  Second level has  $\sqrt{k} + 1$  sub-mergers
  - $\circ \sqrt{k}$  sub-mergers feeding into 1 sub-merger
  - $\circ$  Each sub-merger has  $\sqrt{k}$  inputs
  - $\circ k^{3/2}$ -element buffer per bottom sub-merger
  - Recurses until very small fan-in (two?)



# Lazy K-Merger

Procedure Fill(v):

while v's output buffer is not full
if left input buffer empty
Fill(left child of v)

if right input buffer empty
 Fill(right child of v)

perform one merge step

- Each k merger fits in k<sup>2</sup> space
- □ Ideal cache effects!
  - Proof too complex to show today...
- What should k be?
  - Given N elements,  $k = N^{(1/3)}$ "Funnelsort"



Source: Brodal et. al., "Engineering a Cache-Oblivious Sorting Algorithm," 2008

#### In-Memory Funnelsort Empirical Performance



Source: Brodal et. al., "Engineering a Cache-Oblivious Sorting Algorithm"

#### In-Memory Funnelsort Empirical Performance



Source: Brodal et. al., "Engineering a Cache-Oblivious Sorting Algorithm"

#### In-Storage Funnelsort Empirical Performance



# **Applications of Interest**

- □ Matrix multiplication
- **Trees And Search**
- Merge Sort
- □ Stencil Computation

#### **Stencil Computation**

**D** Example: Heat diffusion

o Uses parabolic partial differential equation to simulate heat diffusion

$$rac{\partial u}{\partial t} = lpha \left( rac{\partial^2 u}{\partial x^2} + rac{\partial^2 u}{\partial y^2} + rac{\partial^2 u}{\partial z^2} 
ight)$$



#### Heat Equation In Stencil Form

$$\Box \text{ Simplified model: 1-dimensional heat diffusion } \frac{\partial u}{\partial t} = \alpha \left(\frac{\partial^2 u}{\partial x^2}\right)$$

$$\frac{\partial u}{\partial t} = \lim_{\Delta t \to 0} \frac{u(x, t + \Delta t) - u(x, t)}{\Delta t}$$

$$\frac{\partial u}{\partial t} \approx \frac{u(x, t + \Delta t) - u(x, t)}{\Delta t}$$

$$\frac{\partial^2 u}{\partial t} \approx \frac{u(x, t + \Delta t) - u(x, t)}{\Delta t}$$

$$\frac{\partial^2 u}{\partial x^2} = \frac{\partial u_x}{\partial x}$$

$$\approx \frac{u(x + \Delta x, t) - u(x, t)}{\Delta x}$$

$$\approx \frac{u(x + \Delta x, t) - u(x, t)}{\Delta x}$$

$$\frac{u(x, t + \Delta t) - u(x, t)}{\Delta t} \approx k \frac{u(x + \Delta x, t) - 2u(x, t) + u(x - \Delta x, t)}{(\Delta x)^2} \quad \longleftrightarrow \quad = \frac{u(x + \Delta x, t) - 2u(x, t) + u(x - \Delta x, t)}{(\Delta x)^2}$$

$$u(x, t + \Delta t) \approx u(x, t) + \alpha \left[u(x + \Delta x, t) - 2u(x, t) + u(x - \Delta x, t)\right]$$

\

# A 3-point Stencil

$$u(x,t + \Delta t) \approx u(x,t) + \alpha \left[ u(x + \Delta x,t) - 2u(x,t) + u(x - \Delta x,t) \right]$$

 $\Box$  u(x, t +  $\Delta$ t) can be calculated using u(x, t), u(x +  $\Delta$ x, t), u(x -  $\Delta$ x, t)

- A "stencil" updates each position t using surrounding values as input
  - This is a 1D 3-point stencil
  - 2D 5 point, 2D 9 point, 3D 7 point,
     3D 25-point stencils popular
  - Popular for simulations, including fluid dynamics, solving linear equations and PDEs



### Some Important Stencils



[1] 19-point 3D Stencil for Lattice Boltzmann Method flow simulation



[2] 25-point 3D stencil for seismic wave propagation applications

[1] Peng, et. al., "High-Order Stencil Computations on Multicore Clusters"[2] Gentryx, Wikipedia

# Cache Behavior of Naïve Loops

Using the 1D 3-point stencil

- Unless x is small enough, there is no cache reuse
- Continuing the theme, can we recursively process data in a cacheoptimal way?



# Cache Efficient Processing: Trapezoid Units

Computation in a trapezoid is either:

- Self-contained, does not require anything from outside( \_\_\_\_\_), or
- $\circ$  Only uses data that has been computed and ready (



, after

### Recursion #1: Space Cut

#### $\Box \text{ If width } >= \text{height} \times 2$

 $\circ~$  Cut the trapezoid through the center using a line of slope -1

 $\circ$  Process left, then right



### Recursion #2: Time Cut

#### $\Box$ If width < height × 2

- $\circ~$  Cut the trapezoid with a horizontal line through the center
- $\circ~$  Process bottom, then top



# **Cache Analysis**

- □ Intuitively, trapezoids are split until they are of size M (cache size)
- **D**ata read =  $\Theta(NT/M)$ 
  - Cache lines read =  $\Theta(NT/MB)$
  - $\circ$  Good!

# Parallelism-Aware Cutting

Vanilla method not good for parallelism /
 Three splits have strict dependencies...

Space cuts can be made parallelism-friendly!

Bottom two first, top one next

Effects on parallel scalability

- $\circ$  Difference in impact of four cores
- o Why? DRAM bandwidth bottleneck!





1.93x

3.96x

| Code                 | Time    |   |
|----------------------|---------|---|
| Serial looping       | 128.95s | 1 |
| Parallel looping     | 66.97s  | S |
| Serial trapezoidal   | 66.76s  | ٦ |
| Parallel trapezoidal | 16.86s  | S |

Performance scaling with four cores Source: 2008-2018 by the MIT 6.172 Lecturers