

## CSCI 350 Ch. 9 – Caching and VM

Mark Redekopp Michael Shindler & Ramesh Govindan

#### USC Viterbi <sup>2</sup>

School of Engineering

# **Examples of Caching Used**

- What is caching?
  - Maintaining copies of information in locations that are faster to access than their primary home
- Examples
  - TLB
  - Data/instruction caches
  - Branch predictors
  - VM
  - Web browser
  - File I/O (disk cache)
  - Internet name resolutions

## **REVIEW OF DEFINITIONS & TERMS**



3

## What Makes a Cache Work

- What are the necessary conditions
  - Locations used to store cached data must be faster to access than original locations
  - Some reasonable amount of reuse
  - Access patterns must be somewhat predictable

## Memory Hierarchy & Caching

5

**ISC**Viterb

School of Engineering

 Use several levels of faster and faster memory to hide delay of upper levels





School of Engineering

## Hierarchy Access Time & Sizes

| Cache                            | Hit Cost    | Size   |
|----------------------------------|-------------|--------|
| 1st level cache/first level TLB  | 1 ns        | 64 KB  |
| 2nd level cache/second level TLB | 4 ns        | 256 KB |
| 3rd level cache                  | 12 ns       | 2 MB   |
| Memory (DRAM)                    | 100 ns      | 10 GB  |
| Data center memory (DRAM)        | $100\mu s$  | 100 TB |
| Local non-volatile memory        | $100 \mu s$ | 100 GB |
| Local disk                       | 10 ms       | 1 TB   |
| Data center disk                 | 10 ms       | 100 PB |
| Remote data center disk          | 200 ms      | 1 XB   |

# Principle of Locality

- Caches exploit the Principle of Locality
  - Explains why caching with a hierarchy of memories yields improvement gain
- Works in two dimensions
  - <u>Temporal Locality</u>: If an item is referenced, it will tend to be referenced again soon
    - Examples: Loops, repeatedly called subroutines, setting a variable and then reusing it many times
  - <u>Spatial Locality</u>: If an item is referenced, items whose addresses are nearby will tend to be referenced soon
    - Examples: Arrays and program code

## Cache Blocks/Lines

- Cache is broken into "blocks" or "lines"
  - Any time data is brought in, it will bring in the entire block of data
  - Blocks start on addresses multiples of their size



8

## Cache Blocks/Lines

- Whenever the processor generates a read or a write, it will first check the cache memory to see if it contains the desired data
  - If so, it can get the data quickly from cache
  - Otherwise, it must go to the slow main memory to get the data



9

## **Cache Definitions**

- **Cache Hit** = Desired data is in current level of cache
- **Cache Miss** = Desired data is not present in current level
- When a cache miss occurs, the new block is brought from the lower level into cache
  - If cache is full a block must be evicted
- When CPU writes to cache, we may use one of two policies:
  - Write Through (Store Through): Every write updates both current and next level of cache to keep them in sync. (i.e. coherent)
  - Write Back: Let the CPU keep writing to cache at fast rate, not updating the next level. Only copy the block back to the next level when it needs to be replaced or flushed

## Write Back Cache

- On write-hit
  - Update only cached copy
  - Processor can continue quickly
  - Later when block is evicted, entire block is written back (because bookkeeping is kept on a per block basis)



11

# Write Through Cache

- On write-hit
  - Update both levels of hierarchy
  - Depending on hardware implementation, lower-level may have to wait for write to complete to lower level
  - Later when block is evicted, no writeback is needed



12

# Write-through vs. Writeback

13

- Write-through
  - Pros
    - Avoid coherency issues between levels (need for eviction)
  - Cons
    - Poor performance if next level of hierarchy is slow (VM page fault to disk) or if many, repeated accesses
- Writeback
  - Pros
    - Fast if many repeated accesses
  - Cons
    - Coherency issues
    - Slow if few, isolated writes since entire block must be written back

# Principle of Inclusion

14

- When the cache at level j misses on data that is store in level k (j < k), the data is brought into all levels i where j < i < k
- This implies that lower levels always contains a subset of higher levels
- Example:
  - L1 contains most recently used data
  - L2 contains that data + data used earlier
  - MM contains all data
- This make coherence far easier to maintain between levels



### USC Viterbi

## Average Access Time

- Define parameters
  - H<sub>i</sub> = Hit Rate of Cache Level L<sub>i</sub>
    (Note that 1-H<sub>i</sub> = Miss rate)
  - $T_i$  = Access time of level i
  - R<sub>i</sub> = Burst rate per word of level i (after startup access time)
  - B = Block Size
- Let us find T<sub>AVE</sub> = average access time



School of Engineering

# T<sub>ave</sub> without L2 cache

- 2 possible cases:
  - Either we have a hit and pay only the L1 cache hit time
  - Or we have a miss and read in the whole block to L1 and then read from L1 to the processor

• 
$$T_{ave} = T_1 + (1-H_1) \cdot [T_{MM} + B \cdot R_{MM}]$$
  
(Miss Rate)\*(Miss Penalty)

For T<sub>1</sub>=10ns, H<sub>1</sub> = 0.9, B=8, T<sub>MM</sub>=100ns, R<sub>MM</sub>=25ns
 - T<sub>ave</sub> = 10 + [ (0.1) • (100+8•25) ] = 40 ns

# T<sub>ave</sub> with L2 cache

#### • 3 possible cases:

- Either we have a hit and pay the L1 cache hit time
- Or we miss L1 but hit L2 and read in the block from L2
- Or we miss L1 and L2 and read in the block from MM

• 
$$T_{ave} = T_1 + (1-H_1) \bullet H_2 \bullet (T_2 + B \bullet R_2) + (1-H_1) \bullet (1-H_2) \bullet (T_{MM} + B \bullet R_{MM})$$
  
L1 miss / L2 Hit L1 miss / L2 Miss

• 
$$T_{ave} = 10 + (0.1) \cdot (.98) \cdot (20 + 8 \cdot 10) + (0.1) \cdot (.02) \cdot (100 + 8 \cdot 25)$$
  
= 10 + 9.8 ns + 0.6 = 20.4 ns

17

## **Three Main Issues**

18

School of Engineering

- Finding cached data (hit/miss)
- Replacement algorithms
- Coherency (managing multiple versions)

- Discussed in previous lectures





### Cache Question



Hi, I'm a block of cache data. Can you tell me what address I came from? Oxbfffeff0? 0x0080a1c4? 20

## **Cache Implementation**

- Assume a cache of 4 blocks of 16-bytes each
- Must store more than just data!
- What other bookkeeping and identification info is needed?
  - Has the block been modified
  - Is the block empty or full
  - Address range of the data: Where did I come from?



Cache with 4 data blocks

21

## Implementation Terminology

22

- What bookkeeping values must be stored with the cache in addition to the block data?
- **Tag** Portion of the block's address range used to identify the MM block residing in the cache from other MM blocks.
- Valid bit Indicates the block is occupied with valid data (i.e. not empty or invalid)
- Dirty bit Indicates the cache and MM copies are "inconsistent" (i.e. a write has been done to the cached copy but not the main memory copy)
  - Used for write-back caches

### Identifying Blocks via Address Range

- Possible methods
  - Store start and end address (requires multiple comparisons)
  - Ensure block ranges sit on binary boundaries (upper address bits identify the block with a single value)
    - Analogy: Hotel room layout/addressing



| 200 |                                        | 220 |  |
|-----|----------------------------------------|-----|--|
| 201 | 201                                    |     |  |
| 202 | 202<br>203<br>204<br>205<br>206<br>207 | 222 |  |
| 203 |                                        | 223 |  |
| 204 |                                        | 224 |  |
| 205 |                                        | 225 |  |
| 206 |                                        | 226 |  |
| 207 |                                        | 227 |  |
| 208 |                                        | 228 |  |
| 209 |                                        | 229 |  |

Analogy: Hotel Rooms

1<sup>st</sup> Digit = Floor 2<sup>nd</sup> Digit = Aisle 3<sup>rd</sup> Digit = Room w/in aisle

To refer to the range of rooms on the second floor, left aisle we would just say rooms **20x**  4 word (16-byte) blocks:

| Addr. Range |      | Binary |                |
|-------------|------|--------|----------------|
| 000-00f     | 0000 | 0000   | 0000 -<br>1111 |
| 010-01f     | 0000 | 0001   | 0000 -<br>1111 |

23

School of Engineering

#### 8 word (32-byte) blocks:

| Addr. Range |      | Binary |                  |
|-------------|------|--------|------------------|
| 000-01f     | 0000 | 000    | 00000 -<br>11111 |
| 020-03f     | 0000 | 001    | 00000 -<br>11111 |

## **Cache Implementation**

- Assume 12-bit addresses and 16-byte blocks
- Block addresses will range from xx0-xxF
  - Address can be broken down as follows
  - A[11:4] = identifies block range (i.e. xx0-xxF)
  - A[3:0] = byte offset within the cache block



Addr. = 0x124 Word 0x4 (1<sup>st</sup>) w/in block 120-12F 0001 0010 0100 Addr. = 0xACC Word 0xC (3<sup>rd</sup>) w/in block AC024

# Cache Implementation

25

School of Engineering

 To identify which MM block resides in each cache block, the tags need to be stored along with the Dirty and Valid bits



## Scenario

- You lost your keys
- You think back to where you have been lately
  - You've been the library, to class, to grab food at campus center, and the gym
  - Where do you have to look to find your keys?
- If you had been home all day and discovered your keys were missing, where would you have to look?
- Key lesson: If something can be anywhere you have to search

By contrast, if we limit where things can be then our search need only look in those limited places

## **Content-Addressable Memory**

27

- Cache memory is one form of what is known as "content-addressable" memory
  - This means data can be in any location in memory and does not have one particular address
  - Additional information is saved with the data and is used to "address"/find the desired data (this is the "tag" in this case) via a search on each access
  - This search can be very time consuming!!

|           |  |            | 1010 1100 |                   | Data of 0xAC0-ACF | Is block 0x470- |
|-----------|--|------------|-----------|-------------------|-------------------|-----------------|
|           |  |            | V=1       | D=0               | (unmodified)      | 0x47f here?     |
| Processor |  | 0100       | 0111      | Data of 0x470-47F | or boro?          |                 |
|           |  | Read 0x47c | V=1       | D=1               | (modified)        | of here?        |
|           |  | 00         | 0000      | 0000              | empty             | or horo?        |
|           |  |            | V=0       | D=0               |                   | or nere?        |
|           |  |            | 0000      | 0000              | empty             | or boro?        |
|           |  |            | V=0       | D=0               |                   | or here?        |
|           |  |            |           |                   |                   |                 |

## Tag Comparison

28

School of Engineering

• When caches have many blocks (> 16 or 32) it can be expensive (hardware-wise) to check all tags



## **Tag Comparison Example**

29

School of Engineering

 Tag portion of desired address is check against all the tags and qualified with the valid bits to determine a hit



# Mapping Techniques

30

- Determines where blocks can be placed in the cache
- By reducing number of possible MM blocks that map to a cache block, hit logic (searches) can be done faster
- 3 Primary Methods
  - Direct Mapping
  - Fully Associative Mapping
  - Set-Associative Mapping

# Cache Mapping Schemes

- Cache mappings are really just variations in hash table configurations
  - We hash the larger memory space to the smaller cache space
  - Key = originating memory address (e.g. main memory address where the block came from)



School of Engineering

31



# Fully Associative Cache Mapping

32

School of Engineering

Memory

- Any memory block can go anywhere
- Like a hash table with 1 bucket/chain [ h(k) = 0 ]
  - Turns into a linked list
  - To find something in the list we must do a linear search



## **Direct Mapped Caches**

- Cache is like a hash table without chaining (one slot per bucket)
  - Collisions yield to evictions
  - Each main memory block will always map to the same cache location



33



# K-way Set Associative Mapping

- Buckets in the hash table are limited to size=k
  - Once a bucket is full, must evict blocks to make room for a new one
  - Each bucket is referred to as a set
  - Each MM block maps to one set but can go anywhere in that bucket



Block 0

School of Engineering

34



# **Fully Associative Mapping**

- Any block from memory can be put in any cache block (i.e. no restriction)
  - Implies we have to search everywhere to determine hit or miss
    Memory



## **Direct Mapping**

36

- Each block from memory can only be put in one location
- Given n cache blocks, MM block i maps to cache block i mod n


#### **K-way Set-Associative Mapping**

37

- Given, S sets, block i of MM maps to set i mod s
- Within the set, block can be put anywhere
- Let k = number of cache blocks per set = n/s





38

- 12-bit address:
  - 16 bytes per block => 4 LSB's used to determine the desired byte/word offset within the block
  - Tag = Block # = Upper bits used to identify the block in the cache





#### Fully Associative Address Scheme

- A[1:0] unused (word access only)
- Word bits = log<sub>2</sub>B bits (B=Block Size)
- Tag = Remaining bits

- Any block from memory can be put in any cache block (i.e. no mapping scheme)
- Completely flexible



40



41

School of Engineering



42

School of Engineering



43

School of Engineering



 Any block from memory can be put in any cache block (i.e. no mapping)

Tag

11111111

**Byte** 

| Тад      | Cache    |
|----------|----------|
| 11111111 | Block FF |
| 11111110 | Block FE |
| 0000000  | Block 0  |
| 00000001 | Block 1  |

Block FF can go in any cache block, so the only one left is cache block 0



Memory

44

45

School of Engineering



46

School of Engineering



47

- Each block from memory can only be put in one location
- MM block i maps to cache block i mod n



# **Direct Mapping Implementation**

48

School of Engineering

#### • 12-bit address:

- 16 bytes per block => 4 LSB's used to determine the desired byte/word offset within the block
- 4 = 2<sup>2</sup> possible blocks => 2 bits to determine cache location (i.e. hash function => use these 2 bits of address)
- Tag = Upper 6 bits used to identify the block in the cache (identifies between blocks that map to the same bucket (block 0, 4, 8, etc.)

|               | Tag    | Block | Byte |            |          |           |  |
|---------------|--------|-------|------|------------|----------|-----------|--|
| Address = 080 | 000010 | 00    | 0000 |            |          |           |  |
| Cache         |        |       |      |            | Memory   |           |  |
|               |        |       |      | 080<br>085 | Block 08 | = 0 mod 4 |  |
|               |        |       |      | 090        | Block 09 | = 1 mod 4 |  |
| Cache Block 1 |        |       |      | 09F<br>0A0 |          | - 2 mod 4 |  |
| Cache Block 2 |        |       |      |            | BIOCK UA | = 2 mod 4 |  |
| Cache Block 3 |        |       |      | 0BF        | Block 0B | = 3 mod 4 |  |
|               |        |       |      | 0C0<br>0CF | Block 0C | = 0 mod 4 |  |

# **Direct Mapping Implementation**

49

School of Engineering

#### • 12-bit address:

- 16 bytes per block => 4 LSB's used to determine the desired byte/word offset within the block
- 4 = 2<sup>2</sup> possible blocks => 2 bits to determine cache location (i.e. hash function => use these 2 bits of address)
- Tag = Upper 6 bits used to identify the block in the cache (identifies between blocks that map to the same bucket (block 0, 4, 8, etc.)

|               | Tag I  | Block | Byte |            |          |           |
|---------------|--------|-------|------|------------|----------|-----------|
| Address = 0A8 | 000010 | 10    | 0000 |            |          |           |
| Cache         |        |       |      |            | Memory   | _         |
|               |        |       |      | 080<br>085 | Block 08 | = 0 mod 4 |
|               |        |       |      | 090        | Block 09 | = 1 mod 4 |
| Cache Block 1 |        |       |      | 09F<br>0A0 |          |           |
| Cache Block 2 |        |       |      | 0AF        | BIOCK UA | = 2 mod 4 |
| Cache Block 3 |        |       |      | 080<br>0BF | Block 0B | = 3 mod 4 |
| Cache Diock 3 |        |       |      | 0Ĉ0<br>0CF | Block 0C | = 0 mod 4 |

- Each block from memory can only be put in one location
- MM block i maps to cache block i mod n

| Tag | Cache         |  |  |  |  |  |  |  |
|-----|---------------|--|--|--|--|--|--|--|
|     | Cache Block 0 |  |  |  |  |  |  |  |
|     | Cache Block 1 |  |  |  |  |  |  |  |
|     | Cache Block 2 |  |  |  |  |  |  |  |
|     | Cache Block 3 |  |  |  |  |  |  |  |



Memory

50

51

- Each block from memory can only be put in one location
- Block i mod n maps to cache block i



52

- Each block from memory can only be put in one location
- Block i mod n maps to cache block i



53

- Each block from memory can only be put in one location
- Block i mod n maps to cache block i



54

- Each block from memory can only be put in one location
- Block i mod n maps to cache block i



55

- Each block from memory can only be put in one location
- Block i mod n maps to cache block i



56

- Each block from memory can only be put in one location
- Block i mod n maps to cache block i



57

- Each block from memory can only be put in one location
- Block i mod n maps to cache block i



58

- Each block from memory can only be put in one location
- Block i mod n maps to cache block i



59

School of Engineering



60

- 12-bit address:
  - 16 bytes per block => 4 LSB's used to determine the desired byte/word offset within the block
  - 2 = 2<sup>1</sup> possible sets => 1 bits to determine cache set (i.e. hash function => use this 1-bit of address)
  - Tag = Upper 7 bits used to identify the block in the cache



 Blocks from set i can map into any cache block from set i





61

| Tag     | Set | Byte |
|---------|-----|------|
| 0000000 | 0   | 0000 |

 Blocks from set i can map into any cache block from set i



62

63

School of Engineering



64

School of Engineering



65

School of Engineering



66

School of Engineering



# **Summary of Mapping Schemes**

- Fully associative
  - Most flexible (less evictions)
  - Longest search time O(N)
- Direct-mapped cache
  - Least flexible (more evictions)
  - Shortest search time O(1)
- K-way Set Associative mapping
  - Compromise
    - 1-way set associative = \_
    - N-way set associative = \_\_\_\_
  - Search time is O(k) [usually small enough to be done in parallel => O(1)]



67



School of Engineering

#### Intel Nehalem Quad Core





School of Engineering

#### **Cache Configurations**

|                    | AMD Opteron     | Intel P4         | PPC 7447a        |
|--------------------|-----------------|------------------|------------------|
| Clock rate (2004)  | 2.0 GHz         | 3.2 GHz          | 1.5 – 2 GHz      |
| Instruction Cache  | 64KB, 2-way SA  | 96 KB            | 32 KB, 8-way SA  |
| Latency (clocks)   | 3               | 4                | 1                |
| Data cache         | 64 KB, 2-way SA | 8 KB, 4-way SA   | 32 KB, 8-way SA  |
| Latency (clocks)   | 3               | 2                | 1                |
| L1 Write Policy    | Write-back      | Write-through    | Programmable     |
| On-chip L2         | 1 MB, 16-way SA | 512 KB, 8-way SA | 512 KB, 8-way SA |
| L2 Latency         | 6               | 5                | 9                |
| Block size (L1/L2) | 64              | 64/128           | 32/64            |
| L2 Write-Policy    | Write-back      | Write-back       | Programmable     |

Sources: H&P, "CO&D", 3<sup>rd</sup> ed., Freescale.com,



School of Engineering

#### **REPLACEMENT ALGORITHMS**

# **Replacement Policies**

- On a miss, a new block must be brought in
- This requires evicting a current block residing in the cache
- Optimal Replacement Policy
  - MIN: Replace block used the farthest in the future
    - Requires knowledge of the future
- Practical Replacement policies
  - FIFO: First-in first-out (oldest block replaced)
  - LRU: Least recently used (usually best but hard to implement)
  - Random: Actually performs surprisingly well
- What about Least Frequently Used (LFU?)

#### **Replacement Aglorithms**

- FIFO can be pessimal (worst possible) for repeated linear scans that don't fit in the cache
- Consider cache of 4 blocks with a repeated iteration through an array that requires 5 blocks of storage

| FIFO      |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |
|-----------|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Reference | А | В | С | D | Е | А | В | С | D | E | А | В | С | D | E |
| 1         | А |   |   |   | Е |   |   |   | D |   |   |   | С |   |   |
| 2         |   | В |   |   |   | А |   |   |   | Е |   |   |   | D |   |
| 3         |   |   | С |   |   |   | В |   |   |   | А |   |   |   | E |
| 4         |   |   |   | D |   |   |   | С |   |   |   | В |   |   |   |

OS:PP 2<sup>nd</sup> Ed. Fig 9.13

72


73

School of Engineering

• Compare the following replacement algorithms for a pattern exhibiting temporal locality

|           |   |   |   |   |   |   | LRU  |   |   |   |   |   |   |   |   |
|-----------|---|---|---|---|---|---|------|---|---|---|---|---|---|---|---|
| Reference | А | В | А | С | В | D | А    | D | Е | D | А | Е | В | А | С |
| 1         | А |   | + |   |   |   | +    |   |   |   | + |   |   | + |   |
| 2         |   | В |   |   | + |   |      |   |   |   |   |   | + |   |   |
| 3         |   |   |   | С |   |   |      |   | E |   |   | + |   |   |   |
| 4         |   |   |   |   |   | D |      | + |   | + |   |   |   |   | С |
|           |   |   |   |   |   | ] | FIFO |   |   |   |   |   |   |   |   |
| 1         | А |   | + |   |   |   | +    |   | Е |   |   |   |   |   |   |
| 2         |   | В |   |   | + |   |      |   |   |   | Α |   |   | + |   |
| 3         |   |   |   | С |   |   |      |   |   |   |   | + | В |   |   |
| 4         |   |   |   |   |   | D |      | + |   | + |   |   |   |   | С |
|           |   |   |   |   |   | 1 | MIN  |   |   |   |   |   |   |   |   |
| 1         | А |   | + |   |   |   | +    |   |   |   | + |   |   | + |   |
| 2         |   | В |   |   | + |   |      |   |   |   |   |   | + |   | С |
| 3         |   |   |   | С |   |   |      |   | Е |   |   | + |   |   |   |
| 4         |   |   |   |   |   | D |      | + |   | + |   |   |   |   |   |

OS:PP 2<sup>nd</sup> Ed. Fig 9.14



#### **Replacement Aglorithms**

• Compare LRU & MIN following replacement algorithms for a pattern that repeatedly scans through memory

|           |   |   |   |   |   | ] | LRU |   |   |   |   |   |   |   |   |
|-----------|---|---|---|---|---|---|-----|---|---|---|---|---|---|---|---|
| Reference | А | В | С | D | Е | А | В   | С | D | Е | А | В | С | D | E |
| 1         | А |   |   |   | Е |   |     |   | D |   |   |   | С |   |   |
| 2         |   | В |   |   |   | А |     |   |   | E |   |   |   | D |   |
| 3         |   |   | С |   |   |   | В   |   |   |   | А |   |   |   | E |
| 4         |   |   |   | D |   |   |     | С |   |   |   | В |   |   |   |
|           |   |   |   |   |   | 1 | MIN |   |   |   |   |   |   |   |   |
| 1         | А |   |   |   |   | + |     |   |   |   | + |   |   | + |   |
| 2         |   | В |   |   |   |   | +   |   |   |   |   | + | С |   |   |
| 3         |   |   | С |   |   |   |     | + | D |   |   |   |   | + |   |
| 4         |   |   |   | D | Е |   |     |   |   | + |   |   |   |   | + |

OS:PP 2<sup>nd</sup> Ed. Fig 9.15

# Belady's Anomaly

- Adding space to a cache generally helps improve the hit rate
- BUT NOT ALWAYS!
- For FIFO, more slots may actually decrease hit rate: **Belady's anomaly** 
  - Other algorithms like LRU, MIN, and LFU can be proven to show that adding slots to the cache will ONLY HELP
- Compare the hit rate for FIFO replacement with 3 vs. 4 slots

| FIFO (3 slots) |   |   |   |    |       |        |   |   |   |   |   |   |
|----------------|---|---|---|----|-------|--------|---|---|---|---|---|---|
| Reference      | А | В | С | D  | А     | В      | Е | А | В | С | D | E |
| 1              | А |   |   | D  |       |        | Е |   |   |   |   | + |
| 2              |   | В |   |    | А     |        |   | + |   | С |   |   |
| 3              |   |   | С |    |       | В      |   |   | + |   | D |   |
|                |   |   |   | FI | FO (4 | slots) | 1 |   |   |   |   |   |
| 1              | А |   |   |    | +     |        | Е |   |   |   | D |   |
| 2              |   | В |   |    |       | +      |   | А |   |   |   | E |
| 3              |   |   | С |    |       |        |   |   | В |   |   |   |
| 4              |   |   |   | D  |       |        |   |   |   | С |   |   |

OS:PP 2<sup>nd</sup> Ed. Fig 9.15

75

#### **Miss Rate**

76

- Reducing Miss Rate means lower T<sub>AVE</sub>
- To analyze miss rate categorize them based on why they occur
  - Compulsory Misses
    - First access to a block will always result in a miss
  - Capacity Misses
    - Misses because the cache is too small
  - Conflict Misses
    - Misses due to mapping scheme (replacement of direct or set associative)



School of Engineering

#### Miss Rate & Block Size



Graph used courtesy "Computer Architecture: AQA, 3<sup>rd</sup> ed.", Hennessey and Patterson



School of Engineering

#### Hit/Miss Rate vs. Cache Size



#### USC Viterbi 🤇

School of Engineering

79

#### Miss Rate & Associativity



# Prefetching

80

- Hardware Prefetching
  - On miss of block i, fetch block i and i+1
- Software Prefetching
  - Special "Prefetch" Instructions
  - Compiler inserts these instructions to give hints ahead of time as to the upcoming access pattern



#### CACHE CONSCIOUS PROGRAMMING

# Working Sets

82

- Generally a program works with different sets of data at different times
  - Consider an image processing algorithm akin to JPEG encoding
    - Perform data transformation on image pixels using several weighting tables/arrays
    - Create a table of frequencies
    - Perform compression coding using that table of frequencies
    - Replace pixels with compressed codes
- The data that the program is accessing in a small time window is referred to as its working set
- We want that working set to fit in cache and make as much reuse of that working set as possible while it is in cache
  - Keep weight tables in cache when performing data transformation
  - Keep frequency table in cache when compressing



Engineering



INSTRUCTIONS (376706 per pixel)

Page size: 4096: 0 to 2% memory

#### https://cartesianproduct.wordpress.com/tag/working-set/

## **Cache-Conscious Programming**

- Order of array indexing
  - Row major vs. column major ordering
- Blocking (keeps working set small)
- Pointer-chasing
  - Linked lists, graphs, tree data structures that use pointers do not exhibit good spatial locality
- General Principles
  - Keep working set reasonably small (temporal locality)
  - Use small strides (spatial locality)
  - Static structures usually better than dynamic ones



84

## **Blocked Matrix Multiply**

- Traditional working set
  - 1 row of C, 1 row of A, NxN matrix B
- Break NxN matrix into smaller BxB matrices
  - Perform matrix multiply on blocks
  - Sum results of block multiplies to produce overall multiply result
- Blocked multiply working set
  - Three BxB matrices







85



86

- Intel Nehalem processor
  - L1D = 32 KB, L2 = 256KB, L3 = 8 MB



# Zipf Distribution

- Zipf modeled the frequency of word usage in larger text bodies
- Zipf model says the frequency of access of the k-th most frequent/popular item from a set is 1/k<sup>α</sup> where [1 < α < 2]</li>
- Applies to may other domains
  - Web page access on the Internet
  - Popularity of cities, books, etc.
  - Size of friend lists in social networks



OS:PP 2<sup>nd</sup> Ed.: Fig. 9.7

87

## **Cache Implications**

- Zipf-ian distributions may not perform well even on large caches due to the heavy-tail
- Web-page cache
  - New data: New pages are being added all the time
  - No working set: While there are some popular webpages, no small subset will cover the bulk of the accesses
- Diminishing returns as the cache size is increased



88

School of Engineering

OS:PP 2<sup>nd</sup> Ed.: Fig. 9.7

#### **SWAPPING**

USC Viterbi

89



School of Engineering

#### Recall: VM Swap = Caching



## Page Fault Steps

3

(4)

3. Evict (writeback) page if no

frame free (update PT & TLB) 4. then bring in needed page 91

School of Engineering

**Disk Driver** 

(Interrupt)

4

data

- On page fault, handler will access disk possibly on eviction and to bring in desired page
  - Likely context switch on each access since disk is slow
- and update PT OS Exception Make sure PT & TLB are updated (Page Fault) appropriately Handler 4 Page Table Invalid / Not Present 3 Miss Restart faulting 5 PPFN VPN instruction TI B 10 ns 6 Memory TLB Miss / PT VA PA 6 CPU walk / Update TLB Cache Page Offset **Translation Unit / MMU** Hit **Processor Chip**

# VM Eviction Algorithms

- Clock algorithm
  - Cycle through frames (circular queue)
- Second-chance Algorithm
  - Clock algorithm but pages w/ referenced bit set get a 2<sup>nd</sup> chance (wait until next cycle) to be evicted)
  - May give preference to dirty pages
- Pseudo-LRU
  - Use HW reference bits + OS-managed reference counts to perform some form of pseudo-LRU





92

# **Thrashing and Sharing**

93

School of Engineering

- When too many processes are sharing cache or main memory paging, thrashing may occur
- **Thrashing**: **Working set** cannot fit in memory causing constant, evictions and re-fetching of needed data
  - CPU is underutilized b/c it is constantly waiting on the memory system



https://www.cs.uic.edu/~jbell/CourseNotes/OperatingSystems/9\_VirtualMemory.html

### Page Allocation Fairness

- Want to prevent a few processes from hogging all the physical resources (or possibly all the swap space)
- Max-min fairness for how many pages allocated to process
  - Maximize responsiveness to the minimum request and then redistribute remainder to other processes
- Example:
  - Solaris (Unix) has a background thread that can utilize some percentage of the CPU's time looking for pages to evict
  - Can enforce limits on how many frames a process is occupying





School of Engineering

94

## Page Coloring

- We would not want to allocate pages to a process that all map (hash) to the same cache
  - If so, then when that process runs it would be having to walk the page table much too often
- The OS can keep track of the sets that pages allocated to a given process hash to and then allocate a page that hash to a different set (color) on the next request



#### Phys. Mem. Frames

95