### Chapter 7- Memory System Design

Introduction

7-1

- RAM structure: Cells and Chips
- Memory boards and modules
- Two-level memory hierarchy
- The cache
- Virtual memory
- The memory as a sub-system of the computer

### Introduction

So far, we've treated memory as an array of words limited in size only by the number of address bits. Life is seldom so easy...

Real world issues arise: •cost •speed •size •power consumption •volatility •etc.

What other issues can you think of that will influence memory design?

Computer Systems Design and Architecture by V. Heuring and H. Jordan © 1997 V. Heuring and H. Jordan: Updated David M. Zar, February, 2001

### In This Chapter we will cover-

•Memory components:

•RAM memory cells and cell arrays

•Static RAM-more expensive, but less complex

- Tree and Matrix decoders—needed for large RAM chips
- Dynamic RAM–less expensive, but needs "refreshing"

Chip organization

- Timing
- ROM–Read only memory

Memory Boards

- Arrays of chips give more addresses and/or wider words
- 2-D and 3-D chip arrays

Memory Modules

Large systems can benefit by partitioning memory for
 separate access by system components
 fast access to multiple words

-more-

Computer Systems Design and Architecture by V. Heuring and H. Jordan © 1997 V. Heuring and H. Jordan: Updated David M. Zar, February, 2001

### In This Chapter we will also cover-

The memory hierarchy: from fast and expensive to slow and cheap

- Example: Registers->Cache->Main Memory->Disk
- At first, consider just two adjacent levels in the hierarchy
- The Cache: High speed and expensive

7-4

- Kinds: Direct mapped, associative, set associative
- Virtual memory–makes the hierarchy transparent
  - Translate the address from CPU's logical address to the physical address where the information is actually stored
  - Memory management how to move information back and forth
  - Multiprogramming what to do while we wait
  - The "TLB" helps in speeding the address translation process
- Overall consideration of the memory as a subsystem.

### Fig. 7.1 The CPU–Main Memory Interface



Sequence of events: Read:

- 1. CPU loads MAR, issues Read, and REQUEST
- 2. Main Memory transmits words to MDR
- 3. Main Memory asserts COMPLETE.

Write:

-more-

1. CPU loads MAR and MDR, asserts Write, and REQUEST 2. Value in MDR is written into address in MAR.

3. Main Memory asserts COMPLETE. Computer Systems Design and Architecture by V. Heuring and H. Jordan © 1997 V. Heuring and H. Jordan: Updated David M. Zar, February, 2001



•Some systems use separate R and W lines, and omit REQUEST.

### Table 7.1 Some Memory Properties

7-7

| Symb                                                    | ol Definition                                                                                                                                                     | Intel<br>8088 | Intel<br>8086 | IBM/Moto.<br>601                                                       |
|---------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|---------------|------------------------------------------------------------------------|
| w<br>m<br>s<br>b<br>2 <sup>m</sup><br>2 <sup>m</sup> xs | CPU Word Size<br>Bits in a logical memory address<br>Bits in smallest addressable unit<br>Data Bus size<br>Memory wd capacity, s-sized wds<br>Memory bit capacity | 8<br>8        |               | 64 bits<br>32 bits<br>8<br>64<br>2 <sup>32</sup><br>2 <sup>32</sup> x8 |

### Big-Endian and Little-Endian Storage

7-8

When data types having a word size larger than the smallest addressable unit are stored in memory the question arises,

"Is the least significant part of the word stored at the lowest address (*little Endian, little end first*) or-

is the most significant part of the word stored at the lowest address (*big Endian, big end first*)"?

Example: The hexadecimal 16-bit number ABCDH, stored at address 0:



# Table 7.2MemoryPerformanceParameters

| <u>Symbol</u>                             | Definition                 | Units                   | Meaning                                                                     |
|-------------------------------------------|----------------------------|-------------------------|-----------------------------------------------------------------------------|
| t <sub>a</sub>                            | Access time                | time                    | Time to access a memory word                                                |
| t <sub>c</sub>                            | Cycle time                 | time                    | Time from start of access to start of next access                           |
| k                                         | <b>Block size</b>          | words                   | Number of words per block                                                   |
| b                                         | Bandwidth                  | words/time              | Word transmission rate                                                      |
| t <sub>i</sub>                            | Latency                    | time                    | Time to access first word of a sequence of words                            |
| t <sub>bl</sub> =<br>t <sub>l</sub> + k/b | Block<br>access time       | time                    | Time to access an entire block of words                                     |
| (Inform                                   | ation is often             | stored and n            | noved in blocks at the cache and disk level.)                               |
| Computer Syster                           | ms Design and Architecture | by V. Heuring and H. Jo | ordan © 1997 V. Heuring and H. Jordan: Updated David M. Zar, February, 2001 |

### Table 7.3 The Memory Hierarchy, Cost, and Performance

#### **Some Typical Values:**



Computer Systems Design and Architecture by V. Heuring and H. Jordan 💿 1997 V. Heuring and H. Jordan: Updated David M. Zar, February, 2001

### **Intel Architecture Over Time**

| Processor      | Release<br>Date | MIPS | Max. CPU<br>Frequency<br>at<br>Introduction | # of Xtors<br>on Die | Main CPU<br>Register<br>Size | External<br>Data<br>Bus<br>Size | Max,<br>External<br>Address<br>Space | Caches in<br>CPU<br>Package        |
|----------------|-----------------|------|---------------------------------------------|----------------------|------------------------------|---------------------------------|--------------------------------------|------------------------------------|
| 8086           | 1978            | 0.8  | 8 MHz                                       | 29 K                 | 16                           | 16                              | 1 MB                                 | None                               |
| Intel286       | 1982            | 2.7  | 12.5 MHz                                    | 134 K                | 16                           | 16                              | 16 MB                                | None                               |
| Intel386 DX    | 1985            | 6    | 20 MHz                                      | 275 K                | 32                           | 32                              | 4 GB                                 | None                               |
| Intel486 DX    | 1989            | 20   | 25 MHz                                      | 1.2 M                | 32                           | 32                              | 4 GB                                 | 8 KB L1                            |
| Pentium        | 1993            | 100  | 60 MHz                                      | 3.1 M                | 32                           | 64                              | 4 GB                                 | 16 KB L1                           |
| Pentium<br>Pro | 1995            | 440  | 200 MHz                                     | 5.5 M                | 32                           | 64                              | 64 GB                                | 16 KB L1;<br>256KB or<br>512 KB L2 |
| Pentium II     | 1997            | 466  | 266 MHz                                     | 7 M                  | 32                           | 64                              | 64 GB                                | 32KB L1;<br>256 KB or<br>512 KB L2 |
| Pentium III    | 1999            | 1000 | 500 MHz                                     | 8.2 M                | 32 GP<br>128<br>SIMD-FP      | 64                              | 64 GB                                | 32 KB L1;<br>512 KB L2             |

Computer Systems Design and Architecture by V. Heuring and H. Jordan \_\_© 1997 V. Heuring and H. Jordan: Updated David M. Zar, February, 2001

## Fig. 7.3 Memory Cells - a conceptual view –

Regardless of the technology, all RAM memory cells must provide these four functions: Select, DataIn, DataOut, and R/W.



This "static" RAM cell is unrealistic in practice, but it is functionally correct. We will discuss more practical designs later.

Computer Systems Design and Architecture by V. Heuring and H. Jordan \_ © 1997 V. Heuring and H. Jordan: Updated David M. Zar, February, 2001

### 7-13 Fig. 7.4 An 8-bit register as a 1D **RAM** array

The entire register is selected with one select line, and uses one R/W line



#### Data bus is bi-directional, and buffered. (Why?)





### Fig 7.7 A 16Kx4 SRAM Chip



This chip requires 24 pins including power and ground, and so will require a 24 pin pkg. Package size and pin count can dominate chip cost.

Computer Systems Design and Architecture by V. Heuring and H. Jordan \_ © 1997 V. Heuring and H. Jordan: Updated David M. Zar, February, 2001

### Fig 7.8 Matrix and Tree Decoders

•2-level decoders are limited in size because of gate fanin. Most technologies limit fanin to ~8.

•When decoders must be built with fanin >8, then additional levels of gates are required.

•Tree and Matrix decoders are two ways to design decoders with large fanin:



### **3-to-8 line tree decoder constructed from 2-input gates.**

4-to-16 line matrix decoder constructed from 2-input gates.

Computer Systems Design and Architecture by V. Heuring and H. Jordan \_\_© 1997 V. Heuring and H. Jordan: Updated David M. Zar, February, 2001

#### 7-18

### Fig 7.9 A 6 Transistor static RAM cell

This is a more practical design than the 8-gate design shown earlier.

A value is read by precharging the bit lines to a value 1/2 way between a 0 and a 1, while asserting the word line. This allows the latch to drive the bit lines to the value stored in the latch.





decode the address and provide value to the data bus.



## Figs 7.11 Static RAM Write Timing



Write time—the time the data must be held valid in order to decode address and store value in memory cells.

Computer Systems Design and Architecture by V. Heuring and H. Jordan \_\_© 1997 V. Heuring and H. Jordan: Updated David M. Zar, February, 2001

### Fig 7.12 A Dynamic RAM (DRAM) Cell

Capacitor will discharge in 4-15ms.

Refresh capacitor by reading (sensing) value on bit line, amplifying it, and placing it back on bit line where it recharges capacitor.

Write: place value on bit line and assert word line. Read: precharge bit line, assert word line, sense value on bit line with sense/amp.

This need to refresh the storage cells of dynamic RAM chips complicates DRAM system design.





© 1997 V. Heuring and H. Jordan: Updated David M. Zar, February, 2001 Computer Systems Design and Architecture by V. Heuring and H. Jordan



### **DRAM Refresh and row access**

•Refresh is usually accomplished by a "RAS-only" cycle. The row address is placed on the address lines and RAS asserted. This refreshed the entire row. CAS is not asserted. The absence of a CAS phase signals the chip that a row refresh is requested, and thus no data is placed on the external data lines.

•Many chips use "CAS before RAS" to signal a refresh. The chip has an internal counter, and whenever CAS is asserted before RAS, it is a signal to refresh the row pointed to by the counter, and to increment the counter.

 Most DRAM vendors also supply one-chip DRAM controllers that encapsulate the refresh and other functions.

•Page mode, nibble mode, and static column mode allow rapid access to the entire row that has been read into the column latches.

•Video RAMS, VRAMS, clock an entire row into a shift register where it can be rapidly read out, bit by bit, for display.

Computer Systems Design and Architecture by V. Heuring and H. Jordan © 1997 V. Heuring and H. Jordan: Updated David M. Zar, February, 2001

### Fig 7.16 A CMOS ROM Chip

2-D CMOS ROM Chip



Computer Systems Design and Architecture by V. Heuring and H. Jordan © 1997 V. Heuring and H. Jordan: Updated David M. Zar, February, 2001

### Tbl 7.4 Kinds of ROM

| ROM Type             | Cost                | Programmability      | Time to program T   | ime to erase       |
|----------------------|---------------------|----------------------|---------------------|--------------------|
| Mask pro-<br>grammed | Very<br>inexpensive | At the factory       | Weeks (turn around) | N/A                |
| PROM                 | Inexpensive         | Once, by end<br>user | Seconds             | N/A                |
| EPROM                | Moderate            | Many times           | Seconds             | 20 minutes         |
| Flash<br>EPROM       | Expensive           | Many times           | 100 us.             | 1s, large<br>block |
| EEPROM               | Very<br>expensive   | Many times           | 100 us.             | 10 ms,<br>byte     |

Computer Systems Design and Architecture by V. Heuring and H. Jordan © 1997 V. Heuring and H. Jordan: Updated David M. Zar, February, 2001

### Memory boards and modules

•There is a need for memories that are larger and wider than a single chip •Chips can be organized into "boards."

•Boards may not be actual, physical boards, but may consist of structured chip arrays present on the motherboard.

•A board or collection of boards make up a memory module.

•Memory modules:

•Satisfy the processor-main memory interface requirements •May have DRAM refresh capability

•May expand the total main memory capacity

•May be interleaved to provide faster access to blocks of words.





### **SRC DRAM Design**

SRC



Computer Systems Design and Architecture by V. Heuring and H. Jordan © 1997 V. Heuring and H. Jordan: Updated David M. Zar, February, 2001



### **SRC DRAM Design with Refresh**

SRC



Computer Systems Design and Architecture by V. Heuring and H. Jordan © 1997 V. Heuring and H. Jordan: Updated David M. Zar, February, 2001

### **Refresh Counter**

• Each row needs to be refreshed every R μs

7-32

 There are N rows in the DRAM so every R/N μs we need to refresh one of them.



### **Memory Controller State Machine**



READ = READ.H \* RQST.H' \* A<31> \* A<30> \* A<29> \* A<28> \* A<27> \* A<26> \* A<25> WRITE = WRITE.H \* RQST.H' \* A<31> \* A<30> \* A<29> \* A<28> \* A<27> \* A<26> \* A<25> REFRESH = RQST.H

Computer Systems Design and Architecture by V. Heuring and H. Jordan © 1997 V. Heuring and H. Jordan: Updated David M. Zar, February, 2001

## Fig 7.17 General structure of **memory chip** This is a slightly different view of the memory chip than previous.



Computer Systems Design and Architecture by V. Heuring and H. Jordan 💿 1997 V. Heuring and H. Jordan: Updated David M. Zar, February, 2001

### Fig 7.18 Word Assembly from Narrow Chips

All chips have common CS, R/W, and Address lines.



Computer Systems Design and Architecture by V. Heuring and H. Jordan © 1997 V. Heuring and H. Jordan: Updated David M. Zar, February, 2001

# Fig 7.19 Increasing the Number of Words by a Factor of 2<sup>k</sup>

The additional k address bits are used to select one of 2<sup>k</sup> chips, each one of which has 2<sup>m</sup> words:



Computer Systems Design and Architecture by V. Heuring and H. Jordan © 1997 V. Heuring and H. Jordan: Updated David M. Zar, February, 2001





## Fig 7.22 A Memory Module interface

•Read and Write signals.

•Ready: memory is ready to accept commands.

•Address-to be sent with Read/Write command.

•Data-sent with Write or available upon Read when Ready is asserted. •Module Select-needed when there is more than one module.



Computer Systems Design and Architecture by V. Heuring and H. Jordan © 1997 V. Heuring and H. Jordan: Updated David M. Zar, February, 2001

**Chapter 7- Memory System Design** Fig 7.23 DRAM module with refresh control Address k+m Address Register Chip/board selection m/2 m/2m/2 Refresh counter Address Refresh Multiplexer clock and m/22 control Address lines Board and chip selects Module مصط CO-Sand ചാസ്ത select RAS Dynamic Memory Read -RAM Array CAS timing

7-40





### Fig 7.25 Timing of Multiple Modules on a Bus

If time to transmit information over bus,  $t_b$ , is < module cycle time,  $t_c$ , it is possible to time multiplex information transmission to several modules;

Example: store one word of each cache line in a separate module.

Main Memory Address:

Word

Module No.

This provides successive words in successive modules.



# With interleaving of $2^k$ modules, and $t_b < t_b/2k$ , it is possible to get a $2^k$ -fold increase in memory bandwidth, provided memory requests are pipelined. DMA satisfies this requirement.

Computer Systems Design and Architecture by V. Heuring and H. Jordan © 1997 V. Heuring and H. Jordan: Updated David M. Zar, February, 2001

#### Memory system performance

Breaking the memory access process into steps:

For all accesses: •transmission of address to memory •transmission of control information to memory (R/W, Request, etc.) •decoding of address by memory

For a read: •return of data from memory •transmission of completion signal

For a write:

Transmission of data to memory (usually simultaneous with address)
storage of data into memory cells
transmission of completion signal

The next slide shows the access process in more detail --

Computer Systems Design and Architecture by V. Heuring and H. Jordan © 1997 V. Heuring and H. Jordan: Updated David M. Zar, February, 2001

Fig 7.26 Static and dynamic RAM timing



7-44

(a) Static RAM behavior



#### "Hidden refresh" cycle. A normal cycle would exclude the pending refresh step.

## **Example SRAM timings**

**Approximate values for static RAM Read timing:** 

Address bus drivers turn-on time: 40 ns.
Bus propagation and bus skew: 10 ns.
Board select decode time: 20 ns.
Time to propagate select to another board: 30 ns.
Chip select: 20ns.

#### **PROPAGATION TIME FOR ADDRESS AND COMMAND TO REACH CHIP: 120 ns**

On-chip memory read access time: 80 ns
Delay from chip to memory board data bus: 30 ns.
Bus driver and propagation delay (as before): 50 ns.

#### TOTAL MEMORY READ ACCESS TIME: 280 ns.

Moral: 70ns chips to not necessarily provide 70ns access time!

#### Considering any two adjacent levels of the memory hierarchy Some definitions:

**Temporal locality:** the property of most programs that if a given memory location is referenced, it is likely to be referenced again, "soon."

**Spatial locality:** if a given memory location is referenced, those locations near it numerically are likely to be referenced "soon."

Working set: The set of memory locations referenced over a fixed period of time, or in a *time window*.

Notice that temporal and spatial locality both work to assure that the contents of the working set change only slowly over execution time.





#### Primary and secondary levels of the memory hierarchy Speed between levels defined by latency: time to access first word, and

Speed between levels defined by latency: time to access first word, and bandwidth, the number of words per second transmitted between levels.



Typical latencies: cache latency: a few clocks Disk latency: 100,000 clocks

•The item of commerce between any two levels is the **block**.

•Blocks may/will differ in size at different levels in the hierarchy. Example: Cache block size ~ 16-64 bytes. Disk block size: ~ 1-4 Kbytes.

•As working set changes, blocks are moved back/forth through the hierarchy to satisfy memory access requests.

•A complication: Addresses will differ depending on the level. Primary address: the address of a value in the primary level. Secondary address: the address of a value in the secondary level.

Computer Systems Design and Architecture by V. Heuring and H. Jordan © 1997 V. Heuring and H. Jordan: Updated David M. Zar, February, 2001

# Primary and secondary address examples

•Main memory address: unsigned integer

7-48

•Disk address: track number, sector number, offset of word in sector.



## Fig 7.29 Primary Address Formation

7-50



```
7-51
                                                  Chapter 7- Memory System Design
             Hits and misses; paging;
 Hit: the word was found at the level from which it was requested.
  Miss: the word was not found at the level from which it was requested.
  (A miss will result in a request for the block containing the word from
 the next higher level in the hierarchy.)
  Hit ratio (or hit rate) = h =
                                     number of hits
                              total number of references
  Miss ratio: 1 - hit ratio
 t_p = primary memory access time. t_s = secondary memory access time
 Access time, t_a = h \cdot t_p + (1-h) \cdot t_s.
  Page: commonly, a disk block. Page fault: synonymous with a miss.
  Demand paging: pages are moved from disk to main memory only when
 a word in the page is requested by the processor.
```

Block placement and replacement decisions must be made each time a block is moved.

## Virtual memory

a Virtual Memory is a memory hierarchy, usually consisting of at least main memory and disk, in which the processor issues all memory references as effective addresses in a flat address space. All translations to primary and secondary addresses are handled transparently to the process making the address reference, thus providing the *illusion* of a flat address space.

Recall that disk accesses may require 100,000 clock cycles to complete, due to the slow access time of the disk subsystem. Once the processor has, through mediation of the operating system, made the proper request to the disk subsystem, it is available for other tasks.

**Multiprogramming** shares the processor among independent programs that are resident in main memory and thus available for execution.

Computer Systems Design and Architecture by V. Heuring and H. Jordan © 1997 V. Heuring and H. Jordan: Updated David M. Zar, February, 2001

## Decisions in designing a •Translation procedure to translate from system address to primary address.

- •Block size-block transfer efficiency and miss ratio will be affected.

7-53

- •Processor dispatch on miss-processor wait or processor multiprogrammed.
- Primary level placement-direct, associative, or a combination. Discussed later.
- •Replacement policy–which block is to be replaced upon a miss.
- •Direct access to secondary level-in the cache regime, can the processor directly access main memory upon a cache miss?
- •Write through–can the processor write directly to main memory upon a cache miss?
- •Read through–can the processor read directly from main memory upon a cache miss as the cache is being updated?
- •Read or write bypass–can certain infrequent read or write misses be satisfied by a direct access of main memory without any block movement?



The cache mapping function is responsible for all cache operations: •Placement strategy: where to place an incoming block in the cache •Replacement strategy: which block to replace upon a miss •Read and write policy: how to handle reads and writes upon cache misses.

Mapping function must be implemented in hardware. (Why?)

Three different types of mapping functions:

- Associative
- •Direct mapped
- •Block-set associative



### Memory fields and address translation

7-55



## Fig 7.31 Associative mapped caches

Associative mapped cache model: any block from main memory can be put anywhere in the cache. Assume a 16-bit main memory.\*

7-56



\*16 bits, while unrealistically small, simplifies the examples Computer Systems Design and Architecture by V. Heuring and H. Jordan © 1997 V. Heuring and H. Jordan: Updated David M. Zar, February, 2001

#### 7-57 Fig 7.32 Associative cache mechanism

Because any block can reside anywhere in the cache, an *associative* (content addressable) memory is used. All locations are searched simultaneously.



# Advantages and disadvantages of the associative mapped cache.

Advantage •Most flexible of all-any MM block can go anywhere in the cache.

#### Disadvantages

- •Large tag memory.
- •Need to search entire tag memory simultaneously means lots of hardware.

Replacement Policy is an issue when the cache is full. -more later-

**Q.:** How is an associative search conducted at the logic gate level?

-next-

Direct mapped caches simplify the hardware by allowing each MM block to go into only one place in the cache.

## Fig 7.33 The direct mapped cache



Now the cache needs only examine the single group that its reference specifies.

Computer Systems Design and Architecture by V. Heuring and H. Jordan © 1997 V. Heuring and H. Jordan: Updated David M. Zar, February, 2001

7-59

corresponding to the

group number.

## Fig 7.34 Direct Mapped Cache Operation

1. Decode the group number of the incoming MM address to select the group

2. If Match AND Valid

3. Then gate out the tag field

4. Compare cache tag with incoming tag

5. If a hit, then gate out the cache line,



#### 6. and use the word field to select the desired word.

Computer Systems Design and Architecture by V. Heuring and H. Jordan © 1997 V. Heuring and H. Jordan: Updated David M. Zar, February, 2001

#### **Direct mapped caches**

•The direct mapped cache uses less hardware, but is much more restrictive in block placement.

•If two blocks from the same group are frequently referenced, then the cache will "thrash." That is, repeatedly bring the two competing blocks into and out of the cache. This will cause a performance degradation.

•Block replacement strategy is trivial.

•Compromise - allow several cache blocks in each group-the Block Set Associative Cache. -next-

Computer Systems Design and Architecture by V. Heuring and H. Jordan © 1997 V. Heuring and H. Jordan: Updated David M. Zar, February, 2001

## Fig 7.35 2-Way Set Associative Cache

Example shows 256 groups, a set of two per group. Sometimes referred to as a 2-way set associative cache.



Computer Systems Design and Architecture by V. Heuring and H. Jordan © 1997 V. Heuring and H. Jordan: Updated David M. Zar, February, 2001

### Getting Specific: The Original Intel Pentium Cache

•The Pentium actually has two separate caches–one for instructions and one for data. Pentium issues 32-bit MM addresses.

Each cache is 2-way set associative
Each cache is 8K=2<sup>13</sup> bytes in size
32 = 2<sup>5</sup> bytes per line.

MMX Pentium: •16 KB code •16 KB data •4-way SA

•Thus there are 64 or 2<sup>6</sup> bytes per line, and therefore 2<sup>13</sup>/2<sup>6</sup> or 2<sup>7</sup>=128 groups •This leaves 32-5-7 = 20 bits for the tag field:



#### This "cache arithmetic" is important, and deserves your mastery.

Computer Systems Design and Architecture by V. Heuring and H. Jordan © 1997 V. Heuring and H. Jordan: Updated David M. Zar, February, 2001

### **Cache Read and Write policies**

•Read and Write cache hit policies
•Write-through-updates both cache and MM upon each write.
•Write back-updates only cache. Updates MM only upon block removal.
•"Dirty bit" is set upon first write to indicate block must be written back.

•Read and Write cache miss policies
•Read miss - bring block in from MM
•Either forward desired word as it is brought in, or
•Wait until entire line is filled, then repeat the cache request.
•Write miss
•Write allocate - bring block into cache, then update
•Write - no allocate - write word to MM without bringing block into cache.

#### **Block replacement strategies**

•Not needed with direct mapped cache

7-65

Least Recently Used (LRU)
Track usage with a counter. Each time a block is accessed:
Clear counter of accessed block
Increment counters with values less than the one accessed
All others remain unchanged
When set is full, remove line with highest count.

Random replacement - replace block at random.
Even random replacement is a fairly effective strategy.

#### **Cache performance**

Recall Access time,  $t_a = h \cdot t_p + (1-h) \cdot t_s$  for Primary and Secondary levels.

For  $t_p$  = cache and  $t_s$  = MM,

 $\mathbf{t}_{\mathsf{a}} = \mathbf{h} \bullet \mathbf{t}_{\mathsf{C}} + (1 - \mathbf{h}) \bullet \mathbf{t}_{\mathsf{M}}$ 

We define S, the speedup, as  $S = T_{without}/T_{with}$  for a given process, where  $T_{without}$  is the time taken without the improvement, cache in this case, and  $T_{with}$  is the time the process takes with the improvement.

Having a model for cache and MM access times, and cache line fill time, the speedup can be calculated once the hit ratio is known.



•The PPC 601 has a *unified* cache - that is, a single cache for both instructions and data.

•It is 32KB in size, organized as 64x8block set associative, with blocks being 8 8-byte words organized as 2 independent 4 word sectors for convenience in the updating process

A cache line can be updated in two single-cycle operations of 4 words each.
Normal operation is write back, but write through can be selected on a per line basis via software. The cache can also be disabled via software.

## **Virtual memory**

The Memory Management Unit, MMU is responsible for mapping logical addresses issued by the CPU to physical addresses that are presented to the Cache and Main Memory.



#### A word about addresses:

•Effective Address - an address computed by by the processor while executing a program. Synonymous with Logical Address
•The term Effective Address is often used when referring to activity inside the CPU. Logical Address is most often used when referring to addresses when viewed from outside the CPU.

•Virtual Address - the address generated from the logical address by the Memory Management Unit, MMU.

•Physical address - the address presented to the memory unit.

(Note: Every address reference must be translated.) Computer Systems Design and Architecture by V. Heuring and H. Jordan © 1997 V. Heuring and H. Jordan: Updated David M. Zar, February, 2001

### Virtual addresses - why

The logical address provided by the CPU is translated to a virtual address by the MMU. Often the virtual address space is larger than the logical address, allowing program units to be mapped to a much larger virtual address space.

Getting Specific: The PowerPC 601
•The PowerPC 601 CPU generates 32-bit logical addresses.
•The MMU translates these to 52-bit virtual addresses, before the final translation to physical addresses.

•Thus while each process is limited to 32 bits, the main memory •can contain many of these processes.

•Other members of the PPC family will have different logical and virtual address spaces, to fit the needs of various members of the processor family.

Computer Systems Design and Architecture by V. Heuring and H. Jordan © 1997 V. Heuring and H. Jordan: Updated David M. Zar, February, 2001

#### Virtual addressing - advantages

•Simplified addressing. Each program unit can be compiled into its own memory space, beginning at address 0 and potentially extending far beyond the amount of physical memory present in the system.

•No address relocation required at load time.

•No need to fragment the program to accommodate memory limitations.

•Cost effective use of physical memory.

•Less expensive secondary (disk) storage can replace primary storage. (The MMU will bring portions of the program into physical memory as required)

Access control. As each memory reference is translated, it can be simultaneously checked for read, write, and execute privileges.
This allows access/security control at the most fundamental levels.
Can be used to prevent buggy programs and intruders from causing

damage to other users or the system. This is the origin of those "bus error" and "segmentation fault" messages.

Computer Systems Design and Architecture by V. Heuring and H. Jordan © 1997 V. Heuring and H. Jordan: Updated David M. Zar, February, 2001



Notice that each segment's virtual address starts at 0, different from its physical address.
Repeated movement of segments into and out of physical memory will result in gaps between segments. This is called external fragmentation.
Compaction routines must be occasionally run to remove these fragments.



•The computation of physical address from virtual address requires an integer addition for each memory reference, and a comparison if segment limits are checked. •Q: How does the MMU switch references from one segment to another?

#### Fig 7.40 The Intel 8086 Segmentation Scheme

7-73

The first popular 16-bit processor, the Intel 8086 had a primitive segmentation scheme to "stretch" its 16bit logical address to a 20bit physical address:



#### The CPU allows 4 simultaneously active segments, CODE, DATA, STACK, and EXTRA. There are 4 16-bit segment base registers.



This figure shows the mapping between virtual memory pages, physical memory pages, and pages in secondary memory. Page n-1 is not present in physical memory, but only in secondary memory.
The MMU that manages this mapping.



A page fault will result in 100,000 or more cycles passing before the page has been brought from secondary storage to MM.
Page tables are maintained by the OS

#### Page placement and replacement

Page tables are direct mapped, since the physical page is computed directly from the virtual page number.

But physical pages can reside anywhere in physical memory.

Page tables such as those on the previous slide result in large page tables, since there must be a page table entry for every page in the program unit.

Some implementations resort to hash tables instead, which need have entries only for those pages actually present in physical memory.

Replacement strategies are generally LRU, or at least employ a "use bit" to guide replacement.

## Fast address translation: regaining lost ground

•The concept of virtual memory is very attractive, but leads to considerable overhead:

•There must be a translation for every memory reference

- •There must be *two* memory references for every program reference: •One to retrieve the page table entry,
  - •one to retrieve the value.

7-77

•Most caches are addressed by physical address, so there must be a virtual to physical translation before the cache can be accessed.

The answer: a small cache in the processor that retains the last few virtual to physical translations: A Translation Lookaside Buffer, TLB.

The TLB contains not only the virtual to physical translations, but also the valid, dirty, and protection bits, so a TLB hit allows the processor to access physical memory directly.

The TLB is usually implemented as a fully associative cache.



#### Fig 7.44 Operation of the Memory Hierarchy



Computer Systems Design and Architecture by V. Heuring and H. Jordan © 1997 V. Heuring and H. Jordan: Updated David M. Zar, February, 2001



#### Fig 7.46 I/O Connection to a Memory with a Cache

•The memory system is quite complex, and affords many possible tradeoffs.

The only realistic way to chose among these alternatives is to study a typical workload, using either simulations or prototype systems.
Instruction and data accesses usually have different patterns.

It is possible to employ a cache at the disk level, using the disk hardware.
Traffic between MM and disk is I/O, and Direct Memory Access, DMA can be used to speed the transfers:



Computer Systems Design and Architecture by V. Heuring and H. Jordan © 1997 V. Heuring and H. Jordan: Updated David M. Zar, February, 2001