Memory Hierarchy And Cache Management

  • Uploaded by: ainugiri
  • 0
  • 0
  • June 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Memory Hierarchy And Cache Management as PDF for free.

More details

  • Words: 5,018
  • Pages: 100
Speculate Ambitiously To speculate ambitiously requires three capabilities: • The ability of the complier to find instructions that, with the possible use of register renaming, can be speculatively moved and not affect the program data flow. • The ability to ignore exceptions in speculated instructions, until we know that such exceptions should really occur. • The ability to speculatively interchange loads and stores, or stores and stores, which may have address conflicts. • The first of these is a complier capability, while the last two require hardware support . 

Hardware Support for Preserving Exception Behavior • To speculate ambitiously, we must be able to move any type of instruction and still preserve its exception behavior. • The key to being able to do this is to observe that the results of a speculated sequence that is mispredicted will not be used in the final computation , and such a speculated instruction should not cause an exception.

Investigated for supporting more ambitious speculation • There are four methods that have been investigated for supporting more ambitious speculation without introducing erroneous behavior: – The hardware and operating system cooperatively ignore exceptions for speculative instructions. This approach preserves exception behavior for correct programs, but not for incorrect ones. This approach may viewed as unacceptable for some programs, but it has been used, under program control , as a “fast mode” in several processors. – Speculative instructions that never raise exceptions are used, and checks are

Investigated for supporting more ambitious speculation – A set of status bits, called poison bits, are attached to the result registers written by speculated instructions when the instructions cause exceptions. The poison bits cause a fault when a normal instructions attempts to use the register. – A mechanism is provided to indicate that an instruction is speculative , and the hardware buffers the instruction result until it is certain that the instruction is no longer speculative.

Schemes • To know the schemes, we need to distinguish between exceptions that indicate a program error and would normally cause termination, such as a memory protection violation, and those that are handled and normally resumed, such as a page default. • Exception that can be resumed can be accepted and processed for speculative instructions just as if they were normal instructions. • If the speculative instruction should not have been executed, handling the unneeded exception may have some negative performance effects, but it cannot cause incorrect execution.

Schemes.. • The cost of these exceptions may be high, however, and some processors use hardware support to avoid taking such exceptions, just as processors with hardware speculation may take some exceptions in speculative mode, while avoiding others until an instruction is known not to be speculative. • Exceptions that indicate a program error should not occur in correct programs, and the result of a program that gets such a exception is not well defined, except perhaps when the program is running in a debugging mode. • If such exceptions arise in speculated instructions, we cannot take the exception until we know that the instruction is no longer speculative.

Schemes … • In the simplest method for preserving exceptions, the hardware and the operating system handle all resumable exceptions when the exception occurs and simply return an undefined value for any exception that would cause termination. • If the instruction generating the terminating exception was not speculative , then the program is in error. • Note the instead of terminating the program, the program is allowed to continue, although it will almost certainly generate incorrect results. • If the instruction generating the terminating exception is speculative, then the program may be correct and the speculative result will simply be unused.

Schemes … • Thus, returning an undefined value for the instruction cannot be harmful. • This scheme can never cause a correct program to fail, no matter how much speculation is done. • An incorrect program, which formerly might have received a terminating exception, will get an incorrect result. • This a acceptable for some programs, assuming the complier can also generate a normal version of the program, which does not speculate and can receive a terminating exception.

Schemes … • In such a scheme, it is not necessary to know that an instruction is speculative. Indeed, it is helpful only when a program is in error and receives a terminating exception on a normal instruction; in such cases, if the instruction were not marked as speculative, the program could be terminated. • A second approach to preserving exception behavior when speculating introduces speculative versions of instructions that do not generate terminating exceptions and instructions to check for such exceptions. This combines preserves the exception behavior exactly.

Schemes … • A third approach for preserving exception tracks exceptions as they occur but postpones of the exception, although not in a completely precise fashion. • The scheme is simple: a poison bit is added to every register, and another bit is added to every instruction to indicate whether a speculative instructions results in a terminating exception; all other exceptions are handled immediately. • If a speculative instruction uses a register with a poison bit turned on, the destination register of the instruction simply has its poison bit turned on, • If a normal instruction attempts to use a register source with its poison bit turned on, the instruction causes a fault . • In this way, any program that would have generated an exception still generates one, albeit at the first instance where a result is used by an instruction that is not speculative.

Schemes … • Since poison bits exist only on register values and not memory values, stores are never speculative and thus trap if either operand is “poison”. • One complication thus must be overcome is how the OS saves the user registers on a context switch if the poison bit is set. • A special instruction is needed to save and reset the state of the poison bits to the avoid this problem. • The fourth and final approach listed in earlier relies on a hardware mechanism that operates like a reorder buffer. • In such an approach, instructions are marked by the complier as speculative and include an indicator of how many branches the instruction was speculatively moved across and what branch action (taken/not taken) the complier assumed. • The last piece of information basically tells the hardware the location of the code block where the speculated instruction originally was .

Schemes … • In practice, most of the benefit of speculative instruction is marked by a sentinel, which tells the hardware that the earlier speculative instruction is no longer speculative and values may be committed. • All instructions are placed in a reorder buffer when issued and are forced to commit in order, as in hardware speculation approach.(Notice , though, that no actual speculative branch prediction or dynamic scheduling occurs). • The reorder buffer tracks when instructions are ready to commit and delays the “write-back” portion of any speculative instruction.

Schemes … • Speculative instructions are not allowed to commit until the branches that have been speculatively moved over are also ready to commit, or ,alternatively until the corresponding sentinel is reached. • At that point, we know whether the speculated instructions should have been executed or not. • If it should have been executed and it generated a terminating exception, then we know that the program should be terminated. • If the instruction should not have been executed then the can be ignored. •

Example 1 • Consider the code fragment from if-the-else statement of the form: 

If (A==0) A=B; else A=a+4;Where A is at 0(R3) and B is at 0(R2). Assume the then clause is almost always executed. Compile the code using complier-based speculation. Assume R14 is unused and available.

  

L1: L2:

LD B NE Z LD J DA DDI SD

R1,0(R3) ; load A R1,L1 ;tes t A R1,0(R2) ; then c laus e L2 ; S k ip els e R1,R1, #4; els e c laus e R0,0(R3) ; s tore A

Example 2 • Show how the code using a speculative load (sLD) and a speculation check instruction (SPECCK) to completely preserve exception behavior. Assume R14 is unused and available. LD R1,0(R3) ; load A • B NE Z R1,L1 ;tes t A LD R1,0(R2) ; then c laus e J L2 ; S k ip els e L1: DA DDI R1,R1, #4; els e c laus e L2: SD R0,0(R3) ; s tore A

Example 3 • Consider the code fragment and show how it would be complied with speculative instructions and poison bits. Show where an exception for the speculative memory reference would be recognized. Assume R14 is unused and R1,0(R3) ; load A available. LD B NE Z R1,L1 ;tes t A LD R1,0(R2) ; then c laus e J L2 ; S k ip els e L1: DA DDI R1,R1, #4; els e c laus e L2: SD R0,0(R3) ; s tore A

memory Reference Speculation • Moving loads across stores is usually done when the complier is certain the address do not conflict. • A special instruction to check for address conflicts can be included in the architecture, • The special instruction is left at the original location of the load instruction (and acts like a guardian), and the load is moved up across or more stores. • When a speculated load is executed, the hardware saves the address of the accessed memory location. • If a subsequent store the address of the location before the check instruction, then the speculation has failed.

memory Reference Speculation … • If the location has not been touched, then the speculation is successful. • Speculation failure can be handled in two ways. – If only the load instruction was speculated, then it suffices to redo the load at the point of the check instruction (which could supply the target register in addition to the memory address). – If additional instructions that depend on the load were also speculated , then a fix-up sequence that reexecutes all the speculation instructions starting with the load is needed. – In this case, the check instruction specifies the address where the fix-up code is

Sections To be covered from Chapter 4 • Section 4.5 • Software versus Hardware based scheduling

Memory Hierarchy Memory Hierarchy & Cache Memory

Entry Quiz 1.Primary cache or level 1 cache is implemented in a separate chip 

a.True b.False

Entry Quiz 2.SRAM is implemented using 

a.Flip-Flop b.Magnetic core c.Capacitor d.Non-volatile Technology

Entry Quiz 3.Main memory (200 ns) is slower compared register (0.2 ns) by an order of 

a.3 b.4 c.1000 d.10000

Entry Quiz 4.Virtual Memory is 5. a.Same as caching b.Same as associative memory c.Different from caching d.Same as disk memory e.

Entry Quiz 5.Cache Miss occurs when the a.Required instruction is not found in the cache b.Required data is not found in the cache c.Required instruction or data is not found in the cache d.Required instruction or data is not found in the main memory e.For all of the above conditions

Module Objective • To understand 1. Memory requirements of different computers 2. Memory hierarchy and the motivation behind it  3. Moore’s Law  4. Principles of Locality  5. Cache Memory and its implementation 6. Cache Performance 7. Terms: Cache, Cache Miss, Cache Hit, Latency, Bandwidth, SRAM, DRAM, by an order of, Direct Mapping, Associative Mapping, Set Mapping, Write Through, Write Allocated, Write Back, Dirty Bit, and Valid Bit

Memory Requirements • In general we would like to have – Faster Memory (lower access time or latency) – Larger (capacity and bandwidth) Memory – Simpler Memory 

Memory Requirements - Server, Desktop, and Embedded devices

• Server nDesktop – Lower Access nLower time Access time – Higher nLarger Bandwidth* Memory – Better Protection* – Larger Memory

nEmbedded nLower Access time nSimpler Memory*

Processor-Memory Gap

Moore’s Law

• Transistor density on a chip dye doubles every couple (1.5) of years. • • Short reference: http://en.wikipedia.org/wiki/Moore's_law 

What is Memory Hierarchy and Why? 250 ns

CPU 0.25 ns

(Registe rs)

Main Memor y Bus Adapte r

Storage & I/O devices 2,500,000 ns!

Memory Hierarchy & Cache

Cache • Cache is a smaller, faster, and expensive memory. • Improves the througput/latency of slower memory next to it in the memory hierarchy. • Blocking reads and delaying the writes to slower memory offers better performance. • There are two cache memories L1 and L2 between CPU and main memory. • L1 is built into CPU. • L2 is an SRAM.

Cache Operation • Cache Hit: CPU finds the required data item (or instruction) in the cache. • Cache Miss: CPU does not find the required item in the cache. – CPU Stalled – Hardware loads the entire block that contains the required data item from the memory into cache. – CPU continues to execute once the cache loaded.

• Spatial Locality • Principle of Locality

Hit and Miss • Hit – Hit Rate – Hit Time

• Miss – Miss Rate – Miss Penalty

• Hit Time << Miss Penalty

Higher Level Lower Level

Cache Performance Program Execution Time • • • •

CPU Clock Cycle Cycle Time IC – Instruction Count Program Execution Time (Simple model)

 



= (Useful CPU Cycles + Stalled CPU Cycles) x Cycle Time

Stalled CPU Cycles Stalled CPU Cycles

 

= Number of Cache Misses X Miss Penalty





= IC X Memory Access X Miss Rate X Miss



Penalty

Number of Cache reference

Instruction

 

Note: Unit of Miss Penalty is in CPU cycles



Separating Read Miss from Write Miss Stalled CPU Cycles = IC X Memory Access X Miss Rate X Miss Penalty Instruction = IC X Read Access X Read Miss Rate X Read Penalty + Instruction IC X Write Access X Write Miss Rate X Write Penalty Instruction

Example 1 • Assume we have a computer where Clock cycles Per Instruction (CPI) is 1.0 when all memory accesses are cache hits. The only data accesses are loads and stores and these total 50% of the instructions. If the miss penalty is 25 clock cycles and miss rate is 2%, how much faster would the computer be if all instructions were cache hits?

Example 1 … 1.Execution time for the program in a computer with 100 % hits  

 



= (Useful CPU Cycles + Stalled CPU Cycles) x Cycle Time = (IC X CPI + 0) X Cycle Time = IC X 1.0 X Cycle Time

Example 1 … 2.Stall Cycle when there is one or more Cache Misses:  = IC X Memory Access X Miss Rate X Miss Penalty 

 

Instruction

= IC X (1 + 0.5) X 0.02 X 25 Cycle Time = IC X 0.75 Cycle Time



3.Total Execution Time:  = (IC X 1 + IC X 0.75) Clock Cycles  = 1.75 IC Clock Cycles

Example 1 … 4.Ratios 

= 1.75 IC Clock Cycles / IC X 1.0 X Cycle Time

 

= 1.75



Execution time is 1.75 faster in a computer with no Misses.



 

Example 2 • Assume we have a computer where Clock cycles Per Instruction (CPI) is 1.0 when all memory accesses are cache hits. The only data accesses are loads and stores and these total 50% of the instructions. Miss penalty for read miss is 25 clock cycles and miss penalty for write miss is 50 clock cycles. Miss rate is 2% and out of this 80% is Read miss. How much faster would the computer be if all instructions were cache hits?

Cache Operation Line Number

CPU

Tag

0 1 2 3

Block

0

 

 

1

 

 

2

 

 

 

 

 

 

 

 

C= 16  

 

Block

( K Words )

.. .

.. . Block Length (K Words) Cache

Block 2N - 1 Word Length (K Words)

Main Memory

Elements of Cache Design • • • • • • • • •

Cache Size Block Size Mapping Function Replacement Algorithm Write Policy Write Miss Number of caches Split versus Unified/Mixed Cache

Mapping Functions From CPU

Address Tag Line Word

Tag

Block 0

Line 0

Block 1

1. Select

Line 1 2. Copy 3.Compare

+

5. Load

Line 2

4. Hit

Block n-1

Miss

Line 3 (m)

To main memory

Block n

Cache Main Memory

Mapping Function • Direct – Line value in address uniquely points to a line in cache. 1 tag Comparison

• Set Associative – Line value in address points to a set of lines in cache (typically 2/4/8, so 2/4/8 tag comparisons). This is known as 2/4/8 way Set Associative.

• Associative – Line value is always 0. This means Line points to all the lines in cache (4 (m) tag Comparisons) – Uses Content Addressable Memory (CAM) for comparison. – Needs non-trivial replacement algorithm 

0

Address Tag Line Word

From CPU

5

2

Tag

2

3

Line 0

Line 1

1 2 3 4

Block 0

5 6 7

Block 1

1. Select

SET 0

2. Copy

Line 2

3.Compare

+

5. Load 504

4. Hit

Line 3

Miss

SET 1

To main memory Cache

505 506 507 508

Block 126

509 510 511

Block 127

Memory

Mapping Function Comparison Cache Type Direct Mapping Fully Associative

Hit Ratio

Search Speed

Good

Be s t

Be s t

M ode rate

Ve ry Good, Bette r N-Way Set Associative as N Incre as e s

Good, Wors e as N Incre as e s

Replacement Algorithm • • • •

Least Recently Used First In First Out Least Frequently Used Random

Write Policy • Write-through – Information written to cache and the memory 

• Write back – Information written only to cache. Content of the cache is written to the main memory only when this cache block is replaced or the program terminates. 

• Dirty bit is used to indicate that a cache block needs “Write back”

Write Miss • Write-Allocate • No Write-Allocate

Number of Caches • Primary Cache (CPU) • Secondary Cache (SRAM) • L3 Cache (Cheaper SRAM) 

Split versus Unified/Mixed Cache S in g le o r tw o Le ve l nU n ifie d o r S p lit nM isse s p e r 1 0 0 0 in stru ctio n s w ith va rio u s C a ch e size n

n

Size (KB) 8

Instruct Cache 8.

Improving Cache Performance • Reducing Penalty • Reducing Misses – Compiler optimization attempts to reduce the Cache Misses falls under this category • Stalled Cycles

= IC X Memory Access X Miss Rate X Miss Penalty

 

Instruction

Improving Cache Performance Using Compiler • Compilers are built with the following optimization: 

Instructions: 



Reordering instructions to avoid conflict misses

Data :     

Merging Arrays Loop Interchange Loop Fusion Blocking

Merging Arrays Key /* Conflict */  int key[size]; value  int value [size]; 



/* Instead no conflict*/ Key, struct merge {  int key;  int value; } ; 

value pairs

Loop Interchange for (k = 0; k < 100; k= k+1)  for (j = 0; j < 100; j= j+1)  for (i = 0; i < 5000; i= i+1)  x[i][j] = 2*x[i][j]  /*instead */ 

for (k = 0; k < 100; k= k+1) for (i = 0; i < 5000; i= i+1) for (j = 0; j < 100; j= j+1) x[i][j] = 2*x[i][j] Memory

X(0,0) X(0,1) X(0,2) X(0,3) X(0,4) X(0,5) … X(0,4999) X(1,0) X(1,1) …

Address

Blocking

Loop Fusion for (i = 0; i < 5000; i= i+1)  a[i] = i; … for (i = 0; i < 5000; i= i+1)  b[i] = b[i]+a[i]; 



/*instead */

 

for (i = 0; i < 5000; i= i+1)  a[i]= i;  b[i] = b[i]+a[i]; 



Example 2 • Assume a fully associative write-back cache with many cache entries that starts empty. Below is a sequence of five memory operations (the address is in square brackets) • What are the number of hits and misses using nowrite allocate versus write allocate?      

WriteMem[100]; WriteMem[100]; ReadMem[200]; WriteMem[200]; WriteMem[100];

Exit Quiz 1.In memory hierarchy top layer is occupied by the 

a.Fastest and the most expensive memory b.Slowest and the most expensive memory c.High band width and the fastest memory d.Slowest and the least expensive memory

Exit Quiz 2.The memory technology suitable for L2 cache is 

a.EPROM b.DRAM c.SRAM d.Flash e.

Exit Quiz 3.Server’s memory requirements are different from desktop’s memory requirements a.True b.False 

Exit Quiz 4.Hit Rate is (1 – Miss Rate) a.True b.False c.

6.Gordon Moore’s law states that the transistor density a.Triples every year b.Doubles every two years c.Doubles every 1.5 years d.Doubles every year

Improving Cache Performance • Two options – By Reducing Miss Penalty – By Reducing Miss Rate

Reducing Cache Miss Penalty 1.Multilevel caches 2.Critical word first & early restart 3.Priority for Read misses over Write 4.Merging Write Buffers 5.Victim Cache •

Reducing Miss Penalty Multilevel caches (1) 100 ns, 128M

CPU

2 ns, 16K Cache 1 0.04

100 ns, 128M

10 ns, 512K Main Memory

CPU

0.02 Slower but larger

Faster but smaller

? ns, 528 K CPU

Cache 1

Cache 1 0.04

Cache 2 0.02

Faster and Larger

100 ns, 128M

Main Memory

Main Memory

Reducing Miss Rate & Global Miss Rate (1) 100 ns, 128M

2.6 ns, 528K

2 ns, 16K CPU

Cache 1

10 ns, 512K

Cache 2

Main Memory

0.05 0.02 Faster and Larger

Average memory access timeeff = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1 Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2 Global Miss Rate = Miss RateL1 x Miss RateL2

R e d u cin g M iss Pe n a lty – C ritica lW o rd First a n d E a rly R e sta rt ( 2 ) • Critical word first • Early restart • Lets discuss the following: – How is Miss Penalty improved? – Under what conditions this is useful? – Is it worth going through all the trouble? –

Reducing Miss Penalty – Priority To Read Miss (3) • Read Miss takes priority • Write to Memory is put on hold • Lets discuss the following: – Is there a problem? – How is Miss Penalty improved? – How this is done in Write-through? – How this is done in Write-back? – Under what conditions this is useful? – Is it worth going through all the trouble?

Reducing Miss Penalty – Merging Write Buffer (4) • Write-through is sent to a write buffer • Buffer & Write possible options: 1. Buffer is empty – write address and data to buffer 2. Buffer is not full – write or merge data 3. Buffer is full – Stall until buffer is written to memory. – Note: Block write to memory is more efficient then multiple writes.

• Lets Discuss the following: – How many different ways do we optimize Miss Penalty in this scheme? – Explain the Fig 5.12 in the book – When there is no block write (like I/O device), no merging

Fully Associative Victim Cache (5) • Key word is Recycle • Another implicit key word is complex (hardware) • Aimed at handling conflict misses • 1 to 6 victim caches are ideal

Reducing Misses • Misses due to 3Cs: Compulsory, Capacity, and Conflict • Reducing Misses Techniques 1.Larger Lines/Blocks 2.Larger Caches 3.Higher Associativity 4.Way Prediction Pseudoassociative caches 5.Compiler Techniques  

Notes: copy tables 5.14 and 5.17 to two slides

Reducing Cache Misses – Larger Block Size (1) • Larger block improves number of misses up to a point! • Points to discuss – Why number of misses starts increasing for larger blocks? – Low latency encourages smaller blocks and higher latency encourages larger blocks. – Low bandwidth encourages smaller blocks and higher bandwidth encourages larger blocks. – Effect on Miss Penalty with larger blocks • What is the other name for 1-way set associative mapping?

Reducing Cache Misses – Larger Caches (2)

• Obviously!

Reducing Cache Misses – Higher Associativity (3)

• Miss rate improves with associativity • Points to discuss – Complexity of set associative mapping versus the improvement. – Rule of thumb 1: 8-way associative is as good as fully associative. – Rule of thumb 2: 2-way associative is as good as direct mapping. – Greater associativity increases Hit time

Pseudo-Associative Caches (4) • Lets understand this using an tags 8-way set associative Instruction Cache. • Instruction exhibits better locality • Each access to instruction cache normally needs 8 comparisons. • Using locality predict the next block to access in the set to reduce the number of comparisons. • Effect on Hit-time and Cache

Cache Memory – ith set

Block 0

Block 1

Block 6

Block 8

Reducing Hit Time – Small and Simple Cache (1)

• Tag comparison is complex, specifically in associative mapping • Tag checking can be overlapped with data transfer to reduce the hit time. • Tag checking in CPU chip, cache in a chip by itself. Provides better hit time and larger cache capacity

Reducing Hit Time – Virtual Cache (2)

• Addressing – VA -> Physical Address -> cache

• Skip two levels, VA maps to cache • Problems: – No page boundary checks – Building direct mapping between VA and Cache for every process is not easy.

Reducing Hit Time – Pipelined Cache (3)

• Increases the cache throughput not access time 

Reducing Hit Time – Trace Cache

• Increasing instruction level parallelism • Instead of 4 consecutive locations of cache, load the next 4 instruction required by the CPU using trace. • Folding branch prediction into cache

Improving Parallelism with Cache • Cache Hit under Cache Miss • Cache Miss under Cache Miss • Both require non-blocking cache, out of order execution CPU • Other methods to improve performance using parallelism: – Hardware pre-fetching of instruction – Software (compiler) pre-fetching of data

Preparation Write-back

Write Through





• Complicates cache coherency problem • Low overhead memory access overhead • Better cache access time than write -through • Requires higher memory bandwidth if blocking •

• Simplifies cache coherency problem • High overhead • If cache is blocking then higher access time • Requires Lower memory bandwidth •

• N o te : N o

n e e d to d e scrib e th e se tw o p o licie s nW rite - th ro u g h d o e s n o t b u y a n yth in g extra fo r a sin g le p ro ce sso r syste m d u e to a b se n ce o f ca ch e co h e re n cy n

CPU Execution Time CPU Execution Time  = (CPU Cycle time + Stalled Cycle) X Cycle Time 



• Stalled Cycle = misses x penalty • Misses given either as misses/1000 instruction or misses/memory-access AKA miss rate. • Instruction Count , Cycles per Instruction, Miss are also required to compute CPU execution time. 

Average Access Time with Cache Average Access Time = Hit Time + Miss Rate X Penalty 



Multi-level Cache  Avg Access Time = Hit Time L1 + Miss RateL1 X PenaltyL1





PenaltyL1 = Hit TimeL2 + Miss RateL2 X PenaltyL2

Addressing Address Tag Set Word

From CPU

Tag Set 0

Line 0

0 1 2 3 4 5 6 7

Block 0

Block 1

1. Select

Line 1 2. Copy Set 1

3.Compare

+

5. Load

Line 2

4. Hit

Block 126

4. Miss

To main memory

Cache

Line 3 (m) 508 509 510 511

Block 127 Main Memory

Assignment I – Due same day next week Mapping functions Replacement algorithms Write policies Write Miss policies Split Cache versus Unified Cache Primary Cache versus Secondary Cache • Compiler cache optimization techniques with examples • • • • • •

Assignment II - Due same day next week Multilevel Cache Cache Inclusion/Exclusion Property Thumb rules of cache Compiler pre-fetch Multi-level Caching + one another Miss Penalty Optimization technique • Two miss Rate Optimization Techniques • • • • •

Assignment III - Due 2nd class of next week • All odd numbered problems from cache module of your text book.

Assignment IV - Due 2nd class of next week • All even numbered problems from cache module of your text book.

CPU Execution Time & Average Access Time Memory Cache CPU 1 CC

100 CC

With Multi-level Cache Memory Cache CPU 1000

1 CC

10 CC

100 CC

Memory Hierarchy Main Memory

Main Memory • Module Objective – To understand Main memory latency and bandwidth – Techniques to improve latency and bandwidth

Memory Hierarchy & Cache

Main Memory – Cache – I/O 250 ns

0.25 ns

CPU

ca ch e

Main Memor y Bus Adapte r

Storage & I/O devices 2,500,000 ns!

§Cache prefers low latency main memory §I/O & Multiprocessors prefer high bandwidth/throughput main memory

Main Memory Access Time 4 cc Main Memory

CPU

56 cc

ca ch e

4 cc

Data Bus

Address Bus

Access Time per word = 4+56+4 CC One word is 8 Bytes Latency is 1 bit/CC

CC – Clock Cycle

Improving Memory Performance • Improving Latency ( time to access 1 memory unit - word) • Improving bandwidth (bytes accessed in unit time)  

Improving Memory Bandwidth Simple Design Wider Bus

Interleaved

CPU 64 bits Cache

CPU

CPU

Cache Block 4 words One word 8 bytes

Cache

Cache 4x64 bits

64 bits Memory

64 bits

64 bits

Memory

64 bits 0 4 8 12

Bank 0

1 5 9 13

2 6 10 14

3 7 11 15

Bank 1

Bank 2

Bank 3

Bandwidth, Latency, Penalty Interleaving Factor Address allocation with Interleaving

Related Documents

Cache Memory
November 2019 21
Cache Memory
April 2020 8
Cache Memory
June 2020 8
Coa 15 Cache Memory
May 2020 13
04 Cache Memory
June 2020 7

More Documents from "Mashail Ali"