Lecture 7

  • Uploaded by: Hamedsad
  • 0
  • 0
  • June 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Lecture 7 as PDF for free.

More details

  • Words: 1,891
  • Pages: 38
Advanced Computer Architecture CS5223 Lecture #7 Memory Architecture 3

Topics  Virtual Memory  Translation Look-Aside Buffer (TLB)

 Virtual Caches  Homonym and Synonym Problems  Alpha 21064 & Pentium 4 Memory Hierarchy

2

Virtual Memory (VM)  In order to manage memory more efficiently and robustly, modern systems provide an abstraction of main memory known as Virtual Memory (VM).  Virtual memory is an elegant interaction of hardware exceptions, hardware address translation, main memory, disk drives, and operating system software that provides each process with a large, uniform, and private address space. 3

Virtual Memory (cont.)

Concept  multiple processes a running at any instant in time  too expensive to dedicate a full address space for each process  sharing of physical memory among many processors  Virtual memory  divides physical memory into blocks and allocates them to different processes  protection schemes  restricts access to blocks  reduced time to start a program  not all code and data need be in physical memory  automatically manages the memory hierarchy 4  overlays 

Virtual Memory versus Cache Virtual memory

Cache

replacement on miss is

replacement on miss

primarily controlled by OS

controlled by hardware

is longer miss penalty means OS can afford to spend more time deciding what to replace the size of the processor address determines the size of virtual memory size secondary storage is also used

the cache size is independent of the processor address 5

Virtual Memory Categories Fixed-size

blocks (pages)  pages 4Kb to 64Kb  single fixed size address  divided into page number and offset  Variable-size blocks (segments)  largest segments 64Kb to 4Gb  smallest segment 1 byte  requires two words for addressing  one word for segment  one word for offset 6

Virtual Memory Capabilities  It uses main memory efficiently by treating it as a cache for an address space stored on disk, keeping only the active areas in main memory, and transferring data back and forth between disk and memory as needed.  It simplifies memory management by providing each process with a uniform address space.  It protects the address space of each process from corruption by other processes.

7

System with Virtual Addressing Intel P6, the Sun Sparc, and the Compaq Alpha, use a form of addressing known as virtual addressing

8

VM as a tool for caching On

CPU access the virtual memory contents, the virtual memory stored on disk is cached in DRAM. The data at the lower level of the cache hierarchy must be partitioned into blocks that will serve as the transfer units between the lower level and the cache.

9

Partitions of Virtual Pages  At any point in time the set of virtual pages is partitioned into three disjoint subsets:  Unallocated: Pages that have not yet been allocated (or created) by the VM system. Unallocated blocks do not have any data associated with them and thus do not occupy any space on disk.  Cached: Allocated pages that are currently cached in physical memory.  Uncached: Allocated pages that are not cached in physical memory.

10

Translation Look-Aside Buffer (TLB).  Every time the CPU generates a virtual address, the Memory Management Unit (MMU) must refer to a Page Table Entry (PTE) in order the translate the virtual address into a physical address. This requires an additional fetch from memory, at a cost of tens to hundreds of cycles.  If the PTE happens to be cached in L1, then the cost goes down to one or two cycles. However, many systems try to eliminate even this cost by including a small cache of PTEs in the MMU called a Translation Lookaside Buffer (TLB).

11

Translation Look-Aside Buffer (Cont.) Motivation  page tables are large  stores in memory  requires two accesses to get the address  get the physical address  get the data  Solution  use cache for address translation  TLB or TB  tag holds portion of virtual address  data portion holds a physical page number 

12

Components of a Virtual Address to access the TLB.  The index and tag fields that are used for set selection and line matching are extracted from the virtual page number in the virtual address.  If the TLB has T = 2t sets, then the TLB index (TLBI) consists of the t least significant bits of the VPN, and the TLB tag (TLBT) consists of the remaining bits in the VPN.

13

Steps when TLB Hit  Step (1): The CPU generates a virtual address.  Steps (2) – (3): The MMU fetches the appropriate PTE from the TLB.  Step (4): The MMU translates the virtual address to a physical address and sends it to the cache.  Step (5): The cache returns the appropriate data word to the CPU.

14

TLB Miss When there is a TLB miss, then the MMU must fetch the PTE from the L1 cache, as shown in the Figure. The newly fetched PTE is stored in the TLB, possibly overwriting an existing entry.

15

TLBs and Caches

16

Q4: What Happens on a Write? Write through: The information is written to both the block in the cache and to the block in the lower-level memory. Write back: The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced. Pros and Cons of each: WT: read misses cannot result in writes (because of replacements) WB: no writes of repeated writes WT always combined with write buffers so that don’ t wait for lower level memory 

17

Write Policies We

know about write-through vs. write-back Assume: a 16-bit write to memory location 0x00 causes a cache miss. Do we change the cache tag and update data in the block? Yes: Write Allocate No: Write No-Allocate Do we fetch the other data in the block? Yes: Fetch-on-Write (usually do write-allocate) No: No-Fetch-on-Write Write-around cache Write-through no-write-allocate 18

Virtual Caches Send virtual address to cache. Called Virtually Addressed Cache or just Virtual Cache vs. Physical Cache or Real Cache  Avoid address translation before accessing cache faster hit time to cache  Context Switches? Just like the TLB (flush or pid) Cost is time to flush + “compulsory” misses from empty cache Add process identifier tag that identifies process as well as address within process: can’ t get a hit if wrong process 19  I/O must interact with cache 

I/O and Virtual Caches

20

Aliases and Virtual Caches Aliases (sometimes called synonyms); Two different virtual addresses map to same physical address The virtual address is used to index the cache  Could have data in two different locations in the cache 

21

Using virtual caches  Cache is indexed and/or tagged with the virtual address  Cache access and MMU translation/validation done in parallel  Physical address saved in tags for later write-back but not used during indexing

processor

virtual address

virtual/physical address tags

data

cache miss address mapper

physical address main memory 22

Problem with Virtual Caches

 Homonym Problem  Synonym Problem

23

Homonym Problem process 1 translation information 100

10

process 1 writes 1000 to virtual page 100 context-switched to process 2 process 2 read from virtual page 100

process 2 translation information 100

20

100

1000

tag

data 24

Solutions to Homonym Problem Cache purging at each context switch Using PID (process id) as an additional tag Virtually-index physically-tagged caches

25

Synonym Problem Process 1 translation information

100 200

Process 1 reads from virtual page 100 Process 1 reads from virtual page 200 Process 1 writes 10 to virtual page 100 Process 1 reads from virtual page 200

10 10

100

5

100 200

5 5

100 200 tag

10 5 data 26

Solutions to Synonym Problem  Hardware anti-aliasing  Alignment of synonyms (require all the synonyms to be identical in the lower bits of their virtual addresses assuming a directmapped cache)

27

Challenges for Memory Hierarchy Designers The challenge in designing memory hierarchies is that every change that potentially improves the miss rate can also negatively affect overall performance. This combination of positive and negative effects is what makes the design of a memory hierarchy challenging. Design change

Effect on miss rate

Possible negative performance effect

Increase size

Decrease capacity misses

May increase access time

Increase associativity

Decreases miss rate due to conflict misses Decreases miss rate for a wide rage of block sizes

May increase access time

Increase block size

May increase miss penalty 28

Four Questions for Memory Hierarchy Designers Where can a block be placed in the upper level? (Block placement) Fully Associative, Set Associative, Direct Mapped How is a block found if it is in the upper level? (Block identification) Tag/Block Which block should be replaced on a miss? (Block replacement) Random, LRU What happens on a write? (Write strategy) Write Back or Write Through (with Write Buffer)29 

Alpha 21064 & Pentium 4 Memory Hierarchy

30

Alpha 21064

Separate Instruction & Data TLB & Caches  TLBs 32 entry fully associative  TLB updates in SW  Caches 8KB direct mapped  Critical 8 bytes first  Prefetch instruction Stream buffer  2 MB L2 cache, direct mapped (off-chip)  256 bit path to main memory, 4 x 64-bit modules 

31

Alpha VM Mapping 64-bit

address divided into 3 segments seg0 (bit 63=0) user code/heap seg1 (bit 63 = 1, 62 = 1) user stack kseg (bit 63 = 1, 62 = 0)  kernel segment for OS

Three

level page table, each one page Alpha only 43 unique bits of VA (future min page size up to 64KB => 55 bits of VA) 32

Alpha Memory Performance Miss Rates

33

Predicting Cache Performance from Different Programs (ISA, compiler, ...)

34

Pentium 4 Processor Cache The Intel NetBurst micro-architecture can support up to three levels of on-chip cache.  Only two levels of on-chip caches are implemented in the Pentium 4 processor, The level nearest to the execution core of the processor, the first level, contains separate caches for instructions and data: a first-level data cache and the trace cache, which is an advanced first-level instruction cache.  All other levels of caches are shared..  All caches use a pseudo-LRU (least recently used) replacement algorithm. 35 A second-level cache miss initiates a transaction 

Pentium 4 Processor Cache (cont.)  The Pentium 4 processor supports up to four outstanding load misses that can be serviced either by on-chip caches or by memory.  The streaming instructions (prefetch and stores) can be used to manage data and minimize disturbance of temporal data held within the processor’s caches.  The Pentium 4 processor takes advantage of the Intel C ++ Compiler that supports C ++ language-level features for the Streaming SIMD Extensions. The Streaming SIMD Extensions and MMX technology instructions provide intrinsics that allow you to optimize cache utilization. 36

Optimization of Memory Copy Algorithm  The memory copy algorithm can be optimized using the Streaming SIMD Extensions and these considerations: alignment of data proper layout of pages in memory cache size interaction of the transaction look-aside buffer (TLB) with memory accesses combining prefetch and streaming-store instructions. 37

Pentium 4 Processor Cache Parameters

38

Related Documents

Lecture 7
November 2019 33
Lecture 7
April 2020 18
Lecture 7
June 2020 6
Lecture 7
April 2020 6
Lecture 7
April 2020 6
Lecture 7
October 2019 28

More Documents from ""

Lecture 7
June 2020 6