Advanced Computer Architecture CS5223 Lecture #7 Memory Architecture 3
Topics Virtual Memory Translation Look-Aside Buffer (TLB)
Virtual Caches Homonym and Synonym Problems Alpha 21064 & Pentium 4 Memory Hierarchy
2
Virtual Memory (VM) In order to manage memory more efficiently and robustly, modern systems provide an abstraction of main memory known as Virtual Memory (VM). Virtual memory is an elegant interaction of hardware exceptions, hardware address translation, main memory, disk drives, and operating system software that provides each process with a large, uniform, and private address space. 3
Virtual Memory (cont.)
Concept multiple processes a running at any instant in time too expensive to dedicate a full address space for each process sharing of physical memory among many processors Virtual memory divides physical memory into blocks and allocates them to different processes protection schemes restricts access to blocks reduced time to start a program not all code and data need be in physical memory automatically manages the memory hierarchy 4 overlays
Virtual Memory versus Cache Virtual memory
Cache
replacement on miss is
replacement on miss
primarily controlled by OS
controlled by hardware
is longer miss penalty means OS can afford to spend more time deciding what to replace the size of the processor address determines the size of virtual memory size secondary storage is also used
the cache size is independent of the processor address 5
Virtual Memory Categories Fixed-size
blocks (pages) pages 4Kb to 64Kb single fixed size address divided into page number and offset Variable-size blocks (segments) largest segments 64Kb to 4Gb smallest segment 1 byte requires two words for addressing one word for segment one word for offset 6
Virtual Memory Capabilities It uses main memory efficiently by treating it as a cache for an address space stored on disk, keeping only the active areas in main memory, and transferring data back and forth between disk and memory as needed. It simplifies memory management by providing each process with a uniform address space. It protects the address space of each process from corruption by other processes.
7
System with Virtual Addressing Intel P6, the Sun Sparc, and the Compaq Alpha, use a form of addressing known as virtual addressing
8
VM as a tool for caching On
CPU access the virtual memory contents, the virtual memory stored on disk is cached in DRAM. The data at the lower level of the cache hierarchy must be partitioned into blocks that will serve as the transfer units between the lower level and the cache.
9
Partitions of Virtual Pages At any point in time the set of virtual pages is partitioned into three disjoint subsets: Unallocated: Pages that have not yet been allocated (or created) by the VM system. Unallocated blocks do not have any data associated with them and thus do not occupy any space on disk. Cached: Allocated pages that are currently cached in physical memory. Uncached: Allocated pages that are not cached in physical memory.
10
Translation Look-Aside Buffer (TLB). Every time the CPU generates a virtual address, the Memory Management Unit (MMU) must refer to a Page Table Entry (PTE) in order the translate the virtual address into a physical address. This requires an additional fetch from memory, at a cost of tens to hundreds of cycles. If the PTE happens to be cached in L1, then the cost goes down to one or two cycles. However, many systems try to eliminate even this cost by including a small cache of PTEs in the MMU called a Translation Lookaside Buffer (TLB).
11
Translation Look-Aside Buffer (Cont.) Motivation page tables are large stores in memory requires two accesses to get the address get the physical address get the data Solution use cache for address translation TLB or TB tag holds portion of virtual address data portion holds a physical page number
12
Components of a Virtual Address to access the TLB. The index and tag fields that are used for set selection and line matching are extracted from the virtual page number in the virtual address. If the TLB has T = 2t sets, then the TLB index (TLBI) consists of the t least significant bits of the VPN, and the TLB tag (TLBT) consists of the remaining bits in the VPN.
13
Steps when TLB Hit Step (1): The CPU generates a virtual address. Steps (2) – (3): The MMU fetches the appropriate PTE from the TLB. Step (4): The MMU translates the virtual address to a physical address and sends it to the cache. Step (5): The cache returns the appropriate data word to the CPU.
14
TLB Miss When there is a TLB miss, then the MMU must fetch the PTE from the L1 cache, as shown in the Figure. The newly fetched PTE is stored in the TLB, possibly overwriting an existing entry.
15
TLBs and Caches
16
Q4: What Happens on a Write? Write through: The information is written to both the block in the cache and to the block in the lower-level memory. Write back: The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced. Pros and Cons of each: WT: read misses cannot result in writes (because of replacements) WB: no writes of repeated writes WT always combined with write buffers so that don’ t wait for lower level memory
17
Write Policies We
know about write-through vs. write-back Assume: a 16-bit write to memory location 0x00 causes a cache miss. Do we change the cache tag and update data in the block? Yes: Write Allocate No: Write No-Allocate Do we fetch the other data in the block? Yes: Fetch-on-Write (usually do write-allocate) No: No-Fetch-on-Write Write-around cache Write-through no-write-allocate 18
Virtual Caches Send virtual address to cache. Called Virtually Addressed Cache or just Virtual Cache vs. Physical Cache or Real Cache Avoid address translation before accessing cache faster hit time to cache Context Switches? Just like the TLB (flush or pid) Cost is time to flush + “compulsory” misses from empty cache Add process identifier tag that identifies process as well as address within process: can’ t get a hit if wrong process 19 I/O must interact with cache
I/O and Virtual Caches
20
Aliases and Virtual Caches Aliases (sometimes called synonyms); Two different virtual addresses map to same physical address The virtual address is used to index the cache Could have data in two different locations in the cache
21
Using virtual caches Cache is indexed and/or tagged with the virtual address Cache access and MMU translation/validation done in parallel Physical address saved in tags for later write-back but not used during indexing
processor
virtual address
virtual/physical address tags
data
cache miss address mapper
physical address main memory 22
Problem with Virtual Caches
Homonym Problem Synonym Problem
23
Homonym Problem process 1 translation information 100
10
process 1 writes 1000 to virtual page 100 context-switched to process 2 process 2 read from virtual page 100
process 2 translation information 100
20
100
1000
tag
data 24
Solutions to Homonym Problem Cache purging at each context switch Using PID (process id) as an additional tag Virtually-index physically-tagged caches
25
Synonym Problem Process 1 translation information
100 200
Process 1 reads from virtual page 100 Process 1 reads from virtual page 200 Process 1 writes 10 to virtual page 100 Process 1 reads from virtual page 200
10 10
100
5
100 200
5 5
100 200 tag
10 5 data 26
Solutions to Synonym Problem Hardware anti-aliasing Alignment of synonyms (require all the synonyms to be identical in the lower bits of their virtual addresses assuming a directmapped cache)
27
Challenges for Memory Hierarchy Designers The challenge in designing memory hierarchies is that every change that potentially improves the miss rate can also negatively affect overall performance. This combination of positive and negative effects is what makes the design of a memory hierarchy challenging. Design change
Effect on miss rate
Possible negative performance effect
Increase size
Decrease capacity misses
May increase access time
Increase associativity
Decreases miss rate due to conflict misses Decreases miss rate for a wide rage of block sizes
May increase access time
Increase block size
May increase miss penalty 28
Four Questions for Memory Hierarchy Designers Where can a block be placed in the upper level? (Block placement) Fully Associative, Set Associative, Direct Mapped How is a block found if it is in the upper level? (Block identification) Tag/Block Which block should be replaced on a miss? (Block replacement) Random, LRU What happens on a write? (Write strategy) Write Back or Write Through (with Write Buffer)29
Alpha 21064 & Pentium 4 Memory Hierarchy
30
Alpha 21064
Separate Instruction & Data TLB & Caches TLBs 32 entry fully associative TLB updates in SW Caches 8KB direct mapped Critical 8 bytes first Prefetch instruction Stream buffer 2 MB L2 cache, direct mapped (off-chip) 256 bit path to main memory, 4 x 64-bit modules
31
Alpha VM Mapping 64-bit
address divided into 3 segments seg0 (bit 63=0) user code/heap seg1 (bit 63 = 1, 62 = 1) user stack kseg (bit 63 = 1, 62 = 0) kernel segment for OS
Three
level page table, each one page Alpha only 43 unique bits of VA (future min page size up to 64KB => 55 bits of VA) 32
Alpha Memory Performance Miss Rates
33
Predicting Cache Performance from Different Programs (ISA, compiler, ...)
34
Pentium 4 Processor Cache The Intel NetBurst micro-architecture can support up to three levels of on-chip cache. Only two levels of on-chip caches are implemented in the Pentium 4 processor, The level nearest to the execution core of the processor, the first level, contains separate caches for instructions and data: a first-level data cache and the trace cache, which is an advanced first-level instruction cache. All other levels of caches are shared.. All caches use a pseudo-LRU (least recently used) replacement algorithm. 35 A second-level cache miss initiates a transaction
Pentium 4 Processor Cache (cont.) The Pentium 4 processor supports up to four outstanding load misses that can be serviced either by on-chip caches or by memory. The streaming instructions (prefetch and stores) can be used to manage data and minimize disturbance of temporal data held within the processor’s caches. The Pentium 4 processor takes advantage of the Intel C ++ Compiler that supports C ++ language-level features for the Streaming SIMD Extensions. The Streaming SIMD Extensions and MMX technology instructions provide intrinsics that allow you to optimize cache utilization. 36
Optimization of Memory Copy Algorithm The memory copy algorithm can be optimized using the Streaming SIMD Extensions and these considerations: alignment of data proper layout of pages in memory cache size interaction of the transaction look-aside buffer (TLB) with memory accesses combining prefetch and streaming-store instructions. 37
Pentium 4 Processor Cache Parameters
38