Random Access Memory And Problems

  • Uploaded by: ainugiri
  • 0
  • 0
  • June 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Random Access Memory And Problems as PDF for free.

More details

  • Words: 5,489
  • Pages: 89
Problem 1 • Interleaving factor – What is the optimal interleaving factor if the memory cycle is 8 CC (1+6+1)? – Assume 4 banks b1

1

2

1st Read Request

3

4

5

6

7

8

9

b2

10

b3

11

2nd Read Request

b4

12

13

14

15

X

3rd Read Request

16



17

3rd Read Request

18

19

20

Problem 2 (page 452 in your text book) • • • • • •



Block size = 1 word Memory bus size = 1 word CPU Miss rate = 3% Memory access per instruction = 1.2 Cache miss Penalty = 64 CC Avg Cycles per instruction = 2

Cache 64 CC

Simple Memory

Assume 1000 instructions in your Program If no miss then execution time is 2000 CC One instruction needs 1.2 memory accesses, 1000 instruction -> 1200 accesses. If miss rate is 3% then number of misses for 1200 accesses is 36. the execution time is = 2000+36x64 = 4304 CC Average cycles per instruction = 4304/1000 = 4.3

Problem 2 (wider bus – 2 words) • • • • • •



Block size = 4 word Memory bus size = 2 word CPU Miss rate = 2.5% Memory access per instruction = 1.2 Cache miss Penalty = 128 CC Avg Cycles per instruction = 2

Cache 64 CC

Interleaved Memory

Assume 1000 instructions in your Program If no miss then execution time is 2000 CC One instruction needs 1.2 memory accesses, 1000 instruction -> 1200 accesses. If miss rate is 2.5% then number of misses for 1200 accesses is 30. the execution time is = 2000+30x128 = 5840 CC Average cycles per instruction = 5840/1000 = 5.84

Problem 2 (2-way interleaving) • • • • • •



Block size = 4 word Memory bus size = 1 word CPU Miss rate = 2.5% Memory access per instruction = 1.2 Cache miss Penalty = 68x2 CC Avg Cycles per instruction = 2

Interleaved Cache 64 CC Memory

Assume 1000 instructions in your Program If no miss then execution time is 2000 CC One instruction needs 1.2 memory accesses, 1000 instruction -> 1200 accesses. If miss rate is 2.5% then number of misses for 1200 accesses is 30. the execution time is = 2000+30x68x2 = 6000 CC Average cycles per instruction = 6000/1000 = 6.0

DRAM (Dynamic Random Access Memory) • Address Multiplexing • RAS and CAS • Dynamic refreshing and implications – Data to be written after a READ!

• Amdahl’s rule of thumb – 1000 MIPS should have 1000 M memory

• Read your text book for DRAM, SRAM, RAMBUS, and Flash.

SRAM (Static Random Access Memory) • Uses six transistor per bit. • Static – NO NEED to write data after a READ

• Comparison versus DRAM – More transistors • Less memory – ¼ to 1/8 • Expensive 8 - 10 times

– No Refresh • Faster 8-16 times

– Used in Cache instead of main memory •

Flash • Similar to EEPROM • Ideal choice for embedded systems – Low power requirement, no moving parts – Faster READ time, slower WRITE time

• Comparison versus EEPROM – Finer and simpler control – Better capacity

• WRITE operation – entire block to be erased and replaced. – Writes are 50 to 100 times slower than READ

• NOR type flash and NAND type flash 

Virtual Memory Physical Address

Virtual Address 0 4K 8K 12K

0 4K 8K 12K 16K 20K 24K 28K

A B C D

Virtual Memory

Disk

D

C

A B

Physical Main Memory

Virtual Versus First Level Cache B lock /P age S iz e Hit Tim e M iss P enalty A cc es s Tim e TransferTim e M is s Rate

A ddress M apping

16-128 B 4K -64K B 1-3 CC 50-150 CC 8-150 CC 1-10 M CC 6-130 CC 0.8-8 M CC 2-20 CC 0.2-2 M CC 0.1-10% .00001 to 0.001% 25-45 P hys ic al 32-64 V irtual A ddress A ddres s to 14-20 to 25-45 P hy sic al Cache A ddres s A ddres s

Memory Hierarchy Questions • • • •

Where can a block be placed? How is a block found if it is in? Which block should be replaced? What happens on a write?

Segmentation versus Paging Code

Data

Paging Segmentation

Remark Words per address Programmer visible Replacing a block Memory use inefficiency

Page

Segment Two (Segment and offset) Visible to programmer Hard External fragmentation

One Invisible Trivial Internal fragmentation

Efficient disk traffic

Yes (easy tune the page size) Not always (small segments)

CPU Execution Time CPU Execution Time  = (CPU Cycle time + Stalled Cycle) X Cycle Time 



• Stalled Cycle = misses x penalty • Misses given either as misses/1000 instruction or misses/memory-access AKA miss rate. • Instruction Count , Cycles per Instruction, Miss are also required to compute CPU execution time. 

Average Access Time with Cache Average Access Time = Hit Time + Miss Rate X Penalty 



Multi-level Cache  Avg Access Time = Hit TimeL1 + Miss  RateL1 X PenaltyL1 

  

PenaltyL1 = Hit TimeL2 + Miss RateL2 X PenaltyL2

I/O Systems • Objective – To understand mainly disk storage technologies.

Storage Systems • Individual Disks • SCSI Subsystem • RAID • Storage Area Networks (NAS & SAN) •

Disk Storage • Disk storage slower than Memory. • Disk offers better volume at a lower unit price. • Nicely fit into the Virtual Memory concept. • Critical piece of the performance

Courtesy: www.shortcourses.com/ choosing/storage/06.htm

Disk System • Disk pack & Cache • Disk Embedded Controller • SCSI Host Adapter • OS (kernel/driver) • User Program

disk

disk





Host

SCSI Adapter

Controll er



Controll er

SCSI Bus Host I/O Bus

T

Individual Disks



Reference: http://www.stjulians.com/cs/diskstoragenotes.html

Components of Disk Access Time • Seek Time • Rotational Latency • Internal Transfer Time • Other Delays • That is, – Avg Access Time = 

Avg Seek Time + Avg Rotational Delay +



Transfer Time + Other overhead

Problem (page 684) • • • • •

Seek Time = 5 ms/100 tracks RPM = 10000 RPM Transfer rate 40 MB/sec Other Delays = 0.1 ms Sector Size = 2 KB



Average Access Time =



     

Average Seek Time (5ms) + Average Rotational Delay (time for 1/2 revolution) + Transfer Time (512/(40x106)) + Other overhead (0.1 ms) = 5 + 103x60/(2x10000) + 512x103/(40x106) + 0.1 ms = 5 + 3 + 128/104 + 0.1 ms = 5 + 3 + 0.0128 + 0.1 =

RAID • RAID stands for Redundant Array of Inexpensive/Individual Disks • Uses a number of little disks instead of one large disk. • Six+ Types of RAID (0-5). 

• –





Terminologies

• Mean Time Between Failure (MTBF) • Mean Time To Data Loss (MTTDL) • Mean Time To Data Inaccessability (MTTDI) 

RAID-0 Striping Stripe width

chunk

Logical Storage

Physical Storage

01 02

03

04

05

06

07

08

09

10

11

12

01 05

02 06

03 07

04 08

09

10

11

12

n number of independent disks

RAID-0 Performance • Throughput: best case - nearly n x single disk value • Utilization: worst case – nearly (1/ n) x single disk value • Data Reliability: (r)n , where r is the reliability of a disk, (r  1). • Sequential Access: Fast • Random Access: Multithreaded Random Access offers better performance. • When r = 0.8 and n = 2, reliability is 0.64



RAID-1 Mirroring • Issue Addressed: RAID-0 Reliability Problem. • Shadowing or mirroring is used. • Data is not lost when a disk goes down. • One or more disks can be used to mirror primary disk. • Writes are posted to primary and shadowing disks. • Read from any of them.

01 02 03

01 02 03

RAID-1 Performance • Reliability is improved with mirroring: – (1- (1-r)(1-r))

• Example: when r is 0.8, the reliability of RAID-1 is .96. • Writes are more complex – must be committed to primary and all shadowing disks. • Writes much slower due to atomicity requirement. • Expensive – due to 1-to-1 redundancy.

RAID 0+1 -Stripping & Mirroring RAID 0 + 1 RAID 1 RAID 0

RAID 0

01

02

01

02

03

04

03

04

05

06

05

06

n disks

n disks

Performance RAID-0+1 • Let the Reliability of a RAID 0 sub-tree be R’: – Then the reliability of RAID 1 tree = 1 – (1-R’) (1-R’)

• Reliability R’ is: – R’ = r2 (reliability of a single disk is r):

• Throughput is same as RAID-0, however with 2 x n disks • Utilization is lower than RAID-0 due to mirroring • “Write” is marginally slower due to atomicity • When r = 0.9, R’ = 0.81, and R = 1 – (0.19)2 = .96

RAID 1+0 - Mirroring & Stripping RAID 1+0 RAID 0 RAID 1

RAID 1

01 03

01 03

02 04

02 04

05

05

06

06

Performance RAID-1+0 • Let the Reliability of a RAID 1 sub-tree be R’: – Then the reliability of RAID 0 tree = (R’)2

• Reliability R’ is: – R’ = 1-(1-r)2 (reliability of a single disk is r):

• Throughput is same as RAID-0, however with 2 x n disks • Utilization is lower than RAID-0 due to mirroring • “Write” is marginally slower due to its atomicity • When r = 0.9, R’ = 0.99, and R = (0.99)2 = .98 

RAID-2 Hamming Code Arrays • Low commercial interest due to complex nature of Hamming code computation.

RAID-3 Stripping with Parity Single Bits Or Words Logical Storage

Physical Storage

Stripe width

01 02 03 04 05 06 07 08 09 10

01 05

02 06

09

10

03 07

04 08

11

12

11

12

P0 P1 P2 Parity Disk

RAID-3 Operation • Based on the principle of reversible form of parity computation. • Where Parity P = C0 C1 … Cn-1  Cn • Missing Stripe 

Cm = P  C0 C1 … Cm-1  Cm+1 … Cn-1  Cn

RAID-3 Performance

• RAID-1’s 1-to-1 redundancy issue is addressed by 1-for-n Parity disk. Less expensive than RAID-1 • Rest of the performance is similar to RAID-0. • This can withstand the failure of one of its disks. • Reliability = all the disks are working + exactly one failed

= rn + nc1rn-1 .(1-r)



• When r = 0.9 and n = 5 

= 0.95 + 5 x 0.94 x (1- 0.9)



= 0.6 + 5x0.66 x 0.1



= 0.6 + 0.33 = .93

RAID-4 Performance • Similar to RAID-3, but supports larger chunks. • Performance measures are similar to RAID-3.

RAID-5 (Distributed Parity) Chunk Logical Storage

Physical Storage

Stripe width

01 02 03 04 05 06 07 08 09 10

01 P1

02 05

09

P2

11

12

03 06

04 07

P0 08

10

11

12 Parity Disk

RAID-5 Performance • In RAID-3 and RAID-4, Parity Disk is a bottleneck for Write operations. • This issue is addressed in RAID-5.

Reading Assignment • Optical Disk • RAID study material • Worked out problems from the text book: p 452, p 537, p 539, p561, p 684, p 691, p 704

Amdahl’s Speedup Multiprocessors Speedup = 1/(Fractione/Speedupe+(1–Fractione ))



  Speedup e - number of processors 

Fractione





- Fraction of the program that runs parallely on Speedupe

 



Assumption: Either the program runs in fully parallel (enhanced) mode making use of all the processors or non-enhanced mode.





Multiprocessor Architectures • Single Instruction Stream, Single Data Stream (SISD): • Single Instruction Stream, Multiple Data Stream (SIMD)  • Multiple Instruction Streams, Single Data Stream (MISD)   • Multiple Instruction Streams, Multiple Data Streams (MIMD)

Classification of MIMD Architectures • Shared Memory: – Centralized shared memory architecture OR Symmetric (shared-memory) Multiprocessors (SMP) OR Uniform Memory Access (UMA). – Distributed shared memory architecture OR NonUniform Memory Access (NUMA) architecture.

• Message Passing: – Multiprocessor Systems based on messaging

Shared Memory versus Message Passing No 1

2

Shared Memory Compatibility with well-understood shared memory mechanism used in centralized multi-processor systems. Simplify compiler design

3 4

No need to learn a messaging protocol Low communication overhead

5

Caching to improve latency

Message Passing Simpler hardware compared to scalable shared memory Communication is explicit. This means it is easier to understand. Improved modularity No need for expensive and complex synchronization mechanisms similar to the ones used in shared memory.

Two types of Shared Memory architectures Centralized

Shared -Memory Architecture

Processor

Processor

Processor

Processor

Cache Cache

Cache Cache

Cache Cache

Cache Cache

Main Memory

I/O Systems

Distributed -Shared -Memory Architecture Processor + Cache

Processor + Cache

Processor + Cache

… Memory

I/O

Memory

I/O

Interconnection Network

Memory

I/O

Symmetric versus Distributed Memory MP •

SMP: Uses shared memory for Inter-process Communication –

Advantage: • Close coupling due to shared memory • Sharing of data is faster between processors

– Disadvantage • •

Scaling: Memory is a bottleneck High and unpredictable Memory Latency 





Distributed : Uses message passing for Inter-process Communication –

Advantage: • •



Low Memory Latency Scaling: Scales better than SMP

Disadvantage •

Control and management are more complex due to distributed memory

Performance Metrics • Communication bandwidth   • Communication Latency: Ideally latency is as low as possible. Communication latency is  = Sender Overhead + Time of flight +  Transmission Time + Receiver Overhead • Communication Latency Hiding

Cache Coherence Problem Time 0 1 2 3

Event

Cache Contents for CPU A

Cache Contents for CPU B

CPU A reads X CPU B reads X CPU A stores 0 in X

1 1 0

1 1

Memory contents for Location X 1 1 1 0

Cache Coherence A Memory System is coherent if

 

1.

A read by a processor to location X that follows a write by another processor to X returns the written value if the read and write are sufficiently separatedin time and no other writes to X occur between the two accesses. Defines coherent view of memory.

2. 3. Write



to the same location are serialized; that is two writes to the same location by any two processors are seen in the same order by all processors. For example, if the values 1 and then 2 are written to a location, processors can never read the value of the location as 2 and then later read it as 1.

Coherency and Consistency • Coherency and consistency are complementary. 

• Coherency defines the behavior of reads and writes to the same memory location, while consistency defines the behavior of reads and writes with respect to other memory locations.

Features supported by Coherent Multiprocessors • Migration: Shared data in a cache can be moved to another cache directly (without going through the main memory). This is referred to as migration. It is used in a transparent fashion. It reduces both latency (going to another cache every time the data item is accessed) and precious memory bandwidth. 

• Replication: Cache also provides replication for shared data items that are being simultaneously read, since the caches make a copy of the data item in the local cache. Replication reduces both latency of access and contention for a read shared data item.

Migration and Replication Centralized Shared -Memory Architecture (SMP) Processor

Processor

Processor

Processor

Cache Cache

Cache Cache

Cache Cache

Cache Cache

cc Main Memory

cc

cc Bus

cc Cache Controller

cc I/O Systems

Cache Coherent (CC) Protocols • CC protocol implement cache coherency • Two types: – Snooping (Replicated): There are multiple copies of the sharing status. Every cache that has a copy of the data from a block of physical memory also has a copy of the sharing status of the block, and no centralized state is kept. – Directory based (logically centralized): There is only one copy of the sharing status of a block of physical memory. All the processors use this one copy. This copy could be in any of the participating processors. 

Snooping Protocol Centralized Shared -Memory Architecture (SMP)

Processor

Processor

Processor

Processor

Tw o w a ys: 1. W rite

Cache Cache

Cache Cache

Cache Cache

Cache Cache

In va lid a tio n 2. W rite B ro a d ca st

cc

cc

cc

cc

Bus I/O Systems

Main Memory

cc Cache Controller

Invalidation of Snooping Protocol Processor Activity CPU A reads X (block) CPU B reads X (block) A Writes 1 to X B reads X

Bus Activity

Cache Contents for CPU A

Cache Miss for X

0

Cache Miss for X Invalidation for X Cache Miss for X

0 1 1

Cache Contents for CPU B

Memory contents for Location X 0 0

0 1

0 0 1

Write Broadcast of Snooping Protocol Processor Activity

Bus Activity

CPU A reads X (block) Cache Miss for X CPU B reads X (block) Cache Miss for X A Writes 1 to X Write Broadcast for X B reads X

Cache Contents for CPU A

Cache Contents for CPU B

0 0 1 1

Memory contents for Location X 0 0

0 1 1

0 1 1

Reading Assignment • Section 7.4 (Reliability, Availability, & Dependability) in your text book • Page 554 and 555 • Section 7.11 I/O Design – attempt all the 5 problems in this section. • •

Invalidation versus Write Distribute •   •

Multiple writes to the same data item with no intervening reads require multiple write broadcast in an update protocol, but only one initial invalidation in a write invalidate protocol.



  •

With multiword cache blocks, each word written in a cache block requires a write broadcast in an update protocol, although only the first write to any word in the block needs to generate an invalidate in an invalidation protocol.



  •

An invalidation protocol works on cache blocks, while an update protocol must work on individual words.



The delay between writing a word in one processor and reading the written value in another processor is usually less in a write update scheme, since written data are immediately updated in the reader’s cache. By comparison, in an invalidation protocol, the reader is invalidated first, then later reads the data and is stalled until a copy can be read and returned to the processor.

Coherent Misses • Write Invalidate causes Read cache misses – True Miss: If an item X in a block is made exclusive by write invalidate and the read of the same item in another processor causes read miss then it is a true miss. – False Miss: If non-exclusive items in the same block cause the miss then it is a false miss.

Use of Valid, Shared, and Dirty bits •

  •

Valid bit: Every time a block is loaded into a cache from memory, the tag for the block is saved in cache and the valid bit is set to TRUE. A write update to the same block in a different processor may reset this valid bit due to write invalidate. Thus, when a cache block is accessed for READ or WRITE, tag should match AND the value of valid bit should be TRUE. If the tag matches but the valid bit is reset, then its a cache miss.



Shared bit: When a memory block is loaded into a cache block for the first time the shared bit is set to FALSE. When some other cache loads the same block, it is turned into TRUE. When this block is updated, write invalidate uses the value of shared bit to decide whether to send write-invalidate message or not. If the shared bit is set then an invalidate message is sent, otherwise not.





Dirty bit: Dirty bit is set to FALSE when a block is loaded into a cache memory. It is set to TRUE when the block is updated the first time. When another processor wants to load this block, this block is migrated to the processor instead of loading from the memory.

Summary of Snooping Mechanism Re q u e st Read hit Read m is s Read m is s Read m is s W rite hit W rite hit W rite m is s W rite m is s W rite m is s Read m is s

S o u rce proc es s or proc es s or proc es s or proc es s or proc es s or proc es s or proc es s or proc es s or proc es s or bus

Read m is s bus W rite m is s bus W rite m is s bus

S ta te o f a d d re sse d ca ch e b lo ck F u n ctio n a n d Ex p la n a tio n s hared or ex c lus iveRead data in c ac he invalid P lac e read m is s on bus s hared A ddres s c onflic t m is s : plac e read m is s on bus ex c lus ive A ddres s c onflic t m is s : write bac k bloc k , then plac e read m is s on bus ex c lus ive W rite data in c ac he s hared P lac e write m is s on bus invalid P lac e write m is s on bus s hared A ddres s c onflic t m is s : plac e read m is s on bus ex c lus ive A ddres s c onflic t m is s : write bac k bloc k , then plac e read m is s on bus s hared No ac tion; allow m em ory to s ervic e read m is s A ttem pt to s hare data: plac e c ac he bloc k on bus and c hange s tate to ex c lus ive s hared s hared A ttem pt to write s hared bloc k ; invalidate the bloc k A ttem pt to write bloc k that is ex c lus ive els ewhere: write bac k the c ac he ex c lus ive bloc k and m ak e its s tate invalid

State Transition CPU read hit

CPU Read

In va lid

Place read miss on bus

Place write miss on bus

CPU Write

CPU Write hit CPU Read hit

t ri

b e-

a

ck bu

o bl s

Shared (read only)

ck

P u sC

U

e it r w

CPU read miss Place read miss on bus

W on b ss i s n m s o mi ad s e r is ad m U e P r C te e i c r w la Cache State Transitions Exclusive P e c Based on requests from CPU (read/write) P l a

CPU Write miss Write - back cache block Place write miss on bus

State Transition

Write - back block ; Abort memor y acces s

In va lid

Shared (read only)

Write miss for this block

CPU Write ck o bl s ck ces a -b ac e it ory r W em M

Write miss for Exclusive this block

;

a

r bo

t

CPU read miss

Cache State Transitions Based on requests from the bus

(read/write) Read miss for this block

Some Terminologies • Polling: A process periodically checks if there is a message that it needs to handle. This method of awaiting a message is called polling. Polling reduces processor utilization. • Interrupt: A process is notified when a message arrives using built-in interrupt register. Interrupt increases processor utilization in comparison to polling.  • Synchronous: A process sends a message and waits for the response to come before sending another message or carrying out other tasks. This is way of waiting is referred to as synchronous communication. • Asynchronous: A process sends a message and continues to carry out other tasks while the requested message is processed. This is referred as asynchronous communication.

Communication Infrastructure • Multiprocessor Systems with shared memory can have two types of communication infrastructure: – Shared Bus – Interconnect Shared Bus

Interconnect

Directory Based Protocol Nod n-1 e

Directory



Nod e 1

Directory

Interconnect



Directory

Nod e n

Directory

Nod e 2

State of the Block • Shared: One or more processors have the block cached, and the value in memory is up to date. • Uncached: No processor has a copy of the block. • Exclusive: Exactly one processor has a copy of the cache block, and it has written the block, so the memory copy is out of date. The processor is called the owner of the block.

Local, Remote, Home Node • Local Node: This is the node where request originates • Home Node: This is the node where the memory location and the directory entry of an an address reside. • Remote Node: A remote node is the node that has copy of a cache block

Shared State Operation • Read-miss: The requesting processor is sent the requested data from memory and the requestor is added to the sharing set. • Write-miss: The requesting processor is sent the value. All the processors in the set Sharer are sent invalidate messages, and the Sharer set is to contain the identity of the requesting processor. The state of the block is made exclusive.

Uncached State Operation • Read-miss: The requesting processor is sent the requested data from memory and the requestor is made the only sharing node. The state of the block is made shared. • Write-miss:The requesting processor is sent the requested data and becomes the sharing node. The block is made exclusive to indicate that the only valid copy is cached. Sharer indicates the identity of the owner.

Exclusive State Operation •

Read-miss: The owner processor is sent a data fetch message, which causes the state of the block in the owner’s cache to transition to shared and causes the owner to send the data to the directory, which it is written to memory and sent back to the requesting processor. The identity of the requesting processor is added to the set Sharer, which still contains the identity of the processor that was the owner (since it still has a readable copy).





Data write back: The owner processor is replacing the block and therefore must write it back. This write back makes the memory copy up to date (the home directory essentially becomes the owner), the block is now uncached and the Sharer is empty.





Write-miss: The block has a new owner. A message is sent to the old owner causing the cache to invalidate the block and send the value to the directory, from which it is sent to the requesting processor, which becomes the new owner. Sharer is set to the identity of the new owner, and the state of the block remains exclusive.

Cache State Transition Fetch , Shar es = {}

U n ca ch e d

Read Miss Data Value Reply Shares = { P }

Shared ; (read only) y pl e r

Data Value Reply Shares = { P }

ue l Write Miss va } ta {P a d ; += h s tc res bu e }; F ha P n { S s o = is ss m es i r m ad Data Writeha ly ad e S e r p r Back e; re e t c a da lue i Pl al va v Exclusive In ata d (read/write)

i Wr

Write miss Fetch / invalidate Data value Reply Shares = { P }

te

Mi

ss

Read miss

Data value Reply Shares += { P }

True Sharing Miss & False Sharing Miss Time 1 2 3 4 5

Processor P1 Write X1

Processor P2 Read X2

Write X1 Write X2 Read X2

Synchronization • We need a set of hardware primitives with the ability to atomically read and modify a memory location.

Example 1: Atomic Exchange • Interchanges a value in a register for a value in memory • You could build a lock with this. If the memory location contains 1 then the lock is on. Register A

0

Physical Main Memory

1

Other Operations • Test-and-Set • Fetch-and-Increment

Load Linked & Set Conditional • Load Linked: Load linked is a primitive operation that loads the content of a specified location into cache. • Set Conditional: Set conditional operation is related to Load Linked operation by operating on the same memory location. This operation sets the location to a new value only if the location contains the same value read by Load Linked, otherwise it will not.

Operation of LL & SC try:

LL DADDUI SC BEQZ

R2, R3, R3, R3,

0(R1) R2, #1 0(R1) try

; ; ; ;

load linked increment Store conditional branch store fails

1.Address of the memory location loaded by LL is kept in a register 2.If there is an interrupt or invalidate then this register is cleared 3.SC checks if this register is zero. a.If it is then it fails b.Otherwise it simply Stores the content of R3 in that memory address

Single Processor System • Concurrent Execution is the source of problem X=0

Process 1

Process 2

Memory

TOP

time

P1 READ BNEZ

P2 X TOP P2 is scheduled TOP READ BNEZ WRITE

P1 is scheduled WRITE 1

X TOP 1

With Interrupt Mask X=0 Memory

Process 1

Process 2

P1 TOP

time

P2

READ X BNEZ TOP MASK WRITE 1 UNMASK TOP

MASK READ X BNEZ TOP WRITE UNMASK

1

Multiprocessor System • Masking the interrupt will not work X=0 Memory

Processor1

P1 TOP

READ X BNEZ TOP MASK WRITE 1 UNMASK

Processor2

P2 TOP

READ BNEZ MASK WRITE UNMASK

X TOP

1

Multiprocessor System • Uses Cache Coherence – Bus Based Systems or Interconnect Based System • Coherence system arbitrates writes (Invalidation/write distribute) • Thus, serializes writes!

Spin Locks • Locks that a processor continuously tries to acquire, spinning around a loop until it succeeds. • It is used when a lock is to be for a short amount of time.

Spin Lock implementation 

lockit:

LD BNEZ DADDUI EXCH BNEZ

R2, R2, R2, R2, R2,

0(R1) lockit R0,#1 0(R1) lockit

;load of lock ;not available-spin ;load locked value ;swap ;branch if lock wasn't 0

Cache Coherence steps step 1

P0

2 Set lock to 0

P1 Spins, testing if lock=0 Invalidate received

P2 Spins, testing if lock=0 Invalidate received

3 4

Cache Miss Waits fo bus

5

Lock = 0

Cache Miss Lock = 0 Executes swap, gets cache miss Completes swap: returns 0 and sets Lock = 1

1 Has lock

6

7 8

Executes swap, gets cache miss Swap completes and returns 1, and sets Lock = Enter critical 1 section Spins, testing if lock=0

Coherence state of lock

Bus/Directory activity

Shared

None Write invalidate of lock Exclusive (P0) variable from P0 Bus/Directory services P2 Cache miss; write back from Shared P0 Shared Cache miss for P2 satisfied Shared

Cache miss for P1 satisfied

Bus/Directory services P2 Cache miss; generates Exclusive (P2) invalidate Bus/Directory services P1 Cache miss; generates write Exclusive (P1) back None

Multi-threading • Multi-threading – why and how? • Fine-grain – Almost round robin – High overhead (compared to coarse grain) – Better throughput then coarse grain

• Coarse-grain – Threads switched only on long stalls (L2 cache miss) – Low overhead – Individual threads get better performance

Reading Assignment • Memory Consistency – Notes (Answers)

• Problem on page 596 • An example that uses LL (Load Linked) and Set Conditional (SC).

End!

IETF – Internet Engineering Task Force • Loosely self-organized groups (working groups) of … • Two types of documents: I-D and RFC • I-D’s life is shorter compared to RFCs life • RFCs include proposed standards and standards • RFC Example: RFC 1213, RFC 2543 • I-D -> Proposed Standard -> Standard

SIP Evolution SIP I-D v1 ’97, v2 ’98 SIP RFC 2543 ’99 SIP RFC 3261 ’00, obsoletes RFC 2543 SIP working groups – SIPPING, SIMPLE, PINT, SPIRITS, … • SIPit – SIP interoperability test • • • •





Predictor for SPEC92 Benchmarks 300 250

Instructions between 200 mispredictio ns 150

P re d ic te d T a k e n P r o fi le b a s e d

100 50

ear

dod uc

gcc

hyd ro2 d md ijdp su2 cor

com

pre ss eqn tot esp res so

li

0

B e n ch m a rk

Related Documents

Random Access Memory
October 2019 10
Direct Memory Access
October 2019 15
Random
December 2019 55

More Documents from ""