Problem 1 • Interleaving factor – What is the optimal interleaving factor if the memory cycle is 8 CC (1+6+1)? – Assume 4 banks b1
1
2
1st Read Request
3
4
5
6
7
8
9
b2
10
b3
11
2nd Read Request
b4
12
13
14
15
X
3rd Read Request
16
17
3rd Read Request
18
19
20
Problem 2 (page 452 in your text book) • • • • • •
•
Block size = 1 word Memory bus size = 1 word CPU Miss rate = 3% Memory access per instruction = 1.2 Cache miss Penalty = 64 CC Avg Cycles per instruction = 2
Cache 64 CC
Simple Memory
Assume 1000 instructions in your Program If no miss then execution time is 2000 CC One instruction needs 1.2 memory accesses, 1000 instruction -> 1200 accesses. If miss rate is 3% then number of misses for 1200 accesses is 36. the execution time is = 2000+36x64 = 4304 CC Average cycles per instruction = 4304/1000 = 4.3
Problem 2 (wider bus – 2 words) • • • • • •
•
Block size = 4 word Memory bus size = 2 word CPU Miss rate = 2.5% Memory access per instruction = 1.2 Cache miss Penalty = 128 CC Avg Cycles per instruction = 2
Cache 64 CC
Interleaved Memory
Assume 1000 instructions in your Program If no miss then execution time is 2000 CC One instruction needs 1.2 memory accesses, 1000 instruction -> 1200 accesses. If miss rate is 2.5% then number of misses for 1200 accesses is 30. the execution time is = 2000+30x128 = 5840 CC Average cycles per instruction = 5840/1000 = 5.84
Problem 2 (2-way interleaving) • • • • • •
•
Block size = 4 word Memory bus size = 1 word CPU Miss rate = 2.5% Memory access per instruction = 1.2 Cache miss Penalty = 68x2 CC Avg Cycles per instruction = 2
Interleaved Cache 64 CC Memory
Assume 1000 instructions in your Program If no miss then execution time is 2000 CC One instruction needs 1.2 memory accesses, 1000 instruction -> 1200 accesses. If miss rate is 2.5% then number of misses for 1200 accesses is 30. the execution time is = 2000+30x68x2 = 6000 CC Average cycles per instruction = 6000/1000 = 6.0
DRAM (Dynamic Random Access Memory) • Address Multiplexing • RAS and CAS • Dynamic refreshing and implications – Data to be written after a READ!
• Amdahl’s rule of thumb – 1000 MIPS should have 1000 M memory
• Read your text book for DRAM, SRAM, RAMBUS, and Flash.
SRAM (Static Random Access Memory) • Uses six transistor per bit. • Static – NO NEED to write data after a READ
• Comparison versus DRAM – More transistors • Less memory – ¼ to 1/8 • Expensive 8 - 10 times
– No Refresh • Faster 8-16 times
– Used in Cache instead of main memory •
Flash • Similar to EEPROM • Ideal choice for embedded systems – Low power requirement, no moving parts – Faster READ time, slower WRITE time
• Comparison versus EEPROM – Finer and simpler control – Better capacity
• WRITE operation – entire block to be erased and replaced. – Writes are 50 to 100 times slower than READ
• NOR type flash and NAND type flash
Virtual Memory Physical Address
Virtual Address 0 4K 8K 12K
0 4K 8K 12K 16K 20K 24K 28K
A B C D
Virtual Memory
Disk
D
C
A B
Physical Main Memory
Virtual Versus First Level Cache B lock /P age S iz e Hit Tim e M iss P enalty A cc es s Tim e TransferTim e M is s Rate
A ddress M apping
16-128 B 4K -64K B 1-3 CC 50-150 CC 8-150 CC 1-10 M CC 6-130 CC 0.8-8 M CC 2-20 CC 0.2-2 M CC 0.1-10% .00001 to 0.001% 25-45 P hys ic al 32-64 V irtual A ddress A ddres s to 14-20 to 25-45 P hy sic al Cache A ddres s A ddres s
Memory Hierarchy Questions • • • •
Where can a block be placed? How is a block found if it is in? Which block should be replaced? What happens on a write?
Segmentation versus Paging Code
Data
Paging Segmentation
Remark Words per address Programmer visible Replacing a block Memory use inefficiency
Page
Segment Two (Segment and offset) Visible to programmer Hard External fragmentation
One Invisible Trivial Internal fragmentation
Efficient disk traffic
Yes (easy tune the page size) Not always (small segments)
CPU Execution Time CPU Execution Time = (CPU Cycle time + Stalled Cycle) X Cycle Time
• Stalled Cycle = misses x penalty • Misses given either as misses/1000 instruction or misses/memory-access AKA miss rate. • Instruction Count , Cycles per Instruction, Miss are also required to compute CPU execution time.
Average Access Time with Cache Average Access Time = Hit Time + Miss Rate X Penalty
Multi-level Cache Avg Access Time = Hit TimeL1 + Miss RateL1 X PenaltyL1
PenaltyL1 = Hit TimeL2 + Miss RateL2 X PenaltyL2
I/O Systems • Objective – To understand mainly disk storage technologies.
Storage Systems • Individual Disks • SCSI Subsystem • RAID • Storage Area Networks (NAS & SAN) •
Disk Storage • Disk storage slower than Memory. • Disk offers better volume at a lower unit price. • Nicely fit into the Virtual Memory concept. • Critical piece of the performance
Courtesy: www.shortcourses.com/ choosing/storage/06.htm
Disk System • Disk pack & Cache • Disk Embedded Controller • SCSI Host Adapter • OS (kernel/driver) • User Program
disk
disk
•
•
Host
SCSI Adapter
Controll er
…
Controll er
SCSI Bus Host I/O Bus
T
Individual Disks
Reference: http://www.stjulians.com/cs/diskstoragenotes.html
Components of Disk Access Time • Seek Time • Rotational Latency • Internal Transfer Time • Other Delays • That is, – Avg Access Time =
Avg Seek Time + Avg Rotational Delay +
Transfer Time + Other overhead
Problem (page 684) • • • • •
Seek Time = 5 ms/100 tracks RPM = 10000 RPM Transfer rate 40 MB/sec Other Delays = 0.1 ms Sector Size = 2 KB
Average Access Time =
Average Seek Time (5ms) + Average Rotational Delay (time for 1/2 revolution) + Transfer Time (512/(40x106)) + Other overhead (0.1 ms) = 5 + 103x60/(2x10000) + 512x103/(40x106) + 0.1 ms = 5 + 3 + 128/104 + 0.1 ms = 5 + 3 + 0.0128 + 0.1 =
RAID • RAID stands for Redundant Array of Inexpensive/Individual Disks • Uses a number of little disks instead of one large disk. • Six+ Types of RAID (0-5).
• –
•
•
Terminologies
• Mean Time Between Failure (MTBF) • Mean Time To Data Loss (MTTDL) • Mean Time To Data Inaccessability (MTTDI)
RAID-0 Striping Stripe width
chunk
Logical Storage
Physical Storage
01 02
03
04
05
06
07
08
09
10
11
12
01 05
02 06
03 07
04 08
09
10
11
12
n number of independent disks
RAID-0 Performance • Throughput: best case - nearly n x single disk value • Utilization: worst case – nearly (1/ n) x single disk value • Data Reliability: (r)n , where r is the reliability of a disk, (r 1). • Sequential Access: Fast • Random Access: Multithreaded Random Access offers better performance. • When r = 0.8 and n = 2, reliability is 0.64
•
RAID-1 Mirroring • Issue Addressed: RAID-0 Reliability Problem. • Shadowing or mirroring is used. • Data is not lost when a disk goes down. • One or more disks can be used to mirror primary disk. • Writes are posted to primary and shadowing disks. • Read from any of them.
01 02 03
01 02 03
RAID-1 Performance • Reliability is improved with mirroring: – (1- (1-r)(1-r))
• Example: when r is 0.8, the reliability of RAID-1 is .96. • Writes are more complex – must be committed to primary and all shadowing disks. • Writes much slower due to atomicity requirement. • Expensive – due to 1-to-1 redundancy.
RAID 0+1 -Stripping & Mirroring RAID 0 + 1 RAID 1 RAID 0
RAID 0
01
02
01
02
03
04
03
04
05
06
05
06
n disks
n disks
Performance RAID-0+1 • Let the Reliability of a RAID 0 sub-tree be R’: – Then the reliability of RAID 1 tree = 1 – (1-R’) (1-R’)
• Reliability R’ is: – R’ = r2 (reliability of a single disk is r):
• Throughput is same as RAID-0, however with 2 x n disks • Utilization is lower than RAID-0 due to mirroring • “Write” is marginally slower due to atomicity • When r = 0.9, R’ = 0.81, and R = 1 – (0.19)2 = .96
RAID 1+0 - Mirroring & Stripping RAID 1+0 RAID 0 RAID 1
RAID 1
01 03
01 03
02 04
02 04
05
05
06
06
Performance RAID-1+0 • Let the Reliability of a RAID 1 sub-tree be R’: – Then the reliability of RAID 0 tree = (R’)2
• Reliability R’ is: – R’ = 1-(1-r)2 (reliability of a single disk is r):
• Throughput is same as RAID-0, however with 2 x n disks • Utilization is lower than RAID-0 due to mirroring • “Write” is marginally slower due to its atomicity • When r = 0.9, R’ = 0.99, and R = (0.99)2 = .98
RAID-2 Hamming Code Arrays • Low commercial interest due to complex nature of Hamming code computation.
RAID-3 Stripping with Parity Single Bits Or Words Logical Storage
Physical Storage
Stripe width
01 02 03 04 05 06 07 08 09 10
01 05
02 06
09
10
03 07
04 08
11
12
11
12
P0 P1 P2 Parity Disk
RAID-3 Operation • Based on the principle of reversible form of parity computation. • Where Parity P = C0 C1 … Cn-1 Cn • Missing Stripe
Cm = P C0 C1 … Cm-1 Cm+1 … Cn-1 Cn
RAID-3 Performance
• RAID-1’s 1-to-1 redundancy issue is addressed by 1-for-n Parity disk. Less expensive than RAID-1 • Rest of the performance is similar to RAID-0. • This can withstand the failure of one of its disks. • Reliability = all the disks are working + exactly one failed
= rn + nc1rn-1 .(1-r)
• When r = 0.9 and n = 5
= 0.95 + 5 x 0.94 x (1- 0.9)
= 0.6 + 5x0.66 x 0.1
= 0.6 + 0.33 = .93
RAID-4 Performance • Similar to RAID-3, but supports larger chunks. • Performance measures are similar to RAID-3.
RAID-5 (Distributed Parity) Chunk Logical Storage
Physical Storage
Stripe width
01 02 03 04 05 06 07 08 09 10
01 P1
02 05
09
P2
11
12
03 06
04 07
P0 08
10
11
12 Parity Disk
RAID-5 Performance • In RAID-3 and RAID-4, Parity Disk is a bottleneck for Write operations. • This issue is addressed in RAID-5.
Reading Assignment • Optical Disk • RAID study material • Worked out problems from the text book: p 452, p 537, p 539, p561, p 684, p 691, p 704
Amdahl’s Speedup Multiprocessors Speedup = 1/(Fractione/Speedupe+(1–Fractione ))
Speedup e - number of processors
Fractione
- Fraction of the program that runs parallely on Speedupe
Assumption: Either the program runs in fully parallel (enhanced) mode making use of all the processors or non-enhanced mode.
Multiprocessor Architectures • Single Instruction Stream, Single Data Stream (SISD): • Single Instruction Stream, Multiple Data Stream (SIMD) • Multiple Instruction Streams, Single Data Stream (MISD) • Multiple Instruction Streams, Multiple Data Streams (MIMD)
Classification of MIMD Architectures • Shared Memory: – Centralized shared memory architecture OR Symmetric (shared-memory) Multiprocessors (SMP) OR Uniform Memory Access (UMA). – Distributed shared memory architecture OR NonUniform Memory Access (NUMA) architecture.
• Message Passing: – Multiprocessor Systems based on messaging
Shared Memory versus Message Passing No 1
2
Shared Memory Compatibility with well-understood shared memory mechanism used in centralized multi-processor systems. Simplify compiler design
3 4
No need to learn a messaging protocol Low communication overhead
5
Caching to improve latency
Message Passing Simpler hardware compared to scalable shared memory Communication is explicit. This means it is easier to understand. Improved modularity No need for expensive and complex synchronization mechanisms similar to the ones used in shared memory.
Two types of Shared Memory architectures Centralized
Shared -Memory Architecture
Processor
Processor
Processor
Processor
Cache Cache
Cache Cache
Cache Cache
Cache Cache
Main Memory
I/O Systems
Distributed -Shared -Memory Architecture Processor + Cache
Processor + Cache
Processor + Cache
… Memory
I/O
Memory
I/O
Interconnection Network
Memory
I/O
Symmetric versus Distributed Memory MP •
SMP: Uses shared memory for Inter-process Communication –
Advantage: • Close coupling due to shared memory • Sharing of data is faster between processors
– Disadvantage • •
Scaling: Memory is a bottleneck High and unpredictable Memory Latency
•
Distributed : Uses message passing for Inter-process Communication –
Advantage: • •
–
Low Memory Latency Scaling: Scales better than SMP
Disadvantage •
Control and management are more complex due to distributed memory
Performance Metrics • Communication bandwidth • Communication Latency: Ideally latency is as low as possible. Communication latency is = Sender Overhead + Time of flight + Transmission Time + Receiver Overhead • Communication Latency Hiding
Cache Coherence Problem Time 0 1 2 3
Event
Cache Contents for CPU A
Cache Contents for CPU B
CPU A reads X CPU B reads X CPU A stores 0 in X
1 1 0
1 1
Memory contents for Location X 1 1 1 0
Cache Coherence A Memory System is coherent if
1.
A read by a processor to location X that follows a write by another processor to X returns the written value if the read and write are sufficiently separatedin time and no other writes to X occur between the two accesses. Defines coherent view of memory.
2. 3. Write
to the same location are serialized; that is two writes to the same location by any two processors are seen in the same order by all processors. For example, if the values 1 and then 2 are written to a location, processors can never read the value of the location as 2 and then later read it as 1.
Coherency and Consistency • Coherency and consistency are complementary.
• Coherency defines the behavior of reads and writes to the same memory location, while consistency defines the behavior of reads and writes with respect to other memory locations.
Features supported by Coherent Multiprocessors • Migration: Shared data in a cache can be moved to another cache directly (without going through the main memory). This is referred to as migration. It is used in a transparent fashion. It reduces both latency (going to another cache every time the data item is accessed) and precious memory bandwidth.
• Replication: Cache also provides replication for shared data items that are being simultaneously read, since the caches make a copy of the data item in the local cache. Replication reduces both latency of access and contention for a read shared data item.
Migration and Replication Centralized Shared -Memory Architecture (SMP) Processor
Processor
Processor
Processor
Cache Cache
Cache Cache
Cache Cache
Cache Cache
cc Main Memory
cc
cc Bus
cc Cache Controller
cc I/O Systems
Cache Coherent (CC) Protocols • CC protocol implement cache coherency • Two types: – Snooping (Replicated): There are multiple copies of the sharing status. Every cache that has a copy of the data from a block of physical memory also has a copy of the sharing status of the block, and no centralized state is kept. – Directory based (logically centralized): There is only one copy of the sharing status of a block of physical memory. All the processors use this one copy. This copy could be in any of the participating processors.
Snooping Protocol Centralized Shared -Memory Architecture (SMP)
Processor
Processor
Processor
Processor
Tw o w a ys: 1. W rite
Cache Cache
Cache Cache
Cache Cache
Cache Cache
In va lid a tio n 2. W rite B ro a d ca st
cc
cc
cc
cc
Bus I/O Systems
Main Memory
cc Cache Controller
Invalidation of Snooping Protocol Processor Activity CPU A reads X (block) CPU B reads X (block) A Writes 1 to X B reads X
Bus Activity
Cache Contents for CPU A
Cache Miss for X
0
Cache Miss for X Invalidation for X Cache Miss for X
0 1 1
Cache Contents for CPU B
Memory contents for Location X 0 0
0 1
0 0 1
Write Broadcast of Snooping Protocol Processor Activity
Bus Activity
CPU A reads X (block) Cache Miss for X CPU B reads X (block) Cache Miss for X A Writes 1 to X Write Broadcast for X B reads X
Cache Contents for CPU A
Cache Contents for CPU B
0 0 1 1
Memory contents for Location X 0 0
0 1 1
0 1 1
Reading Assignment • Section 7.4 (Reliability, Availability, & Dependability) in your text book • Page 554 and 555 • Section 7.11 I/O Design – attempt all the 5 problems in this section. • •
Invalidation versus Write Distribute • •
Multiple writes to the same data item with no intervening reads require multiple write broadcast in an update protocol, but only one initial invalidation in a write invalidate protocol.
•
With multiword cache blocks, each word written in a cache block requires a write broadcast in an update protocol, although only the first write to any word in the block needs to generate an invalidate in an invalidation protocol.
•
An invalidation protocol works on cache blocks, while an update protocol must work on individual words.
The delay between writing a word in one processor and reading the written value in another processor is usually less in a write update scheme, since written data are immediately updated in the reader’s cache. By comparison, in an invalidation protocol, the reader is invalidated first, then later reads the data and is stalled until a copy can be read and returned to the processor.
Coherent Misses • Write Invalidate causes Read cache misses – True Miss: If an item X in a block is made exclusive by write invalidate and the read of the same item in another processor causes read miss then it is a true miss. – False Miss: If non-exclusive items in the same block cause the miss then it is a false miss.
Use of Valid, Shared, and Dirty bits •
•
Valid bit: Every time a block is loaded into a cache from memory, the tag for the block is saved in cache and the valid bit is set to TRUE. A write update to the same block in a different processor may reset this valid bit due to write invalidate. Thus, when a cache block is accessed for READ or WRITE, tag should match AND the value of valid bit should be TRUE. If the tag matches but the valid bit is reset, then its a cache miss.
Shared bit: When a memory block is loaded into a cache block for the first time the shared bit is set to FALSE. When some other cache loads the same block, it is turned into TRUE. When this block is updated, write invalidate uses the value of shared bit to decide whether to send write-invalidate message or not. If the shared bit is set then an invalidate message is sent, otherwise not.
•
Dirty bit: Dirty bit is set to FALSE when a block is loaded into a cache memory. It is set to TRUE when the block is updated the first time. When another processor wants to load this block, this block is migrated to the processor instead of loading from the memory.
Summary of Snooping Mechanism Re q u e st Read hit Read m is s Read m is s Read m is s W rite hit W rite hit W rite m is s W rite m is s W rite m is s Read m is s
S o u rce proc es s or proc es s or proc es s or proc es s or proc es s or proc es s or proc es s or proc es s or proc es s or bus
Read m is s bus W rite m is s bus W rite m is s bus
S ta te o f a d d re sse d ca ch e b lo ck F u n ctio n a n d Ex p la n a tio n s hared or ex c lus iveRead data in c ac he invalid P lac e read m is s on bus s hared A ddres s c onflic t m is s : plac e read m is s on bus ex c lus ive A ddres s c onflic t m is s : write bac k bloc k , then plac e read m is s on bus ex c lus ive W rite data in c ac he s hared P lac e write m is s on bus invalid P lac e write m is s on bus s hared A ddres s c onflic t m is s : plac e read m is s on bus ex c lus ive A ddres s c onflic t m is s : write bac k bloc k , then plac e read m is s on bus s hared No ac tion; allow m em ory to s ervic e read m is s A ttem pt to s hare data: plac e c ac he bloc k on bus and c hange s tate to ex c lus ive s hared s hared A ttem pt to write s hared bloc k ; invalidate the bloc k A ttem pt to write bloc k that is ex c lus ive els ewhere: write bac k the c ac he ex c lus ive bloc k and m ak e its s tate invalid
State Transition CPU read hit
CPU Read
In va lid
Place read miss on bus
Place write miss on bus
CPU Write
CPU Write hit CPU Read hit
t ri
b e-
a
ck bu
o bl s
Shared (read only)
ck
P u sC
U
e it r w
CPU read miss Place read miss on bus
W on b ss i s n m s o mi ad s e r is ad m U e P r C te e i c r w la Cache State Transitions Exclusive P e c Based on requests from CPU (read/write) P l a
CPU Write miss Write - back cache block Place write miss on bus
State Transition
Write - back block ; Abort memor y acces s
In va lid
Shared (read only)
Write miss for this block
CPU Write ck o bl s ck ces a -b ac e it ory r W em M
Write miss for Exclusive this block
;
a
r bo
t
CPU read miss
Cache State Transitions Based on requests from the bus
(read/write) Read miss for this block
Some Terminologies • Polling: A process periodically checks if there is a message that it needs to handle. This method of awaiting a message is called polling. Polling reduces processor utilization. • Interrupt: A process is notified when a message arrives using built-in interrupt register. Interrupt increases processor utilization in comparison to polling. • Synchronous: A process sends a message and waits for the response to come before sending another message or carrying out other tasks. This is way of waiting is referred to as synchronous communication. • Asynchronous: A process sends a message and continues to carry out other tasks while the requested message is processed. This is referred as asynchronous communication.
Communication Infrastructure • Multiprocessor Systems with shared memory can have two types of communication infrastructure: – Shared Bus – Interconnect Shared Bus
Interconnect
Directory Based Protocol Nod n-1 e
Directory
…
Nod e 1
Directory
Interconnect
…
Directory
Nod e n
Directory
Nod e 2
State of the Block • Shared: One or more processors have the block cached, and the value in memory is up to date. • Uncached: No processor has a copy of the block. • Exclusive: Exactly one processor has a copy of the cache block, and it has written the block, so the memory copy is out of date. The processor is called the owner of the block.
Local, Remote, Home Node • Local Node: This is the node where request originates • Home Node: This is the node where the memory location and the directory entry of an an address reside. • Remote Node: A remote node is the node that has copy of a cache block
Shared State Operation • Read-miss: The requesting processor is sent the requested data from memory and the requestor is added to the sharing set. • Write-miss: The requesting processor is sent the value. All the processors in the set Sharer are sent invalidate messages, and the Sharer set is to contain the identity of the requesting processor. The state of the block is made exclusive.
Uncached State Operation • Read-miss: The requesting processor is sent the requested data from memory and the requestor is made the only sharing node. The state of the block is made shared. • Write-miss:The requesting processor is sent the requested data and becomes the sharing node. The block is made exclusive to indicate that the only valid copy is cached. Sharer indicates the identity of the owner.
Exclusive State Operation •
Read-miss: The owner processor is sent a data fetch message, which causes the state of the block in the owner’s cache to transition to shared and causes the owner to send the data to the directory, which it is written to memory and sent back to the requesting processor. The identity of the requesting processor is added to the set Sharer, which still contains the identity of the processor that was the owner (since it still has a readable copy).
•
Data write back: The owner processor is replacing the block and therefore must write it back. This write back makes the memory copy up to date (the home directory essentially becomes the owner), the block is now uncached and the Sharer is empty.
•
Write-miss: The block has a new owner. A message is sent to the old owner causing the cache to invalidate the block and send the value to the directory, from which it is sent to the requesting processor, which becomes the new owner. Sharer is set to the identity of the new owner, and the state of the block remains exclusive.
Cache State Transition Fetch , Shar es = {}
U n ca ch e d
Read Miss Data Value Reply Shares = { P }
Shared ; (read only) y pl e r
Data Value Reply Shares = { P }
ue l Write Miss va } ta {P a d ; += h s tc res bu e }; F ha P n { S s o = is ss m es i r m ad Data Writeha ly ad e S e r p r Back e; re e t c a da lue i Pl al va v Exclusive In ata d (read/write)
i Wr
Write miss Fetch / invalidate Data value Reply Shares = { P }
te
Mi
ss
Read miss
Data value Reply Shares += { P }
True Sharing Miss & False Sharing Miss Time 1 2 3 4 5
Processor P1 Write X1
Processor P2 Read X2
Write X1 Write X2 Read X2
Synchronization • We need a set of hardware primitives with the ability to atomically read and modify a memory location.
Example 1: Atomic Exchange • Interchanges a value in a register for a value in memory • You could build a lock with this. If the memory location contains 1 then the lock is on. Register A
0
Physical Main Memory
1
Other Operations • Test-and-Set • Fetch-and-Increment
Load Linked & Set Conditional • Load Linked: Load linked is a primitive operation that loads the content of a specified location into cache. • Set Conditional: Set conditional operation is related to Load Linked operation by operating on the same memory location. This operation sets the location to a new value only if the location contains the same value read by Load Linked, otherwise it will not.
Operation of LL & SC try:
LL DADDUI SC BEQZ
R2, R3, R3, R3,
0(R1) R2, #1 0(R1) try
; ; ; ;
load linked increment Store conditional branch store fails
1.Address of the memory location loaded by LL is kept in a register 2.If there is an interrupt or invalidate then this register is cleared 3.SC checks if this register is zero. a.If it is then it fails b.Otherwise it simply Stores the content of R3 in that memory address
Single Processor System • Concurrent Execution is the source of problem X=0
Process 1
Process 2
Memory
TOP
time
P1 READ BNEZ
P2 X TOP P2 is scheduled TOP READ BNEZ WRITE
P1 is scheduled WRITE 1
X TOP 1
With Interrupt Mask X=0 Memory
Process 1
Process 2
P1 TOP
time
P2
READ X BNEZ TOP MASK WRITE 1 UNMASK TOP
MASK READ X BNEZ TOP WRITE UNMASK
1
Multiprocessor System • Masking the interrupt will not work X=0 Memory
Processor1
P1 TOP
READ X BNEZ TOP MASK WRITE 1 UNMASK
Processor2
P2 TOP
READ BNEZ MASK WRITE UNMASK
X TOP
1
Multiprocessor System • Uses Cache Coherence – Bus Based Systems or Interconnect Based System • Coherence system arbitrates writes (Invalidation/write distribute) • Thus, serializes writes!
Spin Locks • Locks that a processor continuously tries to acquire, spinning around a loop until it succeeds. • It is used when a lock is to be for a short amount of time.
Spin Lock implementation
lockit:
LD BNEZ DADDUI EXCH BNEZ
R2, R2, R2, R2, R2,
0(R1) lockit R0,#1 0(R1) lockit
;load of lock ;not available-spin ;load locked value ;swap ;branch if lock wasn't 0
Cache Coherence steps step 1
P0
2 Set lock to 0
P1 Spins, testing if lock=0 Invalidate received
P2 Spins, testing if lock=0 Invalidate received
3 4
Cache Miss Waits fo bus
5
Lock = 0
Cache Miss Lock = 0 Executes swap, gets cache miss Completes swap: returns 0 and sets Lock = 1
1 Has lock
6
7 8
Executes swap, gets cache miss Swap completes and returns 1, and sets Lock = Enter critical 1 section Spins, testing if lock=0
Coherence state of lock
Bus/Directory activity
Shared
None Write invalidate of lock Exclusive (P0) variable from P0 Bus/Directory services P2 Cache miss; write back from Shared P0 Shared Cache miss for P2 satisfied Shared
Cache miss for P1 satisfied
Bus/Directory services P2 Cache miss; generates Exclusive (P2) invalidate Bus/Directory services P1 Cache miss; generates write Exclusive (P1) back None
Multi-threading • Multi-threading – why and how? • Fine-grain – Almost round robin – High overhead (compared to coarse grain) – Better throughput then coarse grain
• Coarse-grain – Threads switched only on long stalls (L2 cache miss) – Low overhead – Individual threads get better performance
Reading Assignment • Memory Consistency – Notes (Answers)
• Problem on page 596 • An example that uses LL (Load Linked) and Set Conditional (SC).
End!
IETF – Internet Engineering Task Force • Loosely self-organized groups (working groups) of … • Two types of documents: I-D and RFC • I-D’s life is shorter compared to RFCs life • RFCs include proposed standards and standards • RFC Example: RFC 1213, RFC 2543 • I-D -> Proposed Standard -> Standard
SIP Evolution SIP I-D v1 ’97, v2 ’98 SIP RFC 2543 ’99 SIP RFC 3261 ’00, obsoletes RFC 2543 SIP working groups – SIPPING, SIMPLE, PINT, SPIRITS, … • SIPit – SIP interoperability test • • • •
•
Predictor for SPEC92 Benchmarks 300 250
Instructions between 200 mispredictio ns 150
P re d ic te d T a k e n P r o fi le b a s e d
100 50
ear
dod uc
gcc
hyd ro2 d md ijdp su2 cor
com
pre ss eqn tot esp res so
li
0
B e n ch m a rk