Performance

  • Uploaded by: api-26072581
  • 0
  • 0
  • July 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Performance as PDF for free.

More details

  • Words: 2,817
  • Pages: 42
Performance

CSE 313:Computer Architecture

1

Performance 

Evaluation is a key component in the iterative architecture design process 



Also important in many other cases     



Performance is crucial to system design Purchasing a computer Evaluating new technologies Writing software Implementing an instruction set Designing a new architecture

Need to quantify performance

CSE 313:Computer Architecture

2

Performance What is performance?  Performance Metrics  How to evaluate and summarize performance?  Performance Pitfalls 

CSE 313:Computer Architecture

3

Airplane Example

 

Which one is the fastest for transporting 1 passenger? Which one is the fastest for transporting 5000 passengers?

CSE 313:Computer Architecture

4

Computer Performance: TIME, TIME, TIME 

Response Time (latency)/ Execution Time — How long does it take for my job to run? — How long does it take to execute a job? — How long must I wait for the database query?



Throughput — How many jobs can the machine run at once? — What is the average execution rate? — How much work is getting done?



If we upgrade a machine with a new processor what do we improve?



If we add a new machine to the lab what do we improve?

CSE 313:Computer Architecture

5

Book's Definition of Performance 

For some program running on machine X, PerformanceX = 1 / Execution timeX



"X is n times faster than Y“ (speedup) PerformanceX / PerformanceY = n = Execution time y / Execution time x



Hard Real Time/soft real time 

Embedded System and DVD playback CSE 313:Computer Architecture

6

Example Machine A runs a program in 20 seconds  Machine B runs the same program in 25 seconds  How much faster is A than B?  Performance ratio 25 / 20 = 1.25  A is 1.25 times faster than B. (the speedup of A over B is 1.25) 

CSE 313:Computer Architecture

7

Time 

Elapsed time/Wall Clock time/response time  

Counts everything (disk and memory accesses, I/O, operating system overhead etc.) A useful number, but often not good for comparison purposes 



CPU time/CPU Execution Time  





Doesn’t count I/O or time spent in running other programs Can be broken into system CPU time and user CPU time

Our focus: user CPU time 



Time sharing among multiple programs

CPU time spent in executing the lines of code that are “in” our program

CPU performance System Performance

CSE 313:Computer Architecture

8

Clock Cycles 

Instead of reporting execution time in seconds, we often use cycles seconds = cycles * seconds / cycle



Clock “ticks” indicate when to start activities (one abstraction):



cycle time = time between ticks = seconds per cycle clock rate (frequency) = cycles per second (1 Hz. = 1 cycle/sec)



time

A 2 Ghz. clock has a

1 / 2×109 = 0.5 nano-second (ns) cycle time

CSE 313:Computer Architecture

9

How to Improve Performance CPU time = # of Cycles * Clock cycle time = # of Cycles / Clock rate

So, to improve performance (everything else being equal) you can either ________ the # of required cycles for a program, or ________ the clock cycle time or, said another way, ________ the clock rate.

CSE 313:Computer Architecture

10

Example 

Our favorite program runs in 8 seconds on computer A, which has a 4 GHz. clock. We are trying to help a computer designer build a new machine B, that will run this program in 4 seconds. The designer can use new (or perhaps more expensive) technology to substantially increase the clock rate, but has informed us that this increase will affect the rest of the CPU design, causing machine B to require 1.2 times as many clock cycles as machine A for the same program. What clock rate should we tell the designer to target?"



Don't Panic, you can easily work this out from basic principles

CSE 313:Computer Architecture

11

How many cycles are required for a program?

...

6th

5th

4th

3rd instruction

2nd instruction

Can we assume that # of cycles = # of instructions? 1st instruction



time

This assumption is incorrect, different instructions take different amounts of time on different machines. CSE 313:Computer Architecture

12

Different numbers of cycles for different instructions 

Multiplication takes more time than addition



Floating point operations take longer than integer ones



Accessing memory takes more time than accessing registers

CSE 313:Computer Architecture

13

Measure Clock Cycles 

CPU clock cycles/program is not an intuitive or easily determined value Clock cycles = # instructions * average clock cycles per instruction



Cycles per instruction (CPI) is often used

Time = # instructions * CPI * clock cycle time  

CPI is an average since the number of cycles per instruction varies from instruction to instruction CPI varies by application, as well as among implementation with the same instruction set   

Number of cycles for each instruction Frequency of instructions (instruction mix) Memory access time

CSE 313:Computer Architecture

14

CPI Example 

It takes a program of 1,000,000 instructions 1.2ms to run on computer A. Computer A has a clock rate of 2 GHz. What is the CPI of the program on computer A?

CSE 313:Computer Architecture

15

CPI Example 

Suppose we have two implementations of the same instruction set architecture (ISA). For some program, Machine A has a clock cycle time of 1 ns. and a CPI of 2.0 Machine B has a clock cycle time of 2 ns. and a CPI of 1.2 Which machine is faster for this program, and by how much?

CSE 313:Computer Architecture

16

# of Instructions Example 

A compiler designer is trying to decide between two code sequences for a particular machine. Based on the hardware implementation, there are three different classes of instructions: Class A, Class B, and Class C, and they require 1, 2, and 3 cycles respectively. The and The and

first code sequence has 5 instructions: 2 of A, 1 of B, 2 of C second sequence has 6 instructions: 4 of A, 1 of B, 1 of C.

Which sequence will be faster? How much faster? What is the CPI for each sequence?

CSE 313:Computer Architecture

17

CPI Variability 



Different instruction types require different number of cycles per instruction CPI is often reported for classes of instructions n

total clock cycles = ∑ CPI i × I i i =1

where CPIi is the CPI for the instruction in class i and Ii is the count of instructions in class i

CSE 313:Computer Architecture

18

Computing CPI 

To compute the overall average CPI use CPI =

∑ CPI ⋅# instruction i

i

i

total

# instruction Ii = ∑ CPI i i ∑ Ii = ∑ CPI i ⋅ Pi i

∑ P =1 i

i

Where Pi is the percentage (frequency) of instruction in class i CSE 313:Computer Architecture

19

Computing CPI Example Instruction Type



CPI

Frequency

ALU

1

50%

Branch

2

20%

Load

2

20%

Store

2

10%

Average CPI is 1 x 50% + 2 x 20% + 2 x 20% + 2 x 10% = 0.5 + 0.4 + 0.4 +0.2 = 1.5

CSE 313:Computer Architecture

20

Tradeoffs 

Instruction count, CPI, and clock cycle present tradeoffs 

RISC – reduced instruction set computer (MIPS)   



Simple instructions Higher instruction counts for an application Lower CPI

CISC – complex instruction set computer (IA-32)   

More complex instructions Lower instruction counts for an application Higher CPI

CSE 313:Computer Architecture

21

Understanding Program Performance

CSE 313:Computer Architecture

22

MIPS 

MIPS (millions instructions per second)

instruction count clock rate MIPS = = 6 execution time × 10 CPI × 10 6



Is it a good measure for performance?

CSE 313:Computer Architecture

23

MIPS example 

Two different compilers are being tested for a 1 GHz. machine with three different classes of instructions: Class A, Class B, and Class C, which require 1, 2, and 3 cycles (respectively). Both compilers are used to produce code for a large piece of software. The first compiler's code uses 5 billion Class A instructions, 1 billion Class B instructions, and 1 billion Class C instructions. The second compiler's code uses 10 billion Class A instructions, 1 billion Class B instructions, and 1 billion Class C instructions.

 

Which sequence will be faster according to MIPS? Which sequence will be faster according to execution time?

CSE 313:Computer Architecture

24

Performance  

Performance is determined by execution time Do any of the other variables equal performance?     



# of cycles to execute program? # of instructions in program? # of cycles per second? average # of cycles per instruction? average # of instructions per second?

Common pitfall: thinking one of the variables is indicative of performance when it really isn’t.

CSE 313:Computer Architecture

25

Now that we understand Performance 



A given program will require 

some number of instructions (machine instructions)



some number of cycles



some number of seconds

We have a vocabulary that relates these quantities: 

cycle time (seconds per cycle)



clock rate (cycles per second)



CPI (cycles per instruction) a floating point intensive application might have a higher CPI



MIPS (millions of instructions per second) this would be higher for a program using simple instructions

CSE 313:Computer Architecture

26

Example 

If two machines have the same ISA which of the following quantities will always be identical for a same program and compiler?     

clock rate # of instructions CPI execution time MIPS

CSE 313:Computer Architecture

27

Example Consider a program on a computer of two classes of instructions



 

A: CPI = 2, frequency = 40% B: CPI = 4, frequency = 60%

What’s the CPI of this machine? If the CPI of the instruction class B is reduced to 3 without changing clock rate, how much faster is the new machine? What’s its CPI? If we can also reduce the number of class A instructions to 50% of the original for the program, how much faster is the new machine? What’s its CPI?

  

CPI = 2 * 40% + 4 * 60% = 3.2 CPI = 2 * 40% + 3 * 60% = 2.6. Because both have the same number of instructions and clock rate, the ratio of execution time is the ratio of the CPI. The speedup is 3.2 / 2.6 = 1.23 times faster 3. Assume originally there are I instructions, then the number of cycles is 3.2*I. The number of cycles after reducing CPI of B and instructions of A is 2 * 0.2 * I + 3 * 0.6 * I = 2.2 *I So the speedup is 3.2 / 2.2 = 1.45 The CPI of the new machine is 2.2 * I / 0.8 * I = 2.75 We can not compute CPI as 2 * 20% + 3 * 60% = 2.2. Why? 1. 2.

CSE 313:Computer Architecture

28

Tools for Evaluating Performance 

Estimations 



Analysis 



Queuing theory

Simulations 



Area/delay estimate

ISA execution, circuit, gate

Benchmarks

CSE 313:Computer Architecture

29

Benchmarks 





Performance best determined by running real applications  Use programs typical of expected workload  Or, typical of expected class of applications e.g., compilers/editors, scientific applications, graphics, etc.  Not always feasible Small benchmarks  nice for architects and designers  easy to standardize  can be abused SPEC (System Performance Evaluation Cooperative)  companies have agreed on a set of real program and inputs  valuable indicator of performance (and compiler technology)  require upgraded to current software applications: SPEC ’89, SPEC ’92, SPEC ’95, SPEC ’2000 (www.spec.org)  can still be abused

CSE 313:Computer Architecture

30

SPEC CPU2000

CSE 313:Computer Architecture

31

Benchmark Games 

An embarrassed Intel Corp. acknowledged Friday that a bug in a software program known as a compiler had led the company to overstate the speed of its microprocessor chips on an industry benchmark by 10 percent. However, industry analysts said the coding error…was a sad commentary on a common industry practice of “cheating” on standardized performance tests…The error was pointed out to Intel two days ago by a competitor, Motorola …came in a test known as SPECint92…Intel acknowledged that it had “optimized” its compiler to improve its test scores. The company had also said that it did not like the practice but felt to compelled to make the optimizations because its competitors were doing the same thing…At the heart of Intel’s problem is the practice of “tuning” compiler programs to recognize certain computing problems in the test and then substituting special handwritten pieces of code… Saturday, January 6, 1996 New York Times

CSE 313:Computer Architecture

32

Compare Performance Across Multiple Programs 

Benchmark results are often contradictory Computer A

Computer B

Computer C

Program 1 (seconds)

1

10

20

Program 2 (seconds)

1000

100

20

Program 3 (seconds)

1001

110

40

A is 10 times faster than B for program 1 B is 10 times faster than A for program 2 A is 20 times faster than C for program 1 C is 50 times faster than A for program 2 B is 2 times faster than C for program 1 C is 5 times faster than B for program 2 Each statement is correct. But we just want to know which machine is the fastest.

CSE 313:Computer Architecture

33

Summarizing Results Use (weighted) arithmetic mean for with execution times

CSE 313:Computer Architecture

34

Example Computer A

  

Computer B

Program 1

2s (60%)

4s (60%)

Program 2

16s (40%)

4s (40%)

Computer A average time = 2x60% + 16x40% = 7.6s Computer B average time = 4x60% + 4x40% = 4s Computer B is 7.6/4 = 1.9 times faster than A for the benchmark

CSE 313:Computer Architecture

35

SPEC ‘95 Does doubling the clock rate double the performance?

10

10

9

9

8

8

7

7

6

6

SPECfp

SPECint

Can a machine with a slower clock rate have better performance?

5

5

4

4

3

3

2

2

1

1 0

0 50

100

150 Clock rate (MHz)

200

250

50

Pentium Pentium Pro

CSE 313:Computer Architecture

100

150

Clock rate (MHz)

200

250

Pentium Pentium Pro

36

Amdahl's Law Execution Time After Improvement = Execution Time Unaffected +( Execution Time Affected / speedup )

ExTimeold Speedup = = ExTimenew

1 Fractionenhanced (1 − Fractionenhanced ) + Speedupenhanced

CSE 313:Computer Architecture

37

Example 

Floating point instructions of a computer to run 2 times faster; but only 10% of actual time is used for floating point operations. How much faster will the computer become?

ExTimenew = ExTimeold × (0.9 + 0.1 / 2) = 0.95 × ExTimeold Speedup = 1 / 0.95 = 1.053

The speedup resulting from an enhancement is bounded by the inverse of the fraction that is not enhanced.

CSE 313:Computer Architecture

38

Amdahl’s Law: Corollary 

Principle: Make the common case fast



Examples: 



All instructions requires an instruction fetch, only a fraction require a data fetch/store => optimize instruction access over data access Programs spend a lot of time accessing memory and exhibit spatial and temporal locality =>design a storage hierarchy such that the most frequent access are to the smallest (fastest) memories

CSE 313:Computer Architecture

39

Example 

"Suppose a program runs in 100 seconds on a machine, with multiply responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster?" How about making it 5 times faster?

CSE 313:Computer Architecture

40

Example 





A program takes 10 seconds to run on the current computer. The program spends 40% of its time on floating-point operations, 40% on integer operations, and 20% on I/O operations. If you can make the floating-point operations 2 times faster, what is the overall speedup of the program? If we want the whole program to run 2 times faster, how much do we need to improve the speed of integer operations, in additional to doubling the speed of floating-point operations?

CSE 313:Computer Architecture

41

Remember 

Performance is specific to a particular program/s 



For a given architecture performance increases come from:   



Execution time is a consistent summary of performance

increases in clock rate (without adverse CPI affects) improvements in processor organization that lower CPI compiler enhancements that lower CPI and/or instruction count

Pitfall: expecting improvement in one aspect of a machine’s performance to significantly affect the total performance



References: 

4.1 4.2



4.5



See Slides and solve some exercise

CSE 313:Computer Architecture

42

Related Documents

Performance
November 2019 45
Performance
July 2020 26
Performance
June 2020 37
Performance
April 2020 40
Performance
April 2020 14
Performance
May 2020 11