Pipelined Instruction Processing In Computer Organization and Architecture
Vishal Patyal Reg. No = 3050060122 97806-66742 (M) (0190) 5230-650 (Home)
[email protected]
Term Paper for CSE-211 Computer Arithmetic Fifth Term 2008
ABSTRACT As we know this is new introduction of term paper which is nothing but minor project of computer system organization and architecture in which i deeply studied about the “performance metrics for parallel system” which is very helpful to us in the future in the field of IT industry. In this I studied about one frequently needs to compare the performance of two or more parallel computers, but how should this be done? This report discusses some of these issues so that the problems they can cause may be avoided in the future. We review the many performance metrics that have been proposed for parallel systems (i.e., program- architecture combinations). These include the many variants of speedup, efficiency, and isoefficiency. We give reasons why none of these metrics should be used independent of the run time of the parallel system. Keywords: - Principle, Problems and Solution, Advantages & Disadvantages, Examples, Implementation, Development, Conclusion, References
Acknowledgements:First of all I would like to express my sincere gratitude to the almighty for encouraging me to complete this term paper. The following are some important people in my life who gave me strength and valuable suggestions to complete the task. First, my parents, friends, whose love and affection give me strength and encouragement to face challenges of life. Second, my sir Bijoy Chatterjee, whose inspiration, motivation spiritual guidance always keeps me in the right path and keeps my faith in God almighty without whose blessings nothing is possible. Finally, thanks for the Lovely Professional University which gave me great opportunity to make the term paper.
DATED:
VISHAL PATYAL 3050060122
CONTENTS: An introduction to performance metrics 1. Performance Metrics for Parallel Systems 1. Execution time 2. Total parallel overload 1. Performance metrics for parallel system: Execution time 2. Performance metrics for parallel systems: Total parallel overhead Performance metrics for parallel systems: Speed up Performance metrics for parallel systems: Super linear Speedups Performance metrics for parallel systems: Efficiency File systems for parallel I/O Evaluating performance metrics for parallel systems:
An introduction to performance metrics 1. The term performance The performance can be considered as a sub-characterstic of dependability, although the classical literature does not include it in their definition. If user perceive the performance of a system as bad, for example because of high response times or denied requests due to overload. Some of the most commonly used performance metrics are response time, throughput and utilization. Response time is defined as the time interval between a user request of a service and the response of the system. This metric is sometimes called responsiveness. As responces often cannot be delivered instantaneously, there are two possible implication of this definition. Either the response time is defined as the time interval between the user issuing a request and the system starting to respond or finishing . Performance metrics are measures of an organization's activities and performance. Performance metrics should support a range of stakeholder needs from, customers, shareholders to employees while traditionally many metrics are financed based, inwardly focusing on the performance of the organization, metrics may also focus on the performance against customers requirements and value. Developing performance metrics usually follows a process of: 1. Establishing critical processes/customer requirements, 2. Developing measures, 3. Establishing targets which the results can be scored against. A criticism of performance metrics is that when the value of information is computed using mathematical methods, it shows that even performance metrics professionals choose measures that have little value. This is referred to as the "measurement inversion". For example, metrics seem to emphasize what organizations find immediately measurable — even if those are low value — and tend to ignore high value measurements simply because they seem harder to measure (whether they are or not).
Performance Metrics for Parallel Systems It is important to study the performance of parallel programs with a view to determining the best algorithm, evaluating hardware platforms, and examining the benefits from parallelism. A number of metrics have been used based on the desired outcome of performance analysis. 1. Execution time The serial runtime of a program is the time elapsed between the beginning and the end of its execution on a sequential computer. The parallel runtime is the time that elapses from the moment a parallel computation starts to the moment the last processing element finishes execution. We denote the serial runtime by TS and the parallel runtime by TP. 2.Total parallel overload The overheads incurred by a parallel program are encapsulated into a single expression referred to as the overhead function. We define overhead function or total overhead of a parallel system as the total time collectively spent by all the processing elements over and above that required by the fastest known sequential algorithm for solving the same problem on a single processing element. We denote the overhead function of a parallel system by the symbol To. The total time spent in solving a problem summed over all processing elements is pTP. TS units of this time are spent performing useful work, and the remainder is overhead. Therefore, the overhead function (To) is given by
1. Performance metrics for parallel system: Execution time •Serial runtime of a program is the time elapsed between the beginning and the end of its execution on a sequential computer. •The parallel runtime is the time that elapses from the moment the first processor starts to the moment the last processor finishes execution. •We denote the serial runtime by Ts and the parallel runtime by Tp
2. Performance metrics for parallel systems: Total parallel overhead •Let T all be the total time collectively spent by all the processing elements. •TS is the serial time. •Observe that T all TS is then the total time spend by all processors Combined in non-useful work. This is called the total overhead. •The total time collectively spent by all the processing elements T all = p T P (p is the number of processors). •The overhead function (To) is therefore given by To= p TP- TS
4. Performance metrics for parallel systems: Speed up
•
.
Speedup (S) is the ratio of the time taken to solve a problem on a single processor to the time required to solve the same problem on a parallel computer with p identical processing elements.
Performance metrics: speed up Example •
Speedup (S) is the ratio of the time taken to solve a problem on a single processor to the time required to solve the same problem on a parallel computer with p identical processing elements.
•
Consider the problem of adding n numbers by using n processing elements.
•
If n is a power of two, we can perform this operation in log n steps by propagating partial sums up a logical binary tree of processors.
Example: Computing the global sum of 16 partial sums using 16 processing elements . Σji denotes
Sum of numbers with consecutive labels from i to j. Performance metrics: Example (continued) •
If an addition takes constant time, say, tc and communication
of a single word takes time ts + tw, we have the parallel time TP = Θ (log n) • •
We know that TS = Θ (n) Speedup S is given by S = Θ (n / log n)
Performance metrics: Speedup • • • • • • • •
For a given problem, there might be many serial algorithms available. These algorithms may have different asymptotic runtimes and may be parallelizable to different degrees. For the purpose of computing speedup, we always consider the best sequential program as the baseline. Consider the problem of parallel bubble sort. The serial time for bubble sort is 150 seconds. The parallel time for odd-even sort (efficient parallelization of bubble sort) is 40 seconds. The speedup would appear to be 150/40 = 3.75. But is this really a fair assessment of the system? What if serial quick sort only took 30 seconds? In this case, the speedup is 30/40 = 0.75. This is a more realistic assessment of the system.
5. Performance metrics for parallel systems: Super linear Speedups Resource-based super linearity: The higher aggregate cache/memory bandwidth can result in better cache-hit ratios, and therefore superlinearity. Example: A processor with 64KB of cache yields an 80% hit ratio. If two processors are used, since the problem size/processor is smaller, the hit ratio goes up to 90%. Of the remaining 10% access, 8% come from local memory and 2% from remote memory. If DRAM access time is 100 ns, cache access time is 2 ns, and remote memory access time is 400ns, this corresponds to a speedup of 2.43!
6. Performance metrics for parallel systems: Efficiency Efficiency is a measure of the fraction of time for which a processing element is usefully employed
File systems for parallel I/O The section opens with a survey of the "The Vesta parallel file system," (Peter F. Corbett and Dror G. Feitelson, 1996). IBM's AIX Parallel I/O File System, offered on the SP2, is based on Vesta. While many previous parallel file systems hide how a file is distributed (striped) across nodes, Vesta provides user interface-level access to a file's parallel structure. Vesta also aims to facilitate file access via different types of decomposition of a file's data (i.e., data might be accessed by row, and also by column). A Vesta file has a 2-dimensional structure, called a cell. A cell is defined as "a container where data can be deposited, or alternatively, as virtual I/O nodes that are then mapped to available physical I/O nodes." Thus, one file dimension is the cell dimension, and the other is the data within the cell. As a file is written, a parameter specifies how many cells it can occupy. Given that the number of cells determines the amount of parallelism a file uses, that features is useful especially in the presence of small files, since distributing a small file over many nodes results in potentially poor access to that file. The article describes many other Vesta features, and offers an evaluation of the file system's performance using a parallel sorting application as an example. "The Zebra striped network file system," (John H. Hartman and John K. Ousterhout, 1995), describes an extension to log structured file systems. In such a system, "each client forms its new data for all files into a sequential log that it stripes across the storage servers." That batching of writes from each node results in much improved performance, especially in the presence of many small writes. Zebra also takes advantage of RAID-style redundancy techniques: It can recover from a single server failure without loss of information (but not from multiple concurrent server failures). File block creations, updates, and deletes are communicated between nodes in the form of deltas. The paper gives an overview of log-structured file systems, and then outlines Zebra's
components (including a "stripe cleaner" to reclaim unused space). It concludes with a performance evaluation focusing on small-file writes. Many parallel file systems hide the striping and other data distribution strategies from client applications in order to simplify the interface application programmers must be aware of. However, that facade also makes it difficult to experiment with, and explore, distributed file system characteristics in general. "PPFS: A High-Performance Portable Parallel File System" (James V. Huber, Jr., Christopher L. Elford, Daniel A. Reed, Andrew A. Chien, and David S. Blumenthal), which is an updated version of a 1995 original paper, defines a parallel file system that lends itself to experimentation. It is a rich user-level library implemented on top of an operating system's native file system, effectively interposing itself between that OS and an application program. The article outlines PPFS's design philosophy, and provides two benchmark application results, one sfrom a read-intensive gene sequencing application, and the other based on a write-intensive electron scattering code from plasma physics. "The Global File system" (Steven R. Soltis, Thomas M. Ruwart, Grant M. Erickson, Kenneth W. Preslan, and Matthew T. O'Keefe, 1996) is a cluster-based distributed file system. Instead of focusing on serving a large number of clients, GFS aims to deliver high performance to a relatively small number of systems. Its target applications are those requiring high bandwidth and large amounts of storage place, such as multimedia information systems. GFS takes advantage of device locks to ensure data consistency, and outlines a file access pattern that is somewhat similar to how processes access shared memory in an SMP. In GFS, clients obtain a lock when they read or write data on shared network storage, and those locks correspond to device-level locks on the storage system; in the prototype implementation, those locks are implemented with the SCSI Dlock command. Because that design allows GFS to atomically modify data, GFS clients can remain unaware of each other's presence. To provide a unified logical storage place, GFS collects storage devices into a network storage pool; subpools divide network storage pools based on device type. GFS enables clients to export file systems to machines not directly connected to a storage pool: a GFS client can act as an NFS server, and can even serve HTTP clients, enabling access to the storage pool via the Web. "Serverless network file systems" (1996, Thomas E. Anderson, Michael D. Dahlin, Jeanna M. Neefe Matthews, David A. Patterson, Drew S. Roselli, and Randolph Y. Wang) describes xFS, the file system for the network of workstations (NOW) system. NOW defines a cooperation mechanism for independent workstations. For storage, that implies that there is no central server, and that file system services are provided collectively by the workstations: Any workstation on the network can store, cache, and access any block of data. Since any machine on the network can fail as well, xFS also provides for the redundancy of data. xFS incorporates previous research and extends it to enable a serverless mode of operation. The paper discusses RAID, logstructured storage, Zebra, different multiprocessor cache consistency mechanisms, and then focuses on how that research is adapted to xFS's unique needs. For instance, xFS distributes file system management services among the workstations: It attempts to assign a file used by a client to a manager located on that client machine. The paper describes a simulation study showing that co-locating a file's management with that file's client improves locality, and thereby significantly reduces network-bound client requests. Data distribution in xFS is achieved by a software implementation of RAID, utilizing a Zebra-like log-based striping.
Evaluating performance metrics for parallel systems: In conducting any evaluation, we need to identify a set of performance metrics that we would like to measure and the techniques and tools that will be used to gather these metrics. Metrics which capture the “available” computer power are often not a true indicator of the performance actually “delivered” by a parallel system. Metrics for parallel systems performance evaluation should quantify this gap between available and delivered compute power since understanding application and architectural bottlenecks is crucial for application restricting and architectural enhancements. Many performance metrics such as speed up, scaled speed up and isoefficiency, have been proposed to quantify the match between the application and architecture in a parallel system. While these metrics are useful for tracking overall performance trends, they provide little additional information about where performance is lost. Some of these metrics attempt to identify the cause of the problem when the parallel system does not scale expected. However, it is essential to find the individual application and architectural artifacts that lead to these bottlenecks and quantify their relative contribution towards limiting the overall scalability of the system. Traditional metrics do not help further in this regard. Parallel system overheads may be broadly classified into purely algorithmic components. A component arising from the interaction of the application with the system software and a component arising from the interaction of the application with the hardware.
2.2 Evaluation Techniques Experimentation, analytical modeling and simulation are three well-known techniques for evaluating parallel systems. Experimentation involves implementing the application on the actual hardware and measuring its performance. Analytical models abstract hardware and application details in a parallel system and capture complex system features by simple mathematical formulae. These formulae are usually parameterized by a limited number of degrees of freedom so that the analysis is kept tractable. Simulation is a valuable technique which exploits computer resources to model and imitate the behavior of a real system in a controlled manner. Each technique has its own limitations. The amount of statistics that can be gleaned by experimentation (to quantify the overhead functions) is limited by the monitoring/instrumentation support provided by the underlying system. Additional instrumentation can sometimes disturb the evaluation. Analytical models are often criticized for the unrealism and simplifying assumptions made in expressing the complex interaction between the application and the architecture. Simulation realistic computer system demand considerable resources in terms of space and time.
References: 1. http://www google.com 2. http://www.wikipedia.com 3. http://www.guruji.com 4. http://www.google.com/introduction+to+performance+metrics&btnG=Search 5.h ttp://scholar.google.com/s 6. http://www.stormingmedia.us/html