1
1.What is Parallel Computing?
I would like to start this with providing some metaphors of real life.What do we do when we are provided with a Herculean task and need to do it fast? Divide the task into pieces,and employ more people to do the work if it needs to be done faster.What would you do if not this? Similarly, a large corporation requires many employees for it’s proper functioning, whereas a large economy requires many people at work in various sectors of business. If a computer was human, then its central processing unit (CPU) would be its brain. A CPU is a microprocessor -- a computing engine on a chip. While modern microprocessors are small, they're also really powerful. They can interpret millions of instructions per second. Even so, there are some computational problems that are so complex that a powerful microprocessor would require years to solve them. Computer scientists use different approaches to address this problem. One potential approach is to push for more powerful microprocessors. Usually this means finding ways to fit more transistors on a microprocessor chip. Computer engineers are already building microprocessors with transistors that are only a few dozen nanometers wide. How small is a nanometer? It's onebillionth of a meter. A red blood cell has a diameter of 2,500 nanometers -- the width of modern transistors is a fraction of that size. Building more powerful microprocessors requires an intense and expensive production process. Some computational problems take years to solve even with the benefit of a more powerful microprocessor. Partly because of these factors, computer scientists sometimes use a different approach: parallel processing. In general, parallel processing means that at least two microprocessors handle parts of an overall task. The concept is pretty simple: A computer scientist divides a complex problem into component parts using special software specifically designed for the task. He or she then assigns each component part to a dedicated processor. Each processor solves its part of the overall computational problem. The software reassembles the data to reach the end conclusion of the original complex problem.
1
2 It's a high-tech way of saying that it's easier to get work done if you can share the load. You could divide the load up among different processors housed in the same computer, or you could network several computers together and divide the load up among all of them. There are several ways to achieve the same goal. Traditionally, computer software has been written for serial computation. To solve a problem, an algorithm is constructed and implemented as a serial stream of instructions. These instructions are executed on a central processing unit on one computer. Only one instruction may execute at a time—after that instruction is finished, the next is executed. Parallel computing, on the other hand, uses multiple processing elements simultaneously to solve a problem. This is accomplished by breaking the problem into independent parts so that each processing element can execute its part of the algorithm simultaneously with the others. The processing elements can be diverse and include resources such as a single computer with multiple processors, several networked computers, specialized hardware, or any combination of the above.
1.1 Frequency Scaling and Power Consumption
Frequency scaling was the dominant reason for improvements in computer performance from the mid-1980s until 2004. The runtime of a program is equal to the number of instructions multiplied by the average time per instruction. Maintaining everything else constant, increasing the clock frequency decreases the average time it takes to execute an instruction. An increase in frequency thus decreases runtime for all computation-bounded programs. However, power consumption by a chip is given by the equation P = C × V2 × F, where P is power, C is the capacitance being switched per clock cycle (proportional to the number of transistors whose inputs change), V is voltage, and F is the processor frequency (cycles per second).Increases in frequency increase the amount of power used in a 2
3 processor. Increasing processor power consumption led ultimately to Intel's May 2004 cancellation of its Tejas and Jayhawk processors, which is generally cited as the end of frequency scaling as the dominant computer architecture paradigm.
1.2 Amdahl’ Law Theoretically, the speed-up from parallelization should be linear—doubling the number of processing elements should halve the runtime, and doubling it a second time should again halve the runtime. However, very few parallel algorithms achieve optimal speed-up. Most of them have a near-linear speed-up for small numbers of processing elements, which flattens out into a constant value for large numbers of processing elements.
The potential speed-up of an algorithm on a parallel computing platform is given by Amdahl’s Law, originally formulated by Gene Amdahl in the 1960s.It states that a small portion of the program which cannot be parallelized will limit the overall speed-up available from parallelization. Any large mathematical or engineering problem will typically consist of several parallelizable parts and several non-parallelizable (sequential) parts.
3
4
This relationship is given by the equation:
where S is the speed-up of the program (as a factor of its original sequential runtime), and P is the fraction that is parallelizable. If the sequential portion of a program is 10% of the runtime, we can get no more than a 10x speed-up, regardless of how many processors are added. This puts an upper limit on the usefulness of adding more parallel execution units. "When a task cannot be partitioned because of sequential constraints, the application of more effort has no effect on the schedule. The bearing of a child takes nine months, no matter how many women are assigned."
1.3 Gustafson’s Law
Gustafson's law is another law in computer engineering, closely related to Amdahl's law. It can be formulated as:
where P is the number of processors, S is the speed-up, and α the nonparallelizable part of the process.Amdahl's law assumes a fixed-problem size and that the size of the sequential section is independent of the number of processors, whereas Gustafson's law does not make these assumptions.
1.4 Pipelining
4
5 One of the basic concepts behind parallel computing is the concept of pipelining. In computing, a pipeline is a set of data processing elements connected in series, so that the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in time-sliced fashion; in that case, some amount of buffer storage is often inserted between elements.
Computer-related pipelines include:
* Instruction pipelines, such as the classic RISC pipeline, which are used in processors to allow overlapping execution of multiple instructions with the same circuitry. The circuitry is usually divided up into stages, including instruction decoding, arithmetic, and register fetching stages, wherein each stage processes one instruction at a time. * Software pipelines,consisting of multiple processes arranged so that the output stream of one process is automatically and promptly fed as the input stream of the next one. Unix pipelines are the classical implementation of this concept.
5
6
2.Approaches To Parallel Computing
To understand parallel processing, we need to look at the four basic programming models. Computer scientists define these models based on two factors: the number of instruction streams and the number of data streams the computer handles. Instruction streams are algorithms. An algorithm is just a series of steps designed to solve a particular problem. Data streams are information pulled from computer memory used as input values to the algorithms. The processor plugs the values from the data stream into the algorithms from the instruction stream. Then, it initiates the operation to obtain a result.
2.1 Flynn’s Taxonomy
Michael J. Flynn created one of the earliest classification systems for parallel (and sequential) computers and programs, now known as Flynn's taxonomy. Flynn classified programs and computers by whether they were operating using a single set or multiple sets of instructions, whether or not those instructions were using a single or multiple sets of data.
Flynn’s Taxonomy
6
7 SISD-Single Instruction Multiple Data MISD-Multiple Instruction Single Data SIMD-Single Instruction Multiple Data MIMD-Multiple Instruction Multiple Data
2.1.1 Single Instruction Single Data(SISD) Single Instruction, Single Data (SISD) computers have one processor that handles one algorithm using one source of data at a time. The computer tackles and processes each task in order, and so sometimes people use the word "sequential" to describe SISD computers. They aren't capable of performing parallel processing on their own. 2.1.2 Single Instruction Multiple Data(SIMD) Single Instruction, Multiple Data (SIMD) computers have several processors that follow the same set of instructions, but each processor inputs different data into those instructions. SIMD computers run different data through the same algorithm. This can be useful for analyzing large chunks of data based on the same criteria. Many complex computational problems don't fit this model. 2.1.3 Multiple Instruction Single Data(MISD) Multiple Instruction, Single Data (MISD) computers have multiple processors. Each processor uses a different algorithm but uses the same shared input data. MISD computers can analyze the same set of data using several different operations at the same time. The number of operations depends upon the number of processors. There aren't many actual examples of MISD computers, partly because the problems an MISD computer can calculate are uncommon and specialized.
2.1.4 Muliple Instruction Multiple Data(MIMD) Multiple Instruction, Multiple Data (MIMD) computers have multiple processors, each capable of accepting its own instruction stream independently from the others. Each processor also pulls data from a separate data stream. An MIMD computer can execute several different processes at once. MIMD computers are more flexible than SIMD or MISD computers, but it's more 7
8 difficult to create the complex algorithms that make these computers work. Single Program, Multiple Data (SPMD) systems are a subset of MIMDs. An SPMD computer is structured like an MIMD, but it runs the same set of instructions across all processors.
According to David A. Patterson and John L. Hennessy, "Some machines are hybrids of these categories, of course, but this classic model has survived because it is simple, easy to understand, and gives a good first approximation. It is also—perhaps because of its understandability—the most widely used scheme."
2.2 Classification On The Basis Of Computation On this basis,they can be classified into three major classes of parallelism. Massively Parallel Embarassingly Parallel Grand Challenge Problems
2.2.1 Massively Parallel Systems Massive parallelism (MP) is a term used in computer architecture, reconfigurable computing, application-specific integrated circuit (ASIC) and field-programmable gate array (FPGA) design. It signifies the presence of many independent arithmetic units or entire microprocessors, that run in parallel. The term massive connotes hundreds if not thousands of such units. Early examples of such a system are the Distributed Array Processor, the Goodyear MP, and the Connection Machine.
Today's most powerful supercomputers are all MP systems such as Earth Simulator, Blue Gene, ASCI White, ASCI Red, ASCI Purple, ASCI Thor's Hammer.
8
9 In this class of computing, all of the processing elements are connected together to be one very large computer. This is in contrast to distributed computing where massive numbers of separate computers are used to solve a single problem.
Through advances in Moore's Law, System-On-Chip (SOC) implementations of massively parallel architectures are becoming cost effective, and finding particular application in high definition video processing.
2.2.2 Embarassingly Parallel Systems In the jargon of parallel computing, an embarrassingly parallel workload (or embarrassingly parallel problem) is one for which no particular effort is needed to segment the problem into a very large number of parallel tasks, and there is no essential dependency (or communication) between those parallel tasks. A very common usage of an embarrassingly parallel problem lies within graphics processing units (GPUs) for things like 3D projection since each pixel on the screen can be rendered independently from any other pixel.
Embarrassingly parallel problems are ideally suited to distributed computing over the Internet (e.g. SETI@home), and are also easy to perform on server farms which do not have any of the special infrastructure used in a true supercomputer cluster.
Embarrassingly parallel problems lie at one end of the spectrum of parallelization, the degree to which a computational problem can be readily divided amongst processors.
2.2.3 Grand Challenge Problems A grand challenge is a fundamental problem in science or engineering, with broad applications, whose solution would be enabled by the application of high performance computing resources that could become available in the near future. Examples of grand challenges are: 9
10
Computational fluid dynamics for the design of hypersonic aircraft, efficient automobile bodies, and extremely quiet submarines, weather forecasting for short and long term effects, efficient recovery of oil, and for many other applications; Electronic structure calculations for the design of new materials such as chemical catalysts, immunological agents, and superconductors; Plasma dynamics for fusion energy technology and for safe and efficient military technology; Calculations to understand the fundamental nature of matter, including quantum chromodynamics and condensed matter theory; Symbolic computations including speech recognition, computer vision, natural language understanding, automated reasoning, and tools for design, manufacturing, and simulation of complex systems." Prediction of weather, climate, and global change Challenges in materials sciences Semiconductor design Superconductivity 10
11 Structural biology Design of pharmaceutical drugs Human genome Quantum Chromodynamics Astronomy Challenges in Transportation Vehicle Signature Turbulence Vehicle dynamics Nuclear fusion Efficiency of combustion systems Enhanced oil and gas recovery Computational ocean sciences Speech Vision Undersea surveillance for ASW One current topical Grand Challenge is the DARPA Autonomous Vehicle challenge.
2.3 Types Of Parallelism
There are four major types of parallelism
Bit-Level Parallelism Instruction-Level Parallelism 11
12
Data Parallelism Task Parallelism
2.3.1 Bit-Level Parallelism From the advent of very-large-scale integration (VLSI) computer-chip fabrication technology in the 1970s until about 1986, speed-up in computer architecture was driven by doubling computer word size—the amount of information the processor can execute per cycleIncreasing the word size reduces the number of instructions the processor must execute to perform an operation on variables whose sizes are greater than the length of the word. For example, where an 8-bit processor must add two 16-bit integers, the processor must first add the 8 lower-order bits from each integer using the standard addition instruction, then add the 8 higher-order bits using an add-with-carry instruction and the carry bit from the lower order addition; thus, an 8-bit processor requires two instructions to complete a single operation, where a 16bit processor would be able to complete the operation with a single instruction.
Historically, 4-bit microprocessors were replaced with 8-bit, then 16-bit, then 32-bit microprocessors. This trend generally came to an end with the introduction of 32-bit processors, which has been a standard in general-purpose computing for two decades. Not until recently (c. 2003–2004), with the advent of x86-64 architectures, have 64-bit processors become commonplace.
2.3.2 Instruction-Level Parallelism A computer program is, in essence, a stream of instructions executed by a processor. These instructions can be re-ordered and combined into groups which are then executed in parallel without changing the result of the program. This is known as instruction-level parallelism. Advances in instruction-level parallelism dominated computer architecture from the mid-1980s until the mid-1990s. Modern processors have multi-stage instruction pipelines. Each stage in the pipeline corresponds to a different action the processor performs on that instruction in that stage; a processor with an N-stage pipeline can have up to N different instructions at different stages of completion. The canonical example of a pipelined processor is a RISC processor, with five stages: instruction fetch, 12
13 decode, execute, memory access, and write back. The Pentium 4 processor had a 35-stage pipeline.[19] A five-stage pipelined superscalar processor, capable of issuing two instructions per cycle. It can have two instructions in each stage of the pipeline, for a total of up to 10 instructions (shown in green) being simultaneously executed. In addition to instruction-level parallelism from pipelining, some processors can issue more than one instruction at a time. These are known as superscalar processors. Instructions can be grouped together only if there is no data dependency between them. Scoreboarding and the Tomasulo algorithm (which is similar to scoreboarding but makes use of register renaming) are two of the most common techniques for implementing out-of-order execution and instruction-level parallelism. Instruction-level parallelism (ILP) is a measure of how many of the operations in a computer program can be performed simultaneously. Consider the following program:
1. e = a + b 2. f = c + d 3. g = e * f Operation 3 depends on the results of operations 1 and 2, so it cannot be calculated until both of them are completed. However, operations 1 and 2 do not depend on any other operation, so they can be calculated simultaneously.If we assume that each operation can be completed in one unit of time then these three instructions can be completed in a total of two units of time, giving an ILP of 3/2.
2.3.3 Data Parallelism Data parallelism is parallelism inherent in program loops, which focuses on distributing the data across different computing nodes to be processed in parallel. "Parallelizing loops often leads to similar (not necessarily identical) operation sequences or functions being performed on elements of a large data structure."Many scientific and engineering applications exhibit data parallelism. 13
14
A loop-carried dependency is the dependence of a loop iteration on the output of one or more previous iterations. Loop-carried dependencies prevent the parallelization of loops. For example, consider the following pseudocode that computes the first few Fibonacci numbers:
1: PREV2 := 0 2: PREV1 := 1 3: CUR := 1 4: do: 5:
CUR := PREV1 + PREV2
6:
PREV2 := PREV1
7:
PREV1 := CUR
8: while (CUR < 10) This loop cannot be parallelized because CUR depends on itself (PREV1) and PREV2, which are computed in each loop iteration. Since each iteration depends on the result of the previous one, they cannot be performed in parallel. As the size of a problem gets bigger, the amount of data-parallelism available usually does as well.
2.3.4Task Parallelism Task parallelism (also known as function parallelism and control parallelism) is a form of parallelization of computer code across multiple processors in parallel computing environments. Task parallelism focuses on distributing execution processes (threads) across different parallel computing nodes. It contrasts to data parallelism as another form of parallelism. In a multiprocessor system, task parallelism is achieved when each processor executes a different thread (or process) on the same or different data. The threads may execute the same or different code. In the general case, different execution threads communicate with one another as they work. 14
15 Communication takes place usually to pass data from one thread to the next as part of a workflow.
As a simple example, if we are running code on a 2-processor system (CPUs "a" & "b") in a parallel environment and we wish to do tasks "A" and "B" , it is possible to tell CPU "a" to do task "A" and CPU "b" to do task 'B" simultaneously, thereby reducing the runtime of the execution. The tasks can be assigned using conditional statements as described below.
Task parallelism emphasizes the distributed (parallelized) nature of the processing (i.e. threads), as opposed to the data (data parallelism). Most real programs fall somewhere on a continuum between Task parallelism and Data parallelism. The pseudocode below illustrates task parallelism:
program: ... if CPU="a" then do task "A" else if CPU="b" then do task "B" end if ... end program The goal of the program is to do some net total task ("A+B"). If we write the code as above and launch it on a 2-processor system, then the runtime environment will execute it as follows.
15
16 In a SIMD system, both CPUs will execute the code. In a parallel environment, both will have access to the same data. The "if" clause differentiates between the CPU's. CPU "a" will read true on the "if" and CPU "b" will read true on the "else if", thus having their own task. Now, both CPU's execute separate code blocks simultaneously, performing different tasks simultaneously. Code executed by CPU "a":
program: ... do task "A" ... end program Code executed by CPU "b":
program: ... do task "B" ... end program This concept can now be generalized to any number of processors.
2.4 Classification Based on Memory Access
16
17 One major way to classify parallel computers is based on their memory architectures. Shared memory parallel computers have multiple processors accessing all available memory as global address space. They can be further divided into two main classes based on memory access times:
Uniform Memory Access (UMA), in which access times to all parts of memory are equal. Non-Uniform Memory Access (NUMA), in which they are not.
2.4.1 Uniform Memory Access(UMA) Uniform Memory Access (UMA) is a computer memory architecture used in parallel computers having multiple processors and probably multiple memory chips. All the processors in the UMA model share the physical memory uniformly. Peripherals are also shared. Cache memory may be private for each processor. In a UMA architecture, accessing time to a memory location is independent from which processor makes the request or which memory chip contains the target memory data. It is used in symmetric multiprocessing (SMP). Uniform Memory Access computer architectures are often contrasted with Non-Uniform Memory Architecture(NUMA) architectures. UMA machines are, generally, harder for computer architects to design, but easier for programmers to program, than NUMA architectures.
2.4.2 Non-Uniform Memory Access(NUMA) Non-Uniform Memory Access or Non-Uniform Memory Architecture (NUMA) is a computer memory design used in multiprocessors, where the memory access time depends on the memory location relative to a processor. Under NUMA, a processor can access its own local memory faster than nonlocal memory, that is, memory local to another processor or memory shared between processors. NUMA architectures logically follow in scaling from symmetric multiprocessing (SMP) architectures. Their commercial development came in 17
18 work by Burroughs, Convex Computers(later HP),SGI , Sequent and Data General during the1990s. Techniques developed by these companies later featured in a variety of Unix-like operating systems, as well as to some degree inWindows NT. Modern CPUs operate considerably faster than the main memory they are attached to. In the early days of high-speed computing and supercomputers the CPU generally ran slower than its memory, until the performance lines crossed in the 1970s. Since then, CPUs, increasingly starved for data, have had to stall while they wait for memory accesses to complete. Many supercomputer designs of the 1980s and 90s focused on providing high-speed memory access as opposed to faster processors, allowing them to work on large data sets at speeds other systems could not approach.
Architecture of a NUMA system. In the above figure,notice that the processors are connected to the bus or crossbar by connections of varying thickness/number. This shows that different CPUs have different priorities to memory access based on their location. Limiting the number of memory accesses provided the key to extracting high performance from a modern computer. For commodity processors, this means installing an ever-increasing amount of high-speed cache memory and using increasingly sophisticated algorithms to avoid "cache misses". But the dramatic increase in size of the operating systems and of the applications run on them have generally overwhelmed these cache-processing improvements. Multiprocessor systems make the problem considerably worse. Now a system can starve several processors at the same time, notably because only one processor can access memory at a time. 18
19 NUMA attempts to address this problem by providing separate memory for each processor, avoiding the performance hit when several processors attempt to address the same memory. For problems involving spread data (common for servers and similar applications), NUMA can improve the performance over a single shared memory by a factor of roughly the number of processors (or separate memory banks). Of course, not all data ends up confined to a single task, which means that more than one processor may require the same data. To handle these cases, NUMA systems include additional hardware or software to move data between banks. This operation has the effect of slowing down the processors attached to those banks, so the overall speed increase due to NUMA will depend heavily on the exact nature of the tasks run on the system at any given time.
2.5 Classes Of Parallel Computers
Parallel computers can be roughly classified according to the level at which the hardware supports parallelism. This classification is broadly analogous to the distance between basic computing nodes. These are not mutually exclusive; for example, clusters of symmetric multiprocessors are relatively common.
2.5.1 Multi-Core Computing A multi-core CPU (or chip-level multiprocessor, CMP) combines two or more independent cores into a single package composed of single integrated circuits(IC), called a die, or more dies packaged together. A dual-core processor contains two cores, and a quad-core processor contains four cores. A multi-core microprocessor implements multiprocessing in a single physical package. A processor with all cores on a single die is called a monolithic processor. Cores in a multicore device may share a single coherent cache at the highest on-device cache level (e.g. L2 for the Intel Core 2) or may have separate caches (e.g. current AMD dual-core processors). The processors also share the same interconnect to the rest of the system. Each "core" independently implements optimizations such as superscalar execution, pipelining, and multithreading. A system with n cores is effective when it is 19
20 presented with n or more threads concurrently. The most commercially significant (or at least the most 'obvious') multi-core processors are those used in personal computers (primarily from Intel and AMD) and game consoles (e.g., the eight-core Cell processor in the PS3 and the three-core Xenon processor in the Xbox 360. In this context, "multi" typically means a relatively small number of cores. However, the technology is widely used in other technology areas, especially those of embedded processors, such as network processors and digital signal processors, and in GPUs.
The amount of performance gained by the use of a multicore processor depends on the problem being solved and the algorithms used, as well as their implementation in software (Amdahl's Law). For so-called "embarassingly parallel" problems, a dual-core processor with two cores at 2GHz may perform very nearly as fast as a single core of 4GHz.Other problems though may not yield so much speedup. This all assumes however that the software has been designed to take advantage of available parallelism. If it hasn't, there will not be any speedup at all. However, the processor will multitask better since it can run two programs at once, one on each core. 2.5.2 Symmetric Multiprocessing In comuting, symmetric multiprocessing or SMP involves a multiprocessor computer-architecture where two or more identical processors can connect to a single shared main memory. Most common multiprocessor systems today use an SMP architecture. In case of multi-core processors, the SMP architecture applies to the cores, treating them as separate processors. SMP systems allow any processor to work on any task no matter where the data for that task are located in memory; with proper operating system support, SMP systems can easily move tasks between processors to balance the workload efficiently.
2.5.3 Distributed Computing Distributed computing deals with hardware and software systems containing more than one processing element or storage element, concurrent processes, or multiple programs, running under a loosely or tightly controlled regime. 20
21 In distributed computing a program is split up into parts that run simultaneously on multiple computers communicating over a network. Distributed computing is a form of parallel computing, but parallel computing is most commonly used to describe program parts running simultaneously on multiple processors in the same computer. Both types of processing require dividing a program into parts that can run simultaneously, but distributed programs often must deal with heterogeneous environments, network links of varying latencies, and unpredictable failures in the network or the computers. 2.5.4 Cluster Computing A cluster is a group of loosely coupled computers that work together closely, so that in some respects they can be regarded as a single computer.Clusters are composed of multiple standalone machines connected by a network. While machines in a cluster do not have to be symmetric, load balancing is more difficult if they are not. The most common type of cluster is the Beowulf cluster, which is a cluster implemented on multiple identical commercial offthe-shelf computers connected with a TCP/IP Ethernet local area network.Beowulf technology was originally developed by Thomas Sterling and Donald Becker. The vast majority of the TOP500 supercomputers are clusters
Beowulf Clusters 2.5.5 Grid Computing
Grid computing is the most distributed form of parallel computing. It makes use of computers communicating over the Internet to work on a given problem. Because of the low bandwidth and extremely high latency available on the Internet, grid computing typically deals only with embarrassingly parallel problems. Many grid computing applications have been created, of which SETI@home and Folding@Home are the best-known examples. 21
22 Most grid computing applications use middleware, software that sits between the operating system and the application to manage network resources and standardize the software interface. The most common grid computing middleware is the Berkeley Open Infrastructure for Network Computing (BOINC). Often, grid computing software makes use of "spare cycles", performing computations at times when a computer is idling. 2.5.6 Vector Processors A vector processor is a CPU or computer system that can execute the same instruction on large sets of data. "Vector processors have high-level operations that work on linear arrays of numbers or vectors. An example vector operation is A = B × C, where A, B, and C are each 64-element vectors of 64-bit floating-point numbers."They are closely related to Flynn's SIMD classification. Cray computers became famous for their vector-processing computers in the 1970s and 1980s. However, vector processors—both as CPUs and as full computer systems—have generally disappeared. Modern processor instruction sets do include some vector processing instructions, such as with AltiVec and Streaming SIMD Extensions (SSE).
22
23
Cray-1 The most famous vector processor
23
24
3.Parallel Computing Implementation
Parallel computing has implementations in both,software as well as hardware.To be precise in terms of terminology,parallel computing in software is called as ‘parallel programming’ whereas in the world of hardware it is referred as ‘parallel processing’.
3.1 Parallel Computing in Software Parallel models are implemented in several ways: as libraries invoked from traditional sequential languages, as language extensions, or complete new execution models. They are also roughly categorized for two kinds of systems: shared-memory system and distributed-memory system, though the lines between them are largely blurred nowadays. A programming model is usually judged by its expressibility and simplicity, which are by all meanings conflicting factors. The ultimate goal is to improve productivity of programming. 3.1.1 Automatic Parallelization Automatic parallelization of a sequential program by a compiler is the "holy grail" of parallel computing. Despite decades of work by compiler researchers, automatic parallelization has had only limited success. Mainstream parallel programming languages remain either explicitly parallel or (at best) partially implicit, in which a programmer gives the compiler directives for parallelization. A few fully implicit parallel programming languages exist—SISAL, Parallel Haskell, and (for FPGAs) Mitrion-C—but these are niche languages that are not widely used. The Fortran code below can be auto-parallelized by a compiler because each iteration is independent of the others, and the final result of array z will be correct regardless of the execution order of the other iterations. do i=1, n z(i) = x(i) + y(i) enddo
On the other hand, the following code cannot be auto-parallelized, because the value of z(i) depends on the result of the previous iteration, z(i-1). 24
25 do i=2, n z(i) = z(i-1)*2 enddo
This does not mean that the code cannot be parallelized. Indeed, it is equivalent to do i=2, n z(i) = z(1)*2**(i-1) enddo
However, current parallelizing compilers are not usually capable of bringing out these parallelisms automatically, and it is questionable whether this code would benefit from parallelization in the first place.
3.1.2 Application Checkpointing
This is all about providing a restore or a backup pointt,to ensure we don’t lose the data.The larger and more complex a computer, the more that can go wrong and the shorter the mean time between failures. Application checkpointing is a technique whereby the computer system takes a "snapshot" of the application —a record of all current resource allocations and variable states, akin to a core dump; this information can be used to restore the program if the computer should fail. Application checkpointing means that the program has to restart from only its last checkpoint rather than the beginning. For an application that may run for months, that is critical. Application checkpointing may be used to facilitate process migration.
3.1.3 Dependencies in Parallel Algorithms Understanding data dependencies is fundamental in implementing parallel algorithms. No program can run more quickly than the longest chain of dependent calculations (known as the critical path), since calculations that depend upon prior calculations in the chain must be executed in order. However, most algorithms do not consist of just a long chain of dependent calculations; there are usually opportunities to execute independent calculations in parallel. Let Pi and Pj be two program fragments. Bernstein's conditions describe when the two are independent and can be executed in parallel. Let Ii be all of the 25
26 input variables to Pi and Oi the output variables, and likewise for Pj. P i and Pj are independent if they satisfy several conditions. Consider the following functions, which demonstrate several kinds of dependencies: 1: function Dep(a, b) 2: c := a·b 3: d := 2·c 4: end function
Operation 3 in Dep(a, b) cannot be executed before (or even in parallel with) operation 2, because operation 3 uses a result from operation 2. It violates condition 1, and thus introduces a flow dependency. 1: function NoDep(a, b) 2: c := a·b 3: d := 2·b 4: e := a+b 5: end function
In this example, there are no dependencies between the instructions, so they can all be run in parallel. Bernstein’s conditions do not allow memory to be shared between different processes. For that, some means of enforcing an ordering between accesses is necessary, such as semaphores, barriers or some other synchronization method.
3.2 Parallel Computing in Hardware Some of implementations of parallel computing in hardware are in vector processors like the Cray-1,and many discussed in the classes of parallel computers(Section 2.5) earlier.The NUMA architecture(Section 2.4.2) shows practical implementation in hardware.However,most new generation computers are coming with multiple processors in it,be it the client systems or the server systems.Most of the personal computers today have Intel’s DualCore or later versions of processors,or the AMD’s Sempron series.The need for parallel computing is felt even at the micro level of a personal computer. No supercomputer can be made without concepts of parallel computing applied in it.The famous Cray-2 supercomputer also used the same.In fact,one of the 26
27 major distinguishing feature of a supercomputer is nothing but the use of parallelism to a great extent. We will have a short look at how parallel processing is implemented in Intel’s Xeon processors. 3.2.1Parallel Processing in Intel Xeon Series The diagrams shown in further pages explain how multi-threading technology gives an edge over earlier used,and now obsolete single-threading processors.
Thread(or instruction) processing in a single thread technology processor. (fig. above and below)
27
28
Threads(instructions) are executed concurrently in multi-threading processor technology(fig.bottom) whereas, Threads(instructions) processor.(fig.top)
are executed serially
in a single-threading
3.3 Applications The applications of parallel computing are far and wide. 28
29 As parallel computers become larger and faster, it becomes feasible to solve problems that previously took too long to run. Parallel computing is used in a wide range of fields, from bioinformatics (to do protein folding) to economics (to do simulation in mathematical finance). Common types of problems found in parallel computing applications are:
Dense linear algebra Sparse linear algebra Spectral methods (such as Cooley-Tukey Fast Fourier transform) N-body problems (such as Barnes-Hut simulation) Structured grid problems (such as Lattice Boltzmann methods), Unstructured grid problems (such as found in finite element analysis) Monte Carlo simulation Combinational logic (such as brute-force cryptographic techniques) Graph traversal (such as Sorting algorithms) Dynamic programming Branch and bound methods Graphical models (such as detecting Hidden Markov models and constructing Bayesian networks) Finite State Machine simulation
29
30
FUTURE SCOPE
We all know that the silicon based chips are reaching a physical limit in processing speed, as they are constrained by the speed of electricity, light and certain thermodynamic laws. A viable solution to overcome this limitation is to connect multiple processors working in coordination with each other to solve grand challenge problems. Hence, high performance computing requires the use of Massively Parallel Processing (MPP) systems containing thousands of power ful CPUs. The development of parallel processing is being influenced by many factors. The prominent among them include the following: 1.Computational requirements are ever increasing, both in the area of scientific and business computing. The technical computing problems, which require high-speed computational power, are related to life sciences, aerospace, geographical information systems, mechanical design and analysis, etc. 2.Sequential architectures reaching physical limitation, as they are constrained by the speed of light and thermodynamics laws. Speed with which sequential CPUs can operate is reaching saturation point ( no more vertical growth ), and hence an alternative way to get high computational speed is to connect multiple CPUs ( opportunity for horizontal growth ).
3.Hardware improvements in pipelining, super scalar, etc, are non scalable and requires sophisticated compiler technology. Developing such compiler technology is difficult task.
4.Vector processing works well for certain kind of problems. It is suitable for only scientific problems ( involving lots of matrix operations). It is not useful to other areas such as database.
30
31 5.The technology of parallel processing is mature and can be exploited commercially, there is already significant research and development work on development tools and environment is achieved.
6.Significant development in networking technology is paving a way for heterogeneous computing. India launched a major initiative in parallel computing in 1988. There are five or six independent projects to construct parallel processing systems. This was motivated by the need for advanced computing, a vision of developing its own technology, and difficulties (political and economic) obtaining commercial products.
CONCLUSION
31
32 With increasing demands of human,and more expectation from technology,the importance of processing faster and to a larger scale is growing fast.In every sphere of computing,be it an application as astronomical as sending a spaceship to other planet,launching a satellite in space,or getting images for meterological purposes,or even images of secret hideouts of terrorist hideouts,the need for complex and fast computing is felt.With growing demands and expectations,parallel processing is the only way to go.Be it in software or hardware. Many algorithms designed by the developers community are such that they need to be prioritized to a great extent,and some part of a large algorithm(as big as 10,000) lines needs to be executed only if certain condition is met.Also,some algorithm are time-sliced.They work as if they are running a marathon and every piece of algorithm is scheduled to execute for some limited time-frame.Just consider the application of a satellite which needs to send images of earth and ocean at certain strict points of time(day,night,evening)etc.,and a forecast is made based on some assumptions.This requires parallel processing to a great extent. Even in hardware,multicore processors are used in home PCs.Earlier,such a privelege of multiple processors was restricted only to mainframe computers.In new generation mobiles(3G and 4G) and PDAs,the presence of parallel processing is already taking place.In Google’s mobile project of Android,they are using parallel processing in their mobiles.This is also providing them with an edge over it’s rivals,Apple iPhone and Nokia N96. All this shows growing influence of parallelism and lesser use of sequential processing. I would like to end the whole seminar report with one-line about parallel comuting.
“The future of computing is parallelism!”
REFERENCES
32
33 http://portal.acm.org/citation.cfm?id=290768&coll=portal&dl=ACM http://www-users.cs.umn.edu/~karypis/parbook/ www.cs.berkeley.edu/~yelick/cs267-sp04/lectures/01/lect01-intro www.cs.berkeley.edu/~demmel/cs267_Spr99/Lectures/Lect_01_1999b http://www.intel.com/technology/computing/dualcore/demo/popup/dualcore.swf www.parallel.ru/ftp/computers/intel/xeon/24896607.pdf www.google.com www.wikipedia.org www.howstuffworks.com
ACKNOWLEDGEMENT
I take this opportunity to express deep sense of gratitude and sincere thanks for the invaluable assistance that I received,at the worthy hands of my guide,Dr.R.P.Adgaonkar. 33
34 I express my sincere thanks to our HOD of Computer Department,Dr.R.P.Adgaonkar for permitting me to present this seminar and also to all staff members who have helped me directly or indirectly. I also express my thanks to my friends,well-wishers for their underlying support shown during preparation of this seminar. Finally I would like to express my sincere gratitude towards my family for always being there.The credit goes to all of them.
Ameya Waghmare BE CSE Roll No. 41
34
35
Ts. The prom We all know that the silicon based chips are reaching a physical limit in processing speed, as they are constrained by the speed of electricity, light and certain thermodynamic laws. A viable solution to overcome this limitation is to connect multiple processors working in coordination with each other to solve grand challenge problems. Hence, high performance computing requires the use of Massively Parallel Processing (MPP) systems containing thousands of power full CPUs.
a given problem. A given task is divided into multiple sub tasks using divide-and-conquer technique and each one of them are processed on different CPUs. Programming on multiprocessor system using divide-and-conquer technique is called Parallel Processing. The development of parallel processing is being influenced by many factors. The prominent among them include the following:
35