  May 2020
Some General Parallel Terminology Like everything else, parallel computing has its own "jargon". Some of the more commonly used terms associated with parallel computing are listed below. Most of these will be discussed in more detail later. Task A logically discrete section of computational work. A task is typically a program or program-like set of instructions that is executed by a processor. A task is "an execution path through address space". In other words, a set of program instructions that are loaded in memory. The address registers have been loaded with the initial address of the program. At the next clock cycle, the CPU will start execution, in accord with the program. The sense is that some part of 'a plan is being accomplished'. As long as the program remains in this part of the address space, the task can continue, in principle, indefinitely, unless the program instructions contain a halt, exit, or return. •

In the computer field, "task" has the sense of a real-time application, as distinguished from process, which takes up space (memory), and execution time. See operating system. o Both "task" and "process" should be distinguished from event, which takes place at a specific time and place, and which can be planned for in a computer program.  In a computer graphical user interface (GUI), an event can be as simple as a mouse click or keystroke.

Parallel Task A task that can be executed by multiple processors safely (yields correct results). Task parallelism (also known as function parallelism and control parallelism) is a form of parallelization of computer code across multiple processors in parallel computing environments. Task parallelism focuses on distributing execution processes (threads) across different parallel computing nodes. It contrasts to data parallelism as another form of parallelism. Serial Execution Execution of a program sequentially, one statement at a time. In the simplest sense, this is what happens on a one processor machine. However, virtually all parallel tasks will have sections of a parallel program that must be executed serially. Parallel Execution Execution of a program by more than one task, with each task being able to execute the same or different statement at the same moment in time. The simultaneous use of more than one CPU or processor core to execute a program or multiple computational threads.

Ideally, parallel processing makes programs run faster because there are more engines (CPUs or cores) running it. In practice, it is often difficult to divide a program in such a way that separate CPUs or cores can execute different portions without interfering with each other. Most computers have just one CPU, but some models have several, and multicore processor chips are becoming the norm. There are even computers with thousands of CPUs. With single-CPU, single-core computers, it is possible to perform parallel processing by connecting the computers in a network. However, this type of parallel processing requires very sophisticated software called distributed processing software. Note that parallel processing differs from multitasking, in which a CPU provides the illusion of simultaneously executing instructions from multiple different programs by rapidly switching between them, or "interleaving" their instructions. Parallel processing is also called parallel computing. In the quest of cheaper computing alternatives parallel processing provides a viable option. The idle time of processor cycles across network can be used effectively by sophisticated distributed computing software. Pipelining Breaking a task into steps performed by different processor units, with inputs streaming through, much like an assembly line; a type of parallel computing. In computing, a pipeline is a set of data processing elements connected in series, so that the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in time-sliced fashion; in that case, some amount of buffer storage is often inserted between elements. As the assembly line example shows, pipelining doesn't decrease the time for a single datum to be processed; it only increases the throughput of the system when processing a stream of data. High pipelining leads to increase of latency - the time required for a signal to propagate through a full pipe. A pipelined system typically requires more resources (circuit elements, processing units, computer memory, etc.) than one that executes one batch at a time, because its stages cannot reuse the resources of a previous stage. Moreover, pipelining may increase the time it takes for an instruction to finish. Shared Memory From a strictly hardware point of view, describes a computer architecture where all processors have direct (usually bus based) access to common physical memory. In a programming sense, it describes a model where parallel tasks all have the same "picture" of memory and can directly address and access the same logical memory locations

regardless of where the physical memory actually exists. In computing, shared memory is memory that may be simultaneously accessed by multiple programs with an intent to provide communication among them or avoid redundant copies. Depending on context, programs may run on a single processor or on multiple separate processors. Using memory for communication inside a single program, for example among its multiple threads, is generally not referred to as shared memory. Symmetric Multi-Processor (SMP) Hardware architecture where multiple processors share a single address space and access to all resources; shared memory computing. In computing, symmetric multiprocessing or SMP involves a multiprocessor computer architecture where two or more identical processors can connect to a single shared main memory. Most common multiprocessor systems today use an SMP architecture. In the case of multi-core processors, the SMP architecture applies to the cores, treating them as separate processors. SMP systems allow any processor to work on any task no matter where the data for that task are located in memory; with proper operating system support, SMP systems can easily move tasks between processors to balance the workload efficiently. SMP has many uses in science, industry, and business which often use custom-programmed software for multithreaded (multitasked) processing. However, most consumer products such as word processors and computer games are written in such a manner that they cannot gain large benefits from concurrent systems. For games this is usually because writing a program to increase performance on SMP systems can produce a performance loss on uniprocessor systems. Recently, however, multi-core chips are becoming more common in new computers, and the balance between installed uni- and multi-core computers may change in the coming years. The nature of the different programming methods would generally require two separate code-trees to support both uniprocessor and SMP systems with maximum performance. Programs running on SMP systems may experience a performance increase even when they have been written for uniprocessor systems. This is because hardware interrupts that usually suspend program execution while the kernel handles them can execute on an idle processor instead. The effect in most applications (e.g. games) is not so much a performance increase as the appearance that the program is running much more smoothly. In some applications, particularly compilers and some distributed computing projects, one will see an improvement by a factor of (nearly) the number of additional processors. In situations where more than one program executes at the same time, an SMP system will have considerably better performance than a uni-processor because different programs can run on different CPUs simultaneously. Systems programmers must build support for SMP into the operating system: otherwise, the additional processors remain idle and the system functions as a uniprocessor system.

In cases where an SMP environment processes many jobs, administrators often experience a loss of hardware efficiency. Software programs have been developed to schedule jobs so that the processor utilization reaches its maximum potential. Good software packages can achieve this maximum potential by scheduling each CPU separately, as well as being able to integrate multiple SMP machines and clusters. Access to RAM is serialized; this and cache coherency issues causes performance to lag slightly behind the number of additional processors in the system.

Distributed Memory In hardware, refers to network based memory access for physical memory that is not common. As a programming model, tasks can only logically "see" local machine memory and must use communications to access memory on other machines where other tasks are executing. Communications Parallel tasks typically need to exchange data. There are several ways this can be accomplished, such as through a shared memory bus or over a network, however the actual event of data exchange is commonly referred to as communications regardless of the method employed. Synchronization The coordination of parallel tasks in real time, very often associated with communications. Often implemented by establishing a synchronization point within an application where a task may not proceed further until another task(s) reaches the same or logically equivalent point. Synchronization usually involves waiting by at least one task, and can therefore cause a parallel application's wall clock execution time to increase. Granularity In parallel computing, granularity is a qualitative measure of the ratio of computation to communication. • •

Coarse: relatively large amounts of computational work are done between communication events Fine: relatively small amounts of computational work are done between communication events

Parallel Overhead The amount of time required to coordinate parallel tasks, as opposed to doing useful work. Parallel overhead can include factors such as: • •

Task start-up time Synchronizations

• • •

Data communications Software overhead imposed by parallel compilers, libraries, tools, operating system, etc. Task termination time

Massively Parallel Refers to the hardware that comprises a given parallel system - having many processors. The meaning of "many" keeps increasing, but currently, the largest parallel computers can be comprised of processors numbering in the hundreds of thousands. Embarrassingly Parallel Solving many similar, but independent tasks simultaneously; little to no need for coordination between the tasks. In parallel computing, an embarrassingly parallel workload (or embarrassingly parallel problem) is one for which little or no effort is required to separate the problem into a number of parallel tasks. This is often the case where there exists no dependency (or communication) between those parallel tasks.[1] Embarrassingly parallel problems are ideally suited to distributed computing and are also easy to perform on server farms which do not have any of the special infrastructure used in a true supercomputer cluster. Scalability Refers to a parallel system's (hardware and/or software) ability to demonstrate a proportionate increase in parallel speedup with the addition of more processors. Factors that contribute to scalability include: • • • •

Hardware - particularly memory-cpu bandwidths and network communications Application algorithm Parallel overhead related Characteristics of your specific application and coding

Multi-core Processors Multiple processors (cores) on a single chip. A multi-core processor is a processing system composed of two or more independent cores (or CPUs). The cores are typically integrated onto a single integrated circuit die (known as a chip multiprocessor or CMP), or they may be integrated onto multiple dies in a single chip package. A many-core processor is one in which the number of cores is large enough that traditional multiprocessor techniques are no longer efficient — this threshold is somewhere in the range of several tens of cores — and likely requires a network on chip. Cluster Computing

Use of a combination of commodity units (processors, networks or SMPs) to build a parallel system. A computer cluster is a group of linked computers, working together closely so that in many respects they form a single computer. The components of a cluster are commonly, but not always, connected to each other through fast local area networks. Clusters are usually deployed to improve performance and/or availability over that provided by a single computer, while typically being much more cost-effective than single computers of comparable speed or availability.[1 Supercomputing / High Performance Computing supercomputer (a mainframe computer that is one of the most powerful available at a given time). Use of the world's fastest, largest machines to solve large problems. A supercomputer is a computer that is at the frontline of current processing capacity, particularly speed of calculation. Supercomputers introduced in the 1960s were designed primarily by Seymour Cray at Control Data Corporation (CDC), and led the market into the 1970s until Cray left to form his own company, Cray Research. He then took over the supercomputer market with his new designs, holding the top spot in supercomputing for five years (1985–1990). In the 1980s a large number of smaller competitors entered the market, in parallel to the creation of the minicomputer market a decade earlier, but many of these disappeared in the mid-1990s "supercomputer market crash". Today, supercomputers are typically one-of-a-kind custom designs produced by "traditional" companies such as Cray, IBM and Hewlett-Packard, who had purchased many of the 1980s companies to gain their experience. As of July 2009, the IBM Roadrunner, located at Los Alamos National Laboratory, is the fastest supercomputer in the world. The term supercomputer itself is rather fluid, and today's supercomputer tends to become tomorrow's ordinary computer. CDC's early machines were simply very fast scalar processors, some ten times the speed of the fastest machines offered by other companies. In the 1970s most supercomputers were dedicated to running a vector processor, and many of the newer players developed their own such processors at a lower price to enter the market. The early and mid-1980s saw machines with a modest number of vector processors working in parallel to become the standard. Typical numbers of processors were in the range of four to sixteen. In the later 1980s and 1990s, attention turned from vector processors to massive parallel processing systems with thousands of "ordinary" CPUs, some being off the shelf units and others being custom designs. Today, parallel designs are based on "off the shelf" server-class microprocessors, such as the PowerPC, Opteron, or Xeon, and most modern supercomputers are now highly-tuned computer clusters using commodity processors combined with custom interconnects.

Grid computing:

Grid computing is the most distributed form of parallel computing. It makes use of computers communicating over the Internet to work on a given problem. Because of the low bandwidth and extremely high latency available on the Internet, grid computing typically deals only with embarrassingly parallel problems. Many grid computing applications have been created, of which SETI@home and Folding@Home are the bestknown examples.[31] Most grid computing applications use middleware, software that sits between the operating system and the application to manage network resources and standardize the software interface. The most common grid computing middleware is the Berkeley Open Infrastructure for Network Computing (BOINC). Often, grid computing software makes use of "spare cycles", performing computations at times when a computer is idling.

