AbstractInitiative Performance Analysis & Tuning / Capacity Management / Cluster (HA) & Grid Computing / LDAP
www.AbstractInitiative.com
Viral Grid Computing
Contents Summary ....................................................................................................................................................... 3 The Workflow................................................................................................................................................ 3 Differences .................................................................................................................................................... 3 Philosophy ..................................................................................................................................................... 6 Scalability ...................................................................................................................................................... 7 More Complex Architectures ........................................................................................................................ 7 Workloads ..................................................................................................................................................... 8 Permission, Trademarks, and Copyrights ..................................................................................................... 8
Abstract Initiative, LLC All Rights Reserved
2|Page
Summary This computing model has been around for a long time. I know this because the principal architects here at Abstract Initiative, LLC have been doing it for a long time around 10 years in some form or another. You've seen it on news websites, possibly on television. It's cheap, powerful, efficient, & most importantly incredibly effective. Thousands of desktop computers infected with some sort of computer virus receiving instructions from some IRC chat room and acting on a task or tasks en masse.
The Workflow The basic workflow is this: 1) A job source fills a queue with atomic tasks. (The queue may be well satisfied with a database and an application to insert/update/delete records - a simple queue.) 2) Preconfigured worker nodes pull tasks from the queue, update their status (running/run time/output/errors/completion/etc.) 3) A "queue manager" process checks for errors/failed jobs/etc. and depending on the desired queue policies can resubmit jobs that do not complete correctly, or can stall/isolate failed jobs if necessary. 4) An application reports on queue status, performance, errors, job completion. This scenario works very well for repetitive, high volume batch-type work - but the model can easily be adjusted for online/transactional type work as well. (The type of work commonly performed by MPI environments - with a few changes to the scheduling model.)
Differences Historically, in "HPC" or "grid" computing there are one or more "manager" or "head" nodes from which jobs are initiated. The jobs are then initialized and sent to the compute slices. This model isn't much different, except for the reliance on a message passing middleware and possibly some sort of distributed, shared memory mechanism. These mechanisms usually require non-standard programming and architectural knowledge on the part of the application developers and can sometimes restrict the possible programming languages to those supported by the MPI or NUMA/Shared Memory middleware. Because of moving the tasks previously managed by an MPI suite (MPI_send/MPI_recv/etc.) to simple communications between the compute slices and Abstract Initiative, LLC All Rights Reserved
3|Page
the "manager" node(s), a significant layer of complexity is removed. Because this model does not provide for a shared memory mechanism, attention must be paid at the workload architecture level to atomicity of units of work, and some facility must be created to reassemble the "pieces" into the final product. There is no specific operating system, programming language, high level network transfer protocol, queue software, hardware platform, or specific vendor that one needs to use to implement this sort of computing platform. This decision can be made for the specific workload, or for your infrastructure. We are effectively simplifying what was previously a complex computing operation. A common infrastructure in HPC/Grid computing is to interconnect all of the nodes with each other. This helps with shared memory access and utilization, data migration, and in some cases fencing - In the event of node failure. Because we are breaking our workload into smaller, discreet bite-sized pieces we shouldn’t need this additional infrastructure. The atomicity of the units of work should effectively render this functionality useless. In the images below, the inter-node interconnects are red arrows. In real life, with a large number of compute slices, these red arrows become expensive and heavily used critical infrastructure.
Abstract Initiative, LLC All Rights Reserved
4|Page
Abstract Initiative, LLC All Rights Reserved
5|Page
In addition to the hardware infrastructure simplicity, there are more benefits to Viral Grid Computing with regards to batch work. Another common piece of the puzzle is some sort of job scheduling software. There are many out there with broad ranges in price and functionality, but this type of architecture is capable of negating the need for this added complexity.
Philosophy The source of this philosophy arguably comes from 2 sources: 1) The UNIX(R) philosophy of "Do one thing, do it well" as evidenced by the large number of small, single-function commands that specialize in specific tasks. As well as: Abstract Initiative, LLC All Rights Reserved
6|Page
2) The "virus" methodology of feeding work to an army of single or limited function "muscle" nodes (sometimes known as zombies).
Scalability The Queue: Depending on how the work queue(s) is/are implemented you may be able to scale the "manager" node(s) significantly in a horizontal fashion. RDBMS technologies like Oracle RAC(R), MySQL NDB, or other "cluster-able" database architectures can scale to the physical or logical limits of the RDBMS software. This can also be leveraged to increase the availability of the "manager" nodes or to increase the concurrent processing power thereof. Worker Nodes: Since the work is pulled, worked, and returned by the worker nodes, and in the simplest examples of this architecture - your logical limits are 1) Physical resource limitations (rack space, network/storage bandwidth, network/storage ports, electricity, etc.) 2) The amount of available work and the amount of time/resources required to complete that work. 3) The capabilities of the code written to complete each unit of work, and the ability to reassemble the completed pieces (if necessary). Reporting/Management Application: This is merely a user-queue interoperability mechanism. Its scalability depends on the complexity of the workload(s) and the capabilities and functionality of this application.
More Complex Architectures Because this is nothing more than a multi-server queue (M/M/C in Kendall notation) it can also be expanded into a multi-queue/multi-server queuing network (open or closed). If you think of the base architecture as people lining up at a bank to get to a teller, the more complex architectures can be seen as: 1) Going to the bank. 2) Then going to the DMV. 3) Then a short shopping spree at your local shopping mall. The high-level description of this queuing network could simply be named "errands". There is no rule that says you can't have more than one queue, and that your worker nodes (compute slices) can't be multi-talented. You simply have to allow for this in your workload design.
Abstract Initiative, LLC All Rights Reserved
7|Page
The same thing can be done with MPI but for the most part this is fairly uncommon because of the complexities of MPI architectures and the development time and effort required for implementation. To implement this, the management application simply needs to include the logic to migrate a job/task to another queue upon completion (or assembly of completed jobs) from the previous queue.
Workloads Almost every IT shop has a workload or two that would benefit greatly from this sort of architecture. An easy way of picking the right workload is to find one that has a large number of discreet units of work, or a number of stages. One type of application that comes to mind is a payroll system, for example. In this case paycheck amounts, accrued vacation & sick time, and items of that sort. Traditionally a payroll system will work through these calculations in a serial manner. To make this more efficient, simply take the information needed to make these calculations and split them up (by job band, by business unit, or whatever) and put it in a queue for a compute node to process. Instead of processing each payroll instance one at a time, you can send them out in blocks, and reassemble them upon completion. This can be done with MPI as well, but vendor support for this may vary. You may not even need to make many changes to an existing payroll system to do it this way (licensing costs may be another concern) - by simply having a compute node per-business unit (or whatever your distinction may be) and using the database that stores this information as an informal “queue”.
Permission, Trademarks, and Copyrights The contents of this document were assembled from documents originally authored, except for the “Summary” section and mention of specific vendor’s RDBMS products, by Brian Horan between August 1998 and November of 2008, used with written permission and copyright 2002-2009 Brian Horan. UNIX is a registered trademark of The Open Group. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. MySQL is a trademark of Sun Microsystems, Inc. in the United States, the European Union and other countries. Abstract Initiative, LLC is independent of Sun Microsystems, Inc.
Abstract Initiative, LLC All Rights Reserved
8|Page