Fault Tolerance In Campus Grids Irb

  • November 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Fault Tolerance In Campus Grids Irb as PDF for free.

More details

  • Words: 681
  • Pages: 4
Fault Tolerance in Campus Grids

1. Problem Formulation The promise of Grid computing for effectively harnessing computing resources across organizations is enormous. The Grid research community has made great strides in defining middleware for core job management and security services. In a grid environment there are potentially thousands of resources, services and applications that need to interact in order to make possible the use of the grid as an execution platform. Since these elements are extremely heterogeneous, there are many failure possibilities, including not only independent failures of each element, but also those resulting from interactions between them. Because of the inherent instability of grid environments, faultdetection and recovery is another critical component that must be addressed. The need for fault-tolerance is especially acute for large parallel applications since the failure rate grows with the number of processors and the duration of the computation. Fault tolerance is the survival attribute of computer systems. The function of fault tolerance is “…to preserve the delivery of expected services despite the presence of fault-caused errors within the system itself. Errors are detected and corrected, and permanent faults are located and removed while the system continues to deliver acceptable service.” Two basic problems in grid fault management are: First, existing solutions for failure diagnosis and correction mainly address information collection. However, while in principle one has to know only what software component does, when such a component breaks, one has also to know how the component works. Second, fault tolerance schemes today implemented on grids tolerate only crash failures. Since grids are prone to more complex failures, such as heisenbugs, one needs to tolerate tougher failures. Most of the fault tolerance systems in the market are using Heartbeat techniques to detect the failure in grids and to deal with these failures they are using Checkpointing and Replica Management techniques in one or another form. In my thesis I will try to improve the fault tolerance mechanism in the grid setup by us. In the proposed work, various grid fault tolerance mechanisms are explored to develop the Complete, Accurate, Consistent, Scalable, Flexible, and Adaptive fault tolerance system.

1

Fault Tolerance in Campus Grids 2. Objective 1. 2. 3. 4. 5.

Kind -

To study the kind of faults1 that can occur in grids. To study and analyze fault tolerance mechanisms2. To study different fault tolerance algorithms3. To explore the possibility of fault occurrence in the grid. To demonstrate the usability of proposed technique in fault tolerance to provide better QoS.

of Failures1 Configuration Middleware Application Hardware

Fault Tolerance Mechanisms2 - Checkpointing o Coordinated Checkpointing o Uncoordinated Checkpointing - Replica Management - Heartbeating Mechanism o Centralized Heartbeating o Heartbeating along a virtual ring o All-to-all heartbeating o Membership Management Protocol Fault Tolerance Algorithms3 -

Consensus Algorithm Dynamic Heartbeat Grouping Algorithm o New Node(s) Join the Grid o Node(s) Fail or Detached From The Grid Gossip Protocol SWIM group membership protocol

2

Fault Tolerance in Campus Grids 3. Methodology 1. Literature Survey: To study the existing1 and forthcoming techniques for fault tolerance. 2. Analysis: Find different fault tolerance mechanisms. 3. Setup: Setting up grid environment using Globus and Sun N1 grid Engine. 4. Propose: New fault tolerance system for setup grid. 5. Experimentation & Write up: Demonstrate and Publish the usability of the above improved fault tolerance system.

Existing Solutions1 -

GALLOP: Replicates SPMD(Single Instruction Multiple Data) in different sites within VO WQR: Fault tolerance schedulers for bag-of-tasks applications. If system fails it automatically reschedules. Legion and Condor Checkpointing Recovery o Legion: Checkpointing at application level o Condor: Checkpointing at System Level GEMS( Grid Enactor and Management Service): uses the concept of DQ server and NQ monitor clients. GUSTO a grid testbed that use HBM and currently spans over 20 institutions and detect failure in NetSolve. N1 grid engine 6 Checkpoint Recovery o N1GE6 does not provide any checkpointing tools but has built-in support for the integration of 3rd party tools

3

Fault Tolerance in Campus Grids 4. Work Plan

Jan-2006 Activity/ Month Literature Survey

Feb-2006

Analysis Setup Propose Experimentation & Write Up Documentation

4

Mar-2006

Apr-2006

May-2006

Related Documents