GridServer Administration Guide Version 4.2
The GridServer Administration Series Proprietary and Confidential
Confidentiality and Disclaimer Neither this document nor any of its contents may be used or disclosed without the express written consent of DataSynapse. This document does not carry any right of publication or disclosure to any other party. While the information provided herein is believed to be accurate and reliable, DataSynapse makes no representations or warranties, express or implied, as to the accuracy or completeness of such information. Only those representations and warranties contained in a definitive license agreement shall have any legal effect. In furnishing this document, DataSynapse reserves the right to amend or replace it at any time and undertakes no obligation to provide the recipient with access to any additional information. Nothing contained within this document is or should be relied upon as a promise or representation as to the future. This product includes software developed by the Apache Software Foundation (www.apache.org/). This product includes software developed by the OpenSSL Project for use in the OpenSSL Toolkit. (www.openssl.org/). This product includes code licensed from RSA Data Security (java.sun.com/products/jsse/LICENSE.html). DataSynapse GridServer Administration Guide Version 4.2 Copyright © 2006 DataSynapse, Inc. All Rights Reserved. GridServer® is a registered trademark, DataSynapse, FabricServer™, the DataSynapse logo, LiveCluster™, and GridClient™ are trademarks, and GRIDesign is a servicemark of DataSynapse, Inc. Protected by U.S. Patent No. 6,757,730. Other patents pending. WebSphere® is a registered trademark and CloudScape™ is a trademark of International Business Machines Corporation in the United States, other countries, or both. All other product names are trademarks or registered trademarks of their respective companies. DataSynapse, Inc. 632 Broadway, 5th Floor; New York, NY 10012 Tel: 212.842.8842 Fax: 212.842.8843 Email:
[email protected] Web: www.datasynapse.com For technical support issues and product updates, please visit customer.datasynapse.com. We appreciate any comments or suggestions you may have about this manual or other DataSynapse documentation. Please send your feedback to
[email protected]. 2212006
Contents Confidentiality and Disclaimer ............................................................................................................2 Contents .....................................................................................................................................................3 Chapter 1 - Introduction ........................................................................................................................9 Before you begin .............................................................................................................................9 GridServer 4.2 Documentation Roadmap .......................................................................................9 GridServer Guides ..............................................................................................................9 Other Documentation and Help ........................................................................................10 Document Conventions .................................................................................................................11 Chapter 2 - Work ...................................................................................................................................13 Introduction ...................................................................................................................................13 Services .........................................................................................................................................13 Clients ...............................................................................................................................13 Service Implementations ...................................................................................................13 Service Session .................................................................................................................14 Service benefits .................................................................................................................14 Jobs ...............................................................................................................................................14 Job Benefits .......................................................................................................................15 Binary-level Integration ................................................................................................................15 Chapter 3 - Engine Balancing and Client Routing .........................................................................17 Introduction ...................................................................................................................................17 Client Routing ...............................................................................................................................17 Allowed Brokers Set .........................................................................................................17 Client Properties Rules .....................................................................................................17 Driver API .........................................................................................................................17 Engine Routing and Balancing .....................................................................................................17 Engine Weight-Based Balancer ........................................................................................18 Home/Shared Balancer .....................................................................................................18 Engine Balancer Configuration ........................................................................................19 Failover Brokers ...........................................................................................................................20 Engine Upper and Lower Bounds .................................................................................................20 Example Use Cases .......................................................................................................................20 N+1 Failover with Weighting ...........................................................................................20 Engine Localization with Sharing .....................................................................................21 Chapter 4 - Grid Fault-Tolerance and Failover ...............................................................................23 Introduction ...................................................................................................................................23 The Fault-tolerant GridServer Deployment ..................................................................................23 Heartbeats and Failure Detection ..................................................................................................23 Manager Stability Features ...........................................................................................................24 Engine Failure ...............................................................................................................................24 Driver Failure ................................................................................................................................24 Director Failure .............................................................................................................................25 Broker Failure ...............................................................................................................................25 Failover Brokers ...........................................................................................................................25 Fault-Tolerant Tasks .....................................................................................................................26 Batch Fault-Tolerance ...................................................................................................................27 GridServer Administration Guide
• • • 3 • • •
GridCache Fault-Tolerance ...........................................................................................................27 Client .................................................................................................................................27 Broker Restart ...................................................................................................................27 Failover .............................................................................................................................27 Chapter 5 - Scheduling .........................................................................................................................29 Introduction ...................................................................................................................................29 Reschedules and Retries ...............................................................................................................29 Retry ..................................................................................................................................29 Reschedule ........................................................................................................................29 Timeout Behavior .............................................................................................................30 The Scheduler ...............................................................................................................................30 Scheduler Overview ..........................................................................................................30 Service Priority .................................................................................................................31 Usage Algorithm ...............................................................................................................31 Time Algorithm ................................................................................................................31 Serial Priority Algorithm ..................................................................................................32 Urgent Priority Services and Preemption .....................................................................................32 Engine Blacklisting .......................................................................................................................33 Conditions .....................................................................................................................................33 Redundant Task Rescheduling ......................................................................................................33 Chapter 6 - The GridServer Administration Tool ..........................................................................35 Introduction ...................................................................................................................................35 Getting Started ..............................................................................................................................35 User Accounts and Access Levels ................................................................................................36 Creating User Accounts ....................................................................................................36 Features Available by Access Level .................................................................................37 User Account Security ......................................................................................................37 Navigating the Administration Tool .............................................................................................38 The Home Page .................................................................................................................38 Tabs ...................................................................................................................................38 Shortcut buttons ................................................................................................................39 Action Controls .................................................................................................................39 Links on other pages .........................................................................................................39 Using Tables .................................................................................................................................39 Pager control .....................................................................................................................39 Search control ...................................................................................................................39 Personalize Table ..............................................................................................................40 Refresh ..............................................................................................................................40 Broker and Director Monitors ...........................................................................................40 Manager Component Indicator .........................................................................................40 Status Display ...................................................................................................................41 Chapter 7 - Application Resource Deployment ..............................................................................43 Introduction ...................................................................................................................................43 Grid Libraries ................................................................................................................................43 Grid Library Format ..........................................................................................................44 Using Grid Libraries from a Service .................................................................................49 Deployment .......................................................................................................................50 • • • • •
4 –•Contents
This Document is Proprietary and Confidental
Grid Library Manager .......................................................................................................50 C++ Bridges ......................................................................................................................51 JREs ..................................................................................................................................51 Grid Library Example .......................................................................................................51 Legacy Resource Deployment ......................................................................................................52 Using Default Resources ..................................................................................................52 Default Resource Paths .....................................................................................................53 C++ Bridges ......................................................................................................................53 Grid Library features not supported by Default Resources ..............................................53 Code Versioning Deprecation ...........................................................................................53 Resource Deployment: Distributing Grid Libraries and Default Resources ................................54 The Resource Deployment Interface ................................................................................54 Resource Deployment File Locations ...............................................................................54 Configuring Directory Replication ...................................................................................55 Using Engines with Shared Network Directories .............................................................55 JAR Ordering File .............................................................................................................56 Remote Application Installation ...................................................................................................56 Service Run-As .............................................................................................................................57 Types of Credentials .........................................................................................................58 Using Run-As ...................................................................................................................58 Chapter 8 - The Batch Scheduling Facility .......................................................................................61 Introduction ...................................................................................................................................61 Terminology ..................................................................................................................................61 Editing Batch Definitions .............................................................................................................62 Batch Components ........................................................................................................................63 Service Runners ............................................................................................................................65 Scheduling Batch Definitions .......................................................................................................66 The Batch Schedule Page .............................................................................................................66 Running Batches ...........................................................................................................................66 Deploying Batch Resources ..........................................................................................................67 Batch Fault-Tolerance ...................................................................................................................67 Using PDriver in a Batch ..............................................................................................................67 Chapter 9 - Configuring Security .......................................................................................................69 Introduction ...................................................................................................................................69 Authentication ...............................................................................................................................69 Operating System Users ....................................................................................................69 Grid Users .........................................................................................................................69 GridServer Built-In Authentication ..................................................................................70 Extensible Authentication Hooks .....................................................................................70 Enabling Client Authentication ........................................................................................70 SSL ................................................................................................................................................71 Communication Overview ................................................................................................71 Certificate Overview .........................................................................................................71 Keypair and Cert Location ................................................................................................72 Types of Connections Using SSL .....................................................................................72 Enabling HTTPS on the Application Server .....................................................................72 Enabling HTTPS on all Components ................................................................................73 GridServer Administration Guide
• • • 5 • • •
Driver SSL ........................................................................................................................73 Engines and Engine Daemon SSL ....................................................................................74 Brokers and Director SSL .................................................................................................75 Resources over HTTPS .....................................................................................................75 Disabling HTTP ................................................................................................................76 Resource Protection ......................................................................................................................76 Chapter 10 - GridServer Performance and Tuning ........................................................................77 Diagnosing Performance Problems ..............................................................................................77 Tuning Data Movement ................................................................................................................77 Stateful Processing ............................................................................................................77 Compression .....................................................................................................................78 Packing ..............................................................................................................................78 Direct Data Transfer .........................................................................................................78 Shared Directories and DDT .............................................................................................79 Caching .............................................................................................................................79 Data References ................................................................................................................79 Tasks Per Message ............................................................................................................79 Invocations Per Message ..................................................................................................80 Tuning for Large Grids .................................................................................................................80 Chapter 11 - Diagnosing GridServer Issues ....................................................................................81 Troubleshooting ............................................................................................................................81 Obtaining Log Files ......................................................................................................................81 Manager Logs ...................................................................................................................81 Engine and Daemon Logs .................................................................................................82 Driver Logs .......................................................................................................................83 Application Server Logs ...................................................................................................83 Chapter 12 - Administration Howto ..................................................................................................85 Backup / Restore ...........................................................................................................................85 Backup Procedure .............................................................................................................85 Restore Procedure .............................................................................................................85 Manager Configuration .................................................................................................................85 Applying a patch or service pack to GridServer ...............................................................85 Importing and Exporting Manager Configuration ............................................................86 Installing Manager Licenses .............................................................................................86 Setting the SMTP host ......................................................................................................87 Setting Up a Failover Broker ............................................................................................87 Configuring SNMP ...........................................................................................................88 Enabling Enhanced Task Instrumentation ........................................................................89 Engine Management .....................................................................................................................89 Deploying Files to Engines ...............................................................................................89 Updating the Windows Engine JRE .................................................................................90 Updating the Unix Engine JRE .........................................................................................90 Setting the Director Used by Engines ...............................................................................91 Running Services ..........................................................................................................................91 Running MPI Jobs using PDriver .....................................................................................91 Registering a Service Type ..............................................................................................92 Creating and Running a Batch .........................................................................................92 • • • • •
6 –•Contents
This Document is Proprietary and Confidental
Creating a native stack trace in Linux ..............................................................................93 Attaching GDB to Engine native code on Linux ..............................................................93 Logging messages from a Native service to the Engine log .............................................94 Running a .NET Driver from an Engine Service ..............................................................94 Configuration Issues .....................................................................................................................95 Installation on Dual-Interface Machines ...........................................................................95 Configuring the timeout period for the Administration Tool ...........................................95 Reconfiguring Managers when Installing a secondary Director .......................................95 Using UNC paths in a driver.properties file .....................................................................95 Chapter 13 - Database Administration .............................................................................................97 Introduction ...................................................................................................................................97 Database Types .............................................................................................................................97 The Reporting Database ....................................................................................................97 The Internal Database .......................................................................................................97 Internal Database Backup .............................................................................................................97 Appendix A - The grid-library.dtd ....................................................................................................99 Introduction ...................................................................................................................................99 Appendix B - Reporting Database Tables ......................................................................................101 Introduction .................................................................................................................................101 Batches ........................................................................................................................................101 Brokers........................................................................................................................................ 101 Broker_stats ................................................................................................................................102 Driver_events.............................................................................................................................. 102 Driver_profiles ............................................................................................................................103 Driver_users ................................................................................................................................103 Engine_events .............................................................................................................................104 Engine_info .................................................................................................................................104 Engine_stats ................................................................................................................................104 Event_codes ................................................................................................................................105 Job_status_codes......................................................................................................................... 105 Jobs .............................................................................................................................................105 Job_discriminators ......................................................................................................................106 Properties ....................................................................................................................................107 Tasks ...........................................................................................................................................107 Task_status_codes ...................................................................................................................... 107 Users ...........................................................................................................................................108 User_events .................................................................................................................................108 Index .......................................................................................................................................................109
GridServer Administration Guide
• • • 7 • • •
• • • • •
8 –•Contents
This Document is Proprietary and Confidental
Chapter 1 Introduction
••••••
This guide is a reference for the administrator who maintains GridServer installations. It includes advanced information on how GridServer works, including scheduling, routing, failover, and file deployment, plus a tour of the GridServer Administration Tool. Howto information is given on frequent tasks, plus advanced information is included on security, tuning, database administration, and log files.
Before you begin This guide assumes that you already have a GridServer Manager running and know the hostname, username, and password. If this isn’t true, see the GridServer Installation Guide or contact the administrator responsible for the installation.
GridServer 4.2 Documentation Roadmap The following documentation is available for GridServer 4.2:
GridServer Guides Four guides and four tutorials are included with GridServer in Adobe Acrobat (PDF) format. They are also available in print format. To view the guides, log in to the Administration tool, select the Admin tab, go to the Documentation page, and select a guide. A search engine is also available on this page for you to search all of the documentation for a phrase or keywords. The PDF files can also be found on the Manager at livecluster/admin/docs. The following guides are available: Introducing the GridServer Platform Series: Introducing the GridServer Platform
Contains an introduction to GridServer, including definitions of key concepts and terms, such as work, Engines, Directors, and Brokers. This should be read first if you are new to GridServer.
The GridServer Administration Series: GridServer Administration Guide
Covers the operation of a GridServer installation as relevant to a system administrator. It includes basic theory on scheduling, fault-tolerance, failover, and other concepts, plus howto information, and performance and tuning information.
GridServer Installation Guide
Covers installation of GridServer for Windows and Unix, including Managers, Engines, and pre-installation planning.
GridServer Administration Guide
• • • 9 • • •
GridServer 4.2 Documentation Roadmap
The GridServer Developer Series: GridServer Developer’s Guide
Contains information on how to develop applications for GridServer, including information on Service Domains, using Services, PDriver (the Batch-oriented GridServer Client), the theory behind development with the GridServer Tasklet API and concepts needed to write and adapt applications.
GridServer Object-Oriented Integration Tutorial
Tutorial on developing applications for GridServer using the object-oriented Tasklet API in Java or C++.
GridServer Service-Oriented Integration Tutorial on developing applications for GridServer using Tutorial Services, such as Java, .NET, native, or binary executable Services. GridServer PDriver Tutorial
Tutorial on using PDriver, the Parametric Service Driver, to create and run Services with GridServer.
GridServer COM Tutorial
Tutorial explaining how client applications in Windows can use COMDriver, GridServer’s COM API, to work with services on GridServer.
Other Documentation and Help In addition to the GridServer guides, you can also find help and information from the following sources: Context-sensitive help is available throughout the GridServer Administration Tool by clicking the help icon located on any page. This provides reference help, plus howto topics.
GridServer Administration Tool Help
Reference information for the GridServer API is provided in the GridServer SDK in the docs directory. The Java API information is in JavaDoc format, while C++ documentation is presented in HTML, and .NET API help is in HTMLHelp. You can also view and search them from the GridServer Administration Tool; log in to the Administration Tool, click the Admin tab, and select the Documentation link.
API Reference
A searchable archive of known issues and support articles is available online. To access the DataSynapse Knowledge Base, go to the DataSynapse customer extranet site at customer.datasynapse.com and log in. You can also use this site to file an issue report, download product updates and licenses, and view documentation. Knowledge Base
• • • • •
10 •
Chapter 1 – Introduction
This Document is Proprietary and Confidental
Document Conventions Convention
Explanation
Example
italics
Book titles
The GridServer Developer’s Guide describes this API in detail.
“Text in quotation marks”
References to chapter or section titles
See “Preliminaries.”
bold text
Emphasizes key terminology
Client applications (Drivers) submit work to a central Manager.
Interface labels or options
Enter your URL in the Address box and click Next.
Courier New
User input, directories, file names, Run the script in the /opt/datasynapse file contents, and program scripts directory.
Blue text
Hypertext link. Click to jump to the See the GridServer Developer’s Guide for details. specified page or document.
[GS Manager Root]
The directory where GridServer is The Driver packages are located in installed, such as c:\datasynapse [GS Manager Root]/webapps/livecluster/WEBor INF/driverInstall /opt/datasynapse.
GridServer Administration Guide
• • • 11 • • •
Document Conventions
• • • • •
12 •
Chapter 1 – Introduction
This Document is Proprietary and Confidental
Chapter 2 Work
•••••• Introduction
GridServer supports a Services model for dividing and processing work. This method takes a large data intensive or compute-intensive problem and logically breaks it down into units of work that can run independently and combine for a final result. GridServer receives the work unit requests and services them in parallel. Additionally, high throughput applications or services can be distributed to a Grid. Then, many similar requests for that service can be fulfilled as they arrive. Each request for service is independent, may be stateful, and generally arrives unpredictably at different points in time. Services also provide a language-independent interface to the GridServer platform. As an alternative, the language-specific Job API can be used to leverage existing Java or C++ development resources. Both models are described below.
Services The Service-Oriented method of defining work in GridServer is a standards-based model. It uses a thin client model, which promotes easy integration of an existing implementation. It also promotes language interoperability, as clients written in different languages can invoke methods in Service Implementations written in the same or other languages. There are two components used with the Service-Oriented method: Clients and Service Implementations. Both are described below.
Clients A client or client application is the implementation that is used to create a Service Session. The client invokes methods that have been distributed on Engines. You can create a Service client in different ways: • A client-side API in Java, COM, C++, or .NET. • A service proxy of Java or .NET client stubs generated by GridServer. • A Web Service client using SOAP, a lightweight protocol used for exchanging messages with decentralized components.
Service Implementations Service Implementations are deployed to Engines, and process requests from clients. They process data and return results back to the client. Service Implementations are
FIGURE 2-1: The relationship between Service Clients and Service Implementations.
GridServer Administration Guide
• • • 13 • • •
Jobs
registered on a GridServer Manager, as a Service Type, which is virtualized on its Engines. When a client makes a client request, it sends the request to a Manager instead of directly requesting an Engine to do the work. This one-to-many relationship provides fault tolerance and scalability for Services. Service Implementations can be constructed with any of the following: • Arbitrary Java classes • Arbitrary .NET classes • A Dynamic Library (.so, .DLL) with methods that conform to a simple input-output string interface. • A command, such as a script or binary executable Integration as a Service in most cases requires minimal changes to the client application.
Service Session A running Service is referred to as a Service Session. This includes the Service Client, Service Implementation, and Service state on all components. When a client has created a Service and the Service Implementation is running on Engines, this is collectively called the Service Session.
Service benefits There are many advantages to Services: Cross-language
Client and Service can be in different languages
Dynamic
Method names can be determined dynamically, or use generated proxies for type safety
Flexible
Use synchronous or asynchronous invocation patterns; can use client proxies generated by GridServer
Virtual
Client-Engine correspondence is not one-to-one; Service requests are adaptively load balanced
Stateful
Despite being virtual, stateful Services can be handled
Standards
Standards-compliant
For more information on Services, see Chapter 3, “Creating Services” on page 23 and Chapter 4, “Accessing Services” on page 33 of the GridServer Developer’s Guide.
Jobs The Object-Oriented method of defining work in GridServer utilizes easy-to-use C++ and Java APIs to create a rich, empowered client. Using this API, a programmer defines a “Job” as a collection of Tasks, with each Task defined as an atomic sub-partition of the overall workload that is run in its entirety on an Engine. The client code submits work and administrative commands and retrieves computational results and status information through a simple API. FIGURE 2-2: Tasks within a Job.
• • • • •
14 •
Chapter 2 – Work
This Document is Proprietary and Confidental
Using the API, you design a Tasklet, which contains the Engine-side code for each Task, and marker interfaces called TaskInput and TaskOutput.
Job Benefits The Job-Task model has differences to the Service model which may be an advantage, depending on your development scenario. Its API makes it easy to adapt if you are designing new applications in Java or C++, and its API makes it easy to leverage existing trained programming resources. For more information on the Job API, see Chapter 5, “The Tasklet API” on page 45 of the GridServer Developer’s Guide.
FIGURE 2-3: Workflow between a Job and an Engine.
Binary-level Integration Another native Driver, PDriver, enables you to execute command-line programs as a parallel processing Job without using the API. PDriver, or the Parametric Job Driver, is a Driver that can execute existing command-line programs as a parallel processing service using the GridServer environment, taking full advantage of the parallelism and fault tolerance of GridServer. PDriver achieves parallelism by running the same program on Engines several times with different parameters. A script is used to define how these parameters change. For example, a distributed search mechanism using the grep command could conduct a brute-force search of a network-attached file system, with each task in the Service being given a different directory or piece of the file system to search. PDriver uses its scripting language, called PDS, to define jobs. These scripts can also be used to set options for a PDriver Service, such as remote logging and exit code checking. For more information on the PDriver, see Chapter 6, “PDriver” on page 49 of the GridServer Developer’s Guide.
GridServer Administration Guide
• • • 15 • • •
Binary-level Integration
• • • • •
16 •
Chapter 2 – Work
This Document is Proprietary and Confidental
Chapter 3 Engine Balancing and Client Routing
•••••• Introduction
This chapter covers the various mechanisms used by GridServer Directors to route Engines and Clients to Brokers, and reallocation of Engines based on the changing state of the grid.
Client Routing The following sections describe methods of routing Clients to Brokers, of which one or more can be used together. However, in most scenarios Clients are associated with a specific Broker, and usually a Failover Broker for fault tolerance.
Allowed Brokers Set The easiest and most common method of routing clients is to use the Driver Profile’s allowedBrokers property to perform direct routing to a set of Brokers. This is configured using the Driver Profile page on the Driver tab in the GridServer Administration Tool. The profile must be associated with the username of the client using the User Admin page on the Admin tab.
Client Properties Rules Client can also be routed to Brokers using rules based on client properties. For centralized management at the Director, user-defined properties are created using the Driver Property List, and are set on a Driver Profile using the Driver Profile page. Additionally, client properties can also be set using the DriverManager API or driver.properties file on the client. The profile must be associated with the username of the client using the User Admin page on the Admin tab. The Broker Routing page is then used to set up routing rules based on these properties.
Driver API The DriverManager API on all Driver platforms provide a method, connect(String broker), that will force the client to log in to the specified Broker. If a Driver Profile is associated with the client, this profile must permit the specified Broker.
Engine Routing and Balancing Engines are dynamically allocated resources that can migrate among Brokers based on such criteria as load and policy. The Engine Balancer is the component on the Director that manages login and regularly re-routes Engines to maintain an optimal balance across the Grid. The Primary Director’s balancer always runs, while the Secondary Director’s will only run if the Primary is down.
GridServer Administration Guide
• • • 17 • • •
Engine Routing and Balancing
On a regular basis, the Director polls all Brokers for the state of all Engines on those Brokers. The routing mechanisms are tested against all Engines to determine where all Engines should optimally reside. Typically, changes in state due to load balancing requirements will result in changes in the optimal distribution. If it is determined that Engines should be re-routed, the Director sends a request to each Broker that has Engines that should be moved, to log those Engines off. When an Engine logs off, it will then log back in to the optimal Broker. There are three balancers available, depending on how the grid is to be used. The weight-based balancer algorithm attempts to distribute Engines equally by relative weights, and it also allows rule-based routing using Engine properties. The Home/Shared Balancer routes Engines based on an Engine’s assigned Home Brokers, and the sharing policy of Home Brokers to other Brokers. Additionally, because version 4.1 used a different routing mechanism, and version 4.2 allows for 4.1 Brokers for staged migration of large grids, a 4.1-based balancer is available. All of the balancers take into account the number of running and pending tasks on each Broker, and the desired maximum and minimum number of Engines for each Broker. If the Engine Balancer is changed on the Director, it must be restarted. Also, all balancer settings must be equal on Primary and Secondary Directors.
Engine Weight-Based Balancer The Engine weight-based balancer allocates Engines based each Broker’s Engine weights value, which is on the Broker Admin page. This value is the amount of Engines that the Broker will be allocated relative to the other Brokers’ weights, when all Brokers are idle. The algorithm also takes into account session load, and idle Engines will be reallocated to busy Brokers as they are needed. This balancer also allows for rule-based routing via Engine Properties, when it is necessary to restrict some Engines to a set of Brokers. Engine can be routed via their intrinsic properties, such as cpuTotal, and by userdefined properties, which can be created using the Engine Property List page and assigned using the Engine Properties List page. The Broker Routing page is used to set up routing rules based on these properties.
Home/Shared Balancer The Home/Shared Engine balancer uses an algorithm based on the idea that every Engine has a set of Home Brokers that it will always work on when there are outstanding tasks, yet they can be shared to other Brokers when there are no outstanding tasks on any home. Engines are assigned a home via its configuration, using the Engine Configuration page. Brokers are configured to share their homed Engines to other Brokers using the Broker Admin page. This algorithm uses Broker needs and Engine preferences for Brokers to perform allocation. Each Engine divides the existing Brokers into tiers by preference. A tier is an unordered set of Brokers. There are two tiers by default-the Engine’s home Brokers, and the shared Brokers of those home Brokers. A third tier can be introduced by splitting shared Brokers into two groups. The higher the tier, the more the Engine prefers the Brokers in that tier. The balancer uses the following rules: 1. An Engine is routed to the highest-tiered Broker that has pending tasks. If multiple Brokers in the same tier have pending tasks, the choice is made at random, as if all weights were 1.
• • • • •
18 •
Chapter 3 – Engine Balancing and Client Routing
This Document is Proprietary and Confidental
2. An Engine will leave its current Broker only if there is a needy Broker in a higher tier. An Engine will not move to a lower-tiered Broker unless it is idle. 3. Failover Brokers are never allocated Engines unless they are needy. When using the Home/Shared Engine balancer, tiers are shown in the GridServer Administration Tool, in the Broker Sharing field of the Broker Admin page. Brokers are separated into tiers with the semicolon, such as “A,B;C,D,E”. For example, an Engine configuration’s home Brokers are A and B. A’s shared list is “C,D;E”. B’s shared list is “F;G”. An Engine with this configuration will have the following preferences: first: A, B; second: C, D, F; third: E, G. Within each group, Brokers are equal, and ordering doesn’t matter.
Engine Balancer Configuration Engine Balancing is configured in the GridServer Administration Tool on the Manager Configuration page, in the Engines and Clients section. These setting must be identical on all Directors.: Setting
Description
Engine Balancer
The Engine balancer that will be used: Weight-Based, Home/Shared, or 4.1Compatible.
Rebalance Interval
The amount of time, in seconds, between balancing episodes. (Previously called the Poll Period.)
Soft Logoff
If true, Engine logoffs do not restart the JVM. This enables them to retain state and log in faster.
Logoff Timeout
The amount of time in seconds that an Engine will wait to finish a task before logging off.
Engine Balance Fraction
The fraction of extra Engines that will actually be moved to another Broker on a balance. This can be set to less than 1 to dampen Engine movement. For instance, if the fraction is 0.5 and the balancer determines that a Broker has 8 extra Engines, it will only move 4 on the first balance. Assuming those Engines move, on the next balance it will determine that there are 4 extra and move 2, and so on.
Engine Balance Maximum
The maximum number of Engines that will be moved to another Broker on a rebalance. The maximum applies over the entire grid. For instance, if this parameter is set to 100 and the balancer determines that 200 Engines should be rebalanced (after taking Engine Balance Fraction into account), then only 100 Engines will actually be rebalanced. Does not apply to 4.1-Compatible balancer.
Engine Threshold
The difference between the actual and optimal number of Engines on a Broker must be greater than this value before any Engines are logged off. This threshold minimizes unnecessary Engine reallocation. For example, if the threshold is 2, and a Broker’s optimal number of Engines is calculated to be 8, it must have more than 10 Engines before it will log off any of them. Applies to 4.1Compatible balancer only.
GridServer Administration Guide
• • • 19 • • •
Failover Brokers
Note that if the 4.1-Compatible balancer is selected, it forces Engine instance grouping to avoid constant Engine upgrading or downgrading.
Failover Brokers The purpose of a Failover Broker is to temporarily take over the execution of service sessions when the Client has no other Brokers to which it is permitted to connect. As far as Clients are concerned, Failover Brokers become part of the pool of active Brokers when there are no other non-Failover Brokers on which the client is permitted As far as Engines are concerned, Failover Brokers are considered to be part of the active pool when there are active sessions in progress on that Failover. In either case, this Broker is now treated like a non-Failover by the algorithm. It is important to then take this into account when setting up the routing configuration. For example, if you are setting up a Driver Profile to allow a client on only one Broker under normal conditions, you must also include a Failover Broker in its list of allowed Broker if you wish this client to have a failover if its main Broker goes down. See Chapter 4, “Grid Fault-Tolerance and Failover” on page 23 for more information.
Engine Upper and Lower Bounds Brokers can also be configured to have upper and lower bounds on the amount of Engines that can be logged in at a given time. These are set the Broker Admin page. By default the columns are hidden, so you may need to add them using the Add Column control. The minimum value specifies that the balancer algorithm will always leave at least this amount of Engines (assuming there are this many) on the Broker regardless of the state of other Brokers. The maximum value is the cap on the total amount of Engines that can be allowed on the Broker. Both values are always considered by the balancing algorithms.
Example Use Cases Example use cases are presented in this section.
N+1 Failover with Weighting An organization has four groups using all available Engines in a Grid. One group is guaranteed to be allocated at least half of the Grid any time it needs it, and the other three groups share the remaining Engines. Brokers
Set up five Brokers. Each group gets a Broker, plus one is used for failover.
Create four Driver Profiles, one for each group. In each profile, set the allowedBrokers value to the group’s Broker and the failover Broker. Assign the Profiles to the appropriate users.
Drivers
Use the weight-based Engine Balancer. Adjust Engine Weight on the Broker Admin page so the first group’s Broker is weighted at 3.0, and the other three groups’ Brokers are weighted as 1.0. You would most likely set the failover Broker weight at 1.0, so that a group would not be assigned any more resources than normal if their Broker went down.
Engines
• • • • •
20 •
Chapter 3 – Engine Balancing and Client Routing
This Document is Proprietary and Confidental
Engine Localization with Sharing A company has two groups, one in New York and one in London. Each has a single middleware application that has a Driver that connects to its own Broker. Each group also has a set of CPUs that it expects to always be working on their own calculations. However, there will be times when one group’s Broker is idle, so they are allowed to share with each other. Set up four Brokers, a regular and a failover for each group. Each regular Broker shares with the other regular Broker, plus its own failover Broker.
Brokers
Create two Driver Profiles, one for each group. In each profile, set the allowedBrokers value to the group’s Broker and its failover Broker. Assign the Profile to the middleware application user.
Drivers
Use the Home/Shared Engine Balancer. Set up two Engine Configurations, “London” and “New York,” which would home the Engines to their respective Broker. Engines
In this scenario, the application always connects to its local Broker, unless it is down, in which case it moves to its failover. Whenever that Broker has pending requests, all of its Engines will always be local. If the other group’s Broker is idle, or if it does not need all of its Engines, any of its idle Engines will be routed to the Broker that needs it. You may also want to increase the Engine Threshold, and decrease the Engine Fraction, to minimize wandering of Engines during normal work periods when there may be occasional brief times when the Broker may have idle Engines.
GridServer Administration Guide
• • • 21 • • •
Example Use Cases
• • • • •
22 •
Chapter 3 – Engine Balancing and Client Routing
This Document is Proprietary and Confidental
Chapter 4 Grid Fault-Tolerance and Failover
•••••• Introduction
GridServer is a fault-tolerant and resilient distributed computing platform. The GridServer platform will recover from a component failure, guaranteeing the execution of Services over a distributed computing Grid with diverse, intermittent compute resources. This section describes how GridServer behaves in the event of Engine, Driver, and Manager failure. Failures of components within the Grid can happen for a number of reasons, such as power outage, network failure, or interruptions by end users. For the purposes of this discussion, failure means any event that causes Grid components to be unable to communicate with each other.
The Fault-tolerant GridServer Deployment A GridServer deployment consists of a primary Director, an optional secondary Director, and one or more Brokers. Drivers and Engines log into the Director, which routes them to one of the Brokers. Directors balance the load among their Brokers by routing Drivers and Engines to currently running Brokers. A minimal fault-tolerant GridServer deployment contains two Directors, a primary and a secondary, and at least two Brokers. The FIGURE 4-1: A typical redundant GridServer configuration. Brokers, Engines, and Drivers in the Grid have the network locations of both the primary and the secondary Directors. During normal operation, the Engines and Drivers log in to their primary Director; the secondary Director is completely idle. Other GridServer topographies, such as having multiple managers to handle volume or to segregate different types of Services to different Managers, are discussed in Chapter 2, “Installation Overview” on page 7 of the GridServer Installation Guide.
Heartbeats and Failure Detection Lightweight network communications sent at regular intervals, called heartbeats, are sent between GridServer components, such as from Drivers to Brokers, from Engine Instances to Brokers, and from Engine Daemons to Directors. A Manager detects Driver and Engine failure when it does not receive a heartbeat within the configurable heartbeat interval time. Drivers detect Broker failure by failing to connect when they submit Jobs or poll for results. Engines detect Broker failure when they attempt to report for work or return results. To minimize unnecessary messaging, a heartbeat is only sent if no other message has been sent within the heartbeat interval.
GridServer Administration Guide
• • • 23 • • •
Manager Stability Features
Manager Stability Features Several precautions are taken to prevent Manager failure due to excessive traffic. For example, the number of threads used for file update is limited. This prevents a large number of file updates from Brokers to Engines from preventing other HTTP activity due to use of all of the HTTP threads on the application server; instead, Engines will retry the download later when this maximum is reached. By default, this is set at 50 threads, but can be changed in the GridServer Administration Tool on the Manager Configuration page, in the Communication section, with the Maximum Resource Download Connections property. The number of Broker/Director messaging threads is also limited. If this limit is reached, clients will retry rather than immediately fail.
Engine Failure Network connection loss, hardware failure, or errant application code can cause Engine failure. When an Engine goes offline, the work assigned to it is requeued, and will be assigned to another Engine. Although work done on the failed Engine is lost, the Task will be assigned to a new Engine. Engines that have built up a considerable state or cache or that are running particularly long Tasks could cause a larger loss if Engine failure occurs. This can be avoided by shortening Task duration in your application or by using the Engine Checkpointing mechanism. For more information on Task duration, see Chapter 10, “GridServer Performance and Tuning” on page 77. Each Engine has a checkpoint directory where a Task can save intermediate results. If an Engine fails and the Manager retains access to the Engine machine’s file system, a new Engine will copy the checkpoint directory from the failed Engine. It is the responsibility of the client application to handle correct resumption of work given the contents of the checkpoint directory. Note that if an Engine Daemon logs off the Director or otherwise fails, it does not log off its Engines. Provided the failure has not caused the Engines to also fail, they will continue working and return results when completed.
Driver Failure When a client application fails, the Broker detects the failure when the Client does not return a heartbeat and does not not log back in within the interval specified by the Client Timeout setting. When this happens, any currently running services are cancelled. If this happens, application failure recovery or restart is the responsibility of your application. The exception to cancellation are fully submitted Services of type Collection.LATER, or any of type Collection.NEVER. Also, if a Client is collecting results from a Collection.LATER type Service, none of the outputs will be removed until all have been collected and the Client destroys the Service, so that if a Client fails during collection it can restart and recollect the outputs. All Driver fileservers return a “Server Unavailable” code with instructions to retry if they are processing too many concurrent requests. This significantly reduces the chance of a Service invocation failing due to a temporarily overloaded Driver.
• • • • •
24 •
Chapter 4 – Grid Fault-Tolerance and Failover
This Document is Proprietary and Confidental
Director Failure If the primary Director fails, the secondary Director takes over balancing and routing Drivers and Engines to Brokers. Since the Directors do not maintain any state, no work is lost if a Director fails and is restarted. Also, because both Directors follow the same rules for routing to Brokers, it makes no difference which Director is used for login. The Primary Director is also responsible for the Administrative Database, which contains data needed by the Grid for operation, such as the User list, routing properties, and so on. These values, then, can only be modified on the Primary Director. This database is synchronized to the Secondary Director while both are running, and backed up by the Secondary Director on every database backup, so that the Grid can remain in operation when the Primary Director is down.
Broker Failure Like the Director, the Broker is designed as a robust application that will run indefinitely, and will typically only fail in the event of a hardware failure, power outage, or network failure. However, the fault-tolerance built into the Drivers guarantees that all Services will complete even in the event of failure. Because the most likely reason that a Driver will be disconnected from its Broker is a temporary network outage, the Driver does not immediately attempt to log in to another Broker. Instead, it waits a configurable amount of time to reconnect to the Broker to which it was connected. After this amount of time, it will then attempt to log in to any available Broker. This amount of time is specified in the driver.properties file or via the API. Once the Driver has timed out and reconnected to another Broker, all Service instances will then resubmit any outstanding tasks and continue. Tasks that are already complete will not be resubmitted. The Service instances will also resubmit all state updates in the order in which they were originally made. From the Service instance point of view, there will be no indication of error, such as exceptions or failure, just the absence of any activity during the time in which the Driver is disconnected. That is, all Services will run successfully to completion as long as eventually a suitable Broker is brought online. If an Engine is disconnected from its Broker, the process simply shuts down, restarts, and logs in to any suitable Broker. Any work is discarded.
Failover Brokers In the fault-tolerant configuration, somea Brokers can be set up as a Failover Brokers. When a DriverClient logs in to a Director, the Director will first attempt to route it to a nonFailover Broker. If no non-Failover Brokers are available, the Director will consider all Brokers, which would typically then route the Driver to a Failover Broker. FIGURE 4-1: A GridServer configuration with Failover capability.
GridServer Administration Guide
• • • 25 • • •
Fault-Tolerant Tasks
A Failover Broker is not considered for Engine routing if there are no active Services on that Broker. Otherwise, it is considered like any other Broker, and follows Engine routing like any other Broker. By virtue of these rules, if a Failover Broker becomes idle, Engines will be routed back to other Brokers. The primary Director monitors the state of all Brokers on the Grid. If a Driver logged into a Failover Broker is able to log in to a non-Failover Broker, it will be logged off so it can return to the non-Failover Broker. All running Services will be continued on the new Broker by auto-resubmission. By default, all Brokers are non-Failover Brokers. Designate one or more Brokers within the Grid as Failover Brokers when you want those Brokers to remain idle during normal operation.
Fault-Tolerant Tasks Fault-Tolerant Tasks enable an Engine to continue executing a task even if it logs off of a Broker, so that it does not lose work due to a Broker failure. It is intended for use on long-running tasks. This means that if an Engine is working on a task, and it logs off of the Broker, it will not immediately exit. Rather, it will continue to work on that task, while continuing to attempt to log in to a Broker that has the Service on which it is working. If it does not log back in within a defined time period, it will exit. If it does log back in, it will first notify the Broker that it is working on the task. If it has already completed, it will immediately send the result; otherwise, it will do so upon completion. It’s not recommend that you use this feature unless you have individual tasks that take many hours to finish (or the longest task takes nearly as long as the whole job.) For example, if a report runs during the night and some tasks takes 8 hours to process, then you may want this feature in place to ensure that the 8 hours task didn’t have to start from the beginning if the Broker failed at 7 AM. On the other hand, enabling faulttolerant tasks can diminish the efficiency of the Grid, since it will redundantly schedule all outstanding tasks. With short tasks, it’s usually more efficient to simply recalculate tasks in the event of a Broker failure. As an example of Fault-Tolerant Tasks, consider the following: 1. An Engine and Driver are connected to Broker A. 2. Broker A goes down. 3. The Driver continues for 5 minutes to find the Broker with its Service. The Engine continues working, while it attempts to find the Broker with its Service. 4. After 5 minutes, the Driver connects to Broker B, and resubmits outstanding work. 5. Now that the Service is on Broker B, the Engine logs in to Broker B, and indicates that it has taken that task. When it has finished, it writes its task. If it has already finished, it immediately writes the task. If another Engine has already taken that task by the time this Engine logs in, no attempt will be made to cancel the task on the Broker. It will essentially be the same as a redundantly rescheduled task. When an Engine logs into a failover Broker and works on a task, the task is cancelled once the Driver switches to the regular Broker. To enable Fault-Tolerant Tasks, in the GridServer Administration Tool, click the Manager tab, then click Manager Configuration, then Engines and Clients and change the value of Engine Timeout Minutes and click Save. The timeout should be longer than the Driver’s timeout, which is the value of DSBrokerTimeout set in the driver.properties file. • • • • •
26 •
Chapter 4 – Grid Fault-Tolerance and Failover
This Document is Proprietary and Confidental
To use Fault-Tolerant Tasks, another Broker must be available for failover, and the Client running the session will need to fail over to the Broker and resubmit its session. No attempt will be made upon login of the Engine running a fault-tolerant task to cancel that same task if it has already been taken by another Engine.
Batch Fault-Tolerance Batch Schedules that exist on a Manager are persistent, provided the Next Run field is not never. This provides failover capability in the event of a Manager failure, as the Batch Schedules will still exist when the Manager is restarted. The following Batch Schedules are persistent: • Absolute schedules • Relative schedules with repeat • Cron schedules All persistent Batches are restarted when the Manager is restarted, just like they were scheduled for the first time. Batch runs that were to occur during the time when the Manager was down are ignored.
GridCache Fault-Tolerance GridCache supports fault-tolerance, as described below. Note that primary and failover Brokers must have their clocks synchronized for GridCache failover.
Client If any client puts data in the cache and subsequently dies or logs out, that data is still available to all other clients. This is due to the fact that the Broker maintains the master index and complete view of the cached data. This does not apply to the local caching mode where a region has a local loader that does not synchronize with the other local caches.
Broker Restart GridCache can be configured to survive Manager restart and failure. GridCache’s cache index is rebuilt on system startup; objects persisted on the Broker’s file system will be recovered. If some or all of the cache is stored in memory, that information will be lost.
Failover A failover Broker can manage a GridServer cache when a regular Broker goes down, provided that the persistent cache directory is on a shared filesystem. The location of this filesystem is configurable from the Manager Configuration page in the GridServer Administration Tool. When the regular Broker goes down and the failover Broker takes over, the failover Broker will build its cache index and begin managing the cache from the shared filesystem. All clients that then fail over to the failover Broker will be able to get references to the existing cache regions on the shared filesystem.
GridServer Administration Guide
• • • 27 • • •
GridCache Fault-Tolerance
Note that a failover Broker can only be configured to fail over to one shared cache directory. Therefore, a failover Broker can’t serve as a failover for multiple Brokers with different cache directories; a different failover Broker would have to be used for each Broker.
• • • • •
28 •
Chapter 4 – Grid Fault-Tolerance and Failover
This Document is Proprietary and Confidental
Chapter 5 Scheduling
••••••
One of the responsibilities of Brokers is scheduling, which is the management of Services and Tasks on Engines and interactions between Engines and Drivers. This chapter gives more details on how scheduling works, and the method used to determine what Tasks in a Service are sent to what Engines.
Introduction Most of the time, the scheduling of Services and Tasks on Engines is completely transparent and requires no administration. However, in order to tune performance, or to diagnose and resolve problems, it is helpful to have a basic understanding of how the Broker manages scheduling. Recall that clients create Service Sessions on the Broker. Each Service Session consists of one or more Tasks, which may be performed in any order. The scheduler determines the optimal match of Engines to Services. Whenever an Engine reports to the Broker to request work, the Broker assigns a Task from that Service to the Engine. When an Engine completes a Task, it is queued on the Broker for collection by the client. If an Engine is interrupted during processing, the Task is requeued by the Broker.
Reschedules and Retries Before the discussion of scheduling behavior, we must first define the terms Retry and Reschedule within the context of scheduling Tasks.
Retry A Retry is when a Task is re-queued due to a known failure of the Task. Such failures could be due to an error condition in the implementation, an error due to inability to download data, or a failure of an Engine (the monitor has detected that the Engine is no longer connected but it has not logged off.) It is always the result of the Engine returning the Task as failed to the Broker. When a Task is retried, it is always placed at the front of that session’s queue. The scheduler manages a retry count for each Task, so that a limit can be placed on the number of allowed retries.
Reschedule A Reschedule is when a Task is re-queued when it may or may not have failed. When a Task is rescheduled, it is by default placed at the back of that session’s queue, unless the Reschedule First configuration option on the Broker (set in the Manager tab, on the Manager Configuration page, in the Services section) is set to true. The scheduler also manages a reschedule count for each Task. The following conditions result in a reschedule: • Engine Logoff: When an Engine logs off gracefully while running a Task (such as when UI or CPU idle conditions are met, or there is a forced rebalance), the Task is rescheduled, but the reschedule count is not incremented, since there was no Task error.
GridServer Administration Guide
• • • 29 • • •
The Scheduler
• Redundant Rescheduler: If any of the Redundant Rescheduler strategies are in effect, Tasks may be rescheduled to other Engines. By default, those Tasks are allowed to continue to run on the current Engines, in case they finish before the rescheduled Tasks. In this case, the reschedule count is increased.
Timeout Behavior When the INVOCATION_MAX_TIME option is set, it specifies that any invocation of a request may not exceed this value. If a Task times out on an Engine, it may be either retried or rescheduled, depending on what makes more sense for your application. If retried, the current Engine’s invoke process is terminated, and the Task is assigned to another Engine. If rescheduled, the current Engine Task is allowed to continue execution. In either case, the appropriate count is incremented. The default behavior is set on the Broker, and is set to retry by default. It can also be set for the Service Type via the Service Type Registry page, or programatically when the Service Session is created.
The Scheduler The Scheduler is the component that is used on a GridServer Broker to assign tasks to Engines. It attempts to make optimal matches based on criteria such as the session priority level, affinity, and Serial Service and Priority execution modes.
Scheduler Overview The scheduler aims to schedule tasks to Engines by attempting to have the proper amount of Engines allocated to all active Service Sessions at any given time. On any given scheduling event, the algorithm decides the number of Engines each Session should have at the time based on static and dynamic criteria, and then assigns the appropriate number of Engines to sessions based on how many the Session needs to reach the ideal level. Additionally, the scheduler takes into account the amount of usage that the Session has received over a given historical window of time. The “usage” refers to the amount of Engine clock time that the Session has occupied during that window. When a Session is created, it is initialized in such a way that it simulates as if it was running ideally over this window. This usage provides the ordering in which Engines are allocated to Sessions. This addresses starvation issues, round off error (the number of ideal Engines will rarely be an integer), and under/over-utilization due to discrimination, changes in the number of available Engines, and so on. Essentially, on a scheduling event, sessions are assigned the ideal number of Engines less the amount that are currently allocated, in the order of least to most usage. The following sections will discuss first the general algorithm, and then address specific subclasses of that algorithm for serial service and priority execution modes. This approach can be seen as analogous to a CPU thread scheduling algorithm. Each session is a “thread”, the engines are the “CPU”, the window is the sample period, and each task is an uninterruptible unit of CPU time allotted to a thread.
• • • • •
30 •
Chapter 5 – Scheduling
This Document is Proprietary and Confidental
Service Priority Every GridServer Service has an associated priority. Priorities can take any integer value between zero and ten, so that there are eleven priority levels in all. 0 is the lowest priority (a suspended Service), 10 is the highest (an urgent priority Service, see below), and 5 is the default. The GridServer API provides methods that allow the application code to attach priorities to Services at runtime (see the GridServer API documentation for more details) and you can use the GridServer Administration Tool to change priorities while a Service is running. Priority Weight refers to the weight associated with a Priority Level. The weight defines the amount of Engines allocated to a session relative to all other active sessions. For example, if Session A and B have weights of 2.0, and Session B has weight 4.0, and there are eight Engines, Session A and B get allocated two Engines each, and Session B gets four. The weights are set with the Priority Weights property in the GridServer Administration Tool, on the Manager Configuration page in the Services section.
Usage Algorithm The usage algorithm is the default mode, and is used when Serial Service Execution mode is not enabled. Whenever an Engine or set of Engines is available for scheduling, the scheduler decides how many Engines each session should be allocate. In general, that value is: Ideal Engines per Session = All Engines * Session Priority Weight / Total Weight, where “Total Weight” is the sum of all Priority Weights of active sessions. This value is rounded up to the next integer to prevents starvation for an ideal calculation of < 0.5, and assures that the sum of Ideal Engine’s is always at least as large as Total Engines. This algorithm also takes into account if the actual number of Engine that can be allocated is less than the ideal, such as when a Session is towards the end, or when Max Engines is used. Recall that a Session’s usage is considered to be the total Engine clock time spent on the session over the last configurable amount of time. This includes running and completed tasks. When a Session is created, it must initialize its usage. The simplest, most fair method of doing this is to assume it has been operating in a steady state over the window with the ideal non-rounded number of Engines. The variables that monitor usage are then initialized as such. If no sessions are active, it initializes them such that the session's ideal is the total number of Engines currently on the Broker. Whenever there is any event that requires a scheduling episode, the scheduler assigns the proper number of engines to each session for it to be at its ideal amount. This assignment is performed in order of least to most priority-normalized usage. If there are any unassigned Engines remaining after this initial round based on usage (typically due to disallowed conditions preventing assignment), a second tier round robin assignment is performed.
Time Algorithm The time algorithm is used when Serial Service Execution mode is enabled. This algorithm works as follows:
GridServer Administration Guide
• • • 31 • • •
Urgent Priority Services and Preemption
Session Addition When a session is added to the Waiting List, it is placed such that it is ordered by Session creation time. Typically this is at the back of the list, although if the session had been removed and then re-added, it may not be.
Scheduling Episode On each episode, only the first session with waiting tasks is considered for assignment. The scheduler simply attempts to assign all Idle Engines to the session. Affinity is not considered. Note that as soon as the Session has no more waiting tasks, subsequent Sessions may be assigned Engines on the next episode even while the previous session is still running.
Serial Priority Algorithm The Serial Priority Algorithm is used when Serial Priority Execution mode is enabled. Either the Time Algorithm or the Usage Algorithm, depending on whether Serial Service Execution mode is enabled, is used on the subset of sessions at the current highest Priority Level that have waiting tasks in any sessions. For example, with Serial Service Execution mode off, all sessions at level 9 (assuming highest) will be allocated equal amounts of Engines until no more sessions at level 9 have waiting tasks, after which level 8 sessions are allocated. On the other hand, with Serial Service Execution mode on, all sessions at level 9 will execute in their order of creation. Note that in this state, if they finish, and level 8 sessions start, and then a new level 9 session is created, that new level 9 session will take over at that point. This is because priority takes precedence over creation time.
Urgent Priority Services and Preemption Services with priority of 10 are considered urgent by the scheduler. (The API defines PRIORITY_URGENT to be equal to 10.) An urgent Service’s weight is hard-coded to be essentially infinite, so that they are assigned all available Engines. They may also preempt Engines that are currently working. When an Engine is preempted, the Task it is currently running is cancelled and rescheduled, and the Engine becomes available for new Tasks. Engines are preempted on a Service under the following conditions: if after being assigned all free Engines a Service can still make use of more Engines, then it may preempt some busy Engines, subject to two constraints that can be adjusted with configuration properties. First, the urgent Service must have been in the queue for Preempt Delay Seconds. Second, the percentage of Engines in the Grid running urgent Services cannot exceed Preemptable Engine Percent. For example, if this property is set to 50, and 47 percent of the Engines are currently running urgent Services, then at most three percent will be preempted. This value is not a hard limit on the number of Engines that may be running urgent Services, because free Engines are allocated to urgent Services regardless of how many Engines are already running urgent Services. The scheduler chooses Engines for preemption based on the following rules: Engines running an urgent Service will never be preempted. An Engine running a Task from a Service with lower priority will generally be selected in preference to one running a higher-priority Task. However, if the lower-priority Task has been running for a long time, a short-running, higher-priority Task may be preempted instead. The Preempt • • • • •
32 •
Chapter 5 – Scheduling
This Document is Proprietary and Confidental
Threshold Minutes property determines the value at which this crossover happens. For example, if this property is set to 30, then an Engine that has just started running a priority 2 Task will be chosen for preemption over an Engine that has been running a priority 1 Task for more than 30 minutes. Other important points concerning priority Services and preemption: • Tasks canceled by preemption are not subject to a rescheduling limit, since they are not considered failures. • To prevent preemption from ever occurring, set Preemptable Engine Percent to 0. • It is possible that the first Service on the queue will not get all free Engines if it doesn’t have enough Tasks, it is already using its maximum number of Engines, or it discriminates against some Engines. Free Engines that are not taken by the first urgent Service are first offered to the other urgent Services on the queue, and then to all other Services.
Engine Blacklisting If a Service sets the option “engineBlacklisting” (ENGINE_BLACKLISTING) to true, then Engines that fail on a Task from that Service will not be given any other Tasks from that Service. The default is false. “fail” means any action that results in a failed Task being sent back to the Manager, regardless of whether that failure was due to Engine hardware, Engine environment, or Tasklet code. It does not include events such as the Engine going offline to user activity, since that does not result in a Task failure. Blacklisted Engines are excluded for a particular Service Session only; they can freely accept tasks from any other Service, regardless of Service Type, assuming the other Services haven’t also blacklisted the Engine or have some discriminators in place that prevent it. To remove an Engine from all blacklists, go to the Engine Daemon Admin page in the GridServer Administration Tool and select Clear from Blacklists from the Actions list.
Conditions Task Discrimination allows limiting certain Tasks to a subset of Engines. If an Engine is ineligible to take the next waiting Task, it will be assigned the first Task it is eligible to take. The Broker tracks a number of predefined properties, such as available memory or disk space, performance rating (megaflops), operating system, and so forth, that the Discriminator can use to define eligibility. The site administrator can also establish additional attributes to be defined as part of the Engine installation, or attach arbitrary properties to Engines “on the fly” from the Broker. More information on using the Discriminator API, can be found in Chapter 9, “Using Discriminators” on page 85 of the GridServer Developer’s Guide.
Redundant Task Rescheduling Redundant rescheduling addresses the situation in which a handful of Tasks, running on less-capable processors, might significantly delay or prevent Job completion. The basic idea is to launch redundant instances of long-running Tasks. The Broker accepts the first result to return; remaining instances will not be cancelled immediately; it will wait to either finish, or wait until the Job finishes. Redundant rescheduling does not apply to Services. It is also unrelated to any other retry/reschedule behavior described above.
GridServer Administration Guide
• • • 33 • • •
Redundant Task Rescheduling
By default, redundant Task rescheduling is not enabled. With pools of more capable or nearly identical Engines, fastest Task execution occurs when there is no redundancy from rescheduling. In general, rescheduling is only appropriate when there are widely different capabilities in Engines. Three separate strategies, running in parallel, govern rescheduling. Tasks are rescheduled whenever one or more of the three corresponding criteria are satisfied. However, none of the rescheduling strategies comes into play for any Service until a certain percentage of Tasks within that Service have completed; the Strategy Effective Percent parameter determines this percentage. The rescheduler scans the pending Task list for each Service at regular intervals, as determined by the Poll Period parameter. Each Service has an associated taskMaxTime, after which Tasks within that Service will be rescheduled. When the strategies are active (based on the Strategy Effective Percent), the Broker tracks the mean and standard deviation of the (clock) times consumed by each completed Task within the Service. Each of the three strategies uses one or both of these statistics to define a strategy-specific time limit for rescheduling Tasks. Each time the rescheduler scans the pending list, it checks the elapsed computation time for each pending Task. Initially, rescheduling is driven solely by the taskMaxTime for the Service; after enough Tasks complete, and the strategies are active, the rescheduler also compares the elapsed time for each pending Task against the three strategy-specific limits. If any of the limits is exceeded, it adds a redundant instance of the Task to the waiting list. (The Broker will reset the elapsed time for that Task when it gives the redundant instance to an Engine.) The Reschedule First flag determines whether the redundant Task instance is placed at the front of the back of the waiting list; that is, if Reschedule First is true, rescheduled Tasks are placed at the front of the queue to be distributed before other Tasks that are waiting. The default setting is false, which results in less aggressive rescheduling. Each of the three strategies computes its corresponding limit as follows: • The Percent Completed Strategy waits until the Service nears completion (as determined by the Remaining Task Percent setting), after which it begins rescheduling every pending Task at regular intervals, based on the average completion time for Tasks within the Service. • The Average Strategy returns the product of the mean completion time and the Average Limit parameter. That is, this strategy reschedules Tasks when their elapsed time exceeds some multiple (as determined by the Average Limit) of the mean completion time: • The Standard Dev Strategy returns the mean plus the product of the Standard Dev Limit parameter and the standard deviation of the completion times. That is, this strategy reschedules Tasks when their elapsed time exceeds the mean by some multiple (as determined by the Standard Dev Limit) of the standard deviation:
• • • • •
34 •
Chapter 5 – Scheduling
This Document is Proprietary and Confidental
Chapter 6 The GridServer Administration Tool
•••••• Introduction
The GridServer Manager provides the GridServer Administration Tool, a set of web-based tools that allow the administrator to monitor and manage the Manager, its Grid of Engines, and the associated job space. The GridServer Administration Tool is accessed from a web-based interface, usable by authorized users from any compatible browser, anywhere on the network. Administrative user accounts provide passwordprotected, role-based authorization. With the pages in the Administration Tool, you can: • Monitor Service and Task execution and cancel Services • Monitor Engine activity and kill Engines • View and modify Manager and Engine configuration • Install Engines • Create administrative user accounts and edit user profiles • Subscribe to get e-mail notification of events
FIGURE 6-1: The GridServer Administration Tool.
• Edit Engine Tracking properties and change values • Configure Broker discrimination FIGURE 6-2: The GridServer Administration Tool. • View the GridServer API • Download the SDK files necessary to integrate application code and run Drivers • View and extract log information • View diagnostic reports • Run Service Tests
Getting Started The Administration Tool is accessible via HTTP network access from any supported browser that supports JavaScript and Java applets. Make sure that both of these features are enabled in the browser.
GridServer Administration Guide
• • • 35 • • •
User Accounts and Access Levels
In the browser, open http://hostname:port/livecluster (where hostname is the address of the GridServer Manager, and port is the port on which it is listening.); the Manager will prompt you for a username and password. If you are running a browser on the same machine that runs the Manager, you can typically open http://localhost:8000/livecluster to begin.
User Accounts and Access Levels All of the administrative screens require you to first log in with a user account. The GridServer Administration Tool uses a system of tiered access to provide security and enable different users to access different areas of the interface. This is done by assigning different access levels for user accounts. There are four account access levels: Configure, Manage, Service, and View. The Configure level is for administrators and allows access to any part of the Administration Tool. By default, the admin account you created at installation is set to the Configure level; you can also create accounts with full access for other administrative users. Other users can be given accounts with more limited access. When a user account with an access level of View, Service, or Manage is used with the Administration Tool, some pages will either function differently, or will not be available.
Creating User Accounts To create a User Account: 1. Log in to the GridServer Administration Tool using an account that has configurelevel access, such as the one created when you first installed GridServer. 2. Click the Admin tab, then click User Admin. 3. On the User Admin page, select Create New User from the Global Actions list. The New User Information page will open. 4. Enter the User Name, a password, and confirm the password. The following information for a username is optional. You can also:
FIGURE 6-3: Creating a User account.
• Enter a first and last name, and an email address for notifications. • Select an access level. By default, this will be View. • If you are using Driver Authentication, you can associate a Driver Profile with a user account, so Drivers using the same username as a user account will also use a specified Driver Profile. Select a Driver Profile from the Driver Profile list to do this. • Select the users that can be viewed with this account. This user will be able to view any Services submitted by the selected users. Services that don't specify a user will default to the hostname of the Driver and can only be viewed by setting Service Username Access to all • • • • •
36 •
Chapter 6 – The GridServer Administration Tool
This Document is Proprietary and Confidental
Features Available by Access Level The following table lists what pages are available in each level: Level
Pages
View
Service Session Admin, Service Group Admin, GridCache Admin (view only), Dataset Admin (view only), Propagator Admin (view only), Engine Home, Engine Admin, Engine Install, Driver Admin, Broker Admin, Broker Monitor, Director Monitor, License Information, Discriminator Admin (view only), Engine Configuration (view only), Manager Configuration (view only), and Documentation.
Service
All pages from the View level, plus SDK Download, Cache Configuration (view only), Resource Deployment (view only), Service Test, Engine Admin - Log URL List, Engine Admin - Remote Engine Log, Engine Admin - Search Logs, Engine Daemon Admin, Engine Daemon Admin - Log Url List, Engine Daemon Admin - Search Logs, Event Subscription, Cache Configuration, Hook Admin, Service Session Admin - Cancel Service, Service Session Admin - Cancel All Services, Service Session Admin - Remove Finished Service, Service Session Admin - Remove Finished Services, Service Session Admin - Set Priority, TaskAdmin - Cancel Task, ServiceSessionAdmin - Update Deployment Files, and Service Test.
Manage
All pages from the Service level (with full rights on all Admin pages), plus Discriminator Admin (full rights), Engine Properties, Broker Routing, Event Subscription, Batch Admin, Batch Schedule, Reports (except Direct Query), Engine Configuration (full rights), Manager Configuration (full rights), Cache Configuration, Hook Admin, Current Log, and Diagnostics.
Configure
All pages.
Service Session Admin methods or actions require the user to have Service Username Access to the Service in question. For example, the Service Session page will only show a user’s Services, and that user can only cancel their own Services. User account access levels also affect the ability to use GridServer Web Services to programmatically interact with GridServer. For a list of GridServer Web Service objects and methods enabled by access level, see Chapter 10, “GridServer Admin API” on page 89 of the GridServer Developer’s Guide. Note that access levels don’t filter Services that were submitted before the access level was changed. For example, if a user’s account is changed from Configure to View while a long-running Service was active, the user would still have Configure-level access to that Service.
User Account Security User accounts can be secured by assigning minimum username and password length, password aging, and other attributes. To configure User security, click the Manager tab, click Manager Configuration, then click Security. The following are configurable: Minimum Username Length, Minimum Password Length, Password Complexity, Password Aging, Password Aging Expiration, and Driver Fails Login With Expired Password. Note that when a user’s password expires, they are required to provide a new password when they log into the Manager.
GridServer Administration Guide
• • • 37 • • •
Navigating the Administration Tool
Session timeouts are also configured for logins to the GridServer Administration Tool and Admin Web Services. By default, these are set at 60 minutes for Administration Tool logins and 300 seconds for Admin Web Services. To change these values, click the Manager tab, click Manager Configuration, then click Security. Values are located in the Admin User Management section.
Navigating the Administration Tool The Administration Tool consists of a number of pages, organized in the following ways:
The Home Page When you first open the Administration Tool, a home page is displayed with links to every page. Click a link to go to that page. You can return to this home page by clicking the Home button in the shortcut buttons.
Tabs All of the pages in the Administration FIGURE 6-4: The Administration Tool Tabs. Tool are arranged under seven tabs, grouped by component or function. Click a tab to display a home page, which contains a description and link for each of the pages available on the tab. You can click a page link to view that page. Each page in a section is also listed in the page bar, which is located below the tab controls. Below each tab is a bar containing a link to each page that’s on the home page, including the home page itself. This is useful for returning to the home page, or quickly going to another page without first returning to the home page. Note that if you have gone to a page other than the home page, clicked on another tab, then clicked on the first tab, you will return to the page you previously viewed, not the home page. The following tabs are available: Services
The Engine tab contains pages used to manage, view, install, and configure Engines.
Engine Driver
The Services tab contains pages used to manage, view, and submit Services.
The Driver tab contains pages used to manage and install Drivers.
Manager Reports
The Manager tab contains pages used to manage Brokers and configure your Manager. The Reports tab contains pages used to view statistics and events generated by the Manager.
Admin The Admin tab contains various administrative pages used to manage users, view logs, edit Manager hooks, and view Documentation. Batch
• • • • •
38 •
The Batch tab contains links to create, edit, and manage Batches.
Chapter 6 – The GridServer Administration Tool
This Document is Proprietary and Confidental
Shortcut buttons The shortcut buttons, shown to the right, are displayed in the upper right of each page. The following buttons are available: • Home - returns to the home page of the Administration Tool. FIGURE 6-5: Shortcut buttons. • License Information - displays information on your GridServer license. This button flashes when your license has expired, or when proxy limits are exceeded. You can turn this off on the Manager tab, in the Manager Configuration page, in the Admin section, by setting the property under the License Manager heading to false. You will also get a license warning starting 14 days before your license is due to expire, on the login page. • Help Index - opens an index of online help topics in a new window. • Documentation - opens a list of all documentation, including links and a search engine.
Action Controls Each table item has an action control, which is a list of actions you can choose. Some of these perform actions on table items, while others open a new page.
Links on other pages Some pages contain shortcut links to other related pages. Note that only pages that are accessible from the current account are displayed. If you are not using an administrative account with all privileges enabled, some options will not be visible.
Using Tables Most pages have controls or information grouped in tables. The following controls can be used to sort or reorganize tables for more convenient viewing:
Pager control The Pager control enables you to step through multiple pages, or specify how many rows appear on a page. Select a page number from the Page list, FIGURE 6-6: The Pager control. or select a range from the second list to display those items. You can select a greater number of items listed per page in a table or display all of the items; type a number in the Results Per Page box and click Go.
Search control The Search control is displayed on any page containing a table. You can use it to search any column of a table. Select a column from the list, enter a search term, and click Go.
FIGURE 6-7: The Search control.
GridServer Administration Guide
• • • 39 • • •
Using Tables
Personalize Table The Personalize Table commands enable you to make changes to a table by removing or adding columns. There are two lists that control this:
FIGURE 6-8: The Add and Delete column controls.
Add Column: Select the name of a listed column to add it to the table. Columns previously deleted from the table will be listed, along with any optional columns that are not displayed in a table’s default configuration. Columns will be added to the right of existing columns.
Delete Column: Select the name of a column to remove it from the table. Deleted columns will remain hidden to this account, and these settings will be saved for future login sessions. Tables are always sorted by a column that has an arrow in it, either facing up or down. You can click this arrow to reverse the sort order of a table, or click another column to change the sort column.
Refresh To update the list and display the most current information in a table, click the Refresh button. You can also select a time value from the Refresh list to automatically refresh the table at a regular interval. To stop automatic refreshes, select none.
Broker and Director Monitors While the pages like the Service Session Admin page and Engine Admin page can be used to oversee the running of Services on your Grid, two graphical tool can be used to provide a more simple overview of status information on your system. Both Directors and Brokers have available a graphical monitor, which can be displayed in its own window. To display the Director Monitor, click the button to the left in the Administration Tool. Note that this button is not present in Managers that only host a Broker. To display the Broker Monitor, click the button to the left in the Administration Tool. Note that this button is not present in Managers running only a Director. Both monitors display up-to-date information on your Grid. The Director Monitor contains graphs with statistics on Engines, Tasks, Servicesand machine status, including thread and memory information. The Broker Monitor contains similar information about one specific Broker. To the right is a sample of a Director Monitor for a Grid with three Engines running several Services at once.
Manager Component Indicator FIGURE 6-9: The Director Monitor. The Manager Component Indicator graphically displays what part of the Manager is controlled by each page within the Administration Tool. Each page’s functionality will control either the entire Manager, a Broker, or a Director. • • • • •
40 •
Chapter 6 – The GridServer Administration Tool
This Document is Proprietary and Confidental
On Manager pages, a red and a blue sphere will be displayed.
If a page’s functionality is tied to a Director, just the red sphere is shown.
If a page’s functionality is for a Broker, just the blue sphere is shown. Also, the Manager Component Indicator will show the hostname of the related component.
Status Display The GridServer Administration Tool contains a Status Bar at the top of each page, which contains four Status displays. Each of these displays are updated at each page reload with information about the status of your Grid. The following Status displays are included: • • • •
Busy Engines and Available Engines Drivers and Engine Daemons Running Services and Finished Services Running Tasks and Pending Tasks
GridServer Administration Guide
• • • 41 • • •
Using Tables
• • • • •
42 •
Chapter 6 – The GridServer Administration Tool
This Document is Proprietary and Confidental
Chapter 7 Application Resource Deployment
•••••• Introduction
GridServer provides several options for distributing classes, libraries, and other resources to Engines. A Grid Library (or GL) provides an enterprise solution to managing versioned sets of resources that may be used by multiple services. Grid Libraries provide the following features: • Version control, including optional automatic selection of the most current version of a Grid Library. • Resource upgrading without interrupting current Sessions. • Specification of dependencies on other Grid Libraries. • Specification of C++ Bridges and non-default JREs via dependencies. • All-in-one packaging for JARs, native libraries for multiple OSes, .NET assemblies, Command Service executables, and Engine Hooks. • Specification of Environment Variables and Java System properties. • Engines that require different compiler support libraries (GCC2/GCC3) can participate in the same Service Session. • Optimization of Engine restarts. • Task reservation when an Engine requires a restart. • Parameterization of package configuration through the use of property substitution files. The Resource Deployment feature replicates sets of directories from a Manager to Engines to provide a method of copying and managing files. It can be used for Grid Libraries and for the default set of resources. In the simplest sense, this enables you to copy a JAR, DLL, or another resource to each Engine to run a Service. Remote Application Installation can install and uninstall applications on remote Windows Engines in nonGrid Library deployment. This chapter details how to use each of these methods of deployment for your GridServer installation.
Grid Libraries A Grid Library is essentially a set of resources and properties necessary to run a Grid Service, along with configuration information that describes to the GridServer environment how those resources are to be used. For example, a Grid Library can contain JARs, native libraries, configuration files, environment variables, hooks, and other resources. A Grid Library is deployed as an archive file in ZIP or gzipped TAR format, with a grid-library.xml file in the root that describes the Grid Library. It may also contain any number of directories that contain resources.
GridServer Administration Guide
• • • 43 • • •
Grid Libraries
Grid Libraries are identified by name and version. All Grid Libraries must have a name, and typically have a version. The version is used to detect conflicts between a desired library and library that has already been loaded; it also provides for automatic selection of the latest version of a library. A GridServer Service can specify that it is implemented by a particular Grid Library by specifying the gridLibrary and gridLibraryVersion Service Options or Service Type Registry Options. Grid Libraries can specify that they depend on other Grid Libraries; like the Service Option, such dependencies can be specified by the name, and optionally the version. Also, nearly all aspects of a Grid Library can be specified to be valid only for a specific operating system. This means that the same Grid Library can specify distinct paths and properties for Windows, Linux, and Solaris, but only the appropriate set of package options will be applied at run-time.
Grid Library Format The Grid Library can be any archive file in ZIP (.zip) or gzipped TAR format (.tgz or .tar.gz), with a gridlibrary.xml file in the root. Although the filename has no inherent meaning, we recommend the format: [library name]-[library version].[zip|tar.gz|tgz]
The directory structure is completely up to the user, since the configuration file is used to specify where resources are found within the Grid Library. The configuration file must be a well-formed XML file named grid-library.xml, and be in the root of the Grid Library. The GridServer SDKs include a grid-library.dtd file that can be used to validate the XML file. They also include an example Apache Ant build.xml file that can be used to validate and build Grid Libraries. This DTD can also be found at Appendix A, “The grid-library.dtd” on page 99. Following is a table that specifies all elements and attributes of the grid-library.dtd file. It uses the XML schema notation for elements and attributes, such as: [no tag] ? *
(Required) (Optional) (Optional and Repeatable)
Element
Description
Elements and Attributes
grid-library
The root element.
ELEMENTS
grid-library-name grid-library-version? dependency* jar-path* lib-path* assembly-path* command-path* hooks-path* environment-variables* java-system-properties*
ATTRIBUTES
os? compiler?
grid-library-name
• • • • •
44 •
The library name. All libraries must be named.
Chapter 7 – Application Resource Deployment
This Document is Proprietary and Confidental
Element
Description
grid-libraryversion
The version. If not specified, 0 is implied. If in comparable format as defined below, it can be used to determine the latest version.
dependency
A library dependency. If the version is not ELEMENTS specified, the latest version is chosen at runtime.
grid-library-name* grid-library-version?
conflict
Indicates that this library conflicts with the ELEMENTS given library. If this Grid Library is NOT a dependency, and grid-library-name="*", then it indicates that this Grid Library conflicts with all other Grid Libraries (aside from its dependencies).
grid-library-name*
pathelement
An element containing a relative path, typically set to a directory. This element must be in the proper format for the OS. The path is resolved relative to the Grid Library.
jar-path
The JAR path. If specified, all JARs and classes in the path are loaded.
ELEMENTS
pathelement*
ATTRIBUTES
os? compiler?
The native library search path.
ELEMENTS
pathelement*
ATTRIBUTES
os? compiler?
lib-path
Elements and Attributes
assembly-path
The .NET assembly search path. Absolute ELEMENTS assembly paths, mapped drives, and UNC paths will not work.
pathelement*
command-path
The path in which the Engine will search for Command Service executables.
ELEMENTS
pathelement*
ATTRIBUTES
os? compiler?
hooks-path
Engine hooks library path. Engine Hooks ELEMENTS will be initialized at the time the containing ATTRIBUTES Grid Library is loaded.
name
The name of a property
value
The value of a property
pathelement* os? compiler?
GridServer Administration Guide
• • • 45 • • •
Grid Libraries
Element
Description
Elements and Attributes
property
A name/value pair, used by environment variables and Java System properties.
ELEMENTS
name, value
environmentvariables
Environment variables to set.
ELEMENTS
property
ATTRIBUTES
os? compiler?
java-systemproperties
ELEMENTS Java system properties, which are set immediately prior to executing a task using ATTRIBUTES this library.
property oscompiler
The following is a list of attributes used above. Valid values can be found in the Product Info page in the GridServer Administration Tool.: Attribute
Description
os
The os attribute specifies that it is only applied to this OS. If the attribute is not this operating system (OS), the containing element and its children and content are ignored.
compiler
If the attribute is not this compiler, the containing element and its children and content are ignored.
Variable Substitution A file can be created that contains variable substitutions, which are substituted into the grid-library.xml file. This allows for quick changes in properties in the grid-library.xml file without redeploying the Grid Library. You can have a default properties file in your Grid Library called grid-library.properties that can provide baseline values for your variables. You can also create an external properties file, named with the same name as the Grid Library archive, with the extension .properties, and place it in the Grid Library deployment directory. External properties will substitute over those in the Grid Library. If the grid-library.xml file contains a property with a value contained with the $ character, such as $mydir$, and the properties file contains an assignment, such as mydir=c:\\dir, the variable is substituted. NOTE: Substitutions are allowed within the content of property value elements and pathelements only. If
the substitution is not found in the file, the empty string, "", is substituted. Substitutions are allowed anywhere in a string. Multiple substitutions per string are allowed. $ characters can be treated as literals by escaping them with another $ character. Windows paths that are specified in the [library].properties file must escape the \ character with another \.
Versioning Versioning provides the following functionality: • It allows for deployment of new versions of libraries and deletion of old versions without interrupting currently executing Service Sessions. • It provides for specifying conflicts, or libraries that cannot coexist with each other. • • • • •
46 •
Chapter 7 – Application Resource Deployment
This Document is Proprietary and Confidental
• It allows for a Service Session or dependency to specify the use of the latest version of a Grid Library. To use versioning, you must specify the Grid Library version in the configuration file. An Engine can load only one version of the library with the same name at any time. If the version is not specified, it is implied to be 0. While the version can be any String, if it follows the proper comparable version format it can also be used to determine the latest version of the library, for automatic loading. This format is [n1].[n2].[n3]...
where nx is an integer, and there may be one or more version points. For instance, 4.0.1.1,
4.1,
3
are in the proper comparable version format. The integer at each version point is evaluated starting at the first point, and continue until a version point is greater than the other. If a version point does not exist for one, it is implied as zero. For instance 4.0.0.1 > 4.0 4.0.0.5 < 4.0.1.1
To specify that a dependency or Service use a particular version of a Grid Library, the version field is set to that value. To specify that it use the latest version, the field is left blank. If a version is specified but not in this format, and there are multiple versions of a library, the “latest version” is undefined. Thus, automatic selection of the latest version is only possible when all Grid Libraries with the specified name provide a version in the proper format. Note that automatic versioning is dynamic. That is, if a Service or dependency specifies the latest version, and a new version of a Grid Library is deployed, the next time that Grid Library is used by any Session it will be the new version.
Dependencies Grid Libraries may specify dependencies on other Grid Libraries. A dependency specification resolves to a particular Grid Library using two values: grid-library-name:
The name of the Grid Library, as specified in the dependency’s XML
The version of the Grid Library, as specified in the dependency’s XML. OS compatibility is determined by checking the os and compiler tags for the top-level element in the dependent Grid Library. If not specified, it will use the latest version supported by the OS grid-library-version:
Note that if a dependency resolves to more than one Grid Library, the dependency used is undefined. Two dependent libraries conflict if they have the same library name, but different versions.
GridServer Administration Guide
• • • 47 • • •
Grid Libraries
Conflicts A conflict between two Grid Libraries means that these libraries cannot be loaded concurrently. When there is a conflict between a loaded Grid Library and a Grid Library required by a Service, the Engine must restart to unload the current libraries and load the requested library. The following circumstances result in a conflict: Version Conflict The most common conflict arises via versioning, and typically when upgrading versions or using more than one version of the same library concurrently. This conflict arises when a Grid Library with the same gridlibrary-name as the requested Grid Library, but different version, is loaded. Explicit Conflict There can be situations in which different Grid Libraries can conflict with each other due to conflicting native libraries, different versions of Java classes, and so on. Because the Engine cannot determine these implicitly, the conflict element can be used to specify Grid Libraries that are known to conflict with this Grid Library. Additionally, the value of the grid-library-name can be set to "*". This means that this Grid Library can conflict with all other Grid Libraries (aside from its dependencies), and it is guaranteed that no other Grid Libraries will be loaded concurrently with this Grid Library. Note that this is only allowed if the Grid Library is not a dependency; if the "*" is used as a conflict in a Grid Library that is a dependency, a verification error will occur. Dynamic Version Conflict A Grid Library conflict occurs if dynamic versioning is used, and the latest version of a Grid Library or Grid Library dependency has changed due to an addition or removal of a dependency since the Grid Library has been loaded. Variable Substitution Conflict A Grid Library conflict occurs if its variable substitution file has changed since it has been loaded.
Grid Library Loading When a Service Session is set to use a Grid Library, that library is loaded. Loading is the process of setting up all resources in the Grid Library for use by the Service. A library is loaded only once per Engine session. First, the library loads itself, and then it loads all dependencies. Libraries are loaded depth-first rather than breadth-first. Certain aspects of a load may require a restart, and possibly re-initialization of the state. The following steps are performed by a load of the root library and all dependencies: 1. Checks for conflicts with currently loaded Grid Libraries. If so, it will restart with the requested Grid Library and clear out the current state of any loaded libraries. 2. If new lib-paths have been added for its OS, they will be appended to the current list of lib-paths, and the Engine will restart. The state of loaded libraries will include all libraries already loaded, plus the requested library. Note that specifying a JRE dependency has this effect. 3. If new jar-paths have been added for its OS, the jars and classes will be added to the classloader. 4. If new assembly-paths have been added, it will add them to the .NET search path. • • • • •
48 •
Chapter 7 – Application Resource Deployment
This Document is Proprietary and Confidental
5. If new command-paths have been added for its OS, it is added to the search path for Command Tasklets. 6. If new hooks-paths have been added, any hooks in the path will be initialized. 7. If the default is current and a Grid Library is requested, the Engine will restart.
State Preservation Under most cases, when an Engine shuts down, it preserves the current state of which Grid Libraries it has loaded. When it starts back up, it loads all Grid Libraries that were loaded when it shut down. As Grid Libraries are loaded, the pathelements they contain are added to a ‘master’ list of paths for that type of pathelement. For example, if a Grid Library contains a lib-path specification, that lib-path is appended to the list of lib-path values obtained from already-loaded Grid Libraries. Note that this means that is up to the creator of the Grid Libraries deployed on the Grid to ensure that the ordering of library paths does not lead to loading the wrong library For example, if two different Grid Libraries each provide DLLs in their lib-paths that share the same name, because of OS-specific library load conventions, the one that will be used will be the first one found in the aggregate lib-path from across all loaded Grid Libraries. Likewise for Java classes, when more than one copy of the same class is in the classloader, it is undefined which class will be loaded. Therefore it is important to either subdivide Grid Libraries appropriately when such conflicts could arise, or to use the conflict element to explicitly state conflicts. If an Engine shuts down due to a conflict, it clears the current state and sets up for only the requested Grid Library upon restart. This is referred to as preloading. If an Engine shuts down due to internal library inconsistencies or a crash, the state is not saved. State is also cleared on all instances for file updates, Daemon restarts, and Daemon disable.
Task Reservation If an Engine requires a restart to load a Grid Library, the task will be reserved on the Broker for that Engine. The Engine is instructed to log back into the same Broker, and will take that task upon login. The timeout for this is configurable on the Broker on the Manager Configuration page, in the Services section.
Environment Variables and System Properties All Environment variables and Java System properties for a Grid Library and all dependencies will be set each time a task is taken from a particular service that specified that Grid Library. (They are not cleared after the task is finished.) Environment variables are set via JNI so that they can be used by native libraries or .NET assemblies, and they are also passed into Command Services. Note that environment variables such as PATH and LD_LIBRARY_PATH should not be changed through this mechanism. Rather, library-path and command-path are reserved for manipulating these variables.
Using Grid Libraries from a Service Services can specify a Grid Library to use by setting the GRID_LIBRARY and optionally the GRID_LIBRARY_VERSION Service Options. This would typically be set by Service Type in the Service Registry page, although it can be set programatically on the Session. Jobs can specify a Grid Library to use by setting the corresponding JobOption values. If the version is not set, a Service will use the latest version of a Grid Library. GridServer Administration Guide
• • • 49 • • •
Grid Libraries
If a Service needs to find resources in a Grid Library, it can use the Grid Library Path. This value is a path value that includes the root directories of all Grid Libraries currently loaded. This path can be retrieved in the following way: ds.GridLibraryPath:
Java System property, .NET System.AppDomain.CurrentDomain data entry
ds_GridLibraryPath:
Command Service, native library Service environment variable
Deployment Grid Libraries are typically deployed by placing them in the Grid Library deployment directory on the Primary Director. The Resource Manager will then replicate these libraries to all Engines. Variable Substitution property files also should be placed in this directory. Grid Libraries are special resources, in that adding or removing Grid Libraries or property files will not result in an Engine and Daemon restart, like other resources. This is because it is not necessary to restart until the Engine actually needs to use the Grid Library, and even then only if necessary according to the loading procedure. Note that if a Grid Library is changed, the Daemon and Engines will restart like they would in the case of a change to any other resource. Also, it is the responsibility of the user not to delete Grid Libraries via the Resource Deployment page that have been loaded by active Services, as that may lead to library load failures for subsequently executed Tasks. If you are not using the Resource Manager for replication, you can use an alternate shared Grid Library directory. You must then set the Grid Library Path in all Engine Configurations to point to this directory, instead of the default replicated location. When changes are made to this library, you must then use the Update button on the Resource Deployment page on the Primary Director. This will send a message to all Engines to check and update their Grid Libraries via the Grid Library Manager.
Grid Library Manager The Grid Library Manager exists on all Engines, and is responsible for maintaining the state of all Grid Libraries deployed. Whenever any change is made to the Grid Library directory (typically due to replication), the Grid Library Manager will update the local status as follows: 1. Any new Grid Library files are unzipped to a directory with the name corresponding to the file name. This new library will be added to the Grid Library Manager’s catalog, but not loaded until needed. 2. If a Grid Library is removed, it will delete the local copy of the zipped Grid Library and the unzipped directory. 3. Variable substitution files are copied into the appropriate directory. If a variable substitution file has been changed, and the corresponding Grid Library has already been loaded, it is marked as dirty so that the next time an Engine attempts use it, it will restart due to conflict. 4. If any Grid Library uses a latest version in the Grid Library’s catalog, and the latest version has changed, it is marked dirty so that the next time an Engine attempts to use it, it will restart due to conflict. The Grid Library Manager locks the directory while making any changes, so that if multiple Engine instances are running or multiple Engine Daemons are running from a shared Engine directory, only one Engine will perform any file manipulation. Other Engines will wait until those operations are completed, and then their Grid Library Managers will update their links appropriately. • • • • •
50 •
Chapter 7 – Application Resource Deployment
This Document is Proprietary and Confidental
C++ Bridges C++ Bridges are the native bridges that allow Engines to execute native Services. They are packaged as Grid Libraries, named cppbridge-[os]-[compiler]-[M]-[m], where M and m are the GridServer major and minor version numbers. All C++ Bridges are pre-packaged and deployed in the Grid Library replication directory upon GridServer Manager installation or upgrade. Only one version of a bridge can be loaded at any given time, so all bridges for a particular platform are built to explicitly conflict with each other. For example, a Service that was VC7.1 conflicts with one that uses VC7.0.
JREs JREs will be packaged as jre-os-.glz. The Grid Library name will be jre-os, and the os will be the JRE version, for example, 1.4.2.06. DataSynapse will package JREs for customers as needed, or as they become available; contact DataSynapse support for details.
Grid Library Example The following example grid-library.xml is for a mixed Java/C++ application that runs on Windows, and both gcc2 and gcc3 for Linux: Example 7.1: grid-library.xml example
MyLib 1.0.0.1 <pathelement>lib/gcc2 <pathelement>lib/gcc3 <dependency> cppbridge-vc6 <dependency> cppbridge-gcc3 <dependency> cppbridge-gcc2 <dependency> jre-win32 1.4.2.06
GridServer Administration Guide
• • • 51 • • •
Legacy Resource Deployment
Example 7.1: grid-library.xml example (Continued) <dependency> MyCalculator <pathelement>hooks <jar-path> <pathelement>jars <pathelement>morejars <pathelement>lib\win <pathelement>s:\lib\win <environment-variables os="win32"> <property > MY_WIN_VAR $WinVar$ <environment-variables os="linux" compiler="gcc3" <property > MY_GCC3_VAR $LinuxDriverDir$ <java-system-properties> <property> foo bar
Legacy Resource Deployment When it is not necessary or optimal to use Grid Libraries, a default set of resources is also available for use by Engines. For instance, a Grid with only a small number of applications that do not require uninterrupted upgrading may not require Grid Libraries. Also, developing and testing GridServer applications is typically easier using the default resources.
Using Default Resources Default resources are used when a Service does not specify a Grid Library. They cannot be used concurrently with Grid Libraries, so the default resources can be thought of as a non-versioned Grid Library that conflicts with all other Grid Libraries. Also, rather than using a grid-library.xml file, it uses the Engine Configuration to specify paths. • • • • •
52 •
Chapter 7 – Application Resource Deployment
This Document is Proprietary and Confidental
When using Default Resources, the following Engine Configuration properties take effect; when using Grid Libraries, they do nothing: Property Environment Variables Default JAR and Class Path Default Library Path Common Library Path Default Hook Path
Default Resource Paths The paths used by Default Resources are set in the Engine Configuration, in the Classes, Libraries, and Paths section. By default, these paths are set to replicated resource locations. Following is a list of the paths, and analogs to Grid Libraries: The jar-path
JAR and Class Path: Library Path: Hooks Path:
The lib-path and assembly-path (for Windows) The hooks-path
C++ Bridges C++ Bridges are used by simply including the bridge libraries in the Library Path. These libraries are installed by default when the Manager is installed or upgraded, into the default library path. Note that this means that only one version of a bridge may be used. For example, when using the default resources, you cannot use both VC6 and VC7 services for the same Engine configuration.
Grid Library features not supported by Default Resources The following features are unique to Grid Libraries and cannot be utilized when using Default Resources: JRE:
Only the default JRE can be used.
System Properties:
Not supported, although they can be set via an Engine Hook or in the Service
implementation Environment Variables:
Not supported, although they can be set via an Engine Hook or in the Service
implementation via JNI Daemon and Engine restart optimization:
When default resources are changed, all Engines and Daemons
will restart to update those resources. Variable Substitution:
Not supported.
Code Versioning Deprecation Code Versioning has been replaced by Grid Libraries as of GridServer version 4.1.
GridServer Administration Guide
• • • 53 • • •
Resource Deployment: Distributing Grid Libraries and Default Resources
To support migration from Grid Libraries without changing the client implementation, the following is done: If the CODE_VERSION option is set for a Service, the GRID_LIBRARY value is set to that value. To migrate, then, you must at minimum perform the following so that legacy clients work correctly: 1. Package all Code Version directories as Grid Libraries with grid-library-name=codeVersion. 2. If any directories include C++ Bridge DLLs, remove them and replace with the proper bridge dependency. 3. If Code Versions conflict with each other, use the conflict element. If all Code Versions conflict with each other, you can simply use the "*" conflict value. Note that these instructions are the minimum necessary to migrate from Code Versions to Grid Libraries without changing existing client code. As client code is changed, you may find a more optimal division of resources into dependencies.
Resource Deployment: Distributing Grid Libraries and Default Resources The GridServer system provides a Resource Deployment mechanism for securely distributing Grid Libraries and resources, such as libraries (.dll or .so), Java class archives (JAR), binaries, or large data files that change relatively infrequently. The resources to be deployed are placed within a reserved directory on the Primary Director. The system maintains a synchronized replica of the reserved directory structure for all Engines. The replica of files on the Director is synchronized to Brokers, and then Brokers synchronize the files with Engines. The files are secure in that they cannot be accessed by anyone on the network, only the Engines.
The Resource Deployment Interface The GridServer Administration Tool provides a graphical interface to manage resources synchronized to Engines. To manage resources, on the Primary Director click the Services tab in the Administration Tool, and click Resource Deployment. The Resource Deployment page, shown to the right, features a file browser that can be used to navigate the replicated directories, create new directories, and add or delete files. To navigate the directories, simply click the displayed file names or the directory names in the current directory, displayed above. You can add FIGURE 7-1: The Resource Deployment page. new files to a directory by entering a filename and clicking the Upload button, or clicking the Browse button to find files on your computer. Once you have added new files, you can click Update to update the files to your Engines.
Resource Deployment File Locations The resources directory contains a directory for each Engine OS that is deployed only to Engines with the respective operating system. The gridlib and shared directories are deployed to all Engines. • • • • •
54 •
Chapter 7 – Application Resource Deployment
This Document is Proprietary and Confidental
The default locations for these directories, relative to the livecluster base directory, are in the deploy/resources directory. Files in the resources directory itself are not deployed. The corresponding Engine-side directory is located under the root directory for the Engine installation, for example, C:\Program Files\DataSynapse\Engine\resources for Windows; or /usr/local/DSEngine/resources for Unix. There two reserved file patterns: those that contain a #, and those that end in .tmp. You cannot deploy resources that match this pattern, as they will cause problems with the replication mechanism.
Configuring Directory Replication The system can be configured to trigger updates of the replicas in one of two modes: • Automatic update mode. The resources will automatically be deployed to any Engine upon login to the Broker. Also, the Manager continuously polls the file signatures within the designated subdirectories at the time interval specified in Monitor Interval. and triggers Engine updates whenever it detects changes; to update the Engines, the system administrator need only add or overwrite files within the directories. This is the default update method. • Manual update mode. The administrator ensures that the correct files are located in the designated subdirectories and triggers the updates manually by issuing the appropriate command in the GridServer Administration Tool. Updates also take place at startup. To configure manual updating, 1. Click the Manager tab, then click Manager Configuration. 2. Under Broker Resources and Director Resources, set Monitor Interval for both to 0. There are two different ways to update files to Engines manually: 1. Click the Services tab, then click Resource Deployment. 2. Click Update. or: 1. Click the Engine tab, then click Engine Admin. 2. Click Update Deployment Files on the Global Actions menu. Either of these actions will cause all Engines to update. If you have installed new files and want all Engines to use them immediately, do either of these commands. During rapid Java development, an alternative to file updating is the use of the JAR_FILE Service Option to dynamically attach a local JAR file to the Service. By default, this option is not available for security reasons, and has certain restrictions.
Using Engines with Shared Network Directories Instead of using directory replication, you can also provide Engines with common files with a shared network directory, such as an NFS mounted directory. To do this, you must provide a directory on a shared server that can be accessed from all of the Engines. Then the Engines must be configured to use that location. Click the Engine tab in the Administration Tool, click Engine Configuration, and change the directories appropriately.
GridServer Administration Guide
• • • 55 • • •
Remote Application Installation
JAR Ordering File If you are using multiple JAR files and need the classloader to load them in a specific order to prevent conflicts, you can specify the order in which they are loaded. To do this, create a file called index.libs in the JAR path root and put the names of JAR files, one per line, in the order in which they should be loaded. Those not in the list will be loaded afterwards, in no specified order.
Remote Application Installation The Windows Deployment Scripting Language provides a mechanism by which programs can be executed in conjunction with file updating on Windows Engines. This can be used for such purposes as registering COM DLLs and .NET assemblies, running Microsoft Installer packages, and so on. It runs an installation command when the script is added, and when any dependent files are modified. It can also run an uninstallation command when the script is removed. Note that the Remote Application Installation feature does not work with Grid Libraries. A deployment script is a file named dsinstall.conf in a resource subdirectory. This is a reserved filename, and the Engine Daemon interprets any file with this name as a deployment script. The script is a properties file, with name and value pairs that govern the command execution. Typically, the script is placed, with associated files, in its own subdirectory of the win32 deployment directory. This will be referred to as the installation directory. The following properties are provided:
• • • • •
56 •
Property
Description
install_cmd
The installation command. The command should be either in the current directory or the resources/win32/lib directory; you can also specify the full path to a command. This command is run when the dsinstall.conf file is added, modified, and when any dependency is modified.
workdir
Working directory from which the commands are launched. The directory is relative to the installation directory.
uninstall_cmd
Optional. The uninstall command. This is executed when the script is deleted, or prior to subsequent runs of the install command if uninstall_first is true. Supporting files for the uninstall script may be deleted along with the script; the command is executed prior to local deletion of the files. Typically an uninstall is performed by simply removing the entire installation directory.
dependfiles
Comma-delimited list of file names that the script depends on. The files are relative to the installation directory. If any of these files change on a file update, the install command is re-run. A file may contain wildcards only as replacements for the entire name or extension, such as *.dll, *.*, or file.*.
waittime
Number of seconds to wait for install/uninstall command to finish. The default is 30 seconds. If this time is exceeded, the process running the command is killed.
uninstall_first
Optional. If true, the uninstall command will always be run prior to the install command, except for the first time the install command is run. This is for situations in which you need to uninstall software prior to reinstallation.
Chapter 7 – Application Resource Deployment
This Document is Proprietary and Confidental
Property
Description
success_exit_codes
Optional. Comma-delimited list of exit code values that indicate successful command execution. If the exit code does not match any value, an error will be logged with the failure code, and the next time the Daemon restarts it will retry the installation. If this property is not set, exit codes are ignored.
disable_on_fail
If an Engine Daemon should disable itself upon the failure of an install. The default is false if not specified in the conf file. When the value is true, the Engine Daemon will disable itself if the installation returned exit code is not in the success exit codes.
The : and \ characters must be escaped with a backslash (\) character in the dsinstall.conf file. Also, you should not rename the dsinstall.conf file. The following is an example of a script that installs a Microsoft Installer package: Example 7.2: A Microsoft Installer Package Installation Script dsinstall.conf: dependfiles=install.bat,uninstall.bat,mypackage.msi workdir=. waittime=30 uninstall_first=true install_cmd=install.bat uninstall_cmd=uninstall.bat success_exit_codes=0
install.bat: %SystemRoot%\system32\msiexec /q /i mypackage.msi ALLUSERS=1
uninstall.bat: %SystemRoot%\system32\msiexec /q /x mypackage.msi ALLUSERS=1
These three files, plus the mypackage.msi file, are all placed in a subdirectory under win32. Note that the uninstall_first property is used to uninstall the previous version of the software whenever the package is changed. To uninstall the software, simply remove the entire installation directory; the uninstallation is performed prior to deleting the files.
Service Run-As There are often cases where Services require specific user permissions in order to access needed resources. By creating the Engine process as a given user, all Service invocations executed by the Engine can operate with these permissions.Service Run-as (or RA) allows for specification of authentication domain accounts under which Service invocations will execute. By default, all RA credentials are authenticated on the Engine Daemon in order to verify that the credentials are valid for the Engine’s authentication domain. Service RA authentication may be disabled on the Broker, but in most installations this is discouraged unless there is a specific reason for doing so. If Service RA authentication is disabled, then Driver user authentication should be enabled to prevent unauthorized users
GridServer Administration Guide
• • • 57 • • •
Service Run-As
from submitting Services that may run under arbitrary accounts. Also note that while disabling this authentication step removes the need for passwords, such Services may only run on Unix Engines due to restrictions in the Windows API. Note that Service Run-As only supports the Service model; there is no support for RA using the legacy Job API.
Types of Credentials There are two ways in which Service Run-as credentials may be specified for a given Service:
Stored Credentials Service Run-as credentials are entered on the Director with the GridServer Administration Tool and are synchronized with all Brokers. These credentials are linked to Services in the Service Type Registry by specifying the username in the RunAsUser field. Credentials in the repository consist of a username and a password. The username may be in Windows DOMAIN/username format if domain-specific authentication is required. This domain is ignored by Unix Engines.
“Pass through” Credentials The Driver provides the username of the current Principal that is logged in and is running the Driver. The password is provided as a DriverManager property, CURRENT_USER_PASSWORD. These are referred to as “pass through” credentials. A password set on the Driver is required in order to prevent user account spoofing between authentication domains (for example, logging in as a local user on the Driver machine to pose as an LDAP user in the credentials DB). “Pass through” credentials are indicated for a Service in the Service Type Registry with the $ token. This token is substituted with the username of the current principal that is executing the Driver process. The token may also be prepended with a Windows domain if domain specific authentication is required. This domain is ignored by Unix Engines.
Using Run-As To use Run-As, you must do three things: set up Engines, add credentials, and associate credentials with Service Types.
Engine Setup To set up Engines for Service RA: Unix Engines
For Unix Engines, from the DSEngine directory, after running configure.sh, but before you start the Engine for the first time, do the following: 1. Change mode of all files to be group read/writable: find . | xargs chmod g+u
2. Change ownership of the invokeRA program to root, and change it to be set UID: sudo chown root bin/invokeRA sudo chmod +s bin/invokeRA • • • • •
58 •
Chapter 7 – Application Resource Deployment
This Document is Proprietary and Confidental
3. Set the Engine user’s umask to make these permissions the default: umask 002
4. Start the Engine: ./engine.sh
Windows Engines
For Windows Engines: 1. Right-click the Engine’s install directory, select Properties, and under the Security tab use Add... to add all users that you intend to run Services as. 2. Select the Allow check box for Full Control. 3. From the Start menu, click Settings, then Control Panel, then Administrative Tools, then Services. Right-click the Service running the Engine and select Properties. You will need to ensure that the Engine Daemon user is allowed to interact with the desktop. If the Local System user is selected, select the Allow Service to Interact with the Desktop check box. 4. The domain user who launches the Engine service in Windows needs to have the following security privileges set. Click the Start menu, then click Settings, click Control Panel, click Administrative Tools, then click Local Security Policy. Click Local Policies, then click User Rights Assignment, and add the user who launches the Engine service to the following policies: SE_TCB_NAME (“Act as part of the operating system”) SE_CHANGE_NOTIFY_NAME
(“Bypass traverse checking”)
SE_ASSIGNPRIMARYTOKEN_NAME SE_INCREASE_QUOTA_NAME
(“Replace a process level token”)
(“Increase quotas” or “Adjust memory quotas for a process”)
If you are using .NET Services that use XML serialization, complete the following steps: 1. Right-click the Engine’s temp directory in its Windows system directory (C:\WINNT\temp for Windows 2000, C:\Windows\temp for Windows XP and Windows Server 2003), select Properties, and under the Security tab, use Add... to add all users that you intend to run Services as. 2. Select the Allow check box, for Read, Write, and Delete permissions. Note that the Delete permission is set using the Advanced button on the Security page of the Windows Explorer folder properties dialog box.
Managing Credentials The Credentials DB is a store of RA credentials on the Director and Brokers to be used for RA services. It is maintained on the Director and synchronized with Brokers. The Credential Repository page in the GridServer Administration Tool enables you to create, edit, and delete RA credentials. To add new Credentials to your Manager: 1. Log in to the GridServer Administration Tool. 2. Click the Admin tab, then click Credentials Repository. 3. Enter the name of a credential, a password, and then enter the same password again. GridServer Administration Guide
• • • 59 • • •
Service Run-As
4. Click Add.
Manage Service Types The Service Type Registry entries allow specification of an RA username for use with that Service. To specify a Run-As user for a Service Type: 1. Log in to the GridServer Administration Tool. 2. Click the Services tab, then click Service Type Registry. 3. For an existing Service Type, go to the Actions control for that Service Type and select Edit Service Type. This opens the Service Type Editor window. 4. In the Service Type Editor window, under the ContainerBinding header, enter the user name in RunAsUser. Note that in this field, you can use $ to indicate the Driver’s current user. Leaving this value blank (the default) indicates that the process will run as the same user running the Engine Daemon. It is also possible to specify a Windows domain in the RunAsUser field. For example, if you are using a Unix Driver (which would not be in a Windows domain) and you want run Services on Windows Engines using a specific user and domain, you can specify this in the form domain/username. The forward slash will be translated to a backslash. For example, specifying DATASYNAPSE/BILL will run Services as the user BILL in the DATASYNAPSE Windows domain (DATASYNAPSE\BILL).
• • • • •
60 •
Chapter 7 – Application Resource Deployment
This Document is Proprietary and Confidental
Chapter 8 The Batch Scheduling Facility
•••••• Introduction
Commands and Services can be scheduled to run on a regular basis using the Batch Scheduling Facility. A Batch Definition contains instructions in the form of components that define scheduling and what the Batch will execute. When the Batch Definition is scheduled on the Manager, it creates a Batch Entry, which typically waits until its scheduled time, then executes, creating a Batch Execution. Services are executed using an embedded Driver on the Manager. Using the Batch Editor page in the GridServer Administration Tool, you can write a Batch Definition with specific scheduling instructions. You can specify a Batch Definition to immediately execute when scheduled, or it can wait until a given time and date. A Batch Definition can be submitted to run at a specific absolute time, or a relative time, such as every hour. They can also be written to wait for an event, such as a new, modified, or deleted file. Batch Definitions contain one or more components contained within a batch component. A Command component contains a program that will be run by the Batch Definition. A schedule or event component will specify when subsequent Command components will run.
FIGURE 8-1: A Batch Definition consists of Batch Components. When a Batch Definition is scheduled, it creates a Batch Entry, and will run as defined by the Batch Components. When it runs, it creates a Batch Execution, which then executes the components according to the definition.
Terminology The following terms are used to describe components related to the Batch Scheduling Facility: Name
Page
Description
Batch Definition
Batch Registry
How a Batch is written. The Batch Definition is edited with the Batch Editor page and contains a Batch Component, that then contains other components that define the Batch. Once created, it can be managed from the Batch Registry page.
Batch Component Batch Editor
When a Batch Definition is created, it consists of a Batch component, which can contain other components, such as ServiceCommand components, Conditional components, and other Batch Components. The Batch Editor page enables you to add, remove, and edit Batch components and other components it contains.
GridServer Administration Guide
• • • 61 • • •
Editing Batch Definitions
Name
Page
Description
Batch Entry
Batch Schedule
When a Batch Definition has been instantiated by being scheduled on the Batch Schedule page, a Batch Entry is created. The Batch Entry will either run immediately, or wait to run, depending on what scheduling components were added to the Batch Definition.
Batch Execution
Batch Admin
When a Batch Entry runs, it creates a Batch Execution, which does whatever was defined in the Batch Definition. For example, if a Batch Definition uses the ServiceCommand to start ten Service Sessions, the Batch Execution will do that. The Batch Execution is managed on the Batch Admin page. Any actual Service Sessions created can be managed on the Service Session page on the Services tab.
Service Runner
Service Runner Service Runners enable you to define a registered Service Type with Registry options and init data that can be used in a Batch Definition.
Editing Batch Definitions To create a new Batch Definition, click the Batch tab in the Administration Tool, then click Batch Registry. The Batch Registry page contains a list of Batch Definitions on the Manager, plus a blank box for entering the name of a new Batch Definition. In the Action column, there is an Action list for each Batch Definition. From each Action list, you can select Edit Batch Definition to edit a Batch Definition, Rename Batch Definition to rename a Batch Definition, Copy Batch Definition to copy a Batch Definition, Delete Batch Definition to remove a Batch Definition, Export Batch Definition to save an XML file of the Batch Definition, or FIGURE 8-2: The Batch Definition Editor. Schedule Batch Definition to place a Batch Definition in the Manager’s Batch queue. You can also select Batch View to display a graphical representation of the Batch Definition in a new window. To edit a Batch Definition, either select Edit Batch Definition from an existing Batch Definition’s Action list, or type the name of a new Batch Definition in the empty box at the end of the list and click Add. This opens a window, shown above, containing parameters for your new Batch Definition. You can then change the values of parameters, and click Save to save the values as a Batch Definition on the Manager, or click Cancel to exit the Batch Editor and discard any changes you have made.
• • • • •
62 •
Chapter 8 – The Batch Scheduling Facility
This Document is Proprietary and Confidental
The Batch Definition parameters are as follows: Parameter
Description
Batch Component Name
The name of the Batch Definition. If this is a new Batch Definition, this is the name you initially typed in the blank box prior to selecting Add, and is not editable. (You can rename a Batch Definition by selecting the Rename action from the Batch Registry page.) If an additional Batch component is added to a Batch Definition, you can set its name.
Type
Determines how a Batch Definition is run, either in serial or parallel. If set to parallel, all Batch components are executed when the Batch Definition is scheduled. If set to serial, Batch components are executed in the order in which they were added. If any of the components fail, it prevents the Batch from continuing, and the Batch will fail. The default is serial.
Schedule Component Type
Sets the type of the Schedule. If Immediate, the Batch Definition will run when scheduled.When Absolute, the Batch Definition will run once according to the date set in startTime. If Relative, the Batch Definition will run after the specified number of minutes in minuteDelay as well as repeating or executing immediately with respect to repeat and runNow. If Cron, the Batch Definition will run according to the values set in the cron. When set to Manager Startup, the Batch Definition when run when the Manager is first initialized.
Add component Adds an component to the Batch Definition. A Batch Definition can contain one or more components, which are described below.
Batch Components The parameters in the Batch Editor window correspond to components contained in the Batch Definition. Each Batch Definition can contain one or more Batch components. These components can be commands, events, or other Batch Definitions. For example, a LogCommand Component is shown below. To add a component to a Batch Definition, select a component from the add component list. Batch components are processed in a Batch Definition in order when Batch Type, described above, is set to serial. You can change the order of Batch components by clicking the Move Up and Move Down buttons in the upper-right corner of each FIGURE 8-3: A Batch component. Batch component, to move that component’s order up or down in the Batch Definition. You can also remove a Batch component by clicking the Remove button in the upper-right corner.
GridServer Administration Guide
• • • 63 • • •
Batch Components
Each of the types of Batch components that can be added to a Batch Definition are described below. In the Batch Editor window, a help description is provided for each Batch component shown. By default, Extended Help is displayed. Using the help control in the upper right corner, you can select Help to display only the first sentence of help, or No Help to suppress the help display. Name
Description
Batch
Contains another Batch Definition. This can be used to create a complex or multileveled Batch Definition. For example, a parent Batch Definition could start each day, starting a two child Batch Definitions, each with different schedules or conditions. For each new Batch component, you must set the same parameters for a Batch Definition as described above. You can then add additional components to the Batch.
Conditional
Provides conditional processing when running Batches. The component specified by test is run. If it runs successfully, the component specified by success is executed. If it fails, the component specified by failed is executed. The component specified in test returns success in the following conditions: • Command returns Command.SUCCESS • ServiceCommand creates the Service and submits the invocation without exception • ServiceRunnerCommand creates the Service and submits all invocations without exception
BatchReference
Contains a reference to a registered Batch Definition that gets loaded when scheduled from the Batch Registry.
Command
Runs an implemented method in a deployed class.
ServiceCommand
Starts a Service. You can specify a Service type registered on the Manager and method name to run. You can also specify a Service reference ID (this enables you to reference the Service from another Service Command), Service action, and input and init data for the Service. Data is comma-delimited. You can add ServiceDescription, ServiceOptions, and Discriminator components to a Service by using a Service Runner.
ServiceRunnerReference Loads the specified registered Service Runner. See below for information on registering a Service Runner. AdminCommand
• • • • •
64 •
Executes a command via the GridServer Admin API. For more information on using the Admin API, see Chapter 10, “GridServer Admin API” on page 89 of the GridServer Developer’s Guide.
Chapter 8 – The Batch Scheduling Facility
This Document is Proprietary and Confidental
Name
Description
EmailCommand
Sends an email message from a Batch Definition, for notification or alerts. You can enter a comma-delimited list of email addresses for recipients, and a message string, which will be used as a subject and a body. Note that in order for email to be sent, you must define an SMTP server in your Manager Configuration. To do this, click the Manager tab, click Manager Configuration, click Admin, and enter a value in SMTP Host under the Mail heading.
EmailFileCommand
Sends an email message from a Batch Definition that includes files as attachments, typically used to send the output of a previous command by saving that output to a file. You can enter a subject, a message body string, a comma-delimited list of email addresses, and a semicolondelimited list of files, which will then be sent as attachments in the message. The setup rules given above in the description of the EmailCommand component also apply to the EmailFileCommand component.
ExecCommand
Executes a command from a Batch. This will execute a command from the application server’s root directory. You can set an input, output, and error file, plus a log file for the command to be run.
LogCommand
Writes a string to the Manager log. This is useful for testing Batches or indicating when a Batch is starting or stopping.
WaitCommand
Halts for a moment before proceeding. The amount of wait time is specified in seconds. Note that this component is only useful for generating a wait time when the Batch type is serial.
EngineWeightCommand Sets the Engine distribution weighting relative to other Brokers. The Brokers must be logged into the Director during execution and to show up in the Batch Editor. The current Broker list is fetched only when adding a new EngineWeightCommand component in the Batch Editor. Event
Makes a Batch File wait for an implemented event to take place. You can use this to pause until a specific condition in a class you deployed has occurred.
FileEvent
Makes a Batch wait for a file event to occur before completing the remaining items in the Batch Definition. Specifically, it enables you to watch a file and wait until it is created, deleted, or modified before proceeding.
Service Runners Service Runners enable you to define a registered Service Type with options and init data that can be used in a Batch Definition. It can also be used to chain together Service Types and discriminators into a single unit that can be used in a Batch Definition.
GridServer Administration Guide
• • • 65 • • •
Scheduling Batch Definitions
To create a Service Runner, click the Service Runner Registry page. Type the name of a Service Runner in the box and click Add. This will open a Service Runner Editor page, where you can choose a Service Type and enter init data, a description, and method names and input data for invocations. You can also use the list at the bottom of the page to add discriminators, Service input description data, and Service options. The Service Runner Registry also lists all Service Runners existing on a Manager. Using the Actions controls, you can edit, rename, copy, delete, export, or launch each Service Runner.
Scheduling Batch Definitions After you have created a Batch Definition with the Batch Editor page, it will be listed with the other Batch Definitions on the Batch Registry page. However, these Batch Definitions are not actually running on the Manager yet. To create a Batch from a Batch Definition, you must first schedule it. This actually instantiates a Batch and inserts it into the Manager’s batch queue. To schedule a Batch Definition, click the Batch Registry page, and find the Batch Definition in the list. Select Schedule Batch Definition from the Actions control. This will schedule the Batch Definition, and open the Batch Schedule page, displaying it as a Batch Entry.
The Batch Schedule Page Batch Entries on a Manager can be listed and administered on the Batch Schedule page. To do this, click the Batch tab, then click the Batch Schedule page. All Batch Entries resident on the Manager are listed. To remove or edit an existing Batch Entry or view logs or Batch executions, select a command from the Actions control next to the relevant Batch.
Running Batches Batch Entries will automatically run when they reach the scheduled time or conditions defined in FIGURE 8-4: The Batch Schedule page. their Batch Definition. When this happens, Batch Executions are created and displayed on the Batch Admin page. PDriver Batches (which are also Batch Executions) are also displayed on this page. On the Batch Admin page, you can monitor Batch Executions, search for logs, and display the Batch Monitor applet to view what parts of a Batch have completed. Any Services that are run by the Batch Execution are displayed on the Service Session Admin page. From there, you can cancel Service Sessions, view Tasks, or do any other actions you normally would with a Service. Note that it is possible to have a Batch Execution run a Service that continues to run, even after the Batch Execution reports that it is finished.
• • • • •
66 •
Chapter 8 – The Batch Scheduling Facility
This Document is Proprietary and Confidental
Deploying Batch Resources Java Services, Commands, and other resources must be placed in [GS Manager Root]/webapps/livecluster/WEB-INF/batch/jar to be properly loaded by the embedded Driver. For more information on resource deployment, see Chapter 7, “Application Resource Deployment” on page 43.
Batch Fault-Tolerance Batch Schedules that exist on a Manager are persistent, provided the Next Run field is not never. This provides failover capability in the event of a Manager failure, as the Batch Schedules will still exist when the Manager is restarted. The following Batch Schedules are persistent: • Absolute schedules • Relative schedules with repeat • Cron schedules All persistent Batches are restarted when the Manager is restarted, just like they were scheduled for the first time. Batch runs that were to occur during the time when the Manager was down are ignored.
Using PDriver in a Batch You can use PDriver within a Batch, with the following configuration changes: 1. Download the GridServer SDK on your Broker machine. 2. Write a batch or shell script to run your PDriver job on the Broker. 3. Create a Batch Definition that uses the ExecCommand component to run that script.
GridServer Administration Guide
• • • 67 • • •
Using PDriver in a Batch
• • • • •
68 •
Chapter 8 – The Batch Scheduling Facility
This Document is Proprietary and Confidental
Chapter 9 Configuring Security
•••••• Introduction
GridServer provides a rich set of security options for integrating into your organization’s computing environment. GridServer does not impose its own security policy; instead you select from the features available to implement your preferred policy. The key security areas of authentication, access control and authorization, event logging, data validation, and cryptography are discussed.
Authentication Authentication is the process of determining if an entity is what it claims to be. In keeping with the GridServer philosophy of providing a flexible set of tools that can be used to implement an organization’s security policy, GridServer provides both a built-in authentication service and an extensible set of hooks for integrating to external authentication systems.
Operating System Users By default, GridServer does not authenticate using operating system accounts. Operating system accounts are used to start GridServer software components, like the Manager, Engine, and Driver. It is not required to use a superuser operating system account to start any GridServer component. Certain features do require superuser level access. For instance, to use GridServer’s UIIdle scheduling mode on Windows, at least the DSHook UI event timing service must run as superuser. It is possible to use operating system user authentication for GridServer authentication. See “Extensible Authentication Hooks” on page 70 for more information. Authentication of operating system users is handled by the operating system in question.
Grid Users Users of Grid Services may be either compute Service users or administrative users. In either case they are authenticated through the same mechanism. GridServer is responsible for authenticating Grid users according to the policy defined by the administrator. Extensible authentication hooks can be used to interface to an external authentication system such as Active Directory, LDAP, or NIS. Once a Grid user has been authenticated, they are given an authentication token to use in further correspondence. In the case of Administration Tool or Web Services users, the authentication token is a standard HTTP session cookie. In the case where compute users connect via the DataSynapse APIs, the authentication token is a DataSynapse object.
GridServer Administration Guide
• • • 69 • • •
Authentication
User accounts are added or modified with the User Admin page, located on the Admin tab in the Administration Tool. Each user account is given an access level, which dictates what features of the Administration Tool they can use. For further details on access levels and their corresponding permissions, see Chapter 6, “The GridServer Administration Tool” on page 36.
GridServer Built-In Authentication GridServer’s built-in authentication mechanism uses the embedded Director database (the internal database) to authenticate Grid users. Administration Tool users must be authenticated with a username and password before they can access the Administration Tool. Likewise, Web Services users must be authenticated with a username and password. The DataSynapse Clients APIs (JDriver, CPPDriver, PDriver) do not require authentication by default, but authentication can be enabled. GridServer built-in authentication includes options for minimum username length, minimum password length, password complexity, password aging, and application behavior on password failure. Password authentication can be configured on the Manager Configuration page, in the Security section.
Extensible Authentication Hooks Many environments already have a suitable authentication service that can be used by GridServer. For instance, the organization may be running an LDAP-based service like Active Directory. In this case the organization’s policy may be to centralize all authentication information in Active Directory. GridServer’s extensible authentication hooks can be used to integrate with existing authentication services. Since there is no universally-accepted standard for Grid authentication nor for application authentication, DataSynapse has chosen to create its own interfaces, DriverAuthenticationHook and UserDatabaseHook, that can be used to integrate existing authentication models. We provide example implementations for these hooks to integrate with LDAP. Since LDAP bindings for Grid authentication can be expected to vary from organization to organization, it may be necessary to modify the example implementations to work with your bindings. An additional authentication hook example is provided for NTLM.
Enabling Client Authentication By default, any client is allowed to log in to a Manager. However, it can be configured to only allow Drivers with a valid Grid User identity that is associated with a Driver Profile to log in. Driver Authentication is a Director setting, and should be set on all Directors. To enable Driver authentication: 1. 2. 3. 4.
Click the Manager tab on the Director. Click Manager Configuration. Click Engines and Clients. In Client Authentication Enabled, enter True.
5. Click Save. After authentication is enabled, you will then need to allow clients to log in. To do this, a Driver Profile must be assigned to a Grid User. For example: 1. Click the Driver tab. • • • • •
70 •
Chapter 9 – Configuring Security
This Document is Proprietary and Confidental
2. Click Driver Profiles. 3. Create a new Driver Profile and save it. 4. Click the Admin tab. 5. Click the User Admin page. 6. Create a new user, and assign the profile to that user. For Drivers, the username and password are assigned using the driver.properties file or the API. For SOAP clients, they are set using HTTP basic authentication. Most SOAP packages provide a method for setting the username/password on the proxy.
SSL SSL (Secure Socket Layer) communication can be enabled for communication at each level in the GridServer architecture depending on the security requirements of the organization and the deployment scenarios involved. SSL provides both encryption of messaging between components, and a trust relationship of the server by the client. In addition, SSL can be used for resource downloading by Engines, and for use of the Administration Tool. In general, HTTP communication can be completely disabled, and all GridServer components can be used using only HTTPS.
Communication Overview To understand how SSL is used for messaging, it is important to understand how components establish communication channels with each other. For the remainder of this discussion, the terms “client” and “server” will be used in the traditional way, that is, a client/server relationship. An example is the Engine Daemon is a “client” to the Director’s “server”. There are two aspects to establishing communication. The first step is the login process. The client requests a login via a known communication channel. At that point, the server may perform authentication or validation, and if successful, it returns a connection for use from then on. Note that this channel may be on a different server. For example, an Engine logs in via a Director, but the connection exists on a Broker. SSL is configurable for both aspects. If SSL is to be used for login, it must be configured on the client. If SSL is to be used for the connection, it must be enabled on the server. For example, to enable a Driver to login via SSL, the Driver must be set to the HTTPS URL address on the Director, either via the driver.properties file or the API. To enable HTTPS communication between the Driver and Broker after login, it must be set on the Broker, typically by configuring all Messaging and Download URLs to the HTTPS URL.
Certificate Overview All SSL clients establish a trust relationship with their server. This is performed via a certificate on the client side, which essentially is a public key that is associated with a private key on the server. When establishing the trust relationship, the server’s certificate must either have been signed by a key trusted by the client, or be trusted implicitly by the client (a self-signed certificate). Most SSL clients contain a set of trusted Certificate Authorities (CAs), so that if a server has a certificate signed by one of those CAs, it will automatically trust the server. If the server is self-signed, that server’s certificate must be added to the client’s list of trusted servers.
GridServer Administration Guide
• • • 71 • • •
SSL
In addition, the client may check the Common Name (CN) of the server’s certificate against the hostname of the server, to verify that the certificate is being used on the intended host. GridServer is packaged with a default self-signed key-pair and certificate. All clients have a local copy of the certificate added to their list of trusted servers. In addition, hostname verification is disabled by default, as the CN will not match the servers hostname. This configuration allows immediate use of SSL without any additional setup. This may or may not be sufficient, depending on your needs.
Keypair and Cert Location All Managers must contain a keypair, either self or CA-signed. The default keypair is stored in a keystore, located at [GS Manager Root]/webapps/livecluster/WEB-INF/certs/server.keystore. The keystore password is configurable via the Manager Configuration page, in the Security section, under the SSL Certificates heading. If you’ve replaced the cert on the manager with one signed by your own CA, you need to replace the cert in each downloaded SDK. If you have your own CA and ROOT_CA.pem contains its cert: • The ROOT_CA.pem file should be imported into config/ssl.keystore as a trusted cert. This is for JDriver and .NET. • ROOT_CA.pem should be renamed ssl.pem (replacing the existing one) in the config directory. This is for C++-based code (including PDriver). The default SSL trust files are ssl.keystore, ssl.crt, and ssl.pem for JDriver, .NETDriver, and CPP/PDriver, respectively. New certificates can be used by either importing them into the appropriate one of these files, or by changing the DSSSLTrustFile property in driver.properties or the DriverManager.SSL_TRUST_FILE option through the API to the file containing the certs.
Types of Connections Using SSL It is possible to enable SSL on several different types of connections within GridServer. SSL can be used for Driver connections, Engine and Engine Daemon connections, Broker and Director communication, and Engine resources. There are two methods for enabling SSL within GridServer. The first is to enable Manager HTTPS and then enable SSL on some components. The other method is to enable HTTPS to all components. Both methods are detailed below.
Enabling HTTPS on the Application Server To enable HTTPS, you must first enable HTTPS on the Manager’s application server. You can then configure HTTPS on any of the connections to components. To enable HTTPS on the application server: 1. 2. 3. 4.
• • • • •
72 •
Log in to the GridServer Administration Tool. Click the Admin tab, then click Manager Reconfigure. Click the Resin Configuration option. Proceed to step 4 of the Resin Configuration, the Resin SSL page. Click Enable SSL and enter an SSL port, or use the default of 8443.
Chapter 9 – Configuring Security
This Document is Proprietary and Confidental
5. Complete the Manager Reconfigure steps and restart your application server. 6. After restart, open the URL to your GridServer Administration Tool. You will be presented with the Manager Installation page. Complete the installation (enabling HTTPS on components if needed, described in the next section) and restart your application server.
Enabling HTTPS on all Components Because it is possible to enable SSL on several different types of connections within GridServer, the option is available to enable everything with SSL on installation, for those who want to run a pure SSL environment. To do this: 1. Complete the above procedure for Enabling Manager HTTPS up to step 6, and start the Manager Installation. 2. On step 3 of the Manager Installation, you are given the option to select Protocol and Port for both Web Administration and Messaging and Resource Download. The Web Administration settings are used for connections for the GridServer Administration Tool. When this is set to HTTPS (typically with port set to 8443), any attempted HTTP connection will be rerouted to a HTTPS connection on this port. The Messaging and Resource Download settings are used for all Engine and Client messaging and Resource Downloads. Setting this protocol to HTTPS will cause all connections to use HTTPS. To configure HTTPS for only a subset of these, such as HTTPS only for Resources, you should set this protocol to HTTP, and then set HTTPS for individual components in the Manager Configuration after installation. Each component’s specific settings are described below. 3. Complete the remaining steps in the configuration, then click Start Installation to complete the installation/reconfiguration. You will need to restart your application server. Note that if you have already installed Drivers from this GridServer installation, their driver.properties files will have to be edited to point to the new HTTPS URL before they will use SSL; Engines will reconfigure themselves to use the new secure reinstallation; the Director URLs in all Engine Configurations are changed to https://host:sslport.
Driver SSL All Driver certificates can be found in the SDK stored in the config directory. Drivers will look for this certificate in this directory by default. The Driver can use a different location if desired; see the API for more information. If your server is using a CA-signed certificate, there is no need to for the default certificate. The JDriver keystore includes all certificates packaged with the Java 1.4.2 cacerts file, plus the GridServer default certificate. HTTPS must be enabled on the Director for login, and on the Brokers for the connection. To enable SSL for Driver login, you must set the Director URLs to the HTTPS location, either via the driver.properties file (with the DSPrimaryDirector property) or by setting the URL programmatically through the DriverManager API.
GridServer Administration Guide
• • • 73 • • •
SSL
To enable SSL for Driver communication, you must enable it on all Brokers you wish to use it. This setting will affect any Driver that is logged in to that Broker. If your Broker is configured to use HTTPS for all Messaging, Drivers will already use HTTPS. If you did not enable HTTPS for all messaging and want to enable SSL for Driver communication: 1. Click the Manager tab. 2. Click Manager Configuration. 3. Click Security. 4. Under HTTPS Communication, set Use HTTPS for Client Communication to True. 5. Click Save. If you wish to use hostname verification, it can be enabled via the driver.properties file or API. Keep in mind that you have to create and install your own keypair corresponding to the CN of the host.
Engines and Engine Daemon SSL The Engine Daemon and Engine use the ssl.pem and ssl.keystores files, respectively, found in the Engine’s root directory. HTTPS must be enabled on the Director for login and connection for Daemons, and on the Brokers for the connection for Engines. To enable SSL for Engine and Engine Daemon login, you must set the Directors to the HTTPS location in the Engine Configuration. To enable SSL for Engine communication, you must enable it on all Brokers you wish to use it. SSL is enabled for Engine Daemons on Directors. If your Broker is configured to use HTTPS for all Messaging, Engines will already use HTTPS. If you did not enable HTTPS for all messaging and want to enable SSL for Engines on Broker: 1. 2. 3. 4. 5.
Click the Manager tab. Click Manager Configuration. Click Security. Under HTTPS Communication, set Use HTTPS for Engine Communication to True. Click Save
To enable SSL for Engines Daemons on a Director: 1. 2. 3. 4.
Click the Manager tab. Click Manager Configuration. Click Security. Under HTTPS Communication, set Use HTTPS for Engine Daemon Communication to True.
5. Click Save If you wish to use hostname verification, it can be enabled via the Engine Configuration. Keep in mind that you have to create and install your own keypair corresponding to the CN of the host.
• • • • •
74 •
Chapter 9 – Configuring Security
This Document is Proprietary and Confidental
Brokers and Director SSL The communication between Brokers and Directors, and the Secondary Director and Primary Director can also be configured to use SSL. Note that because they use pure sockets for communication, HTTPS does not need to be enabled on the Manager. The default cert is stored in livecluster/WEB-INF/certs/ssl.keystore. Its location is configurable via the Manager Configuration page, in the SSL section. To enable SSL for Broker and Secondary Director login: 1. 2. 3. 4. 5. 6.
Click the Manager tab on the Director. Click Manager Configuration. Click Security. Under Server-side Socket SSL, set Require SSL for Login to True. Click Save. Click the Manager tab on the Brokers and/or Secondary Director.
7. Click Manager Configuration. 8. Click Security. 9. Set Use SSL for Login to for all applicable categories (such as Broker- Primary Director) 10.Click Save. WARNING: If a Director requires SSL, all Brokers and the Secondary Director must be also use SSL for login. To enable SSL for the connections: 1. 2. 3. 4.
Click the Manager tab on the Director. Click Manager Configuration. Click Security. Set Use SSL for Communication to True for the Broker-Primary Director and/or BrokerSecondary Director Connections. 5. Click Save. If you wish to use hostname verification, it can be enabled via the Verify Hostname setting on the Security page. Keep in mind that you have to create and install your own keypair corresponding to the CN of the host.
Resources over HTTPS The resources used by Engines may be downloaded via HTTPS. In addition to Engines downloading resources from Brokers, Brokers also download synchronized resources from the Director. Thus there are two settings. If your Broker is configured to use HTTPS for all Messaging, Resources will already use HTTPS. Otherwise, the following procedure will enable it. To enable SSL for the connections: 1. Click the Manager tab. 2. Click Manager Configuration. 3. Click Security. GridServer Administration Guide
• • • 75 • • •
Resource Protection
4. Under Broker Resources, set HTTPS Enabled to True for appropriate settings. On a Manager that contains only a Broker or Director, there will only be a single setting.
Disabling HTTP For security reasons, you may want to disable HTTP on the Director and only use HTTPS. NOTE: 1-Click install will not work if you are accessing the Manager using SSL (through an HTTPS URL.)
To disable non-HTTP connections: 1. Reconfigure the Manager, setting the URL to use the HTTPS URL. 2. Update all Drivers (in the driver.properties files) to use the HTTPS URL. 3. Shut down the Manager and edit the datasynapse/conf/resin.conf file (or whatever RESIN_CONF refers to) and comment out the
entry for port 8000. (If you have already successfully gone through the Resin Configuration pages in the Administration Tool, there will be another, uncommented
entry that contains an SSL-enabled tag.) 4. When you restart the Manager, everything should use SSL, with no HTTP port open.
Resource Protection Resources that are downloaded by Engines are protected from download via HTTPS. This is done in the following manner: • The deployment directory is protected such that files cannot be directly downloaded from it. • When an Engine receives a message to download resources, it is provided a random nonce (a single use token) that will expire. (This expiration time is configurable via the Manager Configuration page, in the Security section, in the Resource Deployment heading, in the Broker Resources section, with the Token Timeout setting.) When the Engine attempts to download data from the URL, it is redirected to the protected deployment directory. The nonce is then validated by the Manager, and the Engine is allowed to download the data. Note the if you are using an alternate base directory, resources are NOT protected.
• • • • •
76 •
Chapter 9 – Configuring Security
This Document is Proprietary and Confidental
Chapter 10 GridServer Performance and Tuning
••••••
Diagnosing Performance Problems To find bottlenecks in application performance, use GridServer’s Instrumentation feature. With instrumentation enabled, you can get detailed timings of each request submitted to the Broker. These timings highlight scheduling overhead, data marshalling time and network delays. Note that Instrumentation measures only GridServer-related times. It does not show other application delays due to, for example, excessive database load. For information on turning on Instrumentation, see Chapter 12, “Administration Howto” on page 89. For more information on instrumentation, see Appendix A, “Task Instrumentation” on page 105 of the GridServer Developer’s Guide.
Tuning Data Movement Efficient handling of data can often make or break achieving performance gains in a Grid-enabled application. Instrumentation will reveal problems with having too much data per request: serialization, deserialization and network transport times will be high compared to the actual Engine-side compute time. There are a number of remedies for inefficient data movement. We survey them here in order from simplest to most complex.
Stateful Processing GridServer supports two related mechanisms that link client-side service instances to Engine-side state, thereby reducing the need to transmit the same data many times. The two mechanisms are initialization/update data, and Service affinity. Data that is constant across an entire set of task requests should be made Service initialization data. Initialization data is transmitted once per Engine, rather than once per request. Long-lived volume-based applications will typically process thousands of requests, and compute-intensive applications should be designed to create many small requests, rather than few large ones, for a variety of reasons (see Chapter 8, “GridServer Design Guidelines” on page 79 in the GridServer Developer’s Guide for more information). If a piece of data is not constant throughout the life of the application, but changes rarely (relative to the frequency of requests), it can be passed as initialization data and then changed by using an update method. See Chapter 3, “Creating Services” on page 23 the GridServer Developer’s Guide for details. The GridServer scheduler uses the fact that an Engine has initialization data and updates from a particular Service to route subsequent requests to that Service. This feature, called affinity, further reduces data movement, because unneeded Engines are not recruited into the Service. (However, if the Service has pending requests, available but uninitialized Engines will be allocated to it.) Affinity can be further exploited by dividing the state of an application across multiple client-side Service instances, called Service Sessions. The application then routes requests to the instance with the appropriate data. For example, in an application dealing with bonds, each Service instance can be initialized with the data from one or several bonds. When GridServer Administration Guide
• • • 77 • • •
Tuning Data Movement
a request comes in for the value of a particular bond, it is routed to the service instance responsible for that bond. In this way, a request is likely to arrive on an Engine that already has the bond data loaded, yet no Engine will be burdened with the entire universe of bonds. There are Engine and Service parameters related to stateful processing. The Service Session Size parameter, located on Engine Configuration pages under the Caches heading, controls how much initialization data can be stored on an Engine in aggregate. In other words, if the total size of init data across all loaded service instances exceeds the set value of the parameter, then the least-recently used Service instance will be purged from the cache. If Instrumentation shows a non-zero time for Engine Download Instance the second or subsequent time an Engine receives a request from a service, that indicates that the service instance was purged from the cache. Increasing Tasklet Size may then result in improved performance. The STATE_AFFINITY Service option is a number that controls how strongly the scheduler uses affinity for this service. The default is 1, so set it to a higher value to give your service preference when Engines are being allocated by affinity. The AFFINITY_WAIT Service option controls how long a queued request will avoid being allocated to an available Engine that has no affinity, in the hope of later being matched to an Engine with affinity. Use this option when the initialization time for a service instance is large. For instance, say it takes five minutes to load a bond. If AFFINITY_WAIT is set to two minutes, then a queued request will not be assigned to an available Engine that lacks affinity for two minutes from the time the first Engine becomes available. If an Engine that already has loaded the bond becomes available in those two minutes, then the request will be assigned to that Engine, saving five minutes of startup time.
Compression Setting the COMPRESS_DATA Service option to true (in the Service client or on the Service Type Registry page) will cause all transmitted data to be compressed. For large amounts of data, the transmission time saved more than makes up for the time to do the compression.
Packing Packing multiple requests into a single one can improve performance by amortizing the fixed per-request overhead of GridServer and the application over multiple units of work. The fixed overhead includes TCP/IP connection setups for multiple transits, GridServer scheduling, and other possible application initialization steps. GridServer’s AUTO_PACK_NUM Service option is an easy way to achieve request packing. If its value is greater than zero, then that many requests will be packed into a single request, and responses will be unpacked, transparently to the application. (If the application makes fewer than AUTO_PACK_NUM requests, then the accumulated requests are transmitted after one second.) Auto-packing amortizes per-request overhead, but does not factor out common data.
Direct Data Transfer By default, GridServer uses Direct Data Transfer (DDT) to transfer inputs and outputs between Drivers and Engines. When Driver-Engine DDT is enabled, the Driver saves each request as a file and sends a URL to the Broker. The Engine assigned to the request gets the URL from the Broker and reads the data directly from the Driver. Engine-Driver DDT works the same way in the opposite direction. Without DDT, all data must needlessly go through the Broker. • • • • •
78 •
Chapter 10 – GridServer Performance and Tuning
This Document is Proprietary and Confidental
DDT is efficient for medium to large amounts of data, and prevents the Broker from becoming a bottleneck. However, if the amount of data read and written is small, disabling DDT may boost performance. Disable Driver-Engine DDT in the driver.properties file on the client. Disable Engine-Driver DDT from the Engine Configuration page.
Shared Directories and DDT In some network configurations, it may be more efficient to use a shared directory for DDT rather than the internal fileservers included in the Drivers and Engines. In this case, the Driver and Engines are configured to read and write requests and results to the same shared network directory, rather than transferring data over HTTP. All Engines and the Driver must have read and write permissions on this directory. Shared directories are configured at the Job and Service level with the SHARED_UNIX_DIR and SHARED_WIN_DIR options. If using both Windows and Unix Engines and Drivers, you must configure both options to be directories that resolve to the same directory location for the respective operating systems.
Caching Service initialization data is effectively a caching mechanism for data whose lifetime corresponds to the Service Session. Other caching mechanisms can be used for data with other lifetimes. If the data is constant or rarely changing, use GridServer’s resource deployment mechanism to distribute it to Engine disks before the computation begins. This is the most efficient form of data transfer, because the transfer occurs before the application starts. GridCache can also be used to cache data. GridCache data is stored on the Manager and cached by Engines and other clients. GridCache can handle large amounts of frequently updated data. See Chapter 7, “GridCache” on page 73 of the GridServer Developer’s Guide for more information.
Data References GridServer supports Data References: remote pointers to data. A Data Reference is small, but can refer to an arbitrary amount of data on another machine. Data References are helpful in reducing the number of network hops a piece of data needs to make. For instance, imagine that an Engine has computed a result that another Engine may want to use. It could write this result to GridCache. But if the result is large, it will travel from the writing Engine to the GridCache repository on the Broker, and then to the reading Engine. If the first Engine writes a Data Reference instead, the second Engine can read the data directly from the first Engine. Data References hide this implementation from the programmer, making network programming much simpler. See Chapter 4, “Accessing Services” on page 39 of the GridServer Developer’s Guide or the GridServer API for more information.
Tasks Per Message In the Job model, messages are sent to the Engine when TaskInputs are created. To minimize message overhead, a message is only sent for each 20 Tasks in a Job. You may find that when running Jobs with many short-running tasks, message overhead can be minimized by setting the Job option TASKS_PER_MESSAGE to a number higher than the default of 20.
GridServer Administration Guide
• • • 79 • • •
Tuning for Large Grids
Invocations Per Message In the Services model, Drivers will send a message per invocation submitted to the Manager. To minimize message overhead, more invocations can be sent in each message. This can increase submission speed on Services when many invocations are submitted in bulk. The Service option INVOCATIONS_PER_MESSAGE can be changed to a number greater than 1, so the Driver will buffer that number of invocations before submitting to the Manager. The buffered invocations are also flushed to the Manager every second if the buffered number doesn't reach the maximum number.
Tuning for Large Grids In GridServer installations with a large Grid, Manager performance may become extremely slow. For example, the Broker Monitor response time may take several seconds to update. The following changes can improve performance on large Grids: • Increase the number of Resin request threads from the default of 200 to 300 or more. A good rule of thumb is Resin Threads = Maximum Messaging Connections + Maximum Resource Download Connections + 50. This ensures enought threads to handle all messaging, downloads, and browser requests. To do this, edit the conf/resin.conf file at the top of the Broker's installation directory. Change the line that reads:
200
to change the setting to 300 or more. Note that your Broker will restart when the resin.conf file is modified. • On the Brokers, increase the Engine “Max Millis Per Heartbeat” value to be at least 2 minutes; the default is 30 seconds. • Increase the SSL “Token Timeout,” which is actually in effect regardless of SSL, for both the “Broker Resources” and “Director Resources” to be 5 minutes. The settings are on the Manager Configuration page, in the SSL section, under the Resource Deployment heading. • Increase the Assignment Timeout, on the Manager Configuration page, in the Services section, to 60000 ms. Increasing this allows more time for an Engine to connect and pickup an assigned task when the Broker is under heavy load. This value should be increased if you see 'Task assignment expired:'... messages often. • On the Manager Configuration page, in the communication section, change Maximum Messaging Connections to 200; change Messaging Retry Wait to 10000 ms; change Driver/Engine/Daemon Socket Timeout to 120 seconds. • Increase the heap size. The Java maximum heap size is set in the server.sh or server.bat file, and is 512 MB by default in GridServer 4.2. It can be increased by changing the environment variable MAX_HEAP in the server.bat or server.sh file.
• • • • •
80 •
Chapter 10 – GridServer Performance and Tuning
This Document is Proprietary and Confidental
Chapter 11 Diagnosing GridServer Issues
••••••
This chapter contains information on how to find information to diagnose GridServer issues. It contains information on troubleshooting your installation and gathering information that will be helpful if you contact DataSynapse for support.
Troubleshooting When troubleshooting a GridServer installation, try the following: 1. Search the GridServer Knowledge Base, located at customer.datasynapse.com. This contains known issues, including those that have occurred since the publication of this guide, and is updated frequently. 2. Check the state of your Grid: • Check Engine Daemon state configuration. • Is File Update enabled? • Are Engine paths set as desired? 3. Read the log files, as described below.
Obtaining Log Files There are several logs generated by GridServer. Depending on what kind of issue you are troubleshooting, you may need to examine one or more logs. These include Manager, Driver, Engine, and Engine Daemon logs.
Manager Logs Manager Logs are generated on the console window on Windows machines if the Manager is not run as a service, or on Unix machines if the Manager is run in the foreground on the console. Because GridServer is usually run as a service or in the background, there are several other ways to view the manager log: • In the GridServer Administration Tool, from the Admin menu, select Current Log. This displays new lines of the log as the happen, in a new window. It doesn’t, however, display any historical information. Click the Snapshot button to open a frozen duplicate of the current log window. • Also in the Administration Tool, from the Admin menu, select Diagnostics. This page enables you to search from the Manager log, plus other logs, and display it, or create a .ZIP file of the results. To view Manager Log results, select Manager Log in Choose Files, then select a time range in Choose Manager Log Date/Time. You can then display the log on-screen by clicking Display Below, display it in a new window with Display in Separate Popup Window, or save it in a compressed file with Create .ZIP File. • The Manager log is available directly at manager_root/webapps/livecluster/WEB-INF/log/server/* or the location specified on the Manager Configuration page in the Logging section, on the Manager tab. GridServer Administration Guide
• • • 81 • • •
Obtaining Log Files
The Manager log can be set to different levels of granularity, ranging from Severe, which provides the least amount of logging information, to Finest, which logs the most information. By default, this level is set at Info. For debugging purposes, it may be neccesary to set the level higher, to Finer or Finest. To change the log level: 1. 2. 3. 4.
In the GridServer Administration Tool, select the Manager tab. Select Manager Configuration. Select Logging. In Default Debug Level, select a new level.
Engine and Daemon Logs Each Engine and Engine Daemon generates its own logs. These can be accessed directly on Engines. However, because Engines are typically installed in several different machines, there are also methods to view logs remotely from other computers. The following procedures describe how to read Engine logs. To read the log in a scrolling window: 1. In the GridServer Administration Tool, select the Engine tab. 2. Select the Engine Admin page. 3. From the Actions menu, select Remote Log. This will open a window that displays the log for the Engine. As new logging information is generated, it is displayed. This does not, however, display any prior logging history. To access previous logs: 1. In the GridServer Administration Tool, select the Engine tab. 2. Select the Engine Admin page. 3. From the Actions menu, select Log URL List. This will open a window containing hyperlinks to each of the log files on the Engine. You can click on each link to remotely view each log. Note that if you open a log and then more Engine activity occurs, you will need to reload the log to view it. To directly view log files, look in the following directories in each Engine install directory: • Instance logs: work/name-instance/log/* • Daemon logs: profiles/name/logs/engined.log • Also examine other .log files in Engine tree To change the log level for Engines: 1. 2. 3. 4. 5.
• • • • •
82 •
In the GridServer Administration Tool, select the Engine tab. Select the Engine Configuration Page. Select an Engine Configuration from the list. In the Log section, select a new level in the Level list. Change this setting in each Engine Configuration for which you want to change logging.
Chapter 11 – Diagnosing GridServer Issues
This Document is Proprietary and Confidental
Driver Logs Driver logs are displayed in the command or shell window when a Driver is running. They are also captured in the in logs subdirectory of working directory For SOAP access, including Web Service and Batches, an embedded Driver on the Manager is used: no local logs are generated.
Application Server Logs The application server used to run the GridServer Manager also generates logs that can be helpful in diagnosing issues. For Resin, the logs are in manager_root/log/error.log
GridServer Administration Guide
• • • 83 • • •
Obtaining Log Files
• • • • •
84 •
Chapter 11 – Diagnosing GridServer Issues
This Document is Proprietary and Confidental
Chapter 12 Administration Howto
••••••
This chapter contains several procedures that are commonly used when administrating a GridServer Manager. Most of the tasks outlined below use the GridServer Administration Tool, which is also described in Chapter 6, “The GridServer Administration Tool” on page 35. Also, the Administration Tool has online help, which further describes each page’s features.
Backup / Restore Backing up and restoring GridServer managers requires doing little more than an OS level file copy of the webapps/livecluster directory in your installation directory. On Director installations you may also have to use the database repair scripts to back up or restore the internal and reporting databases.
Backup Procedure To back up a GridServer installation: 1. Archive (with tar or zip) or simply copy the [GS Manager Root]/datasynapse/webapps/livecluster directory. Exclude the subdirectories livecluster/dataTransfer and livecluster/localDriverDDT from your archive process.
Restore Procedure To restore a GridServer installation: 1. Unpack the original GridServer Manager installation using WinZip or a similar tool for Windows. On a Unix system, do the following: gzip -d -c GridServer_R4*gz | tar xvf -
2. Delete the livecluster directory from [GS Manager Root]/DataSynapse/webapps. 3. Copy the backup livecluster directory to [GS Manager Root]/DataSynapse/webapps.
Manager Configuration Applying a patch or service pack to GridServer To apply a patch or service pack to GridServer, do the following: 1. Shut down the GridServer Managers that will be updated. 2. Run the JAR file. The syntax for running the JAR is: java -jar [Patch or Service Pack].jar [webapp_dir] [basedir1] [basedir2] ...
GridServer Administration Guide
• • • 85 • • •
Manager Configuration
[webapp_dir] [basedirX]
is the livecluster directory on your application server.
is the base directory for each Manager, if using alternate base dirs.
For example, to apply GridServer 3.2 patch 1 to GridServer 3.2 installed in c:\datasynapse: java -jar GridServer-3_2-Patch1.jar C:\datasynapse\webapps\livecluster
Driver Upgrade: Be sure to re-download the SDK and update all Drivers after a successful Manager update. Note: All files that are changed will be saved in the corresponding directory in [basedirX]\WEB-INF\uninstall. For instance, the above example will save the old files in c:\datasynapse\webapps\livecluster\WEBINF\uninstall\3_2-Patch1.
Importing and Exporting Manager Configuration GridServer Managers support the ability to export the Director and Broker configurations and Engine configuration profiles into a signed JAR file format and later import this same format to migrate settings from one Manager to another. This can be used to migrate Engines from one Manager to another Manager without reconfiguring all of the Engines, to simplify administration of multiple Manager systems, or to disseminate an organization’s preferred default Engine configuration among all clusters in the organization. To export a configuration: 1. In the GridServer Administration Tool, click the Admin tab and click the Import/Export page. 2. Select the configurations you would like to include in the JAR. This includes the Broker configuration, Director configuration, and any Engine configuration profiles. 3. Click Export. 4. A File Download dialog box appears. Click Save to save the jar file. To import a configuration: 1. In the GridServer Administration Tool, click the Admin tab and click the Import/Export page. 2. Next to the Provide File for import box, click Browse. 3. Browse to the location of the jar file containing the GridServer Manager configuration export. 4. Click Upload to begin the import. 5. A list of configurations found in the JAR file will be displayed, with configurations highlighted in red if they will install over existing configurations. Select the configurations you wish to import, then click Import. When completed, the Manager may need to be restarted for changes to take effect; in this case, a message will be displayed and the Manager will automatically shut down.
Installing Manager Licenses Each GridServer Manager requires a valid license to function. Licenses are limited by date, hostname, and number of Engines. By default, a demo license for four Engines is included with each Manager, but for further evaluation or production use, you must obtain a license by contacting DataSynapse Support. • • • • •
86 •
Chapter 12 – Administration Howto
This Document is Proprietary and Confidental
To view your Manager’s license information in the GridServer Administration Tool, click the Admin tab and click the License Information page. A Manager license consists of a single XML file, and is typically sent by DataSynapse Support via email as an attached file. They can also be downloaded at any time from the http://customer.datasynapse.com customer support site. You can inspect the license with a text editor to determine its capacity, but you should not make changes to the file. To install the license: 1. In the GridServer Administration tool, click the Admin tab, click the License Information page. 2. Copy the .ser file that was an attachment in your email message from DataSynapse or from a download from the DataSynapse customer support site to a location accessible with your web browser (either a local directory or a shared directory.) 3. Click Browse. 4. Find the license file and click Open. 5. Click Upload New License. If the license file is valid, it will overwrite the existing license and changes will take place immediately. If it is expired, corrupt, or otherwise not valid, an error message will appear and your existing license will remain in place.
Setting the SMTP host The GridServer Administration Tool can be configured to send notifications via email, via the Event Subscription page. To send the email, there must be a SMTP host configured for the Manager. This is typically configured during Manager installation, but you can later add or change the value. To set the SMTP host: 1. 2. 3. 4.
Click the Manager tab. Click Manager Configuration. Click Admin. In the Mail heading, in SMTP Host, enter the name of your SMTP server. For many organizations, this is simply mail.
5. In Contact Address, enter the email address of an administrative contact. A notification will be sent to this address when new users are added to the Administration Tool. 6. Click Save.
Setting Up a Failover Broker In the fault-tolerant configuration, some Brokers can be set up as Failover Brokers. When a Broker is designated a Failover Broker, no Director will route Engines to that Broker unless there are no other active Brokers. When there are no Jobs waiting for Service on a Failover Broker and other Brokers in the Grid are available, the Failover Broker will “kick off” idle Engines causing the Engines to login to their Primary Director and get reassigned to a non-Failover Broker in the Grid. By default, all Brokers are non-Failover Brokers (they load-balance work). Designate one or more Brokers within the Grid as Failover Brokers when you want those Brokers to remain idle during normal (non-failure) operation. GridServer Administration Guide
• • • 87 • • •
Manager Configuration
To set up a Failover Broker: 1. 2. 3. 4.
Log in to the GridServer Administration Tool. Click the Admin Tab, then click Manager Reconfigure page. Go through each configuration step. In the third step, set Broker to Failover. After completing the eight steps of the Manager Reconfigure, click Start Installation. This will reinstall GridServer and restart the Broker as a Failover Broker.
Configuring SNMP The ServerEvent API supports the generation of SNMP traps on a per-event basis. For example, events such as ‘Job Cancelled’ and ‘Engine Died’ can be sent as traps to an SNMP monitoring station. The SNMP interface can be administered through an administrative plugin on the GridServer Manager. The traps themselves are defined in the GridServer application MIB. To configure and enable SNMP support for your Manager: 1. In the Administration Tool, click the Admin tab and click SNMP Configuration. 2. Enter the hostname and port of your SNMP server in the Host and Port fields, then click Add. 3. If you have multiple SNMP servers, repeat step 2 for each server. 4. In SNMP Version, select the version of the SNMP protocol your servers use. 5. Select each event in the event list for which you would like to have a trap generated. 6. Click the Manager tab, click Manager Configuration, and click Admin. 7. In the SNMP section, set enabled to True for the Broker, Director, or both. The GridServer MIB can be found in [GS Manager Root]/webapps/livecluster/WEB-INF/etc/snmp. Some SNMP events generate traps from the Broker, while others generate traps from the Director. The following is a list of events that generate traps, sorted by Broker or Director: Broker Trap Events
Director Trap Events
DriverAddedEvent
BrokerAddedEvent
DriverRemovedEvent
BrokerRemovedEvent
EngineAddedEvent
EngineDaemonAddedEvent
EngineDiedEvent
EngineDaemonRemovedEvent
EngineRemovedEvent
RemoteDatabaseBackupFailure
JobCancelled
LocalDatabaseBackupFailure
JobFinished
ServerStartedEvent
JobRunning ServerStartedEvent TaskFailed • • • • •
88 •
Chapter 12 – Administration Howto
This Document is Proprietary and Confidental
Enabling Enhanced Task Instrumentation Normally, a submitted task or remote Service Invocation’s execution time is measured only from start to finish. But often it is useful to be able to track the time spent in the various stages of this process, including input serialization, disk writing, task message submission, task queueing, task fetching, data transport, input deserialization, task processing, output serialization, output transport, queuing, and so on. This will allow you to understand the timing characteristics of distributed computing, optimize the process, and diagnose problems with greater ease. To enable enhanced task instrumentation: 1. In the Administration Tool, click the Manager tab, click Manager Configuration, then click Services. 2. In Instrumentation, set Enable to True. 3. Click Save. When enabled, task instrumentation applies to all Services on the Manager. WARNING Task instrumentation will slow down the Manager, and also requires additional disk space, so it is important to disable it after you have completed using it. It is NOT recommended for production systems. To view data generated by enhanced task instrumentation: 1. Click the Services tab, and click Service Session Admin. 2. Find the Service you wish to view, and select View Instrumentation from the Actions menu. Note that this choice will only appear after the Service has finished running. A new window will open, displaying a table of data collected by enhanced task instrumentation for the Service. For more information on instrumentation, see Appendix A, “Task Instrumentation” on page 105 of the GridServer Developer’s Guide.
Engine Management Deploying Files to Engines Directory Replication enables you to coordinate and synchronize files from your Manager to Engines. You can use this to ensure that Engines all have the latest version of a library, file, data set, or other resources needed to complete work. By default, the Directory Replication mechanism automatically looks for files in a predefined directory (typically deploy/resources, within the livecluster directory on the Manager. This contains six directories, one for each OS supported and a shared directory, replicated to all OSes.) During each check of the directory (the default is once per minute), if it notices changes, it sends the new files to each Engine. It also forces the Engine to log out and log back in. This interrupts any current work, but it also ensures that work isn’t completed with incorrect libraries or data. You can also manually trigger a file update to ensure all Engines have the same files.
GridServer Administration Guide
• • • 89 • • •
Engine Management
To upload files and manually trigger an update: 1. Click the Services tab. 2. Click Resource Deployment. 3. Add your files to the Manager by clicking directory names to navigate to a directory. Then click Browse to find a file on your PC, and Upload to upload it to the Manager. 4. You can also place files in the livecluster/deploy/resources directories on the Manager. There are OS-specific directories for Engines running on Linux, Solaris, and Win32 machines, and a shared directory which is copies to all Engines. 5. Click the Update button.
Updating the Windows Engine JRE By default, the 1.4.2_03 JRE is used for Windows Engines. You can change what version of the JRE is used. For Windows, the JRE used resides on the Manager, and is updated on Engines, so you only need to change the JRE once on the Manager. It is not necessary to re-install the Engines after adding the new JRE because they will update themselves automatically. Note that when downloading a new JRE from Sun, you should download the SDK and use the JRE contained within that package. There is also a downloadable JRE package, but the JRE it contains does not contain the server version of a library required for Engines to run. To change the JRE version: 1. Open the JRE you wish to use into a temporary directory. Also, ensure that the JRE you have is the Server version (included in the Java JDK) and not the client version (from the standalone JRE.) 2. Download the Java Cryptography Extension (JCE) from Sun at http://java.sun.com/j2se/1.4.2/download.html (at the bottom, under “Other Downloads”). This download contains the two files local_policy.jar and US_export_policy.jar which should be copied into the jre/lib/security directory. 3. Create a ZIP file of the directory containing the JRE files and additional files. 4. Replace the public_html/register/install/jre/jre.ZIP on the Manager with the ZIP file of the JRE you created. Note: the file is case-sensitive, and ZIP must be uppercase. If you are using an alternate base directory, a read-only installation, or running multiple Managers on one machine, make sure to copy this file into the same location on each DS_BASEDIR directory. 5. Open the file engineUpdate/Win32/jre.dat in your GridServer distribution with a text editor. 6. Replace the 1.4.2_03 with the version number of the JRE you wish to use.
Updating the Unix Engine JRE By default, the 1.4.2 JRE is used for Unix Engines. You can change what version of the JRE is used. For Unix, there is not an update mechanism similar to the Windows, so you need to update the JRE on each Engine.
• • • • •
90 •
Chapter 12 – Administration Howto
This Document is Proprietary and Confidental
Note that when downloading a new JRE from Sun, you should download the SDK and use the JRE contained within that package. There is also a downloadable JRE package, but the JRE it contains does not contain the server version of a library required for Engines to run. To change the JRE version: 1. Shut down any running daemons: engine.sh stop
2. Change directories to the Engine home directory on the machine running the Engine, for example, DSEngine. 3. Move the current JRE to a new directory: mv jre jre1_4_3
4. Unarchive the desired JRE into a new directory, such as jre1_4_3. 5. Download the Java Cryptography Extension (JCE) from Sun at http://java.sun.com/j2se/1.4.2/download.html (at the bottom, under “Other Downloads”). This download contains the two files localsecurity.jar and US_export_security.jar which should be copied into the jre/lib/security directory. 6. Symlink the desired JRE to jre: ln -s jre1_4_3 jre
Setting the Director Used by Engines The primary and secondary Directors for an Engine is set during Engine installation. You can later change the Directors to which an Engine reports, by changing the Engine Configuration used by the Engine. To configure an Engine’s Directors: 1. Log in to the GridServer Administration Tool. 2. Click the Engine Tab, then click the Engine Configuration page. 3. Select the Engine distribution used by the Engine. This is typically the operating system of the Engine. 4. Go to the Directors and Brokers heading and change Primary Director URL and Secondary Director URL to the corresponding addresses and ports of the primary and secondary Directors, in the format http(s)://address:port. Note that this will change the Directors for all Engines using that Engine distribution.
Running Services Running MPI Jobs using PDriver PDriver, the Parametric Job Driver, has support for running MPI Jobs. The following two options in the PDS language supported by PDriver are used when running MPI: - boolean switch which indicates the job is to be run in MPI mode. An MPI mode job is based on a groupsize (see below), and each “group step” being treated as a single step of the job. If a single task in an MPI job, all other tasks in that “group step” are rescheduled. mpiEnabled
GridServer Administration Guide
• • • 91 • • •
Running Services
mpiGroupsize - The number of nodes used in each MPI group step. The number of tasks for the job must be evenly divisible by this setting.
For more information on writing PDS scripts for PDriver, see Chapter 6, “PDriver” on page 49 of the GridServer Developer’s Guide.
Registering a Service Type To use a Service, you must first register a Service Type from the GridServer Administration Tool. To register a Service Type: 1. Log in to the GridServer Administration Tool. 2. Click the Service tab, then click the Service Type Registry page. 3. A list of existing Service Types appears on that page, along with a line for adding a new Service Type. 4. Enter the Service Type Name on the blank line. 5. Select the Service Implementation, then click Add. A window with several options appears after clicking the Add button. 6. For Java Service Types, enter the fully qualified class name for the service; for .NET, dynamic libraries, or commands, enter the classname plus assembly name, library name, or command line, respectively. The window also allows you to enter options for the Service Type. Note that after you register a Service Type, you must deploy the implementation to your Engines
Creating and Running a Batch To run a Batch, you must first create a Batch Definition, which contains components that specify the schedule used by a Batch and what Services or commands are executed. To edit a Batch Definition: 1. In the Administration Tool, click the Batch tab, and click Batch Registry. 2. Type a name for your Batch Definition in the blank box at the bottom of the list and click Add. 3. The Batch Editor dialog box will open. You can type values for the Batch and Schedule components, and add additional components. For example, to create a simple Batch Definition named NightlyBatch that runs a registered Service at midnight, do the following: 1. In the Schedule object, select a type of cron. 2. In the Cron subheading, enter 0 for minute and hour. This specifies a starting time of 00:00 on a daily basis, in cron format. You could change the values here to select a different time pattern, or select a type of absolute to enter times in a string, like Sat, 12 Aug 1995 13:30:00 GMT. 3. In the Add Component list, select ServiceCommand. 4. In the ServiceCommand component, select a Service Type from the ServiceName list. This is a list of all Service Types currently registered on your Manager. 5. In the ServiceCommand object, enter a MethodName, initData, inputData, or any other values that will be needed by your Service. • • • • •
92 •
Chapter 12 – Administration Howto
This Document is Proprietary and Confidental
6. Click Save. 7. In the Actions control next to your Batch Definition in the list, select Schedule Batch Definition. 8. Your Batch will now be on the Manager and viewable in the Batch Schedule page on the Batch tab. It will wait until midnight, and then run the specified Service. When the Batch is running, you can monitor it on the Batch Admin page.
Creating a native stack trace in Linux Sometimes when you are troubleshooting native C/C++ code on linux, you want to generate a stack trace, for example when a SIGSEGV is thrown. Since the JVM on the Engine already traps SIGSEGV and prints out a Java (not native) stack trace, you need to override the actions of the JVM and install your own SIGSEGV handler for debugging. The backtrace_fd() and backtrace_symbols_fd() methods from glibc can be used for this purpose. To install your own SIGSEGV handler for debugging, add code to your tasklet or service initialization method similar to this: #include <execinfo.h> #include <stdio.h> #include <signal.h> #define TRACE_DEPTH 50 void MyService::segv_handler(int signum) { void *trace[TRACE_DEPTH]; int depth; FILE *fp; depth = backtrace(trace, TRACE_DEPTH); fp = fopen("trace.log", "w"); backtrace_symbols_fd(trace, depth, fileno(fp)); fclose(fp); abort(); } void MyService::init() { signal(SIGSEGV, segv_handler); signal(SIGBUS, segv_handler); }
Attaching GDB to Engine native code on Linux GDB can be used to debug native code in cppdriver or JNI in Linux. Also, GDB can be useful in identifying unusual problems with the Linux JVM. However, there are some subtle issues when trying to use GDB on a JVM, as is the case with the GridServer Engine. First, when attaching GDB to the Engine, you must specify the LD_LIBRARY_PATH to both the Engine components and the JVM components. You must also obtain the process ID of a running “invoke” process from the ps command. Also, it’s somewhat easier if you run GDB from the base directory of the Engine install (typically DSEngine) . The GDB command used is something like: LD_LIBRARY_PATH=lib:jre/lib/i386:jre/lib/i386/native_threads:jre/lib/i386/server:resources/l ib/linux gdb bin/invoke $INVOKEPID
GridServer Administration Guide
• • • 93 • • •
Running Services
This method of running GDB works well for troubleshooting those rare JVM problems. However when you are troubleshooting cppdriver code, you need a little more finesse. The issue is that cppdriver loads your application shared objects only when the tasklet or service is instantiated, so it becomes difficult to set a breakpoint in the application shared object. Further, attaching GDB to a running JVM often has undesired side effects, including crashing the JVM depending on the versions of JVM, pthreads, and GDB being used. One technique that works in this instance is to have your application tasklet or service method include some conditional code to enter a loop checking some variable value that is never changed by the application code, effectively creating an infinite loop. When you need to attach GDB, trigger the conditional that causes the loop to be entered on the next invocation. Then attach GDB as above. You’ll see that the invoke process is stopped while running in the loop. At that point you can change the loop evaluation value so that the infinite loop is exited, and the code will continue to your breakpoint where you can continue debugging.
Logging messages from a Native service to the Engine log To log messages to the Engine log file, use the UtilFactory::log method. See the C++ API documentation for more information. Alternatively, you may redirect your standard out to a separate log file. See the “Redirecting Engine Output” section in “Log Overview” on page 21 chapter of the GridServer Developer’s Guide for more details. Also, if you’re using C++ via JNI from a Java Tasklet and Linux Engines, you can log to stderr and it will appear in profiles/.../engine.x.log. Note that JNI C++ code can not write to standard out on Windows Engines.
Running a .NET Driver from an Engine Service To run a .NET Driver from an Engine Service, you must first deploy the driver.properties file to the Engine, and then configure the Engine to use the new file. To do this: 1. If you haven’t already downloaded a copy of the driver.properties file, log in to the GridServer Administration Tool, click the Driver tab, click SDK download, and download the driver.properties to your local machine. 2. In the GridServer Administration Tool, on the Services page, click the Resource Deployment page. 3. Navigate to the resources\shared\config directory. 4. Click Browse and find your local copy of the driver.properties file, and click Upload. 5. On the Engine tab, click the Engine Configuration page. 6. Select the configuration that your Engines are currently configured to use or create a new one. If you create a new configuration remember to change the Engines to use that configuration before you test. 7. In the configuration editing screen there will be a section called Properties. In the Environment Variables section, change the value of DSDRIVER_DIR to .\resources\shared\config. 8. Click Save.
• • • • •
94 •
Chapter 12 – Administration Howto
This Document is Proprietary and Confidental
Configuration Issues Installation on Dual-Interface Machines In some network configurations, a machine may have more than one network interface, and a GridServer component may default to using the incorrect interface. This can be corrected by configuring the component to use the correct interface. To configure the Driver to use a different network interface, set the DSLocalIPAddress property to the IP number of the correct interface. For example:
Drivers:
DSLocalIPAddress=192.168.12.1
To configure the Engine to use a different network interface, select the Engine Configuration that will be used by the Engine on the Engine Configuration page, and set the Net Mask value under the File Server heading to match the network range on which the Engine should run. Engines:
Configuring the timeout period for the Administration Tool For security purposes, the GridServer Administration Tool will time out and require users to log in again. By default, the timeout period is 60 minutes. To change the timeout period, log in to the Administration tool and click the Manager tab. Click Manager Configuration and click Security. In the Admin User Management selection, type a time in seconds in the Admin Browser Timeout box.
Reconfiguring Managers when Installing a secondary Director When you install a Manager that includes a secondary Director, you must also configure the Manager containing the primary Director. This will register the secondary Director’s address with the primary Director, as well as reconfigure the Engine and Driver configurations. To reconfigure the Manager containing the primary Director, click the Admin menu, click Manager Reconfigure, and enter the secondary Director’s address and port in the corresponding page. This will configure the primary Director to recognize the secondary Director, as well as reconfiguring Engine and Driver configurations accordingly.
Using UNC paths in a driver.properties file It is possible to use UNC paths to specify a hostname or directory within a driver.properties file. However, you will need to change all backslashes (\) to forward slashes (/) in the path. For example, to change the input directory for task (Job) inputs to the UNC path \\homer\job1-dir, change the following line: DSWebserverDir=./ds-data
to this: DSWebserverDir=//homer/job1-dir
GridServer Administration Guide
• • • 95 • • •
Configuration Issues
• • • • •
96 •
Chapter 12 – Administration Howto
This Document is Proprietary and Confidental
Chapter 13 Database Administration
•••••• Introduction
Each GridServer Manager has an embedded database running on each Director. This internal, or admin database stores administrative data, such as User, Engine, Driver, and Broker information. An external reporting database can be used to log events and statistics. By default, GridServer is not configured with a reporting database; the included HSQLDB or a different external reporting database can be used.
Database Types There are two databases used by the GridServer ManagerBroker, each of which are described below.
The Reporting Database The external reporting database is optionally used to store events and statistics, which depending on configuration settings, can grow fairly quickly. It is recommended to use a robust external database if you are going to be making extensive use of the reporting capabilities. The specific types of data that are stored in the reporting database are configurable on the Manager Configuration page’s Database section. The external database can be installed on any machine, providing that the ManagerBroker is able to create connections to the database through a protocol such as JDBC. For information on installing an external database for the reporting database, see Appendix B, “Database Configuration” on page 61 of the GridServer Installation Guide.
The Internal Database GridServer’s internal database stores admin data such as User, Engine, Driver, and Broker information. In typical cases, the internal database is read at Manager startup, and only written to thereafter if user-driven admin events occur, such as adding a user, Engine, Broker, or Driver profile. The internal database is required in order to start the Manager. If it becomes unavailable or corrupt, the Manager will continue to function, but a restart would be impossible until the database is available again. This database is an embedded component of the GridServer software.
Internal Database Backup The internal database used by GridServer is automatically backed up at on a regular interval. The database is backed up to the [GS Manager Root]/webapps/livecluster/WEB-INF/db/internal/backup directory and is also replicated to the secondary Director if one is installed. Backups take place based on the Backup Cron configuration option, located on the Database section of the Manager Configuration page of the Manager tab in the GridServer Administration Tool. The cron setting is the same as traditional Unix cron settings. It is a string of the form “minute, hour, day of month, month,
GridServer Administration Guide
• • • 97 • • •
Internal Database Backup
day of week, year”. If any field is set to -1, the backup will be repetitive. For instance, a setting of “00,23,1,-1,-1,-1” means the backup will occur daily at 11 PM. A setting of “00,23,1,-1,-1,-1” means the backup will occur on the first of every month at 11 PM. Ranges are as follows: Name
Description
minute
Minute of the backup. Allowed values 0-59.
hour
Hour of the backup. Allowed values 0-23.
dayOfMonth
Day of month of the backup (-1 if every day). This attribute is exclusive with dayOfWeek. Allowed values 1-31. If both dayOfMonth and dayOfWeek are restricted, each backup will be scheduled for the earlier match.
month
Month of the backup (-1 if every month). Allowed values 0-11 (0 = January, 1 = February, ...). java.util.Calendar constants can be used.
dayOfWeek
Day of week of the backup (-1 if every day). This attribute is exclusive with dayOfMonth. Allowed values 1-7 (1 = Sunday, 2 = Monday, ...). java.util.Calendar constants can be used. If both dayOfMonth and dayOfWeek are restricted, each alarm will be scheduled for the earlier match.
year
Year of the backup. When this field is not set (i.e. -1) the alarm is repetitive (i.e. it is rescheduled when reached).
NOTE: Database backups can be very resource-intensive. It’s advisable to schedule them to occur during off-peak hours when your Grid usage is minimal.
• • • • •
98 •
Chapter 13 – Database Administration
This Document is Proprietary and Confidental
Appendix A The grid-library.dtd
•••••• Introduction
The grid-library.xml configuration file in the root of a Grid Library must be a well-formed XML file. The GridServer SDKs include a grid-library.dtd file that can be used to validate the XML file. The DTD is also shown below. Example A.1: grid-library.dtd
GridServer Administration Guide
• • • 99 • • •
Introduction
Example A.1: grid-library.dtd (Continued)
• • • • •
100•
Appendix A – The grid-library.dtd
This Document is Proprietary and Confidental
Appendix B Reporting Database Tables
•••••• Introduction
GridServer uses a simple relational database to report Grid processing events for historical analysis. This appendix describes the tables in the reporting and internal databases for use by external programs.
Batches Batches that have been scheduled or executed Database: reporting Primary key: none Column name
Data type
Description
server
Varchar
Manager where the Batch resided or ran
batch_id
Bigint
Unique ID number of the Batch Entry
time_stamp
Timestamp
Timestamp of the event
event
Int
Event code
class
Varchar
Class in the Batch
execution_id
Bigint
Unique ID number of the Batch Execution, if applicable
description
Longvarchar Description of the Batch Event
Brokers Table of all Brokers that have participated in this Grid. Database: internal Primary key: broker_id Column name
Data type
Description
broker_id
Int
Broker ID #
broker_url
Varchar
Broker’s configured base URL
weight_0
Float
Engine weight for Broker routing
weight_1
Float
Driver weight for Broker routing GridServer Administration Guide
• • •101 • • •
Broker_stats
Column name
Data type
Description
discriminator_0
Longvarchar Engine discriminator for Broker routing*
discriminator_1
Longvarchar Driver discriminator for Broker routing*
broker_name
Varchar
shared_brokers
Longvarchar Comma-delimited list of Brokers that share Engines with this Broker
min_engines
Int
Minimum number of Engines allowed on the Broker
max_engines
Int
Maximum number of Engines allowed on the Broker
Name of the Broker
* stored as xml object
Broker_stats All statistic reports from Brokers are stored in this table. Database: reporting Primary key: broker_id + timestamp Column name
Data type
Description
broker_id
Int
The unique id of the Broker
time_stamp
Timestamp
Timestamp of the report
num_busy_engines
Int
Number of Engines busy at report time
num_total_engines
Int
Number of Engines logged in at report time
num_drivers
Int
Number of Drivers logged in at report time
uptime_minutes
Float
Time since Broker start in minutes
num_jobs_running
Int
Number of jobs running at report time
num_tasks_pending
Int
Number of tasks pending (not yet assigned to Engines) at report time
Driver_events Brokers report when a Driver logs in or out. Database: reporting
• • • • •
102•
Appendix B – Reporting Database Tables
This Document is Proprietary and Confidental
Primary key: none Column name
Data type
Description
username
Varchar
Driver user name
hostname
Varchar
Hostname Driver is running on
time_stamp
Timestamp
Timestamp of the report
broker_id
Int
ID of Broker where event occurred
event
Int
0 for an add, or the reason code for a remove – map these to the event_codes table
Driver_profiles Profiles that can be used by Drivers Database: internal Primary key: name Column name
Data type
Description
name
Varchar
Profile name
driver_properties
Longvarchar Internal properties*
permission_properties
Longvarchar Permissions*
description_discriminator Longvarchar Job description discriminator* * Stored as xml object
Driver_users Driver users for internal use Database: internal Primary key: username Column name
Data type
Description
username
Varchar
Driver username
password
Varchar
Driver password
hostname
Varchar
Hostname Driver is on
profile
Varchar
Driver profile used by Driver
GridServer Administration Guide
• • •103 • • •
Engine_events
Engine_events The Brokers report when an Engine is added or removed; for example, when an Engine logs in or logs out. Database: reporting Primary key: none Column name
Data type
Description
engine_id
Bigint
The unique id of the Engine
time_stamp
Timestamp
Timestamp of the report
broker_id
Int
ID of Broker where event occurred
event
Int
0 for an add, or the reason code for a remove – map these to the event_codes table
Engine_info This table contains administrative information for all Engines that have ever logged in to this Director. Database: internal Primary key: engine_id Column name
Data type
Description
engine_id
Bigint
The unique ID of the Engine
username
Varchar
The username used by the Engine
guid
Varchar
Another unique such as a MAC address
IP
Varchar
The IP address used by the Engine
install_date
Timestamp
When the Engine was installed
last_logon_date
Timestamp
When the Engine last logged on*
last_file_update_date Timestamp properties
The last successful file update to the Engine*
Longvarchar Administratively defined Engine properties**
* deprecated - fields are no longer updated ** stored as xml object
Engine_stats All statistic reports from Engine Daemons are stored in this table. Database: reporting
• • • • •
104•
Appendix B – Reporting Database Tables
This Document is Proprietary and Confidental
Primary key: none Column name
Data type
Description
engine_id
Bigint
The unique ID of the Engine
time_stamp
Timestamp
Timestamp of the report
cpu_utilization
Float
%CPU total utilization
ds_cpu_utilization
Float
%CPU utilized by DataSynapse processes
total_ram_kb
Bigint
Installed RAM reported by the OS in kilobytes
free_ram_kb
Bigint
Free RAM reported by the OS in kilobytes
disk_mb
Bigint
Free disk reported by the OS in megabytes
num_invokes
Int
Number of Engine processes currently running
Event_codes Table mapping event codes to reasons Database: reporting or internal Primary key: none Column name
Data type
Description
code
Int
Numeric code
name
Varchar
Description
Job_status_codes Table mapping numeric job status codes to descriptive text Database: reporting or internal Primary key: none Column name
Data type
Description
code
Int
Numeric code
name
Varchar
Description
Jobs Historical information about all jobs that have been run by GridServer Database: reporting
GridServer Administration Guide
• • •105 • • •
Job_discriminators
Primary key: job_id+start_time Column name
Data type
Description
job_id
bigint
Job ID
service_type_name
Varchar
The Service Type used for the Service.
job_class
Varchar
Java or pseudo-java class used to create the job on the client
start_time
Timestamp
When job was started
end_time
Timestamp
When job finished
job_status
Int
Job status (see job_status_codes table)
num_tasks
Int
Number of tasks in the job
task_time_std
Float
Standard deviation of task completion time
task_time_avg
Float
Mean task completion time
priority
Int
Job priority when submitted
end_priority
Int
Job priority when complete
driver_username
Varchar
Submitting Driver username
driver_hostname
Varchar
Submitting Driver hostname
job_name
Varchar
Optional descriptive job name from JobDescription
app_name
Varchar
Optional descriptive application name from JobDescription
description
Varchar
Optional descriptive description from JobDescription
dept_name
Varchar
Optional descriptive department name from JobDescription
group_name
Varchar
Optional descriptive group name from JobDescription
indiv_name
Varchar
Optional descriptive individual name from JobDescription
broker_id
Int
ID of Broker that ran the job
Job_discriminators Table of Job-based discriminators Database: internal Primary key: name
• • • • •
106•
Column name
Data type
Description
name
Varchar
Name of discriminator
Appendix B – Reporting Database Tables
This Document is Proprietary and Confidental
Column name
Data type
Description
description_discriminator Longvarchar Discriminator on Job description to determine whether to attach job discriminator* job_discriminator
Longvarchar Engine discriminator for service
* Stored as xml object
Properties Properties used by the Manager for its internal processing. Database: internal Primary key: none Column name
Data type
Description
name
Varchar
The property name
value
Longvarchar The property value as an XML object.
Tasks Historical information about all tasks that have been run by GridServer Database: reporting Primary key: none Column name
Data type
Description
job_id
Bigint
Job ID
task_id
Int
Task ID
engine_id
Bigint
Engine that (finally) ran task
start_time
Timestamp
When task was started
end_time
Timestamp
When task finished
task_status
Int
Task status (see task_status_codes table)
num_reschedules
Int
Number of times task was retried
engine_instance
Int
Number of Engine instance that ran task
task_info
Varchar
Task information
Task_status_codes Table mapping numeric task status codes to descriptive text
GridServer Administration Guide
• • •107 • • •
Users
Database: reporting or internal Primary key: none Column name
Data type
Description
code
Int
Numeric code
name
Varchar
Description
Users Administrative users for internal use Database: internal Primary key: none Column name
Data type
Description
username
Varchar
User name
user_access
Int
Authorized role
user_info
Longvarchar Various internal info about the user
personalization
Longvarchar UI personalization*
* Stored as xml object
User_events Table stores historical user events. Database: reporting Primary key: none
• • • • •
108•
Column name
Data type
Description
server
Varchar
Server where event occurred
username
Varchar
User recording event
time_stamp
Timestamp
When event occurred
handler
Varchar
Internal handler class that recorded event
event
Longvarchar Description of event
Appendix B – Reporting Database Tables
This Document is Proprietary and Confidental
Index Symbols [GS Manager Root]
11
A access levels Administration Tool 36 Administration Tool access levels 36 help 10 introduction 35 opening 35 shortcut buttons 39 timeout 38 authentication built-in 70 Driver, configuring 69, 70, 71, 76
B backup database 97 balancing Engines 18 Batch Batch Definition 61 Batch Entry 61 deploying resources 67 editing Batch Definition 62 fault-tolerance 27, 67 running 66 Service Runners 65 using PDriver with 67 Batch Definition definition 61 editing 62 scheduling 66 Batch Entry definition 61 Batch scheduling facility introduction 61 serial and parallel jobs 63 blacklisting
Engine 33 Broker enabling SSL for messaging with clients 72, 73, 74, 75 failover 25 failure 25 heartbeat 23 monitor 40 Broker Monitor 40 Broker routing 17 introduction 17 Broker,routing 17
C C++ bridges 51 configuring 88 SNMP 88 conflicts Grid Library 48 credentials pass through 58 stored 58
D database backup 97 deployment Batch resources 67 Director failure 25 monitor 40 Director Monitor 40 discriminators in Service Runners 66 task 33 Driver authentication, enabling 69, 70, 71, 76 failure 24 heartbeat 23 dsinstall.conf
definition 56
E Engine GridServer Administration Guide
• • •109 • • •
balancing 17 blacklisting 33 failure 24 heartbeat 23
F failover introduction 23 failover Brokers 25 failure Broker 25 Director 25 Driver 24 Engine 24 fault tolerant tasks 26 fault-tolerance Batch 27, 67 GridCache 27 introduction 23
G Grid Library conflicts 48 definition 43 directory, alternate 50 example 51 format 44–46 loading 48 state preservation 49 using 49 variable substitution 46 versioning 46–47 Grid Library Manager 50 GridCache fault-tolerance 27 grid-library.dtd
description 99–100 grid-library.xml
dtd 99–100 elements 44–46 GridServer Web Services timeout 38
H heartbeat 23 HTTP disabling 76
I internal database backup 97
J JAR Ordering File 56 Job definition 13
M Manager component indicator 40 Manager Component Indicator 40 Microsoft Install Package example 57 monitor Broker 40 Director 40
P pass through credentials using 58 PDriver introduction 15 using with Batch 67 port 80 disabling 76 preemption Service 32 priority Service 31
R Remote Application Installation definition 43 using 56
• • • • •
110 • – Index
This Document is Proprietary and Confidental
Resource Deployment definition 43 ROOT_CA.pem
definition 72 Run-as definition 57 Engine setup 58 managing credentials 59 Service Type Registry 60 using 58
S scheduling introduction 29 serial priority execution 32 serial Service execution 32 security authentication 69 disabling HTTP 76 Grid users authenticating with Grid users 69 operating system users authentication with operating system users 69 user accounts 37 Server See also Manager Service preemption 32 priority 31 urgent priority 32 Service Runners 65 Service Session definition 14 Services definition 13 session timeout Administration Tool 38 shortcut buttons
Administration Tool 39 simple network management protocol 88 SNMP configuring 88 SSL enabling for Broker-Client messaging 72, 73, 74, 75 state preservation Grid Library 49 stored credentials using 58
T Task discriminators 33 Task Reservation definition 49 Tasks fault tolerant 26
U User accounts security 37 using Grid Library 49
V variable substitution Grid Library 46 versioning Grid Library 46–47
W Windows Deployment Scripting Language using 56
GridServer Administration Guide
• • •111 • • •
• • • • •
112 • – Index
This Document is Proprietary and Confidental