David Walker
Page 1
BUILDING LARGE SCALEABLE CLIENT/SERVER SOLUTIONS David Walker International Oracle User Group, San Francisco, USA, 1993 Summary The replacement of 'Legacy' systems by 'Right-Sized' solutions has lead to a growth in the number of large open client/server installations. In order to achieve the required return on investment these solutions must be at the same time flexible, resilient and scaleable. This paper sets out to describe the building of an open client/server solution by breaking it up into components. Each component is examined for scaleability and resilience. The complete solution, however, is more than the sum of the parts and therefore a view of the infrastructure required around each element will be discussed. Introduction This paper will explore the proposition that large scale client server solutions can be designed and implemented by de-constructing the architecture into quantifyable components. This can be expressed as a model based on the following four server types: Database Server Application Server Batch Server Print Server Each of these components must have two characteristics. The first characteristic is resilience. Having a single point of failure will mean that the whole solution may fail. This must be designed out wherever possible. The second characteristic is scaleablity. All computers have limitations in processing power, memory and disk capacity. The solution must take into account hardware limitations and be designed to make optimum use of the underlying platform. It is assumed that the system described will be mission critical, with over two hundred concurrent users and a database not less than ten gigabytes. The growth of these sites to thousands of concurrent users and hundreds of gigabytes of data is made easier by designing the system to be scaleable.
David Walker
Page 2
The Large Scaleable Server ( Client devices connect via the ‘Client Network’ )
Database Server (multi node cluster)
Batch/Print Server (two node cluster) Server Network (FDDI)
Application Server (free standing)
Client Networks
David Walker
Page 3
Database Server The Database Server is the component traditionally thought of as 'The Server'. It is here that the Oracle RDBMS deals with requests, normally via SQL*Net. A resilient Database Server will be spread over more than one node of a clustered system using Oracle Parallel Server [OPS] in case of hardware failure within a single node. A clustered system is one where homogeneous hardware nodes are linked together. They share a common pool of disk, but have discrete processor and memory. There is a Lock Manager [LM] which manages the locking of data and control objects on the shared disks. The Lock Manager operates over a Low Latency Inter-connect [LLI] (typically Ethernet or FDDI) to pass messages between nodes. The disks are connected to all nodes through a High Bandwidth Inter-connect [HBI] (typically SCSI). Clustering a system provides additional resilience, processing power and memory capacity. However, since the complete database must be available on shared disk, the maximum database size can be no greater than the maximum database size configurable on a single node. The disks within such mission critical systems are also required to meet the resilience and scaleability criteria. The Redundant Array of Inexpensive Disks [RAID] technologies provide a solution. RAID level 0 provides striping across many disks for high performance, but it is fault intolerant. RAID level 1 implements disk mirroring, where each disk has an identical copy. RAID level 5 offers striping with parity. Whilst RAID level 5 is cheaper than full mirroring there are performance issues when a failure occurs. In the type of system that is being examined in this paper the solution is to use RAID 0 and RAID 1; striping and mirroring. The basic level of resilience is to start an OPS instance on each of two nodes. All users are then connected to one node, the other node participates in the OPS database but no connections are made to the database. When the active node fails, all users migrate to the alternate node and continue processing. The precise method of migration will depend on the network protocol in use. Any existing network connections will have to be established to the alternate node and the current transaction will be lost. The connection will require users to be directed to the alternate node. This would typically be dealt with transparently from the 'Application Server' (See below). As the application grows the load may exceed the processing power, or memory capacity of a single node. At this point the users can be distributed over multiple nodes. In order to do this the application must have been designed with OPS in mind. This requires that, where possible, the applications are 'partitioned'.
David Walker
Page 4
A large airline system, for example, may have applications for 'Reservations' and for 'Flight Operations'. Select statements will involve blocks being read from common lookup tables. Inserts and updates will take place on tables mainly within the set of tables belonging to one or other application. Partitioning these tables reduces the lock traffic being passed between each node and consequently improves the overall performance of the OPS system. Where it is impossible to partition an application more sophisticated techniques are required. A detailed discussion of these techniques is beyond the scope of this paper. It is also important to make some decisions about the use of stored procedures. These are a powerful tool that can reduce coding in front-end applications. They can also ensure code integrity as changes to standard packages need only be modified in one place. They do, however, force the load back into the Database Server. This means that scaleability is limited by the processing capacity of the Database Server. This, as it will be shown later, may not be as scaleable as putting the work back into the Application Server (See Below). There exists a number of concerns about buying 'redundant' equipment. For two node systems the second node could, for example, be utilised for a Decision Support System [DSS] database. This normally fits well with the business requirements. A DSS is normally built from the transactional data of the On-line Transaction Processing [OLTP] system. This will transfer the data via the system bus rather than over a network connection. If the node running the OLTP system fails, DSS users can be disabled and the OLTP load transferred to the DSS node. Where there are two or more OLTP nodes a strategic decision can be made to limit access to the 'most important' users up to the capacity of the remaining node(s). This is inevitably difficult to decide and even harder to manage. When one node has failed pressure is on to return the system to service. Having to make decisions about the relative importance of users may be impossible. The OPS option is available with Oracle 7.0. With Oracle 7.1 the ability to use the Parallel Query Option [PQO] became available. This allows a query to be broken up into multiple streams and run in parallel on one node. This is not normally useful in the daytime operations of an OLTP system, as it is designed to make use of full table scans that return a large volume of data. Batch jobs and the DSS database may however take advantage of PQO for fast queries. In Oracle 7.2 the ability to perform PQO over multiple nodes becomes available. This is most useful to perform the rapid extract of data in parallel from the OLTP system running across N minus one nodes whilst inserting into the DSS node.
David Walker
Page 5
Backup of the system is also critical. A system of this size will most probably use the Oracle on-line or 'hot' backup ability of the database. Backup and recovery times are directly related to the number of tape devices, the size of each device, the read/write speed of the devices and the I/O scaleability of the system and the backup software. If the system provides backup tape library management then it is desirable to use a larger number of short tape devices with high read/write speeds. Tapes such as 3480 or 3490 compatible cartridges are therefore desirable over higher capacity but less durable and slower devices such as DAT or Exabyte Recovery will require one or more datafiles to be read from tape. These can be quickly referenced by a backup tape library manager that prompts for the correct tape to be inserted. Large tapes may contain more than one of the required datafiles and may have the required datafile at the end of a long tape. This does not affect backup time but can have a considerable affect on the time taken to recover. Since all nodes are unavailable a quick recovery time is important. Further enhancements to the speed of backup will be made possible by using Oracle Parallel Backup and Restore Utility [OPBRU]. This will integrate with vendor specific products to provide high performance, scaleable, automated backups and restores via robust mainframe class media management tools. OPBRU became available with Oracle 7.0. The optimal solution for the Database Server is based on an N node cluster. There would be N minus one nodes providing the OLTP service and one node providing the DSS service. The DSS functionality is replaced by OLTP when a system fails. The number and type of backup devices should relate to the maximum time allowed for recovery. Application Server The Application Server configuration can be broken down into two types. The first type is called 'host based', the second is 'client based'. These two techniques can either be used together in a single solution, or one can be used exclusively. A host based Application Server is where the user connects to a host via telnet or rlogin and starts up a session in either SQL*Forms or SQL*Menu. This typically uses SQL*Forms v3, character based SQL*Forms v4, or the MOTIF based GUI front-end to SQL*Forms v4. This type of application is highly scaleable and makes efficient use of machine resources. The terminal is a low resource device of the 'Thin-client' type. The amount of memory required per user can be calculated as the sum of the size of concurrently open forms. The CPU load can be measured by benchmarking on a per SQL*Forms session basis and scaled accordingly. Disk requirement equals the size of the code set for the application, plus the size of the code set for Oracle.
David Walker
Page 6
Each host based Application Server can potentially support many hundreds of users, dependent on platform capability, and the growth pattern is easily predicted. When a host based Application Server is full, a second identical machine can be purchased. The code set is then replicated to that machine and users connected. Resilience is achieved by having N+1 host based Application Servers, where N is the number of machines required to support the load. The load is then distributed over all the machines. Each machine is then 100*N/(N+1)% busy. In the case of a host based Application Server failure 100/(N+1)% of the users are redistributed causing minimal impact to the users as a whole. It should be noted that these machines are not clustered; they only hold application code that is static. Since the machines are not clustered it is possible to continue to add machines indefinitely (subject to network capacity). This relates back to the issue of stored procedures mentioned above. The Database Server is limited by the maximum processing power of the cluster. The host based Application Server can always add an additional node to add power. This means that the work should take place on the host based Application Server rather than the Database Server. The more scaleable solution is therefore to put validation into the SQL*Forms, rather than into stored procedures. In practice, it is likely that a combination of stored procedures and client based procedures will provide the best solution. Host based Application Servers also make it easy to manage transition between nodes in the Database Server when a node in the Database Server fails. A number of methods could be employed to connect the client session to the appropriate server node. A simple example is that of a script monitoring the primary and secondary nodes. If the primary Database Server node is available then its host connection string is written to a configuration file. If it is not available then the connection string for a secondary Database Server node is written to the configuration file. When users reconnect they read the configuration file and connect to the correct server. Client based Application Servers are used in conjunction with the more traditional client side machines. Here a Personal Computer [PC] runs the SQL*Forms which today is likely to be Windows based SQL*Forms v4. This is the so called 'Fat Client'. In a large environment the management of the distribution of application code to maybe several thousand PCs is a system administrators' nightmare. Each PC would more than likely need to be a powerfully configured machine requiring a 486 or better processor, with 16 Mb of memory and perhaps 0.5 Gb of disk to hold the application. Best use of the power and capacity available is not made as it is an exclusive, rather than shared, resource. Microsoft's Systems Management Server [SMS] has recently become available. This may help with the management of many networked PCs. SMS provides four main functions: an inventory of hardware and software across the network; the management of networked application; the automated installation and execution of software on workstations; the remote diagnostics and control of workstations. Whilst in production, this software is yet to be tested on large scale sites.
David Walker
Page 7
The client based Application Server can help by serving the source code to each PC, typically by acting as a network file server. The SQL*Forms code is then downloaded to the PC at run-time. This overcomes the problem of disk capacity on the client PC and helps in the distribution of application code sets. It does create the problem of potentially heavy local network traffic between the PC and the client based Application Server as the SQL*Forms are downloaded. It also does not help the requirement for memory and processor power on the PC. The client based Application Server can also be used to hold the configuration file in the same manner as the host based Application Server. This again deals with the problem of Database Server node failure. The growing requirements for Graphical User Interfaces [GUI] may be driven by a misguided focus of those in control of application specification. Windows based front-end systems are seen as very attractive by those managers who are considering them. The basic operator using the system however is usually keyboard, rather than mouse, orientated. Users will typically enter a code and only go to a lookup screen when an uncommon situation arises. An example where a GUI is inappropriate is of a Bureau de Change whose old character based systems accepted UKS for UK Sterling and USD for US Dollars. In all, entering the code required eight keystrokes and took about five seconds. The new system requires the use of a drop down menu. A new 'mousing' skill was needed by the operator to open the window. More time was spent scrolling to the bottom of the window. A double click was then used to select the item. This was repeated twice. After a month of use, the best time achieved by the operator was about thirty seconds. This is six times longer than with the original character based system and any delay is immediately obvious as the operation is carried out in front of the customer. A large scaleable client/server system can take advantage of both methodologies. The general operator in a field office can make a connection to a host based Application Server and use the traditional character based system. This is a low cost solution as the high power PCs are not required and the resources are shared. The smaller number of managers and 'knowledge workers' who are based at the head office or in regional offices can then take advantage of local client based Application Servers and GUIs. Although these users require high specification PCs the ratio of knowledge workers to operators makes the investment affordable. In order to make the task of building application code simpler, it is advantageous to code for a standard generic batch and print service. Since the platform each component is running on may be different (for example: a UNIX Database Server, a Windows/NT client based Application Server, etc.), SQL*Forms user exits should be coded to one standard for batch and printing. A platform specific program should then be implemented on each server to work with a local or remote batch queue manager.
David Walker
Page 8
Batch Server Batch work varies from site to site, and is often constrained by business requirements. A Batch Server will manage all jobs that can be queued. The Batch Server may also act as the Print Server (See below). The batch queues have to be resilient and when one node fails another node must be capable of picking up the job queues and continuing the processing. Some ways of achieving this resilience are suggested below. The first is to implement the queue in the database. The database is accessible from all nodes and the resilience has been achieved. The alternate node therefore need only check that the primary node is still running the batch manager, and if not start up a local copy. The second method is for the Batch Server to be clustered with another machine (perhaps the Print Server), and a cluster-aware queue manager product employed. The queues set up under such a product can be specified as load-balancing across clustered nodes, or queuing to one node, with automatic job fail-over and re-start to an alternate node on failure. An alternative clustered implementation would be to maintain the queue on shared disk. When the Batch Server node fails the relevant file systems are released and acquired by the alternate node, which examines the queue and restarts from the last incomplete job. Two other requirements exist. The first is the submission of jobs; the second is the processing of jobs. The submission of jobs requires that the submitting process sends a message to the batch manager or, in the case of a queue held in the database, inserts a row into the queue table. This can be done via the platform specific code discussed above. If the queue is held in the database the workload is increased on the Database Server as the batch queue is loaded, however it is also easy to manage from within the environment. Job processing depends on the location of the resource to perform the task. The task can be performed on the Batch Server, where it can have exclusive access to the machine's resources. This requires that the code to perform the task is held in one place and the ability to process batch requests is therefore limited by the performance of the Batch Server node. Alternatively the Batch Server can distribute jobs to utilise available Application Server systems. These machines must each contain a copy of the batch processing code and a batch manager sub-process to determine the level of resource available. If an Application Server finds that the system has spare capacity then it may take on a job. The batch manager sub-process then monitors the load, accepting more jobs until a predefined load limit is reached.
David Walker
Page 9
In practice it is common to find many batch jobs are run in the evening when the Application Servers are quiet, with a small number of jobs running during the day. To manage this effectively the system manager may enable the batch manager subprocesses on the Application Servers only outside the main OLTP hours. During the OLTP operational window batch jobs are then limited to the Batch Server. This capitalises on available resource whilst ensuring that batch and OLTP processing do not interfere. It should be noted that all jobs submit work to the Database Server. Large batch jobs that take out locks will inhibit the performance of the database as a whole. It is important to categorise jobs that can be run without affecting OLTP users and those which can not. Only those tasks that do not impact the Database Server should be run during the day. Batch Servers of this nature are made from existing products combined with bespoke work tailored to meet site requirements. Print Server The Print Server provides the output for the complete system. There are many possible printing methods that could be devised. This paper will look at three. The first is the distributed Print Server. Each place where output is produced manages a set of queues for all possible printers it may use. Printers need to be intelligent and able to manage new requests whilst printing existing ones. As the load increases the chance of a printout being lost or of a queue becoming unavailable increases. The management of many queues also becomes very complex. An alternative is to use a central queue manager. This can be integrated with the batch manager mentioned above and use either of the queuing strategies suggested. The pitfall with this method is that the output file has to be moved to the Print Server before being sent to the printer. This involves a larger amount of network traffic that may cause overall performance problems. The movement of the file from one system to another may be achieved either by file transfer or by writing the output to a spool area that is Network File System [NFS] mounted between the systems. This will also stress the print sub-system on the Print Server. A third method requires the print task be submitted to a central queue for each printer on the Print Server (or in the database). This requires that the task submission includes the originating node. When the task is executed a message is sent back to the submitting node. The Print Server sub-process then sends the output to the local queue. On completion the sub-process notifies the Print Server. The Print Server can then queue the next job for a given printer. In this way the network is not saturated by sending output files over the network twice, but there is a cost associated with managing the queues on machines that can submit print requests. As has already been suggested the Print Server can either be combined with the Batch Server, or clustered with it for resilience. When a failure occurs each can take on the role of the other.
David Walker
Page 10
The above illustrates that printer configuration requires considerable thought at the development stage. Each site will have very specific requirements and generic solutions will have to be tailored to meet those requirements. As previously mentioned, this type of solution requires a combination of existing product and bespoke work. Backup and Recovery Backup and recovery tasks have already been identified for the Database Server. This is by far the most complex server to backup as it holds the volatile data. All the other servers identified remain relatively static. This is because they contain application code and transient files that can be regenerated if lost. Most systems are replicas of another (Application Servers being clones and the Batch/Print Servers being clustered), but full system backups should be regularly taken and recovery strategies planned. In the case of a complete black-out of the data centre it is important to understand what level of service can be provided by a minimal configuration. The size of the systems means that a complete hot site is unlikely to be available. The cost of the minimal configuration should be balanced against the cost to the business of running with a reduced throughput at peak times. The full recovery of a system is non-trivial and requires periodic full rehearsals to ensure that the measures in place meet the requirements and that the timespans involved are understood. Systems Management Effective management of a large number of complex systems requires a number of specialist tools. Many such tools have become available over the last two years. Most have a component that performs the monitoring task and a component that displays the information. If the monitoring component runs on the a server then it can increase the load on that server. Some sites now incorporate SNMP Management Information Bases [MIBs] into their applications and write low-impact SNMP agents to look at the system and database. Information from these MIBs can be requested be the remote console component. It is desirable that any monitoring tool has the ability to maintain a history, trigger alerts and perform recovery actions. Long term monitoring of trends can assist in predicting failures and capacity problems. This information will help avoid system outages and maintain availability.
David Walker
Page 11
Networking The network load on these systems can be very complex. It is recommended that a network such as FDDI with its very high bandwidth is provided between the servers for information flow. Each machine should also have a private Ethernet or Token Ring for systems management. This network supports the agent traffic for the system management and system administrators' connections. The users are only attached to the Application Servers via a third network. Where clusters are in use each LLI will also require network connections. The distribution over many networks aids resilience and performance. Integrating the Solution The methodology discussed above is complex. Much of the technology used will be leading edge. Many design decisions will be made before the final hardware platform is purchased. A successful implementation requires hardware (including the network), the database and an application. Each element must have a symbiotic relationship with the others. This has worked best where hardware vendor, database supplier, application provider (where appropriate) and customer put a team together and work in partnership. Staffing of the new system will be on a commensurate scale requiring operators and administrators for the system, database and network. A '7 by 24' routine and cover for leave and sickness means that one person in each role will be insufficient. Conclusion The building of a large scaleable client/server solution is possible with existing technologies. An important key to success is to design the system to grow. This should be done regardless of whether the current project requires it or not. The easiest way to manage the design process is to break up the system into elements and then examine each element for scaleability and resilience. Then bring all the elements together and ensure that they work together. Do not attempt to do this on your own but partner with your suppliers to get the best results from your investment.