Atomic Commit Protocols
A thesis submitted to the University of Manchester for the degree of M.Phil. in the Faculty of Humanities.
2005
Sumit Agarwal
School of Informatics
1
CONTENTS 1.
2.
Introduction..........................................................................................................15 1.1
Aims and Objectives ....................................................................................16
1.2
Chapter Outline............................................................................................18
Background ..........................................................................................................19 2.1
2.1.1
Flat Transactions.................................................................................. 20
2.1.2
Spheres of Control ............................................................................... 23
2.1.3
Nested Transactions ............................................................................. 28
2.1.4
SAGAS ................................................................................................ 30
2.1.5
ASSET ................................................................................................. 30
2.1.6
Distributed Transaction processing ..................................................... 30
2.2
Fault Tolerant Distributed Computing.........................................................32
2.2.1
Failure .................................................................................................. 32
2.2.2
Properties of a distributed system: Liveness and Safety...................... 33
2.2.3
Fault tolerance...................................................................................... 33
2.3
Consensus Problem......................................................................................36
2.4
Atomic Commit Protocols ...........................................................................36
2.4.1
Generic Commit Protocol .................................................................... 37
2.4.2
Two Phase Commit (2PC) Protocol..................................................... 39
2.4.3
Non-blocking atomic commit algorithms ............................................ 41
2.4.4
Backup Commit ................................................................................... 49
2.5 3.
Transaction Processing ................................................................................19
Summary ......................................................................................................50
Algorithm Overview ............................................................................................51 3.1
Two Phase Commit (2PC) ...........................................................................51
3.2
Paxos Commit..............................................................................................55 2
3.3
3.3.1
Two Phase Commit.............................................................................. 59
3.3.2
Paxos Commit...................................................................................... 60
3.4
4.
Messaging Analysis .....................................................................................59
Analysis of Disk Writes ...............................................................................61
3.4.1
Two phase commit............................................................................... 61
3.4.2
Paxos Commit...................................................................................... 62
3.5
Recommended Optimisation........................................................................62
3.6
Alternate algorithm ......................................................................................63
3.7
Summary ......................................................................................................64
Design and Implementation of Simulation Framework.......................................65 4.1
Motivation for the Design............................................................................65
4.1.1
Overview of Simulations ..................................................................... 65
4.1.2
Assumptions......................................................................................... 66
4.2
Simulated Message Passing .........................................................................66
4.2.1
Simulation............................................................................................ 68
4.2.2
Messages.............................................................................................. 68
4.2.3
Counters ............................................................................................... 69
4.2.4
2PC Processes ...................................................................................... 69
4.2.5
Paxos Commit Processes ..................................................................... 69
4.2.6
Optimisation Implementation .............................................................. 70
4.3
Simulation Using Message Passing Library ................................................71
4.3.1 4.4
Design of Modified Paxos ...........................................................................72
4.4.1 4.5
Counters ............................................................................................... 71
Disk Write Analysis............................................................................. 73
Constants Used.............................................................................................73
4.5.1
Simulated Messaging........................................................................... 73
3
4.5.2 4.6 5.
5.1
Glossary .......................................................................................................76
5.2
Overview......................................................................................................77
5.3
Failure Free Environment ............................................................................78
5.3.1
Simulated Message Passing ................................................................. 78
5.3.2
Message Passing .................................................................................. 88
Message Failures..........................................................................................91
5.4.1
Simulated Message Passing ................................................................. 92
5.4.2
Message Passing ................................................................................ 102
5.5
Process and Message Failures....................................................................104
5.5.1
Simulated Message Passing ............................................................... 104
5.5.2
Message Passing ................................................................................ 114
5.6
Disk Write Analysis...................................................................................115
5.6.1
Paxos.................................................................................................. 115
5.6.2
Modified Paxos .................................................................................. 116
5.7
7.
Summary ......................................................................................................74
Results and Analysis ............................................................................................76
5.4
6.
Message Passing .................................................................................. 74
Summary ....................................................................................................116
Conclusions and Further Work ..........................................................................117 6.1
Conclusions................................................................................................118
6.2
Further Work..............................................................................................119
6.2.1
Simulation Analyser .......................................................................... 119
6.2.2
Low Latency Non-Blocking Atomic Commit ................................... 120
6.2.3
Modified Paxos .................................................................................. 121
6.2.4
Disk Write Optimisation.................................................................... 121
References..........................................................................................................122
4
Appendix A
Algorithm for Leader Election...........................................................125
Appendix B
Result for 2PC with mean message delay of 1.4ms...........................126
Appendix C
Average Transaction Commit Time for 2PC .....................................127
5
List of Figures Figure 1 SoC for containing intermediate results ........................................................25 Figure 2 Example to illustrate Dynamic creation of SoC ............................................26 Figure 3 Tracing back and creation of Sphere F as recovery environment .................27 Figure 4 Extension of Sphere F to include all dependent spheres ...............................27 Figure 6 Commitment control in Nested Transaction..................................................29 Figure 7 Commitment control in Distributed Transaction...........................................31 Figure 9 Set diagram for Fault and Fault Tolerance ....................................................35 Figure 10 ......................................................................................................................38 Figure 11 ......................................................................................................................39 Figure 12 ......................................................................................................................43 Figure 13 ......................................................................................................................46 Figure 14 State Transition Diagram Legend................................................................51 Figure 15 State Transition Diagram for 2PC Transaction Manager ............................52 Figure 16 State transition diagram for 2PC Resource Manager ..................................53 Figure 17 State Transition Diagram for Client ............................................................54 Figure 18 State Transition Diagram for Registrar .......................................................56 Figure 19 State Transition Diagram for RM................................................................57 Figure 20 State Diagram for an Instance of the Paxos Algorithm on an Acceptor .....59 Figure 21 Class Diagram for the various processes of 2PC.........................................67 Figure 22 Interactions between the various classes .....................................................67 Figure 23 Internal working of the message class.........................................................69 Figure 24 Class diagram for the processes of Paxos Commit .....................................70
6
List of Tables Table 1 Summary of Results for 2PC in fault free environment .................................78 Table 2 Transaction completion times for 2PC in fault free environment ..................79 Table 3 Results for Paxos Commit without optimisation in the failure free environment .................................................................................................................79 Table 4 Commit and Abort times for Paxos Commit without optimisation in failure free environment ..........................................................................................................80 Table 5 Results for Paxos Commit with message bundling in failure free environment ......................................................................................................................................81 Table 6 Transaction completion times for Paxos Commit with bundling in fault free environment .................................................................................................................81 Table 7 Results for Paxos commit without optimisation in the fault free environment where one fault is tolerated ..........................................................................................82 Table 8 Transaction completion times for Paxos Commit with one fault setting in fault free environment ..........................................................................................................83 Table 9 Results for Paxos commit with message bundling with one fault to tolerate in fault free environment..................................................................................................84 Table 10 Transaction completion times for Paxos commit with one fault to tolerate with message bundling in fault free environment........................................................84 Table 11 Results for Paxos commit with disk write optimisation with one fault to tolerate in fault free environment.................................................................................85 Table 12 Transaction completion times for Paxos commit with disk write optimization with one fault to tolerate in fault free environment .....................................................85 Table 13 Results for Paxos with one fault to tolerate with disk write optimisation and message bundling in fault free environment................................................................86 Table 14 Transaction completion times for Paxos with one fault to tolerate with disk write optimisation and message bundling in fault free environment...........................86 Table 15 Results for Modified Paxos in fault free environment..................................87
7
Table 16 Transaction commit times for Modified Paxos in fault free environment....87 Table 17 Results for Modified Paxos where it tolerates one fault in fault free environment .................................................................................................................88 Table 18 Transaction commit times for Modified Paxos with one fault tolerance in fault free environment..................................................................................................88 Table 19 Results for 2PC in fault free environment ....................................................89 Table 20 Transaction completion times for 2PC in fault free environment ................89 Table 21 Results for Modified Paxos in fault free environment with no faults to tolerate..........................................................................................................................90 Table 22 Transaction completion times for Modified Paxos in fault free environment with no faults to tolerate ..............................................................................................90 Table 23 Results for Modified Paxos in fault free environment with one fault to tolerate..........................................................................................................................91 Table 24 Transaction completion times for Modified Paxos in fault free environment ......................................................................................................................................91 Table 25 Results for 2PC where messages fail............................................................92 Table 26 Transaction completion times for 2PC with message failures......................93 Table 27 Results for Paxos Commit with message failures.........................................94 Table 28 Transaction completion time for Paxos Commit with message failures ......95 Table 29 Results for Paxos Commit with Durable Write optimisation where environment has message failures ...............................................................................96 Table 30 Transaction completion times for Paxos Commit with disk write optimisation where message failures occur .................................................................97 Table 31 Results for Paxos commit with optimisation and bundling in message failure environment .................................................................................................................98 Table 32 Transaction completion times for Paxos commit with optimisation and bundling in environment with message failures ..........................................................99 Table 33 Results for Modified Paxos in environment with message failures............100
8
Table 34 Transaction completion times for Modified in message failure environment ....................................................................................................................................101 Table 35 Results for 2PC where there are message failures......................................102 Table 36 Transaction completion times for 2PC where there are message failures ..102 Table 37 Results for Modified Paxos in message failure environment .....................103 Table 38 Transaction completion times for Modified Paxos where message failures occur...........................................................................................................................103 Table 39 Results for 2PC with all failures .................................................................104 Table 40 Transaction completion times for 2PC with all failures .............................105 Table 41 Results for Paxos commit with all failures .................................................106 Table 42 Transaction completion time for Modified Paxos with all failures ............108 Table 43 Results for Paxos commit with disk optimisation ......................................108 Table 44 Transaction completion times for Paxos commit with disk optimisation...110 Table 45 Results for Paxos commit with optimisation and message bundling..........110 Table 46 Transaction completion times for Paxos commit with optimisation and message bundling.......................................................................................................112 Table 47 Results for Modified Paxos.........................................................................112 Table 48 Transaction completion time for Modified Paxos ......................................113 Table 49 Results for 2PC with all failures .................................................................114 Table 50 Transaction completion times for 2PC with all failures .............................114 Table 51 Results for Modified Paxos with all failures ..............................................115 Table 52 Transaction completion times for Modified Paxos.....................................115 Table 53 Results for 2PC with 1.4ms mean message delay ......................................126 Table 54 Transaction commit times for 2PC with 1.4ms mean message delay ........126
9
Abstract A transaction is a logical collection of tasks that must either complete successfully or unsuccessfully as a group. A system that provides tools to accomplish this is known as a transaction processing system. In a distributed environment where the tasks that make up a transaction are distributed across physically distinct nodes, a protocol is required to decide if the transaction should be completed successfully (Commit) or cancelled (Abort); this protocol is called an Atomic Commit Protocol. The protocol that is used in most distributed transaction processing systems is Two Phase Commit Protocol (2PC). In 2PC a single coordinator is used to collect each participant’s status of the work; the coordinator decides to Commit the transaction if all participants have successfully completed their part of the transaction, otherwise the transaction is Aborted. 2PC is a blocking protocol, i.e. if a coordinator fails while the participants are waiting for a decision, all the participants remain in a blocked state till the coordinator recovers and terminates the transaction. As alternatives, Non-Blocking Atomic Commit (NB-AC) protocols have been suggested; all NB-AC protocols require a synchronous environment. Paxos Commit is an Atomic Commit protocol that solves a weaker form of NB-AC, the protocol is non-blocking only if the network is non-faulty for long enough. It uses an instance of the Paxos Consensus Algorithm to achieve consensus on each participant’s status. Paxos Consensus is a fault tolerant consensus algorithm that uses 2F+1 acceptors (acceptors are processes that receive values from proposing processes and pass the first value to the Leader) and proceeds if F+1 of them are active. The study investigates and develops an alternative algorithm, Modified Paxos, based on the Paxos consensus algorithm and a disk write optimisations to Paxos Commit that utilises the available redundancy to improve performance. The algorithms have been simulated in a framework designed to observe overall transaction commit times in varying failure conditions. The results show that 2PC is faster than Paxos Commit and Modified Paxos in a fault free environment. This is true even in the case where the two Paxos consensus based commit algorithms are run with fault tolerance switched off. The disk write
10
optimisation improves the performance of Paxos Commit for committed transactions by 46%. The Modified Paxos when compared with unoptimised Paxos Commit is 4.5% faster for committed transactions. Modified Paxos also has the advantage of not using a single Transaction Manager process, the current leader of the acceptor processes is used as the Transaction Manager.
11
DECLARATION No portion of the work referred to in the thesis has been submitted in support of an application for another degree or qualification of this or any other university or other institute of learning.
12
COPYRIGHT STATEMENT i. Copyright in text of this thesis rests with the author. Copies (by any process) either in full, or of extracts, may be made only in accordance with instructions given by the author and lodged in the John Rylands University Library of Manchester. Details may be obtained from the Librarian. This page must form part of any such copies made. Further copies (by any process) of copies made in accordance with such instructions may not be made without the permission (in writing) of the author. ii. The ownership of any intellectual property rights which may be described in this thesis is vested in The University of Manchester, subject to any prior agreement to the contrary, and may not be made available for use by third parties without the written permission of the University, which will prescribe the terms and conditions of any such agreement. iii. Further information on the conditions under which disclosures and exploitation may take place is available from the Head of School of School of Informatics.
13
ACKNOWLEDGMENT I would like to thank Dr. Jim Gray for his guidance and the initial framework which helped in the implementation of the Paxos Commit. I would like to thank Prof. John Keane for all the guidance and support he has provided throughout my research. Finally I am really grateful to have the support of my family and friends who have been there to help me through everything.
14
1. Introduction “In computer science, a transaction is a group of logical operations that must all succeed or fail as a group. Systems dedicated to supporting such operations are known as transaction processing systems. [1]” Transaction processing is a part of everyday life. With an increased amount of data being processed everyday, the data needs to be split over multiple databases. This gives rise to transactions being processed in a distributed manner. There are various transaction properties that need to be maintained for distributed transaction processing. The group of logical actions that form a transaction, must succeed or fail as a group; actions across all databases, that are part of a distributed transaction, either succeed or fail, as a group, in updating the data, i.e. a consistent outcome. An Atomic Commit Protocol ensures that transactions terminate consistently across all participating resources even in the presence of failure [2]. In certain failure scenarios the most common Atomic Commit Protocol, Two Phase Commit (2PC) [3], is unable to decide a consistent outcome without waiting for a failed site to recover; the protocol is said to be blocking [4]. In large real-world distributed systems, failures are the rule rather than the exception and so must be dealt with at all levels – problem specification, algorithm design, and analysis [5]. Various protocols [4, 6, 7] have been suggested over the years that attempt to solve the problem of blocking in Atomic Commit Protocols; these are termed as NonBlocking Atomic Commit (NB-AC) Protocols. NB-AC protocols permit all correct participants to come to a commit decision and terminate the transaction. There have been various studies to understand NB-AC protocols [2, 8, 9]. The problem of choosing a single value out of multiple values proposed by a collection of processes is the consensus problem [10]. Atomic commit, in the literature, has been compared with the consensus problem [11] and there have been
15
various studies about the solvability of consensus in asynchronous environments [1214]. In an asynchronous system no assumptions can be made about process execution speeds and/or message delivery delays [15]. Much research has been done on the subject of consensus and NB-AC in asynchronous systems with failures, but NB-AC in an asynchronous environment has not been tested.
1.1 Aims and Objectives Chapter 2 briefly describes existing atomic commit protocols. 2PC, Paxos Commit and Backup Commit are implementable in a simple asynchronous system. All the other protocols discussed either require a synchronous system or advanced underlying frameworks, like virtual synchrony etc., to implement. Paxos Commit is a NB-AC protocol with weakened requirements for Non-Blocking and Non-Triviality; it also has a message delay that is only one message cycle more than the 2PC. These characteristics show that Paxos Commit has the potential to be implemented in real world systems, as messaging in modern local area networks is cheap. Comparatively 3PC has not been used in real world systems despite being around for many years [16]. The aim is to: 1. Investigate asynchronous NB-AC protocols 2. Develop a simulation of a complete transaction system with the Paxos Commit protocol 3. Compare the performance of the Paxos Commit protocol with 2PC by measuring the overall transaction completion times for both the protocols in varying failure conditions. 4. Develop an alternate algorithm, using the Paxos Consensus algorithm, which improves on the performance of Paxos Commit. Paxos is non-blocking only if the network is non-faulty for long enough; there will still be blocking if the entire network is faulty throughout the transaction. Thus, in the study, the overall number of aborted and committed transactions over a set period of
16
time will be compared for the two protocols. The average amount of time taken will also be measured. The performance analysis outlined by Gray et. al. [6] for the Paxos Commit algorithm, focuses on the number of messages that are sent between the participants and the coordinators and the number of forced writes that have to be done during the commit decision making process. Although the number of messages sent and the forced disk writes are very important indicators of the performance of a distributed commit algorithm, other factors also contribute to the performance of an algorithm. The algorithms are compared based on: 1. The performance of the algorithms in a failure free environment. The results in a failure free environment, provides a base value for the performance of the algorithms. These results can be used to determine the performance overhead of using a consensus based atomic commit protocol. 2. Resilience to failure – when failure occurs, a blocking system grinds to a halt (i.e. 2PC), the Paxos Commit Algorithm on the other hand is non-blocking, if a sufficiently large network of nodes is non-faulty for long enough. The nonblocking behaviour has a weakened condition and thus it essential to measure the resilience of the Paxos Commit algorithm to various failures. Using the results obtained for the algorithms running in a failure free environment a comparison can be made to determine the resilience of the algorithm to failures. The faults that can be simulated are •
process failures – any of the processes can fail at random,
•
message non-delivery – the messaging system might be unreliable and messages may not get delivered to the recipient
•
late delivery or out of order delivery – in addition to the messages not being delivered at all the messages may get delivered late, in scenarios where processes are coordinating actions timely arrival of messages can be important.
In this study the NB-AC Algorithm with weakened requirements suggested by Gray et. al. [6] has been simulated and compared with the 2PC protocol in an asynchronous environment with failures. 17
1.2 Chapter Outline Chapter 2 reviews the literature to present an introduction to transaction processing, failure and fault tolerance, consensus, and Atomic Commit. Chapter 3 goes into details of the two Atomic Commit protocols, 2PC and Paxos Commit, that have been simulated. This chapter also presents detailed algorithms for these Atomic Commit protocols. Paxos Commit utilises redundancy to achieve fault tolerance. The available redundancy can also be used to gain performance advantage by not performing disk writes, which are important for failure recovery. This optimisation has been analysed in Chapter 3. A different method for implementing Paxos Consensus Algorithm to achieve a Commit algorithm was designed. This is also discussed in Chapter 3. A simulation framework was designed in order to obtain results that can be analysed to compare the transaction commit/abort numbers and the transaction completion times for the 2PC and Paxos Commit algorithms. Gray et. al. [6] compared 2PC with Paxos Commit theoretically by analysing the number of message delays and disk write delay. The simulation here aims to be able to test these empirically. This appears to be the first reported simulation for the Paxos Commit [17] and thus the first empirical comparison with the 2PC protocol . Chapter 4 presents details on the design motivation and the design used. The results and analysis are presented in Chapter 5. The findings of the study are presented as a conclusion in Chapter 6. During the course of the study aspects of further work were identified to improve NBAC in asynchronous systems, these are also examined in Chapter 6.2.
18
2. Background A distributed commit protocol is essential to maintain consistency in Distributed Transaction systems. In this study, simulations for two distributed commit protocols, Paxos Commit and Two Phase Commit, have been implemented, and a new algorithm based on Paxos Commit has been designed and implemented. The results from these simulations have been compared on the basis of average transaction completion time and commit to abort ratio for the protocol running for a certain duration of time in varying fault conditions. This chapter introduces the concepts of transaction processing, distributed transaction processing and commit protocols. In a system that operates within agreed specification, no extra primitives are required to maintain well known transaction properties. If failure occurs, extra primitives are required to ensure that these transaction properties hold. This chapter also gives a description of failures and fault tolerance and how these affect commit protocols. Section 2.2 gives an outline of failure and fault tolerance. Section 2.3 gives an introduction to the consensus problem. Section 2.4 defines atomic commitment and then presents a generic commit protocol followed by a description of some known commit protocols.
2.1 Transaction Processing A transaction processing system provides tools to ease or automate application programming, execution, and administration. Transaction processing applications typically support a network of devices that submit queries and updates to the application. Based on these inputs, the application maintains a database representing some real-world state. Application responses and outputs typically drive real-world actuators and transducers that alter or control the real-world states. The applications, database, and network tend to evolve over several decades. Increasingly, the systems are geographically distributed, heterogeneous, continuously available and have stringent response time requirements [18]. A transaction is a state change to a database representing some real-world states. For example, Customer A deposits £500 into his account; the account is part of the 19
database and the deposit is the state change to that database. The database is updated to reflect the new balance of the account. A transaction can be considered a collection of actions with the following properties [18] Atomicity – a state transition is said to be atomic if it appears to jump from the initial state to the result state without any intermediate steps or if it appears as though it never left the initial state. A transaction is atomic when it behaves atomically to any outside observer, i.e., the outside observer should not be able to view any of the intermediate results generated by any intermediate steps. Consistency – a transaction produces consistent results only; otherwise it aborts. A result is consistent if the new state of the database fulfils all the rules or integrity constraints of the application. If an integrity constraint states that all accounts must have a positive balance, then any transaction violating this rule will be aborted. Isolation – means a program running under transaction protection must behave exactly as it would in single-user mode. This behaviour is only applicable in terms of an outside observer. Thus, the results obtained at the end of a transaction are the same irrespective of whether it was a single user system or a multi-user system. However, the data is still shared between transactions. Durability – requires that results of transactions that have completed successfully must not be forgotten by the system. Once the system acknowledges the execution of a transaction, it must be able to re-establish its results after any type of subsequent failure.
2.1.1 Flat Transactions There are various types of transactions based on the control structure they provide for controlling how the ACID properties, described above, are implemented. Flat transactions represent the simplest type of transaction. Flat transactions are the basic building block for organising an application into atomic actions. These kinds of transactions are called flat because there is only one layer of control by the application. Everything inside the BEGIN and COMMIT brackets is at the same level; that is, the whole transaction will either be committed together, or it will be rolled back as a whole. Flat transactions are also called ACID transactions (because of the 20
four properties Atomicity, Consistency, Isolation, and Durability as above). Flat transactions are very simple and easy to implement, but as a result are not applicable to more complex problems. One of the major restrictions of flat transaction is that there is no way of either committing or aborting parts of such transactions, or committing results in several stages [18]. An example where flat transaction is applicable is in a banking system where a debit or credit needs to be applied to an account. The system is also required to update the running balance of the teller and the branch, and finally file a history record with the details of the transaction. These changes may be carried out in four steps: •
the account is updated with a debit or credit
•
the teller’s balance is updated
•
the branch balance is updated
•
the history record is inserted
These four changes will be part of a single Flat or ACID transaction performed within a single BEGIN and COMMIT block. If any execution performs only a part of these tasks, the execution will be considered inconsistent and would not be carried out. Flat Transactions isolate the application from faults resulting due to an execution not adhering with ACID properties. In addition to the protection provided to the application, Flat Transactions are simple to implement. However, the simple control structure of Flat Transactions does not fit well with requirements for complex applications that may require a higher degree of control. Examples illustrating this are provided [18]: •
Trip Planning – A person wants to travel from place A to place F. Since there is no mode of travel that could get the person directly from A to F. The travel agent will need to break the journey down into several segments, which may involve travel via air, train, rental car and may require overnight stay in hotels. BEGIN WORK S1.
Book Flight from A to B
S2.
Book Flight from B to C, same day 21
S3.
Book Flight from C to D, same day
The only way of getting from D to F on the same day is by Rental Car, but the person does not wish to drive at night. There may be other options for the person to travel from D to F via another place E. Alternatively the person may travel from place C using a different route and mode of travel. In a normal flat transaction the entire booking will need to be cancelled, which is not essential as there is no need to cancel the first two steps. The application instead requires a control structure where selective rollback can be carried out. •
Bulk Updates – At the end of a month a bank updates all its accounts with a debit or credit interest accumulated for that month. A flat transaction will be fine if the entire process completes without any faults. In a scenario where the database crashes after 100,000 accounts have been updated, the flat transaction will require that all these updates are rolled back. These updates would have already required much processing time and it is unnecessary to rollback these updates. Instead a better option will be to continue ahead from the last account, when the database recovers from the crash.
The traditional Flat transaction approach is the best approach when dealing with hardcoded systems in which the ACID properties are essential [18]. However, in many modern applications, the ACID properties may be •
unnecessary – e.g., maintaining consistency of some street traffic management data may be unnecessary, since the data will be overwritten with new data anyway.
•
impossible to satisfy – e.g., it is impossible to model a travel reservation activity involving multiple airlines and hotels as a transaction, because the different companies will not let atomic commit protocols to be run across separate databases.
•
undesirable – e.g., it is obviously undesirable to model activities of two collaborating humans as transactions, because then they would (by definition) be isolated, i.e., uncollaborative.
22
For the above reasons, more expressive primitives are needed to model composite activities, and the commitment of their components. Although the flat transaction model [18] is simple and powerful, it is still found to be lacking in functionality and performance when applied to application that have reactive, open-ended or collaborative activities [19], like the Trip Planning and Bulk Update example provided above. There are many such applications scenarios that are not efficiently handled by Flat Transactions. For such problems found in the standard flat transaction, extensions to the transaction model have been proposed [19-23]. New types of applications normally result in a new, specialised transaction model. Each transaction model is at least partially incompatible to the other models [18]. For example the commit and abort dependency between sub transactions is very different in nested transactions and SAGAS. A brief description of some of the transaction models is provided in the following sections.
2.1.2 Spheres of Control Davies [24] developed the concept of spheres of control which triggered the subsequent development of the transaction paradigm. This was the first attempt to investigate the interrelated issues of control, recovery, and concurrency in integrated systems in a coherent fashion. A framework is presented for better definition of the audit and control aspects of data processing application. It introduces the concept of spheres of control, which are logical boundaries that exist in all data processing systems. Spheres of Controls define process bounding for recovery and process commitment [18]. At the core of the notion of spheres of control is the observation that controlling computations in a distributed multi-user environment primarily means, 1. Containing the effects of arbitrary operations as long as there might be a necessity to revoke them, and 2. Monitoring the dependencies of operation on each other in order to be able to trace the execution history in case faulty data is found at some point. Gray et. al. [18] describe the dynamic creation of spheres of control around actions accessing data. In this model all the dependencies between data and computation are
23
recorded so that in case of a subsequent problem the execution history can be traced back and all affected computation can be redone or invalidated. Spheres of control may be defined statically by structuring the system into a hierarchy of abstract data types or created due to the dynamic interaction among spheres of control on shared data. The dynamic creation of spheres of controls brings to attention the concept of commitment control. Spheres of controls become dependent on each other due to shared intermediate data. The spheres of controls sharing this data are then encapsulated within a dynamically created sphere of control. A complete log of dependencies between data and processing is maintained, to be able to trace back all affected processes in case of a failure. The following definitions are quoted from Gray et. al. [18]. •
Process control. Process control ensures that the information required by an atomic process is not modified by others, and it constrains the dependencies that another process may place on the update made by this process.
•
Process atomicity. Process atomicity is the amount of processing one wishes to consider as having identity, i.e., an outside observer considers atomic. It is the control over processing that permits an operator to be atomic at one level of control, while the implementation of that operator may consist of many parallel and/or serial atomic operators at the next lower level of control.
•
Process commitment. While a function is in process, changes of state are being made only by that function or they are expressly known to that function. This allows the establishment of a single point to which one can return for rerun purposes independently of the error detected or the number of functions since that point. Preventing process commitment by holding (controlling) the use of its results permits the system to perform a unilateral back out (process undo) over much larger units of process. Unilateral here means without having to request permission from each participant in a process. In summary, process commitment control is the containment of the effects of a process, even beyond the end of the process.
In the concept of spheres of control (SoC), the commitment of data produced by an atomic process is not strictly bound to the completion of the process. It is possible to
24
dynamically create a new enclosing process that will control the commitment of the result of the first process.
Figure 1 SoC for containing intermediate results
Figure 1 demonstrates an example of the case above. Each solid sphere represents a predefined hierarchy of functions, each executing in its own sphere of control as an atomic process. The highest level processes A1 and A2 are executed sequentially. If for some reason A1 decides that its data D is not ready to be committed to the outside world, A2 can start working on this data after a new, higher-level sphere of control, S, has been created. S now contains the joint activities of A1, A2 and whatever might follow after that. S can be terminated when all processes depending on A1’s result have been terminated, and A1 has decided to commit its result. The dynamically created spheres for commitment control allow processing data that are not ready for access by the outside world and would otherwise require the application to stop or undergo some checking procedure. The concept of dynamic creation of spheres of control also allows the description of process dependencies on data that are found to be incorrect long after they have been committed.
25
D4 D D2 D6 A E
D3
D1
C
B
D5
Time Source: Gray, J. and A. Reuter, Transaction Processing Concepts and Techniques, 1993, Morgan Kaufmann Publishers
Figure 2 Example to illustrate Dynamic creation of SoC
In Figure 2, the dotted line denotes current time and time progresses from left to right. There are three completed spheres of control A, C and E. While sphere B is executing, a problem is found. There is no way to unilaterally back out from the work done so far, because other processes, that may or may not depend on faulty data, could have already committed their results; in Figure 2, process completed in sphere E and result D6. The problem is dealt by dynamic creation of Spheres of Control in the following way: •
The processing is traced back to find which process created the incorrect data (process A in this case) and to enclose the already-completed SoC into a dynamically created new SoC (item F) to serve as the recovery environment (Figure 3).
•
Item A will be checked to determine the source of the incorrect data.
•
The process now moves forward in time to recover all processes that have become dependent on any data produced by the process that created the incorrect data D1. This leads to an extension of SoC F (Figure 4).
26
Figure 3 Tracing back and creation of Sphere F as recovery environment
F D4 D D2 D6 A E
D3
D1
B
D5
C
Time Source: Gray, J. and A. Reuter, Transaction Processing Concepts and Techniques, 1993, Morgan Kaufmann Publishers
Figure 4 Extension of Sphere F to include all dependent spheres
27
The SoC F (Figure 4) indicates the necessity to go backwards and forwards through various levels of commitment. The steps taken in the actual recovery of SoC F are application dependent. The notion of spheres of control is both general and powerful, but has never been fully formalised. Due to its generality, it is likely to be difficult to build a system exploiting it in full [18]. Flat transactions, on the other hand, are simple and easy to implement but they only describe the bottom level of atomic actions in the spheres of control model; flat transactions only implement Spheres of Control to contain intermediate data from being accessed by an outside observer, as shown in Figure 1. Spheres of control characterise activities more generally than database transactions. Conceptually, SoCs are related to a number of extended transaction models that have been proposed [25]. As stated earlier at the core of SoCs is the notion of not making data changes public as long as there might be a necessity to revoke them and monitoring the dependencies to be able to trace the execution history if faulty data is found. These concepts are utilised in extended transactions. For example, in nested transactions the data changes made by a child transaction are not externalised but are only available to its parent transaction and in SAGAS the dependencies are maintained so as to be able to run compensating transactions to undo the changes made by partial executions. Nested Transactions and SAGAS are discussed in the following sections.
2.1.3 Nested Transactions To maintain Isolation in transactions, concurrency controls are required. Moss [23] presents a method for concurrency control in transactions using Nested Transactions. Nested transactions provide nested universes of synchronisation and recovery from failures. Advantages of nested transactions over flat transactions are that they provide concurrency control within transactions by serialising sub-transactions appropriately, and that they permit parts of transactions to fail without necessarily aborting the entire transaction. The method suggested uses locking for concurrency control. In nested transactions, a transaction is composed of an arbitrary number of subtransactions that may be run concurrently, organised in a tree form. Figure 5 shows an example of how a nested transaction may be organised. The root transaction T
28
initiates multiple sub-transactions T1, T2 and T3; T1 subsequently initiates subtransactions T11 and T12. A sub-transaction acquires a lock on the T
data that it wishes to access, on completion this lock is not released but passed to the parent transaction, thus the results are made available to the parent.
T1
T2
T3
Failure of a child transaction does not cause the whole transaction to fail, as the rest of the transaction may still be committed.
Results
of
the
T12
T11
T31
whole
computation are externalised only on commit of the parent transaction. If the
T121
parent transaction aborts all other subtransactions are also aborted. System
Figure 5 Nested Transaction Organisation System
Trigger
A
B
C
A
B
Trigger
A
C
C
A
A
C
B
A
C
A) Root transaction T is running.
C Wait Trigger
B
C
A
T1
T1 A
C
Trigger
Wait
Trigger
A
B T
T
T A
System
Trigger
A
C
B) Subtransaction T1 has been started.
B
C
T2 C
A
C
Wait
Trigger
A
B
C Wait
T11 A
C
C) T1 has created subtransaction T11, and after that the root transaction starts another subtransaction, T2. Source: Gray, J. and A. Reuter, Transaction Processing Concepts and Techniques, 1993, Morgan Kaufmann Publishers
Figure 6 Commitment control in Nested Transaction
Figure 6 shows commitment control in a nested transaction. The abort of a higher level transaction will trigger the abort of all sub-transactions, e.g., the abort of
29
transaction T will trigger the abort of T1 and T2, and the abort of T1 will subsequently trigger the abort of T11. The commit of a sub-transaction is not externalised unless the parent is committed. Thus, it waits on the commit of the parent.
2.1.4 SAGAS Long lived transactions hold on to database resources for relatively long periods of time, delaying the termination of shorter and more common transactions. The notion of saga solves this problem. A long lived transaction is a saga if it can be split as a sequence of transactions that can be interleaved with other transactions. The database management system guarantees that either all the transactions making up a saga are completed successfully, or compensating transactions are run to amend the changes made by the partial execution [21].
2.1.5 ASSET ASSET is a system for supporting extended transactions. ASSET consists of a set of transaction primitives that allow users to define custom transaction semantics to match the needs of specific applications. These transactions primitives can be used to specify a variety of transaction models such as nested transactions, split transactions and sagas [20].
2.1.6 Distributed Transaction processing A distributed transaction is the execution of a program accessing shared data at multiple sites [26]. A distributed transaction is typically a flat transaction that runs in a distributed environment and therefore has to visit several nodes in the network, depending on the location of the data. The conceptual difference between a distributed transaction and a nested transaction can be put as follows: The structure of nested transactions is determined by the functional decomposition of the application, that is, by what the application views as Spheres of Control. The structure of a distributed transaction depends on the distribution of data in a network. In other words, even for a flat transaction, from the application’s point of view a distributed transaction may have to be executed if the data involved are scattered across a number of nodes [18].
30
Sub-transactions are spawned to access data at nodes other than the local node. For example a transaction T running on node a, requires data P and Q. The data P and Q is not available locally at node a, but on nodes b and c respectively over the network. Sub-transaction T1 and T2 of transaction T will be spawned on the respective nodes. System
System
Trigger
A
B
Trigger
C
A
T A
B
C
T C
A
C Wait
A
B
C
A
T1 Trigger
A
B
C
T2 C
A) Root transaction T is running.
A
C
Wait
B) Subtransaction T1 and T2 have been started at the same level, abort of the subtransactions triggers an abort of the parent.
Figure 7 Commitment control in Distributed Transaction
The decomposition into sub-transactions does not reflect a hierarchical structure in the programs to be executed, but is induced by the placement of the data in the network. Consequently, the sub-transaction are not really executing at a conceptual lower level of control, as is the case in nested transactions. The sub-transactions are parts of the same top-level transaction. If a sub-transaction commits its work, it signals the commit of the entire transaction, which forces all other sub-transactions to commit. In comparison, a commit of a sub-transaction in a nested transaction is only local to that sub-transaction. The final success of which is dependant on its ancestor committing [18]. The abort of a sub-transaction in a distributed transaction causes the abort of the whole transaction; this can be seen in Figure 7.
Atomic Commit Problem The atomic commit problem is concerned with deciding a consistent outcome for a distributed transaction even in the presence of failure [14].
31
2.2 Fault Tolerant Distributed Computing This study is about Atomic Commit Protocols. As stated earlier an Atomic Commit Protocol must provide a consistent outcome even in the presence of failure. It is very important to understand what failure is and how it affects the decision making process. This section presents some definitions and concepts related to failure.
2.2.1 Failure “Whenever the service of a system, as seen by the user of the system deviates from the agreed specification of the system, the system has failed. A failure is thus an event occurring at a particular point in real time. Most computer system failures can be traced back to an incorrect internal state of the computer, for instance, a wrong data element in the memory or a register. We call such an incorrect internal state an error. An error is thus an unintended state that lasts for a given interval of time. The cause of an error, and thus indirectly of a failure, is called a fault.” [27] A variety of failure models have been proposed in connection with distributed systems. The models are based on assigning responsibility for faulty behaviour to the system’s components – processors and communication channels. Common failure models are listed [15] : •
Failstop. A processor fails by halting. Once it halts, the processor remains in that state. The fact that a processor has failed is detectable by other processors.
•
Crash. A processor fails by halting. Once it halts, the processor remains in that state. The fact that the processor has failed may not be detectable by other processors.
•
Crash+Link. A processor fails by halting. Once it halts, the process remains in that state. A link fails by losing some messages, but does not delay, duplicate, or corrupt messages.
•
Receive Omission. A processor fails by receiving only a subset of the messages that have been sent to it or by halting and remaining halted.
•
Send Omission. A processor fails by transmitting only a subset of the messages that it actually attempts to send or by halting and remaining halted. 32
•
General Omission. A processor fails by receiving only a subset of the messages that have been sent to it, by transmitting only a subset of the messages that it actually attempts to send, and/or by halting and remaining halted.
•
Byzantine Failure. A processor fails by exhibiting arbitrary behaviour.
Failstop failures are the least disruptive, because processors never perform erroneous actions and failures are detectable. Other processors can safely perform actions on behalf of a fault failstop processor. Unless a system is synchronous, it is not possible to distinguish between a processor that is executing very slowly and one that has halted due to a crash failure. A processor that is halted due to a crash cannot take any further action, but a slowly executing processor will continue processing. Another processor can safely perform actions on behalf of a processor that has halted due to a crash, but not for a slow processor, because a slow processor may perform subsequent actions that will be inconsistent with the actions performed on its behalf by others. Thus, crash failures in asynchronous systems are harder to deal with than failstop failures. In a synchronous system crash and failstop failures are equivalent.
2.2.2 Properties of a distributed system: Liveness and Safety In general, there are two kinds of correctness properties that an algorithm must satisfy: safety and liveness. Intuitively, a safety property describes what is not allowed to happen, and a liveness property describes what must happen. The safety property of a system defines that during the execution a “bad thing” never happens and the liveness property specifies that a “good thing” eventually happens [28].
2.2.3 Fault tolerance Informally, fault tolerance is the ability of a system to behave in a well-defined manner once faults occur [29]. A fault-tolerant system aims to detect and take measures, so as to prevent an error, due to a fault. A formal definition of fault tolerance has been given by Gärtner [29]: “A distributed program A is said to tolerate faults from a fault class F for an invariant P if there exists a predicate T for which the following three requirements hold: 33
•
•
At any configuration where P
var x e {0,1,2,3} init 1 {* local state *} holds, T also holds (i.e., P fi begin {* normal program action *} T). x = 1 f x := 2 x = 2 f x := 1 Starting from any state where {* faults that should be tolerated *} T holds, if any actions of A or true f x := 0 {* protection mechanism *} F are executed, the resulting x = 0 f x := 1 state will always be one in end which T holds (i.e., T is Figure 8 A Fault tolerant program
closed in A and T is closed in F). •
Starting from any state where T holds, every computation that executes actions from A alone eventually reaches a state where P holds.”
Figure 8 shows a program for a simple process that keeps on changing its only variable x between the values 1 and 2. The process only tolerates a single type of fault specified by the fault action1 true f x := 0 and the protection mechanism2 sets the value of x to 1. The definition can be explained using this program. The invariant of the process will be P ≠ x e {1,2}, and the fault class to tolerate will be F = {true f x := 0}. The predicate T is called a fault span; it is the limit to the changes that will be made by the faults from F. Thus, T ≠ x e {0,1,2}. The definition states that faults from F are handled by knowing the fault span T. As long as such faults f e F occur, the system may leave the set P (Figure 9 shows the invariant and the predicate as sets), but will always remain within T. When fault
1
A fault action is the additional virtual programming to transform a fault free program A to program
A’ that may be subject to faults Note that A’ will never be explicitly implemented; the transformation process is just a means to be able to reason about programs which are subject to faults. 2
Protection mechanism is a component or concept that enriches the system to provide protection
against faults from the fault class.
34
actions are not executed for a sufficiently long period of time, but only normal program actions t ‰ F, the system will eventually reach P again and resume “normal” behaviour.
Figure 9 Set diagram for Fault and Fault Tolerance
A key method to providing fault-tolerance is to introduce redundancy [29]. Redundancy refers to extra resources that are not used during the normal operations of the system. For example, in Figure 8, the code occurring in the protection mechanism (x = 0 f
x := 1) is extra code that is present to return the program to normal
execution after a fault has occurred. This code is not used in the normal execution of the program. In an asynchronous system to utilise available redundancy fault detection is required. It is a contributing factor in achieving fault tolerance [30]. For example, consider two processes p and q, both of which can crash, which form a distributed system. Process q is the redundant process and starts making decisions when p crashes. In this setup of the distributed system, Process q cannot find out if process p has crashed. If p sends messages to q at regular intervals, the absence of the message, in an asynchronous system, cannot indicate that the process has crashed. It could just be that process p is running very slow, thus there is a need for a fault detection method in this system. Fault detection is being increasingly used in the design, analysis and implementation of many fault-tolerant distributed algorithms [31].
35
2.3 Consensus Problem Consensus is a paradigm of agreement problems. Consensus is required whenever processes have to agree on the same action to take. In the consensus problem, every correct process proposes a value vi and all the correct processes have to decide on the same value v. The consensus problem can be defined by the following properties [31]: •
C-Termination: Every correct process eventually decides on some value.
•
C-Validity: If a process decides v, then v was proposed by some process.
•
C-Agreement: No two correct processes decide differently.
The atomic commit problem is a specialised case of the consensus problem. The various processes in the distributed system need to decide on whether to commit the transaction or not.
2.4 Atomic Commit Protocols In a distributed transaction system, a transaction is performed over multiple nodes or participants. The transaction terminates in a globally consistent manner by utilising an Atomic Commit Protocol. The protocol aims to decide amongst two possible decision values, commit or abort, for each participant. Deciding commit causes all the participants to make the transaction changes permanent, while deciding abort will cause none of them to make the changes. Each participant’s decision is irreversible. The commit decision is only made if all participants unanimously vote YES [2]. The requirements of the atomic commit protocol are: 1. All participants that decide reach the same decision. 2. If any participant decides commit, then all participants must have voted YES. 3. If all participants vote YES and no failure occur, than all participants decide commit. 4. Each participant decides at most once, i.e., a decision is irreversible.
36
2.4.1 Generic Commit Protocol Each of the participants in a distributed transaction is a process called a Resource Manager (RM). The commit protocol is initiated when one of the resource managers issues a request either to commit or to abort the transaction. For the transaction to be committed, each participating RM must be willing to commit it; otherwise, the transaction must be aborted. Prior to the commit request, any RM may decide to abort its part of the transaction. The fundamental requirement is that all RMs must agree on whether the transaction is committed or aborted; this ensures that the transaction is Atomic and Consistent. Once a RM has decided to abort or commit, the decision cannot be changed [6]. The RM may be viewed as a state machine and the commit protocol as a set of state transitions. The states of the RM are as follows: •
Working – the RM is currently processing its sub-transaction
•
Prepared – the RM has finished processing its sub-transaction successfully and is willing to commit
•
Aborted – the RM has aborted the transaction either due to the RM not completing its sub-transaction successfully or because one of the other RMs has decided to abort.
•
Committed – all the RMs unanimously wish to commit the transaction and thus the protocol decides that the transaction is to be committed. The RM then makes the changes for its sub-transaction permanent.
Figure 10 shows the state transition diagram for an RM
37
Working Work completed
Abort
Prepared Abort
Commit Committed
Aborted
Source: Gray, J. and L. Lamport, Consensus on Transaction Commit. 2004, Microsoft Research.
Figure 10
All RMs start in the working state. The protocol ensures that either all the RMs reach the Committed or the Aborted state. Requirements of the Generic Commit Protocol [6] •
Stability – once an RM enters the committed or aborted state, it remains in that state forever
•
Consistency – one RM must not be in the committed state at the same time as another one is in the aborted state.
These two properties imply that, once an RM enters the committed state, no other RM can enter the aborted state, and vice versa [6]. Each RM also has a prepared state. We require that •
An RM can enter the committed state only after all RMs have been in the prepared state.
These requirements imply that the transaction can commit, meaning that all RMs reach the committed state, only by the following sequence of events: 1. All the RMs enter the prepared state, in any order. 2. All the RMs enter the committed state, in any order.
38
The protocol allows the following event that prevents the transaction from committing: •
Any RM in the working state can enter the aborted state.
The stability and consistency conditions imply that this abort event cannot occur if some RM has entered the committed state. In practice, an RM will abort when it learns that a failure has occurred that prevents the transaction from committing. However, the abstract representation of the problem permits an RM to enter the aborted state in the absence of failure as well [6].
2.4.2 Two Phase Commit (2PC) Protocol The 2PC [3] is an implementation of the Generic Commit Protocol. The protocol uses a Transaction Manager (TM) process to coordinate the decision making process.
q
w
a
c
Figure 11
The RM has the same states as the transaction commit protocol (working, prepared, aborted, committed). The TM has the init, preparing, aborted and committed states. In the 2PC protocol, one of the RMs enters the prepared state and sends a prepared message to the TM. The TM on receiving the prepared message sends a prepare message to all the RMs. The RMs that are still in the working state can enter the prepared state. Once the TM receives prepared message from each of the working RMs, the TM makes a commit decision. If any of the RMs in the working state decide
39
to abort, an abort message is sent to the TM; the TM may also decide to abort. The TM then sends an abort message to all the RMs. Once the TM makes a commit decision, a commit message is sent to all the RMs. On receiving the commit message the RMs move from the Prepared state to the Committed state. Once an RM enters the committed state it must not move to the aborted state. In the 2PC protocol the coordinator, i.e. TM goes through the following two-phase process: 1. Prepare: first, it instructs all resource managers to get ready to “go either way” on the transaction. In practice this means that each participant in the process, i.e. RM, must force all log records for local resources used by the transaction out to its own physical log. Assuming the forced write is successful; the resource manager now replies “Yes” to the coordinator otherwise it replies “No.” 2. Commit: When the coordinator has received replies from all participants, it forces a record to its own physical log, recording its decision regarding the transaction. If all replies were “Yes,” then decision is “commit”; if any reply was “No,” the decision is “abort.” Either way, the coordinator then informs each participant of its decision, and each participant must then commit or roll back the transaction locally, as instructed.
Effects of Faults on the 2PC Protocol For Example, consider a system with three participants one of which performs the role of the coordinator, using 2PC to ensure commit atomicity. Consider a scenario where one of the participants and the coordinator fails after the third site has voted but has not yet received the commit decision. There are several possible execution states; two of these are of interest 1. Either one of the failed sites may have aborted the transaction. 2. All sites may have decided to commit the transaction. In the second case, if the coordinator failed during sending the commit messages and if the second site failed after receiving the commit message, then the transaction has
40
been committed at the second site. As the third site has no way of knowing the status of the transaction, it cannot safely proceed. The third site has to wait for the coordinator to recover to determine the commit decision. The execution of the transaction is blocked at site three until the coordinator recovers. Performance and consistency need to be at a correct balance for an optimal transaction management system. Even though 2PC's blocking behaviour is well known, it is most widely used due to its overall performance.
2.4.3 Non-blocking atomic commit algorithms Protocols that allow operational sites to continue transaction processing even though site failures have occurred are called non-blocking. A protocol that never requires operational sites to block until a failed site has recovered is called a non-blocking protocol [4]. The fundamental non-blocking theorem as defined by Skeen [4] is: “A protocol is non-blocking if and only if it satisfies both of the following conditions (for all participating sites): 1. there exists no local state such that its concurrency set3 contains both an abort and a commit state 2. there exists no noncommittable state whose concurrency set contains a commit state” Analysing Figure 11, which displays the finite state automaton (FSA) for a participating site of the 2PC, with the non-blocking theorem it can be seen that: 1. state w’s concurrency set contains both an abort and a commit state and, 2. the noncommittable state w has commit in its concurrency state. Thus, the 2PC is an example of a blocking protocol: operational sites sometimes wait on the recovery of a failed site. Locks are held while the transaction is blocked.
3
Concurrency set is the set of states that may be concurrently occupied by other sites when the local
site i is in the states si.
41
The Non-Blocking Atomic Commit (NB-AC) [11] problem has the following conditions •
Uniform Agreement – forbids any two participants (correct or not) to decide differently.
•
Uniform Validity – if a participant decides commit, all participant have voted yes.
•
Termination – every correct participant eventually decides
•
Non-Triviality – commit must be decided if all participants vote yes, and there is no failure.
The conditions for NB-AC are similar to the properties for the consensus problem described in section 2.3. The Non-Triviality condition is additional to avoid a trivial solution to NB-AC where the algorithm always decides to abort. In the case of a site failing, a non-blocking protocol initiates a termination or recovery protocol. The termination protocol aims at terminating the transaction at the operational sites as quickly as possible. The recovery protocol aims at resuming the transaction processing at the operational sites. Both of these methods must ensure ACID properties. Non-blocking algorithms generally have a higher latency [32] due to extra message overheads.
Three Phase Commit Protocol Skeen [4] presents the three phase commit (3PC) protocol as a non-blocking atomic commit protocol. There is a buffer state introduced between the waiting or prepare and commit state of the 2PC protocol. So the coordinator and the resource managers move to a prepared state and once all of them acknowledge the prepared state, then they move to the commit state. A backup coordinator can wake up and commit the transaction in case of failure.
42
q
w
p
a
buffer state
c
Figure 12
Analysing Figure 12, which displays the FSA for the 3PC, with the non-blocking theorem, the following can be said: 1. there is no state that has both an abort and a commit in its concurrency set, as the w state’s concurrency set now has the abort and buffer state in its concurrency set. 2. the noncommittable state w does not have the commit state in its concurrency set. The 3PC protocol is non-blocking only in the case of a non-faulty messaging system [2].
Non-Blocking Atomic Commit (NB-AC) in Asynchronous systems Studies have compared atomic commit with consensus [11, 14]. NB-AC has been considered to be a harder problem than consensus; the non-triviality condition of NBAC (commit must be decided if all participants vote “Yes”, and there is no failure) to avoid trivial solutions to the problem requires precise knowledge about failures. Guerraoui [11] proves that a weaker form of NB-AC is equivalent to consensus and can be solved with weak failure detectors in an asynchronous system. In the following sections NB-AC algorithms for asynchronous systems are discussed. 43
Low latency Non-blocking Atomic Commit Algorithm Jimenez-Peris, et al. [7] present a low latency non-blocking atomic commit algorithm. The algorithm is characterised by using multiple coordinators and using optimistic commit4, additionally the message to commit is sent before the disk write is done. This protocol aims to reduce the latency for non-blocking atomic commit. The optimistic commit suggested may need roll-back: in the majority of cases the decision would be correct but in the few cases where the decision was incorrect the transactions will be rolled back. Jimenez-Peris, et al not only optimises the non-blocking algorithm by not flushing entries to the log but also by reducing the messaging overhead. The protocol reduces the messaging overhead by using the virtual synchronous uniform multicast [33]. In distributed systems processes may communicate via broadcast5 or multicast6. A message transmitted through a broadcast protocol is available to all hosts in the network. The Broadcast protocol is defined in terms of two primitives: broadcast(m) and deliver(m)7, where m is a message from a set of M possible messages. In the presence of failure, the broadcast may not reach all the hosts on the network. Hadzialacos, et. al. [34] present a collection of broadcast primitives which consider failures. Uniform Reliable Broadcast [34] is a broadcast primitive with the following properties •
Validity. If a correct process broadcasts a message m, then all correct processes in the system eventually deliver m.
•
Uniform Integrity. For any message m, each process in the system delivers m at most once, and only if some process actually broadcasts m.
4
Optimistic commit refers to the process where the commit message is sent to the TM group before the
votes become uniform across the Coordinator Group. 5
Broadcast protocols send messages to all hosts on the network
6
Multicast protocols send messages to selected subset of hosts on the network.
7
The deliver primitive in a host sends a message received by the host to all its neighours.
44
•
Uniform Timeliness. There exists a known constant ∆b such that if the broadcast of m is initiated at real-time t, no process in the system delivers m after real-time t + ∆b.
•
Uniform Agreement. If any process (correct or not) in the system delivers a message m, then all correct processes in the system eventually deliver m.
This broadcast primitive has mostly been considered in synchronous systems and has been presented by Schiper et. al. [33] in a virtually synchronous environment8. The non-blocking behaviour is achieved by the coordinator using virtual synchronous uniform multicast9 protocol. This guarantees that either all or none of the participants know the outcome of the transaction. Uniformity is very expensive in terms of the delay it introduces and the delay is dependent on the size of the network. The algorithm uses the uniform multicast only over a small subset of participants, i.e., the commit servers and it hides the latency introduced by the uniform multicast behind tasks that need to be performed for the transaction commit. The flushing of log records to disk is a source of increased latency in AC protocols. This algorithm sends messages instead of flushing log records; the main memory of a replicated group is used as stable storage.
8
A virtually synchronous environment allows processes to be structured into process groups, and
makes events like broadcasts to the group as an entity, group membership changes, and even migration of an activity from one place to another appear to occur instantaneously -- in other words, synchronously. 9
A uniform reliable multicast of a message m has the property that if m has been received by any
destination process (faulty or not), then m is received by all processes that reach a decision.
45
CS Group
C2
C1
C3
Opt-Commit Vote
Vote
Vote
TM Group
TM1
TM2
TM3
Figure 13
Protocol Overview
The AC protocol starts when a client requests to commit a transaction. The commit request arrives at a TM, TMi, which then starts the protocol. The protocol involves several rounds of messages in two phases [7]: First phase 1. Upon delivering the commit request, TMi multicasts a reliable prepare to commit message to the TM group. This message contains the transaction identifier (tid) to be committed and the number of participants involved (the number of TMs contacted during the execution of the transaction). 2. Upon delivering the prepare to commit message, each participant uniform multicasts its vote and the number of participants to the CS group. If a participant has not yet written the corresponding entries to its local log when the prepare to commit message arrives, it sends the log entry in addition to its vote without waiting to write to the log. After the message has been sent, it then writes the log entry to its local disk. (This is illustrated in Figure 13) Second phase 1. Upon opt-delivering a vote message, the processes of the commit server decide who will act as proxy coordinator for the protocol based on the tid of the 46
transaction and the current view. Assume this site is Ci. The rest of the processes in the CS group act as backup in case Ci fails. If a no vote is opt-delivered, the transaction is aborted immediately and an abort message is reliable multicast to the TM group. If all votes are yes, as soon as the last vote is opt-delivered at Ci, Ci sends a reliable multicast with an opt-commit message to the TM group. (This is illustrated in Figure 13) 2. Upon delivering an abort message, a participant aborts the transaction. Upon delivering an opt-commit message, the participant changes the transaction locks to opt mode. 3. If all votes are affirmative when they have been uni-delivered at Ci, Ci reliable multicasts to the TM group a commit message. 4. When a participant delivers a commit or abort message, it releases all locks (both opt and non-opt) held by the transaction and returns the corresponding transactions that were on hold to their normal state. 5. If all the votes are affirmative, the coordinator opt-commits the transaction before being excluded from the majority view (before being able to commit the transaction), and one or more votes do not reach the majority view, the transaction will be aborted by the new coordinator. Figure 13 displays a representation of the system with the messages as would be sent from the second step from Phase 1 and the first step from Phase 2 of the protocol.
Paxos Commit Protocol The Paxos commit protocol [6] uses the Paxos consensus algorithm [10] on the commit/abort decision of each participant to achieve fault tolerance in distributed commit. Gray et. al. [6] while specifying the requirements of an atomic commit protocol specify two liveness requirements: •
Non-Triviality If the entire network is nonfaulty throughout the execution of the protocol, then (a) if all RMs reach the prepared state, then all RMs reach the committed state, and (b) if some RM reaches the aborted state, then all RMs reach the aborted state.
47
•
Non-Blocking If, at any time, a sufficiently large network of nodes is nonfaulty for long enough, then every RM executed on those nodes will reach either the committed or aborted state.
As stated before, NB-AC is harder to solve in asynchronous systems and only a weaker form of NB-AC is solvable in an asynchronous system. Gray et. al. have weakened the Non-triviality requirement; that participants only reach a decision if the entire network remains nonfaulty throughout the execution of the protocol. The Paxos consensus algorithm [10] is an algorithm for implementing a fault-tolerant distributed system. The algorithm uses multiple acceptors to reach consensus on choosing a proposed value. A unique value is chosen when a majority of acceptors receive the same proposed value. Paxos commit uses 2F+1 coordinators and makes progress if at least F+1 of them are active. The distributed commit problem can be made fault tolerant by implementing the Paxos consensus for the TM’s commit/abort decision. The Paxos commit algorithm implements an instance of the Paxos consensus algorithm for choosing the prepared/abort decision for each of the RMs. The transaction is only committed if all the RMs are prepared. The safety property of the Paxos consensus algorithm does not have any requirements related to the processing speed or upper bound on the message delay. This requirement classifies the system as asynchronous. Paxos Consensus In Brief
Paxos consensus is an algorithm for choosing a unique proposed value. The algorithm uses a collection of acceptor processes to decide on a proposed value. At the start there is a unique leader process that proposes a value to the acceptors. The acceptors reply to the leader and once the leader receives a message from a majority of the acceptors, the value is chosen. The leader informs any other processes about the value [10]. This algorithm is fault tolerant; as a new leader is elected if the original leader fails. The new leader is able to continue processing if there is enough information available; otherwise the transaction is aborted. The Paxos algorithm proceeds in the following manner [6]: 48
Phase 1a The leader chooses a ballot number that it thinks is higher than any other ballot number that has been processed. A message is sent to all the acceptors with this ballot number. Phase 1b The acceptor sends a message to the leader with the value for the highest ballot number that it has received and the highest ballot number for which it has received a Phase 1a message. Phase 2a Once the leader receives a Phase 1b message for the ballot number from a majority of acceptors, it can know the status for the ballot number. Free – A value has not been chosen for the ballot number Forced – A value has already been chosen for the ballot number If the ballot is free the leader can propose any value to be chosen by the acceptors. In the forced situation, the value chosen earlier is proposed again to be selected. Phase 2b If the acceptor has not received a Phase 1a or Phase 2a message for a higher ballot number; then it sends a Phase 2b message for the current ballot number. Phase 3 Once the leader receives the Phase 2b message from a majority of acceptors, the value proposed for this ballot number is chosen. The value is communicated to all interested processes with a Phase 3 message.
2.4.4 Backup Commit Reddy et. al. [16] describe the Backup Commit (BC) protocol that reduces the blocking in 2PC by using multiple backup sites. In the algorithm there are multiple backup sites connected to the coordinator. The coordinator on deciding, broadcasts its decision to the backup sites before informing the participants. When the coordinator fails the participants can check the decision for the transaction with one of the coordinators. Reddy et. al. have compared BC with 3PC, which has high latency due to an extra round of messaging between the participants and the coordinator. BC has lower latency as the backup sites are placed closer to the coordinator to reduce network latency. It is stated that the performance of commit processing depends on the message processing latency of the coordinator. As BC does not introduce an extra
49
round of messaging between the coordinator and the participants, an overhead is not incurred.
2.5 Summary This chapter has given an overview of the atomic commit problem. The chapter started by introducing Transaction Processing, ACID properties and various Transaction Processing Models. Failure and Fault Tolerance were described before describing the consensus problem and how it is made more difficult due to failures. Atomic Commit in Distributed Databases is a form of the consensus problem; the chapter looked at a number of algorithms from the literature. In the following chapter, the algorithms for 2PC and Paxos Commit are explained in detail, this discussion will form the basis for the design of the simulation framework.
50
3. Algorithm Overview An introduction to transaction processing concepts was provided in the previous chapter. The chapter also described how failure affects transaction processing, especially distributed transactions. Since this study compares atomic commit protocols a better understanding of these protocols is required. This chapter provides a detailed analysis of the atomic commit protocols that are simulated in this study. All processes, that are a part of a transaction processing system, are modelled as state machines. This chapter presents a state transition diagram for each of these processes. Figure 14 shows how the state transition diagrams are drawn.
Message Sent
Message Received
Initial State
Destination State Figure 14 State Transition Diagram Legend
3.1 Two Phase Commit (2PC) The 2PC protocol described in section 2.4.2 uses a single coordinator, the Transaction Manager (TM). The various processes in the transaction system including the TM are described below: 1. Transaction Manager (TM) The TM in the 2PC works as the coordinator and is responsible for servicing all requests from the Client and the Resource Managers (RM). State Transition Diagram for the TM can be seen in Figure 15 . •
It starts in the Free State and assigns a transaction id to every transaction started and then moves to the Working State. 51
•
It keeps track of all the RMs that are part of the transaction
•
Once the Client Process completes assigning work to the RMs, it sends a Do Commit message to the TM. This causes the TM to move from the Working State to the Preparing State.
•
Each RM informs the TM whether it is ready to commit
•
Based on the individual RM states, it decides whether to commit or abort, and the TM moves to either the Aborted or Committed State.
Began Transaction
Begin Transaction
Free
Prepare
Do Commit
Working
Preparing Ab or te Ab d or t
ed pr ea it Pr m om C
Aborted
Committed
Figure 15 State Transition Diagram for 2PC Transaction Manager
2. Resource Manager (RM) The RM is the process that does the work in a transaction processing system. The state transition diagram can be seen in Figure 16. •
The RM starts in the Free State. The Client requests the RM to do work for a transaction. The RM moves to the Joining State, by sending a message to the TM.
52
•
The RM knows that the registration with the TM for the transaction is complete when it receives the Joined Transaction message. The RM moves to the Working State.
•
After completing the work the RM informs the TM of its status and moves to the Prepared or Abort State.
•
When the TM decides whether the transaction should be committed or aborted the RM can finally discard the changes or make them permanent and finally move the Aborted or Committed State. Free
Joining
Working
Prepared
m
Ab
om
or t
C it
Aborted
Committed
Figure 16 State transition diagram for 2PC Resource Manager
3. Client The client is the process that utilises the transaction service. The State Transition Diagram can be seen in Figure 17. The client:
53
•
initiates the transaction, by sending a Begin Transaction message to the TM and moves from the Free State to the Joining State;
•
the Client receives a Began Transaction message from the TM with the transaction id. The Client moves from Joining State to Working State and asks the different RMs to do processing by sending them Do Work messages.
Begin Transaction
Start
Free
Do Worki
Began Transaction
Joining
Do Commit
Did Workn
Working
C
om
t or Ab
m
it
Prepared
Committed
Aborted
Figure 17 State Transition Diagram for Client
•
Each RM sends a message back to the Client when it has completed the work. The Client moves to Prepared State when it receives Did Work message from all the RMs, it sends a Do Commit message to the TM at this point 54
•
finally the TM informs the Client of the Final state of the transaction and the Client moves to the Aborted or the Committed State.
3.2 Paxos Commit Paxos commit is a non-blocking atomic commit protocol that has been briefly introduced on Page 48. Paxos commit utilises the Paxos consensus algorithm on the commit/abort decision of each of the RMs, to achieve fault tolerance and thus nonblocking behaviour. In Paxos commit the same set of 2F+1 acceptors and leader is used for each instance of the Paxos Algorithm. The execution of the Paxos commit begins when a RM decides to prepare and sends a BeginCommit message to the Leader. The Leader then sends a Prepare message to all the RMs. The RMs then send a phase 2a message with ballot number 0 to all the 2F+1 acceptors with either an Abort or Prepared message, depending on its state. For each instance, each acceptor sends a phase 2b message to the Leader. Once the Leader receives F+1 phase 2b messages for an instance, it knows the outcome for that instance. The Leader can then send a Phase 3 message regarding the decision for that instance to all interested parties, all the RMs in this case. The Phase 3 messages for all the instances are combined into a single Commit/Abort message, depending on the outcome for each of the instances of the Paxos algorithm. All the instances must agree on the Prepared value to send a Commit message. If even one of the instances agrees on the Abort value the Abort message is sent. There are four types of participants in the Paxos commit algorithm, there are described below: 1. Registrar The registrar is the process that contains the logic for transaction management. The state transition diagram for the Registrar can be seen in Figure 18. The registrar handles the following tasks in the Paxos Commit Protocol: •
The creation of the transaction.
•
The registration of the RMs of the transaction, i.e. which RMs are part of the transaction.
An instance of the Paxos consensus algorithm is also run for the registrar process, this is initiated when the Client requests the registrar to begin a 55
transaction. The Registrar moves from the “Free” state to “Working” State. Once the client has completed the transaction it indicates to the registrar that the transaction has been completed with a “Do Commit” message. The Registrar sends a “Prepare” message to all the RMs and moves to the “Prepring State”. An instance of the Paxos consensus algorithm is then initiated for each of the RMs that are a part of the transaction. When a transaction is created, an instance of the Paxos algorithm is initiated for the registrar process. This sets a timeout for the main transaction, because if the Paxos algorithm for this instance does not choose a value for set of RMs (S), then the leader will propose a special value B, which would cause the transaction to abort.
Initiate Instance
Begin Transaction
Free
Prepare
Do Commit
Working
Ph
2b
Preparing
2b
Ph as e
e as
Committed
Aborted
Figure 18 State Transition Diagram for Registrar
Once a value S has been chosen by the Paxos algorithm, an instance of the Paxos algorithm is started for each one of the RMs. This in effect initiates a
56
timeout on each of the RMs. If a value is not chosen for one of the RMs instances the leader will propose an abort value for that instance. Registrar controls the registration of the RMs for a transaction, i.e. does the job of the TM. It then runs an instance of the Paxos algorithm to get its list accepted. The Registrar moves to the Aborted or Committed state on receiving a Phase 2b message related to the commit decision. 2. Resource Manager (RM) The resource manager is a participant in the transaction. The resource manager is the process that performs work as part of the transaction.
Join Transaction
Do Work
Free
Did Work
Joined Transaction
Joining
Phase 2a
Prepare
Working
2b
Ph as e
e as Ph
2b
Prepared
Committed
Aborted
Figure 19 State Transition Diagram for RM
57
It is the RM that needs to know the decision of the coordinator. RMs registers with the Registrar for the transaction when it is asked to do work by the client. Figure 19 shows the State transition diagram for the RM. •
The RM starts in the “Free” state. It moves to the “Joining” state, when the Client requests the RM to Do Work, by sending a “Join Transaction” message.
•
When the Registrar responds to the “Join Transaction” message with a “Joined Transaction” message, the RM can finish the work and move to the “Working” state by sending a “Did Work” message.
•
The RM moves to the “Prepared” state after it receives a “Prepare” message and it sends a Phase 2a message to all the acceptors regarding its Prepared State.
•
The RM moves to the “Abort” or “Commit” state after it receives Phase 2b messages relating to the commit decision of the entire transaction.
3. Acceptor The acceptor is the process that contains the Paxos consensus logic. Figure 20 shows the state transition diagram for an instance of the Paxos algorithm that runs at an acceptor. An instance of the Paxos consensus algorithm is run to obtain agreement on the decision each RM makes of whether to prepare or abort—represented by the value Prepared or Aborted and an instance is also run for the Registrar process’s list of RMs. Each instance of the Paxos consensus algorithm works to achieve consensus on the prepared decision of the specific RM. When the leader knows the prepared decision for all the RMs, from at least F+1 acceptors, it can make the commit decision. For the Paxos consensus to work, one of the acceptors has to act as the leader.
58
Phase 2b
Phase 2a
Free
Forced Figure 20 State Diagram for an Instance of the Paxos Algorithm on an Acceptor
As part of the optimisations recommended by Gray et.al. [6], the phase 2b messages are sent as a bundled message to the leader by the acceptors. In 2PC the TM can unilaterally abort a transaction. In the Paxos commit algorithm the leader can only abort an RMs decision to be prepared; the leader sends a Phase 2b abort message with a higher ballot number then the one from the RM. 4. Client The client is the process that: •
initiates a transaction.
•
assigns work to each of the RMs and
•
requests that the transaction be committed.
The Client finds out the final outcome of the transaction. The state transition diagram shown in Figure 17 for the 2PC is also applicable to the Paxos Commit Algorithm.
3.3 Messaging Analysis The messaging delay in atomic commit protocols is a key part of the performance analysis. In this section an analysis of the messaging delay for 2PC and Paxos Commit is provided [6]. This analysis is for the normal case, where the transaction is committed.
3.3.1 Two Phase Commit The 2PC protocol sends the following sequence of messages in the normal case:
59
•
The initiating RM enters the prepared state and sends a Prepared message to the TM. (1 message)
•
The TM sends a Prepare message to every other RM. (N-1 messages)
•
Each other RM sends a Prepared message to the TM. (N-1 messages)
•
The TM sends a Commit message to every RM. (N messages)
Thus, in the normal case, the RMs learn that the transaction has been committed after four message delays. A total of 3N-1 messages are sent. In a faster form of the 2PC the prepare messages from the TM can be avoided, as all the RMs can inform the TM as and when they get ready, this reduces the message complexity to 2N [6].
3.3.2 Paxos Commit Assume that there are N RMs; assume a system that can tolerate F faults, so there are 2F +1 acceptors. In the normal case, the Paxos Commit algorithm uses the following (potentially inter-node) messages [6] (The details of the Paxos Algorithm messages can be found on Page 48): •
The first RM to prepare sends a BeginCommit message to the leader (1 message)
•
The leader sends a Prepare message to every other RM. (N-1 messages)
•
Each RM sends a ballot 0 phase 2a Prepared message for its instance of Paxos to every acceptor. ((2F + 1)N messages)
•
For each RM’s instance of Paxos, every acceptor sends a phase 2b Prepared message to the leader. However, an acceptor can bundle the messages for all those instances into a single message. (2F + 1 messages)
•
The leader sends a single Commit message to each RM containing a phase 3 Prepared message for every instance of Paxos. (N messages)
The RMs therefore learn after five message delays that the transaction has been committed. A total of 2F(N + 1) + 3N + 1 messages are sent. If the leader is on the same node as one of the acceptors, then one of the phase 2b messages is free, so the total number of messages is (2F + 3)N + 2F. If each acceptor is on the same node as
60
an RM, then 2F + 1 of the phase 2a messages are intra-node messages that can be discounted [6]. As observed above, phase 3 of Paxos algorithm can be eliminated by having each acceptor send its phase 2b messages directly to all the RMs. This allows the RMs to learn the outcome in only four message delays, but a total of 2F(2N +1)+3N +1 messages are required. If each acceptor is on the same node as an RM, then 2(2F +1) of those messages can be discounted, leaving 2F(2N - 2F - 1) + 3N + 1 messages. So far Paxos commit requires five message delays, which can be reduced to four by eliminating phase 3 and having acceptors send extra phase 2b messages. Two of those message delays result from the sending of Prepare messages to the RMs. As stated in section 3.3.1, these delays can be eliminated by allowing the RMs to prepare spontaneously, leaving just two message delays [6].
3.4 Analysis of Disk Writes Disk writes for recording the states of the processes, in transaction processing system, are also a large part of the performance of the system. This is in addition to state changes and messaging which were described in the previous two sections for 2PC and Paxos commit. In this section an analysis of the disk writes that are performed by the two protocols is provided.
3.4.1 Two phase commit In 2PC there are two important decisions or states that need to be recorded to stable storage. These are essential so that if a failure occurs the system is able to read from stable storage the last state for the transaction and continue processing. These are [6]: 1. The RM’s internal decision about whether to Abort or Commit. 2. The TM’s decision, after receiving all RM decisions, about whether to Abort or Commit.
61
3.4.2 Paxos Commit Paxos commit has similar disk write requirements as 2PC but they vary slightly due to the introduction of the Acceptors. Paxos consensus requires a write to stable storage for [6]: 1. The leaders decision to send a phase 2a message 2. The acceptors decision to send a phase 1b or phase 2b message These two in the context of Paxos Commit can be stated as 1. RMs record there individual decision to disk. As per the Paxos commit protocol the phase 2a message is sent by the RM to the acceptors for the prepared decision, thus a write is required by the RM before sending its prepared message. 2. According to the Paxos Consensus algorithm each of the acceptors is supposed to write to disk before sending its Phase 2b message. This is applicable for every instance of the Paxos algorithm. This also applies to the instance of the Paxos algorithm that is run for the Registrar process and the selection of the list of the RMs that are taking part in the transaction. This means that each of the acceptor needs to do a disk write for each of the RM and the Registrar process. Gray et.al. [6] suggest that disk writes for the RM processes should be bundled together similar to the phase 2b messages to reduce the number of disk writes that are done.
3.5 Recommended Optimisation While analysing the Paxos commit algorithm a scope for further improvement was realised. Considering, disk writes are much more time consuming than messaging, if the delay associated with disk writes for the algorithm can be reduced the additional delay of the messaging due to additional processes can be overcome. The Acceptor writes to disk before sending the Phase 2b message. The reason for performing the write before the Phase 2b message is sent to ensure that when an acceptor recovers from a fault it knows that the Phase 2b message was sent or ready to be sent. This delays the process of choosing a correct value. Paxos can tolerate F faults and the likelihood of multiple faults at the same time is fairly low. If the 62
algorithm is adjusted to send the message concurrently to the disk write and the disk write fails at F Acceptors, there are still F+1 acceptors to complete the decision making process. In the rare case that there are more than F failures and the system is unable to calculate the same decision which was optimistically decided due to concurrent messaging and writing to disk the transaction can be rolled back. A similar optimisation was also proposed by Jimenez-Peris, R., et al [7]. They propose using multicasting of the RM’s decision to a coordination group without flushing its result to log. Multicasting is used in small groups to reduce latency; the architecture uses the coordinator group for multicasting. When the RM is required to send its decision to the coordinator, if the RM hasn’t already written its record to disk, it multicasts the coordinator group with its decision concurrently to writing to the log. The latency of the multicast is masked behind the disk write which is essential and the coordinator is aware of the RMs decision earlier. Both approaches use the redundancy in the acceptors to achieve optimisation. The optimisation suggested by Jimenez-Peris et. al can also be applied to the Paxos commit, the message from the RM can be sent to the acceptors before the RMs write the decision to disk. The multiple acceptors will act as durable storage for the RMs. This concept has been called online data store or persistent memory by Mehra et. al [35].
3.6 Alternate algorithm The results for Paxos commit displayed a problem with using the Registrar process. The instance of the Paxos algorithm that is run for the Registrar does not complete in time; the RM messages are received before the acceptor finds out the RMs that are a part of the transaction. The other problem with the Paxos Commit algorithm is that while the Registrar is non-operational due to failure, the whole transaction system remains unavailable for that same duration. This occurs as the transaction processing system does not have a process doing the TM function of initiating a transaction. An alternate algorithm was designed to address these problems. In the alternate algorithm (Modified Paxos) the leader process of the Paxos algorithm performs the role of the TM.
63
In Paxos, if the Registrar process fails the transaction fails. The same is applicable in the Modified Paxos algorithm if the current Leader fails. The Leader informs the acceptors of the list at the time the Client sends the ‘Do Commit’ message to the Leader. In a situation where the Leader fails before the Client sends the ‘Do Commit’ message, the new Leader does not have a list of the registered RMs and the transaction terminates unsuccessfully. The Leader has to additionally send messages to all the participants in the transaction process so that all the processes know the current leader.
3.7 Summary In this chapter 2PC and Paxos Commit algorithms have been explained in detail and optimisations and alternate algorithms have been suggested.
The next chapter
presents the purpose of the simulation and outlines its design and implementation.
64
4. Design
and
Implementation
of
Simulation
Framework This chapter provides both a motivation and description of the design used to implement the simulation. The aim is to implement the Paxos Commit Algorithm and Two Phase Commit (2PC) in a complete transaction processing system simulation. The overall number of transactions committed and aborted over a set period of time will be compared for the two Atomic Commit Protocols.
4.1 Motivation for the Design As stated in the previous chapters Paxos commit is fault tolerant and non-blocking. Thus the simulation needs to be able to introduce various different kinds of faults into the system to be able to measure the benefits provided by the fault tolerance. The faults that need to be introduced are as follows: •
Process Failure and Recovery – the processes can fail intermittently and then recover after a certain amount of time.
•
Message Failures – the messaging system may have a failure and because of this the message does not get delivered to the recipient.
•
Message Delay – there is a certain amount of delay on all messages, some messages may get delayed more than others and hence create an out of order delivery scenario.
This appears to be the first attempt at simulating the Paxos Commit Algorithm [17].
4.1.1 Overview of Simulations As part of the study two types of simulations were designed 1. Simulated Message Passing – in which a queue is used to simulate message passing between the various processes. In this simulation a high number of failures can be introduced and a large number of transaction cycles can be simulated in a short span of time. This simulation is able to process large 65
number of transaction cycles in a short span of time due to the simulation operating a virtual clock which advances based on the message delivery time rather than an actual time; this is explained in section 4.2.1. 2. Simulation using message passing library – this simulation uses a message passing library to for passing messages between the various processes instead of using a queue to simulate the message passing. With this approach the network load and number of messages sent would impact the messaging delay and the effects of this can be seen on both the 2PC and Paxos Commit Algorithms. In the simulation that uses a queue for the messages, the message delay is a theoretical value; this simulation uses a real network which may behave differently to the theoretical values calculated in the first simulation.
4.1.2 Assumptions The following assumptions are made about the environment that is being simulated: •
The simulation implements the Receive omission fault model (described in section 2.2.1).
•
The failure of a node does not affect the permanent storage.
•
The messaging system is asynchronous and failure prone.
•
All sites fail randomly. At failure the site loses its volatile states. The stable storage does not fail, and when the site recovers it is able to read its last state from stable storage.
•
All the participating resource managers perform work for every transaction. The Client progresses by assigning a task to each of the Resource Managers for every transaction.
•
A failed node in the system does not behave incorrectly; it stops responding to messages and remembers its previous state when it returns to working state.
•
The messaging delay is based on a LAN.
4.2 Simulated Message Passing The solution that was developed to account for all the requirements utilised a series of classes which model the various types of processes (Figure 21 shows the class 66
diagram for 2PC and Figure 24 shows the class diagram for Paxos Commit). A ‘simulation’ class was designed; this object controls the entire simulation. The simulation class instantiates objects from the process classes depending on the parameters set. The messaging is modelled as a message queue. When a process sends a message it is added to the queue and when the message is scheduled for delivery the simulation class, depending on the target for the message, passes it to the specific object instance to be processed (Figure 22 shows some of the interactions between the various classes of the simulation).
Figure 21 Class Diagram for the various processes of 2PC
The failure of the processes is triggered by the simulation sending a failure message to the specific instance. In the failed state the process does not process any messages until it receives the message to recover from the failure. The timing of the failure is triggered by a random distribution around the mean time to failure set in the simulation. A random distribution is used so that processes fail at random times in relation to the mean time to failure. If the mean time to failure were used all the processes would fail at the same time.
Figure 22 Interactions between the various classes
67
Message transmission also fails randomly; in this case the simulation does not deliver the message to the receiving process. The message delay is simulated by the messages being queued in order of delivery time. The delivery time is calculated by using a random distribution with a constant value for mean delay. A random distribution is used for the messages so that each transaction that does not cause any failure does not complete in the same amount of time. Using random distribution brings the simulated messaging system closer to a network messaging system, where each message can have a different delay due to the various factors that effect message delay. The framework allows for execution of millions of transaction commit cycles with a high failure rate within a reasonable amount of time (approximately an hour)
4.2.1 Simulation Each transaction commit cycle is started by the simulation process. It sends a message to the client to start a new transaction. The simulation process detects if the last transaction has finished and starts a new one. The simulation process maintains a virtual clock. Every message that is processed by the simulation has a delivery time; the virtual clock advances based on this delivery time; the delivery time of the message is set based on the virtual clock. The simulation can be set to either run for a certain number of transactions or for a certain amount of virtual time.
4.2.2 Messages The simulation uses a queuing system to simulate message passing between multiple processes. A single message queue simulates the arrival order of messages to the various processes (illustrated in Figure 23). The transaction commit decision process requires communication between the various processes. The messages are simulated as a single queue sorted by time of arrival, so that the simulation can pick up the next message to be delivered and the virtual clock is advanced in a sequential order. Each message that is to be sent has an associated delay; this may be zero, i.e. for immediate delivery. The messaging system adds the network associated delay and calculates the expected arrival time based on a random distribution associated with the mean delay set in the simulation (The values used for the mean delay are listed in section 4.5).
68
Message Class
Message Queue Time
Message
Destn.
1
...
Client
2
...
RM1
4
...
TM
Next Message Retrieved
Message Added to Queue with delivery time
Next Message
Send Message
Figure 23 Internal working of the message class
As the simulation progresses the next message in the queue is picked up. The virtual time is advanced to the arrival time for the message.
4.2.3 Counters At the end of each transaction cycle, the relevant processes record the results of the transaction and other associated statistics. The time taken for a transaction is calculated by checking the elapsed time for each of the processes in the transaction and using the largest amount of time taken by any of the processes. This could be the RM in case it failed before receiving the commit decision and then, after recovering, completed the transaction.
4.2.4 2PC Processes The 2PC processes are designed as per the state diagrams described in section 3.1
4.2.5 Paxos Commit Processes Each of the Paxos Commit processes is designed according to the state transition diagrams discussed in section 3.2. Some additional implementation decisions had to be made which are described below.
69
Figure 24 Class diagram for the processes of Paxos Commit
Acceptors For the Paxos consensus algorithm to progress one of the Acceptors has to act as the leader process. The leader is decided at the start of the algorithm. A new leader is elected if/when the previous one fails. The (n+4)-stable leader election with message loss [36] has been used as the algorithm for leader election amongst the acceptors. This algorithm was chosen as the algorithm is a stable10 leader election algorithm and it can tolerate message loss. The algorithm for this is provided in Appendix A.
Client The client sets a timeout when it has finished assigning work to all the RMs, just before sending the ‘Do Commit’ message to the Registrar, it uses the commit request timeout constant.
4.2.6 Optimisation Implementation Gray et. al. [6], while outlining the Paxos Commit algorithm, use the Paxos Consensus algorithm for each RMs Abort/Commit decision. The discussion suggests optimisations to the algorithm which should provide improved performance. In the simulations some of these optimisations have been tested against normal execution. For example, as discussed in section 3.3.2, if the Acceptor and RM are on the same node, the message sent between these two processes does not need to be counted. To
10
An algorithm is stable if it ensures that once a leader is elected, it remains the leader for as long as it does not crash and its links have been behaving well.
70
achieve this, the node process is used to assign physical locations to the processes; if the message’s source and destination are the same the message is delivered instantaneously.
Node Process A node object allows us to specify the location of processes; the message delay can be reduced to zero if two processes are located on the same node. The mapping of the nodes and the processes is parameterised. For Paxos Commit each of the Acceptors is assigned to the same node as one of the RMs. The failure message can be sent directly to the node. The node can then send messages to each associated process.
4.3 Simulation Using Message Passing Library The simulated message passing will provide a good comparison for 2PC and Paxos Commit protocols. The simulation utilises mean values for network message delays and disk writes. The JPVM [37] message passing library is used to develop a simulation where the various participants communicate over a network. As the various participants in this simulation communicate over the network, the message delay is not based on a random distribution but the actual delay at that time within the network based on the network load. This kind of a simulation should take into account the network load in a better way. The message passing simulation will not be ideal for running millions of transaction cycles as the time spent will be real instead of being a virtual time. In the simulated message passing simulation, a virtual time of approx. 28 hours get executed in approx. 2 hours. If the message passing simulation is run for 28 hours, the simulation will complete in 28 hours.
4.3.1 Counters In the message passing simulation, at the end of each transaction the statistics for the transaction and messaging are recorded. The difference in this simulation is that the time taken by each transaction is not calculated by taking the maximum of all the processes elapsed time but just for the Client. In this simulation it was harder to get the transaction completion times for all the processes and compute the maximum time 71
taken for each transaction, the decision was made to use the completion time at the Client as this is where the new transaction would be started once the previous transaction is completed. In case of an RM failure the client is likely to know the outcome of the transaction before the RM. In simulated message passing the elapsed time is not calculated till all the participants finish processing that transaction. Due to a single RM failure there will be more than one transaction that will aborted. In the simulated message passing scenario, only a single transaction will be aborted. As a new transaction is not started till the old transaction has finished processing at all the participants.
4.4 Design of Modified Paxos In the Paxos consensus algorithm the leader process is only required to ensure progress. In the Modified Paxos commit model, the leader also takes over the role of the TM. The Paxos consensus algorithm uses ballot numbers for proposing values, because different proposers may choose different values. In the case of distributed commit and the RM’s decision for a prepared/abort value there is only ever a single proposer that would select only one value at a time, as a result ballot numbers are not used in this implementation. The Modified Paxos commit algorithm ignores the Phase 1 of the Paxos consensus algorithm. Phase 1 of the Paxos consensus algorithm finds the highest ballot number that has been executed and determine if the leader is free to propose a value. Ballot numbers are not being used for this algorithm and hence Phase 1 is not required for this algorithm. The commit algorithm starts in phase 2b, of the consensus algorithm, with the RM proposing a value. Each of the acceptors communicates this value to the leader. The leader acting as the TM is the only party interested in the value and based on this value makes the commit decision. The commit decision is then communicated to all the RMs. If a new leader is elected it queries the state of the RMs commit/abort decision from all the other acceptors. This would equate to Phase 1 through to Phase 2a of the Paxos consensus algorithm.
72
4.4.1 Disk Write Analysis The acceptors need to do a write to stable storage before sending the phase 2b message, or in terms of our algorithm the prepared message. The disk writes for all the RMs is bundled into a single disk write.
4.5 Constants Used 4.5.1 Simulated Messaging The simulations used values for calculating message delays, disk delays, process lifetimes and the various timeouts. The table below provides a list of these constants [38]: Process Lifetime
100 seconds
The Process Lifetime is set to such a low number to cause a high number of process failures. Process Recovery Time
1 second
The Recovery time is also really short as the processes fail so often they need to recover in a relative amount of time 100 msecond
Mean message delay
In modern local area networks messaging is instantaneous, the value has been chosen to model real world LAN messaging. Mean Disk delay
5 milliseconds
The disk write is more expensive than a network message, this value models the difference Request Timeout (Amount of time to wait 50 milliseconds for message response) Transaction Timeout (The transaction 100 milliseconds fails after this time, if its not already in
73
the Commit decision stage) The timeouts were chosen after some testing of the simulations, these values were selected so as to allow transactions to complete under regular processing, where processes were active
Paxos Commit This timeout value is only used for the Paxos Commit simulation. This is the value used by the Client to set a timeout after the Client finishes assigning work to RMs. Commit Request Timeout
80 milliseconds
4.5.2 Message Passing These constants are used for calculating delays for timeouts and process failures in the Message Passing Simulations. The reasons for the values are the same as the ones for the values in section 4.5.1. Process Lifetime
100 seconds
Process Recovery Time
1 second
Transaction Timeout
350 milliseconds
Request Timeout
150 milliseconds
Mean Disk Delay
5 milliseconds
4.6 Summary This chapter detailed the motivations for the design of the simulation framework. The architecture for the simulator has been described with details of how 2PC and Paxos Commit are implemented on the simulation framework. The optimisations tested as part of the work have been described and a description of the Modified Paxos Algorithm is given. The chapter also listed the assumptions made for the simulations and the constants used to emulate the environment.
74
The next chapter presents the results and analysis for the tests run on simulation framework.
75
5. Results and Analysis The previous chapters have explained the concept of Atomic Commit in a distributed transaction processing system and the concepts that are essential to understand Commit Protocols. Algorithms for Two Phase Commit (2PC), Paxos Commit and Modified Paxos have been explained, with an explanation of the framework used to simulate these protocols. This chapter lists the results obtained by running these simulations with various different settings to emulate different system conditions and protocol settings.
5.1 Glossary Before the results are presented a description of the fields is provided Row Headings
Description
Events
The number of steps processed by the simulation.
Messages
The number of messages sent.
Messages Failed
The number of messages that were not delivered.
Messages Dead Letter
The number of messages that reached the process, when the process had failed.
Process Failure
The number of processes that failed during the whole simulation.
Application Messages
The number of messages that are related to the Transaction application. (The rest of the messages are sent to control the processes and the simulation, eg. Messages to cause process failures)
Commits
The number of transactions that were committed.
Aborts
The number of transactions that were aborted. 76
Virtual Time
The amount of time that was simulated in mseconds.
5.2 Overview The results provided are for various test sets as follows: 1. Failure Free Environment – Paxos commit, Modified Paxos and 2PC are run in an environment where there are no failures. Paxos commit and Modified Paxos are run with no fault tolerance; in this setting the Paxos commit and Modified Paxos are equivalent to 2PC. This allows for them to be compared for message overheads. In a failure free environment the commit protocols should have a 100% commit record; if the tests reveal a different result this will be due to an error in the commit protocol. Paxos Commit protocol is run with and without the optimisations discussed before, namely •
Bundling of Phase 2 messages – the bundling of Phase 2 messages reduces the number of messages sent, thus reducing the message overhead in Paxos Commit. Running the simulations with and without this optimisation allows for the analysis of the impact of the number of messages on the overall performance of Paxos commit protocol.
•
Durable Write Optimisation – this optimisation uses the redundancy in the Paxos commit to improve performance by sending the Phase 2 message in parallel to the durable write. This shows the impact of durable write on the overall performance of the Paxos commit protocol.
2. Message Failure – the tests that are run in the failure free environment are also run in an environment where the messaging system is unreliable, i.e., the messages may not be delivered. The various simulations in this environment displays how the protocols deal with intermittent message failures. 3. Process and Message Failure – the Process and Message failure environment models an asynchronous environment, where processes can fail, messages may be delayed or not delivered and the application that starts the transaction can also fail. All the tests described above for the failure free environment are also run in the process and message failure environment. This allows us to measure how the protocols would behave in such an environment and cope with all kinds of failures. 77
The simulations have been run with the number of Resource Managers (RM) set to 3, 5, 7 and 9. This was done so that the number of Acceptors and RM increase in the same increments. In Paxos Commit, a pair of Acceptor and RM should be placed on the same node to improve the protocols performance, having the same increments allows for pairs of these processes. It is not necessary to have RMs in these increments and the 2PC protocol was run with 1, 2, 4, 8, 16 and 32 RMs, a graph showing the average transaction commit times is presented in Appendix C.
5.3 Failure Free Environment To be able to analyse the level of improvement provided by the Paxos Commit Algorithm the first set of results compare the performance of the Two Phase Commit (2PC) Algorithm and the Paxos Commit Algorithm with varying properties when no faults occur.
5.3.1 Simulated Message Passing
Two Phase Commit events: messages: messagesFailed: messagesDeadLetter: processFailures: applicationMessages: commits: aborts: virtualTime (ms):
3 RM 247,335,530 354,514,266 247,335,530 8,244,517 100,000,004,057
5 RM 353,139,638 499,001,673 353,139,638 7,676,948 100,000,004,762
7 RM 446,300,916 626,260,973 446,300,916 7,198,402 100,000,002,912
Table 1 Summary of Results for 2PC in fault free environment
78
9 RM 529,084,520 739,361,714 529,084,520 6,783,135 100,000,001,583
time in ms
10 11 12 13 14 15 16 17 18 19 34 35 36 37
3 RM Commits
5 RM Commits
7 RM Commits
9 RM Commits
30,727 4,886,544 3,225,152 101,555 537 1 1 -
2 293,425 4,740,978 2,504,814 136,061 1,660 7 1 -
1,443 794,973 4,359,850 1,903,714 135,745 2,655 19 1 1 -
23,101 1,362,612 3,831,489 1,441,363 121,139 3,383 43 1 2 1
Average Transaction Time 11.41 12.32 13.19 14.04 Table 2 Transaction completion times for 2PC in fault free environment
In the failure free situation all transactions are completed, as no failure occurs. The completion time is dependant on the message delay. Results in Table 2 demonstrate this; as the number of resource managers increases the average amount of time required to complete the transaction increases.
Paxos Commit No Faults to Tolerate To compare the overheads of the Paxos commit protocol, the Paxos commit is simulated in the configuration where it does not tolerate any faults. The number of acceptors in this case is 1. The protocol in this situation is equivalent to 2PC.
events: messages: messagesFailed: messagesDeadLetter: processFailures: applicationMessages: commits: aborts: amnesia: virtualTime (ms):
3 RM 173,060,514 239,734,602 173,060,514 1,287,691 1,032,623 100,000,000,640
5 RM 224,153,100 300,191,334 224,153,100 916,615 1,066,787 100,000,000,349
7 RM 282,562,178 368,098,482 282,562,178 734,563 1,077,021 100,000,000,640
9 RM 347,363,780 442,186,228 347,363,780 621,942 1,078,823 100,000,000,640
Table 3 Results for Paxos Commit without optimisation in the failure free environment
79
3 RM 5 RM 7 RM 9 RM time in ms Commits Aborts Commits Aborts Commits Aborts Commits Aborts 10 1,249 11 639,609 18,020 40 12 621,010 515,092 54,549 944 13 25,649 359,924 428,518 95,916 14 174 23,271 231,789 352,534 15 307 19,207 157,005 16 457 15,054 17 2 477 18 1 12 80 51,611 67 81 832,218 202,918 5,349 28 82 147,120 733,288 358,739 34,315 83 1,671 127,605 606,289 471,053 84 3 2,894 103,016 487,652 85 15 3,593 82,005 86 35 3,710 87 59 88 1 Average Time 11.5215 81.0957 12.4248 81.9347 13.2954 82.7594 14.1460 83.5827 Table 4 Commit and Abort times for Paxos Commit without optimisation in failure free environment
As the system has no failures, the commit protocol should commit all transactions. In these set of results very high Abort rate can be seen. The Registrar’s instance of Paxos must finish before Phase 2 messages are received by the acceptors; this is to ensure that the acceptors have a list of registered RMs for the current transaction. In this scenario one of the Phase 2a messages is received before the Registrar’s instance of Paxos finishes. The Acceptors ignore the Phase 2a message because it does not know the list of registered RMs, and the transaction is aborted. Overall for this scenario there are fewer events as compared to the results for 2PC (Table 1) because the aborts are initiated after the Commit Request Timeout fires after 80 milliseconds of the transaction firing, as a result the average transaction time increases from 11ms for 2PC to 46ms ((11.52+81.10)/2)for Paxos Commit.
80
Phase 2 Messages Bundled 3 RM 5 RM 7 RM 9 RM events: 157,376,204 185,037,763 212,499,750 238,969,966 messages: 224,070,165 261,073,341 298,095,996 333,861,742 messagesFailed: messagesDeadLetter: processFailures: applicationMessages: 157,376,204 185,037,763 212,499,750 238,969,966 commits: 1,288,524 915,925 736,198 623,292 aborts: 1,033,071 1,067,496 1,077,291 1,079,084 virtualTime: 100,000,000,640 100,000,000,640 100,000,000,640 100,000,000,640 Table 5 Results for Paxos Commit with message bundling in failure free environment
When the Phase 2b messages are bundled, the overall number of transactions executed (committed and aborted) increases in comparison to the results from Table 3. The number of events and number of messages decreases. This shows that the system has less network traffic and transactions are processed faster than when the messages are not bundled. 3 RM 5 RM 7 RM 9 RM Commits Aborts Commits Aborts Commits Aborts time in ms Commits Aborts 10 2,498 11 684,753 27,209 97 12 578,231 540,369 70,496 1,676 13 22,873 328,269 439,076 115,388 14 169 19,809 209,751 351,832 15 269 16,441 141,351 16 333 12,662 17 4 375 18 8 80 52,071 54 81 831,314 204,160 5,468 18 82 148,007 732,548 359,185 33,910 83 1,673 127,866 605,999 471,373 84 6 2,851 103,005 488,032 85 17 3,588 81,988 86 46 3,700 87 63 Average Time 11.48 81.10 12.37 81.93 13.23 82.76 14.08 83.58 Table 6 Transaction completion times for Paxos Commit with bundling in fault free environment
The average time taken for transactions to complete is reduced marginally, for 3 RM the average commit time reduces from 11.52ms, where no message bundling is done (Table 4), to 11.48ms, where message bundling is done.
81
Durable Write Optimised
In the scenario where Paxos Commit is run in the no fault tolerance setting; the disk write optimisation should not be used, as there is no redundancy available (F=0, there is only 1 Acceptor).
One Fault to Tolerate 3 RM events: messages: messagesFailed: messagesDeadLetter: processFailures: applicationMessages: commits: aborts: amnesia: virtualTime (ms):
339,942,114 454,427,403 0 0 0 339,942,114 664,594 959,579 0 100,000,000,640
5 RM
7 RM
459,831,023 581,920,459 0 0 0 459,831,023 409,383 1,031,676 0 100,000,000,640
601,737,477 731,715,650 0 0 0 601,737,477 288,080 1,064,846 0 100,000,000,006
9 RM 763,562,448 901,411,437 0 0 0 763,562,448 217,353 1,080,972 0 100,000,000,640
Table 7 Results for Paxos commit without optimisation in the fault free environment where one fault is tolerated
As fault tolerance is introduced in the Paxos commit the commit to abort ratio becomes considerably lower compared to the results from Table 3, 69:100 where faults are tolerated compared to 124:100 where faults are not tolerate (Table 3) for 3RMs. The overall number of transactions processed is also less this is due to the average transaction completion time increasing to 46.63 from 46.31. The number of messages sent has increased; additional messages need to be sent to the additional acceptors introduced because of fault tolerance.
82
9 RM 3 RM 5 RM 7 RM time in Commits Aborts Commits Aborts Commits Aborts Commits Aborts ms 10 423 11 328,962 9,099 21 12 314,884 235,634 25,018 472 13 11,389 149,230 170,419 39,007 14 51 8,479 81,280 121,570 15 94 5,858 47,777 16 110 4,068 17 1 118 18 3 60 32 61 5,411 432 8 62 3,342 4,554 908 34 63 99 1,784 3,348 1,309 64 1 77 1,059 2,355 65 49 608 66 1 30 67 1 80 49,158 83 81 744,245 195,756 5,618 25 82 128,697 691,234 352,949 34,660 83 1,410 119,478 588,311 468,702 84 2,666 99,315 482,100 85 16 3,359 80,731 86 45 3,696 87 57 88 1 1 Average 12.1814 81.0892 13.2243 81.9296 14.1704 82.7541 15.0696 83.5796 Table 8 Transaction completion times for Paxos Commit with one fault setting in fault free environment
In Table 4 the commits complete in the 10ms to 18ms, in this table we see that some transactions commit in the 60ms to 67ms region. This happens due to the transaction completing after a timeout re-requests the result. As the number of RMs is increased the average amount of time required per transaction also increases. The number of transactions that are aborted also increases when more RMs are involved in a transaction.
83
Phase 2 Messages Bundled events: messages: messagesFailed: messagesDeadLetter: processFailures: applicationMessages: commits: aborts: amnesia: virtualTime (ms):
3 RM 301,805,563 415,690,018 301,805,563 580,994 991,139 100,000,000,640
5 RM 364,742,664 485,986,280 364,742,664 321,577 1,070,787 100,000,000,640
7 RM 428,756,371 557,728,823 428,756,371 207,140 1,103,733 100,000,000,640
9 RM 492,620,007 629,449,785 492,620,007 146,368 1,117,580 100,000,000,640
Table 9 Results for Paxos commit with message bundling with one fault to tolerate in fault free environment 9 RM 3 RM 5 RM 7 RM time in Commits Aborts Commits Aborts Commits Aborts Commits Aborts ms 10 396 11 270,891 5,947 8 12 291,036 173,552 14,633 269 13 12,874 129,710 117,605 21,935 14 83 9,026 66,765 81,186 15 2 121 5,775 37,613 16 108 3,709 17 2 105 18 3 60 23 61 3,443 184 2 62 2,187 2,133 360 10 63 59 861 1,398 426 64 42 465 880 65 19 218 66 14 80 50,174 71 81 783,805 204,431 5,512 20 82 137,274 728,640 367,915 35,514 83 1,469 127,149 617,667 487,570 84 3 2,750 105,093 503,793 85 20 3,588 84,515 45 3,786 86 87 1 78 Average 12.0405 81.0926 12.9452 81.9324 13.8513 82.7577 14.6838 83.5819 Table 10 Transaction completion times for Paxos commit with one fault to tolerate with message bundling in fault free environment
The average transaction commit times are better with message bundling than can be seen in Table 8; the average transaction commit time is 12.04ms where messages are bundled and 12.18ms where messages are not bundled.
84
The average times are shown in these two tables as the total number of commits were lower in the message bundling simulation, but the transaction commit times should have been quicker. Durable Write Optimised 3 RM 5 RM 7 RM 9 RM events: 1,066,173,039 1,675,596,123 2,364,412,167 3,119,683,368 messages: 1,295,685,319 1,956,387,875 2,690,312,335 3,485,635,521 messagesFailed: messagesDeadLetter: processFailures: applicationMessages: 1,066,173,039 1,675,596,123 2,364,412,167 3,119,683,368 commits: 8,398,483 7,834,749 7,350,550 6,927,652 aborts: 365 2 1 amnesia: virtualTime: 100,000,000,640 100,000,000,640 100,000,000,020 100,000,000,640 Table 11 Results for Paxos commit with disk write optimisation with one fault to tolerate in fault free environment
The results when there are 3 acceptors used as compared to 1 acceptor shows fewer transactions are completed, but only marginally. The optimization only is viable when we have redundancy. The results are far better compared to the non-optimised simulation. The problem of the Registrar’s instance of Paxos not finishing before the Phase 2b messages described on page 80, does not happen with disk optimisation as the Registrar’s instance of Paxos always completes before the Phase 2b message is received from any of the RMs. 3 RM 5 RM 7 RM 9 RM time in ms Commits Aborts Commits Aborts Commits Aborts Commits Aborts 5 12,061 6 4,455,494 220,003 899 7 3,788,372 4,619,627 704,552 19,425 8 141,726 2,824,422 4,386,080 1,296,215 9 828 168,584 2,094,271 3,908,158 10 1 2,100 161,159 1,559,982 11 13 3,555 139,587 12 33 4,220 13 64 80 209 81 156 2 83 1 Average 6.4837 80.4274 7.3763 81.0000 8.2341 83.0000 9.0746 0.0000 Table 12 Transaction completion times for Paxos commit with disk write optimization with one fault to tolerate in fault free environment
The transactions complete considerably faster than when using 2PC, the average transaction commit times are 6ms, 7ms, 8ms and 9ms for Paxos compared to 11ms, 85
12ms, 13ms and 14ms for 2PC and 12ms, 13ms, 14ms and 15ms for Paxos Commit without optimisation, that is an improvement of 46% when compared with Paxos Commit without optimisation. Phase 2 Messages Bundled and Durable Write Optimised 3 RM 5 RM 7 RM 9 RM events: 890,814,519 1,184,711,969 1,443,180,490 1,672,504,037 messages: 1,120,320,498 1,465,506,105 1,769,086,715 2,038,444,381 messagesFailed: messagesDeadLetter: processFailures: applicationMessages: 890,814,519 1,184,711,969 1,443,180,490 1,672,504,037 commits: 8,398,457 7,834,934 7,350,732 6,927,709 aborts: 365 2 amnesia: virtualTime: 100,000,000,640 100,000,000,635 100,000,000,013 100,000,000,018 Table 13 Results for Paxos with one fault to tolerate with disk write optimisation and message bundling in fault free environment
The number of commits achieved with bundling and optimisation as compared to only optimisation is equivalent. The number of messages sent is less compared to the results from Table 11. 3 RM 5 RM 7 RM 9 RM Aborts Commits Aborts Commits Aborts Commits Aborts time in ms Commits 5 13,602 2 6 4,526,377 227,779 961 7 3,721,652 4,677,157 719,403 19,532 8 136,092 2,768,068 4,425,495 1,319,475 9 729 159,950 2,048,899 3,924,746 10 3 1,966 152,747 1,525,917 11 12 3,191 133,963 12 35 4,024 13 51 14 1 29 1 30 1 80 253 81 112 2 Average 6.4742 80.3068 7.3658 81.0000 8.2235 0.0000 9.0646 0.0000
Table 14 Transaction completion times for Paxos with one fault to tolerate with disk write optimisation and message bundling in fault free environment
The improvement can be seen in the transaction completion times. More transactions are committed in 6ms for the 3RM scenario, than in Table 12. the phase 2b message bundling causes less messages to be sent as a result reducing the overall transaction
86
completion time this can be seen in the reduction of the average transaction commit time from 6.4837ms (Table 12) to 6.4742ms.
Modified Paxos No Faults To Tolerate events: messages: messagesFailed: messagesDeadLetter: processFailures: applicationMessages: commits: aborts: virtualTime:
3 RM 5 RM 7 RM 579,844,464 803,440,533 1,014,712,973 725,074,531 986,816,737 1,231,721,110 579,844,464 803,440,533 1,014,712,973 8,230,719 7,668,810 7,192,620 100,000,000,257 100,000,000,074 100,000,000,025 Table 15 Results for Modified Paxos in fault free environment
9 RM 1,215,852,287 1,462,770,231 1,215,852,287 6,778,686 100,000,000,200
The results for Modified Paxos are similar to those for 2PC, but the number of messages sent has doubled even though there is only one acceptor. The results from the Message Passing simulation will demonstrate more clearly how the increased number of messages contribute to the performance. 3 RM 5 RM 7 RM 9 RM Commits Commits Commits time in ms Commits 10 13,425 11 4,117,047 183,164 689 12 3,922,992 4,273,609 576,692 13,763 13 176,032 2,998,716 4,138,385 1,083,933 14 1,220 210,189 2,273,830 3,777,826 15 2 3,117 198,343 1,725,434 16 15 4,646 172,130 17 35 5,540 18 60 34 1 Average 11.5182 12.4232 13.2929 14.1438 Table 16 Transaction commit times for Modified Paxos in fault free environment
The completion times are also similar to those for 2PC.
87
One Fault to Tolerate 3 RM 5 RM 7 RM 9 RM events: 813,698,447 1,060,214,193 1,291,850,289 1,511,188,517 messages: 1,018,527,166 1,303,218,365 1,568,533,321 1,817,836,970 messagesFailed: messagesDeadLetter: processFailures: applicationMessages: 813,698,447 1,060,214,193 1,291,850,289 1,511,188,517 commits: 8,202,050 7,650,208 7,180,116 6,770,263 aborts: virtualTime: 100,000,000,640 100,000,000,521 100,000,000,021 100,000,000,640 Table 17 Results for Modified Paxos where it tolerates one fault in fault free environment
The results are still similar to 2PC and the results from Table 15, but because of the fault tolerance there are now more processes and thus more messages.
3 RM 5 RM 7 RM 9 RM time in ms Commits Commits Commits Commits 10 8,686 1 11 4,163,096 163,405 513 12 3,883,194 4,349,041 562,670 12,281 13 146,272 2,952,483 4,209,563 1,086,018 14 799 183,015 2,226,141 3,830,781 15 3 2,257 177,462 1,681,394 16 6 3,727 155,324 17 39 4,408 18 1 57 Average 11.5083 12.4133 13.2825 14.1322 Table 18 Transaction commit times for Modified Paxos with one fault tolerance in fault free environment.
In comparison to Table 16 there are fewer commits in the 10ms range for the 3 RM simulation. In the fault tolerant mode due to more Acceptors, Modified Paxos sends more messages, this can be seen from the increase in messages from 725,074,531 (Table 15) where faults are not tolerated, to 1,018,527,166 where faults are tolerated; this increase in messages causes less transactions to commit in the 10ms range.
5.3.2 Message Passing The results for simulation performed with message passing cannot be compared directly against the result for the simulated message passing. The simulations have been implemented in different languages and the processing time will be different for the run time environments.
88
The JPVM messaging libraries add a degree of delay on top of the network messaging. The study by Stankovic et. al. [39] compares various Java based message passing libraries and according to them JPVM on PC has a message latency of 74ms. Similar results to the ones presented in the following section were achieved by setting the mean message delay on the simulated message passing to 1.4ms in comparison to 100ms used for all the tests presented in the previous section. Appendix shows the results for 2PC with 3RMs and a mean message delay of 1.4ms in the fault free environment.
Two Phase Commit 3 RM 5 RM 7 RM 9 RM TotalSimulation: 5,061 3,859 3,233 2,768 Messages: 131,562 154,102 174,156 187,486 MessageDeadLetter: 0 0 0 0 MessagesFailed: 0 0 0 0 ApplicationMessages: 131,559 154,098 174,153 187,483 processFailures: 0 0 0 0 Aborts: 0 4 2 4 Commits: 5,060 3,850 3,223 2,756 Time: 200,000,000 200,000,000 200,000,000 200,000,000 Table 19 Results for 2PC in fault free environment
The results show that there are no aborts in 2PC in the fault free environment. The number of transactions completed reduces as we introduced more RMs. 3 RM 5 RM 7 RM 9 RM time in ms Commits Aborts Commits Aborts Commits Aborts Commits Aborts 0 2 10 4 1 20 374 3 30 1,321 175 13 40 1,436 701 249 21 50 804 1,076 681 177 60 296 745 943 576 70 125 344 527 722 80 71 113 263 434 90 47 64 133 230 100-150 76 140 168 2 265 2 151-200 3 13 11 32 1 201-299 1 13 3 20 1 300-450 1 1 1 1 3 2 MaxTime 1 Table 20 Transaction completion times for 2PC in fault free environment
The commit times increase as more RMs are used for the transaction.
89
Modified Paxos No Faults To Tolerate 3 RM 5 RM 7 RM 9 RM TotalSimulation: 3,811 2,885 2,504 1,982 Messages: 135,775 151,732 174,175 174,846 MessageDeadLetter: MessagesFailed: ApplicationMessages: 130,923 144,742 164,695 163,606 processFailures: Aborts: 34 39 53 22 Commits: 3,777 2,846 2,451 1,960 Table 21 Results for Modified Paxos in fault free environment with no faults to tolerate
The results show that in the same amount of time as the 2PC there are less transactions are completed. A few aborts can be seen in these results; the results in Table 22 show that the majority of these happen as soon as the transaction starts. The Leader election algorithm has a small lead time before a Leader is elected. If a Client sends a “Begin Transaction” message during this time, the transaction will be aborted since there is no Leader to process the message; this is the cause for the aborts at the beginning of the simulation. 3 RM 5 RM 7 RM 9 RM time in ms Commits Aborts Commits Aborts Commits Aborts Commits Aborts 0 32 35 39 12 20 11 30 207 109 1 40 842 3 52 50 1,512 509 407 35 60 749 1,018 659 261 1 70 295 628 488 465 80 90 281 298 376 90 24 102 171 1 246 100-150 34 1 187 320 487 151-200 9 1 5 2 32 1 53 201-299 4 3 1 16 5 25 2 300-399 1 8 3 12 4 400-1100 4 2 MaxTime 1 Table 22 Transaction completion times for Modified Paxos in fault free environment with no faults to tolerate
The average transaction commit time is more in the Modified Paxos as compared to 2PC.
90
One Fault Tolerated 3 RM 5 RM 7 RM 9 RM TotalSimulation: 2,383 1,691 1,173 849 Messages: 128,564 132,167 124,512 116,185 MessageDeadLetter: MessagesFailed: ApplicationMessages: 123,890 125,277 115,609 105,404 processFailures: Aborts: 39 34 33 54 Commits: 2,344 1,657 1,140 795 Table 23 Results for Modified Paxos in fault free environment with one fault to tolerate
The number of messages sent is similar to results in Table 21 but the number of transactions committed is fewer. As there are more processes in the transaction system there are more messages. 3 RM 5 RM 7 RM 9 RM time in ms Commits Aborts Commits Aborts Commits Aborts Commits Aborts 0 26 20 21 17 30 21 1 1 40 121 50 451 10 60 603 59 70 386 259 6 80 197 348 37 90 117 258 107 1 100-150 232 420 524 236 1 151-200 88 5 71 2 89 2 96 201-299 125 4 214 331 2 360 6 300-399 3 18 3 45 5 96 29 400-1100 2 4 1 3 6 17 MaxTime 1 4 1 Table 24 Transaction completion times for Modified Paxos in fault free environment
The average transaction completion time is more than that in Table 22, which is due to more messages being sent to complete a transaction.
5.4 Message Failures The simulations for Paxos and modified Paxos without fault tolerance have only been run in the fault free system to compare the overhead added by the protocol without adding any fault tolerance. In this scenario both the messages and the application fail randomly. The application failure causes the transaction to abort.
91
5.4.1 Simulated Message Passing
Two Phase Commit events: messages: messagesFailed: messagesDeadLetter: processFailures: applicationMessages: commits: aborts: virtualTime:
3 RM 5 RM 7 RM 237,792,016 333,676,955 415,392,133 341,093,184 471,919,348 583,473,266 197,968 282,704 354,747 0 0 0 0 0 0 237,792,016 333,676,955 415,392,133 7,914,251 7,237,814 6,680,896 24,255 36,768 46,941 100,000,004,883 100,000,003,901 100,000,000,204 Table 25 Results for 2PC where messages fail
9 RM 485,930,299 679,815,538 417,390 0 0 485,930,299 6,208,084 56,512 100,000,000,026
The results show that not all message failures cause the transaction to Abort. As was described earlier the various processes set timeouts to make sure that processes receive a timely response. Due to some of these timeouts the process resends the message. The results from Table 26 show that message failures add more variance to transaction completion times. The transactions that complete with time over 50ms have been completed after one of the processes has received a timeout.
92
3 RM 5 RM 7 RM 9 RM time in ms Commits Aborts Commits Aborts Commits Aborts Commits Aborts 0 10,084 9,320 8,434 7,853 1 12,472 16,325 15,257 14,304 2 1,230 9,279 14,413 14,616 3 16 1,029 6,771 12,396 4 9 726 4,933 5 1 17 586 6 22 10 28,679 2 11 4,575,292 267,568 1,280 1 12 3,027,043 4,302,961 702,093 19,690 13 95,575 2,273,627 3,839,792 1,169,205 14 495 122,619 1,678,556 3,280,466 15 1,518 119,402 1,236,229 16 11 2,307 104,211 17 1 17 2,814 18 37 20 77 72 83 54 21 191 246 252 244 22 30 231 383 396 23 27 246 472 24 46 239 25 39 26 1 30 557 31 100,618 10,205 104 32 80,797 157,093 37,460 1,751 33 3,182 91,819 194,430 78,555 34 16 5,874 90,204 212,364 35 1 78 7,198 83,248 36 159 7,553 37 3 206 38 3 50 22 58 47 64 49 51 967 75 173 80 10 53 57 52 946 10 2,411 78 817 98 62 62 53 46 1,671 4 4,393 63 2,211 87 54 139 1 2,348 12 6,335 53 55 3 213 2,619 10 56 4 275 57 16 70 1 1 1 71 7 3 3 2 1 1 72 6 23 2 10 6 73 12 1 59 1 46 1 74 3 35 132 5 75 2 46 76 7 MaxTime 6 4 2 4 Table 26 Transaction completion times for 2PC with message failures
93
Paxos Commit events: messages: messagesFailed: messagesDeadLetter: processFailures: applicationMessages: commits: aborts: virtualTime:
3 RM 5 RM 7 RM 335,574,102 451,771,097 587,991,410 449,801,508 573,448,269 717,368,941 279,976 388,178 516,156 335,574,102 451,771,097 587,991,410 649,050 397,736 279,220 959,942 1,037,250 1,074,915 100,000,000,625 100,000,003,229 100,000,002,903 Table 27 Results for Paxos Commit with message failures
9 RM 743,112,556 880,183,141 663,316 743,112,556 211,221 1,095,748 100,000,000,179
The results in the table above can not be compared to the results for 2PC with message failure. Paxos has more failures due to the Registrar’s instance of the Paxos algorithm not finishing before the RMs start sending there messages; this was explained before in section No Faults to Tolerate on page 79. The results can be compared to the result from Table 7. There are a large number of message failures in this simulation, but the number of commits does not decrease that much in the Paxos case. The results from Table 28 show that most of the aborts happen after the Paxos Commit Timeout or the Transaction timeout has fired. This shows that the Algorithm is unable to progress any further due to a message that failed to deliver; when the timeout fires it triggers the algorithm to complete by aborting the transaction.
94
3 RM 5 RM 7 RM 9 RM Aborts Commits Aborts Commits Aborts Commits Aborts time in ms Commits 0 1,938 1,789 1,612 1,587 1 2,286 3,035 2,986 2,848 2 253 1,744 2,804 2,876 3 3 218 1,390 2,497 4 2 139 1,101 10 399 11 310,505 8,117 27 12 298,299 215,897 22,396 426 13 10,902 137,858 151,643 33,406 14 57 7,781 72,261 105,110 15 100 5,115 41,432 16 110 3,547 20 16 12 13 10 21 28 47 48 42 22 5 45 76 81 23 6 42 115 24 7 46 31 10,825 1,239 25 32 8,881 12,888 3,806 306 33 377 6,565 12,661 6,590 34 5 416 4,936 11,465 35 5 333 3,624 36 10 276 37 11 50 9 13 13 10 51 105 6 25 6 2 15 11 52 104 216 2 106 7 9 11 53 3 126 339 1 208 11 54 10 135 412 3 55 10 166 60 29 61 5,125 381 6 62 3,185 4,159 913 25 63 90 1,714 3,200 1,259 64 2 75 938 2,107 65 57 643 80 1 47,440 69 81 79 719,529 12 186,646 5,264 25 82 70 125,139 93 658,125 32 330,623 2 31,701 83 4 1,368 49 113,525 109 551,826 65 432,458 84 4 1 2,586 38 93,415 87 445,559 85 19 4 3,173 29 75,099 86 35 2 3,313 100 1,155 7 1 101 2 12,292 7,131 456 3 102 1,768 1 16,797 15,485 2,940 103 11 2,207 18,181 1 23,825 104 37 2,365 17,780 105 50 2,352 121 63 48 1 122 15 143 161 35 123 16 214 321 124 3 56 336 MaxTime 3 1 1 2 Table 28 Transaction completion time for Paxos Commit with message failures
95
Durable Write Optimised 3 RM 5 RM 7 RM 9 RM events: 919,921,563 1,338,663,511 1,756,689,457 2,169,857,296 messages: 1,128,648,188 1,579,497,748 2,020,703,273 2,451,066,592 messagesFailed: 784,538 1,172,932 1,571,715 1,968,840 messagesDeadLetter: processFailures: applicationMessages: 919,921,563 1,338,663,511 1,756,689,457 2,169,857,296 commits: 7,076,660 6,119,092 5,345,294 4,722,463 aborts: 43,881 50,512 54,776 57,701 virtualTime: 100,000,002,118 100,000,000,001 100,000,002,250 100,000,000,032 Table 29 Results for Paxos Commit with Durable Write optimisation where environment has message failures
From Table 29, compared to the results from Table 11, there are considerably more Aborts and less number of events that take place. Comparing the results with those for 2PC in Table 25 there are less Commits and more Aborts, this is due to the high volume of message failures. Looking at the transaction completion times in table Table 30, it can be observed that most of the transaction failures are either during the initiation of the transaction or after the Paxos Commit Timeout has fired.
96
3 RM 5 RM 7 RM 9 RM time in ms Commits Aborts Commits Aborts Commits Aborts Commits Aborts 0 8,912 7,818 6,946 5,992 1 10,860 13,933 12,195 10,801 2 1,122 7,858 11,772 10,994 3 11 859 5,393 9,499 4 17 577 3,714 5 9,741 22 462 6 3,643,655 162,903 591 1 15 7 3,099,992 3,422,936 475,888 12,171 8 115,697 2,081,143 2,938,810 797,696 9 663 123,763 1,382,673 2,362,813 10 2 1,542 105,432 925,320 4 2,301 81,872 11 12 1 15 2,422 20 73 56 43 46 21 143 229 192 182 22 25 182 332 339 23 23 184 364 24 1 31 196 25 1,052 2 1 31 26 109,084 19,172 598 7 2 27 90,224 194,604 72,136 6,831 28 4,153 100,799 250,501 148,388 97,442 272,144 29 33 6,251 30 93 7,406 87,302 31 1 187 7,441 32 2 235 46 1,114 288 14 1 47 1,176 3,377 1,660 200 48 49 2,020 6,542 4,697 49 141 2,702 9,111 50 59 3 54 230 43 3,200 34 51 26 60 5 44 296 45 52 13 36 6 48 15 31 53 66 6 67 10 25 20 2 68 1 18 90 45 69 4 41 178 70 2 65 80 1 1,297 1 81 3 17,763 3,522 89 82 3,102 2 13,023 5,444 2 431 83 44 2,276 5 9,104 4 6,295 84 45 1,587 2 6,229 1,125 85 55 86 47 100 34 7 7 6 101 191 87 1 102 35 189 154 20 103 25 195 200 104 1 19 182 105 1 15 MaxTime 5 8 7 6 Table 30 Transaction completion times for Paxos Commit with disk write optimisation where message failures occur
97
Phase 2 Messages Bundled and Durable Write Optimised 3 RM 5 RM 7 RM 9 RM events: 785,328,355 999,760,413 1,174,570,796 1,319,795,058 messages: 996,320,766 1,248,271,518 1,453,718,686 1,624,430,607 messagesFailed: 649,239 828,121 974,153 1,096,716 messagesDeadLetter: processFailures: applicationMessages: 785,328,355 999,760,413 1,174,570,796 1,319,795,058 commits: 7,220,376 6,446,662 5,832,799 5,328,286 aborts: 45,185 53,102 59,392 65,382 virtualTime: 100,000,000,016 100,000,002,821 100,000,000,082 100,000,000,012 Table 31 Results for Paxos commit with optimisation and bundling in message failure environment
In the results where message bundling is also done in addition to disk write optimisation, are more transactions are processed. Average Commit to Abort ratio is higher; this is due to fewer messages being sent, thus there are fewer message failures. The results for transaction completion time do not show any new trend.
98
3 RM 5 RM 7 RM 9 RM Commits Aborts Commits Aborts Commits Aborts Commits Aborts time in ms 0 9,148 8,119 7,306 6,838 1 11,255 14,335 13,343 12,191 2 1,096 8,312 12,768 12,287 3 9 928 5,893 10,805 4 14 686 4,254 5 11,188 20 489 6 3,794,588 178,730 744 19 7 3,131,281 3,704,782 540,250 14,088 1 8 114,495 2,201,117 3,335,021 947,264 9 684 128,056 1,553,213 2,834,228 10 1 1,559 116,585 1,105,656 6 2,467 97,052 11 12 24 2,829 13 34 20 76 72 45 55 21 152 225 223 189 22 31 218 297 362 23 1 32 231 394 24 36 194 25 281 3 24 26 81,708 7,056 57 4 27 80,430 129,792 28,027 1,318 162,299 61,444 28 3,953 85,528 29 25 6,029 80,641 175,238 30 90 6,876 72,382 31 2 172 7,046 32 3 258 46 773 106 47 891 2,130 566 35 48 54 1,520 3,581 1,708 49 1 115 1,980 5,053 50 64 3 48 194 56 2,183 55 51 21 56 6 43 257 51 33 11 47 52 1 21 53 1 10 47 67 5 16 8 1 68 15 43 28 69 31 104 70 1 5 51 1 80 4 1,326 5 81 3 18,228 2 3,867 90 1 82 3,283 4 13,906 3 5,952 3 507 83 36 1 2,293 3 9,872 3 7,140 84 61 1,694 1 7,279 85 63 1,239 86 3 58 100 27 9 4 2 101 192 90 7 102 42 205 182 18 103 37 198 229 104 23 202 105 25 MaxTime 7 3 2 6 Table 32 Transaction completion times for Paxos commit with optimisation and bundling in environment with message failures
99
Modified Paxos 3 RM 5 RM 7 RM 9 RM events: 790,150,978 1,019,150,566 1,231,535,029 1,431,284,485 messages: 988,798,494 1,251,061,846 1,491,696,179 1,715,871,661 messagesFailed: 707,649 926,577 1,131,767 1,323,517 messagesDeadLetter: processFailures: applicationMessages: 790,150,978 1,019,150,566 1,231,535,029 1,431,284,485 commits: 7,703,072 7,034,314 6,479,874 6,013,461 aborts: 23,445 35,504 46,032 54,604 virtualTime: 100,000,000,077 100,000,000,438 100,000,000,436 100,000,000,007 Table 33 Results for Modified Paxos in environment with message failures
The results for Modified Paxos in comparison to 2PC in Table 25 are very promising, as Modified Paxos is able to achieve a higher commit to abort ratio than 2PC (328:1 compared to 326:1 for 2PC) despite having more message failures. Modified Paxos, due to increased number of processes, has more messages; as the message failure rate is fixed there are more message failure, but this does not transfer into the number of Aborts. Looking at the transaction completion times in Table 34 it can be seen that most of the Aborts occur during the transaction initialisation stage and thus resources do not get blocked.
100
3 RM 5 RM 7 RM 9 RM time in ms Commits Aborts Commits Aborts Commits Aborts Commits Aborts 0 9,708 8,931 8,281 7,826 1 11,937 15,667 14,819 13,777 2 1,191 8,952 14,191 13,998 3 23 978 6,561 11,788 4 14 779 4,781 5 13 540 6 34 10 7,884 11 3,786,984 143,674 406 12 3,546,132 1 3,808,874 474,603 10,096 13 133,515 2,589,966 3,555,308 2 886,365 14 674 160,837 1,881,399 2 3,128,608 1 15 48 2,063 151,035 1 1,374,510 16 20 1 36 3,064 127,228 17 15 47 3,694 18 8 100 19 1 5 20 76 66 50 48 21 176 221 249 232 22 28 236 364 421 23 36 252 456 24 32 233 25 3 24 30 1,966 5 31 121,490 2 20,048 655 5 32 97,373 1 187,760 1 60,803 2 5,436 1 33 4,469 107,919 228,574 2 113,320 34 31 7,276 105,142 1 244,822 35 121 8,877 95,412 36 2 3 241 8,967 37 3 283 50 19 74 64 48 49 51 1,250 96 332 73 6 48 40 52 1,134 22 3,137 115 1,388 94 164 59 53 54 2,010 17 5,424 102 3,410 101 54 1 162 2,519 22 7,430 78 55 3 225 3,025 12 56 1 6 310 2 71 10 3 1 3 72 9 41 5 12 1 3 73 26 67 3 53 2 74 2 41 139 4 75 3 52 1 76 1 8 100 8 12 9 1 MaxTime 13 10 8 13 Table 34 Transaction completion times for Modified in message failure environment
101
5.4.2 Message Passing
Two Phase Commit 3 RM 5 RM 7 RM 9 RM TotalSimulation: 2,891 2,356 1,992 1,827 Messages: 58,417 65,495 66,290 73,328 MessageDeadLetter: MessagesFailed: 587 679 698 690 ApplicationMessages: 57,830 64,811 65,582 72,637 processFailures: Aborts: 847 957 983 961 Commits: 1,925 1,315 924 788 Table 35 Results for 2PC where there are message failures
Comparing the results for 2PC in failure free environment, Table 19, with those where there are message failures shows that there are fewer transactions processed in the same amount of time, this signifies that the overall transaction completion time has increased. The results from Table 36 show that all Aborts happen in the over 100ms time range. This is in line with the previous observation that the transaction completion times increase, i.e. due to message failures resources wait for timeouts to fire and then the transactions complete, either by Aborting or Committing. 3 RM 5 RM 7 RM 9 RM time in ms Commits Aborts Commits Aborts Commits Aborts Commits Aborts 10 1 20 84 30 280 18 40 484 128 7 50 410 263 53 7 60 234 250 106 26 70 106 178 144 70 80 50 132 159 107 90 38 67 105 131 100-150 69 292 104 254 166 184 242 155 151-200 108 392 88 572 45 604 46 597 50 73 58 125 99 128 128 201-300 49 301-400 12 10 14 5 11 16 27 16 401-1000 1 3 3 4 MaxTime 102 68 77 65 Table 36 Transaction completion times for 2PC where there are message failures
102
Modified Paxos 3 RM 5 RM 7 RM 9 RM TotalSimulation: 787 781 830 903 Messages: 42,850 57,344 70,706 85,084 MessageDeadLetter: MessagesFailed: 355 449 568 680 ApplicationMessages: 37,572 49,611 60,090 72,682 processFailures: Aborts: 250 305 400 513 Commits: 537 476 430 390 Table 37 Results for Modified Paxos in message failure environment
The results seen for Modified Paxos are similar to those seen for 2PC. There are fewer transactions processed in the same amount of time. In 2PC 29% of all transactions fail. For Modified Paxos 31% of all transactions fail, there are an increased number of aborts because there are more messages being sent per transaction. The transaction completion times show the same trend as that for 2PC. 3 RM 5 RM 7 RM 9 RM time in ms Commits Aborts Commits Aborts Commits Aborts Commits Aborts 0 32 28 42 39 10 5 10 5 3 20 1 6 4 8 7 30 4 1 2 40 24 1 1 50 84 3 1 60 97 1 24 3 70 84 35 9 1 1 1 80 37 59 1 28 1 3 90 26 70 40 10 100 16 40 36 1 13 101-150 35 37 75 31 122 35 112 32 151-200 53 96 31 135 37 186 47 245 201-300 64 12 116 16 131 13 149 58 301-400 4 13 15 28 17 23 46 41 401-900 2 3 3 4 3 4 3 13 MaxTime 6 44 4 48 4 78 6 72 Table 38 Transaction completion times for Modified Paxos where message failures occur
103
5.5 Process and Message Failures 5.5.1 Simulated Message Passing
Two Phase Commit events: messages: messagesFailed: messagesDeadLetter: processFailures: applicationMessages: commits: aborts: virtualTime:
3 RM 5 RM 7 RM 254,119,729 337,951,033 406,717,685 352,643,763 467,833,442 561,614,537 193,745 269,912 331,547 13,806,947 13,037,167 13,255,445 4,986 6,849 8,849 240,302,810 324,900,168 393,444,542 7,520,712 6,754,906 6,092,870 79,476 126,009 174,216 100,000,002,330 100,000,002,616 100,000,004,351 Table 39 Results for 2PC with all failures
9 RM 465,655,469 641,732,005 383,229 13,551,204 10,842 452,082,581 5,560,481 210,260 100,000,000,201
In the case where messages fail to deliver and processes fail the 2PC algorithm in comparison to the non failure case has 8% fewer Commits. The average transaction completion time increases. Results in Table 40 show that majority of transactions Abort in the 50 milliseconds range, this reduces the number of transactions that can be processed.
104
3 RM 5 RM 7 RM time in Commits Aborts Commits Aborts Commits Aborts ms 0 9,688 8,556 7,954 1 11,343 15,269 14,275 2 1,136 8,550 13,201 3 18 961 6,124 4 18 702 5 17 6 10 27,209 3 11 4,349,561 10 250,122 1,127 12 2,874,721 9 4,015,499 8 637,130 4 13 90,458 2,122,125 8 3,504,207 15 14 496 113,964 1 1,530,501 4 15 1 1,349 109,123 16 7 2,196 17 12 18 20 65 69 59 21 165 234 232 22 30 257 375 23 41 229 24 1 38 25 2 30 521 1 31 95,987 1 9,632 1 94 32 76,694 2 146,490 4 34,221 1 33 3,093 85,630 1 177,517 8 34 26 5,680 82,765 2 35 77 6,624 36 157 37 2 38 50 15 41,426 1 40,198 42,895 51 968 15,409 170 38,199 8 41,365 52 890 505 2,388 12,280 766 34,099 53 37 4 1,562 628 4,081 9,751 54 138 9 1,999 746 55 3 192 15 56 8 70 201 214 250 71 6 222 3 607 752 72 6 17 14 355 8 1,022 73 1 23 28 59 411 74 1 35 53 75 2 2 91 4 1 4 92 6 12 93 1 12 99 32 24 30 100 238 323 334 101 14 76 105 102 12 38 103 2 9 MaxTime 13,737,576 12,926,904 13,105,754 Table 40 Transaction completion times for 2PC with all failures
105
9 RM Commits Aborts 7,292 13,028 13,150 11,084 4,482 525 23 17,824 1,045,231 4 2,939,691 11 1,106,813 3 93,404 2,484 34 49 227 369 423 222 31 1,586 70,835 2 190,218 6 75,119 2 6,722 216 2 42,160 39,337 55 37,723 1,911 28,407 5,564 7,636 2,294 652 245 20 219 667 2 1,217 42 1,107 94 422 61 45 3 13 31 30 392 85 86 27 13,370,326
If the registrar fails, the simulation keeps sending messages to the Client to start new transaction. The elapsed time for these transactions is -1 as a transaction id is never issued. This causes the high number in the MaxTime row, this is not accounted for in the analysis that shows 8% fewer commits for 2PC with failures in comparison to the non failure case.
Paxos Commit events: messages: messagesFailed: messagesDeadLetter: processFailures: applicationMessages: commits: aborts: virtualTime:
3 RM 5 RM 7 RM 349,727,127 453,892,002 571,076,067 461,796,497 572,789,962 696,626,844 273,702 372,085 481,265 15,584,255 15,344,182 16,335,484 8,243 9,832 12,184 334,120,336 438,518,378 554,702,331 598,744 361,128 246,481 946,265 1,015,820 1,047,609 100,000,000,512 100,000,000,611 100,000,000,013 Table 41 Results for Paxos commit with all failures
9 RM 701,908,917 834,196,622 605,132 16,069,384 13,661 685,793,967 181,991 1,065,816 100,000,002,261
Paxos Commit does have fewer Commits than it does in the fault free environment, but comparatively more transactions finish in the 50ms time range.
106
3 RM 5 RM 7 RM 9 RM time in Commits Aborts Commits Aborts Commits Aborts Commits Aborts ms 0 1,889 1,741 1,571 1,533 1 2,281 2,918 2,765 2,539 2 199 1,657 2,573 2,691 3 2 198 1,175 2,257 4 9 118 968 5 4 94 10 385 11 286,272 7,551 15 12 275,361 195,995 19,535 386 13 10,315 124,823 134,546 28,806 14 59 7,137 63,365 91,142 15 89 4,661 35,043 16 1 91 3,160 17 87 20 11 11 14 7 21 34 45 37 43 22 7 51 68 80 23 5 32 82 24 7 33 30 93 31 9,819 1,171 24 32 7,940 11,765 3,356 254 33 379 5,938 11,208 5,475 34 5 332 4,322 10,047 35 12 312 3,018 36 10 249 50 1 21,504 21,077 22,287 20,102 51 117 7,289 22 18,338 20,794 20,399 52 83 140 183 5,170 79 17,460 10 20,800 53 4 1 119 204 278 4,498 174 13,953 54 11 3 115 215 368 3,063 55 15 5 104 198 56 4 3 60 30 1 61 4,801 22 371 1 5 62 2,886 10 3,841 16 841 3 25 63 62 1,532 5 2,664 7 1,028 5 64 1 68 826 1 1,863 65 1 50 519 2 66 29 70 97 105 105 89 71 1 100 259 354 340 72 1 4 135 1 485 622 73 1 2 6 3 170 4 568 74 2 7 6 148 75 5 10 80 1 45,495 61 81 72 688,700 10 174,661 4,826 19 82 50 119,474 105 616,736 22 301,261 28,956 83 4 1,292 48 106,767 104 503,669 45 388,805 84 3 1 2,375 28 85,823 99 400,705 85 15 1 2,959 26 67,084
107
3 RM 5 RM 7 RM 9 RM time in Commits Aborts Commits Aborts Commits Aborts Commits Aborts ms 86 30 3 3,095 87 56 100 1,933 771 874 868 101 1 11,767 6,365 459 22 102 1,571 15,600 14,224 2,606 103 14 2,069 16,670 21,656 104 49 1 2,156 1 15,744 105 56 2,042 106 1 77 121 58 41 4 122 10 147 147 32 123 27 199 293 124 27 261 1 49 125 MaxTime 13,950,195 13,457,235 14,114,659 13,562,897 Table 42 Transaction completion time for Modified Paxos with all failures
Durable Write Optimised events: messages: messagesFailed: messagesDeadLetter: processFailures: applicationMessages: commits: aborts: virtualTime:
3 RM 5 RM 7 RM 887,387,691 1,256,146,821 1,604,444,892 1,086,847,088 1,483,137,252 1,849,502,063 739,356 1,082,457 1,416,823 15,102,425 15,392,213 16,144,742 7,968 9,900 11,822 872,263,390 1,240,724,912 1,588,262,600 6,540,268 5,549,421 4,739,031 102,443 123,312 144,801 100,000,000,024 100,000,000,002 100,000,000,699 Table 43 Results for Paxos commit with disk optimisation
9 RM 1,941,779,866 2,199,650,887 1,738,772 16,561,429 14,030 1,925,172,855 4,111,643 161,772 100,000,001,771
The percentage of Aborts in Paxos Commit with optimisation is considerably higher. The Paxos Commit appears to be effected by failures a lot more than 2PC. A higher number of transactions complete in the 80ms time range. Even though transactions complete in less time than 2PC, the overall throughput of a transaction is slower in Paxos Commit.
108
3 RM time in Commits Aborts ms 0 8,254 1 10,175 2 1,011 3 12 4 5 8,976 6 3,364,252 7 2,867,332 8 107,560 9 642 10 1 11 12 13 20 69 21 150 22 21 23 1 24 25 927 26 101,023 27 83,420 28 3,953 29 30 30 31 32 33 46 1,007 47 1,053 48 66 49 1 50 20,780 51 7,092 52 133 53 54 55 56 66 7 67 6 68 1 69 70 93 71 86 72 3 73 74 75 -
5 RM Commits Aborts 7,121 12,760 7,076 750 19 147,800 3,101,466 1 1,888,432 112,494 1,358 7 51 174 153 33 2 1 17,793 177,172 91,636 5,693 82 283 3,078 1,892 160 2 21,198 18,446 5,598 196 1 2 42 20 2 1 123 301 171 11 -
109
7 RM Commits Aborts 6,168 10,988 10,566 4,695 542 22 604 420,205 2,602,269 1,228,157 94,014 2,008 32 512 64,315 223,403 86,729 6,579 147 2 8 1,466 5,799 2,445 183 5 17 86 33 4 -
39 192 281 182 26 2 21,115 21,751 17,195 4,270 220 5 111 367 514 161 11 -
9 RM Commits Aborts 5,418 9,766 9,717 8,421 3,244 376 19 10,504 1 690,963 2,056,438 808,751 71,311 2,197 49 29 148 281 332 170 21 3 1 5,917 129,605 237,621 76,002 6,443 205 2 173 4,089 8,118 2,724 21,025 19,589 263 2 20,404 14,238 3,210 195 5 1 65 123 57 107 12 352 603 534 154 12
3 RM 5 RM 7 RM 9 RM time in Commits Aborts Commits Aborts Commits Aborts Commits Aborts ms 80 1 2,891 5 81 4 42,707 9,031 1 225 82 7,448 2 32,059 5 14,362 1,230 83 72 1 5,624 1 23,539 2 17,174 84 1 132 4,089 1 17,955 85 1 119 3,021 86 1 147 100 317 314 346 335 101 657 326 70 55 102 100 776 658 173 103 110 764 981 104 2 95 688 105 7 94 MaxTime 13,376,582 13,443,074 13,983,736 14,115,387
Table 44 Transaction completion times for Paxos commit with disk optimisation
Phase 2 Messages Bundled and Durable Write Optimised 3 RM 5 RM 7 RM 9 RM events: 765,621,165 948,821,459 1,088,325,072 1,195,644,921 messages: 967,156,910 1,182,997,570 1,346,971,872 1,473,087,033 messagesFailed: 612,197 769,048 884,978 973,640 messagesDeadLetter: 15,909,022 15,066,268 15,424,586 16,574,630 processFailures: 8,077 9,849 11,906 14,192 applicationMessages: 749,689,921 933,725,694 1,072,862,827 1,179,024,140 commits: 6,670,780 5,855,641 5,176,481 4,617,180 aborts: 102,291 124,688 148,942 171,703 virtualTime: 100,000,003,072 100,000,000,253 100,000,001,698 100,000,001,926 Table 45 Results for Paxos commit with optimisation and message bundling
The results are slightly better than that can be seen in Table 43, with fewer message failures, which in turn increases the number of Commits. The percentage of failures is lower. The transaction completion times do not show any new trends.
110
3 RM time Aborts in ms Commits 0 8,410 1 10,456 2 1,011 3 12 4 5 10,281 6 3,504,474 7 2,894,468 8 106,246 9 589 10 1 11 12 13 20 46 21 1 164 22 22 23 24 25 246 26 75,026 27 74,067 28 3,654 29 33 30 31 32 46 731 47 875 48 59 49 50 21,569 51 6,821 52 134 53 54 55 56 66 6 67 10 68 1 69 70 106 71 97 72 3 73 74 75 80 2 2,785 81 6 41,792 82 1 7,399 83 71
5 RM Commits Aborts 7,642 13,270 7,516 832 17 1 162,762 3,363,715 2,000,268 116,345 1,475 6 1 54 229 199 23 1 1 6,259 117,953 77,827 5,399 92 1 95 1,878 1,387 133 1 21,051 17,909 5,383 199 1 1 18 14 4 114 328 147 14 2 1 9,125 2 32,413 1 5,710
111
7 RM Commits Aborts 6,720 12,220 11,447 5,314 636 18 618 479,351 2,957,904 1,379,870 1 103,850 2,250 25 48 197 322 216 28 2 59 24,692 143,739 72,052 6,214 171 1 505 3,165 1,745 188 21,222 3 21,287 16,348 4,280 207 7 5 31 29 8 101 367 478 167 3 258 14,822 3 24,803
9 RM Commits Aborts 6,111 10,893 10,886 9,222 3,663 442 24 12,103 820,938 2,456,162 958,813 83,610 2,562 19 44 1 164 322 376 184 25 1 1,045 53,256 151,863 62,324 6,159 214 42 1,427 4,263 2,008 21,613 212 21,646 11 21,143 14,295 3,220 173 7 1 15 83 35 130 3 369 588 590 165 17 2 1,329 4 18,031
3 RM time in Commits Aborts ms 84 85 86 100 326 101 648 102 105 103 1 104 105 121 2 122 123 124 125 MaxTime 14,269,470
5 RM Commits Aborts 137 -
321 376 783 122 2
-
3 6
-
13,176,138
7 RM Commits Aborts 4,141 135 -
336 82 647 784 113 2 1 9 16 2
-
13,287,196
9 RM Commits Aborts 1 18,672 3,124 130 391 73 173 996 732 101 2 1 15 18 2 14,220,126
Table 46 Transaction completion times for Paxos commit with optimisation and message bundling
Modified Paxos events: messages: messagesFailed: messagesDeadLetter: processFailures: applicationMessages: commits: aborts: virtualTime:
3 RM 5 RM 7 RM 790,263,290 1,019,166,544 1,231,649,454 988,918,831 1,251,057,789 1,491,818,447 707,432 926,763 1,131,622 955,547 1,527,474 2,151,195 10 12 14 789,301,813 1,017,629,338 1,229,484,155 7,703,861 7,033,394 6,480,145 23,362 35,438 45,999 100,000,002,016 100,000,000,154 100,000,001,959 Table 47 Results for Modified Paxos
9 RM 1,431,074,240 1,715,584,636 1,324,560 2,727,824 16 1,428,328,511 6,011,188 54,371 100,000,000,765
Modified Paxos appears more resilient to failures. The percentage of Aborts is considerably lower than 2PC, even though there are more message failures. This behaviour can be explained by very few process failures. The majority of the transactions complete in the 10ms time range.
112
time in ms 0 1 2 3 4 5 6 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 30 31 32 33 34 35 36 37 50 51 52 53 54 55 56 57 70 71 72 73 74 75 76 MaxTime
3 RM 5 RM 7 RM Commits Aborts Commits Aborts Commits Aborts 9,752 8,940 8,153 11,723 15,424 14,714 1,164 8,960 14,204 9 992 6,537 17 738 17 7,787 3,788,316 143,436 422 3,545,546 1 3,807,072 473,670 134,389 2,589,663 3,556,595 759 161,503 1,882,533 50 1,988 150,448 19 1 54 3,120 1 15 67 8 1 67 59 51 158 298 237 26 239 416 34 255 1 36 5 1,949 3 120,920 20,098 597 97,126 188,008 1 59,874 4,462 108,415 1 228,979 2 24 7,255 104,677 5 105 8,980 1 201 3 19 154 112 109 1,250 147 311 127 14 116 1,165 21 3,209 117 1,411 134 56 2,064 13 5,603 122 1 129 1 2,553 19 5 239 1 11 1 3 4 4 2 1 3 10 37 2 18 7 17 79 6 1 39 1 70,932 70,208 73,294 Table 48 Transaction completion time for Modified Paxos
113
9 RM Commits Aborts 7,633 13,687 13,745 11,997 4,730 513 25 10,038 885,653 3,126,315 1 1,373,988 126,913 3,610 69 7 46 222 378 417 250 32 1 3 5,569 113,444 245,417 1 96,234 8,997 262 80 1 73 144 119 3,461 172 7,535 98 2,962 16 280 16 2 4 65 7 132 5 50 1 11 18,540
5.5.2 Message Passing
Two Phase Commit 3 RM 5 RM 7 RM 9 RM TotalSimulation: 2,830 2,295 2,019 1,904 Messages: 56,684 61,556 66,074 69,828 MessageDeadLetter: 57 76 67 104 MessagesFailed: 554 640 657 708 ApplicationMessages: 56,070 60,839 65,349 69,016 processFailures: 7 13 12 13 Aborts: 844 1,011 1,006 1,122 Commits: 1,874 1,206 922 703 Table 49 Results for 2PC with all failures
The results for 2PC with all failures shows that the throughput of the transaction system reduces and there are a similar number of transaction Commits and Aborts as can be seen in the results for only message failure simulation. Transaction completion times have increased in Table 36 there were few transactions finishing in 100-150ms range. In Table 50 there are none in this range. time in ms 0 10 20 30 40 50 60 70 80 90 100-149 150-199 200-299 300-599 MaxTime
3 RM 5 RM 7 RM 9 RM Commits Aborts Commits Aborts Commits Aborts Commits Aborts 2 1 4 2 2 91 291 16 2 403 84 35 4 382 226 149 35 209 259 208 94 115 181 149 107 76 110 86 102 50 64 49 70 96 92 86 2 133 691 56 839 32 763 30 824 94 57 53 106 88 121 134 120 210 8 8 12 11 5 26 8 9 90 72 77 77 Table 50 Transaction completion times for 2PC with all failures
114
Modified Paxos 3 RM 5 RM 7 RM 9 RM TotalSimulation: 1,235 1,032 808 881 Messages: 61,493 68,881 61,282 73,928 MessageDeadLetter: 173 56 322 439 MessagesFailed: 559 604 518 581 ApplicationMessages: 55,745 61,096 51,054 61,407 processFailures: 12 9 25 30 Aborts: 395 461 509 569 Commits: 840 571 299 312 Table 51 Results for Modified Paxos with all failures
The result trend is similar to that where Modified Paxos was run in the environment where there are only message failures, discussed in section Modified Paxos on page 103. time in ms 0 10 20 30 40 50 60 70 80 90 100 101-150 151-200 201-300 301-400 401-900 MaxTime
3 RM 5 RM 7 RM 9 RM Commits Aborts Commits Aborts Commits Aborts Commits Aborts 37 32 30 41 5 4 6 1 1 12 7 2 5 14 5 1 7 101 1 1 1 161 10 2 167 39 2 1 116 57 7 1 47 60 15 1 28 50 18 6 12 21 13 7 38 87 79 20 76 12 72 15 78 254 20 256 32 296 74 155 74 2 153 26 126 72 153 88 2 24 15 35 14 43 33 27 2 5 5 4 3 10 5 13 3 66 3 70 4 76 2 74 Table 52 Transaction completion times for Modified Paxos
5.6 Disk Write Analysis 5.6.1 Paxos The phase 2b message is sent to all the interested parties directly. Where message send is optimised by sending the message concurrently to the disk write, the acceptors receiving the Phase 2b message act as durable storage. In case of failure, the acceptors can be queried instead of retrieving from the disk. 115
5.6.2 Modified Paxos Each acceptor only sends out a single prepared message to the leader and hence this cannot be optimised, as there is not enough redundancy. The leader does not need to write to durable storage as all the acceptors durable write their individual decision.
5.7 Summary The analysis of the results can be summarised as follows: •
The average transaction completion times increases as more RMs are used for a transaction.
•
For Paxos Commit the delayed completion of the Registrar’s instance of Paxos introduces a large number of Aborts to the results.
•
The message bundling optimisation improves performance by approx 1% where Paxos Commit is set to tolerate one fault.
•
As fault tolerance is introduced to Paxos Commit the number of messages sent increases as there are more processes participating in the same transaction, this has increases the average transaction completion time.
•
The durable write optimisation in Paxos Commit halves the transaction completion time as compared to Paxos Commit without optimisation and improves by approx 46% over 2PC in the fault free environment.
•
In the environment where there are failures, Modified Paxos is the best performing Algorithm in the simulated message passing tests, the commit to abort ratio is better than 2PC, but not in the JPVM simulation. The average transaction commit time is the quickest in the 2PC protocol.
116
6. Conclusions and Further Work The overall scope of this work was to investigate and implement a simulation of a complete transaction system that uses the Paxos Commit algorithm to achieve atomic commit in a distributed transaction environment. A simulation for Two Phase Commit (2PC) was to be implemented and the results of 2PC and Paxos Commit were to be compared. There were optimisations suggested for the Paxos Commit which were to be tested in the simulation. An alternate algorithm was also designed to overcome the problems faced with Paxos Commit. The algorithms were to be compared in how they differ in perfect conditions and how each of them performs when the environment is faulty. Simulation frameworks, the simulated message passing framework and the JPVM based message passing framework, were designed and various configurations were run on these frameworks. An alternate algorithm was designed that used the basic principles of Paxos Commit algorithm but did not use Paxos’s phase based messaging and the Registrar process. The results demonstrated that under failure free conditions 2PC is faster than Paxos Commit and Modified Paxos. This is true even in the case where the two Paxos consensus based commit algorithms are run with fault tolerance switched off. The disk write optimisation improves the performance of Paxos Commit for committed transactions by 46%. The Modified Paxos when compared with unoptimised Paxos Commit is 4.5% faster for committed transactions. Modified Paxos also has the advantage of not using a single Transaction Manager process, the current leader of the acceptor processes is used as the Transaction Manager. This study has studied the performance and behaviour of 2PC, Paxos Commit and Modified Paxos in environment with varying degrees of failure. After analysing this data we can present the following conclusions.
117
6.1 Conclusions The 2PC protocol that has been simulated blocks the Resource Manager (RM), if the RM is in the prepared state and the Transaction Manager (TM) fails. The RM waits until the TM recovers from the failure. In the Paxos commit protocol, one of the backup coordinator takes over the role of the TM and completes the execution. The various processes in the transaction system set timeouts to keep track of responses from other processes. The Client and the TM set a timeout for the overall transaction, this is to ensure that the system does not block indefinitely. After a certain amount of time if the transaction does not complete, then the transaction is cancelled. A request timeout is set before certain messages are sent. Depending on the local state of the process the transaction may be aborted or the request re-sent. If any of the processes other than the coordinator fails then the transaction fails. The only redundancy available is in the commit decision making process. The protocol does not provide redundancy for the RM, Client and the Registrar processes. Thus a failure at one of these processes results in the transaction not being successful. A failed RM causes subsequent transactions to abort until the RM recovers. In the past the disk storage and network had specific tasks; disk storage was used for storing local data and networks for communication between independent computers. With advances in technology the difference between the two is disappearing and this has led to the network storage devices [40]. This advance has been utilised in the optimisation suggested; the disk write is done concurrently to the messaging thus improving the transaction commit time considerably. In the context of the Paxos Commit algorithm redundancy is used to achieve fault tolerance. This redundancy has here been used to improve performance by essentially using the additional processes as durable storage. The disk write before sending messages is done to ensure that the protocol can restart at the correct point after a failure. The message being sent to multiple processes replicates this behaviour, as the likelihood of all the processes failing together is very low; the protocol can retrieve the status from one of these processes instead of the disk. In case that all the processes do fail together no other process can act on the decision which is lost during the failure of all the processes.
118
The Paxos commit protocol is not a non-blocking protocol; it is a fault tolerant protocol. It only tolerates the failure of the TM by there being multiple coordinators that may perform the tasks required of the TM. The RM and Acceptor being on the same node would lead to the transaction being aborted, when a node fails, as the RM would fail too. Simulation causes more process failures when there are more nodes. This is due to the mean time to process failure. The results show that the optimisation of concurrently writing to disk and sending the message presents a considerable performance advantage; this improves over 2PC with respect to the average time taken per transaction commit; transactions commit in approximately 6ms with disk write optimisation compared with approximately 11ms for 2PC. In addition to being quicker than 2PC, Paxos Commit is a NB-AC protocol. With the performance improvement achieved and coordinator failure not affecting transaction commit; Paxos Commit appears to be an algorithm that could be used to advantage in transaction processing systems. The message passing simulation does show considerably slower performance for Paxos Commit as compared with 2PC. This is most likely due to the simulation being processed in Java and the additional overhead of messaging using the JPVM library, which was computed to be approximately 14 times slower than the simulated message passing constant used for message delay. The Modified Paxos has performed better than Paxos Commit by approximately 4.5% when comparing the two on the basis of average transaction commit time, without the disk optimisation. If the disk optimisation is implemented in Modified Paxos, the algorithm would be an ideal choice for transaction processing, as the Registrar failures does not block the whole transaction system.
6.2 Further Work 6.2.1 Simulation Analyser A log of all messages sent and received could be maintained, with a list of the exact time for a process failure. A tool can be built that would analyse this log and present a graphical representation to display the transaction failures in relation to process failures.
119
The tool could be used to drill into each and every transaction that was processed, see the messages that were sent and received by each of the processes, and if any messages failed. The tool could be utilised to fine tune the timeout values. The tool will enable the user to view each of the messages transmitted and received, the failure of a process or a message on the transactions timeline, the user can compare average message response time and thus set the timeout values accordingly. The tool will also enable the user to find out the cause for a transaction abort. The process that initiated the message that caused the transaction to abort could be identified. If the transaction aborted due to messages not being received by process(es) about the prepared state of RM(s), then this could be identified. The identification of the exact cause of the transaction failure would help identify any shortfalls in the protocol. In the simulated framework the number of transaction failures is accumulated, the fault identification could allow for a classification of the number of failures by the cause. This would allow for the exact measurement of the improvement provided by the NB-AC protocol.
6.2.2 Low Latency Non-Blocking Atomic Commit The Low-Latency Non-Blocking Atomic Commit (NBAC) suggested by JimenezPeris et. al. [7] has not been implemented according to the literature search. This algorithm requires a virtual synchronous environment with uniform broadcast primitive. The protocol utilises coordinator and Resource Manager uniform multicast groups. It also utilises a concept of optimistic commit and concurrent messaging and disk flushing. The uniform multicast is a process which is slow but essential as in the protocol the uniform multicast process is performed in parallel to essential tasks such as recording and informing all the processes about the final commit decision, thus reducing the latency of the protocol. The optimisation used for Paxos Commit performs the disk write in parallel to the messaging, improving the performance of the protocol considerably. This optimisation is similar to the process described in the Low Latency NBAC.
120
A simulation for the Low Latency NBAC could be constructed taking into consideration the performance impact for introducing virtual synchrony and uniform multicast. The simulation could be run on a similar framework, to the one used for this study, where transactions are started and the overall time is calculated based on when all the processes finish with each transaction. The results could be compared with the results of this study to determine how a specialist protocol that utilises virtual synchronous environment with uniform multicast primitive to reduce latency compares with a protocol designed specifically for asynchronous environment.
6.2.3 Modified Paxos Further variations to Modified Paxos can be made for example the ready to commit messages can be sent to all the parties, instead of just sending the message to the Leader. This would mean that each Acceptors decision is stored with all the other processes and can be retrieved from them in case of failure. This allows each acceptor to send the message without waiting for the disk write to finish. As has been seen in the results for Paxos Commit with disk optimisation, sending messages
concurrently
to
disk
writes
provides
considerable
performance
improvement. Modified Paxos performed better than Paxos Commit where optimisation was not used. With this optimisation it should perform better than Paxos Commit with disk optimisation. Modified Paxos has the added advantage of not having any downtime associated with the failure of the Registrar process, as it uses the current Leader for Transaction Management functions.
6.2.4 Disk Write Optimisation The Disk Write Optimisation that has been suggested and tested provides performance improvement of up to 46% in the simulations with the various parameters set for system environment. The informal analysis done for a pathological case does not show any drawbacks with the suggested approach. A formal analysis of this optimisation should be performed.
121
7. References 1.
Transaction processing, Wikipedia, the free encyclopedia., [Online], Available: http://en.wikipedia.org/wiki/Transaction_processing, 2005, (Accessed 10 October 2005)
2.
Babaoglu, O. and S. Toueg, Understanding non-blocking atomic commitment, 1993, University of Bologna, TR UBLCS-93-2, Available: http://citeseer.ist.psu.edu/babaoglu93understanding.html
3.
Gray, J. Notes on Data Base Operating Systems. in Operating Systems, An Advanced Course. 1978: R. Bayer, et al. eds., Lecture Notes In Computer Science Vol. 60, p. 393-481. Available: http://research.microsoft.com/%7EGray/papers/DBOS.doc
4.
Skeen, D. Nonblocking commit protocols. in Proceedings of the ACM SIGMOD International Conference on Management of Data. 1981, ACM Press, p. 133-142.
5.
Fischer, M.J. and M. Merritt, Appraising two decades of distributed computing theory research. Distributed Computing, 2003. 16(2-3): p. 239-247.
6.
Gray, J. and L. Lamport, Consensus on Transaction Commit, 2004, Microsoft Research, MSR-TR-2003-96, Available: http://research.microsoft.com/research/pubs/view.aspx?tr_id=701, (Accessed 3 Feburary 2004)
7.
Jiménez-Peris, R., et al. A Low-Latency Non-Blocking Commit service. in Proceedings of the 15th International Conference on Distributed Computing. 2001, Lecture Notes In Computer Science Vol. 2180, Springer-Verlag, p. 93107.
8.
Guerraoui, R. and P. Kouznetsov, On the Weakest Failure Detector for NonBlocking Atomic Commit in Proceedings of the IFIP 17th World Computer Congress - TC1 Stream / 2nd IFIP International Conference on Theoretical Computer Science: Foundations of Information Technology in the Era of Networking and Mobile Computing 2002 Kluwer, B.V. p. 461-473
9.
Guerraoui, R., Non-blocking atomic commit in asynchronous distributed systems with failure detectors. Distributed Computing, 2002. 15(1): p. 17-25.
10.
Lamport, L., Paxos Made Simple. ACM SIGACT news distributed computing column 5, 2001. 32(4): p. 51-58.
11.
Guerraoui, R., Revisiting the relationship between non-blocking atomic commitment and consensus, in Proceedings of the 9th International Workshop on Distributed Algorithms. 1995, Springer-Verlag. p. 87-100.
12.
Chandra, T.D., V. Hadzilacos, and S. Toueg, The weakest failure detector for solving consensus. J. ACM, 1996. 43(4): p. 685-722.
13.
Fischer, M.J., N.A. Lynch, and M.S. Paterson, Impossibility of distributed consensus with one faulty process. J. ACM, 1985. 32(2): p. 374-382. 122
14.
Hadzilacos, V., On the relationship between the atomic commitment and consensus problems, in Fault-tolerant distributed computing, B. Simons and A. Spector, Editors. 1990, Springer-Verlag. p. 201-208. Springer Lecutre Notes In Computer Science; Vol. 448
15.
Schneider, F.B., What good are models and what models are good?, in Distributed systems (2nd Ed.), S. Mullender, Editor. 1993, ACM Press/Addison-Wesley Publishing Co. p. 17-26.
16.
Reddy, P.K. and M. Kitsuregawa, Reducing the blocking in two-phase commit with backup sites Inf. Process. Lett. , 2003 86 (1 ): p. 39-47
17.
Gray, J., Personal Communication: Implementation of Paxos Commit Algorithm, 25/03/2004.
18.
Gray, J. and A. Reuter, Transaction processing: concepts and techniques. 1993: Morgan Kaufmann Publishers.
19.
Chrysanthis, P.K. and K. Ramamritham, ACTA: The SAGA Continues, in Database Transaction Models for Advanced Applications, A.K. Elmagarmid, Editor. 1992, Morgan Kaufmann Publishers. p. 349 - 397.
20.
Biliris, A., et al. ASSET: A System for Supporting Extended Transactions. in International Conference on Management of Data. 1994. Minneapolis, Minnesota: R.T. Snodgrass and M. Winslett eds., ACM Sigmod, p. 44-54.
21.
Garcia-Molina, H. and K. Salem. Sagas. in Proceedings of the 1987 ACM SIGMOD international conference on Management of data. 1987. San Francisco, California, United States, International Conference on Management of Data, ACM Press, p. 249-259.
22.
Prasad, K.H., T.K. Nayak, and R.K. Ghosh. DiET: a distributed extended transaction processing framework. in Third International Conference on HighPerformance Computing. 1996. Trivandrum, INDIA, p. 114-119.
23.
Moss, J.E.B., Nested transactions: an approach to reliable distributed computing. 1981, Massachusetts Institute of Technology.
24.
C. Davies, J., Data processing spheres of control. IBM Systems Journal, 1978. 17(2): p. 179-198.
25.
Singh, M.P. Multiagent Systems as Spheres of Commitment. in International Conference on Multiagent Systems (ICMAS) Workshop on Norms, Obligations, and Conventions. 1996. Kyoto, Japan. Available: http://www.csc.ncsu.edu/faculty/mpsingh/papers/mas/icmas-96-norms.pdf
26.
Lampson, B.W., Atomic Transactions in Distributed Systems - Architecture and Implementation, An Advanced Course 1981 Springer-Verlag. p. 246-265
27.
Kopetz, H. and P. Veríssimo, Real time and dependability concepts, in Distributed systems (2nd Ed.), S. Mullender, Editor. 1993, ACM Press/Addison-Wesley Publishing Co. p. 411-446.
28.
Alpern, B. and F.B. Schneider, Defining Liveness. Information Processing Letters, 1984. 21(4): p. 181-185. Available: http://www.cs.cornell.edu/fbs/publications/85-650.ps
29.
Gärtner, F.C., Fundamentals of fault-tolerant distributed computing in asynchronous environments. ACM Computing Surveys, 1999. 31(1): p. 1-26. 123
30.
Szentiványi, D., Performance and Availability Trade-offs in Fault-Tolerant Middleware, in Department of Computer and Information Science. 2002, Linköping University. p. 7-26. Available: http://www.ep.liu.se/lic/science_technology/09/82/digest.pdf
31.
Reynal, M., A short introduction to failure detectors for asynchronous distributed systems SIGACT News 2005 36 (1 ): p. 53-70
32.
Keidar, I. and S. Rajsbaum, On the cost of fault-tolerant consensus when there are no faults: preliminary version SIGACT News 2001 32 (2 ): p. 45-63
33.
Schiper, A. and A. Sandoz. Uniform Reliable Multicast in a Virtually Synchronous Environment. in Proceedings of the 13th International Conference on Distributed Computing Systems. 1993, p. 561-568. Available: http://ieeexplore.ieee.org/iel2/912/7169/00287667.pdf?isnumber=&arnumber= 287667
34.
Hadzilacos, V. and S. Toueg, Fault-tolerant broadcasts and related problems, in Distributed systems (2nd Ed.), S. Mullender, Editor. 1993, ACM Press/Addison-Wesley Publishing Co. p. 97-145.
35.
Mehra, P. and S. Fineberg. Fast and flexible persistence: the magic potion for fault-tolerance, scalability and performance in online data stores. in 18th International Parallel and Distributed Processing Symposium. 2004, p. 206-. Available: Hewlett-Packard Co., CA, USA
36.
Aguilera, M.K., et al. Stable Leader Election. in Proceedings of the 15th International Conference on Distributed Computing. 2001, Lecture Notes in Computer Science Vol. 2180, Springer-Verlag, p. 108-122.
37.
Adam Ferrari, JPVM: network parallel computing in Java. Concurrency: Practice and Experience, 1998. 10(11-13): p. 985-992. Available: Department of Computer Science, University of Virginia, Charlottesville, VA 22903, USA
38.
Gray, J., Personal Communication: Simulation Material for Paxos Commit, 25/03/2004.
39.
Stankovic, N. and K. Zhang, An evaluation of Java implementations of message-passing. Software—Practice & Experience, 2000. 30(7): p. 741-763.
40.
Gibson, G.A. and R.V. Meter, Network attached storage architecture. Communications of the ACM, 2000. 43(11): p. 37-45.
124
Appendix A
Algorithm for Leader Election
Code for each process: StartRound(s) { executed upon start of a new round } if p _= s mod n then send (START, s) to all { bring all to new round } r←s { update current round } leader ←( { demote previous leader but do not elect leader quite yet } restart timer
1 procedure 2 3 4 5 6 on 7 8
initialization: StartRound(0 ) start tasks 0 and 1 0: { leader/candidate sends OK every δ time } loop forever if p = r mod n and have not sent (OK, r) within δ then send (OK, r) to all
9 task 10 11
12 task 13 14 15 16 17 18 19 20 21 22
1: upon receive (OK, k) with k = r do { current leader/candidate is active } if leader = B and received at least two (OK, k) messages then leader ← k mod n { now elect leader } restart timer upon timer > 2δ do StartRound(r + 1)
{ timeout on current leader/candidate } { start next round }
upon receive (OK, k) or (START, k) with k > r do StartRound(k)
{ jump to round k }
upon receive (OK, k) or (START, k) from q with k < r do send (START, r) to q { update process in old round }
125
Appendix B
Result for 2PC with mean message
delay of 1.4ms 3 RM events: 150,710 messages: 216,024 messagesFailed: messagesDeadLetter: applicationMessages: 150,710 commits: 5,023 aborts: virtualTime: 200,001,082 Table 53 Results for 2PC with 1.4ms mean message delay
3 RM time in Commits ms 20 1 21 2 22 5 23 10 24 27 25 44 26 50 27 108 28 132 29 153 30 208 31 265 32 318 33 328 34 336 35 308 36 346 37 335 38 298 39 286
3 RM time in Commits ms 40 280 41 199 42 185 43 182 44 153 45 107 46 71 47 73 48 64 49 32 50 26 51 25 52 28 53 9 54 10 55 9 56 2 57 5 61 1 62 1 64 1 Table 54 Transaction commit times for 2PC with 1.4ms mean message delay
126
Appendix C
Average Transaction Commit Time for
2PC Average Transaction Commit Time 26
Average Time (ms)
24 22 20 18 16 14 12 10 0
4
8
12
16
20
24
Number of Resource Managers
127
28
32
36