This document was uploaded by user and they confirmed that they have the permission to share
it. If you are author or own the copyright of this book, please report to us by using this DMCA
report form. Report DMCA
Overview
Download & View Data Integration Scenario as PDF for free.
MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2004
AUTHORSHIP DECLARATION
I, Keke,Qi, confirm that this dissertation and the work presented in it are my own achievement.
1. Where I have consulted the published work of others this is always clearly attributed; 2. Where I have quoted from the work of others the source is always given. With the exception of such quotations this dissertation is entirely my own work; 3. I have acknowledged all main sources of help; 4. If my research follows on from previous work or is part of a larger collaborative research project I have made clear exactly what was done by others and what I have contributed myself; 5. I have read and understand the penalties associated with plagiarism. Signed:
Date:
Matriculation no:
DATA INTEGRATION SCENARIOS IN OGSA‐DAI
Keke Qi 10 September 2004
ABSTRACT
The OGSA‐DAI is middleware that provides data access and integration capabilities to a Grid consists with the OGSA vision. The role of the OGSA‐DAI middleware is to present a unified programming model for application writers and mask out problems of heterogeneity and distribution. The data access capabilities of OGSA‐DAI have been well‐tested and demonstrated. In this work, the OGSA‐DAI’s data integration capabilities are investigated and evaluated. The evaluation mainly is based on two data integration scenarios. In addition, a proof of concept work which proposes a service driven data integration model is introduced and discussed.
I
TABLE OF CONTENTS
ABSTRACT.......................................................................................................................................I TABLE OF CONTENTS ............................................................................................................... II LIST OF FIGURES.......................................................................................................................IV LIST OF TABLES .........................................................................................................................VI LIST OF CODES ......................................................................................................................... VII ACKNOWLEDGMENTS......................................................................................................... VIII 1
DATA INTEGRATION USING THE SERVICE DRIVEN MODEL ......................... 62 5.1 DATA COPY...................................................................................................................... 62 5.1.1 Client driven........................................................................................................... 62 5.1.2 Service driven......................................................................................................... 63 5.2 DISTRIBUTED JOIN........................................................................................................... 65 5.2.1 Client driven........................................................................................................... 66 5.2.2 Service driven......................................................................................................... 68 5.3 PROFILING THE SERVICE DRIVEN MODEL ........................................................................ 72 5.4 SUMMARY ....................................................................................................................... 73
FIGURE 1 WEB SERVICES STACK......................................................................................................... 2 FIGURE 2 OGSA-DAI FRAMEWORK. THE ARROWS INDICATE THE INTERACTIONS OCCURRING BETWEEN COMPONENTS IN THE GRAPH .................................................................................... 4 FIGURE 3 DATA COPY SCENARIO ........................................................................................................ 8 FIGURE 4 DISTRIBUTED JOIN SCENARIO ............................................................................................. 9 FIGURE 5 DEPLOYMENT DIAGRAM OF EXPERIMENT ENVIRONMENT ................................................ 11 FIGURE 6 DATA COPY USING OGSA-DAI CLIENT-DRIVEN APPROACH ........................................... 14 FIGURE 7 DATA COPY: JDBC VS. OGSA-DAI (1) ........................................................................... 17 FIGURE 8 DATA COPY: JDBC VS. OGSA-DAI (2) ........................................................................... 18 FIGURE 9 DATA COPY: JDBC VS. OGSA-DAI (3). THE LINES OF TWO OGSA-DAI CASES ARE OVERLAPPED. .......................................................................................................................... 18 FIGURE 10 DATA COPY: JDBC VS. OGSA-DAI (4) ......................................................................... 19 FIGURE 11 DATA COPY: JDBC VS. OGSA-DAI (5) ......................................................................... 19 FIGURE 12 THE TOTAL TIME FOR PULLING DATA OUT OF THE MYSQL DATABASE ......................... 24 FIGURE 13 TIME TO CREATE A GDS ................................................................................................. 25 FIGURE 14 ZOOMED UP VIEW OF FIGURE 13. THE BLOCK SIZE IS VARIED FROM 0 TO 1000 ROWS HERE. ................................................................................................................................................ 25 FIGURE 15 OBTAINING A RESULTSET: THE BLOCK SIZE WAS VARIED FROM 100 ROWS TO 10000 ROWS....................................................................................................................................... 26 FIGURE 16 ZOOMED UP VIEW OF FIGURE 15. THE BLOCK SIZE WAS VARIED FROM 50 TO 1000 IN FIXED STEPS OF 50................................................................................................................... 27 FIGURE 17 PULLING DATA FROM THE MYSQL DATABASE USING THE BLOCKAGGREGATOR AND OUTPUTSTREAM ACTIVITIES. THE BLOCK SIZE WAS VARIED FROM 100 ROWS TO 10000 ROWS. ................................................................................................................................................ 28 FIGURE 18 ZOOMED UP VIEW OF FIGURE 17. THE BLOCK SIZE WAS VARIED FROM 50 TO 1000 IN FIXED STEPS OF 50................................................................................................................... 29 FIGURE 19 DATA COPY USING OGSA-DAI DELIVER ACTIVITIES .................................................... 29 FIGURE 20 DATA COPY USING OGSA-DAI DELIVERTOGDT ACTIVITY, WHERE THE SOURCE DATABASE IS MYSQL............................................................................................................. 32 FIGURE 21 DATA COPY USING OGSA-DAI DELIVERTOGDT ACTIVITY, WHERE THE SOURCE DATABASE IS ORACLE............................................................................................................. 32 FIGURE 22 DATA COPY USING OGSA-DAI DELIVERTOGDT ACTIVITY, WHERE THE SOURCE DATABASE IS POSTGRESQL.................................................................................................... 33 FIGURE 23 DATA COPY USING OGSA-DAI DELIVERTOGDT ACTIVITY, WHERE THE SOURCE DATABASE IS DB2................................................................................................................... 33 FIGURE 24 DATA COPY USING OGSA-DAI DELIVERTOGDT ACTIVITY, WHERE THE SOURCE DATABASE IS SQLSERVER...................................................................................................... 34 FIGURE 25 DATA COPY USING OGSA-DAI DELIVERFROMGDT ACTIVITY (1) ............................... 35 FIGURE 26 DATA COPY USING OGSA-DAI DELIVERFROMGDT ACTIVITY (2) ............................... 35 FIGURE 27 DATA COPY FROM XML DATABASE TO RELATIONAL DATABASE .................................. 40 FIGURE 28 CLIENT DRIVEN DATA INTEGRATION .............................................................................. 45 FIGURE 29 SERVICE DRIVEN DATA INTEGRATION ............................................................................ 46 FIGURE 30 UML CLASS DIAGRAM .................................................................................................... 50 FIGURE 31 UML CLASS DIAGRAM OF GDSACTIVITY ...................................................................... 52 FIGURE 32 UML DIAGRAM OF PROCESSING A GDSACTIVITY ......................................................... 53 FIGURE 33 AN EXAMPLE OF GDSACTIVITY DATA FLOW ................................................................. 55
IV
FIGURE 34 UML DIAGRAM OF PROCESSING A SEQUENCE ACTIVITY ................................................ 56 FIGURE 35 AN EXAMPLE OF CONTAINED ACTIVITIES OF A SEQUENCE ACTIVITY ARE CHAINED BY AN INNER IO ................................................................................................................................. 57 FIGURE 36 UML DIAGRAM OF PROCESSING A FLOW ACTIVITY........................................................ 60 FIGURE 37 DATA COPY USING SERVICE DRIVEN ............................................................................... 64 FIGURE 38 DISTRIBUTED JOIN .......................................................................................................... 65 FIGURE 39 DISTRIBUTED JOIN USING CLIENT DRIVEN MODEL ......................................................... 66 FIGURE 40 DISTRIBUTED JOIN IMPLEMENTED USING THE SERVICE DRIVEN MODEL ........................ 69 FIGURE 41 PERFORMANCE OF THE SIMPLE DATA COPY SCENARIO USING THE SERVICE DRIVEN MODEL..................................................................................................................................... 72
V
LIST OF TABLES
Number
Page
TABLE 1 ACTIVITIES SUPPORTED BY THE CURRENT OGSA-DAI RELEASE ....................................... 5 TABLE 2 SPECIFICATION OF USED MACHINES ................................................................................... 10 TABLE 3 OPERATION SYSTEM AND JDK VERSION ........................................................................... 10 TABLE 4 SPECIFICATION OF DATA RESOURCES ................................................................................ 12
VI
LIST OF CODES
Number
Page
CODE 1 PERFORM DOCUMENT TO SOURCE GDS OF DATA COPY SCENARIO ..................................... 14 CODE 2 PERFORM DOCUMENT TO THE SINK GDS ............................................................................ 15 CODE 3 PERFORM DOCUMENT TO THE SOURCE GDS ....................................................................... 30 CODE 4 PERFORM DOCUMENT SENT TO THE SOURCE GDS USING DELIVERFROMGDT ACTIVITY .. 30 CODE 5 PERFORM DOCUMENT SENT TO THE SINK GDS USING DELIVERFROMGDT ACTIVITY........ 31 CODE 6 PERFORM DOCUMENT SENT TO THE SOURCE GDS .............................................................. 37 CODE 7 PERFORM DOCUMENT SENT TO THE SINK GDS ................................................................... 38 CODE 8 EXAMPLE OF A GDSACTIVITY ............................................................................................ 51 CODE 9 EXAMPLE OF A SEQUENCE ACTIVITY ................................................................................... 58 CODE 10 EXAMPLE OF A FLOW ACTIVITY ......................................................................................... 61 CODE 11 COMPLETE PERFORM DOCUMENT SENT TO THE SOURCE GDS IN THE CLIENT DRIVEN MODEL..................................................................................................................................... 62 CODE 12 COMPLETE PERFORM DOCUMENT SENT TO THE SINK GDS IN THE CLIENT DRIVEN MODEL ................................................................................................................................................ 62 CODE 13 COMPLETE PERFORM DOCUMENT SENT TO A COORDINATING GDS IN THE SERVICE DRIVEN MODEL..................................................................................................................................... 64 CODE 14 A PERFORM DOCUMENT SENT TO THE GDS1 TO SELECT DATA AND DELIVER THE DATA TO GDS3...................................................................................................................................... 67 CODE 15 A PERFORM DOCUMENT SENT TO THE GDS3 TO INSERT THE DATA DELIVERED FROM GDS1 ................................................................................................................................................ 67 CODE 16 A PERFORM DOCUMENT SENT TO THE GDS2 TO SELECT DATA AND DELIVER THE DATA TO GDS3...................................................................................................................................... 67 CODE 17 A PERFORM DOCUMENT SENT TO THE GDS3 TO INSERT THE DATA DELIVERED FROM GDS2 ................................................................................................................................................ 68 CODE 18 A PERFORM DOCUMENT SENT TO THE GDS3 TO PERFORM A JOIN SELECT QUERY AND RETRIEVE THE RESULT ............................................................................................................ 68 CODE 19 A PERFORM DOCUMENT USED IN THE SERVICE DRIVEN MODEL TO IMPLEMENT THE DISTRIBUTED JOIN SCENARIO .................................................................................................. 71 CODE 20 GRID_DATA_SERVICES_TYPE_EXT_XSD ........................................................................... 79 CODE 21 GDS_ACTIVITY.XSD............................................................................................................ 81 CODE 22 SEQUENCE_ACTIVITY.XSD ................................................................................................. 82 CODE 23 FLOW_ACTIVITY.XSD ......................................................................................................... 84
VII
ACKNOWLEDGMENTS
I would like to gratefully acknowledge the enthusiastic supervision of Dr. Mario Antonioletti, who gives me great support on my work, his comments and suggestions make this work possible, and Tom Sugden, who shares his great knowledge and skills of OGSA‐DAI and wonderful music with me. I also wish to thank Alastair Hume(EPCC), Alex Woehrer(NESC), Konstantinos Karasavvas(NESC), Neil p chue hong(EPCC) for their comments and discussions on my work. In addition, I thank the OGSA‐DAI team and EPCC support team for their assistance with all types of technical problems. More important, I am grateful to all my friends in Edinburgh, for their care and attention.
VIII
Chapter 1: Introduction
1 Introduction 1.1 Background Data plays a fundamental role in all kinds of cross‐organisational research and collaborations. These organizations can be collectively deemed to form virtual organizations (VOs) [15] . Data will exist in a variety of different formats, such as unstructured or multimedia files in file systems, structured collections stored in relational or XML databases [4], which can vary in volume and may be geographically distributed over a VO. Besides, some large projects ([40], [41]) such as the LHC [40] will generate multiple terabyte‐sized or even petabyte‐sized data. It is impossible to handle such large amounts of data within a single organisation or institute. Addressing data access and integration across organizations is going to be one of the big challenges in setting up VOs. At present, Grid technologies are being developed to facilitate “coordinated resource sharing and problem solving in dynamic, multi‐institutional virtual organisations” [15] . The Grid is “a system that coordinates distributed resources using standard, open, general‐purpose protocols and interfaces to deliver nontrivial qualities of services.” [27] The concept of VOs proposed by [15] can be regarded as a framework in which data access and integration have to be addressed. The Open Grid Service Architecture (OGSA) [9] effort within the Global Grid Forum (GGF) [28] is trying to define a standard framework to address the key concerns in the Grid, e.g. services registries and the discovery process, lifecycle management, metadata services, etc. This work is based on Web services framework [16], “a distributed computing paradigm based on standard techniques for describing interfaces to software components, methods for accessing these components via interoperable protocols and discovery methods that enable the identification of relevant service providers” [29]. The XML [10] technologies are used as the core foundation for Web services as it provides platform independence and
1
Chapter 1: Introduction portability. XML schema [36] [37] defines a grammar that can be used to define other XML languages. The Web services stack (shown in Figure 1) consists of multiple layers. Each layer is dedicated to a particular functionality and defined by an XML based specification.
Discovery
UDDI
Description
WSDL SOAP
Messaging
Figure 1 Web services stack
y
The Simple Object Access Protocol (SOAP)[30] is an XML protocol defined using XML schema which is used to specify the communication message.
y
The Web Services Description Language (WSDL) [18] is defined using an XML schema that describes Web services interfaces.
y
The Universal Description, Discovery and Integration (UDDI)[31] is used to specify a directory model for Web services.
As a result, OGSA inherits the advantages of Web services – platform independence and language neutrality. The OGSA services can thus be implemented in different languages, deployed on different platforms and interoperate with each other using XML messaging. The Open Grid Service Infrastructure (OGSI) specification [3] attempted to define a detailed set of interfaces, which could be used to realise the OGSA
2
Chapter 1: Introduction vision based on extensions to the Web services technologies. OGSI extended the concept of Web services [16] by defining a “Grid service”. A Grid service can simply be looked on as being a web service with state and lifecycle management properties explicitly added. A Grid service is identified by a GSH [9] and described using GWSDL [3], which is an extension of WSDL. Each operation of a Grid service is defined as a WSDL portType. A portType is a collection of related operations. For a variety of reasons [38] OGSI has been deprecated and will be replaced by WS‐Resource Framework (WSRF) [42]. At the current time, a public implementation of WSRF is not available.
1.2 OGSA-DAI The Open Grid Service Architecture‐ Data Access and Integration (OGSA‐DAI) [1] is middleware1 that provides data access and integration capabilities to a Grid consists with the OGSA vision. The role of the OGSA‐DAI middleware is to present a unified programming model for application writers and mask out problems of heterogeneity and distribution [32]. OGSA‐DAI enables various homogeneous or heterogeneous data resources, such as relational databases, XML databases [4] and even file systems, to be accessed through a uniform web service based data access interface. At present, OGSI is used by OGSA‐DAI for its infrastructure. This will be replaced by WSRF in a future release. OGSA‐DAI is currently implemented on top of Globus toolkit 3 (Globus core) [20], which also relies on OGSI, and written using the Java language [21]. In OGSA‐DAI, a data resource is exposed as a persistent Grid service [3], called a Grid Data Service Factory (GDSF), which acts as a point of presence for a data resource on the Grid and is identified by a GSH. Clients access a data resource Middleware is a distributed software layer, or ‘platform’ which abstracts the
1
complexity and heterogeneity of the underlying distributed environment with its multitude of network technologies, machine architectures, operating systems and programming languages.
3
Chapter 1: Introduction through a transient Grid service called a Grid Data Service (GDS) created by the corresponding GDSF. A GDS acts as a client session and is responsible for managing the access to a data resource. Any OGSA‐DAI services can declare their existences and expose their metadata by registering with a DAIServiceGroupRegistry (DAISGR) service provided by OGSA‐DAI. The basic OGSA‐DAI framework is illustrated in Figure 2.
Container DAISGR
Client
GDSF Data Resource GDS
Figure 2 OGSA-DAI framework. The arrows indicate the interactions occurring between components in the graph
Two portTypes defined by the GGF Database Access and Integration Services (DAIS) work group [22] are supported by OGSA‐DAI. At this time, OGSA‐DAI is based on the GGF 7 version [23] of the DAIS specification. The GridDataService portType defines a perform method which consumes an XML document composed by the client. The document specifies the operations that the GDS needs to execute. The GridDataTransport portType defines a set of operations that allow a GDS to push or pull data to/from a third party. By this mean, data can be transferred between two GDSs or between a client and a GDS.
4
Chapter 1: Introduction OGSA‐DAI has adopted a document based framework [39]. An XML document, called a perform document, is used to describe the functionality that an OGSA‐DAI service can undertake A basic task to be executed in a perform document is called an activity. An OGSA‐DAI activity defines a capability or action that can be executed by a GDS. The activities2 discussed in this work are listed in Table 1. Activity sqlQueryStatement sqlBulkLoadRowSet xPathStatement inputStream outputStream deliverFromGDT deliverToGDT deliverFromURL xsltTransform blockAggregator
Description Run an SQL [24] query statement. Bulk load data into a table. Run an XPath [25] statement against an XML database. Receive data through the GridDataTransport portType. Deliver data through the GridDataTransport portType. Pull data from a service implementing the GridDataTransport portType. Push data to a service implementing the GridDataTransport portType. Retrieve data from a URL. Transform data using an XSLT [14]. Aggregate multiple blocks into a single block
Table 1 Activities supported by the current OGSA-DAI release3
A perform document can contain one or more activities. Activities can be chained to form a pipeline by linking their inputs and outputs together. By this means, they can interoperate with each other. By combining various activities, very complex scenarios may be enacted using OGSA‐DAI. The OGSA‐DAI engine, the heart of a GDS, is responsible for validating and analysing perform documents and constructing the activity instances specified in these perform documents. A pipe is constructed by the engine to link any two In the following work, except where otherwise noted, the “activity” term is used as a replacement for the more accurate term “OGSA‐DAI activity”.
2
This table is replicated from the OGSA‐DAI release 4 documentation.
3
5
Chapter 1: Introduction activities for which a data flow has been specified in a perform document. The data flow describes the motion of data, which is from the output of one activity to the input of another activity. OGSA‐DAI provides a Java API, the client toolkit, to facilitate the composition of perform documents for clients. The basic block of the client toolkit is a client toolkit activity. A client toolkit activity is a Java representation of an activity at the server side that serialises into the XML fragment of the server side activity that is to be executed by a GDS.
1.3 Data integration in OGSA-DAI Data integration is defined in [7] as “the problem of combining data residing at different sources, and providing the user with a unified view of these data”. The evaluation and discussion of data integration in OGSA‐DAI in this work is going to be based on the investigation of some data integration scenarios described later. Three key advantages of data integration from [8] are that it frees users from: •
locating the data residing at various data resources;
•
interacting each data resources independently; and
•
combining results from different data resources manually.
The proposal of this work is to evaluate and extend the OGSA‐DAI’s data integration capabilities. This work tries to identify and assess following issues, which are related to the data integration capabilities of OGSA‐DAI. y
What is the overhead of using OGSA‐DAI to implement a data integration scenario?
6
Chapter 1: Introduction y How easy is it for OGSA‐DAI to implement a data integration scenario? y
How well does OGSA‐DAI scale for these data integration scenarios?
y
Is OGSA‐DAI robust and stable? In addition to that, the using and developing experiences of OGSA‐DAI are discussed.
The remainder chapters are organised as the follows. Chapter 2 introduces the scenarios and methodologies for the evaluation. A simple data copy data integration scenario is investigated in chapter 3. Based on the results obtained in chapter 3, a service driven model for data integration is considered in chapter 4. A detailed design is given using activities as proof of concept. The use of this service driven model for data integration is discussed in chapter 5. In chapter 6 the conclusions for this work are outlined and future possible directions for data integration in OGSA‐DAI are discussed.
7
Chapter 2: Methodologies
2 Methodologies 2.1 Scenarios Two data integration scenarios are introduced and used to evaluate OGSA‐DAI’s data integration capabilities. The simple data copy scenario (shown in Figure 3) is discussed and investigated in sections 3 and 5.1. In this scenario data is copied from a source data resource (DR) to a sink data resource. A data resource in this scenario can be either a relational or an XML database. This scenario is implemented using both OGSA‐DAI and a direct approach that uses the APIs provided by the Java programming language to connect to data resources, mainly, JDBC [11] for relational databases and the XMLDB API [4] to connect to XML databases.
Client
Source Data Resource
Sink Data Resource
Figure 3 Data copy scenario
A more complex data integration scenario (shown in Figure 4) implements a distributed join case. In this case, the data is distributed between two geographically separated data resources. When a client wants to query these distributed data resources for information the data resources have to be queried
8
Chapter 2: Methodologies independently and the results from these queries need to be merged to form the final result by the client. These data resources may or may not have same data schemas.
1 DR1 2 Client
DR3
5 4
DR2 3
Figure 4 Distributed Join scenario
A very simple method is used to implement this scenario, described as follows. y
Step 1, 2: The client executes a select query on data resource 1 and inserts the data retrieved into data resource 3.
y
Step 3, 4: The client executes a select query on data resource 2 and inserts the data retrieved into data resource 3.
y
Step 5: The client executes a join select on data resource 3 and retrieves the final result.
2.2 Method A simple benchmarking framework was developed which used the following tools to measure the performance: y
Java System.currentTimeMillis is a Java method which returns the current system time in milliseconds.
9
Chapter 2: Methodologies y Apache Log4J [12] is a log toolkit used to output the system log to the console or to a logging file. Each benchmark experiment was run 10 times. Then the mean for these was computed excluding the maximum and minimum values in order to reduce the influence on the mean. The standard deviation was calculated to tell how much the results spread out from the mean, which are presented as error bars in the graphs later on.
2.3 Environments and tools The experimentation environment included three EPCC machines: coal, brucite and diabase. Their specification of these machines is listed in Table 2. Machine name coal (coal.epcc.ed.ac.uk) brucite (brucite.epcc.ed.ac.uk) diabase (diabase.epcc.ed.ac.uk)
CPU Memory Dual Pentium IV Xeon processor 2GB 2.4 GHz Pentium IV 1500 1GB Pentium IV Xeon 2400
1GB
Table 2 Specification of used machines
The operating systems and JDK versions installed in each of these machines are listed in Table 3. Machine coal (coal.epcc.ed.ac.uk) brucite (brucite.epcc.ed.ac.uk) diabase (diabase.epcc.ed.ac.uk)
OS Linux Redhat 9 distribution Mandrake 7.2 Window Server 2003
JDK J2SE 1.4.2‐b28 J2SE 1.4.2_04‐b05 N/A4
Table 3 Operation System and JDK version
Diabase is the service that hosts the big commercial databases. The version of JDK it has is irrelevant as this machine could not be accessed directly.
4
10
Chapter 2: Methodologies OGSA‐DAI R4, Tomcat 5.0.25 and Globus Toolkit core 3.2 were installed on brucite and coal. Figure 5 illustrates the deployment of the experiment environment.
Oracle
SQLServer
DB2
Diabase
OGSA-DAI
OGSA-DAI
GT Core
GT Core
Tomcat
Tomcat
JDK
JDK
Xindice
MySQL
Brucite
PostgreSQL
Coal
Figure 5 Deployment environment
diagram
of
experiment
Five relational databases and one XML database, those currently officially supported by OGSA‐DAI, were used to conduct the experiments on. Their specifications are presented in Table 4. Data resource MySQL
Databases with the same data schema were deployed in each of the above relational databases. A one million row table was inserted into all the relational databases, and 50,000 documents, each having a size of 250 bytes, were created in the XML database.
12
Chapter 3: The data copy scenario and benchmarking
3 The data copy scenario and benchmarking 3.1 Data copy between two relational databases In this case, the data copy scenario involved two relational databases, one acted as a source and the other as a sink. The data was copied from the source to the sink under the client’s control.
3.1.1 Direct approach JDBC was used to access relational databases for the direct approach. The client executes an N‐ROW SQL query6: SELECT * FROM littleblackbook WHERE ID<=N on the source database to obtain a ResultSet instance. The client then inserts each row read from the ResultSet into the sink database. The data size, N, was selected from {100, 500, 1000, 5000, 10000, 50000, 100000}. The whole data copy procedure consisted of four sub operations. 1. Creating two JDBC connections 2. Executing the SQL query 3. Pulling and pushing data 4. Releasing resources The time measurements taken to benchmark this scenario when implemented using a direct approach started just before point 2 and ended just after point 3. N‐ROW select query: A select query returns N rows of data.
6
13
Chapter 3: The data copy scenario and benchmarking
3.1.2 OGSA-DAI approach: client control Figure 6 shows a graph where an OGSA‐DAI client is used to control the pulling and pushing of data. Two perform documents are needed for this scenario in OGSA‐DAI.
Source GDS
Source DR
Sink GDS
Sink DR
Client
GDT
Figure 6 Data copy using OGSA-DAI client-driven approach
The perform document sent to the source GDS is schematically described by Code 1. <sqlQueryStatement/> Code 1 Perform document to source GDS of data copy scenario
y
The sqlQueryStatement activity has an SQL query expression: SELECT * FROM littleblackbook WHERE ID<=N. The data size N is varied from {100, 500, 1000, 5000, 10000, 50000}.
y
The blockAggregator activity is used to aggregate larger numbers of small blocks of data generated by sqlQueryStatement activity into
14
Chapter 3: The data copy scenario and benchmarking smaller number of large blocks of data. In this case, the block size is fixed to be 100 rows. y
The outputStream activity opens a GDT port on the source GDS from which the users can pull data to the client side.
When processing a large amount of data, the blockAggregator and the outputStream activities were used together to avoid pulling lots of small chunks of data which can cause the client or the service to fail with a java OutOfMemory error. Another perform document (shown in Code 2) was sent to the sink GDS, which consisted of one inputStream activity and one sqlBulkLoadRowSet activity. <sqlBulkLoadRowSet/> Code 2 Perform document to the sink GDS
y
The inputStream activity opens a GDT port on the sink GDS through which the clients can push data to the sink GDS.
y
The sqlBulkLoadRowSet activity bulk loads the data on to a table. The input of the activity must take the form of an XML document formatted using the WebRowSet standard [26]. It is the client’s responsibility to translate the data pulled from the source data resource into a WebRowSet formatted document.
Since the perform method on the GridDataService of the OGSA‐DAI client toolkit is synchronous, it will not return until its operation has completed, a thread is needed to be created on the client side for this scenario using OGSA‐DAI to enable data transfer to be performed asynchronously between these two GDSs.
15
Chapter 3: The data copy scenario and benchmarking The whole procedure of the data copy scenario using OGSA‐DAI consisted of the following operations. 1. Create two GDS instances. 2. Compose two perform documents. 3. Send the perform document (see Code 1) to the source GDS and get the ResultSet instance which is used to retrieve data from the source database. 4. Create a new thread to transfer the data read from the ResultSet to the sinkGDS through its GDTPortType. 5. Send the perform document (see Code 2) to the sink GDS. Once the GDTPortType is initiated on the sink GDS, the data transfer can be started by the thread. 6. Release the resources allocated during the processing of this scenario on the client side. To be consistent with the direct approach scenario, the time measurements taken to benchmark this scenario when implemented using an OGSA‐DAI client started just before point 2 and ended just after point 5.
3.1.3 Analysis and discussion 3.1.3.1 Performance Some of the benchmark results are presented in the graphs from Figure 7 to Figure 11. The fact that OGSA‐DAI performs worse than JDBC is not surprising because of the overheads introduced by using OGSA‐DAI. Figure 7 shows the performance results when MySQL database was used as the source database. Both the fastest and slowest performances occur in the copies made from the MySQL database to the SQLServer database.
16
Chapter 3: The data copy scenario and benchmarking 450000
[OGSADAI]MySQL to Oracle [OGSADAI]MySQL to PSQL [OGSADAI]MySQL to SQL
Figure 7 Data copy: JDBC vs. OGSA-DAI (1)
As the data type retrieved from the Oracle database was incompatible with the data type required by the other databases the sqlBulkLoadRowSet activity could not insert the data read from the Oracle database into the other databases except MySQL. As a result, Figure 8 only shows the performance required by the copy made from the Oracle database to/from the MySQL database. Using the OGSA‐DAI client control approach, where the data is copied back to the client and then pushed back to the sink GDS, was about 9 times slower than when using a direct approach.
mean time(ms) for 10 runs
350000 300000 [JDBC]Oracle to MySQL
250000 200000 150000
[OGSADAI]Oracle to MySQL
100000 50000 0 0
20000 40000 datasize(row s)
17
60000
Chapter 3: The data copy scenario and benchmarking Figure 8 Data copy: JDBC vs. OGSA-DAI (2)
The performance results when the PostgreSQL was used as the source database are shown in Figure 9. In the two cases where the OGSA‐DAI client control approach was used the results were very similar hence the lines in the graph overlap. The times taken by the OGSA‐DAI client control approach to copy data from the PostgreSQL database to the MySQL database and the SQLServer database are almost same. However, the time required by the direct approach to perform these two copies are significantly different. A possible reason for this is that both the MySQL database and the PostgreSQL database are installed on the same machine (coal) and the SQLServer database is installed on another machine (brucite). The network traffic may cause the observed disparity in performance.
350000
mean time(ms) for 10 runs
300000 250000
[JDBC] PSQL to MySQL
200000
[JDBC] PSQL to SQL
150000
[OGSA-DAI] PSQL to MySQL
100000
[OGSA-DAI] PSQL to SQL
50000 0 0
20000 40000 datasize(row s)
60000
Figure 9 Data copy: JDBC vs. OGSA-DAI (3). The lines of two OGSA-DAI cases are overlapped.
The performance results when the DB2 database was used as the source database are shown in Figure 10. Again, it is observed that both the fastest and
18
Chapter 3: The data copy scenario and benchmarking slowest performances occurred in the copies made between the same pair of databases (the DB2 and PostgreSQL).
450000
[JDBC] DB2 to Oracle
mean time(ms) for 10 runs
400000
[JDBC] DB2 to PSQL
350000 300000
[JDBC] DB2 to SQL
250000 200000
[OGSA-DAI] DB2 to Oracle
150000
[OGSA-DAI] DB2 to PSQL
100000 50000
[OGSA-DAI] DB2 to SQL
0 0
20000 40000 datasize(row s)
60000
Figure 10 Data copy: JDBC vs. OGSA-DAI (4)
Figure 11 shows the performance results where the SQLServer database was used as the source database.
450000
[JDBC] SQL to MySQL
mean time(ms) for 10 runs
400000
[JDBC] SQL to Oracle
350000 300000
[JDBC] SQL to PSQL
250000
[OGSA-DAI] SQL to MySQL [OGSA-DAI] SQL to Oracle
200000 150000 100000 50000
[OGSA-DAI] SQL to PSQL
0 0
20000 40000 datasize(row s)
60000
Figure 11 Data copy: JDBC vs. OGSA-DAI (5)
19
Chapter 3: The data copy scenario and benchmarking It is observed from these graphs that the superiority in performance of JDBC is more and more obvious as the data size increased. When the data size is 50000 rows, the OGSA‐DAI approach is 3 to 13 times slower than the direct approach. This shows that the overheads introduced by OGSA‐DAI are in proportion to the data size it processes. 3.1.3.2 Ease of use Comparing usage of OGSA‐DAI with the direct approach, limited coding complexity is introduced by using the OGSA‐DAI client control approach. The additional work done by OGSA‐DAI is mainly engaged in the operations that compose the two perform documents and push the data. y
Composing perform documents In the simple data copy scenario, composing a perform document in OGSA‐DAI looked relatively complex in comparison with the JDBC code. OGSA‐DAI also allows the clients to construct a perform document by loading a predefined XML document, this might be a simplified alternative.
y
Pushing data In the current OGSA‐DAI release, the perform method on the GDS of the OGSA‐DAI client toolkit has to be synchronous. Thus, a new thread spawned by the client is required to transfer data asynchronously between the two GDSs. An asynchronous version of the perform method is considered to be a good enhancement to the OGSA‐DAI client toolkit. The observer pattern [13] could be adopted to notify the clients as to whether the perform method completes or not.
20
Chapter 3: The data copy scenario and benchmarking 3.1.3.3 Robustness Some OGSA‐DAI bugs and flaws were detected during the investigation of this scenario.
Connection releasing When the number of activities that a perform document contains goes over a certain value, an exception indicating that “Can not connect the database” is thrown. This happens frequently. The possible reason for this are: y
Each end point activity is executed by an individual thread in an OGSA‐DAI service.
y
The OGSA‐DAI service allocates a JDBC connection for each SQL activity.
y
However, these connections allocated by JDBC are not released until the Java objects that contain them are finalised by the JVM.
Thus, all the connections of a relational database are consumed quickly, and new request cannot be served.
Parameterised LIMIT select statement Another bug was detected when a parameterised limit select statement, e.g. SELECT * FROM LITTLEBLACK LIMIT ? was executed on the MySQL database using OGSA‐DAI. The reason for this was identified after it was reported as a bug to the OGSA‐DAI team. As the clients cannot specify type information for the parameter of a parameterised SQL query in the perform
21
Chapter 3: The data copy scenario and benchmarking document, the OGSA‐DAI engine cannot determine whether the parameter data parsed from the perform document should be converted into an integer or a string. Currently, the engine uses thePreparedStatement.setObject method to set the parameters. Therefore, all parameters are set as Java Strings. As a result, when an unexpected string parameter is sent to MySQL, an exception is inevitable.
Vary larger data size During the experiments when the data size was increased to above 100, 000 the OGSA‐DAI service gradually became unstable, but the JDBC version worked fine even when the data size was increased to one million7. The reason causing this problem is not known at this time.
Incompatible data type When trying to bulkload a WebRowSet result retrieved from a GDS accessing an Oracle database onto a table residing on a GDS any databases other than MySQL, an exception stating “unable to copy the data” was thrown. The reason for this was: y
In a ResultSet from an Oracle DB, all numbers are defined as
SQL type NUMERIC. y
OGSA‐DAI converts the NUMERIC type into a Java double type
which causes the problem. A bug report of this was submitted to the OGSA‐DAI team and the problem has been fixed. It has been reported that the one million rows data copy case had already been done by the OGSA‐DAI team.
7
22
Chapter 3: The data copy scenario and benchmarking
3.2 Profiling the blockAggregator activity The previous experiments show that OGSA‐DAI does not performs as well as a direct approach. However, it does not indicate where the overheads are being introduced by OGSA‐DAI. To better understand of that, this section conducts a detailed investigation on the blockAggregator activity. The blockAggregator activity aggregates small pieces of data read from the output of another activity into a larger chunk for transportation. It is expected that by using a blockAggregator activity together with an outputStream activity the traffic of data transferring could be reduced. The experiment used the perform document shown in Code 1 and only involved one GDS. The detail procedure is as follows. 1. Create a GDS instance 2. Send the perform document to the GDS and then obtain an instance of a ResultSet from the outputStream activity 3. Loop over the ResultSet and read in all the data 4. Release the GDS instance The MySQL database was used as the target data resource. The total time included the cost of points 1, 2, 3 above and the results are shown in Figure 12. For each line in the graph, the data size was fixed and the block size was varied from {100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 2500, 5000, 7500, 10000}. The graph shows that: y
The performance is proportional to the size of the data to be pulled.
y
Within an increasing block size, the total time of each case increases linearly.
23
Chapter 3: The data copy scenario and benchmarking 80000
mean time(ms) for 10 runs
70000 60000 10000
50000
20000 40000
30000 50000
30000
75000
20000
100000
10000 0 0
2000
4000
6000 8000 blocksize
10000
12000
Figure 12 The total time for pulling data out of the MySQL database
The performance of creating a GDS instance is shown in Figure 13. A zoomed view of this graph is shown in Figure 14, where the x‐axis in the 0 to 1000 region is examined. The fact that the time to create a GDS is related to the data size is unexpected. The reason for this is not known at this time.
75
mean time(ms) for 10 runs
70 65 10000 60
20000 30000
55
50000 75000
50
100000
45 40 0
2000
4000
6000 blocksize
8000
10000
12000
24
Chapter 3: The data copy scenario and benchmarking Figure 13 Time to create a GDS
80
mean time(ms) for 10 runs
75 10000
70
20000 65
30000
60
50000 75000
55
100000 50 45 40 0
500 1000 blocksize(row s)
1500
Figure 14 Zoomed up view of Figure 13. The block size is varied from 0 to 1000 rows here.
Obtaining a ResultSet consisted of using two sub operations: •
Sending a perform document to the target GDS and
•
Obtaining a ResultSet from the outputStream activity.
Inside the OGSA‐DAI engine, the three activities described in Code 1 are chained together because of the dependency set between them. None of them will be executed by the engine until a GDT is opened by the outputStream activity receives a request sent by the client. Once a request is received, the SQL select query is executed on the database by the sqlQueryStatement activity and the first chunk of data is retrieved and aggregated by the blockAggregator activity and returned to the client.
25
Chapter 3: The data copy scenario and benchmarking Therefore, it is no surprise to see that the time required to obtain a ResultSet, presented in Figure 15 , is proportional to both the block size and the data size used. For each line in the graph, the data size was fixed, and the block size was varied from {100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 2500, 5000, 7500, 10000}.
16000
mean time(ms) for 10 runs
14000 12000 10000 10000
20000 30000
8000
50000
6000
75000
4000
100000
2000 0 0
2000
4000
6000 8000 blocksize
10000
12000
Figure 15 Obtaining a ResultSet: the block size was varied from 100 rows to 10000 rows.
In order to clearly see what is happening in the region where the block size was smaller than 1000 rows a new set of experiments were undertaken that zoomed into this region. The corresponding results are presented in Figure 16.
26
Chapter 3: The data copy scenario and benchmarking
mean time(ms) for 10 runs
4000 3500 3000 50000 2500
30000 20000
2000
10000 1500 1000 0
200
400
600
800
1000
1200
blocksize
Figure 16 Zoomed up view of Figure 15. The block size was varied from 50 to 1000 in fixed steps of 50.
The block size determines the size of each communication and thus the number of transportations required. The blockAggregator activity uses a StringBuffer to concatenate multiple small blocks into a larger one. Therefore, a larger block size introduces more overhead through the use of StringBuffer while, on the other hand, a smaller blocksize requires more transportation. Figure 17 shows that the performance from pulling the data results from the MySQL database. It is observed that almost 90% of the total time (see Figure 12) was spent on the data transfer. From the results shown in the graph one can see that when the block size was in the 1000 to 7500 range, increasing the block size lead to a reduction in execution time. However, when the data size was larger than 30000, a larger block size i.e. 10000 can degrade the performance. This seems to indicate the use of the StringBuffer operation is introducing a bigger overhead than the transportation cost.
27
Chapter 3: The data copy scenario and benchmarking 80000
mean time(ms) for 10 runs
70000 60000 10000
50000
20000 40000
30000
30000
50000 75000
20000
100000
10000 0 0
2000
4000
6000 8000 blocksize
10000
12000
Figure 17 Pulling data from the MySQL database using the blockAggregator and outputStream activities. The block size was varied from 100 rows to 10000 rows.
Figure 18 shows the results of pulling data from a GDS, zooming into the region where the block size range was varied from 50 rows to 1000 rows in fixed steps of 50. The five lines presented in the graph show similar behaviour regardless of the size of the data being retrieved. The jump when the block size is somewhere between 600 rows and 650 rows is unexpected. Furthermore, the maximum amount of time required to pull the data from the GDS happened when the block size was set to 650. The reason for this behaviour is not known at this time.
28
Chapter 3: The data copy scenario and benchmarking
mean time(ms) for 10 runs
40000 35000 30000 50000 25000
30000 20000
20000
10000
15000 10000 5000 0
200
400
600
800
1000
1200
blocksize
Figure 18 Zoomed up view of Figure 17. The block size was varied from 50 to 1000 in fixed steps of 50.
3.3 Data copy using OGSA-DAI delivery activities In this scenario instead of pulling data from a source GDS to the client and then the client pushing it to a sink GDS, the deliverToGDT and deliverFromGDT activities supported by OGSA‐DAI are used to enable the two GDSs to transfer data directly between themselves through the GDT ports (shown in Figure 19).
Source
Source DR
Sink
Sink DR
Client
Control flow Data flow
GDT
Figure 19 Data copy using OGSA-DAI deliver activities
29
Chapter 3: The data copy scenario and benchmarking
3.3.1 DeliverToGDT Two perform documents were needed to compose this scenario. The perform document sent to the source GDS consisted of a sqlQueryStatement activity and a deliverToGDT activity, which is shown in Code 3. The sqlQueryStatement activity performs a select SQL query on the source database. The deliverToGDT activity enables the source GDS to push data to the sink GDS via a GDT port opened on the sink GDS. The perform document sent to the sink GDS is the same as Code 2. <sqlQueryStatement/> <deliverToGDT/> Code 3 Perform document to the source GDS
3.3.2 DeliverFromGDT Two perform documents needed to be composed for this scenario. The perform document sent to the source GDS, is shown in Code 4, consisted of two activities. The sqlQueryStatement activity performs a select SQL query on the source database. The outputStream activity opens a GDT port on the source GDS and serves the request. <sqlQueryStatement/> Code 4 Perform document sent to the source GDS using deliverFromGDT activity
The perform document sent to the sink GDS is shown in Code 5. The deliverFromGDT activity pulls data from the GDT opened on the source GDS. The sqlBulkLoadRowSet activity is used to bulk load data comes from the
30
Chapter 3: The data copy scenario and benchmarking output of the deliverFromGDT activity into a corresponding table on the sink database. <deliverFromGDT/> <sqlBulkLoadRowSet/> Code 5 Perform document sent to the sink GDS using deliverFromGDT activity
3.3.3 Analysis and discussion 3.3.3.1 Performance The benchmarking results using the deliverToGDT activity to implement the simple data copy scenario are shown in the following graphs from Figure 20 to Figure 24. These results show that that the performance of using the deliverToGDT activity is similar to using the OGSA‐DAI client control approach (see section 3.1.2). All the results made between any two databases used in this work are quite similar. The slowest result happens when data is copied from the MySQL database to the PostgreSQL database. These two databases are installed on the same machine (coal). Figure 20 shows the performance results where the source database was a MySQL database. The three results obtained when copies are made from MySQL to Oracle, PostgreSQL and SQLServer are all very close.
31
Chapter 3: The data copy scenario and benchmarking mean time(ms) for 10 runs
80000 70000 60000
MySQL to Oracle MySQL to PSQL MySQL to SQL
50000 40000 30000 20000 10000 0 0
5000 10000 datasize(row s)
15000
Figure 20 Data copy using OGSA-DAI deliverToGDT activity, where the source database is MySQL.
Figure 21 shows the performance results when the source database was the Oracle database. For the same reason described in section 3.1.3.1, only the result made from the Oracle database to the MySQL database is presented in the graph.
mean time(ms) for 10 runs
70000 60000 50000 Oracle to MySQL
40000 30000 20000 10000 0 0
5000 10000 datasize(row s)
15000
Figure 21 Data Copy using OGSA-DAI deliverToGDT activity, where the source database is Oracle.
Figure 22 shows the performance results where the source database was the PostgreSQL database. The three results of data copies made from PostgreSQL to MySQL, PostgreSQL and SQLServer overlap.
32
Chapter 3: The data copy scenario and benchmarking mean time(ms) for 10 runs
80000 70000 60000
PSQL to MySQL PSQL to Oracle PSQL to SQL
50000 40000 30000 20000 10000 0 0
5000 10000 datasize(row s)
15000
Figure 22 Data copy using OGSA-DAI deliverToGDT activity, where the source database is PostgreSQL.
Figure 23 shows the performance results where the source database was the DB2 database. The four results made from DB2 to MySQL, Oracle, PostgreSQL and SQLServer are all very close.
mean time(ms) for 10 runs
80000 70000 60000
DB2 to MySQL DB2 to Oracle DB2 to PSQL DB2 to SQL
50000 40000 30000 20000 10000 0 0
5000 10000 datasize(row s)
15000
Figure 23 Data copy using OGSA-DAI deliverToGDT activity, where the source database is DB2.
Figure 24 shows the performance results when the source database was a SQLServer database. The three results obtained for data copies from SQLServer to Oracle, PostgreSQL and MySQL are very close.
33
Chapter 3: The data copy scenario and benchmarking
mean time(ms) for 10 runs
80000 70000 60000
SQL to MySQL SQL to Oracle SQL to PSQL
50000 40000 30000 20000 10000 0 0
5000 10000 datasize(row s)
15000
Figure 24 Data copy using OGSA-DAI deliverToGDT activity, where the source database is SQLServer.
The benchmarking results obtained whenusing the deliverFromGDT activity to implement the simple data copy scenario are shown in the following graphs, Figure 25 and Figure 26. Compared to the performance obtained for the deliverToGDT activity, the deliverFromGDT approach is slower than the deliverToGDT approach. Copying data from the MySQL database to the Oracle database is the slowest (shown in Figure 25) amongst all these results. Compared to the performance results obtained using the deilveryToGDT activity to implement the data copy scenario, the performance results of each case are relatively dissimilar.
34
Chapter 3: The data copy scenario and benchmarking
mean time(ms) for 10 runs
100000 90000 80000
MySQL to Oracle Oracle to MySQL PSQL to MySQL
70000 60000 50000 40000 30000 20000 10000 0 0
5000 10000 datasize(row s)
15000
Figure 25 Data copy deliverFromGDT activity (1)
using
OGSA-DAI
The best performance of this scenario implemented by the deliverFromGDT activity is made from the case that the data was copied from the SQL database to the MySQL database (see Figure 26).
mean time(ms) for 10 runs
90000 80000 70000 60000
DB2 to SQL SQL to MySQL SQL to Oracle
50000 40000 30000 20000 10000 0 0
5000 10000 datasize(row s)
15000
Figure 26 Data copy deliverFromGDT activity (2)
using
OGSA-DAI
3.3.3.2 Ease of uses Due to the fact that the data was delivered by the GDS executing the corresponding delivery activity, the client did not need to handle the data
35
Chapter 3: The data copy scenario and benchmarking transfer explicitly as was the case with the JDBC or OGSA‐DAI approaches. By specifying the data flow using the delivery activities, OGSA‐DAI can handle the data transfer of the data integration scenario automatically. What the client needs to do is to compose two perform documents and send them to the corresponding GDSs. Thus, some complexity on the client side is shifted to the server side. A defect of this approach is the client has to compose several perform documents and manage their execution on the corresponding GDSs. As mentioned before, each perform operation has to be synchronous. As a result, the client has to explicitly use threads to deal with this.
3.4 Data copy from XML database to relational database 3.4.1 Description This scenario is similar to the scenario described in section 3.1. The difference is that it is performed between an XML database which acts as the source data resource, and a relational database which acts as the sink data resource. The case where the XML database acts as the sink data resource was not investigated because the current OGSA‐DAI release does not support bulk loading data into the XML database.
3.4.2 Direct approach In the direct approach, it is the clients’ responsibility to transform the XML format to an appropriate tabular format for entry into a relational data resource. The XMLDB API [4] was used to access the XML database. The XPath [25] query used to select N XML documents was
36
Chapter 3: The data copy scenario and benchmarking /entry[@id<=N] The size N was selected from {100, 500, 1000, 5000, 10000}. The corresponding data extracted from each XML document was composed to a SQL insert query to be executed at the sink database. The procedure using the direct approach in this scenario is described below. 1. Create a JDBC connection and an XML collection. An XML collection is similar to a JDBC connection, which is mainly used to manage an access session. 2. Select N XML document from the XML collection. 3. Extract the corresponding data from each selected XML document and compose a SQL insert query to be executed on the sink database. 4. Release the resources. The time taken for this benchmark started just before the point 2 and ended just after the point 3.
3.4.3 OGSA-DAI approach Two perform documents are needed for this scenario. The perform document (Code 6) sent to the source GDS consists of two activities, xPathStatement and deliverToGDT. The xPathQueryStatement activity performs an XPath query on the source XML database. The deliverToGDT activity delivers the data retrieved from the output of the xPathQueryStatement activity to the sink GDS thought a GDT port opened on the sink GDS. <xPathQueryStatement/> <deliverToGDT/> Code 6 Perform document sent to the source GDS
37
Chapter 3: The data copy scenario and benchmarking The perform document sent to the sink GDS is described in Code 7. <deliverFromUrl/> <xsltTransform/> <sqlBulkLoadRowSet/> Code 7 Perform document sent to the sink GDS
y
The inputStream activity opened a GDT port on the sink GDS.
y
The deliverFromUrl activity was used to load the XSLT definition from a specified URL.
y
The xsltTransform activity was used to translate the XML data into WebRowSet format. Two inputs are required by an xsltTransform activity. The XML data was read from the inputStream activity. The XSLT definition was read from the deliverFromURL activity. According to the XSLT definition, the xsltTransform can transform XML data between arbitrary schemas.
y
The sqlBulkLoadRowSet bulk loaded the data in to a corresponding table on the sink database.
The detail procedure for this scenario using OGSA‐DAI is described as below. 1. Create two GDS instances. 2. Compose two perform documents. 3. Asynchronously perform the perform document (described in Code 7) on the source GDS. A new thread needs to be created for this on the client to achieve this. 4. Send the perform document presented in code 6 to the source GDS and wait until both two operations complete. 5. Release resources.
38
Chapter 3: The data copy scenario and benchmarking The time taken to benchmark this scenario when implemented using an OGSA‐DAI approach started just before point 2 and ended just after point 4.
3.4.4 Analysing and discussing 3.4.4.1 Performance The comparison in the use of a direct and an OGSA‐DAI approach is presented in Figure 27. The line marked as “XPath” represents the time required to perform the XPath query on the XML database using the direct approach. The gap between the two approaches is not as big as that found in the data copy scenario between two relational databases. One possible reason for this might be that the performance is mainly dependent on the amount of time taken to process the XPath query. The performance of the direct approach mainly depended on the time required to process an XPath query. The reason why the time required to process the XPath query decreased with the increasing data size is not known. The cost of using OGSA‐DAI increased with the data size which is different from the behaviour encountered with the direct approach. This may be caused by the overheads introduced by using the xmlTransform activity and the deliverToURL activity. Both the time required to process these two activities are dominated by the size of the data they process.
39
Chapter 3: The data copy scenario and benchmarking
mean time(ms) for 10 runs
110000 100000 90000 80000
Direct approach
70000
OGSA-DAI approach
60000
XPath
50000 40000 0
2000
4000
6000 8000 datasize
10000 12000
Figure 27 Data copy from XML database to relational database
3.4.4.2 Ease of use In the direct approach, the format transformation needs to be handled explicitly by the client using the DOM [43] API to extract data directly from the XML document or using an XSLT processor. The xsltTransform activity enables OGSA‐DAI to transform arbitrary XML documents to arbitrary document types on the server side. By embedding it in the activities chain, the data can flow smoothly from the inputStream activity to the sqlBulkLoadRowSet, and the transformation is done automatically. The mechanism relieves the complexity of the client side. Comparing this scenario with the data copy scenario between relational data resources, the heterogeneity of using relational and xml data resources introduces more difficulties when trying to do data integration using a direct approach. Moreover, this becomes more difficult as more data resources are introduced. y
The client has to use two different sets of APIs.
40
Chapter 3: The data copy scenario and benchmarking y The client has to address the difference in data formats between the different data resources. y
The scalability of the direct approach is poor because of the fact that the XML transformation operation has to be explicitly performed on the client.
y
The client has to handle the data transfer between the two data resources.
These difficulties can be addressed by OGSA‐DAI in this scenario. y
OGSA‐DAI itself provides a uniform interface to heterogeneous data resources.
y
The format transformation can be addressed by defining an XSLT document, which is loosely coupled to the code. Furthermore, a XSLT document can also be looked upon as a data source and integrated into the whole scenario.
y
The data transfer is handled by the OGSA‐DAI service.
3.5 Summary From the results of the above investigation, it appears that the performance of OGSA‐DAI is about 3 to 13 times slower for the biggest data sizes than when using a JDBC direct approach. The major overhead of OGSA‐DAI identified from above investigations is introduced by GDT when transferring data. Robustness is also an issue for OGSA‐DAI. OGSA‐DAI was not found to be fully compatible with the five relational database it officially supports. Some OGSA‐DAI bugs and flaws were found during the experiments. When the data sizes in the data copy scenario were increased OGSA‐DAI always failed for a variety of different reasons (see section 3.1.3.3).
41
Chapter 3: The data copy scenario and benchmarking In terms of the programming experience of this scenario, using the OGSA‐DAI client toolkit and the direct approach were quite similar. Both of these two approaches used the same query languages, e.g. SQL and XPath, and programming interfaces, such as ResultSet in JDBC and Collection in XML DB API, to implement the data copy scenario. To compose a perform document using the OGSA‐DAI client toolkit introduces complexity on the client side, however it is a simple job and can be avoided through using a pre‐defined perform document instead. The advantages of using OGSA‐DAI to perform data integration identified in this data copy scenario can be summarised as follows: y
Uniform access interface The uniform access interface of OGSA‐DAI can greatly simplify the complexity of data integration. All data resources, whether homogeneous or heterogeneous, provided by different vendors, are are exposed by their GDSFs identified through GSHs. The client does not need to be concerned as to whether JDBC or XMLDB drivers need to be used or what the details of the data resource configuration they need to use. Furthermore, only one interface, the GDS, is needed by the clients to access all kinds of data resources. The instruction as to how the data integration scenario is to be processed by a GDS is described in a perform document. By this means, the OGSA‐DAI users can focus on the data integration logic instead of how to implement it.
y
GDT OGSA‐DAI enables data to be transported between two GDSs directly through a GDT instead of going through the client side back and forth.
42
Chapter 3: The data copy scenario and benchmarking This basic service coordination functionality frees the client from handling the data transportation explicitly. y
Data flow In the data integration scenario above, the client can specify the data flows in a perform document by linking the input and output of two related activities together. Moreover, these data flows can be recognised and handled by OGSA‐DAI. This reduces the complexity at the client side, and enables the whole data integration to be processed automatically and independently of the client.
It is observed from the data copy scenario that the complexity on the client side can be reduced due to functionality, such as data transformation and data transport, being implemented at the server side. Two GDSs can be coupled together in OGSA‐DAI using the deliverToGDT or the deliverFromGDT activities to enable data to be transported directly between them. This frees the clients from having to handle the data transfer directly. However, what is observed from the above scenario is that the client still has to interact with every data resource independently and synchronously. The GDS needs to be initialised and the perform documents need to be set up independently and sent to each perform method on the GridDataService in the OGSA‐DAI client toolkit is synchronous. When the number of the data resources participating in a data integration scenario is increased, the client still needs to interact with each GDS. A high level OGSA‐DAI data integration service driven framework is proposed in the next section, which is trying to reduce this restriction.
43
Chapter 4: A service driven model
4 A service driven model 4.1 Introduction Amongst all the potential types of users of OGSA‐DAI, an important role is that of the analyst. An analyst is mainly responsible for analysing the data integration scenario and identifying the data integration logic for that scenario. The analyst would be happy if OGSA‐DAI could directly apply the logic deduced by the analyst regarding the instructions required to direct the whole scenario. Thus, the data integration could be applied smoothly and automatically. However, this cannot be done easily with the current OGSA‐DAI release. The main reason for this is that the interaction between two GDSs only has limited support; only the data transfer between services is supported. Also, OGSA‐DAI does not have support in its document framework to express the data integration logic properly. The perform document only can express very limited data integration logic, such as the actions and a limited expression of data flow. As a result, some logic has to be implemented on the client side and the whole scenario has to be explicitly controlled by the client.
4.1.1 Client and service driven data integration In the current OGSA‐DAI distribution it is up to the client to explicitly drive any data integration scenarios, such as the one shown in Figure 28. In this case a user, or application, would have to create both GDSs and then compose perform documents, using the OGSA‐DAI client toolkit, which would then be sent to each GDS. However, when it comes to coordinating more than one GDS in a data integration scenario it is up to the user to compose the corresponding number of perform documents and manage the GDSs explicitly.
44
Chapter 4: A service driven model
Source GDS
Source DR
Sink GDS
Sink DR
Client
Control flow Data flow GDT
Figure 28 Client driven data integration
In this chapter, a service driven model is proposed where the service‐to‐service interactions are mediated through a single service, shown schematically in Figure 29. This would free users from a lot of the explicit service management. In the service driven case a user would only need to describe one perform document and send this to the coordinating service. A GDS supporting an extra set of activities would then be used as a pivot point to coordinate the other GDSs. This preliminary work thus designs and implements a service driven model by extending the existing OGSA‐DAI activity model. This work should be regarded as a proof of concept which, if successful, should see some of the functionality suggested here as activities hopefully being integrated into OGSA‐DAI components such as the GDS engine.
45
Chapter 4: A service driven model
Source
Source DR
Sink
Source DR
Client
Control flow Data flow GDT
Figure 29 Service driven data integration
4.1.2 Definition Some new terminology is introduced here, base and composite activities, which are used throughout the remainder of this document to mean: y
base activities: all activities supported in the current OGSA‐DAI release are referred to as being base activities. These do not contain other activities and usually perform a single operation, such as executing a SQL or XMLDB query on a target data resource, transforming an XML document using an XSLT document or transferring data, etc.
y
composite activities: act as containers for other activities. The activities enclosed in a composite activity are called contained activities. A composite activity may manage and control these contained activities in a similar manner to the OGSA‐DAI engine.
46
Chapter 4: A service driven model
4.2 System overview 4.2.1 Introduction The current OGSA‐DAI release, as it stands, is not entirely sufficient to use the service‐driven model for the following reasons: 1. Limited interaction functionality between GDSs In the current OGSA‐DAI release interaction between GDSs is only supported through the GDT. Moreover this interaction is limited to only doing data transfer. No function is currently supported by OGSA‐DAI to enable a GDS to control or manage other GDSs. 2. Local data flow The users can specify the data flow between two activities in a perform document. However, in a data integration scenario involving multiple GDSs, each GDS will only be able to see its local contribution to the global data flow. This is a problem for a service driven model. Hence, the coordination among multiple GDSs has to adopt the client‐driven model. 3. No control flow syntax or control flow support In the service driven mode, the coordinating GDS needs to know the particular order in which activities are to execute. However, control flow syntax and management is not supported in the current OGSA‐DAI distribution. In order to be able to use a service‐driven model the extensions discussed in the following sections are necessary.
47
Chapter 4: A service driven model 4.2.1.1 Interaction between GDSs In a service‐driven model a GDS needs to coordinate other GDSs to implement a given data integration scenario. The GDSActivity proposed in this document is used to meet this requirement. A GDSActivity is a composite activity executed by a GDS to send a collection of base activities to another GDS and receive the corresponding results. By this means, a GDS is able to use or manage other GDSs in OGSA‐DAI. 4.2.1.2 Control flow management among multiple GDSs In a service driven model, the coordinating GDS needs to be able to explicitly manage the sequence in which GDSActivities are executed on GDSs to enable services coordination. There is no support for the specification and management of control flow8 in the current OGSA‐DAI release. Hence, two new composite activities, sequence and flow, are used to prescribe two types of simple control flow for any activities they contain. Because this document mainly proposes an approach to support service‐driven data integration in OGSA‐DAI, discussion of control flow within a GDS is out of the scope. As a result of these restrictions the type of activities contained in sequence and flow activities must not be base activities. 4.2.1.3 Data flow At present, in OGSA‐DAI the types of IO an activity undertakes takes one of the following two forms: y
Internal IO: Both the producer and the consumer of an internal IO are activities located in the same GDS.
Control flow: An abstract representation of all possible sequences of events in a programʹs execution.
8
48
Chapter 4: A service driven model y External IO: The producer and the consumer of an external IO are activities located in different GDSs. For example, the output of an outputStream activity is an external output. Two new types of IO are identified in composite activities, their definitions are only limited in the scope of a composite activity: y
Inner IO: Both the producer and the consumer of an IO are activities inside a composite activity.
y
Outer IO: Either the producer or the consumer of an IO is an activity outside of a composite activity but located on the same GDS.
Composite activities need to explicitly handle the inner and outer types of IO. The details of composite activities are discussed in the following sections.
4.2.2 GDSActivity A GDSActivity is a composite activity which is used to perform a collection of activities on a target GDS and receive the corresponding results. The client must specify the GSH of either a GDS or a GDSF. The GDSActivity decides how to obtain the GDS instance according to the name of the element containing the GSHs. For simplicity, all activities contained in a GDSActivity must be base activities.
4.2.3 Sequence and flow activity Sequence and flow activities are containers for composite activities. More complex control flow can be defined and processed by nesting composite activities. A sequence activity prescribes a sequential execution order. The activities contained in a sequence activity must be executed in the order they are specified
49
Chapter 4: A service driven model by the coordinating GDS. A flow activity enables the contained activities to be executed in parallel.
4.2.4 Security As this document only presents a preliminary proposal for a service driven data integration in OGSA‐DAI, security is not a main concern in this document.
4.3 Detail design 4.3.1 Introduction Figure 30 shows an overview of the classes needed to implement the service‐driven model. The abstract class CompositeActivity implements the Activity interface and provides common functions for all three inheritors, such as identifying the inputs and outputs for each of activity.
Figure 30 UML class diagram
4.3.2 GDSActivity The GDSActivity XML schema is shown in appendix 1.2. The activities contained in a GDSActivity can be one or more base activities. They are
50
Chapter 4: A service driven model enclosed in a element so as to facilitate the XML processing. Either a or element is used by a client to specify the target GDS. Code 8 presents an example of a GDSActivity. The element indicates that a GDS needs to be created from a GridDataServiceFactory which is specified by the url attribute. The GDSActivity (gdsActivity) contains one sqlQueryStatement (statement) activity. The output of the GDSActivity is established and mapped to the output (results) of the containing sqlQueryStatement activity. After the GDSActivity receives the result of the sqlQueryStatement, it rewrites the result to its output. The data may be composed into a response document or consumed by another activity. <sqlUpdateStatement name="statement"> <sqlParameter position="1"> 12 <sqlParameter position="2"> 321 <expression> insert into littleblackbook values ? ? Code 8 Example of a GDSActivity
4.3.2.1 Performing activities The UML class diagram in Figure 31 shows the classes used by the GDSActivity to perform the activities it contains.
51
Chapter 4: A service driven model
Figure 31 UML class diagram of GDSActivity
When a GDSActivity is executed by the GDS engine, it acts as a client to the GDS specified in its description. The processing steps of a GDSActivity are illustrated in Figure 32. 1. If a element is specified in the XML description of a GDSActivity, the GDSActivity needs to be able to create a new instance of GDS from the given GDSF’s GSH. If a element is specified, a GDSActivity uses that instance. 2. The GDSActivity constructs the activities it contains into a DocumentRequest object. Because these activities are executed by the coordinated GDS, it is not necessary for a GDSActivity to check whether they are supported by the coordinating GDS. 3. The DocumentRequest is sent to the coordinated GDS using the OGSA‐DAI client toolkit. 4. If any outer input is identified in this GDSActivity, it transfers data from the input to the coordinated GDS using a GDT portType. 5. A response document is received.
52
Chapter 4: A service driven model 6. The GDSActivity is responsible for transferring data from the coordinated GDS to any outer output identified in the GDSActivity using a GDT portType. 7. If a GDSActivity creates the coordinated GDS from a GridDataServiceFactory, it destroys the GDS instance when processing of the base activities in that GDS has finished.
Figure 32 UML diagram of processing a GDSActivity
53
Chapter 4: A service driven model Any exception that arises from these steps causes an ActivityUserException to be thrown by the ActivityEngine. 4.3.2.2 Data flow The GDSActivity is responsible for handling the data transfer for any outer IO activities it contains. Thus, if a base activity contained in a GDSActivity needs to read data from an outer input of the GDSActivity, an input link is established by the GDSActivity. The GDSActivity is responsible for transferring data between these two activities. If an activity contained in a GDSActivity needs to write data to an outer output of the GDSActivity, an output link is established by the GDSActivity. The transfer is also done by the GDSActivity. The inner IOs of a GDSActivity are managed by the coordinated GDS. Currently, in order to transfer data between two GDSs a GDT has to be explicitly specified. Because of this, if a GDSActivity takes an outer input, the consumer point of this outer output must be an inputStream Activity. An example of data flow of a GDSActivity is shown in Figure 33.
Coordinated GDS
GDSActivity
Activity1 Activity1
GDT
Activity2
Activity2
Activity3 Activity3
Data flow Input of a GDSActivity Output of a GDSActivity
54
Chapter 4: A service driven model Figure 33 An example of GDSActivity data flow
The GDSActivity in Figure 33 contains three base activities. Dash lined boxes are used to represent the activities contained in the GDSActivity as they are actually executed by the coordinated GDS. The data flow is illustrated by the arrow lines. Activity3 needs to read data from an outer input of this GDSActivity, so an input (the top rectangle in the GDSActivity) is set up by the GDSActivity and the data is transferred from the outside to Activity3 by the GDSActivity. The same to the input processing, an output (the bottom rectangle in the GDSActivity) is established by the GDSActivity and is used to write data generated by Activity3 to the outside of the GDSActivity.
4.3.3 Sequence activity A sequence is a composite activity which contains one or more composite activities that are performed sequentially, in the order in which they are specified. The XML schema for a sequence activity is shown in appendix 1.3. The sequence activity completes when the last activity contained in the sequence finishes or any error happens during execution. 4.3.3.1 Processing activities When a sequence activity is executed by a GDS, it first identifies the activities it contains, it then constructs them into corresponding composite activities implementation instances and finally executes them in turn. The processing is illustrated in Figure 34.
55
Chapter 4: A service driven model
Figure 34 UML diagram of processing a sequence activity
1. A sequence activity needs to construct the activity elements it contains into activities instances. If an unsupported activity exists, an exception is thrown and the sequence activity is terminated. 2. If an inner IO operation is identified between two activities contained in a sequence activity, a pipe is created to link these two activities. Thus an activity chain is formed. An example is shown in Figure 35.
56
Chapter 4: A service driven model
A sequence activity
Activity 1
Activity 3 An activity chain Inner IO Figure 35 An example of contained activities of a sequence activity are chained by an inner IO
1. The sequence activity processes the end point activity of each activity chain in turn. If any exception arises during this step it causes the sequence activity to terminate and an exception is thrown up to the GDS engine. 2. The sequence activity completes when there is no sub activity instances left. 4.3.3.2 Data flow In the same way as for a GDSActivity (section 4.3.2.2), a sequence activity also needs to link inputs and outputs that come from outside to the inputs and outputs of the activities it contains. For an inner IO of a sequence activity, a pipe is used to link the two corresponding activities involved in the IO. By this means, an activity chain is formed. For an outer IO in a sequence activity, the sequence activity needs to handle the data transfer explicitly. 4.3.3.3 Example There are two gdsActivities in the sequence activity presented in Code 9, gdsActivity1 and gdsActivity2. These must be executed sequentially. The sequence activity completes when gdsActivity2 completes. No input is required
57
Chapter 4: A service driven model for this case; the two outputs (results1 and statementResponse) are exposed by the sequence activity links the two most inner outputs that belong to the activities update1 and query1 respectively. <sequence name="sequence"> <sqlUpdateStatement name="update1"> <expression> insert into littleblackbook values 12,33 <sqlQueryStatement name="query1"> <expression> select * from littleblackbook where id<=321 and id>=12 <webRowSetStream name="statementResponse"/>