WHITE PAPER
Third Generation ETL: Delivering the Best Performance
BY YVES DE MONTCHEUIL AND CHRIS DUPUPET, SUNOPSIS
Sunopsis. All rights reserved. The information contained in this document does not constitute a contractual agreement with Sunopsis and may be modified at any time without advance notice.
W H I T E
P A P E R
The process of selecting an ETL (extract, load, transfer) software solution is typically a complex one, during which many features need to be evaluated. One of the most critical criteria that must be met is performing well in a given environment and configuration. Many vendors of ETL software will conclude their sales pitch by giving numbers – always very impressive – regarding the performance of their solution. They are almost invariably provided in the form: such tool can transfer so many rows per time unit. Many users, however, have been misled by impressive-looking performance numbers that turned out to be less than impressive in real life. Why? Because performance is one of the most difficult elements to evaluate without conducting a full-scale evaluation. Indeed, performance in the production environment is significantly affected by the overall architecture of the information system and by the flow of data during the ETL process. This white paper addresses a number of elements that impact the performance of ETL and explains which architectures and approaches give some products a nearly “unfair advantage” when it comes to performance.
Historical Background As computer systems started to evolve from monolithic mainframes to distributed computing systems, and as business intelligence made its debut, the first ETL solutions were introduced. Since that time, several generations of ETL have been produced. First Generation: the Origin of ETL and the Legacy Code Generators. Original data integration tools generated native code for the operating system of the platform on which the data integration processes were to run. Most of these products actually generated COBOL, since at that time data was largely stored on mainframes. These products made the data integration processes easier than they had been by taking advantage of a centralized tool to generate data integration processes and by propagating the code to the appropriate platforms – instead of manually writing programs to do so. Performance was very good because of the inherent performance of native compiled code, but these tools required an in-depth knowledge of programming on the different platforms. Maintenance was also difficult because the code was disseminated to different platforms and differed with the type of sources and target. At the time, this architecture provided the best possible performance, since data was stored in flat files and hierarchical databases and record-level access was fast. Although this worked well on mainframes, using such an approach on relational databases has proven to be less successful for managing large data volumes.
2
W H I T E
P A P E R
Second Generation: the Proprietary ETL Engines. Next came the second generation of ETL tools, which are based on a proprietary engine (sitting between sources and the data warehouse target) that runs all the transformation processes. This approach solved the problem of having to use different languages on different platforms, and required expertise in only one programming language: the language of the ETL tool itself. However, a new problem arose: the proprietary engine performing all the transformations became a bottleneck in the transformation process. All data, coming from various sources to go to the target, had to pass through the engine that processed data transformations row by row. Row by row is very slow when dealing with significant volumes of data. Third Generation: the E-LT (Extract - Load & Transform) Architecture. Addressing the challenges faced by tools from the previous two generations while leveraging their respective strengths, a new generation of ETL tools recently appeared. Since the inception of the previous generation – Proprietary Engines – database vendors have invested considerable amounts of resources to greatly improve the capabilities of their SQL languages. By leveraging these improvements, they have made it possible for an ETL tool to generate and execute highly optimized data integration processes, in the native SQL or in the other languages of the databases involved in these processes. The strength of first generation – Legacy Code Generators – was the centralized approach to generating data integration processes and automatically propagating optimized code to the appropriate platforms. Performance was very good because of the native compiled code. Likewise, the third generation – Distributed Architecture – provides a centralized design tool that generates native code for each platform. The strength of the second generation – Proprietary Engines – was the graphical environment and transformation features that were simply not available in SQL in the 90s. The third generation – E-LT Architecture – provides a highly graphical environment, along with the ability to generate native SQL to execute data transformations on the data warehouse server. This new approach has several clearly identified advantages: •
It eliminates the ETL hub server sitting between sources and target, which was introduced by the second generation of ETL products.
•
Using a RDBMS to execute data transformations allows a bulk processing of the data. Bulk is up to 1,000 times faster than row by row data processing. The larger the volume of data one has to manage the more important the bulk processing becomes.
•
It also provides for better performance than any other solution does because transformations are executed by the RDBMS engine in bulk instead of row by row with second generation ETL. Additionally, the large database
3
W H I T E
P A P E R
vendors – Oracle, IBM, Microsoft, Sybase, etc. – have had significantly more resources and time to invest into improving the performance of their engines than have the vendors of second generation ETL software, relying on the RDBMS engines provides a way to leverage these investments. •
In the E-LT architecture, all database engines can potentially participate in a transformation – thus running each part of the process where it is the most optimized. Any RDBMS can be an engine and it may make sense to distribute the SQL code amongst sources and target to achieve the best performance. For example, a join between 2 large tables may be done on the source.
•
A centralized design tool makes programming and maintenance easy, allowing developers to control which database engine processes which piece of information in which way.
The rest of this white paper details several elements that impact the performance of the ETL process for the products that belong to the third generation – the generation to which Sunopsis products belong.
The E-LT Architecture The innovative architecture of third generation ETL based on SQL code generation allows for leverage of the RDBMS already installed on the Data Warehouse server. This approach is clearly different from the hub and spoke architecture, which comprises a central engine (usually proprietary) that performs the transformations for all the processes throughout the information system. The E-LT architecture brings performance benefits in several ways: Reduces network traffic to improve performance. The first benefit of an ELT and distributed architecture is reduction of traffic on the network. Data can be combined and filtered on the source systems, thus avoiding unnecessary transfers to the transformation engine. Sending the data directly from the source database to the target database saves even more since it offers a direct path – as opposed to the engine solution, where data has first to be loaded in the engine (via the network), processed, and then sent to the target (via the network again). Indeed, with the hub and spoke architecture of the second generation ETL tools, all the data needs to go through the engine to be processed row by row, creating additional network traffic between the proprietary engine and the target DB (the data warehouse) in order to validate data integrity. Unfortunately, transferring data over the network is usually the most costly operation that can be performed.
4
W H I T E
P A P E R
It is a lot more efficient to perform data transformations on the same server as the data warehouse DB to dramatically reduce the network traffic. Distributes the load for better scalability. The SQL code generation approach enables an implementation of an E-LT or distributed architecture. The data integration tool allows each RDBMS (source or target of the data transfer process) to participate in the transformation process. This distribution of the workload is not required by the tool, but whenever it is, the needed flexibility is there. As database servers are added to the infrastructure, more data needs to be processed. New databases will be available, with new engines that can be used. Additional transformations will have little impact – if any – on existing transformations. Indeed, a distributed architecture can easily handle an increase in the number of databases involved in the process. Conversely, with an enginebased architecture, all transformations are performed by the engine. As the numbers of databases and volume of transformations grow, so does the load on the engine. The ETL tool quickly encounters hardware scalability issues and requires more resources (more memory, more processors) until it eventually requires new software licenses of the software, so that the load can be split among multiple engines without fixing the #1 problem of proprietary ETL engines where transformations are still processed row by row. For example, to insert one million rows, one by one, amounts to sending one million insert commands to the database! In bulk, inserting one million rows can be one single insert command! Delegates the load. There are multiple performance benefits offered by a third generation ETL tool as it delegates the load. This white paper has already emphasized the fact that this approach eliminates the bottlenecks common to all engine-based architectures, but the benefit goes beyond the new organization of the operations. For example, you can choose where the transformations will be performed. If you have to filter out data, you obviously want to perform this operation before you start transferring the data. And you want to define directly the rule to create the filter using your development tool – and not have to connect to the database and manually write a script that will not be managed from the GUI. Supports all types of sources. An efficient third generation data integration tool lets you control all types of sources (such as databases, flat files, XML format, LDAP servers, and messages from a middleware) through a single graphical interface. You can choose the technologies that are suited to offer the best possible performance for each transformation. Database servers from the same vendors can be connected directly to one another using the RDBMS vendor native mechanisms (dblink, linked servers, etc.) – you can set up this configuration and monitor it from the ETL tool itself. Flat files can be loaded directly using native utilities (SQL*Loader, bulk insert, bcp, FastLoad, etc.). The necessary configuration files or control files are generated by the data integration tool as needed. Such a product provides the best performance through its
5
W H I T E
P A P E R
versatility, which ensures that the code generation is the most appropriate for the circumstances (SQL code, control files, etc.). Second generation tools either require loading the data into their engines through proprietary loaders, or require manual coding to take advantage of the database utilities – hence losing all traceability and centralization of the developments from the ETL tool. Takes advantage of the architecture of the database itself. Since a third generation data integration tool delegates the transformations to the database engines, it is possible to take advantage of all the advanced features offered by the RDBMS. Options such as load balancing, clustering, or massively parallel processing (MPP) are increasingly available in modern RDBMSs. It is possible to use the appropriate parameters, set options, and call the appropriate database utilities to take full advantage of these features.
Flexibility The flexibility of the architecture is a key component of the actual performance of data integration processes. It is imperative to be able to reorganize the architecture as dictated by the technical environment – the data integration tool should not dictate the architecture. Here are some typical choices that must be left open because the optimal answer may differ for each project: •
Which database engine will be used for the transformations? Will it be the source database engine, the target, or another one?
•
Is it acceptable to have processing data mixed in with production data?
•
How easy is it to redefine the location of your databases?
•
What happens to data integration developments if the physical address of a server changes or if a database name changes?
•
How will the code be promoted from the development environment to the production environment?
Generated Code. The code generated by the ETL tool needs to be native SQL code. A tool that generates generic SQL limits itself to a small fraction of the capabilities of the RDBMS. As mentioned earlier, database vendors have improved SQL with their own transformation functions to a point where all required data transformations can be performed using SQL. A truly flexible tool must be able to take advantage of all of these functions.
6
W H I T E
P A P E R
Likewise, optimizing the generated code must be easy. It is not enough to generate SQL to obtain good performance: the code generated by the ETL tool cannot be optimal in 100% of the cases. It is thus important that the optimization of that code be easily implemented through re-usable components so that any improvement can be carried along to new projects. Platform. Not every database fits every need. Some industries or companies have their preferred RDBMS vendor; some applications require a specific technology. In this regard, it is important for the data integration tool to respect these choices – even to the point of addressing where the metadata and project repository are stored. This not only improves overall maintainability, but it also enables organizations to take advantage of in-house expertise, guaranteeing better performances because of a more homogeneous environment. This is why Sunopsis does not impose any RDBMS technology in its architecture – even for its own repository.
Data Access Technologies There are numerous ways to extract and load data – each RDBMS provides specific technologies and utilities in addition to the industry-standard drivers. Drivers: JDBC vs. ODBC. A third generation ETL tool should support all drivers (JDBC Types 1, 2, 3, and 4, as well as ODBC). However, when available, a Type 4 JDBC driver is strongly recommended because this class of drivers offers the best speed in terms of remote database access. Typically, ODBC drivers and Type 2 JDBC drivers are installed on top of the database client connectivity stack. The database client itself will connect to the database. In this case, the increased number of layers on the client side slows down the data transfer, and adds to the complexity of the installation, configuration, and troubleshooting. Conversely, Type 4 JDBC drivers connect directly to the database listener, with no client software needed. This improves performance (fewer software layers are needed to process the information) and considerably reduces the complexity of the installation, configuration, and troubleshooting. Database Utilities. No matter which driver is used, there is always a time when drivers will not provide sufficient performance. This is especially true for large volume batch data transfers. In these cases, use of database utilities such as dblink and FastLoad is preferred. The third generation ETL tool automatically generates the appropriate information (command and/or parameter files) for these utilities and natively includes them in its loading strategies. There is no need to run the utilities outside out the tool or to manually generate or maintain information for them to run.
7
W H I T E
P A P E R
Summary – Delivering on the Vision Today Today’s RDBMSs and SQL have the power to perform any data integration work. Third generation ETL tools take advantage of this power by generating native SQL code for the RDBMS engines involved in the data integration process and orchestrating the work of these systems – and processing all data transformations in bulk. Sunopsis’ product portfolio – and its success in the marketplace – is based on this approach. Sunopsis software supports more than 50 RDBMSs and all hardware platforms (from PCs to mainframes), and provides a powerful graphical interface. Sunopsis’ over 400 customers worldwide demonstrate that data integration can be completed without coding – with faster results and higher ROI. For a virtual product tour, more detailed product specifications, or a free trial copy of Sunopsis, visit www.sunopsis.com.
8
W H I T E
P A P E R
Boston, USA +1 781 238 1770 1 888 740 0300
[email protected] Bonn, Germany +49 (0) 170 296 4124
[email protected] London, UK +44 (0) 870 351 4408
[email protected] Lyon, France +33 (0)4 78 33 92 20
[email protected] Milan, Italy +39 02 89307064
[email protected] Paris, France +33 (0)1 40 21 42 21
[email protected] Rome, Italy +39 06 5163139
[email protected] Singapore +65 6335 6139
[email protected]
Sunopsis. All rights reserved. The information contained in this document does not constitute a contractual agreement with Sunopsis and may be modified at any time without advance notice. WP-ThirdGenETL -1005-4.0
9