Dmreview - Sql New Language Extraction Data

  • November 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Dmreview - Sql New Language Extraction Data as PDF for free.

More details

  • Words: 3,540
  • Pages: 11
WHITE PAPER

Is SQL Becoming the Industry Standard Language for Data Integration?

BY JOHN RAUSCHER, SUNOPSIS

 Sunopsis. All rights reserved. The information contained in this document does not constitute a contractual agreement with Sunopsis and may be modified at any time without advance notice.

W H I T E

P A P E R

This year is set to be the year when a major shift will occur in the ETL market. Last year saw major RDBMS vendors gain momentum with new and costeffective versions of their Data Integration / E-L-T (Extract, Load, Transform) software that generate SQL scripts. Major market analysts now consider data integration tools from RDBMS vendors to be a viable option for enterprise wide solutions, even when dealing with a large number of developers or in situations with large number of heterogeneous technologies. Understandably, the largest players in the ETL market have based their product strategy and positioning on the claim that SQL is not properly suited to perform ETL tasks. It was true in the 90’s. They have pushed proprietary languages and expensive software as the only solution for successful ETL projects. In reality, custom code is still involved in nearly two thirds of all data integration projects because IT teams actually use the enhanced capabilities of SQL in the latest version of their RDBMS to extract data from databases, then load and transform it in the data warehouse server. Which is the right approach? Is SQL becoming the industry standard language for data integration while pushing ETL proprietary languages into obsolescence?

The traditional engine-based ETL approach For at least 10 years, most ETL software tools have been based on proprietary engines, and traditional ETL vendors brought a strong value proposition to the market. In the 90s, the most widely used versions of the major RDBMSs, such as Teradata, Oracle, Ingres, Informix, DB2 and SQL Server from Sybase and Microsoft, provided a reliable means for storing and accessing large volumes of data without having to define the physical path that enabled those actions. Nevertheless, the SQL languages supported by the RDBMS releases of the 90s were not rich enough to handle the complex transformations required by customers who needed to build a data warehouse. Hence, during that decade, ETL software vendors were not able to utilize SQL and RDBMS as engines to perform transformations. The only viable technical solution was to build costly proprietary-engine-based software which utilized complex proprietary languages to perform the ETL work. The cost of implementing traditional ETL software is so elevated, however, that only 25% of data warehouse projects have opted to use them. As it is typical in a growing market, proprietary solutions dominate the market at first, generally at a high price tag, until an industry standard emerges. When the industry standard is adequate to achieve what the majority of the customers needs, it becomes widely adopted and the proprietary solutions disappear or must refocus on more specific niche markets. This is a scenario that will likely play out in 2005 and 2006, with SQL becoming the industry standard for Data Integration.

2

W H I T E

P A P E R

Architectures: Traditional ETL Solutions (left) and E-L-T based on Industry Standard Solutions (right)

Is the proportion of ETL projects hand coded using SQL increasing or decreasing? Before covering the ability of SQL to handle data integration jobs, let’s review quickly when IT teams choose hand coding rather than a traditional ETL. There are two sides of looking at this question: why would IT teams select hand coding for a data integration project, and why would they stop hand coding and start using a data integration tool? The responses obtained when a large number of people involved in ETL projects were asked why they chose hand coding were not surprising: they generally have opted for this solution for one of two reasons: 1) because of the size of the projects. Typically the initial scope of a project was limited and the cost of purchasing a traditional ETL tool was too high in terms of dollars and learning curve. Or 2) because they got burnt out in the past by the performance of a traditional ETL tool processing data row by row. On the other hand, reasons people decided to halt with hand coding and purchase an ETL tool were mainly among the following: 1- Productivity or speed of development. Even though the SQL language provided with major RDBMSs today is extremely powerful – up to 20 times richer than 10 years ago, with dedicated ETL features – SQL is still time consuming to write and hand-coded projects are delivered late more often than not. Maintaining hand-coded ETL processes can be just as costly, especially when the code is passed to other developers (see Point #3 below). Most ETL tools provide a dedicated development environment for data integration, including Metadata management capabilities, a powerful graphical interface that allows one to develop five times faster than handcoding: moreover, objects can be re-used, code is often platform

3

W H I T E

P A P E R

independent, templates can be designed for easier coding and maintenance is more efficiently managed. 2- Sophistication of the ETL processes. Most hand coding is done with generic SQL, and many advanced features required from an ETL process are difficult to do while writing manual scripts. Among these are: -

Data quality control: this is an integral part of data integration processes and may be one of the most important issues in data warehousing. Data quality control involves checking the integrity of data (primary keys, foreign key references, and user-defined business rules). Data quality is often overlooked when developing data integration processes because of the complexity of implementing it using manual coding.

-

Handling of heterogeneous data sources and targets, including performing joins on tables contained in different databases, and handling type conversion and concatenation/splitting of data between different source and target models. Heterogeneity creates extremely complex challenges, including the need to use different languages, to manage equivalence between platforms and build and load a staging area to perform the joins.

-

Smart updates, slowly changing dimensions: loading and/or updating only the records that have been added or updated since the last iteration or managing slowly changing dimensions in a data warehouse is almost impossible to perform manually.

-

Real-time data transfers: detecting changes on the sources and propagating these changes in real-time or near-real-time to the target databases. Real-time transfers present extremely complex challenges for manual coding.

3- Maintenance. Using manual coding and SQL creates serious obstacles for code maintenance. No proactive error management and no graphical interface for error analysis and debugging are available when scripts are hand coded. Conversely, metadata management helps to quickly identify where the enhancements need to be made. Also, using a tool helps standardize the quality of the code generated and ensures that unique circumstances are addressed by allowing easy additions and modifications to the generated code. 4- Security issues. Manual coding requires the management of user privileges on all the servers used for development, generally without a central view of these rights and privileges: operating system, databases, and remote access all have to be administered independently. Developing these security features is itself a major project. Organizations that develop their own code

4

W H I T E

P A P E R

by necessity are not interested in becoming software development organizations with all the associated costs. 5- Project management. Hand coding does not inherently enable a project manager to easily and accurately follow up on the development progress. There is no “30,000 foot” overview, just thousands of lines of code. ETL software provides a central repository and tools to query it for analysis and reports. It also gives a team the ability to easily back up work for contingency purposes. 6- Exception handling. Hand-coded projects typically do not take into account error handling, so time-consuming yet critical processes to identify and handle exceptions are often omitted. A good ETL tool should isolate invalid rows, document the errors, and, if needed, automate the processing of the error. Nevertheless, the main obstacle for the IT team to move from hand-coding to an ETL tool is the flexibility that the developer looses when using a tool. IT teams which developed their own data integration scripts with custom code need to preserve some of what has been developed while keeping the flexibility they previously enjoyed. This is typically a real challenge since traditional ETL tools do not offer the same capabilities with their proprietary languages. Only an E-L-T tool based on a native SQL code generator technology provides the flexibility required to move from hand-coding to automatic code generation without throwing away all previous developments. E-L-T software also allows organizations to keep the high level performance generally achieved with manual coding. Both rely on bulk data processing executed by the RDBMS.

Will RDBMS vendors become the new dominant players with their Data Integration tools / SQL code generators? A few years ago, RDBMSs and their associated SQL languages did not have the functions required to perform data integration processes and their associated complex transformation rules. They could not efficiently join their data sets. Now this is no longer true. Since the mid 90s, at the end of the marketing battle – fought mainly by Oracle and Sybase – over compatibility between their SQL languages and the ISO-89/ISO-92 standard, RDBMS vendors have focused their engineering efforts on enhancing the functionality of their SQL languages and improving the performance and reliability of their engines. For instance, in the last ten years, major RDBMS vendors have increased the number of SQL features available to programmers by a factor of ten to twenty. For example,

5

W H I T E

P A P E R

Oracle’s Quick Reference Guide for versions 7.3 and 9i grew from two pages to 40 pages and IBM, Teradata, Microsoft and Sybase features sets grew similarly! Some vendors have even added advanced analytic features – capabilities that traditional ETL tools do not offer within their proprietary languages. Indeed, many new functions have been added to most RDBMS solutions. For instance, the CASE…WHEN statement (equivalent to an IF…THEN…ELSE…) can be used for complex transformation rules. Outer joins (LEFT OUTER, RIGHT OUTER, or FULL JOIN) can be used to easily join data sets in a variety of different manners. Ranking and windowing functions (MIN OVER PARTITION, MAX OVER PARTITION, LEAD, LAG, and RANK) allow for more effective handling of large volumes of data. RDBMS vendors now provide a long list of out-of-the-box features and utilities to enable impressive performance when doing ETL. Some features are dedicated to loading data from sources to targets – with high performance in mind – while others process various data formats such as XML files. These are just examples of what can be done with the SQL and RDBMS solutions available today. Using these utilities, data integration tools can achieve the best possible performance with the RDBMS, taking advantage of parallel processing, bulk operations, and other capabilities that are fully integrated with native SQL code. Indeed, most RDBMS vendors have started to leverage the power of the enhanced SQL and now include data transformation software in the server package, seriously challenging the market traditionally owned by (independent software) ETL vendors. The data transformation tools provided by RDBMS vendors have improved to the point where they enter the enterprise data integration market. Such tools use the RDBMS as an engine to perform transformations and aggregations. Their distributed approach enables the execution of the job to take place on the sources, targets (most of the time), or staging servers – wherever the processing is most efficient, given the IT architecture. Those light data integration tools are true SQL generators and actually do E-L-T (Extract, Load, Transform) and prove that today’s SQL is sufficiently powerful to perform all of the work needed to integrate data even when complex transformations are involved. They are limited, however, and their limitations are clearly linked to three factors, due to the inherent nature of these products: -

Some RDBMS vendors have recently entered the data integration market, and the user interfaces of the data integration/data movement tools they provide still lacks features needed to significantly improve development productivity over manual scripting. These tools still require a large amount of manual coding even though there are improvements with every new version.

-

Since RDBMS vendors sell their data integration/data movement tools at a low price or even give their tools away (Microsoft), they are not able to commit the engineering resources required to improve their products. Developers are often disappointed by the limited improvements each successive version provides. Significant differences in features exist

6

W H I T E

P A P E R

between those embedded tools and those developed by third-party E-L-T vendors, especially in terms of productivity and technical capabilities. -

RDBMS vendors clearly position their data integration tools as a way to sell more RDBMS licenses and then deliberately limit compatibility with other databases. They often charge significant additional fees for underperforming “gateways” to access other RDBMSs as sources, and the performance of the generic SQL used in that case is far from being optimized. These are both critical issues for data integration where large volumes of data need to be transferred. A distributed architecture is not possible with the data movement tools of RDBMS vendors since the generated code can only be executed on the vendor RDBMS.

Facing these new competitors, traditional ETL vendors have started to include some SQL as a possible development language choice for performing the required data integration. They have, however, limited the SQL to the capabilities of their proprietary engines and created their own syntax (which is not the same as the SQL offered by the major RDBMS) based on the most generic – and thus less powerful – functionality in order to be compatible with all RDBMS solutions. In contrast, a new generation of powerful third-party SQL code generators based on an E-L-T architecture has emerged. With the power and reliability of today’s RDBMSs, IT teams strongly believe it makes sense to use the RDBMS in place to perform the data integration work. IT managers want to utilize third generation SQL code generators that can be used regardless of the RDBMS involved in a project. The maturity of the market for third party SQL code generators for enterprise projects is the second major shift of today. Proprietary Technologies

SQL Code Generators

Advanced Features

Major Players Proprietary Engines

Sunopsis

Basic Features

Niche Players

ETL from RDBMS Vendors

Different feature sets for different types of technologies

7

W H I T E

P A P E R

The strong benefits provided by the SQL code generator approach Now that SQL is rich enough to perform data integration work, vendors of SQL code generator software offer compelling reasons for adopting their innovative approach: -

The learning curve for this approach is very short because most of the code being generated is SQL, which is known by most developers and can be reviewed quickly. There is no need to train a team to learn a proprietary language. Most often, developers doing mappings and transformations with a SQL code generator use the graphical interface and do not need to know SQL. Only the administrator who would like to review the code before putting it into production will access the SQL code.

-

The use of a SQL generator suggests that there is no need to add any software or hardware to the production IT environment. The code is executed by the existing components of the IT infrastructure, i.e., the RDBMS.

-

Performance is unbeatable: SQL and RDBMS allow a bulk approach for data processing. Bulk is up to 100 times faster than the row by row processing linked to proprietary ETL engines.

-

The code generator produces native SQL code for the RDBMS without the time consuming process of manual coding and gives developers the same flexibility they had when using custom-code. They can reuse some of the work they did previously, such as stored procedures and stored functions.

-

The cost of SQL code generators is lower than that of traditional ETL software that is based on a proprietary language and an engine, most often installed on a dedicated server. With a SQL code generator, the engine is the existing RDBMS, most often the one installed on the target/data warehouse server. It means that the reason for which the SQL code generators are less expensive is linked to the fact that they do not require the software vendor to develop a transformation engine supporting complex features such as parallel processing. With SQL code generators, the engine is the RDBMS already bought by the customer.

-

There is no need to buy a new server to perform the ETL processes. The SQL generator approach allows IT managers to leverage all previous investments made in hardware, software and training. Moreover, the load can be distributed wherever CPU resources are available. Some types of work can be done on the sources, while other types can be performed on the target. For example, to make a join between two tables, it may be a

8

W H I T E

P A P E R

good idea to execute it on the source instead of transferring the tables to another server through the network to execute the join. -

SQL generators offer tremendous flexibility in the data integration process. Whether some transformations need to be performed on sources or on targets (ELT – Extract, Load and Transform), or any combination thereof, the approach is not forced on the user by a proprietary engine and an architectural choice performed by the vendor.

-

Third-party SQL based data integration software vendors provide comprehensive support for the most widely known RDBMSs on the market. Within minutes one of their tools can generate an E-L-T scenario to be performed on DB2 one day and regenerate the same scenario for use on Teradata the next day. The tool will generate the native SQL each time for the specific RDBMS to be used.

-

As RDBMS allow bulk processing of the data, there is no potential performance issue to include in the code the same constraints as those defined in the RDBMS of the target data warehouse (using automatic reverse engineering features) to achieve automatic data quality control. Never will invalid data be included in the Data Warehouse.

-

The high performance level obtained with today’s RDBMSs is unquestionable and clearly supports the SQL generator approach. In addition, IT teams prefer to lean on their own DBAs to fine-tune their RDBMS instead of hiring high priced consultants from a traditional ETL software vendor to tune a proprietary engine.

-

Finally, the ETL process takes full advantage of any features supported by the RDBMS itself. If an RDBMS supports massively parallel processing or clustering, for example, then the SQL code generated for the data integration process will also. There is no longer any concern about the ability of ETL software vendors to support new platforms since capabilities are linked to those of the RDBMS used to execute the generated code.

Summary The success of data integration tools from RDBMS vendors has definitively changed the landscape of the Data Integration market, providing a strong push to SQL as the industry standard language for data integration. This offering proves that RDBMSs and SQL have the power to perform any type of data integration process and that SQL code generators are going to be the foundation for coming generations of enterprise data integration software solutions. They actually paved the way for the much more powerful third-party SQL code generator products such as Sunopsis, which allow complex data integration jobs to be

9

W H I T E

P A P E R

performed in batch or real-time, without any coding. This important shift in the data integration market becomes apparent as SQL becomes the industry standard language for Data Integration and E-L-T, the right architecture for Enterprise Data Warehousing. Sunopsis’ product portfolio – and its worldwide success – is based on this approach. Sunopsis software supports more than 50 RDBMSs and all hardware platforms (from PCs to mainframes) thanks to its Java architecture, and provides a very powerful easy to use graphical interface. Sunopsis’ 400+ customers worldwide demonstrate, through the data integration work they perform, that large scale data integration work can be completed without coding – with faster results and higher ROI. For a 5-minute virtual product tour, more detailed product specifications, or a free trial copy of Sunopsis, visit www.sunopsis.com.

10

W H I T E

P A P E R

Boston, USA +1 781 238 1770 1 888 740 0300 [email protected] London, UK +44 (0) 870 351 4408 [email protected] Lyon, France +33 (0)4 78 33 92 20 [email protected] Milan, Italy +39 02 89307064 [email protected] Paris, France +33 (0)1 40 21 42 21 [email protected] Rome, Italy +39 06 5163139 [email protected] Singapore +65 6335 6139 [email protected]

 Sunopsis. All rights reserved. The information contained in this document does not constitute a contractual agreement with Sunopsis and may be modified at any time without advance notice.

11

Related Documents