Oracle9i Warehouse Builder, Integrated Data Quality An Oracle White Paper July 2003
Oracle9i Warehouse Builder, Integrated Data Quality
Introduction........................................................................................................3 What Is Data Quality?.......................................................................................4 Data Quality is Critical......................................................................................5 Integrating Data Quality Into ETL.................................................................5 Data Quality in Oracle9i Warehouse Builder ................................................6 Name and Address Cleansing .....................................................................6 Sophisticated matching and merging..........................................................7 Overview of the Name and Address Cleansing Process.........................8 Architecture ...............................................................................................9 Improving Name and Address Cleansing ...........................................10 Overview of the Match/Merge Process ..................................................11 Match Rules.............................................................................................13 XREF Group ..........................................................................................21 Merge Rules .............................................................................................22 Conclusion ........................................................................................................23
Oracle 9i Warehouse Builder, Integrated Data Quality Page 2
Oracle9i Warehouse Builder, Integrated Data Quality
INTRODUCTION
Today, more companies realize the need for Business Intelligence and establishing customer relationships to gain competitive advantage. Regardless of industry, there is a universal foundation for all customer-focused initiatives – customer data. Using customer data, businesses can gain a clear picture of customers, analyze their buying patterns, and predict future sales. For an accurate view of customers, companies are turning to solutions such as data warehouses or data marts to help identify and manage their customer relationships. Data quality is one of the biggest obstacles blocking the success of many data warehousing projects. The Data Warehousing Institute estimates that data quality problems cost U.S. businesses more than $600 billion a year. Customer data usually reside in multiple disparate sources and needs to be consolidated. Data must be filtered for unreasonable values and transformed to match other related data. During this process, issues of data quality become even more pronounced. Data of unknown quality is fundamentally untrustworthy and represents a significant risk to the successful completion of any data integration or migration effort. Therefore, a data warehouse or data mart initiative cannot deliver a tangible Return On Investment (ROI) unless the data within it is reliable and accurate. This paper will answer the questions: What is data quality? Why put an effort into data quality? Why is this effort most efficient inside ETL (Extraction, Transformation and Loading) process? How will Oracle9i Warehouse Builder make this effort successful? You will discover how Oracle9i Warehouse Builder combines ETL functionality with data cleansing capabilities such as parsing, correction, standardization and Match/Merge of data elements. With Oracle9i Warehouse Builder, data warehousing managers are ensured of the loading accurate data into a data warehouse as well as effective management and maintenance of the customer data foundation.
Oracle 9i Warehouse Builder, Integrated Data Quality Page 3
WHAT IS DATA QUALITY?
Data quality is an all-encompassing term describing both the state of data that is complete, accurate and relevant as well as the set of processes to achieve such a state. The goal is to have data free of duplicates, misspellings, omissions and unnecessary variations and to have the data conform to the defined structure. Simply put, the data quality addresses the problem cynically but precisely summed in “garbage in – garbage out”. The data quality processes are not utilized just during the population of data warehouse, they are also employed before the warehouse is even designed and after the warehouse is loaded. A significant part of data quality deals with customer data – names and addresses, due to both their key role in business processes and their highly dynamic nature. But data quality is certainly not limited to names and addresses. Any data will benefit from being standardized and complete. Names and addresses usually contain dirty data, since they often include nicknames, abbreviations, misspellings, bad fielding etc. Often companies rely upon name and address information to match and merge data. A common unique identifier is typically not available across systems and because of the anomalies in data sources, it is crucial to cleanse and normalize it. Consequently, whatever data is available must be used to determine if different individuals, businesses or other types of records are actually the same. Names and addresses are ubiquitous—they tend to exist in almost every source and are often the only identifying data. Therefore, most matching applications rely heavily on names and addresses. Sophisticated name and address processing can solve these problems and have a huge effect on matching. The following figure shows that in order to get a correct view of customer data in the data warehouse, the source data must be cleansed, normalized, matched and merged.
Figure1 - Complete and accurate view of customers
Oracle 9i Warehouse Builder, Integrated Data Quality Page 4
DATA QUALITY IS CRITICAL
There are many challenges in building the data warehouse and consequently many areas of importance for the tools that assist with building the data warehouse. Is it easy to build and maintain the warehouse (productivity and maintainability)? Can all required sources of data be brought into the warehouse (source integration)? Can data warehouse continue to add data within the allowed time periods and can it accommodate the growth (performance and scalability)? However, to the end users of the data warehouse making the business decisions based on the data in data warehouse the most important question is “Can I trust this data?” It does not matter if the warehouse was build in record short time, it practically maintains itself, every imaginable source system is in, the loads are blazingly fast and are only getting faster. If the business users find the data in the data warehouse not to be trust-worthy, they will not use it to make their business decisions. All the impressive achievements in those other areas will then be meaningless. INTEGRATING DATA QUALITY INTO ETL
Having complete, accurate and relevant data free of all defects sounds great on paper but is never entirely achieved in reality. Instead, each organization defines, formally or informally, the acceptable level of data quality, a certain threshold that can be measured. Just as impressive as the figures that show the losses to businesses due to poor data quality, are the figures estimating how much is spent on trying to address the data quality issues. Therefore, it is not surprising that the true goal for businesses is not the absolute data quality in its academic sense but the acceptable level of data quality achieved efficiently in terms of time, effort and money. Name and address cleansing is a specialized function traditionally handled by data cleansing vendors. The advent of data warehousing and especially the importance of customer data have created a need for the name and address cleansing functionality to be incorporated into ETL tools. Today many ETL tools treat the name and address cleansing as a separate option that requires users’ knowledge of multiple tools. The key difference of data quality in Oracle9i Warehouse Builder is the level of integration into the ETL process that makes data quality transformations an inseparable function of data warehouse development. Built-in, same interface for design, same interface for management. To understand just how important this is, consider the example of spell checker in a word processor. It would be unacceptable to anyone to close the document, start a spell checker application, process and correct the document, then re-open it in word processor. Even if the spell checker has a great user interface of its own, is fast and is otherwise pleasant to work with. Why not just invoke the spell checker from within a word processor? Better yet, why not correct the mistakes as the document is created? Of course there is no such problem in modern word processors. Spell-checkers have long been integrated and work seamlessly. Another interesting parallel to note is that it’s
Oracle 9i Warehouse Builder, Integrated Data Quality Page 5
fairly unimportant to the user of a word processor that the actual spell checking technology is provided by a third-party vendor. Oracle9i Warehouse Builder fully integrates ETL, data warehousing, name/address cleansing and householding capabilities in a single tool. The following features are key to rapidly building an effective, enterprise-wide customer data foundation suitable for CRM and Business Intelligence: Complete Data Consolidation. Customer data is fragmented and can be found in many internal and external sources. This data needs to be consolidated to minimize divergent data entry practices. Oracle9i Warehouse Builder provides an extensible framework and productive GUI for designing and deploying data warehouses and data marts, facilitating: •
Design: Graphical support for dimensional and relational modeling;
•
Extraction: Move data from relational, legacy, ERP and other data sources into the target data warehouse;
•
Deployment: Generation and population of warehouse and marts (taking advantage of the Oracle database features)
•
Management: Metadata-driven, integration with other Business Intelligence tools, Oracle Enterprise Manager and Oracle Workflow.
Advanced Name and Address processing. Name and Address ensures accurate data for individual names, firm names and locations. Cleansing includes address parsing, validation, correction, standardization and augmentation. Cleansed names and addresses greatly improve the process of matching and merging customer data. Sophisticated matching and merging. Powerful fuzzy matching capabilities identify unique customers, businesses, households or other entities for which common identifiers may not exist. Match rules can be created through an intuitive user interface. Consolidation of match results is automated to create one view of the customer. Lifecycle Management. Complete data warehouse management by efficiently managing all aspects of the data warehouse lifecycle. Pre-built reports, impact analysis and data lineage reports provide a graphical way to identify the impact of changes in your source or target system. Lifecycle support and management ensures information accuracy via synchronization of the data warehouse with the metadata source in Oracle9i Warehouse Builder. DATA QUALITY IN ORACLE9I WAREHOUSE BUILDER Name and Address Cleansing
In Oracle9i Warehouse Builder, name and address cleansing is a built-in transformation. During cleansing, individual names or organization names and/or addresses are:
Oracle 9i Warehouse Builder, Integrated Data Quality Page 6
•
Parsed into individual elements for improved correction and matching
•
Standardized which involves modification of components to a standard version acceptable to a postal service or suitable for record matching. Address components will be converted into a consistent representation; for example, East can be converted to E., and Str or Street can be converted to St. Nicknames and organizational names can be transiently standardized as well to improve matching. Standardization of components include: - Street and city misspellings/improper abbreviations - Directional identifiers: N, S, W, NW - Street types: ST, AVE, RUA, PSO, and JLN - State/province/neighborhood - Business keywords (INC, Corp) - Standardized version of first and middle names as separate components •
Validated and Corrected by using a database of postal data for address components (such as street name, city and zip code). Only addresses will be validated or corrected, names are standardized
•
Augmented, by adding new data elements for mailing, better target marketing or other purposes. Examples are ZIP+4, probable gender, consumer versus business record identification, county codes, and geocodes (including latitude and longitude)
For example, for the original Source data of: Sue Smith|500 Oracl Parkway|Redwood Shores CA The end result after parsing, standardization, correction/validation and augmentation would be include the following components: Sue|Susan|Smith|500|Oracle|Pkwy|Redwood City|CA|94065|1675 Sophisticated matching and merging
In Oracle9i Warehouse Builder, Match/Merge is also a built-in transformation and integrated as part of the ETL functionality. Powerful deterministic and probabilistic matching is provided to identify unique consumers, businesses, households, or other entities for which common identifiers may not exist. Exceptional matching effectiveness can be achieved because any combination of attributes and relationships - name, address, email address, phone number, etc.- may be used to qualify a match. Examples of match criteria include: Similarity scoring to account for typos and other anomalies in the data, Soundex to identify data that sounds alike, Abbreviation and Acronym matching, etc.
Oracle 9i Warehouse Builder, Integrated Data Quality Page 7
Match rules are specified and managed through intuitive wizards. Like any other operator in Oracle9i Warehouse Builder, Match/Merge can also be accessed through an extensive scripting language. Merging consolidates the matched data, including all transactions, service history and other data is associated with a customer, to provide a complete unified view of the customer for the entire organization. Unlike mailing tools, which are designed to drop all but one record from a match set, match results are intelligently consolidated into a unified view based on rules you specify. The user is able to specify the rules for consolidating (merging) the match results to create the best possible final result. Oracle9i Warehouse Builder supports a variety of merge rule strategies, including custom rules, a variety of approaches for resolving inconsistencies that often occur in matched data. Examples include: Ranking sources by order of preference, any non-null value, etc. The resolution of relationships in the merge process is automated to ensure the integrity of the integrated result. The merge rule framework assembles the “best possible” integrated result. This provides significant timesavings over traditional handcoding approaches and dramatically reduces maintenance complexity. Figure2 illustrates that effective Customer-Centric initiatives rely on Name and Address cleansing as well as householding and de-duplication.
Figure 2 - Data quality for all customer-centric applications
Oracle9i Warehouse Builder’s unique data integration and consolidation architecture provides unmatched performance and scalability for creation and maintenance of customer-focused databases.
Overview of the Name and Address Cleansing Process
Oracle9i Warehouse Builder features a Name and Address Mapping Operator that works with included Warehouse Builder Name & Address Server component to support parsing, standardization, postal matching, and geocoding of name and address data.
Oracle 9i Warehouse Builder, Integrated Data Quality Page 8
Architecture
The Name & Address engine resides on the database server, as does the ETL engine for Oracle9i Warehouse Builder. Warehouse Builder Name & Address Server provides an open architecture allowing multiple third-party name/address providers to “plug-in” to Oracle9i Warehouse Builder. Name/Address Server communication is socket based. Sockets were chosen for higher performance. Handling of parse requests and parse results over the socket is done via messaging.
Figure 3 – Architecture
At the design time Name and Address operator is used in a mapping to model Name and Address rules inside Oracle9i Warehouse Builder (see screenshot in Figure 4). At runtime, the deployed Name and Address transformations transparently access parsers and data provided by the third-party vendors. The process of designing and running Name and Address transformation is always the same, regardless of which third-party provider (or even multiple providers) are used.
Oracle 9i Warehouse Builder, Integrated Data Quality Page 9
Figure 4 – Name and Address Operator in Oracle9i Warehouse Builder
Many countries have postal data libraries available from their postal service organizations. The third-party Name and Address providers incorporate such data libraries into their offerings. When postal data is available address verification and correction is most reliable, often to the street level. The list of supported countries differs depending on the provider. The most up to date list of providers and the countries they support is available on Oracle9i Warehouse Builder product page at otn.oracle.com. Other third-party vendor specific capabilities are Geocoding (census and locational data such as latitude and longitude) and postal certification reports: USPS CASS for United States, SERP for Canada, AMAS for Australia. Improving Name and Address Cleansing
Discrete or semi-discrete input roles yield better parsing results by removing ambiguity problems the parser may encounter. An input of “Electric Avenue” with a role of line1 is ambiguous – it could be an electrical appliance business or a street name. If the same input has a role of firm or primary address, the ambiguity is resolved. Input roles can be selected from a list of predefined role types that vary depending on the selected parsing option. An input role specifies the type of data that name/address input item represents. Based on the specified Input Roles, the name/address parser will analyze and standardize the input data and return the requested Output Components. These components are defined by adding output parameters, then assigning an Output Component type to each parameter using the associated predefined list. The Output Name, Output Address and Output Status are what the user maps to targets.
Oracle 9i Warehouse Builder, Integrated Data Quality Page 10
Several status flags are provided to help determine the quality of parsing and postal matching. Generally, these flags should be utilized to determine the final quality of a record. High quality records may be mapped to a destination data target, while other records should be mapped for manual quality checks and correction. The splitter operator in the Mapping Editor can be used to achieve this. The Boolean flags specify the degree to which an input record could be matched to postal directories. Is Good Group, Is Parsed, Is Found, etc. are examples of such flags.
Figure 5 – Input Roles and Output Components in Oracle9i Warehouse Builder
Figure 5 shows how an input name and address would be parsed into the abbreviated list of name and address components. Based on the user defined input roles and output components, Oracle9i Warehouse Builder generates the necessary PL/SQL code. This code is then passed to the name and address server for processing. Overview of the Match/Merge Process
Oracle9i Warehouse Builder features a Match/Merge Mapping Operator that supports matching and consolidation of the name and address data including householding. Mapping wizards assist users through the Match/Merge operator setup. The Match/Merge Operator will accept records from an input source, determine the records that are logically the same, and construct a new “merged” record from the “matched” records. The operator has one “ingroup” and two “outgroup” parameters.
Oracle 9i Warehouse Builder, Integrated Data Quality Page 11
INPUT INGROUP: Map the input records to this group. OUTPUT MERGE: Attributes in this group will be merged. The operator will match and merge the input record set to provide the merged records. Any records that have been matched will be merged into a single merge record. XREF: Cross Reference (XREF) provides a record of the merge process. Every record from the ingroup will construct a record in the XREF outgroup (1-1 cardinality with the ingroup).
Figure 6 – Match/Merge Operator
The Match/Merge operator has an ordered record stream as input. From this stream, it will construct the match bins based on the defined match bin keys. From each match bin, it will construct matched sets based on defined match rules. From each matched set, it will construct a merged record based on defined merge rules. The initial query will contain an order clause consisting of the match bin keys. Constructing Match Bins Match bins are subsets of data on which matching is performed. The match bins should be constructed carefully to fulfill two conflicting needs. One is to make sure any records that should match reside in the bin, and second to keep the size of the bins as small as possible. A small possible match record set is desirable for efficiency. Constructing Match Record Sets Each match bin will generate one or more match record sets. The match rules simply determine if two records match. The Match rules will be applied to all
Oracle 9i Warehouse Builder, Integrated Data Quality Page 12
records in the possible match set to determine the match set. The algorithm is transitive, i.e. given records A, B, C with A = B and B = C, A will be considered to be equal to C without making a comparison. Constructing Merge Records A single merge record is constructed from the match record set. Users may create specific rules for defining merge attributes through the merge rules.
Figure 7 – Match/Merge Rule Diagram
Match BIN Attributes Bin keys limit the possible matches to manageable sets. The fuzzy match algorithm will compare each record in the bin to every other record in the set. This total number of comparisons is therefore (N*(N-1))/2. It is necessary to select a bin key or keys to limit the number of records in the bin. Match Rules
Match rules determine if two records are logically the same and are used to create the matched record set. A match rule may be active or passive. A passive match rule is generated but not automatically invoked. All Active match rules are OR’d together to determine if two records match. At least one active match rule must be defined Match Rule types are as follows: All Records Match, No Records Match, Conditional, Weight, Person, Firm, Address, and Custom.
Conditional Conditional Match rules specify the conditions under which records match. Conditional match rules allow you to combine multiple attribute comparisons into one composite rule. If more than one attribute is involved in the rule then all the comparisons must be true for the record to be considered matched. That is, the attribute comparisons are AND'ed together. Comparison Algorithms: Each attribute in the rule is assigned a comparison algorithm, which specifies how the attribute values are compared. Multiple
Oracle 9i Warehouse Builder, Integrated Data Quality Page 13
attributes may be compared in one rule with a separate comparison algorithm selected for each. The types of comparisons are: · Exact - The attributes match if their values are exactly the same. For example, "Dog" and "dog!" would not match, because the second string is not capitalized and contains an extra character. For data types other than String, this is the only type of comparison allowed. · Standardized Exact - The values of the attribute are standardized before being compared for an exact match. With standardization, the comparison ignores case, spaces, and non-alphanumeric characters. Using this algorithm, "Dog" and "dog!" would match. · Similarity - For this type of comparison, a "similarity score" in the range 0100 is entered. If the similarity of the two attributes is equal or greater to this value, then the attribute values are considered matched. The similarity algorithm computes the edit distance between two strings. A value of 100 indicates that the two values are identical; a value of zero indicates no similarity whatsoever. For example, if the string "tootle" is compared with the string "tootles", then the edit distance is 1. The length of the string "tootles" is 7. The similarity value is therefore (6/7)*100 or 85. · Standardized Similarity - The values of the attribute are standardized before using the Similarity algorithm to determine a match. With standardization, the comparison ignores case, spaces, and non-alphanumeric characters. · Soundex - This converts the data to a Soundex representation and then compares the text strings. If the Soundex representations match, then the two attribute values are considered matched. · Partial Name - The values of a string attribute are considered a match if the value of one entire attribute is contained within the other, starting with the first word. For example, "Midtown Power" would match "Midtown Power and Light," but would not match "Northern Midtown Power". The comparison ignores case and non-alphanumeric characters. · Abbreviation - The values of a string attribute are considered a match if one string contains words that are abbreviations of corresponding words in the other. Before attempting to find an abbreviation, this algorithm performs a Std Exact comparison on the entire string. The comparison ignores case and nonalphanumeric character. For each word, the match rule will look for abbreviations, as follows. If the larger of the words being compared contains all of the letters from the shorter word, and the letters appear in the same order as the shorter word, then the words are considered a match. For example, "Intl. Business Products" would match "International Bus Prd".
Oracle 9i Warehouse Builder, Integrated Data Quality Page 14
· Acronym - The values of a string attribute are considered a match if one string is an acronym for the other. Before attempting to identify an acronym, this algorithm performs a Std Exact comparison on the entire string. If no match is found, then each word of one string is compared to the corresponding word in the other string. If the entire word does not match, each character of the word in one string is compared to the first character of each remaining word in the other string. If the characters are the same, the names are considered a match. For example, "Chase Manhattan Bank NA" matches "CMB North America". The comparison ignores case and non-alphanumeric characters.
Figure 8 – Match Rules
Weight Probabilistic matching is implemented in Oracle9i Warehouse Builder with the match rule of type Weight. A weighted match rule allows you to assign an integer weight to each attribute included in the rule. A threshold must also be specified equaled or exceeded for the data to be considered matched. For each attribute, the Match/Merge operator multiplies the weight by the similarity score, and sums the scores. If the sum equals or exceeds the threshold, the two records being compared are considered a match. A weighted match rule is most useful when comparing a large number of attributes, without having a single attribute that is different causing a non-match, as can happen with conditional rules.
Oracle 9i Warehouse Builder, Integrated Data Quality Page 15
Weight rules implicitly invoke the similarity algorithm to compare two attribute values. This algorithm returns an integer, percentage value in the range 0-100, which represents the degree to which two values are alike. A value of 100 indicates that the two values are identical; a value of zero indicates no similarity whatsoever. Example of a Weight Match Rule: Assume the following attribute values are contained in three separate records and were read in the following order: Attribute 1
Attribute 2
Record 1
CA
QQ
Record 2
CA
QQ
Record 3
CA
QR
The match rule is defined as follows: Attribute_1: Weight 50, Attribute_2: Weight 50, required score for match: 80 Record 1 is the initial record read. Comparing Record 1 to Record 2: Record 2 has a value for Attribute_1 of CA. That value has a similarity of 100 with the value in Record 1. The weight value for Attribute_1 is 50, so its weighted score is 50 (100% of 50). The total weighted score so far is 50. The value for Attribute_2 is QQ. That value has a similarity of 100. The weight value for Attribute_2 is 50 so its weighted score is 50 (100% of 50). The total weighted score is 100 (50 + 50). This equals or exceeds the value of the “required” score for match and so Record 1 and Record 2 are matched. Comparing Record 1 to Record 3: Record 3 has a value for Attribute_1of CA, which has a similarity of 100 with the value in the Record 1. The weight value for Attribute_1 is 50, so its weighted score is 50 (100% of 50). The total weighted score so far is 50. The value for Attribute_2 is QR and that has a similarity of 50 with the value in the driver record. The weight value for Attribute_2 is 50, so its weighted score is 25 (50% of 50). The total weighted score is 75 (50+25). This is less than the value of the Required Score for Match and so Record 1 and Record 3 are not matched.
Person Built-in Person rules provide an easy and convenient way for matching names of individuals. Person match rules are most effective when the data has first been corrected using Warehouse Builder Name and Address. This rule requires users to set what data within the record make up a person’s name. The data can come from multiple columns and each column specified must be assigned an input role as to what it means. The Person roles for matching are: Prename, First_name_std, Middle_name_std, Middle_name_2_std, Middle_name_3_std, and Last_name.
Oracle 9i Warehouse Builder, Integrated Data Quality Page 16
To create a Person match rule, roles must be assigned to the various attributes. For example, you could assign roles as follows for SSN, FirstName, MI, and LastName: Attribute
Person Role
SSN FirstName
First_name_std
MI
Middle_name_std
LastName
Last_name
The attributes assigned to person roles are used in the match rule to compare the records. How the attributes are compared depends on which role they have been assigned to, and what other comparison options have been set. Refer to Person Roles for a complete list of person roles and how each role is treated in a person match rule. Person Roles Prename: If this role is assigned, the prenames are compared only if the following are true: • The Last_name and, if present, the middle name (Middle_name_std, Middle_name_2_std, and Middle_name_3_std roles) in both records match •
The "Mrs. Match" option is selected
•
Either record has a missing First_name_std For example, the following would match: Prename
First_name_std
Last_name
Mrs.
William
Webster
Mrs.
Webster
First_name_std: The First names are compared if this role is assigned. By default, the first names must match exactly, but other comparison options can be specified as well. First names match if both are blank. A blank first name will not match a non-blank first name unless the Prename role has been assigned and the "Mrs. Match" option is set. If a Last_name role has not been assigned, a role of First_name_std must be assigned. Middle_name_std, Middle_name_2_std, Middle_name_3_std: Middle names are compared if any of the mentioned roles are assigned. By default, the middle names must match exactly, but other comparison options can be specified. If more than one middle name role is assigned, attributes assigned to the different roles are cross-compared. For example, values for Middle_name_std will be compared not only against other Middle_name_std values, but also against Middle_name_2_std if that role is also assigned. Middle names match if either or both are blank. If any of
Oracle 9i Warehouse Builder, Integrated Data Quality Page 17
the middle name roles are assigned, the First_name_std role must also be assigned. Last_name: Last names are compared if this role is assigned. By default, the last names must match exactly, but you can specify other comparison options. The last names match if both are blank, but not if only one is blank. Maturity_postname: If this role is assigned, the post name, such as "Jr.", "III," etc., are compared. The post names match if the values are exactly the same, or if either value is blank.
Firm Built-in Firm rules provide an easy and convenient way for matching names of businesses. Firm match rules are most effective when the data has first been corrected using Warehouse Builder Name and Address. Similar to the Person rule, this rule requires users to set what data within the record make up a firm’s name. The data can come from multiple columns and each column specified must be assigned an input role as to what it means. For example, if attributes TaxID, Company, and Contact are present, assign roles are as follows: Attribute
Firm Role
TaxID Company
Firm1
Contact Note that not every attribute needs to be assigned to a firm role, and not every role needs to be assigned to an attribute. The attributes assigned to firm roles are used in the match rule to compare the records. How the attributes are compared depends on which role they have been assigned, and what other comparison options have been set. Refer to Firm Roles for a complete list of firm roles and how each role is treated in a firm match rule. Firm Roles Firm1: If this role is assigned, the business names represented by Firm1 are compared. Firm1 names will not be compared against Firm2 names unless if the Cross-match firm1 and firm2 box is checked. By default, the firm names must match exactly; but other comparison options can also be specified. Firm1 names do not match if either or both names are blank. Firm2: If this role is assigned, the values of the attribute assigned to Firm2 will be compared. Firm2 names will not be compared against Firm1 names unless if the Cross-match firm1 and firm2 box is checked. By default, the firm names must match exactly; but other comparison options
Oracle 9i Warehouse Builder, Integrated Data Quality Page 18
can also be specified. Firm2 names do not match if either or both names are blank. If a Firm1 role is not assigned, a Firm2 roles must be assigned.
Address Address Match rules provide an easy way to match records based on postal addresses. Address match rules are most effective when the data has first been corrected using Warehouse Builder Name and Address. Address Rules work differently depending on whether the address being processed has been corrected or not. Generally, corrected addresses have already been identified in a postal matching database, and are therefore not only syntactically correct, but are legal and existing addresses according to the Postal Service of the country containing the address. Corrected addresses can be processed more quickly, since the match rule can make certain assumptions about their format. Uncorrected addresses may be syntactically correct, but have not been found in a postal matching database. Addresses may have not been found because they are not in the database, or because there is no postal matching database installed for the country containing the address. Address match rules determine whether an address has been corrected based on the Is_found role. If Is_found role is not assigned, then the match rule performs the comparisons for both the corrected and uncorrected addresses. A list of address roles is as follows: Primary_address, Unit_number, PO_Box, Dual_primary_address, Dual_unit_number, Dual_PO_Box, City, State, Postal_code, and Is_found. Note that not every attribute needs to be assigned to an address role, and not every role needs to be assigned to an attribute. To create an Address match rule, address roles to the various attributes should be assigned. For example, if the attributes Business, AddressLine1, City, StateAbbr, and ZIP exist, the assigned roles are as follows: Attribute
Address Role
Business
AddressLine1
City StateAbbr ZIP
Primary_address City
State Postal_code
The attributes assigned to address roles are used in the match rule to compare the records. How the attributes are compared depends on which role they have been assigned, and what other comparison options have been set. Refer to Address Roles for a complete list of address roles and how each role is treated in an address match rule. Address Roles
Oracle 9i Warehouse Builder, Integrated Data Quality Page 19
Primary_address: If this role is assigned, the primary addresses are compared. Primary addresses can be, for example, street addresses ("100 Main Street") or PO boxes ("PO Box 100"). By default, the primary addresses must match exactly, but a similarity option can also be specified. The Primary_address role must be assigned. Unit_number: If this role is assigned, unit numbers (such as suite numbers, floor numbers, or apartment numbers) are compared if the primary addresses match. The unit numbers match if both are blank, but not if one is blank, unless the Match on blank secondary address option is set. If the Allow differing secondary address is set, the unit numbers are ignored. PO_Box: If this role is assigned, the Post Office Boxes are compared. The PO Box is just the number portion of the PO Box ("100"), and is a subset of the primary address, when the primary address represents a PO Box ("PO Box 100"). If the primary address represents a street address, the PO Box will be blank. Dual_primary_address: If this role is assigned, the Dual_primary_address is compared against the other record's Dual_primary_address and Primary_address to determine a match. Dual_unit_number: If this role is assigned, the Dual_unit_number address is compared against the other record's Dual_unit_number and Unit_number. The unit numbers will also match if one or both are blank. To assign the Dual_unit_number role, the Dual_primary_address role must also be assigned. Dual_PO_Box: If this role is assigned, the Dual_PO_Box address is compared against the other record's Dual_PO_Box and the PO_Box. To assign the Dual_PO_Box role, the Dual_primary_address role must also be assigned. City: If this role is assigned, the cities are compared for uncorrected addresses. For corrected addresses, the cities are only compared if the postal codes do not match. If both City and State roles match, then the address line roles, such as Primary_address, can be compared. By default, the cities must match exactly, but a last line similarity option may be specified.. The cities match if both are blank, but not if only one is blank. If the City role is assigned, then the State role must also be assigned. State: If this role is assigned, the states are compared for uncorrected addresses. For corrected addresses, the states are only compared if the postal codes do not match. If both State and City roles match, then the address line roles, such as Primary_address, can be compared. By default, the states must match exactly, but a last line similarity option may be specified. The states match if both are blank, but not if only one is blank. If the State role is assigned, then the City role must also be assigned. Postal_code: If this role is assigned, the postal codes are compared for corrected addresses. For uncorrected addresses, the Postal_code role is not used. To match, the postal codes must be exactly the same. The postal codes are not considered a match if one or both are blank. If the postal codes match, then the address line roles, such as Primary_address, can be compared. If the postal codes do not match, City and State roles are
Oracle 9i Warehouse Builder, Integrated Data Quality Page 20
compared to determine whether the address line roles should be compared. Is_found: If this role is assigned, the Is_found_flag attributes are not compared, but instead are used to determine whether an address has been found in a postal matching database, and therefore represents a legal address according to the postal service of the country containing the address. This determination is important because the type of comparison done during matching depends on whether the address has been found in the postal database or not.
Custom The users can write their own comparison algorithms desired for matching. They can also reference one of the other match rule types to logically “and” or “or” them together to produce a unique rule type. The editor will allow editing a PL/SQL function that returns a true or false. The users can validate the PL/SQL from this editor. Custom rules can invoke other match rules. This allows multiple rules to be chained together in more complex ways than the default, which OR's individual rules together. By default, if there is more than one match rule defined and "active", then any rule that gives a true condition (matches) results in the record being matched. Custom rules reference other match rules using the name of that rule. This allows individual weight rules, conditional rules, etc. to be invoked from custom rules. If a custom rule invokes another match rule, the invoked match rule should be marked as "passive." Otherwise the invoked rule will be executed twice, and will unnecessarily increase processing time or produce an undesirable match. Each match rule is represented as a PL/SQL function with two input parameters. Parameters “THIS_” and “THAT_” refer to the two records being compared. All defined match rules are available in the Custom Match Rule Editor for simple drag-and-drop. For example, if the are defined passive rules LocationMatch, HomePhoneMatch and WorkPhoneMatch, then this custom rule simply accomplishes the logical AND of those rules. BEGIN RETURN LocationMatch(THIS_, THAT_) and (HomePhoneMatch(THIS_, THAT_) and WorkPhoneMatch(THIS_, THAT_)); END;
XREF Group
XREF provides a record of the merge process. Every record from the ingroup will construct a record in the Xref outgroup (1-1 cardinality with the ingroup). The Xref record may be populated with both the original attribute values and the merged attributes.
Oracle 9i Warehouse Builder, Integrated Data Quality Page 21
Merge Rules
Matching will produce a set of records that are logically the same. Merging is the process of creating one record from the set of matched records. A Merge rule is provided for each attribute of a record and uses the matched records to generate a merged value for that attribute.
Figure 9 –Merge Rules
Any This rule will pick any (first non-null) value of the match set for this attribute. Rank This rule will rank the records from the match set. The associated attribute from the highest ranked record will be used to populate the merge attribute value. Sequence User can specify a database sequence for this rule. The nextval of the sequence will be used for the value. Min/Max User can specify an attribute and a relation to choose the record to be used as a source for the merge attribute.
Oracle 9i Warehouse Builder, Integrated Data Quality Page 22
Copy User can choose a value from a different previously merged value. Any RECORD, Rank RECORD, Min/Max RECORD, custom RECORD The “RECOD” merge rules select a matched record to populate values for a group of attributes. Custom User can create a PL/SQL package function to select the merge value. The operator will provide the signature of this function. The user is responsible for the implementation of the rule from “BEGIN” to “END;” The matched records and merge record are parameters for this function.
CONCLUSION
A unified, enterprise-wide view of customer data is the foundation for successful Data Warehousing and Business Intelligence. As companies implement data warehouses, data quality should not be overlooked. Oracle9i Warehouse Builder provides the full range of data consolidation and customer data quality capabilities required to build and to maintain customerfocused databases in a single, integrated solution. It produces significant productivity gains over traditional multi-vendor solutions involving generalpurpose transformation tools, mailing utilities and extensive custom coding. Specifically in data quality area, Oracle9i Warehouse Builder allows accomplishment of tasks such as: •
Name and Address parsing, standardization, verification and augmentation
•
Matching and Merging for the purpose of de-duplication
•
Matching and Merging for the purpose of record-linking (including householding)
This leads to providing a higher ROI through delivery of: •
Improved relationship and reputation with customers by better targeting communications and campaigns
•
Reduced costs of marketing campaigns due to postal discounts for cleansed data
•
Reliable geography specific marketing
•
Customer-oriented “services” as necessary (address validation, correction, lookups, householding, etc.).
•
Other initiatives requiring a cleansed and unified view of customer.
Oracle 9i Warehouse Builder, Integrated Data Quality Page 23
Oracle9i Warehouse Builder addresses the challenges of data quality by offering data cleansing and integration in one tool, allowing businesses to more accurately locate, understand and target customers. Best of all, Oracle9i Warehouse Builder offers the flexibility and accuracy that data consolidation projects demand, yet at a low cost. By implementing Oracle9i Warehouse Builder, you will build a solid foundation for data quality in your data warehouse.
Oracle 9i Warehouse Builder, Integrated Data Quality Page 24
Oracle9i Warehouse Builder, Integrated Data Quality July 2003 Authors: Behnaz Westman, Nikolai Rochnik
Oracle Corporation World Headquarters 500 Oracle Parkway Redwood Shores, CA 94065 U.S.A. Worldwide Inquiries: Phone: +1.650.506.7000 Fax: +1.650.506.7200 www.oracle.com Oracle is a registered trademark of Oracle Corporation. Various product and service names referenced herein may be trademarks of Oracle Corporation. All other product and service names mentioned may be trademarks of their respective owners. Copyright © 2003 Oracle Corporation All rights reserved.