Data Profiling Guide
Informatica PowerCenter® (Version 7.1.2)
Informatica PowerCenter® Data Profiling Guide Version 7.1.2 February 2005 Copyright (c) 2003-2005 Informatica Corporation. All rights reserved. Printed in the USA. This software and documentation contain proprietary information of Informatica Corporation, they are provided under a license agreement containing restrictions on use and disclosure and is also protected by copyright law. Reverse engineering of the software is prohibited. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise) without prior consent of Informatica Corporation. Use, duplication, or disclosure of the Software by the U.S. Government is subject to the restrictions set forth in the applicable software license agreement as provided in DFARS 227.7202-1(a) and 227.7702-3(a) (1995), DFARS 252.227-7013(c)(1)(ii) (OCT 1988), FAR 12.212(a) (1995), FAR 52.227-19, or FAR 52.227-14 (ALT III), as applicable. The information in this document is subject to change without notice. If you find any problems in the documentation, please report them to us in writing. Informatica Corporation does not warrant that this documentation is error free. Informatica, PowerMart, PowerCenter, PowerCenter Connect, PowerConnect, and PowerChannel are trademarks or registered trademarks of Informatica Corporation in the United States and in jurisdictions throughout the world. All other company and product names may be trade names or trademarks of their respective owners. Portions of this software are copyrighted by DataDirect Technologies, 1991-2000. Informatica PowerCenter products contain ACE (TM) software copyrighted by Douglas C. Schmidt and his research group at Washington University and University of California, Irvine, Copyright © 1993-2002, all rights reserved. DISCLAIMER: Informatica Corporation provides this documentation “as is” without warranty of any kind, either express or implied, including, but not limited to, the implied warranties of non-infringement, merchantability, or use for a particular purpose. The information provided in this documentation may include technical inaccuracies or typographical errors. Informatica could make improvements and/or changes in the products described in this documentation at any time without notice.
Table of Contents List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii New Features and Enhancements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xiv About Informatica Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv About this Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi Document Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xvi Other Informatica Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii Visiting Informatica Customer Portal . . . . . . . . . . . . . . . . . . . . . . . . . xvii Visiting the Informatica Webzine . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii Visiting the Informatica Web Site . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii Visiting the Informatica Developer Network . . . . . . . . . . . . . . . . . . . . xvii Obtaining Technical Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xviii
Chapter 1: Data Profiling Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Understanding Data Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Profiling Source Data During Mapping Development . . . . . . . . . . . . . . . 2 Profiling Source Data to Monitor Data Quality . . . . . . . . . . . . . . . . . . . . 2 Steps for Profiling Source Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Step 1. Create the Data Profiling Warehouse . . . . . . . . . . . . . . . . . . . . . . 3 Step 2. Create a Data Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Step 3. Create and Run the Profile Session . . . . . . . . . . . . . . . . . . . . . . . 4 Step 4. View Profile Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Using the Profile Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Profile View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Source View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Chapter 2: Installing and Configuring Data Profiling . . . . . . . . . . . . . 9 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Installing and Configuring Data Profiling . . . . . . . . . . . . . . . . . . . . . . . 10 Upgrading from a Previous Version . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Installing PowerAnalyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 iii
Configuring a Data Profiling Warehouse . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Creating a Data Profiling Warehouse . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Upgrading the Data Profiling Warehouse . . . . . . . . . . . . . . . . . . . . . . . 13 Configuring a Relational Database Connection to the Data Profiling Warehouse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Installing PowerAnalyzer Data Profiling Reports . . . . . . . . . . . . . . . . . . . . . 16 Installing Data Profiling XML Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Importing Data Profiling Schema and Reports . . . . . . . . . . . . . . . . . . . 16 Configuring a Data Source for the Data Profiling Warehouse . . . . . . . . . 20 Creating a Data Connector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Adding a Data Profiling Data Source to a Data Connector . . . . . . . . . . . 23 Configuring Default Data Profile Options . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Purging the Data Profiling Warehouse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Chapter 3: Managing Data Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . .31 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Profiling Sources and Mapplet Output data . . . . . . . . . . . . . . . . . . . . . . 32 Profile Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Creating an Auto Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Auto Profile Naming Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Steps to Create an Auto Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Creating a Custom Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Step 1. Specify a Profile Name and Profile Description . . . . . . . . . . . . . 38 Step 2. Add Sources to the Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Step 3. Add Functions and Enable Session Configuration . . . . . . . . . . . . 39 Step 4. Configure Profile Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Step 5. Configure the Profile Session . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Generating the Profile Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Editing a Data Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Deleting a Data Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Working with Profile Mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Copying Data Profiling Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Combining Data Profile Mappings with Other Mappings . . . . . . . . . . . . 51 Using Mapplets to Extend Data Profiling Functions . . . . . . . . . . . . . . . . . . . 52 Extending Data Profiling Functionality with Mapplets . . . . . . . . . . . . . . 52 Profiling Aggregate Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Profiling Multiple Sources with One Matching Port . . . . . . . . . . . . . . . . 55
iv
Table of Contents
Profiling Sources that Use All Matching Ports . . . . . . . . . . . . . . . . . . . . 57 Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Chapter 4: Running Profile Sessions . . . . . . . . . . . . . . . . . . . . . . . . . 61 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Working with Data Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Running Sessions from the Profile Manager . . . . . . . . . . . . . . . . . . . . . . . . 63 Configuring a Session in the Profile Wizard . . . . . . . . . . . . . . . . . . . . . 64 Running a Session when You Create a Data Profile . . . . . . . . . . . . . . . . 67 Running a Session for an Existing Data Profile . . . . . . . . . . . . . . . . . . . 67 Monitoring Interactive Sessions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Profiling Data Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Step 1. Select a Function Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Step 2. Select a Data Sampling Mode . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Profiling Relational Data Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Performing Sampling as the PowerCenter Server Reads Data . . . . . . . . . 70 Delegating Sampling to the Database . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Improving Data Sampling Accuracy and Performance . . . . . . . . . . . . . . 72 Creating a Session in the Workflow Manager . . . . . . . . . . . . . . . . . . . . . . . 74 Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Chapter 5: Viewing Profile Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 PowerCenter Data Profiling Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Auto Profile Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Custom Profile Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Viewing PowerCenter Data Profiling Reports . . . . . . . . . . . . . . . . . . . . 83 PowerAnalyzer Data Profiling Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Chapter 6: Working with Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Working with Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Prepackaged Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Custom Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 List of Values Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Regular Expression Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 Domain Definition Filename Domains . . . . . . . . . . . . . . . . . . . . . . . . . 98 Specifying Localization and Code Page Information . . . . . . . . . . . . . . . 99 Table of Contents
v
Editing a Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Deleting a Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Chapter 7: Working with Functions . . . . . . . . . . . . . . . . . . . . . . . . . 103 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Source-Level Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Row Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Business Rule Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Candidate Key Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Redundancy Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Row Uniqueness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Column-Level Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Business Rule Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Domain Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Domain Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Aggregate Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Distinct Value Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Intersource Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Orphan Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Join Complexity Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Appendix A: Data Profiling Views . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 List of Data Profiling Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 DPR_LATEST_PRFLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 DPR_PRFL_AUTO_COL_FN_METRICS . . . . . . . . . . . . . . . . . . . . . . . . 130 DPR_PRFL_CART_PROD_METRICS . . . . . . . . . . . . . . . . . . . . . . . . . . 131 DPR_PRFL_COL_FN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 DPR_PRFL_COL_FN_METRICS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 DPR_PRFL_COL_FN_VERBOSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 DPR_PRFL_CP_FN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 DPR_PRFL_FN_DTLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 DPR_PRFL_OJ_FN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 DPR_PRFL_OJ_FN_VERBOSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 DPR_PRFL_OJ_METRICS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 DPR_PRFL_RUN_DTLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 DPR_PRFL_SRC_FN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
vi
Table of Contents
DPR_PRFL_SRC_FN_METRICS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 DPR_PRFL_SRC_FN_VERBOSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 DPR_PRFL_VER_DTLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Appendix B: Code Page Compatibility . . . . . . . . . . . . . . . . . . . . . . . 151 Code Page Compatibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
Appendix C: Data Profiling Error Messages . . . . . . . . . . . . . . . . . . 155 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 Designer Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 Server Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 DP Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
Appendix D: Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 Glossary Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
Table of Contents
vii
viii
Table of Contents
List of Figures Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure
1-1. Profile Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1-2. Profile Manager Source View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2-1. Configuring Default Profile Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3-1. Profile Wizard - Auto Profile Column Selection Page . . . . . . . . . . . . . . . . . . . . . . 36 3-2. Profile Wizard - Profile Sources Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3-3. Profile Wizard - Function-Level Operations Page . . . . . . . . . . . . . . . . . . . . . . . . . 40 3-4. Profile Wizard - Function Details Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3-5. Profile Wizard - Function Role Details Page (Row Count Function) . . . . . . . . . . . 42 3-6. Data Profile Mappings in the Designer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3-7. Sample Mapplet to Aggregate and Filter Mapplet Output Data . . . . . . . . . . . . . . . 53 3-8. Sample Mapplet - Aggregator Transformation Settings . . . . . . . . . . . . . . . . . . . . . 54 3-9. Sample Verbose Summary Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3-10. Sample Mapplet to Join Data from Two Sources . . . . . . . . . . . . . . . . . . . . . . . . 56 3-11. Sample Data Profiling Report for Mapplet Output Data . . . . . . . . . . . . . . . . . . . 57 3-12. Sample Mapplet to Merge Mapplet Output Data . . . . . . . . . . . . . . . . . . . . . . . . 58 3-13. Sample Data Profiling Report for the Union Mapplet . . . . . . . . . . . . . . . . . . . . . 59 4-1. Profile Wizard - Profile Run page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4-2. Profile Wizard - Session Setup Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4-3. Profile Session Status in the Profile Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4-4. Manual PowerCenter Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4-5. Automatic Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4-6. Data Sampling Delegated to the Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4-7. Data Profile of an Historical Sample of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5-1. Sample Auto Profile Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5-2. Sample Verbose Data Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5-3. Sample Custom Profile Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5-4. PowerAnalyzer Data Profiling Analytic Workflow . . . . . . . . . . . . . . . . . . . . . . . . 85 6-1. Regular Expression Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6-2. Test Data Dialog Box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 7-1. Row Count Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 7-2. Business Rule Validation Source-Level Function . . . . . . . . . . . . . . . . . . . . . . . . 107 7-3. Business Rule Validation Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 7-4. Candidate Key Evaluation Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 7-5. Redundancy Evaluation Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 7-6. Row Uniqueness Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 7-7. Business Rule Validation Column-Level Function . . . . . . . . . . . . . . . . . . . . . . . 114 7-8. Domain Validation Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 7-9. Domain Browser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 7-10. Domain Inference Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 7-11. Aggregate Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
List of Figures
ix
Figure Figure Figure Figure
x
7-12. Distinct Value Count Function . . . . 7-13. Orphan Analysis Function . . . . . . . . 7-14. Join Complexity Evaluation Function B-1. Configuring Code Page Compatibility
List of Figures
. . . .
. . . .
. . . .
.. .. .. ..
. . . .
.. .. .. ..
. . . .
. . . .
.. .. .. ..
. . . .
. . . .
.. .. .. ..
. . . .
.. .. .. ..
. . . .
. . . .
.. .. .. ..
. . . .
. . . .
.. .. .. ..
. . . .
.. .. .. ..
. . . .
.121 .124 .126 .153
List of Tables Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table
2-1. Scripts for Creating a Data Profiling Warehouse . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2-2. Scripts for Upgrading the Data Profiling Warehouse . . . . . . . . . . . . . . . . . . . . . . . 13 2-3. Data Connector Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2-4. Default Data Profile Options - General Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2-5. Default Data Profile Options - Prefixes Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3-1. Employee Expenses Sample Source Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4-1. Profile Run Properties for Profile Sessions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4-2. Session Setup Properties for Profile Sessions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4-3. Function Behavior with Data Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4-4. Recommended Percentages for Manual Random Sampling . . . . . . . . . . . . . . . . . . . 73 5-1. Auto Profile Report Summary Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5-2. Custom Profile Report Summary Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5-3. Custom Profile Report Function Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5-4. PowerAnalyzer Data Profiling Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6-1. Prepackaged Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6-2. perl Syntax Guidelines for Regular Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6-3. COBOL Syntax and perl Syntax Compared . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6-4. SQL Syntax and perl Syntax Compared . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 7-1. Row Count Function Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 7-2. Business Rule Validation Source-Level Function Options . . . . . . . . . . . . . . . . . . . 107 7-3. Candidate Key Evaluation Function Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 7-4. Redundancy Evaluation Function Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 7-5. Row Uniqueness Function Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 7-6. Business Rule Validation Column-Level Function Options . . . . . . . . . . . . . . . . . . 114 7-7. Domain Validation Function Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 7-8. Aggregate Functions to Add Based on the Column Datatype. . . . . . . . . . . . . . . . . 119 7-9. Aggregate Function Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 7-10. Distinct Value Count Function Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 7-11. Orphan Analysis Function Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 A-1. DPR_LATEST_PRFLS Column Information . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 A-2. DPR_PRFL_AUTO_COL_FN_METRICS Column Information . . . . . . . . . . . . . 130 A-3. DPR_PRFL_CART_PROD_METRICS Column Information . . . . . . . . . . . . . . . 131 A-4. DPR_PRFL_COL_FN Column Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 A-5. DPR_PRFL_COL_FN_METRICS Column Information . . . . . . . . . . . . . . . . . . . 134 A-6. DPR_PRFL_COL_FN_VERBOSE Column Information . . . . . . . . . . . . . . . . . . . 136 A-7. DPR_PRFL_CP_FN Column Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 A-8. DPR_PRFL_FN_DTLS Column Information . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 A-9. DPR_PRFL_OJ_FN Column Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 A-10. DPR_PRFL_OJ_FN_VERBOSE Column Information . . . . . . . . . . . . . . . . . . . 143 A-11. DPR_PRFL_OJ_METRICS Column Information . . . . . . . . . . . . . . . . . . . . . . . 144
List of Tables
xi
Table Table Table Table Table
xii
A-12. A-13. A-14. A-15. A-16.
List of Tables
DPR_PRFL_RUN_DTLS Column Information . . . . . . DPR_PRFL_SRC_FN Column Information . . . . . . . . . DPR_PRFL_SRC_FN_METRICS Column Information DPR_PRFL_SRC_FN_VERBOSE Column Information DPR_PRFL_VER_DTLS Column Information . . . . . . .
. . . . .
.. .. .. .. ..
. . . . .
.. .. .. .. ..
. . . . .
. . . . .
.. .. .. .. ..
. . . . .
. . . . .
.. .. .. .. ..
. . . . .
.. .. .. .. ..
. . . . .
.145 .146 .148 .149 .150
Preface Welcome to PowerCenter, Informatica’s integrated suite of software products that delivers an open, scalable data integration solution addressing the complete life cycle for data warehouse and analytic application development. PowerCenter combines the latest technology enhancements for reliably managing data repositories and delivering information resources in a timely, usable, and efficient manner. PowerCenter Data Profiling is a technique for analyzing source data. PowerCenter Data Profiling lets you determine information, such as implicit datatype, suggest candidate keys, detect data patterns, and evaluate join criteria.
xiii
New Features and Enhancements This section describes new features and enhancements to Data Profiling 7.1.1.
xiv
Preface
♦
Data sampling. You can create a data profile for a sample of source data instead of the entire source. You can view a profile from a random sample of data, a specified percentage of data, or for a specified number of rows starting with the first row.
♦
Verbose data enhancements. You can specify the type of verbose data you want the PowerCenter Server to write to the Data Profiling warehouse. The PowerCenter Server can write all rows, only the rows that meet the business rule, or only the rows that do not meet the business rule.
♦
Profile session enhancements. You can save sessions that you create from the Profile Manager to the repository.
♦
Domain Inference function tuning. You can configure the Profile Wizard to filter the Domain Inference function results. You can configure a maximum number of patterns and a minimum pattern frequency. You may want to narrow the scope of patterns returned to view only the primary domains, or you may want to widen the scope of patterns returned to view exception data.
♦
Row Uniqueness function. You can determine unique and duplicate rows for a source based on a selection of columns for the specified source.
♦
Define mapping, session, and workflow prefixes. You can define default mapping, session, and workflow prefixes for the mappings, sessions, and workflows generated when you create a data profile.
♦
Profile mapping display in the Designer. The Designer now displays profile mappings under a profile mappings icon in the repository folder.
About Informatica Documentation The complete set of documentation for PowerCenter includes the following books: ♦
Data Profiling Guide. Provides information about how to profile PowerCenter sources to evaluate source data and detect patterns and exceptions.
♦
Designer Guide. Provides information needed to use the Designer. Includes information to help you create mappings, mapplets, and transformations. Also includes a description of the transformation datatypes used to process and transform source data.
♦
Getting Started. Provides basic tutorials for getting started.
♦
Installation and Configuration Guide. Provides information needed to install and configure the PowerCenter tools, including details on environment variables and database connections.
♦
PowerCenter Connect® for JMS® User and Administrator Guide. Provides information to install PowerCenter Connect for JMS, build mappings, extract data from JMS messages, and load data into JMS messages.
♦
Repository Guide. Provides information needed to administer the repository using the Repository Manager or the pmrep command line program. Includes details on functionality available in the Repository Manager and Administration Console, such as creating and maintaining repositories, folders, users, groups, and permissions and privileges.
♦
Transformation Language Reference. Provides syntax descriptions and examples for each transformation function provided with PowerCenter.
♦
Transformation Guide. Provides information on how to create and configure each type of transformation in the Designer.
♦
Troubleshooting Guide. Lists error messages that you might encounter while using PowerCenter. Each error message includes one or more possible causes and actions that you can take to correct the condition.
♦
Web Services Provider Guide. Provides information you need to install and configure the Web Services Hub. This guide also provides information about how to use the web services that the Web Services Hub hosts. The Web Services Hub hosts Real-time Web Services, Batch Web Services, and Metadata Web Services.
♦
Workflow Administration Guide. Provides information to help you create and run workflows in the Workflow Manager, as well as monitor workflows in the Workflow Monitor. Also contains information on administering the PowerCenter Server and performance tuning.
♦
XML User Guide. Provides information you need to create XML definitions from XML, XSD, or DTD files, and relational or other XML definitions. Includes information on running sessions with XML data. Also includes details on using the midstream XML transformations to parse or generate XML data within a pipeline.
Preface
xv
About this Book The Informatica PowerCenter Data Profiling Guide provides information to install Data Profiling, build data profiles, run profile sessions, and view profile results. It is written for the database administrators and developers who are responsible for building PowerCenter mappings and running PowerCenter workflows. This book assumes you have knowledge of relational database concepts, database engines, and PowerCenter. You should also be familiar with the interface requirements for other supporting applications. The material in this book is available for online use.
Document Conventions This guide uses the following formatting conventions:
xvi
Preface
If you see…
It means…
italicized text
The word or set of words are especially emphasized.
boldfaced text
Emphasized subjects.
italicized monospaced text
This is the variable name for a value you enter as part of an operating system command. This is generic text that should be replaced with user-supplied values.
Note:
The following paragraph provides additional facts.
Tip:
The following paragraph provides suggested uses.
Warning:
The following paragraph notes situations where you can overwrite or corrupt data, unless you follow the specified procedure.
monospaced text
This is a code example.
bold monospaced text
This is an operating system command you enter from a prompt to run a task.
Other Informatica Resources In addition to the product manuals, Informatica provides these other resources: ♦
Informatica Customer Portal
♦
Informatica Webzine
♦
Informatica web site
♦
Informatica Developer Network
♦
Informatica Technical Support
Visiting Informatica Customer Portal As an Informatica customer, you can access the Informatica Customer Portal site at http://my.informatica.com. The site contains product information, user group information, newsletters, access to the Informatica customer support case management system (ATLAS), the Informatica Knowledgebase, Informatica Webzine, and access to the Informatica user community.
Visiting the Informatica Webzine The Informatica Documentation team delivers an online journal, the Informatica Webzine. This journal provides solutions to common tasks, detailed descriptions of specific features, and tips and tricks to help you develop data warehouses. The Informatica Webzine is a password-protected site that you can access through the Customer Portal. The Customer Portal has an online registration form for login accounts to its webzine and web support. To register for an account, go to http://my.informatica.com. If you have any questions, please email
[email protected].
Visiting the Informatica Web Site You can access Informatica’s corporate web site at http://www.informatica.com. The site contains information about Informatica, its background, upcoming events, and locating your closest sales office. You will also find product information, as well as literature and partner information. The services area of the site includes important information on technical support, training and education, and implementation services.
Visiting the Informatica Developer Network The Informatica Developer Network is a web-based forum for third-party software developers. You can access the Informatica Developer Network at the following URL: http://devnet.informatica.com
Preface
xvii
The site contains information on how to create, market, and support customer-oriented addon solutions based on Informatica’s interoperability interfaces.
Obtaining Technical Support There are many ways to access Informatica technical support. You can call or email your nearest Technical Support Center listed below or you can use our WebSupport Service. WebSupport requires a user name and password. You can request a user name and password at http://my.informatica.com. North America / South America
Africa / Asia / Australia / Europe
Informatica Corporation 100 Cardinal Way Redwood City, CA 94063 Phone: 866.563.6332 or 650.385.5800 Fax: 650.213.9489 Hours: 6 a.m. - 6 p.m. (PST/PDT) email:
[email protected]
Informatica Software Ltd. 6 Waltham Park Waltham Road, White Waltham Maidenhead, Berkshire SL6 3TN Phone: 44 870 606 1525 Fax: +44 1628 511 411 Hours: 9 a.m. - 5:30 p.m. (GMT) email:
[email protected] Belgium Phone: +32 15 281 702 Hours: 9 a.m. - 5:30 p.m. (local time) France Phone: +33 1 41 38 92 26 Hours: 9 a.m. - 5:30 p.m. (local time) Germany Phone: +49 1805 702 702 Hours: 9 a.m. - 5:30 p.m. (local time) Netherlands Phone: +31 30 602 2797 Hours: 9 a.m. - 5:30 p.m. (local time) Singapore Phone: +65 322 8589 Hours: 9 a.m. - 5 p.m. (local time) Switzerland Phone: +41 800 81 80 70 Hours: 8 a.m. - 5 p.m. (local time)
xviii
Preface
Chapter 1
Data Profiling Overview This chapter includes the following topics: ♦
Understanding Data Profiling, 2
♦
Steps for Profiling Source Data, 3
♦
Using the Profile Manager, 6
1
Understanding Data Profiling Data profiling is a technique used to analyze source data. PowerCenter Data Profiling can help you evaluate source data and detect patterns and exceptions. PowerCenter lets you profile source data to suggest candidate keys, detect data patterns, evaluate join criteria, and determine information, such as implicit datatype. You can use Data Profiling to analyze source data in the following situations: ♦
During mapping development
♦
During production to maintain data quality
Profiling Source Data During Mapping Development You can use Data Profiling during mapping development for the following purposes: ♦
To learn more information about your source data. If you want to know more about your source data, you can profile the data to understand data patterns and exceptions. As a result, you can base the design of mappings and workflows on actual data, rather than make theoretical assumptions about sources.
♦
To validate documented business rules about the source data. For example, if you have a business rule that columns in a source table must contain U.S. zip codes, you can profile the source data to verify that the rows in this table contain the proper values.
Profiling Source Data to Monitor Data Quality You can use Data Profiling during production for the following purposes:
2
♦
To verify that the assumptions you made about your source data during project development are still valid. For example, you may want to view statistics about how many rows satisfied a business rule and how many did not.
♦
To verify the validity of business intelligence (BI) reports. When you use a BI tool, you can profile your data to verify the validity of your BI reports. For example, if you run a BI report that states that your most profitable customers are in California, you might want to profile your source data to verify what percentage of data about your customers contains a state code. If a certain percentage of the data does not contain a state code, the results of your BI report may not be reliable.
Chapter 1: Data Profiling Overview
Steps for Profiling Source Data You create data profiles to profile source data in PowerCenter. A data profile contains functions that perform calculations on the source data. When you create a data profile, the Designer generates a mapping. The PowerCenter repository stores your data profile and the associated mapping. You can run profile sessions for the mapping to gather information about your source data. The Data Profiling warehouse stores the results of profile sessions. After you run profile sessions, you can view reports that display the profile session results. To profile a source, mapplet, or groups in a source or mapplet, perform the following tasks: 1.
Create the Data Profiling warehouse (one-time step).
2.
Create a data profile.
3.
Create and run a session for the data profile.
4.
View reports based on the profile results.
The PowerCenter Designer provides a Profile Manager and Profile Wizard to complete these tasks.
Step 1. Create the Data Profiling Warehouse The Data Profiling warehouse is a relational database that stores data profile information from profile sessions. When you run multiple sessions for a data profile, the Data Profiling warehouse retains the information for each session run. If you update your data profile, the Data Profiling warehouse creates a new version of the data profile. The PowerCenter Client and other applications can access the Data Profiling warehouse for reporting purposes. For more information about viewing Data Profiling reports, see “Viewing Profile Results” on page 77. You create a separate Data Profiling warehouse for each repository when you configure Data Profiling. You can create the warehouse on any relational database that PowerCenter supports. You can also create the warehouse on the same database instance as your PowerCenter repository. Note: If you create a Data Profiling warehouse on Informix, you cannot use PowerAnalyzer to
view Data Profiling reports.
Step 2. Create a Data Profile To profile a source, you must first create a data profile. Data profiles contain functions that perform calculations on the source data. For example, you can use an Aggregate function or Business Rule Validation function in a data profile. For more information about profile functions, see “Working with Functions” on page 103.
Steps for Profiling Source Data
3
The PowerCenter repository stores the profile as an object in the repository. You can apply profile functions to a column within a source, to a single source, or to more than one source. You can create the following types of data profiles: ♦
Auto profile. Contains a predefined set of functions for profiling source data. You can use an auto profile during mapping development to learn more about your source data. For more information about auto profiles, see “Creating an Auto Profile” on page 34.
♦
Custom profile. A profile you define with the functions you need to profile source data. You can use a custom profile during mapping development to validate documented business rules about the source data. You also use a custom profile to monitor data quality. In addition, you can use a custom profile to validate the results of BI reports. For more information about creating custom profiles, see “Creating a Custom Profile” on page 38.
You use the Designer to create an auto profile. You can use the Designer or the Profile Manager to create a custom profile by invoking the Profile Wizard. Once you create an auto profile or custom profile, the Designer generates a mapping based on the profile information. PowerCenter stores the data profile and profile mapping in your repository. You can configure a data profile to write verbose data to the Data Profiling warehouse during a profile session. Verbose data provides more details about the data that results from a profile function. For example, for a function that validates business rules, verbose data may include the invalid rows in the source. For a function that determines the number of distinct values, verbose data can include a list of distinct values. After you create a data profile, you can view profile details from the Profile Manager. You can also modify and remove the data profile. For more information about the Profile Manager, see “Using the Profile Manager” on page 6. For more information about creating, editing, and deleting data profiles, see “Managing Data Profiles” on page 31.
Step 3. Create and Run the Profile Session Once you define a profile, run a PowerCenter session for the profile mapping. The PowerCenter Server writes the profile session results to the Data Profiling warehouse. You can run profile sessions from the following places: ♦
Profile Manager. You can create and run temporary and persistent profile sessions from the Profile Manager. A temporary session runs on demand and is not stored in the repository. A persistent session can run on demand and is stored in the repository.
♦
Workflow Manager. You can create and run persistent sessions from the Workflow Manager, which are stored in the repository. Use the Workflow Manager to create a session for a profile mapping and include the session in a workflow. You can run the workflow from the Workflow Manager or at a scheduled time.
For more information about running profile sessions, see “Running Profile Sessions” on page 61.
4
Chapter 1: Data Profiling Overview
Step 4. View Profile Reports After the profile session completes, the PowerCenter Server loads the session results to the Data Profiling warehouse. You can view the session results using profile reports. You can view reports on the profile session results in the following ways: ♦
View PowerCenter Data Profiling reports from the Profile Manager.
♦
View customizable reports in PowerAnalyzer.
For more information about viewing profile reports, see “Viewing Profile Results” on page 77.
Steps for Profiling Source Data
5
Using the Profile Manager The Profile Manager is a tool in the Designer to help you manage data profiles. Use the Profile Manager to set default profiling options, work with data profiles in your repository, run profile sessions, view profile results, and view sources and mapplets with at least one profile defined for them. When you launch the Profile Manager, you can access profile information for the open folders in your repository. There are two views in the Profile Manager: ♦
Profile View. The Profile View tab displays the data profiles in the open folders in your repository.
♦
Source View. The Source View tab displays the source definitions in the open folders in your repository for which you have defined data profiles.
Note: If the repository folder is read-only, you can only view data profiles in the Profile View.
From the Profile View and the Source View, you can perform the following tasks to manage, run, and view data profiles:
6
♦
Create a custom profile. For more information, see “Creating a Custom Profile” on page 38.
♦
View data profile details. You can view the details of an auto profile or custom profile.
♦
Edit a data profile. For more information, see “Editing a Data Profile” on page 47.
♦
Delete a data profile. For more information, see “Deleting a Data Profile” on page 49.
♦
Run a session. For more information, see “Running Sessions from the Profile Manager” on page 63.
♦
Regenerate a profile mapping. You can regenerate a profile mapping to validate the mapping.
♦
Check in profile mappings. You can check in profile mappings for versioned repositories to commit the changes to the repository. When you check in an object, the repository creates a new version of the object and assigns it a version number. For more information on versioning, see the Repository Guide.
♦
Configure default data profile options. For more information, see “Configuring Default Data Profile Options” on page 26.
♦
Configure domains for profile functions. For more information, see “Working with Domains” on page 89.
♦
Purge the Data Profiling warehouse. For more information, see “Purging the Data Profiling Warehouse” on page 29.
♦
Display the status of interactive profile sessions. For more information, see “Monitoring Interactive Sessions” on page 67.
♦
Display PowerCenter Data Profiling reports. For more information, see “PowerCenter Data Profiling Reports” on page 79.
Chapter 1: Data Profiling Overview
By default, the Profile Manager launches automatically after you create a data profile. You can also manually launch the Profile Manager from the Source Analyzer or Mapplet Designer. To launch the Profile Manager from the Source Analyzer, select Sources-Profiling-Launch Profile Manager. To launch the Profile Manager from the Mapplet Designer, select MappletsProfiling-Launch Profile Manager. Tip: If you do not want the Profile Manager to launch automatically after you create a data
profile, you can change the default data profile options in the Profile Manager. For more information about configuring default data profile options, see “Configuring Default Data Profile Options” on page 26.
Profile View The Profile View tab displays all of the data profiles in the open folders in your repository. Use the Profile View to determine the data profiles that exist for a particular repository folder. Figure 1-1 shows the Profile Manager: Figure 1-1. Profile Manager
Repository Data Profile Folder
Profile Navigator
Properties Window
Session Status
Using the Profile Manager
7
Source View The Source View displays the source definitions with data profiles in the open folders in your repository. Use the Source View to determine if a specific source definition already has data profiles defined. The Source View shows if the data profile is an auto profile or custom profile. You can also use the Source View when you want to work with a data profile but are more familiar with the source name than the data profile name. For example, you may want to run a profile session, and you know the source definition name but not the data profile name. Figure 1-2 shows the Profile Manager Source View: Figure 1-2. Profile Manager Source View
Repository Folder Name
Source Name
Data Profile Name
When you select the Source View tab in the Profile Manager, a source view tree displays data profiles as nodes under the source definition for which you defined the data profile. If you change or delete a data profile or a source or mapplet with a data profile, you can click View-Refresh to refresh the Source View.
8
Chapter 1: Data Profiling Overview
Chapter 2
Installing and Configuring Data Profiling This chapter contains information on the following topics: ♦
Overview, 10
♦
Configuring a Data Profiling Warehouse, 12
♦
Installing PowerAnalyzer Data Profiling Reports, 16
♦
Configuring Default Data Profile Options, 26
♦
Purging the Data Profiling Warehouse, 29
9
Overview Data Profiling installs automatically with the PowerCenter Server and Client when you purchase the Data Profiling Option. After you install PowerCenter with the Data Profiling Option, you must complete configuration steps before you can create data profiles, run profile sessions, or view profile reports.
Installing and Configuring Data Profiling To install and configure Data Profiling, complete the following steps: 1.
Install and configure PowerCenter 7.1.2. Install and configure the PowerCenter 7.1.2 Server and Client with the Data Profiling Option. ♦
You must register the PowerCenter Server before you can view session properties when you create or edit data profiles. For more information about registering the PowerCenter Server, see “Registering the PowerCenter Server” in the Installation Guide.
♦
You must ensure that the code pages for the operating systems that host the PowerCenter Client and Server and the Data Profiling warehouse and domains are compatible. For more information about configuring compatible code pages for PowerCenter and Data Profiling components, see “Code Page Compatibility” on page 152.
2.
Create a Data Profiling warehouse. Create a Data Profiling warehouse for each PowerCenter repository in which you want to store data profiles. For more information, see “Configuring a Data Profiling Warehouse” on page 12.
3.
Install PowerAnalyzer reports (optional). For more information, see “Installing PowerAnalyzer Data Profiling Reports” on page 16.
4.
Configure the default data profile options (optional). You can configure default options for the Profile Wizard, data profile mappings, and data profile workflows and sessions. For more information about configuring default data profile options, see “Configuring Default Data Profile Options” on page 26.
Upgrading from a Previous Version If you are upgrading from a previous version of Data Profiling, complete the following steps:
10
1.
Install and configure PowerCenter 7.1.2. Install and configure the PowerCenter 7.1.2 Server and Client with the Data Profiling Option.
2.
Upgrade the Data Profiling warehouse. For more information, see “Upgrading the Data Profiling Warehouse” on page 13.
Chapter 2: Installing and Configuring Data Profiling
3.
Install Data Profiling PowerAnalyzer reports (optional). For more information, see “Installing PowerAnalyzer Data Profiling Reports” on page 16.
Note: When you upgrade Data Profiling, the Designer maintains the default Data Profiling
options you configured in the previous version.
Installing PowerAnalyzer Before installing and configuring Data Profiling, install and configure PowerAnalyzer if you want to view Data Profiling reports in PowerAnalyzer. To use PowerAnalyzer for Data Profiling, you need the following product licenses: ♦
Application server license
♦
PowerAnalyzer license
Application Server License JBoss Application Server does not require a license. However, if you are using another application server, you may need a license. You should have received the license when you purchased the application server product.
PowerAnalyzer License PowerAnalyzer Data Profiling reports use PowerAnalyzer and require a PowerAnalyzer license. Informatica does not ship the PowerAnalyzer license key with the Data Profiling installation. To obtain a PowerAnalyzer license, send a request to:
[email protected]. Informatica provides the license key in an email with instructions on how to apply the license to the product. You do not need the license key to install PowerAnalyzer. If you have the PowerAnalyzer license key at the time of installation, you can apply the license during the installation. Otherwise, you can complete the installation and apply the license key afterward. The PowerAnalyzer license is a restricted license. The license can only be used to run Data Profiling and PowerCenter Metadata Reporter. For more information about the license, contact Informatica Technical Support. For more information about activating the PowerAnalyzer license or installing and configuring PowerAnalyzer, see the PowerAnalyzer Installation Guide.
Overview
11
Configuring a Data Profiling Warehouse When you install Data Profiling for the first time, create a Data Profiling warehouse for each PowerCenter repository in which you want to store data profiles. When you upgrade from a previous version, you must upgrade your Data Profiling warehouse. Once you create a Data Profiling warehouse, you must configure a relational database connection to the warehouse in the Workflow Manager.
Creating a Data Profiling Warehouse You can create a Data Profiling warehouse on any relational database that PowerCenter supports. You can also create the Data Profiling warehouse on the same database instance as your PowerCenter repository. Note: If you create a Data Profiling warehouse on Informix, you cannot use PowerAnalyzer to
view Data Profiling reports. To create a Data Profiling warehouse, run a script to create a warehouse schema for your database type. When you install Data Profiling, the installation program places the script in the following directory:
\Extensions\DataProfile\ProfileWarehouse
Table 2-1 shows the script you must run for each database type: Table 2-1. Scripts for Creating a Data Profiling Warehouse Database Type
Script
IBM DB2
create_schema_db2.sql
Informix
create_schema_inf.sql
Microsoft SQL Server
create_schema_sql.sql
Oracle
create_schema_ora.sql
Sybase
create_schema_syb.sql
Teradata
create_schema_ter.sql
Creating a Data Profiling Warehouse on IBM DB2 To create a Data Profiling warehouse on IBM DB2, you must connect to DB2. Then, enter the following command from the command line: db2 -tvf create_schema_db2.sql
Creating a Data Profiling Warehouse on Informix To create a Data Profiling warehouse on Informix, run the dbaccess command line program. Select create_schema_inf.sql as the file you want to execute. Select Run. 12
Chapter 2: Installing and Configuring Data Profiling
Creating a Data Profiling Warehouse on Microsoft SQL Server To create a Data Profiling warehouse on Microsoft SQL Server, open Query Analyzer and run the create_schema_sql.sql script. Note: Informatica recommends that you use Query Analyzer rather than isql to run the script.
Otherwise, the script may not run successfully.
Creating a Data Profiling Warehouse on Oracle To create a Data Profiling warehouse on Oracle, run the following command from the command line: sqlplus <user name>/<password>@ @create_schema_ora.sql
Creating a Data Profiling Warehouse on Sybase To create a Data Profiling warehouse on Sybase, run the following command from the command line: isql -U <user name> -P <password> -S -i create_schema_syb.sql
Creating a Data Profiling Warehouse on Teradata To create a Data Profiling warehouse on Teradata, open Teradata SQL Assistant. Open the file create_schema_ter.sql and click Execute.
Upgrading the Data Profiling Warehouse To upgrade a Data Profiling warehouse, run a script to upgrade the warehouse schema for your database type. The script you run depends on your database type and the version of the PowerCenter Data Profiling installation you are upgrading from. When you install Data Profiling, the installation program places the script in the following directory: \Extensions\DataProfile\ProfileWarehouse
Table 2-2 shows the script you must run for each database type and PowerCenter Data Profiling installation version: Table 2-2. Scripts for Upgrading the Data Profiling Warehouse Upgrade
Database Type
Script
From 7.1 to 7.1.2
IBM DB2
upgrade_schema02-03_db2.sql
From 7.1 to 7.1.2
Informix
upgrade_schema02-03_inf.sql
From 7.1 to 7.1.2
Microsoft SQL Server
upgrade_schema02-03_sql.sql
From 7.1 to 7.1.2
Oracle
upgrade_schema02-03_ora.sql
Configuring a Data Profiling Warehouse
13
Table 2-2. Scripts for Upgrading the Data Profiling Warehouse Upgrade
Database Type
Script
From 7.1 to 7.1.2
Sybase
upgrade_schema02-03_syb.sql
From 7.1 to 7.1.2
Teradata
upgrade_schema02-03_ter.sql
From 7.0 to 7.1.2
IBM DB2
upgrade_schema01-03_db2.sql
From 7.0 to 7.1.2
Informix
upgrade_schema01-03_inf.sql
From 7.0 to 7.1.2
Microsoft SQL Server
upgrade_schema01-03_sql.sql
From 7.0 to 7.1.2
Oracle
upgrade_schema01-03_ora.sql
From 7.0 to 7.1.2
Sybase
upgrade_schema01-03_syb.sql
From 7.0 to 7.1.2
Teradata
upgrade_schema01-03_ter.sql
Upgrading a Data Profiling Warehouse on IBM DB2 To upgrade a Data Profiling warehouse on IBM DB2, you must connect to DB2. Then, enter the following command from the command line: db2 -tvf
Upgrading a Data Profiling Warehouse on Informix To upgrade a Data Profiling warehouse on Informix, run the dbaccess command line program. Select the appropriate script from Table 2-2 as the file you want to execute. Select Run.
Upgrading a Data Profiling Warehouse on Microsoft SQL Server To upgrade a Data Profiling warehouse on Microsoft SQL Server, open Query Analyzer and run the upgrade script indicated on Table 2-2. Note: Informatica recommends that you use Query Analyzer rather than isql to run the script.
Otherwise, the script may not run successfully.
Upgrading a Data Profiling Warehouse on Oracle To upgrade a Data Profiling warehouse on Oracle, run the following command from the command line using the appropriate script from Table 2-2: sqlplus <user name>/<password>@ @
14
Chapter 2: Installing and Configuring Data Profiling
Upgrading a Data Profiling Warehouse on Sybase To upgrade a Data Profiling warehouse on Sybase, run the following command from the command line using the appropriate upgrade script listed on Table 2-2: isql -U <user name> -P <password> -S -i
Upgrading a Data Profiling Warehouse on Teradata To upgrade a Data Profiling warehouse on Teradata, open Teradata SQL Assistant. Open the upgrade script indicated on Table 2-2 and click Execute.
Configuring a Relational Database Connection to the Data Profiling Warehouse After you create a Data Profiling warehouse, you must configure a relational database connection to the warehouse in the Workflow Manager. When you run profile sessions, use this relational database connection as the target connection. For more information about configuring a relational database connection, see the Workflow Administration Guide. For more information about running profile sessions, see “Running Profile Sessions” on page 61.
Configuring a Data Profiling Warehouse
15
Installing PowerAnalyzer Data Profiling Reports If you want to view Data Profiling reports on PowerAnalyzer, you must install and configure PowerAnalyzer. Informatica recommends that a PowerAnalyzer user with the System Administrator role set up the Data Profiling reports. To install PowerAnalyzer Data Profiling reports for the first time, complete the following steps: 1.
Install Data Profiling XML scripts. For more information, see “Installing Data Profiling XML Scripts” on page 16.
2.
Import the Data Profiling schema and reports into PowerAnalyzer. For more information, see “Importing Data Profiling Schema and Reports” on page 16.
3.
Create a data source for the Data Profiling warehouse in PowerAnalyzer. For more information, see “Configuring a Data Source for the Data Profiling Warehouse” on page 20.
4.
Create a data connector in PowerAnalyzer. If you do not already have a data connector, you must create one. When you create the data connector, add your Data Profiling warehouse data source to the data connector. If you create a data connector, you can skip the next step. For more information, see “Creating a Data Connector” on page 21.
5.
Add the Data Profiling warehouse data source to a data connector in PowerAnalyzer. If you already have a data connector, add your Data Profiling warehouse data source to the data connector. If you recently created a data connector, you should have already added your Data Profiling warehouse data source to the data connector. For more information, see “Adding a Data Profiling Data Source to a Data Connector” on page 23.
Installing Data Profiling XML Scripts The Data Profiling installation includes XML scripts to create the following PowerAnalyzer objects: ♦
Reports
♦
Schemas
Locate the XML scripts in the following directory: PowerCenter Client\Extensions\DataProfile\IPAReports
Importing Data Profiling Schema and Reports Before you import the Data Profiling schema and reports, ensure that PowerAnalyzer works properly. To import the Data Profiling schema and reports, log in to PowerAnalyzer with system administrator privileges.
16
Chapter 2: Installing and Configuring Data Profiling
Import the following PowerAnalyzer objects from the XML files in the following order: 1.
Schemas. Import the schema tables, attributes, and metrics from an XML file.
2.
Reports. Import the reports from an XML file.
Importing the Data Profiling Schema The dpr_schema.xml file contains the schema, attribute, and metric definitions required for Data Profiling reports. To import the Data Profiling schema: 1.
Click Administration > XML Export / Import > Import Schemas. The Import Schemas page displays.
2.
On the Import Schemas page, select Validate XML Against DTD.
3.
Click Browse to locate the dpr_schema.xml file.
4.
Click Import XML.
Installing PowerAnalyzer Data Profiling Reports
17
The Import Schemas page displays the schema objects to be imported.
5.
Click Continue. The Import Schemas page displays the following message: The objects have been successfully imported to the target repository.
Importing the Data Profiling Reports The dpr_reports.xml file contains the definitions of the Data Profiling reports. Locate the file in the Data Profiling Reports folder on the PowerCenter installation CD. You can copy the file to your PowerAnalyzer machine. To import the Data Profiling reports: 1.
18
Click Administration > XML Export / Import > Import Reports.
Chapter 2: Installing and Configuring Data Profiling
The Import Reports page displays.
2.
On the Import Reports page, select Validate XML Against DTD.
3.
Click Browse to locate the dpr_reports.xml file.
4.
Click Import XML. The Import Reports page displays the reports to be imported.
By default, the Publish to Everyone and Run Scheduled Reports After Import options are selected. 5.
If you do not want to give all PowerAnalyzer users access to the reports, clear the Publish to Everyone option. You can give specific individuals access to the Data Profiling reports. For more information about providing access to a PowerAnalyzer report, see the PowerAnalyzer Administrator Guide.
6.
Clear the Run Scheduled Reports option.
Installing PowerAnalyzer Data Profiling Reports
19
7.
Click Continue. The Import Schemas page displays the following message: The objects have been successfully imported to the target repository.
Configuring a Data Source for the Data Profiling Warehouse You must configure a data source for each Data Profiling warehouse that contains profiling data for which you want to view PowerAnalyzer Data Profiling reports. To create the Data Profiling warehouse data source: 1.
Click Administration > Schema Design > Data Sources.
2.
On the Data Sources page, click Add.
3.
Enter a name and description for the data source.
4.
Select the server type of your Data Profiling warehouse. PowerAnalyzer provides JDBC drivers to connect to the Data Profiling warehouse. When you select the server type, PowerAnalyzer supplies the driver name and connection string format for the JDBC drivers that PowerAnalyzer provides. The server type list includes the following databases:
20
♦
Oracle. Select to connect to an Oracle repository.
♦
SQL Server. Select to connect to a Microsoft SQL Server repository.
♦
DB2. Select to connect to an IBM DB2 repository.
♦
Sybase. Select to connect to a Sybase repository.
Chapter 2: Installing and Configuring Data Profiling
♦
Teradata. Select to connect to a Teradata repository.
Note: You cannot create an Informix data source. 5.
Customize the JDBC connection string with the information for your PowerCenter repository database.
6.
Enter the user name and password to connect to the repository database.
7.
Test the connection. If the connection fails, verify that the repository database information is correct. Consult your database administrator if necessary.
8.
Click OK to save the data source.
For more information about PowerAnalyzer data sources, see the PowerAnalyzer Administrator Guide.
Creating a Data Connector You must create a data connector. Once you create the data connector, add your data source to the data connector. You can create data connectors for specific users and groups. Note: If you already have a data connector to which you want to add your data source, you can
skip this step. However, you must add your Data Profiling data source to your data connector. For more information about adding a data source to a data connector, see “Adding a Data Profiling Data Source to a Data Connector” on page 23. To create a data connector: 1.
Click Administration > Schema Design > Data Connectors. The list of data connectors defined in the repository displays in the task area.
2.
Click Add.
Installing PowerAnalyzer Data Profiling Reports
21
The Data Connectors page displays.
3.
Enter the data connector properties listed in Table 2-3: Table 2-3. Data Connector Properties Property
Required/ Optional
System Name
Required
Description Enter the name of the data connector. The data connector name must be unique. The system name can include any character except a space, tab, newline character, and the following special characters: \ / : * ? “ < > | ‘ &
4.
Description
Optional
Enter a description for the data connector. The connector description can be between 1 and 255 characters.
Primary Data Source
Required
Select the primary data source from the list of data sources available in PowerAnalyzer. PowerAnalyzer uses this data source to connect to the Data Profiling warehouse and read the metric and attribute data for a report. For information about how PowerAnalyzer connects to the primary and additional data sources, see the PowerAnalyzer Administrator Guide.
Primary Time Dimension
Optional
This option does not apply to Data Profiling reports.
Additional Schema Mappings
Optional
This option does not apply to Data Profiling reports.
If you do not want to set up additional data sources, click OK. If this is the first data connector in PowerAnalyzer, PowerAnalyzer saves the data connector to the repository as the system data connector, making it available to all users. You cannot edit the list of users who can use this data connector. If this is not the system data connector, you can edit the list of users who can use this data connector. You can activate this data connector in addition to the system data connector.
22
Chapter 2: Installing and Configuring Data Profiling
Adding a Data Profiling Data Source to a Data Connector PowerAnalyzer uses a data connector to connect to a data source and read the data for Data Profiling reports. Typically, PowerAnalyzer uses the system data connector to connect to all the data sources required for reports. To enable PowerAnalyzer to run Data Profiling reports, add the Data Profiling data source to the system data connector. If you have several data connectors and you want to use a specific data connector for Data Profiling reports, add the Data Profiling data source to the specific data connector. Note: If you configured a data connector in the previous step, verify that you have already
added your Data Profiling data source. Click the data connector name to see if your data source is listed as the primary data source. For more information about PowerAnalyzer data connectors, see the PowerAnalyzer Administrator Guide. To add the Data Profiling warehouse data source to the system data connector: 1.
Click Administration > Schema Design > Data Connectors. The Data Connectors page displays.
2.
Click the name of the system data connector.
Installing PowerAnalyzer Data Profiling Reports
23
PowerAnalyzer displays the properties of the system data connector.
3.
In the Additional Schema Mappings, click Add. PowerAnalyzer expands the section and displays the available schemas in the repository.
4.
24
In the Data Source list, select the Data Profiling data source you created in “Configuring a Data Source for the Data Profiling Warehouse” on page 20.
Chapter 2: Installing and Configuring Data Profiling
5.
In the Available Schemas section, select DP_SCHEMA, and click Add >>.
6.
Click Add. PowerAnalyzer displays the additional schema mappings for the system data connector.
7.
Click OK.
You can now run Data Profiling reports using the system data connector.
Installing PowerAnalyzer Data Profiling Reports
25
Configuring Default Data Profile Options After you install and configure Data Profiling client components, you can configure the default data profile options for the PowerCenter Client machine. Each data profile you create in the Designer uses these default options. You configure the default data profile options from the Profile Manager. Figure 2-1 shows the options dialog box for configuring default data profile options: Figure 2-1. Configuring Default Profile Options
Select to revert to the default settings.
To configure default data profile general options: 1.
Launch the Profile Manager. To launch the Profile Manager from the Source Analyzer, select Sources-Profiling-Launch Profile Manager. To launch the Profile Manager from the Mapplet Designer, select Mapplets-Profiling-Launch Profile Manager.
2.
In the Profile Manager, click Tools-Options. The General tab for the Options dialog box displays.
26
Chapter 2: Installing and Configuring Data Profiling
3.
Configure the following options as necessary: Table 2-4. Default Data Profile Options - General Tab Option
4.
Required/ Optional
Description
Always save changes before interactive run
Optional
Select to always save changes to the profile mapping before running a profile session interactively. If you clear this option, the Designer prompts you to save changes before you run an interactive session.
Display Profile Manager after creating a profile
Optional
Select to always launch the Profile Manager after you create a data profile. If you clear this option, the Profile Manager does not launch automatically after you create a data profile.
Always run profile interactively
Optional
Select to always run a profile session interactively when you create a data profile. If you clear this option, you can still run auto and custom profile sessions interactively from the Profile Manager. For more information about running interactive sessions, see “Running Sessions from the Profile Manager” on page 63. For more information about creating custom profiles, see “Creating a Custom Profile” on page 38.
Check in profile mapping when profile is saved
Optional
Select if you want the Designer to check in profile mappings when you save changes for versioned repositories. Saving versions of profile mappings in your repository can consume large amounts of disk space. Make sure you have enough disk space on the machine hosting the repository.
Always invoke auto profiling dialog
Optional
Select to display the Auto Profiling dialog box when you create a new auto profile. If you clear this option, the Auto Profiling dialog box does not display when you create a new data profile. Also, you cannot configure Domain Inference tuning options and verbose data loading options when you create the auto profile. However, if you clear this option and create an auto profile for a source with 25 or more columns, the Auto Profiling dialog box displays. For more information about creating auto profiles, see “Creating an Auto Profile” on page 34.
Use source owner name during profile mapping generation
Optional
Select to add the table owner name to relational sources when the Designer generates a profile mapping. Note: If the owner name changes after you generate the profile mapping, you must regenerate the mapping. You can regenerate a profile mapping in the Profile Manager. For more information about regenerating profile mappings, see “Using the Profile Manager” on page 6.
Enter the name of a text editor for the session log file. By default, the Profile Manager selects Wordpad as the text editor.
5.
Enter the location where you want your session log files to be written.
6.
Click the Prefixes tab.
Configuring Default Data Profile Options
27
7.
Configure the following options as necessary: Table 2-5. Default Data Profile Options - Prefixes Tab Option Profile Mapping Prefix
Required/ Optional Required
Description Edit the prefix to use with all profile mapping names. Profile mappings use the following naming convention: <prefix>_
The default prefix is m_DP_. The prefix must be 1 to 10 characters. It cannot contain spaces. Profile Workflow Prefix
Required
Edit the prefix to use with all profile workflow names. Profile workflows use the following naming convention: <prefix>_
The default prefix is wf_DP_. The prefix must be 1 to 10 characters. It cannot contain spaces. Profile Session Prefix
Required
Edit the prefix to use with all profile session names. Profile sessions use the following naming convention: <prefix>_
The default prefix is s_DP_. The prefix must be 1 to 10 characters. It cannot contain spaces. Revert All 8.
28
Optional
Click OK.
Chapter 2: Installing and Configuring Data Profiling
Revert prefixes back to the default settings.
Purging the Data Profiling Warehouse When you delete data profiles from a repository, the related metadata and profile session results remain in the Data Profiling warehouse. You can purge metadata and profile session results from the Data Profiling warehouse for deleted data profiles. You can also purge profile session results that are no longer associated with data profile metadata. When you purge the Data Profiling warehouse, you can purge all metadata and session results, the most recent metadata and session results, or metadata data and session results based on a date. You can purge metadata and session results associated with a particular repository folder or associated with all folders in the repository. To purge the Data Profiling warehouse: 1.
Launch the Profile Manager. To launch the Profile Manager from the Source Analyzer, select Sources-Profiling-Launch Profile Manager. To launch the Profile Manager from the Mapplet Designer, select Mapplets-Profiling-Launch Profile Manager.
2.
Select Target Warehouse-Connect to connect to your Data Profiling Warehouse.
3.
Select Target Warehouse-Purge. The Purge Data Profiling Warehouse dialog box displays.
4.
Select the folder from which you want to purge metadata and/or profile session results. The folder must be open to purge it. Or, select All Open Folders to purge metadata and/ or profile session results from it.
Purging the Data Profiling Warehouse
29
5.
6.
Select one of the following options to purge from your Data Profiling warehouse: ♦
Profiles. Purge metadata and profile session results for deleted data profiles.
♦
Orphan Profiles Only. Purges profile session results that are no longer associated with data profile metadata.
If you selected Profiles, select one of the following options for Purge Profile Metadata and Runs. Or, if you selected Orphan Profiles Only, select one of the following options for Purge Profile Runs Only: ♦
All. Purges all profile metadata and session run results for deleted data profiles.
♦
All Except Latest. Purges all profile metadata and session run results for deleted data profiles except the latest instance of each.
♦
Older Than. Select a date to purge all profile metadata and session run results for deleted data profiles prior to that date.
7.
Select Purge Metadata and Runs for Orphan Profiles to delete both metadata and session results that are no longer associated with data profile metadata.
8.
Click Purge.
The Data Profiling warehouse may take a few minutes to purge.
30
Chapter 2: Installing and Configuring Data Profiling
Chapter 3
Managing Data Profiles This chapter includes the following topics: ♦
Overview, 32
♦
Creating an Auto Profile, 34
♦
Creating a Custom Profile, 38
♦
Editing a Data Profile, 47
♦
Deleting a Data Profile, 49
♦
Working with Profile Mappings, 51
♦
Using Mapplets to Extend Data Profiling Functions, 52
♦
Troubleshooting, 60
31
Overview You can create, edit, and delete data profiles. Data profiles contain a set of functions to apply to a specified set of source data. The functions return the metadata about the profile sources that make up the data profile reports. You can create the following types of data profiles: ♦
Auto profile. Contains predefined functions for profiling source data. Use an auto profile during mapping or mapplet development to learn more about source data. For more information about creating an auto profile, see “Creating an Auto Profile” on page 34.
♦
Custom profile. You create and add functions to create a custom profile. Use a custom profile during mapping or mapplet development to validate documented business rules about the source data. You can also use a custom profile to monitor data quality. For more information about creating a custom profile, see “Creating a Custom Profile” on page 38.
When you create a data profile, the Designer generates a mapping based on the profile functions. When you run a session for the mapping, the PowerCenter Server writes profile data to the Data Profiling warehouse. Once you create a data profile, you can edit and delete the data profile. For more information about editing a data profile, see “Editing a Data Profile” on page 47. For more information about deleting a data profile, see “Deleting a Data Profile” on page 49.
Profiling Sources and Mapplet Output data You can profile sources and output data from connected ports in mapplet output groups. A source is any external data source that you create or import in PowerCenter as a source definition. You can profile the following types of sources: ♦
Relational database sources
♦
Flat file sources
♦
XML sources
♦
VSAM sources
♦
Application sources, such as SAP R/3, PeopleSoft, and MQSeries
When you profile a multi-group source, such as an XML source, you can select the groups in the source you want to profile or you can profile the entire source. When you profile mapplet output data, the Profile Wizard creates a data profile based on the data output from the connected ports in the Output transformation. You can profile output data from mapplets that meet the following conditions:
32
♦
The mapplet contains a source definition for input.
♦
The mapplet contains no Input transformation.
♦
The mapplet contains no Transaction Control transformations.
Chapter 3: Managing Data Profiles
♦
The mapplet contains at least one Output transformation.
♦
The ports in the Output transformation are connected.
Note: You can profile sources with ASCII and non-ASCII port names.
Profiling Multiple Sources If you want to profile multiple sources, you can create a mapplet that combines multiple sources and create a data profile based on the mapplet output data. For example, you might use several Joiner transformations within the mapplet to join source data from multiple sources, and profile the output data from this mapplet.
Profiling SAP R/3 Sources When you create a data profile for SAP R/3 sources, you must generate an ABAP program for the profile mapping before you can run a session for the data profile. For more information about generating an ABAP program, see the PowerCenter Connect for SAP R/3 User and Administrator Guide.
Profile Functions You can add multiple profile functions to a data profile. Profile functions are calculations you perform on the source data that return information about various characteristics of the source data. When you add a function to a data profile, you can choose from the following types of functions: ♦
Source-level functions. Performs calculations on two or more columns of a source, source group, or mapplet group. For example, you can evaluate a business rule for groups in an XML source.
♦
Column-level functions. Performs calculations on one column of a source. For example, you can evaluate the data in a column to find patterns that frequently occur in the data.
♦
Intersource functions. Performs calculations on two or more sources. These functions generate information about the relationship between the sources. For example, you might compare the values of columns in two sources to find out the percentage of identical data that appears in both sources.
Each function type has a subset of functionality that you can configure when you add a function to the data profile. For more information about profile functions, see “Working with Functions” on page 103.
Overview
33
Creating an Auto Profile Create an auto profile to learn more about source data or mapplet output data during mapping development. When you create an auto profile, the Designer creates a data profile with the following functions: ♦
Row Count. Counts the number of rows read from the source during the profile session. When you create a data profile that uses the Row Count function with data samples, the Row Count function estimates the total row count.
♦
Candidate Key Evaluation. Calculates the number and percentage of unique values in one or more columns.
♦
Redundancy Evaluation. Calculates the number of duplicate values in one or more columns of the source.
♦
Domain Inference. Reads all values in a column and infers patterns that occurs in the data. You can configure the Profile Wizard to filter the Domain Inference results.
♦
Distinct Value Count. Reads all the values in a column and returns the number of distinct values for the column. You can configure the auto profile to load verbose data to the Data Profiling warehouse.
♦
Aggregate functions. Calculates an aggregate value for numeric or string values in a column. Use aggregate functions to count null values, determine average values, and determine minimum or maximum values.
For more information about functions, see “Working with Functions” on page 103.
Auto Profile Naming Conventions The Designer uses the following naming conventions for an auto profile: AP_<source/mapplet name>
For example, if you generate an auto profile for the source CustomerData, the Designer names the auto profile AP_CustomerData. After you create the auto profile, the Designer generates a mapping based on the profile functions. The Designer uses the following naming convention when it saves the profile mapping to the repository: m_DP_AP_<source/mapplet name>
For example, if you create an auto profile called AP_CustomerData, the profile mapping name is m_DP_AP_CustomerData. Tip: You can rename an auto profile in the Profile Manager. You can click Description on the
Auto Profile Column Selection page to change the name or description of the profile. Or, you can change the naming convention for profile mappings in the default data profile options. For information about changing default data profile options, see “Configuring Default Data Profile Options” on page 26.
34
Chapter 3: Managing Data Profiles
If you have an auto profile for a source, and you generate another auto profile for the same source, the Designer does not overwrite the existing auto profile. It creates a new auto profile using the following naming convention: AP_<source/mapplet name>N
where N is the latest version number of the previous auto profile plus 1. For example, if you have an auto profile AP_CustomerData, and you generate a new auto profile for the source CustomerData, the auto profile name is AP_CustomerData1. The mapping the Designer generates for the new auto profile uses the following naming convention: m_DP_AP_<source/mapplet name>N
where N is the latest version number of the auto profile plus 1. Tip: As source data changes, you may need to create new auto profiles for the source. However,
you can preserve existing auto profiles and their corresponding mappings in the Data Profiling warehouse for tracking or auditing purposes.
Steps to Create an Auto Profile When you create an auto profile, you can select groups or columns in the source to profile. Or you can profile the entire source. To create an auto profile: 1.
Select the source definition in the Source Analyzer or mapplet in the Mapplet Designer you want to profile.
2.
In the Source Analyzer, select Sources-Profiling-Create Auto Profile. In the Mapplet Designer, select Mapplets-Profiling-Create Auto Profile. The Auto Profile Column Selection page displays in the following cases: ♦
You set the default data profile options to open the Auto Profile Column Selection page when you create an auto profile.
♦
The source definition contains 25 or more columns.
Creating an Auto Profile
35
Figure 3-1 shows the Auto Profile Column Selection page: Figure 3-1. Profile Wizard - Auto Profile Column Selection Page Add or edit a name or description for the profile. Add columns or groups to the profile.
Specify the type of verbose data to load.
Select to configure the session.
Select to configure the Domain Inference function settings.
If the Auto Profile Column Selection page does not display, the Designer generates an auto profile and profile mapping based on the profile functions. Go to step 10. Note: If you skip this dialog box, you cannot configure verbose data loading settings or
Domain Inference tuning settings. 3.
Optionally, click Description to add a description for the data profile. If you do not want to add a description, go to step 5.
4.
Enter a description up to 200 characters for the profile, and click OK to return to the Auto Profile Column Selection page.
5.
Optionally, select the groups or columns in the source that you want to profile. By default, all columns or groups are selected.
6.
Specify the type of verbose data for the Distinct Value Count function you want the PowerCenter Server to write to the Data Profiling warehouse during the profile session. The PowerCenter Server can write the following types of verbose data to the Data Profiling warehouse for auto profile functions: ♦
36
No rows. The PowerCenter Server writes no verbose data to the Data Profiling warehouse.
Chapter 3: Managing Data Profiles
♦
Duplicate rows only. The PowerCenter Server writes duplicate values to the Data Profiling warehouse.
♦
All rows.The PowerCenter Server writes distinct and duplicate values to the Data Profiling warehouse.
Note: If you load verbose data for columns with a precision greater than 1,000 characters,
the PowerCenter Server might write truncated data to the Data Profiling warehouse during the profile session. For more information about working with verbose data, see “Configuring a Function for Verbose Mode” on page 43. 7.
Click Profile Settings to specify settings for Domain Inference tuning. The Profile Settings dialog box opens. You can specify the following criteria to filter domain inference data: ♦
Maximum number of patterns. The PowerCenter Server returns the most frequently occurring patterns up to the number of patterns you specify.
♦
Minimum pattern frequency. The PowerCenter Server returns patterns that occur at or above the frequency you specify.
For more information about the Domain Inference function and Domain Inference tuning, see “Domain Inference” on page 117. 8.
Select Configure Session to configure the session properties after you create the data profile.
9.
Click Next if you selected Configure Session, or click Finish if you cleared Configure Session. The Designer generates a data profile and profile mapping based on the profile functions.
10.
Click Repository-Save. If you clicked Next, the Profile Run page opens. If you clicked Finish, the Profile Wizard closes.
11.
Configure the Run Profile options and click Next. For information about configuring the Run Profile options for a session, see “Running a Session when You Create a Data Profile” on page 67. The Session Setup page opens.
12.
Configure the Session Setup page options. For information about configuring the Session Setup page options, see “Running a Session when You Create a Data Profile” on page 67.
If you selected Run Session, the Profile Manager starts the session. If you cleared the Run Session option, the Profile Wizard saves the session properties you configured and closes.
Creating an Auto Profile
37
Creating a Custom Profile You can create a custom profile from the following areas of the Designer: ♦
Source Analyzer. Select Sources-Profiling-Create Custom Profile.
♦
Mapplet Designer. Select Mapplets-Profiling-Create Custom Profile.
♦
Profile Manager. Select Profile-Create Custom.
Note: If you select a multi-group source definition, the Profile Wizard adds all groups in the
source definition to the profile. Use the Profile Wizard to remove any groups you do not want to profile. You can also edit an existing data profile or delete a data profile. For more information about editing data profiles, see “Editing a Data Profile” on page 47. For more information about deleting data profiles, see “Deleting a Data Profile” on page 49. To create a custom profile, complete the following steps: 1.
Enter a profile name and optionally add a description of the profile. For more information, see “Step 1. Specify a Profile Name and Profile Description” on page 38.
2.
Add sources to the profile. For more information, see “Step 2. Add Sources to the Profile” on page 38.
3.
Add, edit, or delete a profile function and enable session configuration. For more information, see “Step 3. Add Functions and Enable Session Configuration” on page 39.
4.
Configure profile functions. For more information, see “Step 4. Configure Profile Functions” on page 41.
5.
Configure the profile session if you enabled session configuration. For more information, see “Step 5. Configure the Profile Session” on page 45.
Step 1. Specify a Profile Name and Profile Description When you start the Profile Wizard, the General Properties page prompts you to specify a name for the data profile and add a description. When you name a data profile, you cannot use the following characters: .+-=~`!%^&*()[]{}'\";:/?,<>\\|\t\r\n
Also, you cannot use an at sign (@) or a number at the beginning of a data profile name. When you are finished, click Next.
Step 2. Add Sources to the Profile After you specify the general properties, the Add Profile Sources page prompts you to select the sources you want to profile. If you selected the source definitions or mapplets you wanted to profile before you launched the Profile Wizard, this page does not display. 38
Chapter 3: Managing Data Profiles
You can add source definitions, mapplets, and groups in a multi-group source definition or output from multiple groups in mapplets to the profile. When you profile a multi-group source, such as an XML source, you can select the groups in the source definition you want to profile or the entire source definition. For more information about eligible sources for profiling, see “Profiling Sources and Mapplet Output data” on page 32. Figure 3-2 shows the Profile Sources page: Figure 3-2. Profile Wizard - Profile Sources Page
Sources Node
Add Source Remove Source
Source Mapplets Node
To add sources to the data profile, select a source definition, group in a source definition, or mapplet and click the Add Source button. To remove a source, select the source definition, group, or mapplet and click the Remove Source button. When you are finished adding sources, click Next. Tip: If you want to profile multiple sources, you can create a mapplet that combines multiple
sources and create a data profile based on the mapplet output data.
Step 3. Add Functions and Enable Session Configuration After you add sources to the data profile, use the Function-Level Operations page to perform the following tasks: ♦
Add functions to the profile. When you add functions to the profile, the Profile Wizard opens the Function Details page for you to configure details about the functions. For
Creating a Custom Profile
39
information about adding functions to a data profile, see “Step 3. Add Functions and Enable Session Configuration” on page 39. ♦
Edit existing functions. You can edit existing functions for the data profile.
♦
Delete functions from the profile. You can remove functions from the data profile.
♦
Organize functions. Use the up and down arrows to organize the functions in a data profile. The order of the functions does not affect the data profile results.
♦
Configure Domain Inference function parameters. You can widen or narrow the scope for the results from the Domain Inference function depending on whether you want to view the primary domains or exception data. For information about configuring Domain Inference function parameters, see “Domain Inference” on page 117.
♦
Enable session configuration. When you enable session configuration, the Profile Wizard prompts you to configure the profile session for the mapping. If you configured the default data profile options to always run profile sessions interactively, this option is selected by default. For more information about configuring default data profile options, see “Configuring Default Data Profile Options” on page 26.
Figure 3-3 shows the Function-Level Operations page: Figure 3-3. Profile Wizard - Function-Level Operations Page Add Function
Edit Function
Delete Function
Up and Down Arrows
Configure Domain Inference function settings.
Enable session configuration.
If you finish adding functions to the profile and you have not enabled session configuration, click Finish. The Profile Wizard generates the mapping. For more information about generating the profile mapping, see “Generating the Profile Mapping” on page 45. 40
Chapter 3: Managing Data Profiles
If you finish adding functions to the profile and you enabled session configuration, click Next. The Profile Wizard prompts you to configure the profile session. For more information about configuring a profile session, see “Step 5. Configure the Profile Session” on page 45.
Step 4. Configure Profile Functions When you add a function to the profile, the Function Details page prompts you to perform the following tasks: ♦
Name the function. Function names are not case sensitive.
♦
Optionally, enter a description of the function. You can enter text to describe the function.
♦
Specify the type of function. You can specify source-level, column-level, or intersource functions. For details about these function types, see “Working with Functions” on page 103.
♦
Specify the functionality you want to use. The functionality you can configure depends on the type of function you choose. For details about functionality see “Working with Functions” on page 103.
Figure 3-4 shows the Function Details page: Figure 3-4. Profile Wizard - Function Details Page
Select the sources to apply the function to.
If you specify an intersource function, you must select at least two sources or two groups from different sources to apply the function to. For more information about profile functions, see “Working with Functions” on page 103. Creating a Custom Profile
41
If you added multiple sources to the profile, you must select the source you want to apply the function to. When you finish specifying the function type and functionality, click Next. The Profile Wizard prompts you to specify the functionality details for the function. The configuration of the Functionality Details window and available options change depending on the type of function you specify. Figure 3-5 shows the Function Role Details page for a Row Count function: Figure 3-5. Profile Wizard - Function Role Details Page (Row Count Function)
Each function type has a subset of functionality you can configure to perform calculations on the source data. For more information about configuring functions, see “Working with Functions” on page 103. When you finish configuring the function, the Profile Wizard returns to the Function-Level Operations page described in “Step 3. Add Functions and Enable Session Configuration” on page 39. You can then continue to add and configure functions for the profile.
Configuring a Function with Group-by Columns Some functions let you generate profile data in a profile session run by group. When you configure a function, you can determine the column by which you want to group the data.
42
Chapter 3: Managing Data Profiles
To select a group-by column: 1.
Configure a source-level function or column-level function.
2.
Select Generate Profile Data By Group.
3.
Click Group By Columns. The Group By Columns dialog box displays.
4.
Select the columns you want to group by. You cannot select the column for which you created the function as a group-by column in a column-level function. For example, if you created a Business Rule Validation function for the column Agreement_Status, you cannot select this column to group by.
5.
Click OK.
Configuring a Function for Verbose Mode When you configure a function for verbose mode, the PowerCenter Server writes verbose data to the Data Profiling warehouse when you run a profile session. You can configure verbose mode for the following functions: ♦
Source-level Business Rule Validation. For information about the source-level Business Rule Validation function, see “Business Rule Validation” on page 106.
♦
Column-level Business Rule Validation. For information about the column-level Business Rule Validation function, see “Business Rule Validation” on page 113.
♦
Domain Validation. For information about the Domain Validation function, see “Domain Validation” on page 115.
♦
Distinct Value Count. For information about the Distinct Value Count function, see “Distinct Value Count” on page 121.
♦
Orphan Analysis. For information about the Orphan Analysis function, see “Orphan Analysis” on page 123.
♦
Row Uniqueness. For information about the Row Uniqueness function, see “Row Uniqueness” on page 110.
The type of verbose data the PowerCenter Server can load to the target depends on the function for which you configure verbose mode.
Creating a Custom Profile
43
You can load the following types of verbose data: ♦
Rows that meet the business rule. The PowerCenter Server writes rows to the Data Profiling warehouse that meet the business rule in the function. For example, for a Domain Validation function, you might load the values that match the specified domain pattern.
♦
Rows that do not meet the business rule. The PowerCenter Server writes rows to the Data Profiling warehouse that do not meet the business rule in the function. For example, when you create a Row Uniqueness function, you might load only duplicate rows.
♦
All rows.The PowerCenter Server writes all verbose data rows to the Data Profiling warehouse. For example, when you create a Domain Validation function, you might load verbose data for values that match the domain pattern as well as those that do not.
When you configure a function, you can select the columns to profile in verbose mode. If the precision for the data in the column you select exceeds 1,000 characters, the PowerCenter Server may write truncated data to the Data Profiling warehouse. To configure a function for verbose mode: 1.
Configure a function that can output data in verbose mode.
2.
Select the type of verbose data to load.
3.
Click Select Columns. The Columns for Verbose Data dialog box displays.
By default, all columns in the source are selected. 4.
44
Clear the columns you do not want to profile in verbose mode.
Chapter 3: Managing Data Profiles
5.
Click OK.
Step 5. Configure the Profile Session If you enabled session configuration on the Function-Level Operations page, the Profile Wizard opens the Profile Run page so that you can configure the profile session properties. You can configure and run a profile session or save the session configuration and run the profile session at a another time. For information about configuring profile sessions, see “Running Sessions from the Profile Manager” on page 63.
Generating the Profile Mapping After you create the profile, the Designer generates a mapping based on the profile metadata. You must save changes to store the new data profile and profile mapping in the repository. The Designer saves the data profile and profile mapping in the repository folder that stores the source or mapplet output you profiled. You can view profile mappings in the Designer. The Designer denotes profile mappings in the Repository Navigator with a Profile Mappings icon. Figure 3-6 shows Data Profile mappings in the Designer: Figure 3-6. Data Profile Mappings in the Designer
Profile Mappings
Creating a Custom Profile
45
The profile mapping name is based on the data profile name. By default, the mapping name contains the prefix m_DP_. For example, if you name the data profile SalaryValidation, the mapping name for the data profile is m_DP_SalaryValidation. You can change the naming convention for profile mappings in the default data profile options. For more information about changing default data profile options, see “Configuring Default Data Profile Options” on page 26.
46
Chapter 3: Managing Data Profiles
Editing a Data Profile You can edit a data profile to change any of the properties you configured. When you edit a data profile, the Designer replaces the profile mapping. You can edit an auto profile to change the profile name or description. However, if you add a source to an auto profile, it becomes a custom profile. For more information about auto profiles, see “Creating an Auto Profile” on page 34. When you delete a source for which you have defined functions in a data profile, the Designer marks the functions as invalid. If a function is invalid, the data profile is invalid. You can edit the functions to use a valid source or sources. Similarly, if you delete a column, change the column datatype, or rename the column, all functions using this column are invalid. You can edit the functions to use the modified version of the column or another column. To edit a data profile: 1.
Launch the Profile Manager. To launch the Profile Manager from the Source Analyzer, select Sources-Profiling-Launch Profile Manager. To launch the Profile Manager from the Mapplet Designer, select Mapplets-Profiling-Launch Profile Manager. The Profile Manager displays.
Editing a Data Profile
47
2.
From the Profile View, select the profile you want to edit.
3.
Select Profile-Edit. The Profile Wizard displays.
4.
Use the Profile Wizard to change any of the data profile properties.
For more information about using the Profile Wizard, see “Creating a Custom Profile” on page 38.
48
Chapter 3: Managing Data Profiles
Deleting a Data Profile You can delete any data profile from the repository. When you delete a data profile, the Profile Manager gives you the option to delete the associated profile mapping from the repository. However, if you delete a profile mapping, the Designer does not delete the data profile associated with the mapping. You must delete the data profile manually. Tip: If you delete a profile mapping, but do not want to delete the data profile, you can save
the data profile and regenerate the profile mapping in the Profile Manager later. To delete a data profile: 1.
Launch the Profile Manager. To launch the Profile Manager from the Source Analyzer, select Sources-Profiling-Launch Profile Manager. To launch the Profile Manager from the Mapplet Designer, select Mapplets-Profiling-Launch Profile Manager. The Profile Manager displays.
2.
From the Profile View, select the profile you want to delete.
3.
Select Profile-Delete. The Profile Manager asks if you want to delete the selected data profile.
Deleting a Data Profile
49
50
4.
Click Yes to delete the data profile and the associated profile mapping.
5.
Click OK.
6.
Click Repository-Save.
Chapter 3: Managing Data Profiles
Working with Profile Mappings When you create a data profile, the Designer generates mappings to create the metadata for the data profile. You can copy or deploy profile mappings to other folders or other repositories. Or, you can combine profile mappings with other mappings in the Designer.
Copying Data Profiling Objects After you create a data profile and the Designer generates the profile mapping, you can copy the mapping to another repository or repository folder. When you copy a profile mapping, the Designer does not copy the data profile the mapping is associated with. To copy all the objects associated with a profile, you must copy or deploy the entire folder to another repository or repository folder. You cannot copy a profile mapping with a reusable domain. If you copy a mapping with a reusable domain and run a session for the mapping, the session fails. For information about copying folders, see the Repository Guide.
Combining Data Profile Mappings with Other Mappings You can combine profile mappings with other mappings. This allows the PowerCenter Server read the source data once to perform mapping logic and profile the data. To do this, make a copy of the profile mapping and edit the copy rather than modify the original data profile mapping. If you edit the original data profile, you may lose the changes you added. This occurs because the Designer regenerates the mapping each time you edit it from the Profile Manager. When the Designer regenerates the mapping, it overwrites any changes.
Working with Profile Mappings
51
Using Mapplets to Extend Data Profiling Functions A function can operate on a column, source, or multiple sources. Sometimes, you need to combine data from multiple sources or multiple columns to use a particular function with it. Or, you may need to aggregate data in order to get the profiling results you want. For example, you want to create a Business Rule Validation function that operates on aggregate values from a source. You need to aggregate the values before you can profile the data using the Business Rule Validation function. To do this, you can use a mapplet. You can use a mapplet when you want to profile the following information: ♦
Aggregate data from a single source
♦
Data from two or more sources with one or more matching ports
♦
Data from two sources with all matching ports
Extending Data Profiling Functionality with Mapplets Complete the following steps to extend data profiling functionality with mapplets: 1.
Create the mapplet. Create a mapplet to aggregate data or join or merge sources.
2.
Create a data profile using the mapplet output data as a source. Create an auto profile based on the mapplet output data. Or, create a custom profile based on the mapplet output data and add functions to the data profile.
3.
Run the data profile. Run the profile session. When you run the profile session from the Profile Manager, it processes the mapplet data as it would if you were running a workflow. You do not need to run a workflow to aggregate or join the data.
4.
View the Data Profiling report. Open the Data Profiling report in the Profile Manager or PowerAnalyzer to view the results.
Profiling Aggregate Data When you want to profile aggregate data, you can use a mapplet to aggregate the data before you create the data profile. Use the mapplet to aggregate data when you want to use a column-level function on aggregate data. For example, you have an Employee Expenses flat file source that provides information on employee expenditures. Table 3-1 shows the data from the Employee Expenses flat file source: Table 3-1. Employee Expenses Sample Source Data
52
EID
Spending Date
Amount
Reason
12
12/3/2003
123.22
Acquired new books
19
4/09/2004
600.21
Purchased ticket to Boston
Chapter 3: Managing Data Profiles
Table 3-1. Employee Expenses Sample Source Data EID
Spending Date
Amount
Reason
213
6/29/2004
215.61
Purchased new software
12
6/12/2004
921.56
Acquired new books
19
6/16/2004
740.21
Purchased ticket to New York
21
7/21/2004
712.88
Purchased a new computer
To test data consistency, you want to create a data profile that shows employees who spent over $1,000 in the last six months. To get this information, aggregate and filter the spending amounts before you create a data profile.
Creating the Mapplet Create a mapplet to aggregate and filter the data. Later, you can use the Profile Manager to create a data profile for the mapping. Figure 3-7 shows a mapplet that aggregates and filters the Spendings source data: Figure 3-7. Sample Mapplet to Aggregate and Filter Mapplet Output Data
When you create the mapplet, add a Filter transformation to filter out purchases older than six months. Add the following condition to the Filter transformation: DATE_DIFF (SYSDATE,SPENDING_DATE,'MM')<6
After you filter the data, add an Aggregator transformation to aggregate the cumulative spending for each employee.
Using Mapplets to Extend Data Profiling Functions
53
Figure 3-8 shows the Aggregator transformation settings: Figure 3-8. Sample Mapplet - Aggregator Transformation Settings
In the Aggregator transformation, add the cumulative_emp_spending output port to aggregate the total amount of money spent by all employees. Group the results by employee ID (EID), so you can see the total for each employee. Connect the EID port and the cumulative_emp_spending port to the Output transformation. You can then use the Profile Manager to profile the mapplet output data. Note: You do not need to run a session to generate the correct mapplet output data. When you
run a profile session, the Profile Manager processes the mapplet before it profiles the mapplet output data.
Creating the Data Profile After you create the mapplet to filter and aggregate the data, you can profile the mapplet output data. From the Profile Manager, create a custom profile. The custom profile locates the employees who spent more than $1,000 in the last six months using the source-level Business Rule Validation function. Create a Business Rule Validation function using the following expression: cumulative_emp_spending > 1000
When you specify the type of verbose data to load to the Data Profiling warehouse, select valid rows only. This ensures that you can view the verbose data for the employees who spent over $1,000 in the last 6 months.
54
Chapter 3: Managing Data Profiles
Viewing the Data Profiling Report Once you create the Data Profile, you can run a profile session. After you run a profile session, you can view the profile results in the PowerCenter Data Profiling report or the PowerAnalyzer Data Profiling report. For example, when you view a PowerCenter Data Profiling report, you can view the verbose data for the rows that do not satisfy the Business Validation Rule. You can see that the employee with employee ID 19 has spent $1,340.42. All other employees have spent under $1,000, and therefore do not appear in the Verbose Report Summary. Figure 3-9 shows the verbose summary report: Figure 3-9. Sample Verbose Summary Report
Profiling Multiple Sources with One Matching Port When you want to profile two related sources with one or more matching ports, you can create a mapplet with a Joiner transformation to join the sources. Then, you can profile the mapplet output data. For example, you have an Items relational source and a Manufacturers relational source. The Items relational source contains information about items, such as item description, wholesale cost, and item price. The Manufacturers relational source contains information about the manufacturers who manufacture the items. You want to find the manufacturers whose items sell with a markup that is greater than 50 percent of the wholesale cost.
Using Mapplets to Extend Data Profiling Functions
55
Creating the Mapplet To determine this information, you need to join the two sources before you can create a data profile. Create the following mapplet using a Joiner transformation to join the two sources. Figure 3-10 shows a mapplet used to join data from two sources: Figure 3-10. Sample Mapplet to Join Data from Two Sources
Use the following join condition: MANUFACTURER_ID1 = MANUFACTURER_ID
Creating the Data Profile After you join the sources, you can profile the mapplet output data to find the manufacturers whose items sell with a markup that is greater than 50 percent of the wholesale cost. You create a custom profile with a Source-Level business function and enter the following expression in the Rule Editor: PRICE > (WHOLESALE_COST +(WHOLESALE_COST * .50))
When you specify the type of verbose data to load to the Data Profiling warehouse, select valid rows only. This allows you to see the verbose data for the rows that meet the business rule.
Viewing the Data Profiling Report Once you create a data profile, you can run a profile session. After you run a profile session, you can view the profile results in the PowerCenter Data Profiling report or the PowerAnalyzer Data Profiling report. For example, when you view a PowerCenter Data Profiling report, you can see the information for the companies whose items sell with a markup greater than 50 percent of the wholesale cost.
56
Chapter 3: Managing Data Profiles
Figure 3-11 shows the Data Profiling Report for the mapplet output data: Figure 3-11. Sample Data Profiling Report for Mapplet Output Data
Profiling Sources that Use All Matching Ports When you want to profile data from two sources that use the same ports, you can use a mapplet to merge the sources and then profile the mapplet output data. For example, you have the Orders1 and Orders2 flat file sources that contain order information from two different stores. The sources use the same structure, but are stored in two different files. Both sources contain the following ports: ♦
OrderID
♦
CID
♦
DATE
♦
AMOUNT
You want to create an auto profile that displays profile data from both sources. You also want to ensure that the order IDs are distinct for order items in both stores.
Creating the Mapplet To create an auto profile that displays data from both sources, create a mapplet that uses a Union transformation to merge the sources.
Using Mapplets to Extend Data Profiling Functions
57
Figure 3-12 shows a mapplet that uses a Union transformation to merge source data: Figure 3-12. Sample Mapplet to Merge Mapplet Output Data
After you merge the data using the Union transformation, you can profile the mapplet output data using the Profile Manager.
Creating the Data Profile Create an auto profile based on the merged source data. This ensures that functions, such as the Row Uniqueness function, can generate reports based on all the data rather than only a part of the source data. When you specify the type of verbose data to load to the Data Profiling warehouse, select all rows to see both valid and exception data.
Viewing the Data Profiling Report Once you create the data profile, you can run a profile session. After you run a profile session, you can view the profile results in the PowerCenter Data Profiling report or the PowerAnalyzer Data Profiling report. For example, when you view a PowerCenter Data Profiling report, you view data from both branches of the store. The percentage of duplicate values for the ORDERID column is 0 percent. Therefore, the Order ID for each order item is unique.
58
Chapter 3: Managing Data Profiles
Figure 3-13 shows the Data Profiling report for the Union mapplet: Figure 3-13. Sample Data Profiling Report for the Union Mapplet
Using Mapplets to Extend Data Profiling Functions
59
Troubleshooting I received a message that my data profile updated successfully but the mapping associated with the data profile is not valid and the data profile must be run in interactive mode. When you update a data profile, the Designer validates the profile mapping. After validating the mapping, the Designer displays a message if the mapping includes blocking transformations. Mappings that include blocking transformations may hang when you run the associated session. For information about how blocking transformations affect mapping validation, see the Designer Guide.
60
Chapter 3: Managing Data Profiles
Chapter 4
Running Profile Sessions This chapter includes the following topics: ♦
Overview, 62
♦
Running Sessions from the Profile Manager, 63
♦
Profiling Data Samples, 68
♦
Profiling Relational Data Samples, 70
♦
Creating a Session in the Workflow Manager, 74
♦
Troubleshooting, 75
61
Overview To generate information about source data from a data profile, you must create and run a profile session. You can create and run profile sessions from the following tools: ♦
Profile Manager. You can create and run profile sessions from the Profile Manager. This allows you to run sessions immediately to quickly obtain profile results from a source. You can also run a session from the Profile Manager when you want to profile a sample of source data instead of the entire source. When you create sessions from the Profile Manager, the Profile Manager creates a workflow and associates it with the session. For information about running sessions from the Profile Manager, see “Running Sessions from the Profile Manager” on page 63.
♦
Workflow Manager. If you want to monitor ongoing data quality issues, you can create a persistent session and workflow for the profile mapping in the Workflow Manager and add a scheduling task. This allows you to perform a time-dimensional analysis of data quality issues. You can also edit and run persistent sessions that you create in the Profile Manager from the Workflow Manager. For more information about creating and running sessions and workflows, see the Workflow Administration Guide.
Working with Data Samples When you run a session from the Profile Manager, you can create a data profile based on a sample of data rather than the entire source. You use data sampling when you want to understand general trends within the data or view some exception data. For information about data sampling, see “Profiling Data Samples” on page 68 and “Profiling Relational Data Samples” on page 70.
62
Chapter 4: Running Profile Sessions
Running Sessions from the Profile Manager You can run a profile session from the Profile Manager to quickly profile sources during mapping development. When you run a session from the Profile Manager, the session is run interactively. You can also run sessions from the Profile Manager for existing data profiles at any time. You can run sessions from the Profile Manager in the following cases: ♦
Immediately after you create a data profile.
♦
At any time for an existing data profile from the Profile Manager.
You can run temporary or persistent sessions. A temporary session runs on demand and is not stored in the repository. A persistent session can run on demand and is stored in the repository. When you run a temporary session, the PowerCenter Server uses normal mode to load data to the target. You can configure the default data profile options to run a session immediately after you create a data profile. For more information about configuring default data profile options, see “Configuring Default Data Profile Options” on page 26.
Running Sessions from the Profile Manager
63
Configuring a Session in the Profile Wizard You must configure a session before you can run it. When you configure a session to run from the Profile Manager, you create and configure the session in the Profile Wizard. When you create a session in the Profile Wizard, you can run it later or run it interactively. Figure 4-1 shows the Profile Run page of the Profile Wizard: Figure 4-1. Profile Wizard - Profile Run page
64
Chapter 4: Running Profile Sessions
Table 4-1 describes the session property settings on the Profile Run page: Table 4-1. Profile Run Properties for Profile Sessions Profile Run Options
Required/ Optional
Description
Server
Required
Select a PowerCenter Server to run the profile session.
Profile Run
Required
Select one of the following options: - Create a persistent session for the profile. A persistent session is stored in the repository. - Create a temporary session for this profile. A temporary session is not stored in the repository.
Sampling
Required
Select one of the following options: - No Sampling. Select to disable profile sampling. When you disable sampling, the Designer generates a profile based on all selected source data. - Sample First N Rows. Select to read the first N rows from the source (up to the number of rows you specify). - Automatic Random Sampling. Select to run a data profile for a random sample of source data and allow the PowerCenter Server to determine the percentage of data to sample. The percent sampled represents a percentage of the total row count. The PowerCenter Server scales the percent of data to sample based on the total row count. - Manual Random Sampling. Select to specify a percentage of data to sample. You can specify 1 to 100 percent. The PowerCenter Server selects random row sets from all parts of the source data. For more information about using data samples, see “Profiling Data Samples” on page 68.
After you configure the options for the Profile Run page, you must configure options for the Session Setup page.
Running Sessions from the Profile Manager
65
Figure 4-2 shows the Session Setup page of the Profile Wizard: Figure 4-2. Profile Wizard - Session Setup Page
Table 4-2 describes the session property settings on the Session Setup page: Table 4-2. Session Setup Properties for Profile Sessions
66
Session Setup Options
Required/ Optional
Source Properties
Required
Configure source connection properties on the Connections tab. Configure source properties on the Properties tab. Configure reader properties on the Reader tab. The Source properties are the same as those in the session properties you configure in the Workflow Manager. For more information about session property settings for sources, see the Workflow Administration Guide.
Target Connections
Required
The relational database connection to the Data Profiling warehouse database. This is the relational database connection you configured for the Data Profiling warehouse in the Workflow Manager. For more information about configuring relational database connections, see the Workflow Administration Guide.
Reject File Directory
Required
Directory for session reject files. The default reject file directory is $PMBadFileDir\.
Run Session
Optional
Select to run the session immediately. Otherwise, the Profile Manager saves the session configuration information and exits.
Chapter 4: Running Profile Sessions
Description
Running a Session when You Create a Data Profile You can configure and run profile sessions when you create a data profile. To do this, select Configure Session on the Auto Profile Column Selection page of the Profile Wizard when you create an auto profile. Or, select Configure Session on the Function-Level Operations page of the Profile Wizard when you create a custom profile. After the Profile Wizard generates the profile mapping, it prompts you to configure and run the session. If you want to run a session immediately after creating a data profile, configure the Always Run Profile Interactively option in the default data profile options. For more information about configuring default data profile options to always run profile sessions, see “Configuring Default Data Profile Options” on page 26. For more information about running sessions from the Profile Manager, see “Using the Profile Manager” on page 6.
Running a Session for an Existing Data Profile If you want to run a session for an existing data profile, select the data profile in the Profile Manager and click Profile-Run. The Profile Wizard prompts you to configure the session before running it. For more information about the Profile Manager, see “Using the Profile Manager” on page 6.
Monitoring Interactive Sessions After you configure an interactive session, the PowerCenter Server runs the session. You can monitor the progress of the session from the Profile Manager. The appearance of the profile node icon in the Navigator Window changes depending on the status of the session. The session status displays in the Profile Session Status window. Figure 4-3 shows profile session status in the Profile Manager: Figure 4-3. Profile Session Status in the Profile Manager
You can also monitor the profile session from the Workflow Monitor. The PowerCenter Server creates a workflow for profile sessions. For more information about monitoring workflows, see the Workflow Administration Guide. If an interactive session fails, the PowerCenter Server writes a session log. You can review the session log, correct any errors, and restart the session. To view the session log, click ViewSession Log in the Profile Manager. When the session successfully finishes, you can view reports that contain the profile results. For more information about viewing Data Profiling reports, see “Viewing Profile Results” on page 77.
Running Sessions from the Profile Manager
67
Profiling Data Samples When you run a session from the Profile Manager, you can create a data profile based on a sample of data rather than the entire source. Use data sampling when you want to understand general trends within the data or view some exception data. For example, you have a large data source and you want to quickly verify that a business rule you created in a custom profile returns the data you expect. To test this business rule, run a profile session based on a sample of data to verify that the business rule returns the data you expect. This allows you to quickly test the data without running a profile session on a large data source. When you run a profile session with data sampling, the Data Profiling report shows the estimated total number of rows in the source, the number of rows sampled, the mode of sampling used, and the percentage of the total source sampled. For information about viewing data profiling reports from the Profile Manager, see “Overview” on page 78. When you sample relational data, the PowerCenter Server can sample data by delegating sampling to the database or by sampling as the PowerCenter Server reads data from the database. You may be able to improve data sampling performance and accuracy by enabling the PowerCenter Server to delegate sampling to the database. For more information about working with relational data samples, see “Profiling Relational Data Samples” on page 70. To profile samples of data, complete the following steps: 1.
Select a function type.
2.
Select a data sampling mode.
Step 1. Select a Function Type When you create data profiles, some functions can return useful data using data samples. Other functions need to perform calculations on all source data to return useful information. Table 4-3 describes function behavior with data samples: Table 4-3. Function Behavior with Data Samples
68
Function Type
Behavior when used with Data Samples
Source-level functions
Profiles created with source functions and data samples can display general patterns within the data.
Chapter 4: Running Profile Sessions
Table 4-3. Function Behavior with Data Samples Function Type
Behavior when used with Data Samples
Column-level functions
Data profiles created with the following column-level functions and data samples can display general patterns within the data: - Domain Inference - Business Rule Validation - Distinct Value Count Data samples for NULL Count and Average Value Aggregate functions display general patterns within the data. However, Minimum and Maximum Value Aggregate functions can have inconsistent results because a column can have unusually high maximum values or unusually low minimum values.
Intersource functions
You cannot use data samples with intersource functions.
Step 2. Select a Data Sampling Mode When you use sample data to create a data profile, you may want to use a different sampling mode depending on the percentage of data you want to sample. Choose from the following data sampling modes: ♦
Automatic Random Sampling. Allows the PowerCenter Server to determine the percentage of data to sample. The PowerCenter Server scales the percentage of data to sample based on the size of the source data. If you do not know the size of the source data, you can enable the PowerCenter Server to determine the percentage of data to sample.
♦
Manual Random Sampling. You can specify a percentage of data to sample from 1 to 100. The PowerCenter Server selects random data from all parts of the source. Use this option if you want to control the percentage of data to sample.
♦
Sample First N Rows. Select to read the first N rows from the source up the number of rows you specify. Use First N rows when you cannot use manual random sampling for the data profile function. You can also use First N Row sampling when you want to specify the number of rows to profile. For example, if you have a very large source, you may want to sample the first 100 rows to understand some basic information about the source.
Profiling Data Samples
69
Profiling Relational Data Samples When you create a data profile based on relational data samples, the PowerCenter Server can obtain data samples in the following ways: ♦
By sampling as the PowerCenter Server reads data. The PowerCenter Server performs a sampling algorithm on the source data as it reads it from the database.
♦
By delegating sampling to the database. The PowerCenter Server delegates the sampling operation to the database, and the PowerCenter Server reads the sampled data after the sampling operation has taken place on the database.
When you run a data profile configured to sample data, the PowerCenter Server first attempts to delegate sampling to the database. If it cannot delegate sampling to the database, it runs an algorithm on the data as it reads the data from the database. When you create a data profile, the PowerCenter Server can delegate sampling to the following databases: ♦
Oracle
♦
DB2
Often, you can optimize performance and accuracy by enabling the PowerCenter Server to delegate sampling to the database. To configure the Profile Manager to allow the PowerCenter Server to delegate sampling to the database, ensure that you use native database drivers to connect to the database. You do not need to complete any other steps to configure the PowerCenter Server to delegate sampling.
Performing Sampling as the PowerCenter Server Reads Data When the PowerCenter Server performs sampling as it reads data, it selects the data samples using the C rand function. You can specify that the PowerCenter Server samples data in the following ways: ♦
Manual random sampling. You specify a percentage of data to sample, from 1 to 100. The PowerCenter Server selects random data from the source up to the percentage you specify.
♦
Automatic random sampling. The PowerCenter Server determines the percentage of data to sample based on the size of the source data.
Using Manual Random Sampling When you use manual random sampling, you specify a percentage of data for the PowerCenter Server to sample, and it samples that percentage of data as it reads rows. For example, if you have 100,000 rows and you sample 50 percent of the data, the PowerCenter Server reads 50 percent of the first 10,000 rows, and 50 percent of the next 10,000 rows, until it samples 50 percent of the data across the data set.
70
Chapter 4: Running Profile Sessions
Figure 4-4 shows a manual PowerCenter sample: Figure 4-4. Manual PowerCenter Sample 50%
10 K
50%
20 K
50%
30 K
50%
40 K
50%
50 K
50%
60 K
50%
50%
70K
80 K
50%
90 K
50%
100 K
Using Automatic Random Sampling When the PowerCenter Server performs automatic random sampling, it determines the percentage of data to sample as it reads data from the database. Because you do not specify the percentage of data to sample, the PowerCenter Server begins by sampling 100 percent of the data and gradually samples progressively less data as it reads from the database. The PowerCenter Server does this to ensure that it samples sufficient data for accuracy. For example, you have 100,000 rows of data to sample. When the PowerCenter Server reads the data, it samples 100 percent of the first 10,000 rows, 90 percent of the next 10,000 rows, 80 percent of the next 10,000 rows, and progressively less data for each group of 10,000 rows. Figure 4-5 shows how data is sampled using automatic random sampling: Figure 4-5. Automatic Random Sampling 100%
10 K
90%
20 K
80%
30 K
70%
40 K
60%
50 K
50%
60 K
40%
30%
70K
80 K
20%
90 K
10%
100 K
Delegating Sampling to the Database When the PowerCenter Server delegates sampling to the database, the database uses native algorithms to sample data. The database samples data in the same way for both automatic and manual random sampling because in both cases the database uses the same algorithm to perform the sampling function. When the PowerCenter Server delegates sampling to the database, the database performs a sampling operation. Once the database has sampled the data, the PowerCenter Server reads the sampled data. For example, you have 100,000 rows of data and you sample 50 percent of the data, the database samples 50 percent of the first 10,000 rows, and 50 percent of the next 10,000 rows. The database selects an equal percentage of data samples from each part of the data source. This means that while the database selects random samples, it selects the samples equally from across the data set.
Profiling Relational Data Samples
71
Figure 4-6 shows sampling delegated to the database: Figure 4-6. Data Sampling Delegated to the Database 50%
10 K
50%
50%
20 K
30 K
50%
40 K
50%
50 K
50%
60 K
50%
70K
50%
80 K
50%
90 K
50%
100 K
Note: To enable database sampling, simply use native connections to connect to the repository
database.
Improving Data Sampling Accuracy and Performance You can improve data sampling accuracy and performance when you work with the following data sources: ♦
Data sources with historical data
♦
Large data sources
Working with Historical Data If you have historical data, you can improve data accuracy by enabling the PowerCenter Server to delegate sampling to the database. For example, you have 100,000 rows of data, and the latest data is stored in the first 30,000 rows, the newer data is stored in the next 30,000 rows, and the oldest data is stored in the last 40,000 rows. When you select automatic random sampling and the PowerCenter Server samples as it reads data, it samples the greatest percentage of the data from the latest data, and the least percentage of data from the oldest data. If there are great differences between the historical data and the latest data, the resulting data profile reflects the latest data more accurately than the older data. Figure 4-7 shows how Data Profiling processes an historical sample of data: Figure 4-7. Data Profile of an Historical Sample of Data 100%
10 K
90%
20 K
80%
30 K
Newest Data 2000 - 2004
72
Chapter 4: Running Profile Sessions
70%
40 K
60%
50 K
50%
60 K
Newer Data 1995- 2000
40%
70K
30%
80 K
20%
90 K
Oldest Data 1990 - 1995
10%
100 K
To improve data accuracy, you can ensure the PowerCenter Server delegates sampling to the database. When you enable database sampling, the database samples data equally from across the data set. When the PowerCenter Server cannot delegate sampling to the database, you can improve data profiling results for historical data by using manual random sampling. Manual random sampling, like database sampling, samples a fixed percentage of data across the data set. Table 4-4 shows recommended sampling percentages for the most accurate results with manual random sampling: Table 4-4. Recommended Percentages for Manual Random Sampling Data Size
Percentage of Data to Sample
1 million or more rows
1 percent
100,000 or more rows
10 percent
100 - 1,000 rows
Greater than 50 percent
Working with Large Data Sources You can improve performance when you have a large source by ensuring the PowerCenter Server delegates sampling to the database. For example, you have a data source with 100,000,000 rows, and you select automatic random sampling. When the PowerCenter Server delegates sampling to the database, the database selects a sample of 1,000,000 rows and the PowerCenter Server reads only 1,000,000 rows. However, if the PowerCenter Server performs the sampling algorithm, it must first read 100,000,000 rows and then sample from those rows.
Profiling Relational Data Samples
73
Creating a Session in the Workflow Manager You can create persistent profile sessions and workflows in the Workflow Manager. You may want to create and configure a profile session and workflow from the Workflow Manager to monitor ongoing data quality issues, or if you want to have more control over the session and workflow properties. Or, you may want add workflow tasks, such as a scheduler that allows you to schedule the workflow to run regularly. This allows you to perform a time-dimensional analysis of data quality issues. When you create a persistent session and workflow in the Workflow Manager, the following rules apply: ♦
You must run the session from the Workflow Manager.
♦
You cannot run the session in real time.
♦
You must view the session status in the Workflow Monitor.
You can view Data Profiling reports that you generate from the Workflow Manager in the Profile Manager or PowerAnalyzer. For more information about using the Workflow Manager and Workflow Monitor, see the Workflow Administration Guide. For more information about viewing reports, see “Viewing Profile Results” on page 77.
74
Chapter 4: Running Profile Sessions
Troubleshooting An interactive session failed due to an invalid SQL query. The interactive session failed because you need to specify an owner name for the Target Name Prefix in the session properties. However, you cannot specify the owner name for an interactive session. To specify an owner name for the session, configure a persistent session in the Workflow Manager. Specify the owner name for the Target Name Prefix in the session properties. For more information about configuring persistent sessions, “Creating a Session in the Workflow Manager” on page 74. I tried to run an interactive session for an SAP R/3 source, but the session failed. When you create a data profile for an SAP R/3 source, the Designer does not generate an ABAP program for the data profile. If you run an interactive session immediately after creating the data profile, the session fails. Create an ABAP program for the profile mapping and then run the session. An interactive session failed with an error message stating that the buffer block size is too low. You ran an interactive session for a source with a large number of rows with high precision. Or, you ran an interactive session for a multi-group source with a large number of groups. As a result, the PowerCenter Server could not allocate enough memory blocks to hold the data, and the session failed. Create a persistent session for the profile mapping in the Workflow Manager. In the session properties, set the value for the buffer block size that the error message in the session log recommends. For more information about optimizing sessions and setting the buffer block size, see the Workflow Administration Guide. My profile session fails but no error message displays. If a data profile contains a Business Rule Validation function and a column name on which the Business Rule Validation function is based changes, the profile session fails. The PowerCenter Server does not display the error on screen, but does write the error to the log file. I tried to restart a persistent session from the Profile Manager and I received the following error message: ERROR: Failed to fetch Workflow [id = <workflow ID>] from the repository. Please check the session log.
However, when I try to locate the session log, no session log for this session run exists. You deleted the workflow for the profile session in the Workflow Manager. When you delete a workflow in the Workflow Manager, the PowerCenter Server cannot run the session or
Troubleshooting
75
generate a session log. To run the profile session from the Profile Manager, reconfigure and save the session information in the Profile Wizard. I edited session properties for a persistent session from the Workflow Manager. However, the properties I configured in the Workflow Manager do not appear on the Session Setup page of the Profile Wizard. You edited the profile session from the Workflow Manager while the Profile Wizard was open. To refresh the session properties in the Profile Wizard, close and reopen the Profile Wizard. I edited session properties for a persistent session from the Profile Wizard, but the session properties I configured did not get saved. You edited the profile session from the Profile Wizard while the session was open in the Workflow Manager. Close the workflow in the Workflow Manager before you edit the session properties from the Profile Wizard.
76
Chapter 4: Running Profile Sessions
Chapter 5
Viewing Profile Results This chapter covers the following topics: ♦
Overview, 78
♦
PowerCenter Data Profiling Reports, 79
♦
PowerAnalyzer Data Profiling Reports, 85
77
Overview After you run a profile session, you can view the session results in a report. There are two types of Data Profiling reports:
78
♦
PowerCenter Data Profiling reports. Reports you can view from the Profile Manager after running a profile session. PowerCenter Data Profiling reports display data for the latest session run. Use PowerCenter Data Profiling reports to quickly view profile results during mapping development.
♦
PowerAnalyzer Data Profiling reports. Reports you can view from PowerAnalyzer after running a profile session. PowerAnalyzer reports provide a time-dimensional view of your data. They also display information about rejected rows in your profile results. Use PowerAnalyzer Data Profiling reports when you want to monitor data quality during production.
Chapter 5: Viewing Profile Results
PowerCenter Data Profiling Reports After you run a profile session, you can view PowerCenter Data Profiling reports from the Profile Manager. The reports provide information based on the last profile session run. The reports display data profile information based on the functions in the data profile. Use PowerCenter Data Profiling reports to view the latest profile information about your source data during mapping development. For more information about running profile sessions, see “Running Profile Sessions” on page 61. The report format of PowerCenter Data Profiling reports is static and cannot be modified except for resizing columns. PowerCenter stores the reports as an XML file in the PowerCenter Client\Extensions\DataProfile\ProfileReports directory. You can view two types of PowerCenter Data Profiling reports: ♦
Auto profile. An auto profile report contains profiling information for the predefined functions in an auto profile. Viewing an auto profile report gives you an overall view of the profile information in your source.
♦
Custom profile. Custom profile reports contain profiling information for the functions you defined in your custom profile. The amount of information in custom profile reports varies depending on the functions you specified in your data profile.
Auto Profile Reports An auto profile report displays information about source data based on the functions in an auto profile. The report displays information for the following functions: ♦
Row Count. Counts the number of rows read from the source. When profiling data samples, the PowerCenter Server estimates the number of rows.
♦
Distinct Value Count. Returns the number of distinct values for the column. You can view verbose data for this function if you selected this mode when creating the auto profile.
♦
Aggregate. Calculates an aggregate value for a numeric or string value or calculated expression applied to one column of a profile source.
♦
Domain Inference. Reads all values in the column and infers a pattern that fits the data.
♦
Candidate Key Evaluation. Calculates the number and percentage of unique values for one or more columns in a source.
♦
Redundancy Evaluation. Calculates the number of duplicate values in one or more columns of a source.
For more information about creating an auto profile, see “Creating an Auto Profile” on page 34. For more information about profile functions, see “Working with Functions” on page 103. An auto profile report contains information about the data profile in a report summary. It also contains profile results in the body of the report.
PowerCenter Data Profiling Reports
79
Table 5-1 describes the attributes that display in the report summary of an auto profile report: Table 5-1. Auto Profile Report Summary Attributes Attribute
Description
Auto Profile Name
Name of the auto profile.
Profile Description
Description of the auto profile.
Profile Run Time
Date and time of the profile session run.
Folder Name
Name of the folder in which the data profile is stored.
Repository Name
Name of the repository in which the data profile is stored.
Source
Name of the source definition on which the auto profile is based in the following format: ::<source definition name>
Sample Type
The type of data sampling used in the report. For more information about using data samples, see “Profiling Data Samples” on page 68.
Profiled Rows
The number of rows applied to the data profile.
Total Rows
The total number of rows in the source.
% Profiled
The percentage of the total number of source rows profiled.
Groups
The groups in the source definition or mapplet on which the auto profile is based, where applicable.
The body of an auto profile report provides general and detailed information. In an auto profile report, you can click the hyperlinks to view information about verbose data for the Distinct Value Count function, Redundancy Evaluation function, and Domain Inference function.
80
Chapter 5: Viewing Profile Results
Figure 5-1 shows a sample auto profile report: Figure 5-1. Sample Auto Profile Report Click to hide report summary attributes. Click to launch the report in a browser.
Click the hypertext link to view verbose data.
Figure 5-2 shows verbose report results: Figure 5-2. Sample Verbose Data Report Click to hide the report summary field.
Click to return to the report summary.
Custom Profile Reports You can view PowerCenter Data Profiling reports for custom profiles. The data the report displays depends on the functions you included in your data profile. The custom profile PowerCenter Data Profiling Reports
81
report displays data in the order of the functions you defined in your data profile. For more information about functions, see “Working with Functions” on page 103. For more information about creating a custom profile, see “Creating a Custom Profile” on page 38. A custom profile report contains information about the data profile in a report summary. It also contains profile results in the body of the report. Each section in the body of the report corresponds to a different function. Table 5-2 describes the attributes that display in the report summary of a custom profile report: Table 5-2. Custom Profile Report Summary Attributes Attribute
Description
Custom Profile Name
Name of the custom profile.
Profile Description
Description of the custom profile.
Profile Run Time
Date and time of the profile session run.
Folder Name
Name of the folder in which the data profile is stored.
Repository Name
Name of the repository in which the data profile is stored.
Source Name
The type and name of the source definitions upon which the custom profile is based.
Sample Type
The type of data sampling used in the report. For more information about using data samples, see “Profiling Data Samples” on page 68.
Profiled Rows
The number of rows applied to the data profile.
Total Rows
The total number of rows in the source.
% Profiled
The percentage of the total number of source rows profiled.
Show report for all sources
Click to show Data Profiling reports for all sources upon which the custom profile is based. By default, the report shows all sources. To view a particular source, click on the source name on the Sampling Summary page. This filters out results for other sources.
Table 5-3 describes the attributes that display for each function in a custom profile report: Table 5-3. Custom Profile Report Function Attributes
82
Attribute
Description
Function Name
The name of the function in the custom profile.
Function Type
The type of function.
Source
The name of the source definitions in the function.
Group By Columns
Columns selected for grouping by in the report.
Column Names
Name of the column to which the profile business rule is being applied.
Rule
Name of the profile business rule.
Chapter 5: Viewing Profile Results
Figure 5-3 is an example of a custom profile report: Figure 5-3. Sample Custom Profile Report
Click hypertext link to view verbose data.
You can also click the hypertext links to view information about verbose data.
Viewing PowerCenter Data Profiling Reports Use the following procedure to view an auto or custom PowerCenter Data Profiling report. To view a PowerCenter Data Profiling report: 1.
Launch the Profile Manager. To launch the Profile Manager from the Source Analyzer, select Sources-Profiling-Launch Profile Manager. To launch the Profile Manager from the Mapplet Designer, select Mapplets-Profiling-Launch Profile Manager.
2.
Select the data profile for which you want to view a report. You can select a data profile from the Navigator window. Or, you can select a data profile from the Session Status bar.
PowerCenter Data Profiling Reports
83
3.
Select View-View Report. If the Designer requires the connection information for the Data Profiling warehouse, the Connect to an ODBC Data Source dialog box displays. If the Designer does not require the connection information for the Data Profiling warehouse, the report displays.
4.
Select the ODBC data source for the Data Profiling warehouse.
5.
Enter the user name and password for the Data Profiling warehouse.
6.
Click Connect.
The PowerCenter Data Profiling report displays.
84
Chapter 5: Viewing Profile Results
PowerAnalyzer Data Profiling Reports PowerAnalyzer displays Data Profiling reports in an analytic workflow. You can run the reports to analyze source data from profile sessions. The reports display data from the latest version of a profile. You can view reports for profile information across session runs for a time-dimensional analysis of the data. PowerAnalyzer Data Profiling reports display information across session runs by default. You can also view reports with data from the most recent profile session run. This is useful during production to monitor data quality. When you display a PowerAnalyzer Data Profiling report, you can drill down to row-level information, drill up for a summary view, and drill across data. You can also see reports for verbose data. PowerAnalyzer lets you modify the Data Profiling reports to fit your business needs. You can set alerts for the Data Profiling reports to call attention to fluctuations in the source data. You can also add or change metrics in the reports or build your own reports. Figure 5-4 shows the PowerAnalyzer Data Profiling analytic workflow: Figure 5-4. PowerAnalyzer Data Profiling Analytic Workflow
You access the PowerAnalyzer Data Profiling reports folder in the Public Folders on the Find tab. Informatica recommends that you do not use the Summary option when viewing Data Profiling reports. If you use the Summary option, the values that display in the report may not accurately reflect your profile session results. Tip: When you view a PowerAnalyzer Data Profiling report, select +Workflows to open the
analytic workflow.
PowerAnalyzer Data Profiling Reports
85
Table 5-4 describes the reports you can view in the PowerAnalyzer Data Profiling reports analytic workflow: Table 5-4. PowerAnalyzer Data Profiling Reports
86
Report Name
Description
Profile List
This report displays information for all data profiles in a folder. The report displays data for the latest data profile version.
Profile Function List
This report displays the functions for the selected profile. The report displays data for the latest data profile version.
Source Function Statistics
This report displays details about source functions for the selected profile. The report displays data for the latest data profile version.
Column Function Statistics
This report displays details about column functions for the selected profile. The report displays data for the latest data profile version.
Redundancy Evaluation Statistics
This report displays details from the Redundancy Evaluation source-level function for the selected profile. The report displays data for the latest data profile version.
Candidate Key Statistics
This report displays details from the Candidate Key Evaluation source-level function for the selected profile. The report displays data for the latest data profile version.
Row Count Statistics
This report displays details from the Row Count source-level function for the selected profile. The report displays data for the latest data profile version.
Source Business Rule Validation Statistics
This report displays details from the Business Rule Validation source-level function for the selected profile. The report displays data for the latest data profile version.
Rejected Rows - Source Business Rule Validation
This report displays details about the rows that do not satisfy the specified business rule for a source (verbose data details). The report displays data for the latest data profile version.
Column Business Rule Validation Statistics
This report displays details about the Business Rule Validation column-level function for the selected profile. The report displays data for the latest data profile version.
Aggregate Column Statistics
This report displays details about the Aggregate function for the selected profile. The report displays data for the latest data profile version.
Distinct Count Statistics
This report displays details about the Distinct Value Count function for the selected profile. The report displays data for the latest data profile version.
Domain Validation Statistics
This report displays details about the Domain Validation function for the selected profile. The report displays data for the latest data profile version.
Domain Inference Statistics
This report displays details about the Domain Inference column function type for the selected profile. The report displays data for the latest data profile version.
Rejected Rows - Column Business Rule Validation
This report displays details about the rows that do not satisfy the specified business rule for a column (verbose data details). The report displays data for the latest data profile version.
Rejected Rows - Domain Validation
This report displays details about the rows that do not satisfy the specified domain validation rule for a column (verbose data details). The report displays data for the latest data profile version.
Outer Join Analysis
This report displays the results of all the Orphan Analysis functions for the selected profile. The report displays data for the latest data profile version.
Chapter 5: Viewing Profile Results
Table 5-4. PowerAnalyzer Data Profiling Reports Report Name
Description
Rejected Rows - Outer Join Analysis
This report displays the unmatched rows of the Orphan Analysis function for the selected profile. The report displays data for the latest data profile version.
Cartesian Product Analysis
This report displays details about the Join Complexity Evaluation function for the selected profile. The report displays data for the latest data profile version.
Auto Profile - Column Statistics
This report displays the results of column-level statistics for a group generated as part of an auto profile for column functions. The report displays data for the latest data profile version.
Inter Source Function List
This report displays details about intersource functions for the selected data profiles. The report displays data for the latest data profile version.
PowerAnalyzer Data Profiling Reports
87
88
Chapter 5: Viewing Profile Results
Chapter 6
Working with Domains This chapter covers the following topics: ♦
Working with Domains, 90
♦
Prepackaged Domains, 91
♦
Custom Domains, 92
♦
Editing a Domain, 101
♦
Deleting a Domain, 102
89
Working with Domains Domains are sets of all valid values for a column. When you create a custom profile, you can create domains or you can use existing domains that Informatica provides. Some domains contain a list of all valid values that the source column can contain. Some domains contain a regular expression that describes a range or pattern of values that the source column can contain. You can use prepackaged domains or create your own domains. You can use the following domains to profile your data:
90
♦
Prepackaged domains. For more information about prepackaged domains, see “Prepackaged Domains” on page 91.
♦
Custom domains. For more information about creating custom domains, see “Custom Domains” on page 92.
Chapter 6: Working with Domains
Prepackaged Domains Informatica provides a set of prepackaged domains to verify data, such as phone numbers, postal codes, and email addresses. You can use prepackaged domains when you create a custom profile with a Domain Validation function. For more information about creating custom profiles, see “Creating a Custom Profile” on page 38. You can view prepackaged domains in the Domain Browser. Prepackaged domains install the first time you open the Domain Browser. It may take a few moments for the Domain Browser to open for the first time. From the Domain Browser, you can edit or delete prepackaged domains. For more information about working with domains, see “Custom Domains” on page 92. Table 6-1 shows the prepackaged domains and their descriptions: Table 6-1. Prepackaged Domains Domain Name
Description
Country-Codes
Contains a list of country abbreviations.
Country-Names
Contains a list of country names.
Email-Address
Validates source against email patterns.
URL
Validates source against accepted URL patterns. Validates HTTP, HTTPS, and FTP URLs.
North-American-Phone-Number
Validates source against North American phone number patterns.
Dollar-Currency
Displays source currency in U.S.A. currency values.
US-Zip-Codes-Pattern (faster)
Validates the source against the U.S.A. zip code pattern.
US-Zip-Codes-List (more-accurate)
Contains a list of United States Postal Service zip codes.
US-State-Codes (50-states)
Contains a list of U.S.A. state abbreviations.
US-State-Codes (extended)
Contains a list of U.S.A. state abbreviations with additional values for territories and outlying areas served by the United States Postal Service.
US-State-Names (50-states)
Contains a list of the names of all U.S.A. states.
US-State-Names (extended)
Contains a list of the names of all U.S.A. states with additional values for territories and outlying areas.
US-Social-Security-Number
Validates the source against the U.S.A. social security number pattern.
Canadian-State-Codes
Contains a list of Canadian province abbreviations.
Canadian-State-Names
Contains a list of Canadian province names.
Canadian-Zip-Codes
Validates the source against the Canadian zip code pattern.
UK-Postal-Codes
Validates the source against the U.K. postal code pattern.
North-American-Industry-ClassificationSystem (NAICS)-codes
Contains a list of the North American Industry Classification System codes.
Prepackaged Domains
91
Custom Domains You can create custom domains when you create a Domain Validation function. During a profile session, the Domain Validation function uses the domains you specify to validate source values or to help you infer patterns from source data. You can create the following types of domains: ♦
List of Values. Domains defined by a comma-delineated list of values.
♦
Regular Expression. Domains defined by a range of values in an expression.
♦
Domain Definition Filename. Domains defined by an external file containing values.
You can create reusable and non-reusable custom domains. Apply a reusable domain to multiple Domain Validation functions. Apply a non-reusable domain to one Domain Validation function. For more information about configuring a Domain Validation function, see “Column-Level Functions” on page 113. You can create a domain from the Profile Manager or when you configure a function. Any domain you create from the Profile Manager is a reusable domain. When you create a domain from a function, you can make the domain reusable or non-reusable. When you view domains in the Domain Browser from the Profile Manager, you can only view reusable and prepackaged domains. You can view non-reusable domains from the Domain Browser when you define a function to which the non-reusable domain applies. Once you create a domain, you can edit or delete the domain. For more information about editing domains, see “Editing a Domain” on page 101. For more information about deleting a domain, see “Deleting a Domain” on page 102.
List of Values Domains You can create a List of Values domain from the Profile Manager or when you configure a function. When you create a List of Values domain, you add the values you want to apply to the domain. For example, if you want to use a domain that lists area codes in a region, you create a List of Values domain. You can also use a List of Values domain to evaluate your data for non-null white spaces. To create a List of Values domain: 1.
To create a List of Values domain from the Profile Manager, select Tools-Domains. To create a domain when you define a function, click the Domains button on the Profile Function Details page.
92
Chapter 6: Working with Domains
The Domain Browser dialog box displays.
2.
Click New to create a new domain. The Domain Details dialog box displays.
Add a domain value. Remove a domain value.
3.
Enter a name for the domain. Note: The domain name cannot contain spaces.
4.
Clear Reusable Domain if you do not want to be able to use this domain in other Domain Validation functions. Note: If you configure a domain from the Profile Manager, the domain is automatically
reusable. You cannot make the domain non-reusable.
Custom Domains
93
5.
Select List of Values as the domain type.
6.
In the Value box, enter a new domain value to manually add values to the list of values. If you want to add a file with a list of values, go to step 8. When you enter a domain value, the Designer ignores any spaces before or after the value.
7.
Click Add to add the domain values you entered to the list of values.
8.
Click Values File to add a list of values. If you want to add domain values manually, go to step 6.
9.
Navigate to the file, and select the file to use.
10.
Select the appropriate code page from the drop-down list. The code page you specify must be a subset of the code page for the operating system that hosts the PowerCenter client. You can specify localization and code page information in the file list. If you do not specify the localization information, the PowerCenter Server uses default values. For information about specifying localization information, see “Specifying Localization and Code Page Information” on page 99.
11.
Repeat steps 6 and 7 for each domain value you want to add.
12.
If you want to remove a domain value, select the value from the list of values and click Remove.
13.
Click OK to save the domain.
14.
Click Close.
Regular Expression Domains You can create a Regular Expression domain from the Profile Manager or when you configure a function. When you create a Regular Expression domain, you create a range of values you want to apply to the domain. For example, if you want to use a domain that includes values for 5-digit employee ID numbers, create a Regular Expression Domain.
Using perl Syntax in a Regular Expression Domain You must use perl syntax when you create a Regular Expression domain. Table 6-2 provides perl syntax guidelines to create a regular expression: Table 6-2. perl Syntax Guidelines for Regular Expressions
94
Syntax
Description
. (a period)
Matches any one character.
[a-z]
Matches one instance of a letter. For example, [a-z][a-z] can match ab or CA.
\d
Matches one instance of any digit from 0-9.
Chapter 6: Working with Domains
Table 6-2. perl Syntax Guidelines for Regular Expressions Syntax
Description
()
Groups an expression. For example, the parentheses in (\d-\d-\d\d) groups the expression \d\d-\d\d, which finds any two numbers followed by a hyphen and any two numbers, as in 12-34.
{}
Matches the number of characters exactly. For example, \d{3} matches any three numbers, such as 650 or 510. Or, [a-z]{2} matches any two letters, such as CA or NY.
?
Matches the preceding character or group of characters zero or one time. For example, \d{3}(-{d{4})? matches any three numbers, which can be followed by a hyphen and any four numbers.
* (an asterisk)
Matches zero or more instances of the values that follow the asterisk. For example, *0 is any value that precedes a 0.
For example, to create a regular expression for U.S.A. zip codes, you can enter the following perl syntax: \d{5}(-\d{4})?
This expression lets you find 5-digit U.S.A. zip codes, such as 93930, as well as 9-digit zip codes, such as 93930-5407. In this example, \d{5} refers to any five numbers, such as 93930. The parentheses surrounding -\d{4} group this segment of the expression. The hyphen represents the hyphen of a 9-digit zip code, as in 93930-5407. \d{4} refers to any four numbers, such as 5407. The question mark states that the hyphen and last four digits are optional or can appear one time. Figure 6-1 shows an example of a Regular Expression domain: Figure 6-1. Regular Expression Domain
Click to enter test data to validate a regular expression.
When you enter a regular expression, you can validate the regular expression with test data. You can use the test data instead of the data in your repository. The test data should represent the source data you plan to use the Regular Expression domain against.
Custom Domains
95
Figure 6-2 shows an example of test data in the Test Data dialog box: Figure 6-2. Test Data Dialog Box
Regular Expression to Test
Enter test data.
Test Results
Tips for Converting COBOL Syntax to perl Syntax If you are familiar with COBOL syntax, you can use the following information to help you write perl regular expressions for a Regular Expression Domain. Table 6-3 shows examples of COBOL syntax and their perl equivalents: Table 6-3. COBOL Syntax and perl Syntax Compared COBOL Syntax
perl Syntax
Description
9
\d
Matches one instance of any digit from 0-9.
9999
\d\d\d\d or \d{4}
Matches any four digits from 0-9, as in 1234 or 5936.
x
[a-z]
Matches one instance of a letter.
9xx9
\d[a-z][a-z]\d
Matches any number followed by two letters and another number, as in 1ab2.
Tips for Converting SQL Syntax to perl Syntax If you are familiar with SQL syntax, you can use the following information to help you write perl regular expressions for a Regular Expression domain. Table 6-4 shows examples of SQL syntax and their perl equivalents: Table 6-4. SQL Syntax and perl Syntax Compared
96
SQL Syntax
perl Syntax
Description
%
.*
Matches any string.
A%
A.*
Matches the letter “A” followed by any string, as in Area.
_
. (a period)
Matches any one character.
A_
A.
Matches “A” followed by any one character, such as AZ.
Chapter 6: Working with Domains
Creating a Regular Expression Domain Use the following procedure to create a Regular Expression domain. To create a Regular Expression domain: 1.
To create a domain from the Profile Manager, select Tools-Domains. To create a domain when you define a function, click the Domains button on the Profile Function Details page. The Domain Browser dialog box displays.
2.
Click New to create a new domain. The Domain Details dialog box displays.
3.
Enter a name for the domain. The domain name cannot contain spaces.
4.
Deselect Reusable Domain if you do not want to be able to use this domain in other Domain Validation functions. If you configure a domain from the Profile Manager, the domain is automatically reusable. You cannot make the domain non-reusable.
5.
In the Expression box, enter a new domain value.
6.
If you want to validate the regular expression, click Test Data.
7.
Enter the test data you want to test the regular expression against.
8.
Click Test to validate the regular expression. You can view the results of the test in the Result Box at the bottom of the Test Data Dialog box.
9.
When you finish validating the regular expression, click Done. Custom Domains
97
10.
Click OK to save the domain.
11.
Click Close.
Domain Definition Filename Domains You can create a Domain Definition Filename domain from the Profile Manager or when you configure a function. Create a Domain Definition Filename domain when you have a text file you can use to validate source data. For example, you have a file that contains a list area codes in a region. Create a Domain Definition Filename domain for the PowerCenter Server to use the file to validate the source data according to this list. When you create a Domain Definition Filename domain, you select a filename with the domain values you want to use. You must store the file on the same machine as the PowerCenter Server on which you want to run the session. You can use the $PMSourceDir environment variable when you specify the filename path. To create a Domain Definition Filename domain: 1.
To create a domain from the Profile Manager, select Tools-Domains. To create a domain when you define a function, click the Domains button on the Profile Function Role Details page. The Domain Browser dialog box displays.
2.
Click New to create a new domain. The Domain Details dialog box displays.
3.
Enter a name for the domain. The domain name cannot contain spaces.
4.
Clear Reusable Domain if you do not want to be able to use this domain in other Domain Validation functions. If you configure a domain from the Profile Manager, the domain is automatically reusable. You cannot make the domain non-reusable.
98
Chapter 6: Working with Domains
5.
In the Filename box, enter a new domain value.
You can use server variables, such as $PMSourceDir and $PMRootDir when you specify a filename and path for the domain value. Note: The file you specify must use a code page that is a subset of the PowerCenter Server
code page. You can specify a code page by entering valid syntax on the first line of the file. For information about entering localization and code page information, see “Specifying Localization and Code Page Information” on page 99. 6.
Click OK to save the domain.
7.
Click Close.
Specifying Localization and Code Page Information When you import a list of values or a domain definition file name, the file can contain localization information. Use the following guidelines to specify localization information: ♦
Enter localization information in the first line of the file.
♦
Enter localization information using 7-bit ASCII.
♦
Use the following syntax: locale=_.<@Sort>
where language, territory, code page, and Sort represent the following information: −
Language. Specifies translation for month and names for days of the week.
−
Territory. Specifies country dependent information such as currency symbols, numeric and monetary formatting rules, and Date/Time formats.
−
Code page. Specifies the character encoding to use. The code page you specify must be a subset of the code page for the operating system that hosts the PowerCenter client.
Custom Domains
99
−
Sort. Specifies the collation sequence to use. For example, you can use Binary.
For example, you can specify the following localization information for a U.S. English file: locale=English_UnitedStates.US-ASCII@binary
For a Japanese file, you can specify the following localization information: locale=Japanese_Japan.JapanEUC@binary
For more information about code page compatibility for Data Profiling components, see “Code Page Compatibility” on page 152.
100
Chapter 6: Working with Domains
Editing a Domain You can edit a reusable domain to change the domain name and domain value, expression, or filename. When you define a function, you can edit a non-reusable domain for the function to change the domain name, domain type, and domain value, expression, or filename. You can edit prepackaged domains to change the domain value, expression, or filename. To edit a domain: 1.
To edit a reusable or prepackaged domain from the Profile Manager, select ToolsDomains. To edit a domain when you define a function, click the Domains button on the Profile Function Role Details page. The Domain Browser dialog box displays.
2.
Select the domain you want to edit, and click Edit. The Domain Details dialog box displays.
3.
If you are editing a custom domain, optionally change the domain name. If you are editing a List of Values domain, add or remove domain values as necessary. If you are editing a Regular Expression domain, modify the domain expression as necessary. If you are editing a Domain Filename Definition domain, modify the filename that contains the domain values as necessary.
4.
Click OK to save your changes.
5.
Click Close.
Editing a Domain
101
Deleting a Domain You can delete a domain if you no longer want to apply it to Domain Validation functions. If you delete a domain, the Designer invalidates all of the data profiles and related profile mappings that reference the domain. To delete a domain: 1.
To delete a domain from the Profile Manager, select Tools-Domains. To delete a domain when you define a function, click the Domains button on the Profile Function Role Details page. The Domain Browser dialog box displays.
102
2.
Select the domain you want to delete, and click Delete.
3.
Click Close.
Chapter 6: Working with Domains
Chapter 7
Working with Functions This chapter contains information on the following topics: ♦
Overview, 104
♦
Source-Level Functions, 105
♦
Column-Level Functions, 113
♦
Intersource Functions, 123
103
Overview You include functions in a profile to perform calculations on sources during a profile session. When you create an auto profile, the Designer adds a predefined set of functions to your profile. When you create a custom profile, you create functions that meet your business needs, and add them to your profile. You can add the following types of functions to a profile: ♦
Source-level functions. Perform calculations on two or more source columns, source group, or mapplet group. For more information about source-level functions, see “SourceLevel Functions” on page 105.
♦
Column-level functions. Perform calculations on one column in a source. For more information about column-level functions, see “Column-Level Functions” on page 113.
♦
Intersource functions. Perform calculations on two or more sources, source groups, or mapplet groups. For more information about intersource functions, see “Intersource Functions” on page 123.
For many profile functions, you can write data to the Data Profiling warehouse in verbose mode. When you select this option, the PowerCenter Server writes verbose data to the Data Profiling warehouse during the profile session. You can use Data Profiling reports to view more information about the verbose data. For more information about viewing PowerAnalyzer reports, see “Viewing Profile Results” on page 77. For more information about creating auto profiles, see “Creating an Auto Profile” on page 34. For more information about creating custom profiles, see “Creating a Custom Profile” on page 38.
104
Chapter 7: Working with Functions
Source-Level Functions Source-level functions perform calculations on two or more columns of a source, source group, or mapplet group. You can add the following source-level functions to a profile: ♦
Row Count. Counts the number of rows read from the source during the profile session. If you enable data sampling, the Row Count function returns an estimated row count based on the rows sampled. For information about the Row Count function, see “Row Count” on page 105.
♦
Business Rule Validation. Evaluates the number of rows for one or more source columns that satisfy a specified business rule, and those rows that and do not satisfy the business rule. For information about the Business Rule Validation function, see “Business Rule Validation” on page 106.
♦
Candidate Key Evaluation. Calculates the number and percentage of unique values in one or more source columns. This helps you identify source columns you might use as a primary key. For information about the Candidate Key Evaluation function, see “Candidate Key Evaluation” on page 108.
♦
Redundancy Evaluation. Calculates the number of duplicate values in one or more source columns. This helps you identify columns to normalize into separate tables. For information about the Redundancy Evaluation function, see “Redundancy Evaluation” on page 109.
♦
Row Uniqueness. Calculates the number of unique and duplicate values in the source based on the columns selected. You can profile all columns in the source row or choose individual columns to profile. For information about the Row Uniqueness function, see “Row Uniqueness” on page 110.
When you specify a source-level function on the Function Details page of the Profile Wizard, the Profile Wizard prompts you to configure the function on the Function Role Details page. The options available on the Function Role Details page for source-level functions depend on the function you select. For more information about configuring function details, see “Step 4. Configure Profile Functions” on page 41.
Row Count The Row Count function returns the number of rows in a source. It can also report the number of rows in each group. If you configure a session for random manual sampling or automatic manual sampling, the Row Count function returns an estimate of the total source rows. If you configure a session to sample First N Rows, the Row Count function returns the number of rows read during the session. For more information about using data samples, see “Profiling Data Samples” on page 68.
Source-Level Functions
105
Figure 7-1 shows the Function Role Details page for the Row Count function: Figure 7-1. Row Count Function
Table 7-1 shows the properties of the Row Count function: Table 7-1. Row Count Function Options Property
Required/ Optional
Description
Generate profile data by group
Optional
Select to group source rows in a particular column. You can view a result for each group in the Data Profiling report.
Group by Columns
Optional
If you selected Generate Profile Data by Group, select a column to group by. You can select a column of any datatype except Binary. If you select a a column of numeric datatype, the precision must between 1 and 28 digits. If you select a column of String datatype, the precision must be between 1 to 200 characters.
Business Rule Validation The Business Rule Validation source-level function calculates the number of rows for one or more columns in a source that satisfy a business rule and the number of rows for one or more columns in a source that do not. A business rule is a valid Boolean expression that you create with the Business Rule Validation Editor. For example, you have a business rule that applies to products with a price of $50.00 or more. You want to profile your source data during mapping development to learn how many rows in 106
Chapter 7: Working with Functions
the Price column in the source contain values greater than $50.00 You can define a business rule in a Business Rule Validation source-level function that evaluates each source row in the Price column to see how many values are $50.00 or higher. Figure 7-2 shows the Function Role Details page for the Business Rule Validation source-level function: Figure 7-2. Business Rule Validation Source-Level Function
Click to enter a business rule.
Table 7-2 shows the properties of the Business Rule Validation source-level function: Table 7-2. Business Rule Validation Source-Level Function Options Property
Required/ Optional
Rule Summary
Required
Click the Rule Editor button to enter a business rule. Once you enter a business rule, the rule displays in the Rule Summary dialog box. Use a valid Boolean expression. You can only enter Business Rule Validation functions in the Business Rule Editor. If you enter other functions available through Business Rule Validation, such as Date or String functions, the session may generate unexpected results.
Generate profile data by group
Optional
Select to group source rows in a particular column. You can view a result for each group.
Group by Columns
Optional
If you selected Generate Profile Data by Group, select a column to group by. You can select a column of any datatype except Binary. If you select a a column of numeric datatype, the precision must between 1 and 28 digits. If you select a column of String datatype, the precision must be between 1 to 200 characters.
Description
Source-Level Functions
107
Table 7-2. Business Rule Validation Source-Level Function Options Property
Required/ Optional
Description
Specify the type of verbose data to load Into the warehouse
Required
Select for the PowerCenter Server to write verbose data to the Data Profiling warehouse. You can load the following types of verbose data: - No Rows - Valid rows only - Invalid rows only - All Rows The character limit is 1,000 bytes/ K, where K is the maximum number of bytes for each character in the Data Profiling warehouse code page. If the column exceeds this limit, the PowerCenter Server writes truncated data to the Data Profiling warehouse. For more information about configuring verbose mode, see “Configuring a Function for Verbose Mode” on page 43.
Select Columns
Optional
Click to select the columns you want to profile in verbose mode. For more information about configuring a function for verbose mode, see “Configuring a Function for Verbose Mode” on page 43.
Figure 7-3 shows the Business Rule Validation Editor for a simple business rule: Figure 7-3. Business Rule Validation Editor
Candidate Key Evaluation The Candidate Key Evaluation function calculates the number and percentage of unique values for one or more columns in a source. You can use the Candidate Key evaluation function to determine the column in a source to use as a primary key. You may want to use the column with the highest percentage of unique values as the primary key. This function can evaluate unique values in columns of numeric datatypes with a precision of 1 to 28 digits or columns of the String datatype with a precision between 1 and 10 characters.
108
Chapter 7: Working with Functions
Figure 7-4 shows the Function Role Details page for the Candidate Key Evaluation function: Figure 7-4. Candidate Key Evaluation Function
Table 7-3 shows the properties of the Candidate Key Evaluation function: Table 7-3. Candidate Key Evaluation Function Options Property
Required/ Optional
Select Column(s)
Required
Select the columns to profile. By default, the Designer selects all columns of numeric datatype with a precision between 1 and 28 digits or String datatype with a precision of 1 to 10 characters.
Enable duplicate count for pairs of selected columns
Optional
Select to evaluate candidate keys based on pairs of columns.
Description
Redundancy Evaluation The Redundancy Evaluation function calculates the number of duplicate values in one or more columns of a source. You can use the Redundancy Evaluation function to identify columns to normalize into separate tables. You may want to normalize the columns that have the highest percentage of redundant values. This function can evaluate unique values in columns of numeric datatypes with a precision of 28 digits or less or columns of the String datatype with a precision of 10 characters or less.
Source-Level Functions
109
Figure 7-5 shows the Function Role Details page for the Redundancy Evaluation function: Figure 7-5. Redundancy Evaluation Function
Table 7-4 shows the properties of the Redundancy Evaluation function: Table 7-4. Redundancy Evaluation Function Options Property
Required/ Optional
Select Column(s)
Required
Select the columns you want to profile using the Redundancy Evaluation function. By default, the Designer selects all columns of numeric datatype with a precision between 1 and 28 digits or String datatype with a precision of 1 to 10 characters.
Enable duplicate count for pairs of selected columns
Optional
Select to perform redundancy evaluation based on pairs of columns.
Description
Row Uniqueness The Row Uniqueness function calculates the number of unique and duplicate values based on the columns selected. You can profile all columns in the source row or choose columns to profile. This helps you identify columns to normalize into a separate table. You can also use this function to test for distinct rows. This function is particularly useful for flat files, which have no internal validation tools such as primary key constraints or unique indexes. For example, if you have a flat file source that uses unique employee ID values to identify each
110
Chapter 7: Working with Functions
row, you can select all columns to test for duplicate rows. Because you have no primary key constraints, this allows you to verify that you did not create duplicate entries for employees. Figure 7-6 shows the Function Role Details page for the Row Uniqueness function: Figure 7-6. Row Uniqueness Function
Table 7-5 shows the properties of the Row Uniqueness function: Table 7-5. Row Uniqueness Function Options Property
Required/ Optional
Select Column(s)
Required
Select the columns you want to profile using the Row Uniqueness function. By default, the Designer selects all columns of numeric datatype with a precision between 1 and 28 digits or String datatype with a precision of 1 to 10 characters.
Generate profile data by group
Optional
Select to group source rows in a particular column. You can view a result for each group in the Data Profiling report.
Description
Source-Level Functions
111
Table 7-5. Row Uniqueness Function Options
112
Property
Required/ Optional
Group by Columns
Optional
If you selected Generate Profile Data by Group, select a column to group by. You can select a column of any datatype except Binary. If you select a a column of numeric datatype, the precision must between 1 and 28 digits. If you select a column of String datatype, the precision must be between 1 to 200 characters.
Specify the type of verbose data to load Into the warehouse
Required
Select for the PowerCenter Server to write verbose data to the Data Profiling warehouse. You can load the following types of verbose data: - All rows - No rows - Duplicate rows only The character limit is 1,000 bytes/ K, where K is the maximum number of bytes for each character in the Data Profiling warehouse code page. If the column exceeds this limit, the PowerCenter Server writes truncated data to the Data Profiling warehouse. For more information about configuring verbose mode, see “Configuring a Function for Verbose Mode” on page 43.
Chapter 7: Working with Functions
Description
Column-Level Functions Column-level functions perform a calculation on one column in a source. You can add the following column-level functions to a profile: ♦
Business Rule Validation. For information about the Business Rule Validation function, see “Business Rule Validation” on page 113.
♦
Domain Validation. For information about the Domain Validation function, see “Domain Validation” on page 115.
♦
Domain Inference. For information about the Domain Inference function, see “Domain Inference” on page 117.
♦
Aggregate Functions. For information about Aggregate functions, see “Aggregate Functions” on page 119.
♦
Distinct Value Count. For information about the Distinct Value Count function, see “Distinct Value Count” on page 121.
When you specify a column-level function on the Function Details page of the Profile Wizard, the Profile Wizard prompts you to configure the function. The options available on the Function Role Details page for column-level functions depend on the function you select. For more information about the Function Details page of the Profile Wizard, see “Step 3. Add Functions and Enable Session Configuration” on page 39. For more information about the Function Role Details page of the Profile Wizard, see “Step 4. Configure Profile Functions” on page 41. When you create a Domain Validation column-level function, you include domains in the function. For more information about creating domains, see “Custom Domains” on page 92. Note: If you enable the PowerCenter Server to write verbose data to the Data Profiling
warehouse, the PowerCenter Server loads all selected rows to the Data Profiling warehouse except those with the datatype Raw. For more information about configuring a function for verbose mode, see “Configuring a Function for Verbose Mode” on page 43.
Business Rule Validation The Business Rule Validation column-level function calculates the number of rows in a single source column that satisfy a business rule and the number of rows that do not. A business rule is a valid Boolean expression that you create with the Rule Editor.
Column-Level Functions
113
Figure 7-7 shows the Function Role Details page for the Business Rule Validation columnlevel function: Figure 7-7. Business Rule Validation Column-Level Function
Click to enter a business rule.
Table 7-6 shows the properties of the Business Rule Validation column-level function: Table 7-6. Business Rule Validation Column-Level Function Options
114
Property
Required/ Optional
Description
Selected Column
Required
Select the column you want to apply the business rule to.
Generate profile data by group
Optional
Select to group source rows in a particular column. You can view a result for each group.
Group by Columns
Optional
If you selected Generate Profile Data by Group, select a column to group by. You can select a column of any datatype except Binary. If you select a a column of numeric datatype, the precision must between 1 and 28 digits. If you select a column of String datatype, the precision must be between 1 to 200 characters.
Chapter 7: Working with Functions
Table 7-6. Business Rule Validation Column-Level Function Options Property
Required/ Optional
Rule Summary
Required
Click the Rule Editor button shown in Figure 7-3 to enter a business rule. Once you enter a business rule, the rule displays in the Rule Summary dialog box. Use a valid Boolean expression. You can only enter Business Rule Validation functions in the Business Rule Editor. If you enter other functions available through Business Rule Validation, such as Date or String functions, the session may generate unexpected results.
Specify the type of verbose data to load Into the warehouse
Required
Select for the PowerCenter Server to write verbose data to the Data Profiling warehouse. You can load the following types of verbose data: - No Rows - Valid rows only - Invalid rows only - All Rows The character limit is 1,000 bytes/ K, where K is the maximum number of bytes for each character in the Data Profiling warehouse code page. If the column exceeds this limit, the PowerCenter Server writes truncated data to the Data Profiling warehouse. For more information about configuring verbose mode, see “Configuring a Function for Verbose Mode” on page 43.
Description
Domain Validation The Domain Validation function calculates the number of values in the profile source column that fall within a specified domain and the number of values that do not. A domain is the set of all possible valid values for a column. For example, a domain might include a list of abbreviations for all of the states in the U.S.A. Or, a domain might include a list of valid U.S.A zip code patterns.
Column-Level Functions
115
Figure 7-8 shows the Function Role Details page for the Domain Validation function: Figure 7-8. Domain Validation Function
Table 7-7 shows the properties of the Domain Validation function: Table 7-7. Domain Validation Function Options
116
Property
Required/ Optional
Description
Selected Column
Required
Select the column you want to evaluate against the domain.
Domain Summary
Required
Click the Domains button to select a reusable domain or to create a non-reusable domain. Once you enter a domain, it displays in the Domain Summary box. For more information about domains, see “Working with Domains” on page 89.
Specify the type of verbose data to load Into the warehouse
Required
Select for the PowerCenter Server to write verbose data to the Data Profiling warehouse. You can load the following types of verbose data: - No Rows - Valid rows only - Invalid rows only - All Rows The character limit is 1,000 bytes/ K, where K is the maximum number of bytes for each character in the Data Profiling warehouse code page. If the column exceeds this limit, the PowerCenter Server writes truncated data to the Data Profiling warehouse. For more information about configuring verbose mode, see “Configuring a Function for Verbose Mode” on page 43.
Chapter 7: Working with Functions
When you click the Domains button to add a domain, the Domain Browser displays. Figure 7-9 shows the Domain Browser: Figure 7-9. Domain Browser
Select the domain you want to use for your domain validation function. Click the Close button to return to the Domain Validation Function Role Details page. If you validate the domain against a List of Values domain or a Domain Definition Filename domain, the list must use a code page that is two-way compatible with the PowerCenter Server. For information about specifying a compatible code pages for these domains, see “Custom Domains” on page 92.
Domain Inference The Domain Inference function reads all values in the column and infers a pattern that fits the data. The function determines if the values fit a list of values derived from the column values or a pattern that describes the pattern of the source data. For example, you have a column with social security numbers in your source. Use the Domain Inference function to determine the pattern of numbers in the column. The Domain Inference function can also infer a pattern of ‘STRING WITH ONLY SPACES’ for columns containing non-null blank space data. This is useful for determining data quality. This function can infer domains for columns with a numeric datatype with a precision of 28 digits or less or a String datatype with a precision of 200 characters or less. This function can also infer domains for columns of the Date/Time datatype.
Column-Level Functions
117
Figure 7-10 shows the Function Role Details page for the Domain Inference function: Figure 7-10. Domain Inference Function
When you create a Domain Inference function, select a column for which you want to infer a domain. Click Finish.
Configuring Domain Inference Settings When you work with the Domain Inference function, you can configure the Profile Wizard to filter the Domain Inference results. You may want to narrow the scope of patterns returned to view only the primary domains, or you may want to widen the scope of patterns returned to view exception data. You configure these settings from the Domain Inference Tuning dialog box when you create custom or auto profiles. When you use a Domain Inference function in a data profile, you can configure the following settings:
118
♦
Maximum number of patterns. The PowerCenter Server returns the most frequently occurring patterns up to the number of patterns you specify. For example, you create an auto profile for a source, and you set the maximum number of patterns to 20. The PowerCenter Server returns the top 20 patterns.
♦
Minimum pattern frequency. The PowerCenter Server returns patterns that occur at or above the frequency you specify. For example, if you set the minimum pattern frequency to 30 percent, the PowerCenter Server returns patterns that occur 30 percent of the time or more and filters the rest.
Chapter 7: Working with Functions
For example, you have a source that contains customers with Canadian and United States zip codes. Most of your customers are in the United States, but you also have Canadian customers. You want to view exception data and see what percentage of the zip codes are Canadian, so you configure the auto profile with a maximum of 300 patterns and a minimum pattern frequency of 1 percent. When you run the auto profile, the PowerCenter Server returns a maximum of 300 patterns. This allows you to view a wide range of exception data. The auto profile also infers a domain for Canadian zip codes, which occurs in a small percentage of the data. Note: When you configure minimum pattern frequency, the PowerCenter Server calculates the
percentage of patterns to return based on non-null data and ignores null data. Therefore, if you have null data in the source row, the percentage of patterns returned may not represent a percentage of the total data.
Aggregate Functions An Aggregate function calculates an aggregate value for a numeric or string value applied to one column of a profile source. You can add the following aggregate functions to a profile: ♦
NULL Value Count. The number of rows with NULL values in the source column.
♦
Average Value. The average value of the rows in the source column.
♦
Minimum Value. The minimum value of the rows in the source column.
♦
Maximum Value. The maximum value of the rows in the source column.
The Aggregate function you can add to a source column depends on the datatype of the column. Table 7-8 describes the Aggregate functions you can add based on the datatype of the source column: Table 7-8. Aggregate Functions to Add Based on the Column Datatype Aggregate Function
Allowed Datatypes and Precision
NULL Value Count
- Binary - Date/Time - Numeric - String with a precision of 4,000 or less.
Minimum Value
- Date/Time - Numeric - String with a precision of 200 or less.
Column-Level Functions
119
Table 7-8. Aggregate Functions to Add Based on the Column Datatype Aggregate Function
Allowed Datatypes and Precision
Maximum Value
- Date/Time - Numeric - String with a precision of 200 or less.
Average Value
- Numeric with a precision of 1,000 or less.
Figure 7-11 shows the Function Role Details page for Aggregate functions: Figure 7-11. Aggregate Functions
Table 7-9 shows the properties of Aggregate functions: Table 7-9. Aggregate Function Options
120
Property
Required/ Optional
Description
Selected Column
Required
Select the column you want to apply the aggregation to.
Aggregate Functions
Required
Select the functions you want to apply to the source column.
Chapter 7: Working with Functions
Table 7-9. Aggregate Function Options Property
Required/ Optional
Description
Generate profile data by group
Optional
Select to group source rows in a particular column. You can view a result for each group.
Group by Columns
Optional
If you selected Generate Profile Data by Group, select a column to group by. You can select a column of any datatype except Binary. If you select a a column of numeric datatype, the precision must between 1 and 28 digits. If you select a column of String datatype, the precision must be between 1 to 200 characters.
Distinct Value Count The Distinct Value Count function returns the number of distinct values for the column. You can also enter a calculated expression to return the number of distinct values for the column based on the result of the expression. For example, you have work orders organized in a table by Job ID. To find out how many unique work orders exist, you can use the distinct count function. Figure 7-12 shows the Function Role Details page for the Distinct Value Count function: Figure 7-12. Distinct Value Count Function
Click to enter a business rule.
Column-Level Functions
121
Table 7-10 shows the properties of the Distinct Value Count function: Table 7-10. Distinct Value Count Function Options
122
Property
Required/ Optional
Description
Selected Column
Required
Select the column you want to apply the function to.
Rule Summary
Optional
Click the Rule Editor button to enter an expression. Once you enter an expression, it displays in the Rule Summary dialog box. If you enter other functions available through the Rule Editor, such as Date or String functions, the session may generate unexpected results.
Generate profile data by group
Optional
Select to group source rows in a particular column. You can view a result for each group.
Group by Columns
Optional
If you selected Generate Profile Data by Group, select a column to group by. You can select a column of any datatype except Binary. If you select a a column of numeric datatype, the precision must between 1 and 28 digits. If you select a column of String datatype, the precision must be between 1 to 200 characters.
Specify the type of verbose data to load Into the warehouse
Required
Select for the PowerCenter Server to write verbose data to the Data Profiling warehouse. You can load the following types of verbose data: - All rows - No rows - Duplicate rows only The character limit is 1,000 bytes/ K, where K is the maximum number of bytes for each character in the Data Profiling warehouse code page. If the column exceeds this limit, the PowerCenter Server writes truncated data to the Data Profiling warehouse. For more information about configuring verbose mode, see “Configuring a Function for Verbose Mode” on page 43.
Chapter 7: Working with Functions
Intersource Functions Intersource functions perform calculations on two or more sources, source groups from different sources, or mapplet output groups, and generate information about their relationship. You can add the following intersource functions to a profile: ♦
Orphan Analysis. For information about the Orphan Analysis function, see “Orphan Analysis” on page 123.
♦
Join Complexity Evaluation. For information about the Join Complexity function, see “Join Complexity Evaluation” on page 125.
Note: If you enable the PowerCenter Server to write verbose data to the Data Profiling
warehouse, the PowerCenter Server loads all selected rows to the Data Profiling warehouse except those with the datatype Raw. You must select at least one column to write verbose data to the Data Profiling warehouse. For more information about configuring a function for verbose mode, see “Configuring a Function for Verbose Mode” on page 43.
Orphan Analysis The Orphan Analysis function compares the values of columns in two sources. When you create this function, you select columns that you want to analyze. During the profile session, the PowerCenter Server reports the number and percentage of rows that appear in a specified column in the master source but not in the detail source. This is useful for referential integrity analysis, also known as orphan analysis. This function can evaluate unique values in columns of numeric datatypes with a precision of 28 digits or less or columns of the String datatype with a precision of 200 characters or less. The columns can use any combination of datatypes except Date and Numeric and Non-Raw with Raw. You can use Raw with Raw when you disable verbose mode.
Intersource Functions
123
Figure 7-13 shows the Function Role Details page for the Orphan Analysis function: Figure 7-13. Orphan Analysis Function
Table 7-11 shows the properties of the Orphan Analysis function: Table 7-11. Orphan Analysis Function Options
124
Property
Required/ Optional
Source
Required
The source you selected to perform the intersource function. The source is listed using the following syntax: :<Source Name> or :<Mapplet Name>.
Port Name
Required
Select the columns you want to profile for each source.
Chapter 7: Working with Functions
Description
Table 7-11. Orphan Analysis Function Options Property
Required/ Optional
Datatype
Required
The datatype for the columns you want to profile in the corresponding source. The columns can use any combination of datatypes except Date and Numeric and Non-Raw with Raw. You can use Raw with Raw when you disable verbose mode.
Specify the type of verbose data to load Into the warehouse
Required
Select for the PowerCenter Server to write all unmatched or orphaned rows to the Data Profiling warehouse. You can load the following types of verbose data: - No rows - Orphan rows The character limit is 1,000 bytes/ K, where K is the maximum number of bytes for each character in the Data Profiling warehouse code page. If the column exceeds this limit, the PowerCenter Server writes truncated data to the Data Profiling warehouse. For more information about configuring verbose mode, see “Configuring a Function for Verbose Mode” on page 43.
Description
Join Complexity Evaluation The Join Complexity Evaluation intersource function measures the columns in multiple sources that do not have a join condition. This function provides information to help you analyze join complexity. This is useful for designing and optimizing queries. You can select up to five sources for the Join Complexity Evaluation function.
Intersource Functions
125
Figure 7-14 shows the Function Role Details page for the Join Complexity Evaluation function: Figure 7-14. Join Complexity Evaluation Function
When you configure the details for the Join Complexity Evaluation function, you must select the column in each source that you want to profile. The columns can use any datatype except Binary. The columns can use any combination of datatypes except Date and Numeric, Raw with Raw, and Non-Raw with Raw.
126
Chapter 7: Working with Functions
Appendix A
Data Profiling Views This chapter includes the following topics: ♦
Overview, 128
♦
List of Data Profiling Views, 128
127
Overview PowerCenter Data Profiling provides views to create metrics and attributes in the Data Profiling schema. The prepackaged reports use these metrics and attributes to provide information about your source data. Your business needs may require you to create custom metrics, attributes, or reports using the prepackaged Data Profiling schema tables. The schema tables are views built on top of the Data Profiling tables. To understand the Data Profiling schema, you must understand the views and the information contained in each view. This chapter provides the following information about each view: ♦
Description. Provides a description of the view.
♦
Usage. Provides information about how you can use the view for analysis.
♦
Column Name. Provides the name of the column used in the view.
List of Data Profiling Views PowerCenter Data Profiling prepackages the following views:
128
♦
DPR_LATEST_PRFLS, 129
♦
DPR_PRFL_AUTO_COL_FN_METRICS, 130
♦
DPR_PRFL_CART_PROD_METRICS, 131
♦
DPR_PRFL_COL_FN, 132
♦
DPR_PRFL_COL_FN_METRICS, 134
♦
DPR_PRFL_COL_FN_VERBOSE, 136
♦
DPR_PRFL_CP_FN, 137
♦
DPR_PRFL_FN_DTLS, 139
♦
DPR_PRFL_OJ_FN, 141
♦
DPR_PRFL_OJ_FN_VERBOSE, 143
♦
DPR_PRFL_OJ_METRICS, 144
♦
DPR_PRFL_RUN_DTLS, 145
♦
DPR_PRFL_SRC_FN, 146
♦
DPR_PRFL_SRC_FN_METRICS, 148
♦
DPR_PRFL_SRC_FN_VERBOSE, 149
♦
DPR_PRFL_VER_DTLS, 150
Appendix A: Data Profiling Views
DPR_LATEST_PRFLS ♦
Description. This view contains the latest version and latest session run details for every profile ID.
♦
Usage. Use this view to get the latest version and session run details. Table A-1. DPR_LATEST_PRFLS Column Information Column Name
Description
PROFILE_ID
Identifies the data profile in every repository and repository folder.
CURR_PRFL_VER_KEY
The surrogate key that the Designer generates for a data profile. This identifies the current version of a data profile.
CURR_PRFL_RUN_KEY
The surrogate key that the Designer generates for a profile session run. This identifies the current version of a profile session run.
CURR_VER_DT
Identifies the last saved date of the data profile.
CURR_RUN_DT
Identifies the last saved date of the profile session run.
DPR_LATEST_PRFLS
129
DPR_PRFL_AUTO_COL_FN_METRICS ♦
Description. This view contains the results (metric statistics) of the column-level functions of an auto profile. Use this view to analyze the column-level function statistics for an auto profile.
♦
Usage. Use this view to get all metric statistics of all the column-level functions of an auto profile. This view does not apply to custom profiles. Table A-2. DPR_PRFL_AUTO_COL_FN_METRICS Column Information
130
Column Name
Description
FUNCTION_KEY
The surrogate key that the Designer generates for a data profile function.
PRFL_RUN_KEY
The profile run key that the Designer generates for a profile session. Refers to DPR_PRFL_RUN_DTLS.PRFL_RUN_KEY column.
TOTAL_ROWS
The total row counts in column. This applies to the column-level Business Rule Validation, Distinct Value Count, and Aggregate functions.
NULL_ROWS
The null rows count in a column. This applies to the Aggregate function.
DISTINCT_ROWS
The distinct count of values in a column. This applies to the Distinct Value Count function.
AVG_VALUE
The average value of the numeric fields of the selected column in a source. This applies to the Aggregate function.
MIN_VALUE
The minimum value of a column selected in a source. This applies to the Aggregate function.
MAX_VALUE
The maximum value of a column selected in a source. This applies to the Aggregate function.
Appendix A: Data Profiling Views
DPR_PRFL_CART_PROD_METRICS ♦
Description. This view contains the results (metric statistics) of the Join Complexity Evaluation functions.
♦
Usage. Use this view to get the result set (metric statistics) of the Join Complexity Evaluation functions. Table A-3. DPR_PRFL_CART_PROD_METRICS Column Information Column Name
Description
FUNCTION_KEY
The surrogate key that the Designer generates for a data profile function.
PRFL_RUN_KEY
The profile run key that the Designer generates for a profile session. Refers to DPR_PRFL_RUN_DTLS.PRFL_RUN_KEY column.
CP_VAL
The value for two or more source tables based on a joined column.
COLUMN_VALUE
The column value for the selected joined columns on two or more source tables.
DPR_PRFL_CART_PROD_METRICS
131
DPR_PRFL_COL_FN ♦
Description. This view contains details about column-level functions, such as Aggregate and Domain Validation functions.
♦
Usage. Use this view to get all data profile column-level function details. Table A-4. DPR_PRFL_COL_FN Column Information
132
Column Name
Description
FUNCTION_LEVEL
Identifies the function level for a function in a data profile. The function level is COLUMN.
FUNCTION_KEY
The surrogate key that the Designer generates for a data profile column-level function.
PRFL_VER_KEY
The profile version key that the Designer generates for a data profile. Refers to DPR_PRFL_VER_DTLS.PRFL_VER_KEY column.
SOURCE_NAME1
Identifies the name of the source object that is being profiled.
SOURCE_NAME2
This column does not apply to column-level functions.
SOURCE_NAME
Identifies the name of the source object that is being profiled.
SOURCE_TYPE1
Identifies the type of source object that is being profiled, such as a database table or mapplet.
SOURCE_TYPE2
This column does not apply to column-level functions.
SOURCE_TYPE
Identifies the type of source object that is being profiled, such as a database table or mapplet.
DBD_NAME1
Identifies the DBD name for the source in a data profile.
DBD_NAME2
This column does not apply to column-level functions.
DBD_NAME
Identifies the DBD name for a source in a data profile.
GROUP_NAME1
Identifies the group name of the source object.
GROUP_NAME2
This column does not apply to column-level functions.
GROUP_NAME
Identifies the group name of the source object.
FUNCTION_NAME
Identifies the name of a function in a data profile.
FUNCTION_TYPE
Identifies the function type for the column-level functions.
FN_TYPE_TEXT
The detailed text of the function type for the column-level functions.
RULE_VAL
Identifies the rule applied to a function. This applies to the column-level Business Rule Validation function and the Distinct Value Count function.
COLNAME1
Identifies the column for which the profile function is defined.
COLNAME2
This column does not apply to column-level functions.
COLNAME3
This column does not apply to column-level functions.
Appendix A: Data Profiling Views
Table A-4. DPR_PRFL_COL_FN Column Information Column Name
Description
COLTYPE1
Identifies the datatype of the source column to which the profile function applies. This column applies to column-level and intersource functions.
COLTYPE2
This column does not apply to column-level functions.
COLTYPE3
This column does not apply to column-level functions.
GRP_BY_COLNAME1
Identifies the first column on which the profiling function is grouped.
GRP_BY_COLNAME2
Identifies the second column on which the profiling function is grouped.
GRP_BY_COLNAME3
Identifies the third column on which the profiling function is grouped.
GRP_BY_COLUMNS
Identifies the concatenated values of the three group-by columns.
GRP_BY_COLTYPE1
Identifies the datatype of the first column on which the profiling function is grouped.
GRP_BY_COLTYPE2
Identifies the datatype of the second column on which the profiling function is grouped.
GRP_BY_COLTYPE3
Identifies the datatype of the third column on which the profiling function is grouped.
GRP_BY_COLTYPES
Identifies the datatype of all the columns on which the profiling function is grouped.
DOMAIN_NAME
Identifies the domain name used for domain validation.
DOMAIN_TYPE
Identifies the domain type for domain validation.
DOMAIN_VALUE
Identifies the domain value for domain validation. This applies to Regular Expression and Domain Validation Filename domains.
DPR_PRFL_COL_FN
133
DPR_PRFL_COL_FN_METRICS ♦
Description. This view contains the results (metric statistics) of all column-level functions. Use this view to analyze the results of various column-level functions.
♦
Usage. Use this view to get all metric statistics of all the column-level functions. You can use this view to analyze the result set of column-level functions. Table A-5. DPR_PRFL_COL_FN_METRICS Column Information
134
Column Name
Description
FUNCTION_KEY
The surrogate key that the Designer generates for a data profile column-level function.
PRFL_RUN_KEY
The profile run key that the Designer generates for a profile session. Refers to DPR_PRFL_RUN_DTLS.PRFL_RUN_KEY column.
GRP_BY_COL1_VAL
Identifies the first column value on which the profile function is grouped.
GRP_BY_COL2_VAL
Identifies the second column value on which the profile function is grouped.
GRP_BY_COL3_VAL
Identifies the third column value on which the profile function is grouped.
GRP_BY_COL_VALUES
Identifies the concatenated values of the three group-by column values.
FUNCTION_TYPE
Identifies the function types for the column-level functions.
FN_TYPE_TEXT
The detailed text of the function type for the column-level functions.
TOTAL_ROWS
The total row count in a column. This applies to the column-level Business Rule Validation, Distinct Value Count, and Aggregate functions.
SATISFIED_ROWS
The row count for rows that satisfied a rule or condition in a column. This applies to the column-level Business Rule Validation function.
UNSATISFIED_ROWS
The row count for rows that did not satisfy a rule or condition in a column. This applies to the column-level Business Rule Validation function.
DUPLICATE_ROWS
The percentage of duplicate rows in a column.
NULL_ROWS
The null row count in a column. This applies to the Aggregate function.
DISTINCT_ROWS
The distinct count of values in a column. This applies to the Distinct Value Count function.
AVG_VALUE
The average value of the numeric fields of the selected column in a source. This applies to the Aggregate function.
MIN_VALUE
The minimum value of a selected column in a source. This applies to the Aggregate function.
MAX_VALUE
The maximum value of a selected column in a source. This applies to the Aggregate function.
COLUMN_PATTERN
The list of values for domain validation and an inferred pattern for the Domain Inference function.
DOMAIN_TYPE
The domain type of domain inference. This could be a list of values or a regular expression.
Appendix A: Data Profiling Views
Table A-5. DPR_PRFL_COL_FN_METRICS Column Information Column Name
Description
DOMAIN_TOTAL_ROWS
The total row count in a source for Domain Validation and Domain Inference functions.
DOMAIN_SATISFIED_ROWS
The total row count for rows that satisfied a domain validation rule inferred pattern. This applies to the Domain Validation and Domain Inference functions.
DOMAIN_UNSATISFIED_ROWS
The total row count for rows that did not satisfy a domain validation rule or inferred pattern. This applies to Domain Validation and Domain Inference functions.
DOMAIN_NULL_ROWS
The null row count for Domain Validation and Domain Inference functions.
DPR_PRFL_COL_FN_METRICS
135
DPR_PRFL_COL_FN_VERBOSE ♦
Description. This view contains the rejected row information for column-level functions.
♦
Usage. Use this view to get all rejected row information for the Business Rule Validation and Domain Validation column-level functions. Table A-6. DPR_PRFL_COL_FN_VERBOSE Column Information
136
Column Name
Description
FUNCTION_KEY
The surrogate key that the Designer generates for a data profile column-level function.
PRFL_RUN_KEY
The profile run key that the Designer generates for a profile session run. Refers to DPR_PRFL_RUN_DTLS.PRFL_RUN_KEY column.
COLUMN_VALUE
Identifies the column value in a source table.
GRP_BY_COL1_VAL
Identifies the first column value on which the profile function is grouped.
GRP_BY_COL2_VAL
Identifies the second column value on which the profile function is grouped.
GRP_BY_COL3_VAL
Identifies the third column value on which the profile function is grouped.
GRP_BY_COL_VALUES
Identifies all the column values on which the profile function is grouped.
Appendix A: Data Profiling Views
DPR_PRFL_CP_FN ♦
Description. This view contains details about Join Complexity Evaluation functions.
♦
Usage. Use this view to get all the Join Complexity Evaluation function details. Table A-7. DPR_PRFL_CP_FN Column Information Column Name
Description
FUNCTION_LEVEL
Identifies the function level for a function in a data profile. The function level is INTER SOURCE.
FUNCTION_KEY
The surrogate key that the Designer generates for a data profile Join Complexity Evaluation function.
PRFL_VER_KEY
The profile version key that the Designer generates for a data profile. Refers to DPR_PRFL_VER_DTLS.PRFL_VER_KEY column.
SOURCE_NAME1
Identifies the name of the source object that is being profiled.
SOURCE_NAME2
This column does not apply to Join Complexity Evaluation functions.
SOURCE_NAME
Identifies the name of the source object that is being profiled.
SOURCE_TYPE1
Identifies the type of source object that is being profiled, such as a database table or mapplet.
SOURCE_TYPE2
This column does not apply to Join Complexity Evaluation functions.
SOURCE_TYPE
Identifies the type of source object that is being profiled, such as a database table or mapplet.
DBD_NAME1
Identifies the DBD name for the source in a data profile.
DBD_NAME2
This column does not apply to Join Complexity Evaluation functions.
DBD_NAME
Identifies the DBD name for a source in a data profile.
GROUP_NAME1
Identifies the group name of the source object.
GROUP_NAME2
This column does not apply to Join Complexity Evaluation functions.
GROUP_NAME
Identifies the group name of the source object.
FUNCTION_NAME
Identifies the name of a function in a data profile.
FUNCTION_TYPE
Identifies the function type for the Join Complexity Evaluation function.
FN_TYPE_TEXT
The detailed text of the function type for the Join Complexity Evaluation function.
RULE_VAL
This column does not apply to Join Complexity Evaluation functions.
COLNAME1
Identifies the column for which the profile function is defined.
COLNAME2
This column does not apply to Join Complexity Evaluation functions.
COLNAME3
This column does not apply to Join Complexity Evaluation functions.
COLTYPE1
Identifies the datatype of the source column to which the profile function applies.
COLTYPE2
This column does not apply to Join Complexity Evaluation functions.
DPR_PRFL_CP_FN
137
Table A-7. DPR_PRFL_CP_FN Column Information
138
Column Name
Description
COLTYPE3
This column does not apply to Join Complexity Evaluation functions.
GRP_BY_COLNAME1
This column does not apply to Join Complexity Evaluation functions.
GRP_BY_COLNAME2
This column does not apply to Join Complexity Evaluation functions.
GRP_BY_COLNAME3
This column does not apply to Join Complexity Evaluation functions.
GRP_BY_COLUMNS
This column does not apply to Join Complexity Evaluation functions.
GRP_BY_COLTYPE1
This column does not apply to Join Complexity Evaluation functions.
GRP_BY_COLTYPE2
This column does not apply to Join Complexity Evaluation functions.
GRP_BY_COLTYPE3
This column does not apply to Join Complexity Evaluation functions.
GRP_BY_COLTYPES
This column does not apply to Join Complexity Evaluation functions.
DOMAIN_NAME
This column does not apply to Join Complexity Evaluation functions.
DOMAIN_TYPE
This column does not apply to Join Complexity Evaluation functions.
DOMAIN_VALUE
This column does not apply to Join Complexity Evaluation functions.
Appendix A: Data Profiling Views
DPR_PRFL_FN_DTLS ♦
Description. This view contains details about all source-level, column-level, and intersource functions. This view is registered in the operational schema to get all the function details.
♦
Usage. Use this view to obtain information about all profile functions. Table A-8. DPR_PRFL_FN_DTLS Column Information Column Name
Description
FUNCTION_LEVEL
Identifies the function level for a function in a data profile. The function levels are: SOURCE, COLUMN, and INTER SOURCE.
FUNCTION_KEY
The surrogate key that the Designer generates for a data profile function.
PRFL_VER_KEY
The profile version key that the Designer generates for a data profile. Refers to DPR_PRFL_VER_DTLS.PRFL_VER_KEY column.
SOURCE_NAME1
Identifies the name of the source object that is being profiled. For Orphan Analysis functions, this contains the master source name.
SOURCE_NAME2
Identifies the name of the detail source object that is being profiled. This only applies to Orphan Analysis functions.
SOURCE_NAME
Identifies the name of the source object that is being profiled.
SOURCE_TYPE1
Identifies the type of source object that is being profiled, such as a database table or mapplet. For Orphan Analysis functions, this identifies the type of the master source object.
SOURCE_TYPE2
Identifies the type of detail source object that is being profiled, such as a database table or mapplet. This only applies to Orphan Analysis functions.
SOURCE_TYPE
Identifies the type of source object that is being profiled, such as a database table or mapplet.
DBD_NAME1
Identifies the DBD name for the source in a data profile. For Orphan Analysis functions, this is populated with the DBD name of the master source.
DBD_NAME2
Identifies the DBD name for the detail source in a data profile. This only applies to Orphan Analysis functions.
DBD_NAME
Identifies the DBD name for a source in a data profile.
GROUP_NAME1
Identifies the group name of the source object.
GROUP_NAME2
Identifies the group name of the source object.
GROUP_NAME
Identifies the group name of the source object.
FUNCTION_NAME
Identifies the name of a function in a data profile.
FUNCTION_TYPE
Identifies the function type for each of the function levels: source level, column level, and intersource.
FN_TYPE_TEXT
The detailed text of the function type for each of the function levels: source level, column level, and intersource.
DPR_PRFL_FN_DTLS
139
Table A-8. DPR_PRFL_FN_DTLS Column Information
140
Column Name
Description
RULE_VAL
Identifies the rule applied to a function. This applies to the source-level and column-level Business Rule Validation functions and the Distinct Value Count function.
COLNAME1
Identifies the column for which the profile function is defined. This column does not apply to source-level functions.
COLNAME2
Identifies one of the source columns to which the profile function applies. If the profile function applies to various columns, this column populates. This column only applies to Orphan Analysis functions.
COLNAME3
Identifies one of the source columns to which the profile function applies. If the profile function applies to various columns, this column populates. If a function applies to various columns, this column populates. This column applies to Orphan Analysis functions.
COLTYPE1
Identifies the datatype of the source column to which the profile function applies. This column applies to column-level and intersource functions.
COLTYPE2
Identifies the datatype of the source column to which the profile function applies. This column applies to Orphan Analysis functions.
COLTYPE3
Identifies the datatype of the source column to which the profile function applies. If a function applies to various source columns, this column populates. This column applies to Orphan Analysis functions.
GRP_BY_COLNAME1
Identifies the datatype of first column on which the profiling function is grouped.
GRP_BY_COLNAME2
Identifies the datatype of second column on which the profiling function is grouped.
GRP_BY_COLNAME3
Identifies the datatype of third column on which the profiling function is grouped.
GRP_BY_COLNAMES
Identifies the datatype of all the columns on which the profiling function is grouped.
DOMAIN_NAME
Identifies the domain name used for domain validation.
DOMAIN_TYPE
Identifies the domain type for domain validation.
DOMAIN_VALUE
Identifies the domain value for domain validation. This applies to Regular Expression and Domain Validation Filename domains.
Appendix A: Data Profiling Views
DPR_PRFL_OJ_FN ♦
Description. This view contains details about Orphan Analysis functions.
♦
Usage. Use this view to get all the Orphan Analysis function details. Table A-9. DPR_PRFL_OJ_FN Column Information Column Name
Description
FUNCTION_LEVEL
Identifies the function level for a function in a data profile. The function level is INTER SOURCE.
FUNCTION_KEY
The surrogate key that the Designer generates for a data profile Orphan Analysis function.
PRFL_VER_KEY
The profile version key that the Designer generates for a data profile. Refers to DPR_PRFL_VER_DTLS.PRFL_VER_KEY column.
SOURCE_NAME1
Identifies the name of the source object that is being profiled. For Orphan Analysis functions, this contains the master source name.
SOURCE_NAME2
Identifies the name of the detail source object that is being profiled. This column only applies to Orphan Analysis functions.
SOURCE_NAME
Identifies the name of the source object that is being profiled.
SOURCE_TYPE1
Identifies the type of source object that is being profiled, such as a database table or mapplet. For Orphan Analysis functions, this identifies the type of the master source object.
SOURCE_TYPE2
Identifies the type of detail source object that is being profiled, such as a database table or mapplet. This only applies to Orphan Analysis functions.
SOURCE_TYPE
Identifies the type of source object that is being profiled, such as a database table or mapplet.
DBD_NAME1
Identifies the DBD name for the source in a data profile. For Orphan Analysis functions, this is populated with the DBD name of the master source.
DBD_NAME2
Identifies the DBD name for the detail source in a data profile. This only applies to Orphan Analysis functions.
DBD_NAME
Identifies the DBD name for a source in a data profile.
GROUP_NAME1
Identifies the group name of the source object.
GROUP_NAME2
Identifies the group name of the source object.
GROUP_NAME
Identifies the group name of the source object.
FUNCTION_NAME
Identifies the name of a function in a data profile.
FUNCTION_TYPE
Identifies the function type for the Orphan Analysis functions.
FN_TYPE_TEXT
The detailed text of the function type for the Orphan Analysis functions.
RULE_VAL
This column does not apply to Orphan Analysis functions.
COLNAME1
Identifies the column for which the profile function is defined.
DPR_PRFL_OJ_FN
141
Table A-9. DPR_PRFL_OJ_FN Column Information
142
Column Name
Description
COLNAME2
Identifies one of the source columns to which the profile function applies. If the profile function applies to various columns, this column populates. This column only applies to Orphan Analysis functions.
COLNAME3
Identifies one of the source columns to which the profile function applies. If the profile function applies to various columns, this column populates. This column only applies to Orphan Analysis functions.
COLTYPE1
Identifies the datatype of the source column to which the profile function applies. This column applies to column-level and intersource functions.
COLTYPE2
Identifies the datatype of the source column to which the profile function applies. This column only applies to Orphan Analysis functions.
COLTYPE3
Identifies the datatype of the source column to which the profile function applies. If a function applies to various source columns, this column populates. This column only applies to Orphan Analysis functions.
GRP_BY_COLNAME1
This column does not apply to Orphan Analysis functions.
GRP_BY_COLNAME2
This column does not apply to Orphan Analysis functions.
GRP_BY_COLNAME3
This column does not apply to Orphan Analysis functions.
GRP_BY_COLNAMES
This column does not apply to Orphan Analysis functions.
GRP_BY_COLTYPE1
This column does not apply to Orphan Analysis functions.
GRP_BY_COLTYPE2
This column does not apply to Orphan Analysis functions.
GRP_BY_COLTYPE3
This column does not apply to Orphan Analysis functions.
GRP_BY_COLTYPES
This column does not apply to Orphan Analysis functions.
DOMAIN_NAME
This column does not apply to Orphan Analysis functions.
DOMAIN_TYPE
This column does not apply to Orphan Analysis functions.
DOMAIN_VALUE
This column does not apply to Orphan Analysis functions.
Appendix A: Data Profiling Views
DPR_PRFL_OJ_FN_VERBOSE ♦
Description. This view contains the rejected row information for Orphan Analysis functions.
♦
Usage. Use this view to get all rejected row information for the Orphan Analysis function. Table A-10. DPR_PRFL_OJ_FN_VERBOSE Column Information Column Name
Description
FUNCTION_KEY
The surrogate key that the Designer generates for a data profile function.
PRFL_RUN_KEY
The profile run key that the Designer generates for a profile session run. Refers to DPR_PRFL_RUN_DTLS.PRFL_RUN_KEY column.
COL1_VALUE
Identifies the column value of the first joined column.
COL2_VALUE
Identifies the column value of the second joined column.
COL3_VALUE
Identifies the column value of the third joined column.
COL_VALUE
Identifies the concatenated text of the three column values of the joined columns.
MASTER_ROW_FLAG
Identifies whether the verbose record pertains to the master source or the detail source.
DPR_PRFL_OJ_FN_VERBOSE
143
DPR_PRFL_OJ_METRICS ♦
Description. This view contains the results (metric statistics) of Orphan Analysis functions.
♦
Usage. Use this view to get all the Orphan Analysis function run details. Table A-11. DPR_PRFL_OJ_METRICS Column Information
144
Column Name
Description
FUNCTION_KEY
The surrogate key that the Designer generates for a data profile function.
PRFL_RUN_KEY
The profile run key that the Designer generates for a profile session run. Refers to DPR_PRFL_RUN_DTLS.PRFL_RUN_KEY column.
TOTAL_PARENT_ROWS
The total row count in the master source.
TOTAL_CHILD_ROWS
The total row count in the detail source.
UNSATISFIED_PARENT_ROWS
The row count for rows that did not satisfy join conditions in the master source.
UNSATISFIED_CHILD_ROWS
The row count for rows that did not satisfy join conditions in the detail source.
SATISFIED_PARENT_ROWS
The rows count for rows that satisfied join conditions in the master source.
SATISFIED_CHILD_ROWS
The rows count for rows that satisfy join conditions in the detail source.
NULL_PARENT_ROWS
The null rows count in the master source.
NULL_CHILD_ROWS
The null rows count in the detail source.
Appendix A: Data Profiling Views
DPR_PRFL_RUN_DTLS ♦
Description.This view contains session run details for all data profiles.
♦
Usage. Use this view to get all session run details for all data profiles. Table A-12. DPR_PRFL_RUN_DTLS Column Information Column Name
Description
PRFL_VER_KEY
The profile version key that the Designer generates for a data profile. Refers to DPR_PRFL_VER_DTLS.PRFL_VER_KEY column.
PRFL_RUN_KEY
The surrogate key that the Designer generates for a profile session run.
PRFL_REPO_NAME
Identifies the name of the repository that stores the data profile.
PRFL_FOLDER_NAME
Identifies the folder that stores the data profile.
PROFILE_ID
Identifies the data profile in every repository and repository folder.
PROFILE_NAME
Identifies the name of the data profile within a folder and repository.
PRFL_VER_DT
Identifies the last saved date of the data profile.
PRFL_RUN_DT
Identifies the last run date of the session for a data profile.
PRFL_RUN_STATUS
Identifies the status of profile session runs. You can run every profile session one or more times. The run status displays the status of every session run. The run status can be: RUNNING, SUCCESS, or FAILURE.
LATEST_PRFL_RUN_FLAG
Identifies the latest profile session run for all functions.
DPR_PRFL_RUN_DTLS
145
DPR_PRFL_SRC_FN ♦
Description. This view contains details about the source-level functions, such as Candidate Key Evaluation and Redundancy Evaluation.
♦
Usage. Use this view to get all source-level function details. Table A-13. DPR_PRFL_SRC_FN Column Information
146
Column Name
Description
FUNCTION_LEVEL
Identifies the function level for a function in a data profile. The function level is SOURCE.
FUNCTION_KEY
The surrogate key that the Designer generates for a data profile source-level function.
PRFL_VER_KEY
The profile version key that the Designer generates for a data profile. Refers to DPR_PRFL_VER_DTLS.PRFL_VER_KEY column.
SOURCE_NAME1
Identifies the name of the source object that is being profiled.
SOURCE_NAME2
This column does not apply to source-level functions.
SOURCE_NAME
Identifies the name of the source object that is being profiled.
SOURCE_TYPE1
Identifies the type of source object that is being profiled, such as a database table or mapplet.
SOURCE_TYPE2
This column does not apply to source-level functions.
SOURCE_TYPE
Identifies the type of source object that is being profiled, such as a database table or mapplet.
DBD_NAME1
Identifies the DBD name for the source in a data profile.
DBD_NAME2
This column does not apply to source-level functions.
DBD_NAME
Identifies the DBD name for a source in a data profile.
GROUP_NAME1
Identifies the group name of the source object.
GROUP_NAME2
This column does not apply to source-level functions.
GROUP_NAME
Identifies the group name of the source object.
FUNCTION_NAME
Identifies the name of a function in a data profile.
FUNCTION_TYPE
Identifies the function type for the source-level functions.
FN_TYPE_TEXT
The detailed text of the function type for source-level functions.
RULE_VAL
Identifies the rule applied to a function. This applies to the source-level Business Rule Validation function.
COLNAME1
This column does not apply to source-level functions.
COLNAME2
This column does not apply to source-level functions.
COLNAME3
This column does not apply to source-level functions.
Appendix A: Data Profiling Views
Table A-13. DPR_PRFL_SRC_FN Column Information Column Name
Description
COLTYPE1
This column does not apply to source-level functions.
COLTYPE2
This column does not apply to source-level functions.
COLTYPE3
This column does not apply to source-level functions.
GRP_BY_COLNAME1
Identifies the first column on which the profiling function is grouped.
GRP_BY_COLNAME2
Identifies the second column on which the profiling function is grouped.
GRP_BY_COLNAME3
Identifies the third column on which the profiling function is grouped.
GRP_BY_COLNAMES
Identifies the concatenated values of the three group-by columns.
GRP_BY_COLTYPE1
Identifies the datatype of first column on which the profiling function is grouped.
GRP_BY_COLTYPE2
Identifies the datatype of second column on which the profiling function is grouped.
GRP_BY_COLTYPE3
Identifies the datatype of third column on which the profiling function is grouped.
GRP_BY_COLTYPES
Identifies the datatype of all the columns on which the profiling function is grouped.
DOMAIN_NAME
This column does not apply to source-level functions.
DOMAIN_TYPE
This column does not apply to source-level functions.
DOMAIN_VALUE
This column does not apply to source-level functions.
DPR_PRFL_SRC_FN
147
DPR_PRFL_SRC_FN_METRICS ♦
Description. This view contains the results (metric statistics) of all source-level functions. Use this view to analyze the results of various source-level functions.
♦
Usage. Use this view to get all metric statistics of all the source-level functions. Table A-14. DPR_PRFL_SRC_FN_METRICS Column Information
148
Column Name
Description
FUNCTION_KEY
The surrogate key that the Designer generates for a data profile source-level function.
PRFL_RUN_KEY
The profile run key that the Designer generates for a profile session. Refers to DPR_PRFL_RUN_DTLS.PRFL_RUN_KEY column.
COLUMN_NAME
Identifies the concatenated text of one or a combination of two columns used for the Candidate Key Evaluation and Redundancy Evaluation functions.
GRP_BY_COL1_VAL
Identifies the first column value on which the profiling function is grouped.
GRP_BY_COL2_VAL
Identifies the second column value on which the profiling function is grouped.
GRP_BY_COL3_VAL
Identifies the third column value on which the profiling function is grouped.
GRP_BY_COL_VALUES
Identifies the concatenated values of the three group-by column values.
FUNCTION_TYPE
Identifies the function types for the source-level functions.
FN_TYPE_TEXT
The detailed text of the function type for the source-level functions.
TOTAL_ROWS
The total rows in a source. This applies to the source-level Business Rule Validation and Row Count functions.
TOTAL_ROWS_EVAL
The total rows in a source. This applies to the Candidate Key Evaluation and Redundancy Evaluation functions.
SATISFIED_ROWS
The total row count for rows that satisfied a rule or condition. This applies to the source-level Business Rule Validation and Row Count functions.
PERC_SATISFIED_ROWS
The percentage of rows that satisfied a rule or condition in a source. This applies to the source-level Business Rule Validation and Row Count functions.
UNSATISFIED_ROWS
The row count for rows that did not satisfy a rule or condition. This applies to the source-level Business Rule Validation and Row Count functions.
DUP_ROWS
The duplicate row count in a source. The duplicate row count is the total number of rows minus the distinct number of rows. This applies to the Redundancy Evaluation function.
Appendix A: Data Profiling Views
DPR_PRFL_SRC_FN_VERBOSE ♦
Description. This view contains the rejected row information for source-level functions.
♦
Usage. Use this view to get all rejected row information for the Business Rule Validation source-level function. Table A-15. DPR_PRFL_SRC_FN_VERBOSE Column Information Column Name
Description
FUNCTION_KEY
The surrogate key that the Designer generates for a source-level function.
PRFL_RUN_KEY
The profile run key that the Designer generates for a profile session run. Refers to DPR_PRFL_RUN_DTLS.PRFL_RUN_KEY column.
COLUMN_NAME
Identifies the column name in a source.
COLUMN_TYPE
Identifies the column datatype in a source.
COLUMN_VALUE
Identifies the column value in a source.
GRP_BY_COL1_VAL
Identifies the first column value on which the profile function is grouped.
GRP_BY_COL2_VAL
Identifies the second column value on which the profile function is grouped.
GRP_BY_COL3_VAL
Identifies the third column value on which the profile function is grouped.
GRP_BY_COL_VALUES
Identifies all the column values on which the profiling function is grouped.
DP_ROW_NUMBER
Identifies the row to which the column belongs. Use this row number group the columns belonging to one row.
DPR_PRFL_SRC_FN_VERBOSE
149
DPR_PRFL_VER_DTLS ♦
Description. This view contains version details for all data profiles.
♦
Usage. Use this view to get details about profile versions. Table A-16. DPR_PRFL_VER_DTLS Column Information
150
Column Name
Description
PRFL_VER_KEY
The surrogate key that the Designer generates for a data profile.
PRFL_REPO_NAME
Identifies the name of the repository that stores the data profile.
PRFL_FOLDER_NAME
Identifies the folder that stores the data profile.
PROFILE_ID
Identifies the data profile in every repository and repository folder.
PROFILE_NAME
Identifies the name of the data profile within a folder and repository.
PRFL_VER_DT
Identifies the last saved date of the data profile.
LATEST_PRFL_VER_FLAG
Identifies the latest version of the profile.
PROFILE_TYPE
Identifies whether the data profile is an auto profile or custom profile. Values are AUTO PROFILE and CUSTOM PROFILE.
Appendix A: Data Profiling Views
Appendix B
Code Page Compatibility This appendix provides details about the following topics: ♦
Code Page Compatibility, 152
151
Code Page Compatibility When you use Data Profiling, configure code page compatibility between all PowerCenter components and Data Profiling components and domains. Follow the instructions in “Code Pages” in the Installation and Configuration Guide to ensure that the code pages of each PowerCenter component have the correct relationship with each other. When you work with data profiles, ensure that the code pages for the Data Profiling components have the correct relationship with each other:
152
♦
The PowerCenter Server must use a code page that is a subset of the Data Profiling warehouse code page.
♦
The PowerCenter Server and the Repository Server must use two-way compatible code pages.
♦
The code page for the PowerCenter Client must be two-way compatible with the code page for the Repository Server.
♦
A Domain Definition Filename domain must use a code page that is a subset of the PowerCenter Server code page. For more information about creating the Domain Definition Filename domain, see “Custom Domains” on page 92.
♦
A List of Values domain must use a code page that is a subset of the code page for the operating system that hosts the PowerCenter Client. For more information about creating a List of Values domain, see “Custom Domains” on page 92.
♦
To view reports from the Profile Manager, the Data Profiling warehouse must use a code page that is two-way compatible with the code page of the operating system that hosts the PowerCenter Client. For more information about viewing Data Profiling reports from the Profile Manager, see “PowerCenter Data Profiling Reports” on page 79.
Appendix B: Code Page Compatibility
Figure B-1 shows code page compatibility requirements for Data Profiling: Figure B-1. Configuring Code Page Compatibility
The PowerCenter Client uses the operating system code page.
List of Values File
Data Profiling Warehouse
Two-way compatible code page required.
The PowerCenter Repository Server uses the operating system code page.
The PowerCenter Server uses the operating system code page.
Domain Definition File
Two-way compatible code page required for Data Profiling reports. Must be a subset of the code page.
Code Page Compatibility
153
154
Appendix B: Code Page Compatibility
Appendix C
Data Profiling Error Messages This appendix provides details about the following topics: ♦
Overview, 156
♦
Designer Messages, 157
♦
Server Messages, 162
155
Overview PowerCenter returns messages when you perform data profiling tasks. Some messages are errors and some are informational. You can use this chapter to help determine what causes error messages to appear and what measures you can take to correct the error. When an error message is a result of an internal error, causes are listed when possible, but contacting Informatica Technical Support is still the best recourse. For contact information, see “Obtaining Technical Support” on page xviii. Data profiling messages from can originate from the following sources: ♦
PowerCenter Designer
♦
PowerCenter Server
The Designer displays messages when you create and modify data profiles. The PowerCenter Server displays messages in the session log when you run profile sessions. For other PowerCenter messages, see the Troubleshooting Guide.
156
Appendix C: Data Profiling Error Messages
Designer Messages The Designer displays messages in message boxes, in the output window, and in the status bar as you perform a task. The messages in this section relate to data profiles and profile mappings in the Designer. For information about other error messages that appear in the Designer, see the Troubleshooting Guide. Cannot access session log file .
Cause:
The log file might have been deleted.
Action:
Rerun the session to create a session log.
Cannot regenerate mapping for the profile since it is running.
Cause:
The Designer cannot regenerate a mapping for this data profile because a session for the mapping is running.
Action:
Wait for the session to complete before you regenerate the profile mapping.
Datatype ‘number’ of port <port name> and datatype ‘character’ of port <port name> are not the same.
Cause:
You specified port with different datatypes for a Join Complexity Evaluation function.
Action:
Select ports with the same datatypes for the Join Complexity Evaluation function.
Failed to add plug-in menu.
Cause:
Internal error.
Action:
Contact Informatica Technical Support.
Failed to launch the Profile Wizard.
Cause:
Internal error.
Action:
Contact Informatica Technical Support.
Failed to load the mapping.
Cause:
The mapping was deleted.
Action:
Recreate the data profile.
Failed to load the mapplet.
Cause:
The mapplet was deleted.
Action:
Recreate the data profile.
Designer Messages
157
Failed to validate server connection: <server name>.
Cause:
You attempted to run a profile against a PowerCenter Server that is down.
Action:
Run the profile on a PowerCenter Server that is running or start the PowerCenter Server that you wish to run the profile against.
file is not a valid domain definition file. Reason: <error message>.
Cause:
The domain definition file that you specified for the List of Values domain is empty or contains a value longer than 200 characters. The domain definition file must contain entries of 200 or fewer characters.
Action:
Make sure that your domain definition file contains valid content before you import it. or
Action:
See the additional error message for more information.
Mapping <mapping name> representing this profile is not valid. This profile cannot be run in interactive mode.
Cause:
The mapping is invalid because it has been modified.
Action:
Regenerate the profile mapping. or
Action:
If this is the first time that you generated the mapping, and you did not modify it, contact Informatica Technical Support.
No report is available for the profile .
Cause:
The PowerCenter Server has not generated a report for the selected data profile because you have not run a profile session for the data profile.
Action:
You must run a profile session before you can view a report.
No session log was created for this session.
Cause:
The session failed.
Action:
Check the workflow log or the Event Viewer for a message that indicates the reason for the failure. Correct the error and rerun the session.
Not connected to the warehouse.
158
Cause:
You specified an incorrect ODBC database connection for viewing reports.
Action:
Enter the correct ODBC database connection in the Profile Manager. Make sure that the target warehouse connection for viewing reports matches the relational database connection you specified when you ran the profile session.
Appendix C: Data Profiling Error Messages
Orphan Analysis is not valid for groups in the same source.
Cause:
You specified more than one group from the same source while adding or editing an Orphan Analysis function.
Action:
Move groups back to the Available Sources field, leaving only one group from each source in the Selected Sources field.
Port <port name> cannot be used in a Join Complexity Evaluation function, because it is of the datatype ‘binary’.
Cause:
You specified a port with a Binary datatype while adding or editing a Join Complexity Evaluation function.
Action:
Specify a port with a datatype other than Binary.
Profile information does not exist in the target warehouse.
Cause:
You did not run a profile session against the Data Profiling warehouse for which you are trying to view reports. or
Cause:
The profile was modified, and you did not run a profile session after the profile was modified.
Action:
Run a profile session against the Data Profiling warehouse for which you are trying to view reports.
Profile was created successfully but mapping <mapping name> representing this profile is not valid. This profile cannot be run in interactive mode.
Cause:
Internal error.
Action:
Contact Informatica Technical Support.
Profile was updated successfully but mapping <mapping name> representing this profile is not valid. This profile cannot be run in interactive mode.
Cause:
Internal error.
Action:
Contact Informatica Technical Support.
Profiling for this source is not supported.
Cause:
Internal error.
Action:
Contact Informatica Technical Support.
Regular expression is invalid.
Cause:
You specified an invalid regular expression.
Action:
Make sure that your regular expression contains valid characters.
Designer Messages
159
Regular expression validation failed at character position <position>: <error message>.
Cause:
The Designer could not validate the regular expression.
Action:
Check the additional error message for more information.
Some fields in source <source file name> group cannot be found. The source might be corrupt.
Cause:
The source has been modified since you last edited the data profile.
Action:
Create a new data profile for the modified source.
Target warehouse does not exist.
Cause:
The Data Profiling warehouse connection that you specified for viewing reports does not match the relational database connection that you specified for your session target.
Action:
In the Profile Manager, modify your connection for viewing reports to match the relational database connection that you specified for your session target. or
Cause:
Your database does not contain Data Profiling warehouse tables.
Action:
Rerun the Data Profiling warehouse script in the PowerCenter Client directory. Commit your SQL script after you run it. Rerun the profile session and try to view the report again.
The mapping corresponding to this profile cannot be found. The repository might be corrupt.
Cause:
The mapping corresponding to this data profile might have been deleted.
Action:
Delete the data profile and create a new one.
The session log editor could not be invoked.
Cause:
The path that you entered for the session log editor is invalid.
Action:
In the Profile Manager, enter a valid path for your session log editor.
The target warehouse does not contain profile results for repository .
Cause:
The Data Profiling warehouse connection that you specified for viewing reports does not match the relational database connection that you specified for your session target.
Action:
In the Profile Manager, modify your connection for viewing reports to match the relational database connection that you specified for your session target.
There are no relational connections in this repository.
160
Cause:
The Connection Browser does not contain any relational database connections for the selected repository.
Action:
Import a relational database connection from another repository, or create a relational database connections in the selected repository.
Appendix C: Data Profiling Error Messages
This is already used for repository .
Cause:
The Data Profiling warehouse that you specified for the profile session has been used for a different repository. A Data Profiling warehouse can contain information for only one repository.
Action:
Specify the correct Data Profiling warehouse for the profile session that you are trying to run.
This mapplet is not valid. Invalid mapplet cannot be profiled.
Cause:
The mapplet that you used to create an auto profile is invalid.
Action:
Open your mapplet in the Mapplet Designer, and choose Mapplet-Validate to view the mapplet error. The error displays in the Output window. Correct the error and try to create the auto profile again.
Unable to load the profile.
Cause:
The Designer cannot load the data profile.
Action:
See the additional error message for more information.
You cannot profile a mapplet with transformations that generate transaction controls.
Cause:
You tried to create a data profile for a mapplet with a transformation that generates transaction controls. For example, you tried to create a data profile for a mapplet with a Custom transformation configured for transaction control.
Action:
Make sure the transformations in your mapping are not configured for transaction control.
Designer Messages
161
Server Messages The PowerCenter Server writes error and informational messages about profile sessions in the session log. This section lists error messages specific to profile sessions. For information about other error messages in the session log, see the Troubleshooting Guide.
DP Codes The following messages may appear when you run a profile session: DP_90001
Invalid target type. Profile mapping targets should either be relational or null.
Cause:
The profile mapping contains target definitions that are not relational or null. The targets in the profile mapping might have been modified.
Action:
Recreate the profile.
DP_90002
All the targets in a profiling mapping should use the same connection and have the same connection attributes.
Cause:
There are two or more relational database connections configured for targets in the profile mapping.
Action:
Make sure you use the same relational database connection for all targets in a profiling mapping.
DP_90003
Create server database connection failed.
Cause:
The database driver might not be set up correctly.
Action:
Check your database driver configuration.
DP_90004
Connection to the database using user <username>, connect string failed. Reason: <error message>.
Cause:
The username or connect string is invalid.
Action:
Verify that your username and connect string values are valid. or
162
Action:
See the additional error message for more information.
DP_90005
Cannot find the profile object for this mapping.
Cause:
You ran a session with a copied or deployed mapping that contains reusable domains. You cannot run a session with a copied or deployed profile mapping that contains reusable domains.
Action:
Run a profile session with the original profile mapping.
Appendix C: Data Profiling Error Messages
DP_90007
Profile metadata deserialization failed with error <error message>.
Cause:
Your Data Profiling Client or Server installation might be missing a DTD file. The file may have been moved from the PowerCenter Client or Server installation directory.
Action:
If the PowerCenter Client or Server installation directory does not contain the DTD file, reinstall the PowerCenter Server or Client. If the problem persists, contact Informatica Technical Support. or
Cause:
The DTD file does not have read permission.
Action:
Verify that you have read permission on the DTD file.
DP_90008
Commit to the database failed with error .
Cause:
A database error prevents the PowerCenter Server from loading data into your Data Profiling warehouse tables.
Action:
Fix the database error indicated in the message and run the session again. If Data Profiling warehouse tables are incomplete, recreate the Data Profiling warehouse.
DP_90009
SQL Prepare failed for statement <SQL statement> with error .
Cause:
The SQL query failed.
Action:
Fix the database error indicated in the message, and rerun the Data Profiling warehouse script in the Data Profiling installation directory. Commit your SQL script after you run it.
DP_900010
SQL Bind failed for statement <SQL statement> with error .
Cause:
The Data Profiling warehouse tables are invalid.
Action:
Rerun the Data Profiling warehouse script in the Data Profiling installation directory. Commit your SQL script after you run it.
DP_900011
SQL Execute failed for statement <SQL statement> with error .
Cause:
The Data Profiling warehouse tables are invalid.
Action:
Rerun the Data Profiling warehouse script in the Data Profiling installation directory. Commit your SQL script after you run it.
DP_900012
SQL Fetch failed for statement <SQL statement> with error .
Cause:
The Data Profiling warehouse tables are invalid.
Action:
Rerun the Data Profiling warehouse script in the Data Profiling installation directory. Commit your SQL script after you run it.
Server Messages
163
164
DP_900013
Fetching the key for type failed.
Cause:
A Data Profiling warehouse table is missing surrogate keys. The following table lists the surrogate keys that correspond to the key types in error messages: Type
Surrogate Key
0
Profile run key
1
Column key
2
Source function key
3
Column function key
4
Orphan Analysis function key
5
Join Complexity Evaluation function key
Action:
Rerun the Data Profiling warehouse script in the Data Profiling installation directory. Commit your SQL script after you run it.
DP_90014
There must be exactly one input group and one output group for this transformation.
Cause:
The Custom transformation in the profile mapping has been modified and is invalid.
Action:
Regenerate the profile mapping.
DP_90015
The output port <port name> datatype should be long.
Cause:
The Custom transformation in the profile mapping has been modified and is invalid.
Action:
Regenerate the profile mapping.
DP_90016
The target warehouse is already used for repository with GUID . Either drop the warehouse tables or use a different one.
Cause:
You tried to use two repositories for the same Data Profiling warehouse.
Action:
Create a second Data Profiling warehouse. Also, create a new relational database connection to the second Data Profiling warehouse in the Workflow Manager.
DP_90017
The profile warehouse tables are not present in the target database connection. Please check the target connection information.
Cause:
The Data Profiling warehouse tables are not in the target database.
Action:
Run the Data Profiling warehouse script in the Data Profiling installation directory. Commit your SQL script after you run it.
Appendix C: Data Profiling Error Messages
DP_90019
Failed to get the folder information from repository.
Cause:
Internal error.
Action:
Contact Informatica Technical Support.
DP_90020
Failed to get the metadata extension <metadata extension> for the mapping.
Cause:
You copied a profile mapping without copying the metadata extensions.
Action:
Copy the metadata extensions from the original profile mapping to the copied mapping and run the session again. or
Cause:
You are running a session with an original profile mapping that corresponds to a data profile, but the metadata extensions are deleted.
Action:
Regenerate the profile mapping.
DP_90022
Profile has some invalid functions. Session run failed.
Cause:
You tried to run a profile session for a data profile. The data profile contains invalid functions.
Action:
Edit the data profile to modify or remove the invalid functions.
DP_90023
Warehouse table PMDP_WH_VERSION is missing. The target warehouse version is incorrect. You may need to upgrade the warehouse.
Cause:
The version of the Data Profiling warehouse does not match the version of PowerCenter.
Action:
Upgrade the Data Profiling warehouse using the upgrade script for your database type.
DP_90024
The target warehouse uses schema version and data version . You may need to upgrade the warehouse.
Cause:
The version of the Data Profiling warehouse does not match the version of PowerCenter.
Action:
Upgrade the Data Profiling warehouse using the upgrade script for your database type.
DP_90026
Warehouse table PMDP_WH_VERSION is missing or column SCHEMA_VERSION/ DATA_VERSION/DATABASE_TYPE is missing. You need to upgrade the warehouse.
Cause:
The version of the Data Profiling warehouse does not match the version of PowerCenter.
Action:
Upgrade the Data Profiling warehouse using the upgrade script for your database type.
Server Messages
165
DP_90029
Source index of mapping parameter <mapping parameter> is invalid.
Cause:
The profile mapping is modified.
Action:
Regenerate the profile mapping.
DP_90030
Missing mapping parameter(s) for source with index .
Cause:
The profile mapping is modified.
Action:
Regenerate the profile mapping.
DP_90031
Source Qualifier transformation [%s] was not found in this mapping.
Cause:
The profile mapping is modified.
Action:
Regenerate the profile mapping.
DP_90401
Failed to load List of Values from the file .
Cause:
You tried to load a domain definition file for a List of Values domain type. However, the file path might be incorrect.
Action:
Verify your domain definition file path. or
Cause:
The file does not exist in the specified location.
Action:
Make sure that the file exists in the specified location. or
166
Cause:
The file is empty.
Action:
Make sure that the domain definition file contains valid entries.
DP_90403
List of Values domain cannot be empty.
Cause:
You tried to load a domain definition file for a List of Values domain type. However, the file might be empty.
Action:
Make sure that your domain definition file contains valid content.
DP_90404
Failed to expand List of Values file .
Cause:
You tried to load a domain definition file for a List of Values domain type. However, the domain definition file path contains an unrecognized variable.
Action:
Check the domain definition file path that you entered.
DP_90405
Failed to open Domain definition file <expanded file path>.
Cause:
You tried to load a domain definition file for a List of Values domain type. However, the domain definition file does not exist in the specified location.
Action:
Make sure that you place the domain definition file in the file path location that you enter.
Appendix C: Data Profiling Error Messages
DP_90406
Fetching Custom transportation input group failed.
Cause:
Internal error.
Action:
Contact Informatica Technical Support.
DP_90407
Fetching Custom transformation output group(s) failed.
Cause:
Internal error.
Action:
Contact Informatica Technical Support.
DP_90408
Unexpected number of input/output groups received.
Cause:
You tried to run a session with a copied or deployed profile mapping that contains reusable domains.
Action:
You must use the original profile mapping. or
Cause:
You copied a profile transformation to use in another mapping.
Action:
You must use the original profile mapping.
DP_90409
Could not fetch output port <port number>.
Cause:
You tried to run a session with a copied or deployed profile mapping that contains reusable domains.
Action:
You must use the original profile mapping. or
Cause:
You copied a profile transformation to use in another mapping.
Action:
You must use the original profile mapping.
DP_90410
Unexpected number of input/output ports received.
Cause:
You tried to run a session with a copied or deployed profile mapping that contains reusable domains.
Action:
You must run a profile session with the original profile mapping if the mapping contains reusable domains. or
Cause:
You copied a profile transformation to use in another mapping.
Action:
You must run a profile session with the original profile mapping.
Server Messages
167
DP_90411
Corrupt input received.
Cause:
You tried to run a session with a copied or deployed profile mapping that contains reusable domains.
Action:
You must run a profile session with the original profile mapping if the mapping contains reusable domains. or
168
Cause:
You copied a profile transformation to use in another mapping.
Action:
You must run a profile session with the original profile mapping.
DP_90603
Regular expression is invalid.
Cause:
The regular expression is not valid. As a result, the session failed.
Action:
Edit the Regular Expression function in the data profile and verify the expression against your source data.
DP_90604
Unexpected condition encountered.
Cause:
Internal error.
Action:
Contact Informatica Technical Support.
DP_90802
No input port(s) found for output port <port>.
Cause:
The profile mapping is modified.
Action:
Regenerate the profile mapping.
DP_90803
Invalid number of input port(s) associated with output port <port>.
Cause:
The profile mapping is modified.
Action:
Regenerate the profile mapping.
DP_90804
There are input port(s) with no corresponding output port.
Cause:
The profile mapping is modified.
Action:
Regenerate the profile mapping.
Appendix C: Data Profiling Error Messages
Appendix D
Glossary This appendix contains information on the following topics: ♦
Glossary Terms, 170
169
Glossary Terms Aggregate functions
Functions that calculate an aggregate value for a numeric or string value applied to one column of a profile source. analytic workflow
A list of PowerAnalyzer reports linked together in a hierarchy consisting of a primary report and related workflow reports. See PowerAnalyzer Data Profiling reports. auto profile
A data profile containing a predetermined set of functions for profiling source data. auto profile report
A PowerCenter Data Profiling report that displays information about source data based on the functions in an auto profile. Average Value function
A column-level aggregate function that calculates an aggregate value for the average value of the rows in the source column. Business Rule Validation column-level function
A column-level function that calculates the number of rows in a single source column that satisfy a business rule and the number of rows that do not. Business Rule Validation source-level function
A source-level function that calculates the number of rows for one or more columns of a source that satisfy a business rule and the number of rows for one or more columns of a source that do not. Candidate Key Evaluation function
A source-level function that calculates the number of duplicate values in one or more columns of a source. column-level functions
Functions that perform calculations on one column of a source, source group, or mapplet group. custom domain
A domain you create to validate source values or to infer patterns from source data. You can create a custom domain when you create a Domain Validation function. See domain and Domain Validation function. 170
Appendix D: Glossary
custom profile
A data profile for which you define functions to profile source data. custom profile report
A PowerCenter Data Profiling report that displays information about source data based on the functions in a custom profile. data connector
A requirement for PowerAnalyzer to connect to a data source and read data for Data Profiling reports. Typically, PowerAnalyzer uses the system data connector to connect to all the data sources required for reports. data profile
A profile of source data in PowerCenter. A data profile contains functions that perform calculations on the source data. Data Profiling views
Views to create metrics and attributes in the Data Profiling schema. PowerCenter Data Profiling reports and PowerAnalyzer Data Profiling reports use these metrics and attributes. You can also use these views to create reports in your BI tool. Data Profiling warehouse
A relational database that stores data profile information from profile sessions. Distinct Value Count function
A column-level function that returns the number of distinct values for a column. domain
A set of all valid values for a source column. A domain can contain a regular expression, list of values, or the name of a file that contains a list of values. Informatica provides prepackaged domains. You can also create your own domains. Domain Definition Filename domain
Domains defined by an external file containing a list of values. Domain Inference function
A column-level function that reads all values in the column and infers a pattern that fits the data. The function determines if the values fit a list of values derived from the column values or a pattern that describes the pattern of the source data.
Glossary Terms
171
Domain Validation function
A column-level function that calculates the number of values in the profile source column that fall within a specified domain and the number of values that do not. The Domain Validation function requires that you specify a domain. group-by columns
Columns by which you want to group data for a custom profile. When you configure a function, you can determine the column by which you want to group the data. interactive session
A profile session that you run from the Profile Manager. intersource functions
Functions that perform calculations on two or more sources, source groups, or mapplet groups. Join Complexity Evaluation function
An intersource function that measures the columns in multiple sources that do not have a join condition. List of Values domain
Domains defined by a comma-delineated list of values. Maximum Value aggregate function
A column-level aggregate function that calculates an aggregate value for the maximum value of rows in the source column. Minimum Value aggregate function
A column-level aggregate function that calculates an aggregate value for the minimum value of rows in the source column. non-reusable domain
A domain that applies to one Domain Validation function. See also reusable domain. NULL Value Count aggregate function
A column-level aggregate function that calculates an aggregate value for the number of rows with NULL values in the source column. Orphan Analysis function
An intersource function that compares the values of columns in two sources.
172
Appendix D: Glossary
persistent session
A session stored in the repository that you run from the Profile Manager or the Workflow Manager. You can use a persistent session to run a profile mapping more than once. PowerAnalyzer Data Profiling reports
Reports you can view from PowerAnalyzer after running a profile session. PowerAnalyzer reports provide a time-dimensional view of your data and information about rejected rows in your profile results. PowerCenter Data Profiling reports
Reports you can view from the Profile Manager after running a profile session. PowerCenter reports provide information based on the last profile session run. prepackaged domains
Domains informatica provides, which verify data, such as phone numbers, postal codes, and email addresses. See domain. Profile Manager
A tool in the Designer that manages data profiles. Use the Profile Manager to set default data profile options, work with data profiles in your repository, run profile sessions, view profile results, and view sources and mapplets with at least one profile defined for them. profile mapping
A mapping the Designer generates when you create a data profile. The PowerCenter repository stores the data profile and the associated mapping. profile session
A session for a profile mapping that gathers information about your source data. The Data Profiling warehouse stores the results of profile sessions. See persistent session and interactive session. Redundancy Evaluation function
A source-level function that calculates the number of duplicate values in one or more columns of the source. Regular Expression domain
A domain defined by a range of values in an expression. reusable domain
A domain you can apply to multiple Domain Validation functions in one or more data profiles. See domain.
Glossary Terms
173
Row Count function
A source-level function that counts the number of rows read from the source during a profile session. Row Uniqueness function
A source-level function that calculates the number of unique and duplicate values based on the columns selected. You can profile all columns in the source row or choose individual columns to profile. source-level function
A function that performs calculations on two or more columns of a source, source group, or mapplet group. temporary session
A session that is run from the Profile Manager and is not stored to the repository. verbose mode
An option to view verbose data that the PowerCenter Server writes to the Data Profiling warehouse during a profile session. You can specify the type of verbose data to load when you configure the data profile.
174
Appendix D: Glossary
Index
A
C
ABAP program generating for SAP R/3 sources 33 adding functions 39 aggregate data profiling 52 Aggregate functions description 119 auto profile reports description 79 auto profiles auto profile reports 79 creating 34 deleting 49 editing 47 running a session after creating an auto profile 67 automatic random sampling description 71
Candidate Key Evaluation function description 108 profiling primary key columns 108 checking in profile mappings 27 COBOL syntax converting to perl syntax 96 code pages configuring compatibility 152 Data Profiling report requirements 152 rules for Domain Definition Filename domain 152 rules for List of Values domain 152 specifying for a Domain Definition Filename domain 99 specifying for a List of Values domain 99 syntax for Domain Definition Filename domain 99 syntax for List of Values domain 99 column-level functions Aggregate functions 119 Business Rule Validation 113 description 113 Distinct Value Count 121 Domain Inference 117 Domain Validation 115
B buffer block size See Workflow Administration Guide Business Rule Validation function column-level function description 113 source-level function description 106
175
configuring default data profile options 26 functions 41 sessions 64 copying profile mappings 51 creating auto profiles 34 custom profiles 38 data connectors 21 Data Profiling warehouse on IBM DB2 12 Data Profiling warehouse on Informix 12 Data Profiling warehouse on Microsoft SQL Server 13 Data Profiling warehouse on Oracle 13 Data Profiling warehouse on Sybase 13 Data Profiling warehouse on Teradata 13 Domain Definition Filename domain 98 List of Values domain 92 Regular Expression domain 94 custom domains description 92 custom profile reports viewing 81 custom profiles creating 38 deleting 49 editing 47 running a session 67
D data connectors adding the Data Profiling data source 23 creating 21 primary data source 22 properties 22 data profiles creating a custom profile 38 creating an auto profile 34 deleting 49 editing 47 Data Profiling during mapping development 2 during production 2 overview 2 Data Profiling reports auto profile reports 79 custom profile reports 81
176
Index
importing into PowerAnalzyer 16 requirements for viewing from the Profile Manager 152 sample verbose summary report 55 viewing in PowerCenter 83 Data Profiling schemas importing 16 Data Profiling warehouse configuring a data source 20 configuring a relational database connection 15 creating 12 overview 3 upgrading 13 data samples profiling 68 profiling relational data samples 70 data sources adding to data connector 23 configuring for the Data Profiling warehouse 20 primary in data connector 22 supported databases 20 datatypes Raw 113, 123 default data profile options configuring 26 deleting data profiles 49 domains 102 Distinct Value Count function description 121 documentation conventions xvi description xv online xvi Domain Definition Filename domain code page compatibility 152 creating 98 description 92 syntax for specifying code pages 99 Domain Inference function configuring Domain Inference settings 118 description 117 Domain Validation function description 115 domains creating 90 custom 92 deleting 102 description 90 Domain Definition Filename domain 92 editing 101
List of Values domain 92 non-reusable 92 prepackaged 91 Regular Expression domain 92 reusable 92
E editing data profiles 47 domains 101 error messages designer 157 DP codes 162 overview 156 session logs 162
F functions adding 39 column-level functions 113 configuring 41 configuring for verbose mode 43 configuring group-by columns 42 intersource functions 123 source-level functions 105
I IBM DB2 creating a Data Profiling warehouse 12 upgrading a Data Profiling warehouse 14 IMA views 128 importing Data Profiling schemas and reports 16 Informatica documentation xv Webzine xvii Informix creating a Data Profiling warehouse 12 upgrading a Data Profiling warehouse 14 installing PowerAnalzyer Data Profiling reports 16 XML scripts 16 interactive profiling See interactive sessions interactive sessions configuring 45 definition 63 monitoring in the Profile Manager 67 monitoring with the Workflow Monitor 67 viewing the session log 67 intersource functions description 123 Join Complexity Evaluation 125 Orphan Analysis 123
G generating ABAP program for SAP R/3 sources 33 profile mappings 45 glossary terms 170 group-by columns configuring for functions 42
J Join Complexity Evaluation function description 125
K keys profiling primary key columns 108
H historical data sampling 72
L List of Values domain code page compatibility requirements 152 creating 92 description 92 syntax for specifying code pages 99
Index
177
localization specifying for a Domain Definition Filename domain 99 specifying for a List of Values domain 99
M manual random sampling description 70 mappings generating a profile mapping 45 prefix for the profile mapping name 28 mapplets extending data profiling functions 52 joining output data from two sources 56 merging mapplet output data 58 profiling 32 Microsoft SQL Server creating a Data Profiling warehouse 13 upgrading a Data Profiling warehouse 14 modifying profile mappings 51 monitoring interactive sessions 67 multiple sources profiling 33
N non-reusable domains description 92 normal mode running temporary sessions 63
O Oracle creating a Data Profiling warehouse 13 upgrading a Data Profiling warehouse 14 Orphan Analysis function description 123
P perl syntax using in a Regular Expression domain 94 persistent sessions definition 63 178
Index
running from the Profile Manager 63 running in real time 74 PowerAnalyzer adding data sources 23 configuring a data source 20 creating data connectors 21 importing Data Profiling schemas and reports 16 PowerAnalyzer Data Profiling reports description 78 installing 16 viewing verbose data 85 PowerCenter Data Profiling reports auto profile reports 79 custom profile reports 81 description 78 PowerCenter Server performing automatic random sampling 71 performing manual random sampling 70 prepackaged domains description 91 primary data sources data connector 22 primary keys profiling with the Candidate Key Evaluation function 108 profile functions description 33 extending functionality 52 Profile Manager checking in profile mappings 6 creating custom profiles 6 creating domains 90 deleting data profiles 6 editing data profiles 6 Profile View 7 regenerating profile mappings 6 running interactive sessions 6 running persistent sessions 63 running sessions 67 running temporary sessions 63 Source View 8 using 6 viewing data profile details 6 profile mappings checking in 6 copying 51 copying with reusable domains 51 modifying 51 prefix for the profile mapping name 28 regenerating 6
profile sessions See also interactive sessions See also persistent sessions configuring to use data samples 68 prefix for the profile session name 28 troubleshooting 75 Profile View in Profile Manager 7 Profile Wizard configuring sessions 64 creating a custom profile 38 creating an auto profile 34 creating domains 90 profile workflows prefix for the profile workflow name 28 requirements for creating in the Workflow Manager 74 profiling aggregate data 52 data samples 68 eligible sources 32 mapplets 32 relational data samples 70 SAP R/3 sources 33 sources with matching ports 55, 57
R Raw datatype verbose mode sessions 113, 123 real-time sessions running 74 Redundancy Evaluation function description 109 Regular Expression domain creating 94 description 92 using perl syntax 94 relational data samples delegating to the database 71 improving data accuracy for historical data 72 increasing performance 73 sampling as the PowerCenter Server reads data 70 relational database connections See also Workflow Administration Guide configuring for the Data Profiling warehouse 15 reports auto profile reports 79 custom profile reports 81 PowerAnalyzer Data Profiling reports 85 PowerCenter Data Profiling reports 79
reusable domains copying a profile mapping 51 description 92 Row Count function description 105 Row Uniqueness function description 110
S SAP R/3 sources profiling 33 scripts See XML scripts session configuration enabling 39 session logs description 162 viewing for interactive sessions 67 sessions See also interactive sessions See also persistent sessions configuring 64 creating in the Workflow Manager 74 running after creating a custom profile 67 running for auto profiles 67 running from the Profile Manager 67 Source View in Profile Manager 8 source-level functions Business Rule Validation 106 Candidate Key Evaluation 108 description 105 loading the Raw datatype for verbose mode 113, 123 Redundancy Evaluation 109 Row Count 105 Row Uniqueness 110 sources eligible sources for profiling 32 profiling multiple sources 33 SQL syntax converting to perl syntax 96 Sybase creating a Data Profiling warehouse 13 upgrading a Data Profiling warehouse 15
T temporary sessions definition 63
Index
179
running from the Profile Manager 63 running in normal mode 63 Teradata creating a Data Profiling warehouse 13 upgrading a Data Profiling warehouse 15 troubleshooting profile sessions 75
U upgrading Data Profiling warehouse on IBM DB2 14 Data Profiling warehouse on Informix 14 Data Profiling warehouse on Microsoft SQL Server 14 Data Profiling warehouse on Oracle 14 Data Profiling warehouse on Sybase 15 Data Profiling warehouse on Teradata 15
V verbose data description 4 viewing in PowerAnalyzer Data Profiling reports 85 verbose mode configuring a function 43 verbose mode sessions Raw datatype 113, 123 versioned objects See also Repository Guide checking in profile mappings 6 viewing auto profile reports 79 custom profile reports 81 PowerAnalzyer Data Profiling reports 85 PowerCenter Data Profiling reports 83 views descriptions 128 list of 128
W warehouse See Data Profiling warehouse webzine xvii Workflow Manager creating a session and workflow 74 Workflow Monitor monitoring interactive sessions 67
180
Index
workflows creating in the Workflow Manager 74
X XML scripts installing 16