This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA
.
Expression Editor • •
PAGE BP-108
To expedite the validation of a newly created expression, simply press OK to initiate the parsing/validation of the expression, then press OK once again in the “Expression parsed successfully” pop-up. To select PowerCenter functions and ports during expression creation, use the Functions and Ports Tab.
BEST PRACTICES
INFORMATICA CONFIDENTIAL
Creating Inventories of Reusable Objects & Mappings
Challenge Successfully creating inventories of reusable objects and mappings, including identifying potential economies of scale in loading multiple sources to the same target. Description Reusable Objects The first step in creating an inventory of reusable objects is to review the business requirements and look for any common routines/modules that may appear in more than one data movement. These common routines are excellent candidates for reusable objects. In PowerCenter, reusable objects can be single transformations (lookups, filters, etc.) or even a string of transformations (mapplets). Evaluate potential reusable objects by two criteria: • •
Is there enough usage and complexity to warrant the development of a common object? Are the data types of the information passing through the reusable object the same from case to case or is it simply the same high-level steps with different fields and data?
Common objects are sometimes created just for the sake of creating common components when in reality, creating and testing the object does not save development time or future maintenance. For example, if there is a simple calculation like subtracting a current rate from a budget rate that will be used for two different mappings, carefully consider whether the effort to create, test, and document the common object is worthwhile. Often, it is simpler to add the calculation to both mappings. However, if the calculation were to be performed in a number of mappings, if it was very difficult, and if all occurrences would be updated following any change or fix – then this would be an ideal case for a reusable object. The second criterion for a reusable object concerns the data that will pass through the reusable object. Many times developers see a situation where they may perform a certain type of high-level process (e.g., filter, expression, update strategy) in two
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-109
or more mappings; at first look, this seems like a great candidate for a mapplet. However, after performing half of the mapplet work, the developers may realize that the actual data or ports passing through the high level logic are totally different from case to case, thus making the use of a mapplet impractical. Consider whether there is a practical way to generalize the common logic, so that it can be successfully applied to multiple cases. Remember, when creating a reusable object, the actual object will be replicated in one to many mappings. Thus, in each mapping using the mapplet or reusable transformation object, the same size and number of ports must pass into and out of the mapping/reusable object. Document the list of the reusable objects that pass this criteria test, providing a high-level description of what each object will accomplish. The detailed design will occur in a future subtask, but at this point the intent is to identify the number and functionality of reusable objects that will be built for the project. Keep in mind that it will be impossible to identify 100 percent of the reusable objects at this point; the goal here is to create an inventory of as many as possible, and hopefully the most difficult ones. The remainder will be discovered while building the data integration processes. Mappings A mapping is an individual movement of data from a source system to a target system. In a simple world, a single source table would populate a single target table. However, in practice, this is usually not the case. Sometimes multiple sources of data need to be combined to create a target table, and sometimes a single source of data creates many target tables. The latter is especially true for mainframe data sources where COBOL OCCURS statements litter the landscape. In a typical warehouse or data mart model, each OCCURS statement decomposes to a separate table. The goal here is to create an inventory of the mappings needed for the project. For this exercise, the challenge is to think in individual components of data movement. While the business may consider a fact table and its three related dimensions as a single ‘object’ in the data mart or warehouse, five mappings may be needed to populate the corresponding star schema with data (i.e., one for each of the dimension tables and two for the fact table, each from a different source system). Typically, when creating an inventory of mappings, the focus is on the target tables, with an assumption that each target table has its own mapping, or sometimes multiple mappings. While often true, if a single source of data populates multiple tables, this approach yields multiple mappings. Efficiencies can sometimes be realized by loading multiple tables from a single source. By simply focusing on the target tables, however, these efficiencies can be overlooked. A more comprehensive approach to creating the inventory of mappings is to create a spreadsheet listing all of the target tables. Create a column with a number next to each Target table. For each of the target tables, in another column, list the source file or table that will be used to populate the table. In the case of multiple source tables per target, create two rows for the target each with the same number, and list the additional source(s) of data. The Table would look similar to the following:
PAGE BP-110
BEST PRACTICES
INFORMATICA CONFIDENTIAL
Number 1 2 3 4 4
Target Table Customers Products Customer_Type Orders_Item Orders_Item
Source Cust_File Items Cust_File Tickets Ticket_Items
When completed, the spreadsheet can be sorted either by target table or source table. Sorting by source table can help determine potential mappings that create multiple targets. When using a source to populate multiple tables at once for efficiency, be sure to keep restartabilty/reloadability in mind. The mapping will always load two or more target tables from the source, so there will be no easy way to rerun a single table. In this example, potentially the Customers table and the Customer_Type tables can be loaded in the same mapping. When merging targets into one mapping in this manner, give both targets the same number. Then, re-sort the spreadsheet by number. For the mappings with multiple sources or targets, merge the data back into a single row to generate the inventory of mappings, with each number representing a separate mapping. The inventory would look similar to the following: Number 1 2 4
Target Table Customers Customer_Type Products Orders_Item
Source Cust_File Items Tickets Ticket_Items
At this point, it is often helpful to record some additional information about each mapping to help with planning and maintenance. First, give each mapping a name. Apply the naming standards generated in 2.2 DESIGN DEVELOPMENT ARCHITECTURE. These names can then be used to distinguish mappings from each other and also can be put on the project plan as individual tasks. Next, determine for the project a threshold for a High, Medium, or Low number of target rows. For example, in a warehouse where dimension tables are likely to number in the thousands and fact tables in the hundred thousands, the following thresholds might apply: Low – 1 to 10,000 rows Med – 10,000 to 100,000 rows High – 100,000 rows +
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-111
Assign a likely row volume (High, Med or Low) to each of the mappings based on the expected volume of data to pass through the mapping. These high level estimates will help to determine how many mappings are of ‘High’ volume; these mappings will be the first candidates for performance tuning. Add any other columns of information that might be useful to capture about each mapping, such as a high-level description of the mapping functionality, resource (developer) assigned, initial estimate, actual completion time, or complexity rating.
PAGE BP-112
BEST PRACTICES
INFORMATICA CONFIDENTIAL
Updating Repository Statistics
Challenge The PowerCenter repository has more than eighty tables, and nearly all use one or more indexes to speed up queries. Most databases keep and use column distribution statistics to determine which index to use in order to optimally execute SQL queries. Database servers do not update these statistics continuously, so they quickly become outdated in frequently-used repositories, and SQL query optimizers may choose a less-than-optimal query plan. In large repositories, choosing a sub-optimal query plan can drastically affect performance. As a result, the repository becomes slower and slower over time. Description The Database Administrator needs to continually update the database statistics to ensure that they remain up-to-date. The frequency of updating depends on how heavily the repository is used. Because the statistics need to be updated table by table, it is useful for Database Administrators to create scripts to automate the task. For the repository tables, it is helpful to understand that all PowerCenter repository tables and index names begin with "OPB_" or "REP_". The following information is useful for generating scripts to update distribution statistics. Oracle Run the following queries: select 'analyze table ', table_name, ' compute statistics;' from user_tables where table_name like 'OPB_%' select 'analyze index ', INDEX_NAME, ' compute statistics;' from user_indexes where INDEX_NAME like 'OPB_%' This will produce output like: 'ANALYZETABLE' TABLE_NAME 'COMPUTESTATISTICS;' analyze table OPB_ANALYZE_DEP compute statistics; analyze table OPB_ATTR compute statistics;
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-113
analyze table OPB_BATCH_OBJECT compute statistics; . . . 'ANALYZEINDEX' INDEX_NAME 'COMPUTESTATISTICS;' analyze index OPB_DBD_IDX compute statistics; analyze index OPB_DIM_LEVEL compute statistics; analyze index OPB_EXPR_IDX compute statistics; . . Save the output to a file. Then, edit the file and remove all the headers. (i.e., the lines that look like: 'ANALYZEINDEX' INDEX_NAME 'COMPUTESTATISTICS;' Run this as a SQL script. This updates statistics for the repository tables. MS SQL Server Run the following query: select 'update statistics ', name from sysobjects where name like 'OPB_%' This will produce output like : name update statistics OPB_ANALYZE_DEP update statistics OPB_ATTR update statistics OPB_BATCH_OBJECT . . Save the output to a file, then edit the file and remove the header information (i.e., the top two lines) and add a 'go' at the end of the file. Run this as a SQL script. This updates statistics for the repository tables. Sybase Run the following query: select 'update statistics ', name from sysobjects where name like 'OPB_%' This will produce output like name update statistics OPB_ANALYZE_DEP update statistics OPB_ATTR update statistics OPB_BATCH_OBJECT
PAGE BP-114
BEST PRACTICES
INFORMATICA CONFIDENTIAL
. . .
Save the output to a file, then remove the header information (i.e., the top two lines), and add a 'go' at the end of the file. Run this as a SQL script. This updates statistics for the repository tables. Informix Run the following query: select 'update statistics low for table ', tabname, ' ;' from systables where tabname like 'opb_%' or tabname like 'OPB_%'; This will produce output like : (constant) tabname (constant) update statistics low for table OPB_ANALYZE_DEP ; update statistics low for table OPB_ATTR ; update statistics low for table OPB_BATCH_OBJECT ; . . . Save the output to a file, then edit the file and remove the header information (i.e., the top line that looks like: (constant) tabname (constant) Run this as a SQL script. This updates statistics for the repository tables. DB2 Run the following query : select 'runstats on table ', (rtrim(tabschema)||'.')||tabname, ' and indexes all;' from sysstat.tables where tabname like 'OPB_%' This will produce output like: runstats on table PARTH.OPB_ANALYZE_DEP and indexes all; runstats on table PARTH.OPB_ATTR and indexes all; runstats on table PARTH.OPB_BATCH_OBJECT
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-115
and indexes all; . . . Save the output to a file. Run this as a SQL script to update statistics for the repository tables.
PAGE BP-116
BEST PRACTICES
INFORMATICA CONFIDENTIAL
Daily Operations
Challenge Once the data warehouse has been moved to production, the most important task is keeping the system running and available for the end users. Description In most organizations, the day-to-day operation of the data warehouse is the responsibility of a Production Support Team. This team is typically involved with the support of other systems and has expertise in database systems and various operating systems. The Data Warehouse Development team, becomes in effect, a customer to the Production Support team. To that end, the Production Support team needs two documents, a Service Level Agreement and an Operations Manual, to help in the support of the production data warehouse. Service Level Agreement The Service Level agreement outlines how the overall data warehouse system will be maintained. This is a high-level document that discusses the system to be maintained, the components of the system, and identifies the groups responsible for monitoring the various components of the system. At a minimum, it should contain the following information: • • • • • • •
Times when the system should be available to users Scheduled maintenance window Who is expected to monitor the operating system Who is expected to monitor the database Who is expected to monitor the Informatica sessions How quickly the support team is expected to respond to notifications of system failures Escalation procedures that include data warehouse team contacts in the event that the support team cannot resolve the system failure
Operations Manual
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-117
The Operations Manual is crucial to the Production Support team because it provides the information needed to perform the maintenance of the data warehouse system. This manual should be self-contained, providing all of the information necessary for a production support operator to maintain the system and resolve most problems that may arise. This manual should contain information on how to maintain all components of the data warehouse system. At a minimum, the Operations Manual should contain: • • • • •
PAGE BP-118
Information on how to stop and re-start the various components of the system Ids and passwords (or how to obtain passwords) for the system components Information on how to re-start failed PowerCenter sessions A listing of all jobs that are run, their frequency (daily, weekly, monthly, etc.), and the average run times Who to call in the event of a component failure that cannot be resolved by the Production Support team
BEST PRACTICES
INFORMATICA CONFIDENTIAL
Load Validation
Challenge Knowing that all data for the current load cycle has loaded correctly is essential for good data warehouse management. However, the need for load validation varies, depending on the extent of error checking, data validation or data cleansing functionality inherent in the your mappings. Description Methods for validating the load process range from simple to complex. The first step is to determine what information you need for load validation (e.g., batch names, session names, session start times, session completion times, successful rows and failed rows). Then, you must determine the source of this information. All this information is stored as metadata in the repository, but you must have a means of extracting this information. Finally, you must determine how you want this information presented to you. Do you want it stored as a flat file? Do you want it e-mailed to you? Do you want it available in a relational table, so that history easily be preserved? All of these factors weigh in finding the correct solution for you. The following paragraphs describe three possible solutions for load validation, beginning with a fairly simple solution and moving toward the more complex: 1.
Post-session e-mails on either success or failure Post-session e-mail is configured in the session, under the General tab and ‘Session Commands’ A number of variables are available to simplify the text of the e-mail: -
INFORMATICA CONFIDENTIAL
%s Session name %e Session status %b Session start time %c Session completion time %i Session elapsed time %l Total records loaded
BEST PRACTICES
PAGE BP-119
-
%r Total records rejected %t Target table details %m Name of the mapping used in the session %n Name of the folder containing the session %d Name of the repository containing the session %g Attach the session log to the message
TIP: One practical application of this functionality is the situation in which a key business user waits for completion of a session to run a report. You can configure email to this user, notifying him/her that the session was successful and the report can run. 2.
Query the repository
Almost any query can be put together to retrieve data about the load execution from the repository. The MX view, REP_SESS_LOG, is a great place to start . This view is likely to contain all the information you need. The following sample query shows how to extract folder name, session name, session end time, successful rows and session duration: select subject_area, session_name, session_timestamp, successful_rows, (session_timestamp - actual_start) * 24 * 60 * 60 from rep_sess_log a where session_timestamp = (select max(session_timestamp) from rep_sess_log where session_name =a.session_name) order by subject_area, session_name The sample output would look like this: Folder Name Web Analytics Web Analytics Finance Finance HR
3.
Session Name
S M W DYNMIC KEYS FILE LOAD SMW LOAD WEB FACT SMW NEW LOANS SMW UPD LOANS SMW NEW PERSONNEL
Session End Time 5/8/2001 7:49:18 AM 5/8/2001 5/8/2001 5/8/2001 5/8/2001
7:53:01 8:06:01 8:10:32 8:15:27
AM AM AM AM
Successful Rows
Failed Session Rows Duration (sec’s)
12900
0
126
125000 35987 45 5
0 0 0 0
478 178 12 10
Use a mapping
A more complex approach, and the most customizable, is to create a PowerCenter mapping to populate a table or flat file with desired information. You can do this by sourcing the MX view REP_SESS_LOG and then performing lookups to other repository tables or views for additional information. The following graphic illustrates a sample mapping:
PAGE BP-120
BEST PRACTICES
INFORMATICA CONFIDENTIAL
This mapping selects data from REP_SESS_LOG and performs lookups to retrieve the absolute minimum and maximum run times for that particular session. This enables you to compare the current execution time with to the minimum and maximum durations.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-121
Third Party Scheduler
Challenge Successfully integrate a third-party scheduler with PowerCenter. This Best Practice describes various levels to integrate a third-party scheduler. Description When moving into production, many companies require the use of a third-party scheduler that is the company standard. A third-party scheduler can start and stop an Informatica session or batch using the PMCMD commands. Because PowerCenter has a scheduler, there are several levels at which to integrate a third-party scheduler with PowerCenter. The correct level of integration depends on the complexity of the batch/schedule and level and type of production support. Third Party Scheduler Integration Levels In general, there are three levels of integration between a third-party scheduler and Informatica: Low Level, Medium Level, and High Level. Low Level Low level integration refers to a third-party scheduler kicking off only one Informatica session or a batch. That initial PowerCenter process subsequently kicks off the rest of the sessions and batches. The PowerCenter scheduler handles all processes and dependencies after the third-party scheduler has kicked off the initial batch or session. In this level of integration, nearly all control lies with the PowerCenter scheduler. This type of integration is very simple and should only be used as a loophole to fulfill a corporate mandate on a standard scheduler. A low level of integration is very simple to implement because the third-party scheduler kicks off only one process. The third-party scheduler is not adding any functionality that cannot be handled by the PowerCenter scheduler.
PAGE BP-122
BEST PRACTICES
INFORMATICA CONFIDENTIAL
Low level integration requires production support personnel to have a thorough knowledge of PowerCenter. Because many companies only have Production Support personnel with knowledge in the company’s standard scheduler, one of the main disadvantages of this level of integration is that if a batch fails at some point, the Production Support personnel may not be able to determine the exact breakpoint. Thus, the majority of the production support burden falls back on the Project Development team. Medium Level Medium level integration is when a third-party scheduler kicks off many different batches or sessions, but not all sessions. A third-party scheduler may kick off several PowerCenter batches and sessions but within those batches, PowerCenter may have several sessions defined with dependencies. Thus, PowerCenter is controlling the dependencies within those batches. In this level of integration, the control is shared between PowerCenter and a third-party scheduler. This type of integration is more complex than low level integration because there is much more interaction between the third-party scheduler and PowerCenter. However, to reduce total amount of work required to integrate the third-party scheduler and PowerCenter, many of the PowerCenter sessions may be left in batches. This reduces the integration chores because the third-party scheduler is only communicating with a limited number of PowerCenter batches. Medium level integration requires Production Support personnel to have a fairly good knowledge of PowerCenter. Because Production Support personnel in many companies are knowledgeable only about the company’s standard scheduler. Therefore, one significant disadvantage of this level of integration is that if the batch fails at some point, the Production Support personnel may not be able to determine the exact breakpoint. They are probably able to determine the general area, but not necessarily the specific session. Thus, the production support burden is shared between the Project Development team and the Production Support team. High Level High level integration is when a third-party scheduler has full control of scheduling and kicks off all PowerCenter sessions. Because the PowerCenter sessions are not part of any batches, the third-party scheduler controls all dependencies among the sessions. This type of integration is the most complex to implement because there are many more interactions between the third-party scheduler and PowerCenter. The thirdparty scheduler controls all dependencies between the sessions. High level integration allows the Production Support personnel to have only limited knowledge of PowerCenter. Because the Production Support personnel in many companies are knowledgeable only about the company’s standard scheduler, one of the main advantages of this level of integration is that if the batch fails at some point, the Production Support personnel are usually able to determine the exact breakpoint. Thus, the production support burden lies with the Production Support team.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-123
PAGE BP-124
BEST PRACTICES
INFORMATICA CONFIDENTIAL
Event Based Scheduling
Challenge In an operational environment, the start of a session needs to be triggered by another session or other event. The best method of event-based scheduling with the PowerCenter Server, is the use of indicator files. Description The indicator file configuration is specified in the session configuration, under advanced options. The file used as the indicator file must be able to be located by the PowerCenter Server, much like a flat file source. When the session starts, the PowerCenter Server will look for the existence of this file and will remove it when it sees it. If the session is waiting on its source file to be FTP’ed from another server, the FTP process should be scripted so that it creates the indicator file upon successful completion of the source file FTP. This file can be an empty, or dummy, file. The mere existence of the dummy file is enough to indicate that the session should start. The dummy file will be removed immediately after it is located. It is, therefore, essential that you do not use your flat file source as the indicator file.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-125
Repository Administration
Challenge The task of managing the repository, either in development or production, is extremely important. A number of best practices are available to facilitate the tasks involved with this responsibility. Description The following paragraphs describe several of the key tasks involved in managing the repository: Backing Up the Repository Two back-up methods are advisable for repository backup: (1) either the PowerCenter Repository Manager or ‘pmrep’ command line utility, and (2) the traditional database backup method. The native PowerCenter backup is required, and Informatica recommends using both methods, although both are not essential. If database corruption occurs, the native PowerCenter backup provides a clean backup that can be restored to a new database. Analyzing Tables in the Repository If operations in any of the client tools, including connectivity to the PowerCenter repository, are slowing down , you may need to analyze the tables in the repository to facilitate data retrieval, thereby increasing performance. Purging Old Session Log Information Similarly, if folder copies are taking an unusually long time, the OPB_SESSION_LOG and/or OPB_SESS_TARG_LOG tables may be being transferred. Removing unnecessary data from these tables will expedite the repository backup process as well as the folder copy operation. To determine which logs to eliminate, execute the following select statement to retrieve the sessions with the most entries in OPB_SESSION_LOG:
PAGE BP-126
BEST PRACTICES
INFORMATICA CONFIDENTIAL
select subj_name, sessname, count(*) from opb_session_log a, opb_subject b, opb_load_session c where a.session_id=c.session_id and b.subj_id=c.subj_id group by subj_name, sessname order by count(*) desc 1. Copy the original session, then delete original session. When a session is copied, the entries in the repository tables do not duplicate. When you delete the session, the entries in the tables are deleted, eliminating all rows for an individual session. 2. Log into Repository Manager and expand the sessions in a particular folder. When you select one of the sessions, all of the session logs will appear on the righthand side of the screen. You can manually delete any of these by highlighting a particular log, then selecting Delete from the Edit menu. Respond ‘Yes’ when the system prompts you with the question “‘Delete these logs from the Repository?” pmrep Utility The pmrep utility was introduced in PowerCenter 5.0 to facilitate repository administration and server level administration. This utility is a command-line program for Windows 95/98 or Windows NT/2000 to update session-related parameters in a PowerCenter repository. It is a standalone utility that installs in the PowerCenter Client installation directory. It is not currently available for UNIX. The pmrep utility has two modes: command line and interactive mode. • •
Command line mode lets you execute pmrep commands from the windows command line. This mode invokes and exits each time a command is issued. Command line mode is useful for batch files or scripts. Interactive mode invokes pmrep and allows you to issue a series of commands from a pmrep prompt without exiting after each command.
The following examples illustrate the use of pmrep: Example 1: Script to backup PowerCenter Repository echo Connecting to repository
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-127
Example 2: Script to update database connection information echo Connecting to repository Informatica Repository
Repository name Database username and password (must be in US-ASCII) Repository username and password (must be in US-ASCII) ODBC data source name (DSN)
The registry does not include the ODBC data source. If you import a registry containing a DSN that does not exist on that client system, the connection fails. Be sure to have the appropriate data source configured under the exact same name as the registry you are going to import, for each imported DSN.
PAGE BP-128
BEST PRACTICES
INFORMATICA CONFIDENTIAL
High Availability
Challenge In a highly available environment, load schedules cannot be impacted by the failure of physical hardware. The PowerCenter Server must be running at all times. If the machine hosting the PowerCenter Server goes down, another machine must recognize this and start another Server and assume responsibility for running the sessions and batches. This is best accomplished in a clustered environment. Description While there are many types of hardware and many ways to configure a clustered environment, this example is based on the following hardware and software characteristics: • • • •
2 Sun 4500, running Solaris OS Sun High-Availability Clustering Software External EMC storage, with each server owning specific disks PowerCenter installed on a separate disk that is accessible by both servers in the cluster, but only by one server at a time
One of the Sun 4500’s serves as the primary data integration server, while the other server in the cluster is the secondary server. Under normal operations, the PowerCenter Server ‘thinks’ it is physically hosted by the primary server and uses the resources of the primary server, although it is physically located on its own server. When the primary server goes down, the Sun high-availability software automatically starts the PowerCenter Server on the secondary server using the basic auto start/stop scripts that are used in many UNIX environments to automatically start the PowerCenter Server whenever a host is rebooted. In addition, the Sun highavailability software changes the ownership of the disk where the PowerCenter Server is installed from the primary server to the secondary server. To facilitate this, a logical IP address can be created specifically for the PowerCenter Server. This logical IP address is specified in the pmserver.cfg file instead of the physical IP addresses of the servers. Thus, only one pmserver.cfg file is needed.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-129
PAGE BP-130
BEST PRACTICES
INFORMATICA CONFIDENTIAL
Recommended Performance Tuning Procedures
Challenge Efficient and effective performance tuning for PowerCenter products. Description Performance tuning procedures consist of the following steps in a pre-determined order to pinpoint where tuning efforts should be focused. 1. Perform Benchmarking. Benchmark the sessions to set a baseline to measure improvements against 2. Monitor the server. By running a session and monitoring the server, it should immediately be apparent if the system is paging memory or if the CPU load is too high for the number of available processors. If the system is paging, correcting the system to prevent paging (e.g., increasing the physical memory available on the machine) can greatly improve performance. 3. Use the performance details. Re-run the session and monitor the performance details. This time look at the details and watch for the Buffer Input and Outputs for the sources and targets. 4. Tune the source system and target system based on the performance details. When the source and target are optimized, re-run the session to determine the impact of the changes. 5. Only after the server, source, and target have been tuned to their peak performance should the mapping be analyzed for tuning. 6. After the tuning achieves a desired level of performance, the DTM should be the slowest portion of the session details. This indicates that the source data is arriving quickly, the target is inserting the data quickly, and the actual application of the business rules is the slowest portion. This is the optimum desired performance. Only minor tuning of the session can be conducted at this point and usually has only a minor effect.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-131
7. Finally, re-run the sessions that have been identified as the benchmark, comparing the new performance with the old performance. In some cases, optimizing one or two sessions to run quickly can have a disastrous effect on another mapping and care should be taken to ensure that this does not occur.
PAGE BP-132
BEST PRACTICES
INFORMATICA CONFIDENTIAL
Performance Tuning Databases
Challenge Database tuning can result in tremendous improvement in loading performance. This Best Practice covers tips on tuning several databases: Oracle, SQL Server and Teradata. Oracle Performance Tuning Tools Oracle offers many tools for tuning an Oracle instance. Most DBAs are already familiar with these tools, so we’ve included only a short description of some of the major ones here. •
V$ Views V$ views are dynamic performance views that provide real-time information on database activity, enabling the DBA to draw conclusions about database performance. Because SYS is the owner of these views, only SYS can query them. Keep in mind that querying these views impacts database performance; with each query having an immediate hit. With this in mind, carefully consider which users should be granted the privilege to query these views. You can grant viewing privileges with either the ‘SELECT’ privilege, which allows a user to view for individual V$ views or the ‘SELECT ANY TABLE’ privilege, which allows the user to view all V$ views. Using the SELECT ANY TABLE option requires the ‘O7_DICTIONARY_ACCESSIBILITY’ parameter be set to ‘TRUE’, which allows the ‘ANY’ keyword to apply to SYS owned objects.
•
Explain Plan Explain Plan, SQL Trace, and TKPROF are powerful tools for revealing bottlenecks and developing a strategy to avoid them. Explain Plan allows the DBA or developer to determine the execution path of a block of SQL code. The SQL in a source qualifier or in a lookup that is running
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-133
for a long time should be generated and copied to SQL*PLUS or other SQL tool and tested to avoid inefficient execution of these statements. Review the PowerCenter session log for long initialization time (an indicator that the source qualifier may need tuning) and the time it takes to build a lookup cache to determine if the SQL for these transformations should be tested. •
SQL Trace SQL Trace extends the functionality of Explain Plan by providing statistical information about the SQL statements executed in a session that has tracing enabled. This utility is run for a session with the ‘ALTER SESSION SET SQL_TRACE = TRUE’ statement.
•
TKPROF The output of SQL Trace is provided in a dump file that is difficult to read. TKPROF formats this dump file into a more understandable report.
•
UTLBSTAT & UTLESTAT Executing ‘UTLBSTAT’ creates tables to store dynamic performance statistics and begins the statistics collection process. Run this utility after the database has been up and running (for hours or days). Accumulating statistics may take time, so you need to run this utility for a long while and through several operations (i.e., both loading and querying). ‘UTLESTAT’ ends the statistics collection process and generates an output file called ‘report.txt.’ This report should give the DBA a fairly complete idea about the level of usage the database experiences and reveal areas that should be addressed.
Disk I/O Disk I/O at the database level provides the highest level of performance gain in most systems. Database files should be separated and identified. Rollback files should be separated onto their own disks because they have significant disk I/O. Co-locate tables that are heavily used with tables that are rarely used to help minimize disk contention. Separate indexes so that when queries run indexes and tables, they are not fighting for the same resource. Also be sure to implement disk striping; this, or RAID technology can help immensely in reducing disk contention. While this type of planning is time consuming, the payoff is well worth the effort in terms of performance gains. Memory and Processing Memory and processing configuration is done in the init.ora file. Because each database is different and requires an experienced DBA to analyze and tune it for optimal performance, a standard set of parameters to optimize PowerCenter is not practical and will probably never exist.
PAGE BP-134
BEST PRACTICES
INFORMATICA CONFIDENTIAL
TIP: Changes made in the init.ora file will take effect after a restart of the instance. Use svrmgr to issue the commands “shutdown” and “startup” (eventually “shutdown immediate”) to the instance. The settings presented here are those used in a 4-CPU AIX server running Oracle 7.3.4 set to make use of the parallel query option to facilitate parallel processing of queries and indexes. We’ve also included the descriptions and documentation from Oracle for each setting to help DBAs of other (nonOracle) systems to determine what the commands do in the Oracle environment to facilitate setting their native database commands and settings in a similar fashion. •
HASH_AREA_SIZE = 16777216 o Default value: 2 times the value of SORT_AREA_SIZE o Range of values: any integer o This parameter specifies the maximum amount of memory, in bytes, to be used for the hash join. If this parameter is not set, its value defaults to twice the value of the SORT_AREA_SIZE parameter. o The value of this parameter can be changed without shutting down the Oracle instance by using the ALTER SESSION command. (Note: ALTER SESSION refers to the Database Administration command issued at the svrmgr command prompt.)
•
Optimizer_percent_parallel=33 This parameter defines the amount of parallelism that the optimizer uses in its cost functions. The default of 0 means that the optimizer chooses the best serial plan. A value of 100 means that the optimizer uses each object's degree of parallelism in computing the cost of a full table scan operation. The value of this parameter can be changed without shutting down the Oracle instance by using the ALTER SESSION command. Low values favor indexes, while high values favor table scans. Cost-based optimization is always used for queries that reference an object with a nonzero degree of parallelism. For such queries, a RULE hint or optimizer mode or goal is ignored. Use of a FIRST_ROWS hint or optimizer mode overrides a nonzero setting of OPTIMIZER_PERCENT_PARALLEL.
•
parallel_max_servers=40 o Used to enable parallel query. o Initially not set on Install. o Maximum number of query servers or parallel recovery processes for an instance.
•
Parallel_min_servers=8 o Used to enable parallel query.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-135
o o
•
Initially not set on Install. Minimum number of query server processes for an instance. This is also the number of query server processes Oracle creates when the instance is started.
SORT_AREA_SIZE=8388608 o Default value: Operating system-dependent o Minimum value: the value equivalent to two database blocks o This parameter specifies the maximum amount, in bytes, of Program Global Area (PGA) memory to use for a sort. After the sort is complete and all that remains to do is to fetch the rows out, the memory is released down to the size specified by SORT_AREA_RETAINED_SIZE. After the last row is fetched out, all memory is freed. The memory is released back to the PGA, not to the operating system. o Increasing SORT_AREA_SIZE size improves the efficiency of large sorts. Multiple allocations never exist; there is only one memory area of SORT_AREA_SIZE for each user process at any time. o The default is usually adequate for most database operations. However, if very large indexes are created, this parameter may need to be adjusted. For example, if one process is doing all database access, as in a full database import, then an increased value for this parameter may speed the import, particularly the CREATE INDEX statements.
IPC as an Alternative to TCP/IP on UNIX On an HP/UX server with Oracle as a target (i.e., PMServer and Oracle target on same box), using an IPC connection can significantly reduce the time it takes to build a lookup cache. In one case, a fact mapping that was using a lookup to get five columns (including a foreign key) and about 500,000 rows from a table was taking 19 minutes. Changing the connection type to IPC reduced this to 45 seconds. In another mapping, the total time decreased from 24 minutes to 8 minutes for ~120130 bytes/row, 500,000 row write (array inserts), primary key with unique index in place. Performance went from about 2Mb/min (280 rows/sec) to about 10Mb/min (1360 rows/sec). A normal tcp (network tcp/ip) connection in tnsnames.ora would look like this: DW.armafix = (DESCRIPTION = (ADDRESS_LIST = (ADDRESS = (PROTOCOL =TCP) (HOST = armafix) (PORT = 1526) ) ) (CONNECT_DATA=(SID=DW) ) )
PAGE BP-136
BEST PRACTICES
INFORMATICA CONFIDENTIAL
Make a new entry in the tnsnames like this, and use it for connection to the local Oracle instance: DWIPC.armafix = (DESCRIPTION = (ADDRESS = (PROTOCOL=ipc) (KEY=DW) ) (CONNECT_DATA=(SID=DW)) ) Improving Data Load Performance •
Alternative to Dropping and Reloading Indexes Dropping and reloading indexes during very large loads to a data warehouse is often recommended but there is seldom any easy way to do this. For example, writing a SQL statement to drop each index, then writing another SQL statement to rebuild it can be a very tedious process. Oracle 7 (and above) offers an alternative to dropping and rebuilding indexes by allowing you to disable and re-enable existing indexes. Oracle stores the name of each index in a table that can be queried. With this in mind, it is an easy matter to write a SQL statement that queries this table. then generate SQL statements as output to disable and enable these indexes. Run the following to generate output to disable the foreign keys in the data warehouse: SELECT 'ALTER TABLE ' || OWNER || '.' || TABLE_NAME || ' DISABLE CONSTRAINT ' || CONSTRAINT_NAME || ' ;' FROM USER_CONSTRAINTS WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT') AND CONSTRAINT_TYPE = 'R' This produces output that looks like: ALTER TABLE MDDB_DEV.AGREEMENT_DIM DISABLE CONSTRAINT SYS_C0011077 ; ALTER TABLE MDDB_DEV.AGREEMENT_DIM DISABLE CONSTRAINT SYS_C0011075 ; ALTER TABLE MDDB_DEV.CUSTOMER_DIM DISABLE CONSTRAINT SYS_C0011060 ;
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-137
ALTER TABLE MDDB_DEV.CUSTOMER_DIM DISABLE CONSTRAINT SYS_C0011059 ; ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE CONSTRAINT SYS_C0011133 ; ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE CONSTRAINT SYS_C0011134 ; ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE CONSTRAINT SYS_C0011131 ; Dropping or disabling primary keys will also speed loads. Run the results of this SQL statement after disabling the foreign key constraints: SELECT 'ALTER TABLE ' || OWNER || '.' || TABLE_NAME || ' DISABLE PRIMARY KEY ;' FROM USER_CONSTRAINTS WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT') AND CONSTRAINT_TYPE = 'P' This produces output that looks like: ALTER TABLE MDDB_DEV.AGREEMENT_DIM DISABLE PRIMARY KEY ; ALTER TABLE MDDB_DEV.CUSTOMER_DIM DISABLE PRIMARY KEY ; ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE PRIMARY KEY ; Finally, disable any unique constraints with the following: SELECT 'ALTER TABLE ' || OWNER || '.' || TABLE_NAME || ' DISABLE PRIMARY KEY ;' FROM USER_CONSTRAINTS WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT') AND CONSTRAINT_TYPE = 'U' ALTER TABLE MDDB_DEV.CUSTOMER_DIM DISABLE CONSTRAINT SYS_C0011070 ; ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE CONSTRAINT SYS_C0011071 ; Save the results in a single file and name it something like ‘DISABLE.SQL’
PAGE BP-138
BEST PRACTICES
INFORMATICA CONFIDENTIAL
To re-enable the indexes, rerun these queries after replacing ‘DISABLE’ with ‘ENABLE.’ Save the results in another file with a name such as ‘ENABLE.SQL’ and run it as a post-session command. Re-enable constraints in the reverse order that you disabled them. Re-enable the unique constraints first, and re-enable primary keys before foreign keys. TIP: Dropping or disabling foreign keys will often boost loading, but this also slows queries (such as lookups) and updates. If you do not use lookups or updates on your target tables you should get a boost by using this SQL statement to generate scripts. If you use lookups and updates (especially on large tables), you can exclude the index that will be used for the lookup from your script. You may want to experiment to determine which method is faster. SQL*Loader •
Loader Options SQL*Loader is a bulk loader utility used for moving data from external files into the Oracle database. To use the Oracle bulk loader, you need a control file, which specifies how data should be loaded into the database. SQL*Loader has several options that can improve data loading performance and are easy to implement. These options are: • • • •
DIRECT PARALLEL SKIP_INDEX_MAINTENANCE UNRECOVERABLE
A control file normally has the following format: LOAD DATA INFILE
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-139
processing. Note that no other users can write to the loading table due to this exclusive lock and no SQL transformations can be made in the control file during the load. The PARALLEL option can be used with the DIRECT option when loading multiple partitions of the same table. If the partitions are located on separate disks, the performance time can be reduced to that of loading a single partition. If the CONVENTIONAL path must be used (i.e., transformations are performed during the load, for example), then you can bypass index updates by using the SKIP_INDEX_MAINTENANCE option. You will have to rebuild the indexes after the load, but overall performance may improve significantly. The UNRECOVERABLE option in the control file allows you to redo log writes during a CONVENTIONAL load. Recoverability should not be an issue since the data file still exists. The DIRECT option automatically disables CHECK and foreign key REFERENCES constraints, but not PRIMARY KEY, UNIQUE KEY and NOT NULL constraints. Disabling these constraints with the SQL scripts described earlier will benefit performance when loading data into a target warehouse. Loading Partitioned Sessions To improve performance when loading data to an Oracle database using a partitioned session, by create the Oracle target table with the same number of partitions as the session. Optimizing Query Performance •
Oracle Bitmap Indexing With version 7.3.x, Oracle added bitmap indexing to supplement the traditional b-tree index. A b-tree index can greatly improve query performance on data that has high cardinality or contains mostly unique values, but is not much help for low cardinality/highly duplicated data and may even increase query time. A typical example of a low cardinality field is gender – it is either male or female (or possibly unknown). This kind of data is an excellent candidate for a bitmap index, and can significantly improve query performance. Keep in mind, however, that b-tree indexing is still the Oracle default. If you don’t specify an index type when creating an index, Oracle will default to btree. Also note that for certain columns, bitmaps will be smaller and faster to create than a b-tree index on the same column. Bitmap indexes are suited to data warehousing because of their performance, size, and ability to create and drop very quickly. Since most dimension tables in a warehouse have nearly every column indexed, the space savings is dramatic. But it is important to note that when a bitmap-indexed column is
PAGE BP-140
BEST PRACTICES
INFORMATICA CONFIDENTIAL
updated, every row associated with that bitmap entry is locked, making bitmap indexing a poor choice for OLTP database tables with constant insert and update traffic. Also, bitmap indexes are rebuilt after each DML statement (e.g., inserts and updates), which can make loads very slow. For this reason, it is a good idea to drop or disable bitmap indexes prior to the load and recreate or re-enable them after the load. The relationship between Fact and Dimension keys is another example of low cardinality. With a b-tree index on the Fact table, a query processes by joining all the Dimension tables in a Cartesian product based on the WHERE clause, then joins back to the Fact table. With a bitmapped index on the Fact table, a ‘star query’ may be created that accesses the Fact table first followed by the Dimension table joins, avoiding a Cartesian product of all possible Dimension attributes. This ‘star query’ access method is only used if the STAR_TRANSFORMATION_ENABLED parameter is equal to TRUE in the init.ora file and if there are single column bitmapped indexes on the fact table foreign keys. Creating bitmap indexes is similar to creating b-tree indexes. To specify a bitmap index, add the word ‘bitmap’ between ‘create’ and ‘index’. All other syntax is identical. •
Bitmap indexes: drop index emp_active_bit; drop index emp_gender_bit; create bitmap index emp_active_bit on emp (active_flag); create bitmap index emp_gender_bit on emp (gender);
•
B-tree indexes: drop index emp_active; drop index emp_gender; create index emp_active on emp (active_flag); create index emp_gender on emp (gender); Information for bitmap indexes in stored in the data dictionary in dba_indexes, all_indexes, and user_indexes with the word ‘BITMAP’ in the Uniqueness column rather than the word ‘UNIQUE.’ Bitmap indexes cannot be unique. To enable bitmap indexes, you must set the following items in the instance initialization file: • • •
INFORMATICA CONFIDENTIAL
compatible = 7.3.2.0.0 # or higher event = "10111 trace name context forever" event = "10112 trace name context forever"
BEST PRACTICES
PAGE BP-141
•
event = "10114 trace name context forever"
Also note that the parallel query option must be installed in order to create bitmap indexes. If you try to create bitmap indexes without the parallel query option, a syntax error will appear in your SQL statement; the keyword ‘bitmap’ won't be recognized. •
TIP: To check if the parallel query option is installed, start and log into SQL*Plus. If the parallel query option is installed, the word ‘parallel’ appears in the banner text.
Index Statistics •
Table Method Index statistics are used by Oracle to determine the best method to access tables and should be updated periodically as normal DBA procedures. The following will improve query results on Fact and Dimension tables (including appending and updating records) by updating the table and index statistics for the data warehouse: The following SQL statement can be used to analyze the tables in the database: SELECT 'ANALYZE TABLE ' || TABLE_NAME || ' COMPUTE STATISTICS;' FROM USER_TABLES WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT') This generates the following results: ANALYZE TABLE CUSTOMER_DIM COMPUTE STATISTICS; ANALYZE TABLE MARKET_DIM COMPUTE STATISTICS; ANALYZE TABLE VENDOR_DIM COMPUTE STATISTICS; The following SQL statement can be used to analyze the indexes in the database: SELECT 'ANALYZE INDEX ' || INDEX_NAME || ' COMPUTE STATISTICS;' FROM USER_INDEXES WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT') This generates the following results: ANALYZE INDEX SYS_C0011125 COMPUTE STATISTICS;
PAGE BP-142
BEST PRACTICES
INFORMATICA CONFIDENTIAL
ANALYZE INDEX SYS_C0011119 COMPUTE STATISTICS; ANALYZE INDEX SYS_C0011105 COMPUTE STATISTICS; Save these results as a SQL script to be executed before or after a load. •
Schema Method Another way to update index statistics is to compute indexes by schema rather than by table. If data warehouse indexes are the only indexes located in a single schema, then you can use the following command to update the statistics: EXECUTE SYS.DBMS_UTILITY.Analyze_Schema ('BDB', 'compute'); In this example, BDB is the schema for which the statistics should be updated. Note that the DBA must grant the execution privilege for dbms_utility to the database user executing this command. TIP: These SQL statements can be very resource intensive, especially for very large tables. For this reason, we recommend running them at off-peak times when no other process is using the database. If you find the exact computation of the statistics consumes too much time, it is often acceptable to estimate the statistics rather than compute them. Use ‘estimate’ instead of ‘compute’ in the above examples.
Parallelism Parallel execution can be implemented at the SQL statement, database object, or instance level for many SQL operations. The degree of parallelism should be identified based on the number of processors and disk drives on the server, with the number of processors being the minimum degree. •
SQL Level Parallelism Hints are used to define parallelism at the SQL statement level. The following examples demonstrate how to utilize four processors: SELECT /*+ PARALLEL(order_fact,4) */ …; SELECT /*+ PARALLEL_INDEX(order_fact, order_fact_ixl,4) */ …; TIP: When using a table alias in the SQL Statement, be sure to use this alias in the hint. Otherwise, the hint will not be used, and you will not receive an error message. Example of improper use of alias:
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-143
SELECT /*+PARALLEL (EMP, 4) */ EMPNO, ENAME FROM EMP A
Here, the parallel hint will not be used because of the used alias “A” for table EMP. The correct way is: SELECT /*+PARALLEL (A, 4) */ EMPNO, ENAME FROM EMP A •
Table Level Parallelism Parallelism can also be defined at the table and index level. The following example demonstrates how to set a table’s degree of parallelism to four for all eligible SQL statements on this table: ALTER TABLE order_fact PARALLEL 4; Ensure that Oracle is not contending with other processes for these resources or you may end up with degraded performance due to resource contention.
Additional Tips •
Executing Oracle SQL Scripts as Pre and Post Session Commands on UNIX You can execute queries as both pre- and post-session commands. For a UNIX environment, the format of the command is: sqlplus –s user_id/password@database @ script_name.sql For example, to execute the ENABLE.SQL file created earlier (assuming the data warehouse is on a database named ‘infadb’), you would execute the following as a post-session command: sqlplus -s pmuser/pmuser@infadb @ /informatica/powercenter/Scripts/ENABLE.SQL In some environments, this may be a security issue since both username and password are hard-coded and unencrypted. To avoid this, use the operating system’s authentication to log onto the database instance. In the following example, the Informatica id “pmuser” is used to log onto the Oracle database. Create the Oracle user “pmuser” with the following SQL statement: CREATE USER PMUSER IDENTIFIED EXTERNALLY DEFAULT TABLESPACE . . . TEMPORARY TABLESPACE . . .
PAGE BP-144
BEST PRACTICES
INFORMATICA CONFIDENTIAL
In the following pre-session command, “pmuser” (the id Informatica is logged onto the operating system as) is automatically passed from the operating system to the database and used to execute the script: sqlplus -s /@infadb @/informatica/powercenter/Scripts/ENABLE.SQL You may want to use the init.ora parameter “os_authent_prefix” to distinguish between “normal” oracle-users and “external-identified” ones. •
DRIVING_SITE ‘Hint’ If the source and target are on separate instances, the Source Qualifier transformation should be executed on the target instance. For example, you want to join two source tables (A and B) together, which may reduce the number of selected rows. However, Oracle fetches all of the data from both tables, moves the data across the network to the target instance, then processes everything on the target instance. If either data source is large, this causes a great deal of network traffic. To force the Oracle optimizer to process the join on the source instance, use the ‘Generate SQL’ option in the source qualifier and include the ‘driving_site’ hint in the SQL statement as: SELECT /*+ DRIVING_SITE */ …;
SQL Server Description Proper tuning of the source and target database is a very important consideration to the scalability and usability of a business analytical environment. Managing performance on an SQL Server encompasses the following points. • • • • • •
Manage system memory usage (RAM caching) Create and maintain good indexes Partition large data sets and indexes Monitor disk I/O subsystem performance Tune applications and queries Optimize active data
Manage RAM Caching Managing random access memory (RAM) buffer cache is a major consideration in any database server environment. Accessing data in RAM cache is much faster than accessing the same Information from disk. If database I/O (input/output operations to the physical disk subsystem) can be reduced to the minimal required set of data and index pages, these pages will stay in RAM longer. Too much unneeded data and index information flowing into buffer cache quickly pushes out valuable pages. The primary goal of performance tuning is to reduce I/O so that buffer cache is best utilized.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-145
Several settings in SQL Server can be adjusted to take advantage of SQL Server RAM usage: • •
Max async I/O is used to specify the number of simultaneous disk I/O operations (???) that SQL Server can submit to the operating system. Note that this setting is automated in SQL Server 2000 SQL Server allows several selectable models for database recovery, these include: -
Full Recovery
-
Bulk-Logged Recovery
-
Simple Recovery
Cost Threshold for Parallelism Option Use this option to specify the threshold where SQL Server creates and executes parallel plans. SQL Server creates and executes a parallel plan for a query only when the estimated cost to execute a serial plan for the same query is higher than the value set in cost threshold for parallelism. The cost refers to an estimated elapsed time in seconds required to execute the serial plan on a specific hardware configuration. Only set cost threshold for parallelism on symmetric multiprocessors (SMP). Max Degree of Parallelism Option Use this option to limit the number of processors (a max of 32) to use in parallel plan execution. The default value is 0, which uses the actual number of available CPUs. Set this option to 1 to suppress parallel plan generation. Set the value to a number greater than 1 to restrict the maximum number of processors used by a single query execution . Priority Boost Option Use this option to specify whether SQL Server should run at a higher scheduling priority than other processors on the same computer. If you set this option to one, SQL Server runs at a priority base of 13. The default is 0, which is a priority base of seven. Set Working Set Size Option Use this option to reserve physical memory space for SQL Server that is equal to the server memory setting. The server memory setting is configured automatically by SQL Server based on workload and available resources. It will vary dynamically between min server memory and max server memory. Setting ‘set working set’ size means the operating system will not attempt to swap out SQL Server pages even if they can be used more readily by another process when SQL Server is idle. Optimizing Disk I/O Performance
PAGE BP-146
BEST PRACTICES
INFORMATICA CONFIDENTIAL
When configuring a SQL Server that will contain only a few gigabytes of data and not sustain heavy read or write activity, you need not be particularly concerned with the subject of disk I/O and balancing of SQL Server I/O activity across hard drives for maximum performance. To build larger SQL Server databases however, which will contain hundreds of gigabytes or even terabytes of data and/or that can sustain heavy read/write activity (as in a DSS application), it is necessary to drive configuration around maximizing SQL Server disk I/O performance by load-balancing across multiple hard drives.
Partitioning for Performance For SQL Server databases that are stored on multiple disk drives, performance can be improved by partitioning the data to increase the amount of disk I/O parallelism. Partitioning can be done using a variety of techniques. Methods for creating and managing partitions include configuring your storage subsystem (i.e., disk, RAID partitioning) and applying various data configuration mechanisms in SQL Server such as files, file groups, tables and views. Some possible candidates for partitioning include: • • • • •
Transaction log Tempdb Database Tables Non-clustered indexes
Using bcp and BULK INSERT Two mechanisms exist inside SQL Server to address the need for bulk movement of data. The first mechanism is the bcp utility. The second is the BULK INSERT statement. • •
Bcp is a command prompt utility that copies data into or out of SQL Server. BULK INSERT is a Transact-SQL statement that can be executed from within the database environment. Unlike bcp, BULK INSERT can only pull data into SQL Server. An advantage of using BULK INSERT is that it can copy data into instances of SQL Server using a Transact-SQL statement, rather than having to shell out to the command prompt.
TIP: Both of these mechanisms enable you to exercise control over the batch size. Unless you are working with small volumes of data, it is good to get in the habit of specifying a batch size for recoverability reasons. If none is specified, SQL Server commits all rows to be loaded as a single batch. For example, you attempt to load 1,000,000 rows of new data into a table. The server suddenly loses power just as it finishes processing row number 999,999. When the server recovers, those 999,999 rows will need to be rolled back out of the database before you attempt to reload the data. By specifying a batch size of 10,000 you could have saved significant recovery time, because SQL Server would have only had to rollback 9999 rows instead of 999,999.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-147
General Guidelines for Initial Data Loads While loading data: • • • • • •
Remove indexes Use Bulk INSERT or bcp Parallel load using partitioned data files into partitioned tables Run one load stream for each available CPU Set Bulk-Logged or Simple Recovery model Use TABLOCK option
While loading data • • •
Create indexes Switch to the appropriate recovery model Perform backups
General Guidelines for Incremental Data Loads • •
Load Data with indexes in place Performance and concurrency requirements should determine locking granularity (sp_indexoption).
· Change from Full to Bulk-Logged Recovery mode unless there is an overriding need to preserve a point–in time recovery, such as online users modifying the database during bulk loads. Read operations should not affect bulk loads. Teradata Description Teradata offers several bulk load utilities including FastLoad, MultiLoad, and TPump. FastLoad is used for loading inserts into an empty table. One of TPump’s advantages is that it does not lock the table that is being loaded. MultiLoad supports inserts, updates, deletes, and “upserts” to any table. This best practice will focus on MultiLoad since PowerCenter 5.x can auto-generate MultiLoad scripts and invoke the MultiLoad utility per PowerCenter target. Tuning MultiLoad There are many aspects to tuning a Teradata database. With PowerCenter 5.x several aspects of tuning can be controlled by setting MultiLoad parameters to maximize write throughput. Other areas to analyze when performing a MultiLoad job include estimating space requirements and monitoring MultiLoad performance. Note: In PowerCenter 5.1, the Informatica server transfers data via a UNIX named pipe to MultiLoad, whereas in PowerCenter 5.0, the data is first written to file. MultiLoad Parameters
PAGE BP-148
BEST PRACTICES
INFORMATICA CONFIDENTIAL
With PowerCenter 5.x, you can auto-generate MultiLoad scripts. This not only enhances development, but also allows you to set performance options. Here are the MultiLoad-specific parameters that are available in PowerCenter: • •
•
• • • • •
TDPID. A client based operand that is part of the logon string. Date Format. Ensure that the date format used in your target flat file is equivalent to the date format parameter in your MultiLoad script. Also validate that your date format is compatible with the date format specified in the Teradata database. Checkpoint. A checkpoint interval is similar to a commit interval for other databases. When you set the checkpoint value to less than 60, it represents the interval in minutes between checkpoint operations. If the checkpoint is set to a value greater than 60, it represents the number of records to write before performing a checkpoint operation. To maximize write speed to the database, try to limit the number of checkpoint operations that are performed. Tenacity. Interval in hours between MultiLoad attempts to log on to the database when the maximum number of sessions are already running. Load Mode. Available load methods include Insert, Update, Delete, and Upsert. Consider creating separate external loader connections for each method, selecting the one that will be most efficient for each target table. Drop Error Tables. Allows you to specify whether to drop or retain the three error tables for a MultiLoad session. Set this parameter to 1 to drop error tables or 0 to retain error tables. Max Sessions. Available only in PowerCenter 5.1, this parameter specifies the maximum number of sessions that are allowed to log on to the database. This value should not exceed one per working amp (Access Module Process). Sleep. Available only in PowerCenter 5.1, this parameter specifies the number of minutes that MultiLoad waits before retrying a logon operation.
Estimating Space Requirements for MultiLoad Jobs Always estimate the final size of your MultiLoad target tables and make sure the destination has enough space to complete your MultiLoad job. In addition to the space that may be required by target tables, each MultiLoad job needs permanent space for: • • •
Work tables Error tables Restart Log table
Note: Spool space cannot be used for MultiLoad work tables, error tables, or the restart log table. Spool space is freed at each restart. By using permanent space for the MultiLoad tables, data is preserved for restart operations after a system failure. Work tables, in particular, require a lot of extra permanent space. Also remember to account for the size of error tables since error tables are generated for each target table. Use the following formula to prepare the preliminary space estimate for one target table, assuming no fallback protection, no journals, and no non-unique secondary indexes:
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-149
PERM = (using data size + 38) x (number of rows processed) x (number of apply conditions satisfied) x (number of Teradata SQL statements within the applied DML) Make adjustments to your preliminary space estimates according to the requirements and expectations of your MultiLoad job. Monitoring MultiLoad Performance Here are some tips for analyzing MultiLoad performance: 1. Determine which phase of the MultiLoad job is causing poor performance. •
•
If the performance bottleneck is during the acquisition phase, as data is acquired from the client system, then the issue may be with the client system. If it is during the application phase, as data is applied to the target tables, then the issue is not likely to be with the client system. The MultiLoad job output lists the job phases and other useful information. Save these listings for evaluation.
2. Use the Teradata RDBMS Query Session utility to monitor the progress of the MultiLoad job. 3. Check for locks on the MultiLoad target tables and error tables. 4. Check the DBC.Resusage table for problem areas, such as data bus or CPU capacities at or near 100 percent for one or more processors. 5. Determine whether the target tables have non-unique secondary indexes (NUSIs). NUSIs degrade MultiLoad performance because the utility builds a separate NUSI change row to be applied to each NUSI sub-table after all of the rows have been applied to the primary table. 6. Check the size of the error tables. Write operations to the fallback error tables are performed at normal SQL speed, which is much slower than normal MultiLoad tasks. 7. Verify that the primary index is unique. Non-unique primary indexes can cause severe MultiLoad performance problems.
PAGE BP-150
BEST PRACTICES
INFORMATICA CONFIDENTIAL
Performance Tuning UNIX Systems
Challenge The following tips have proven useful in performance tuning UNIX-based machines. While some of these tips will be more helpful than others in a particular environment, all are worthy of consideration. Description Running ps-axu Run ps-axu to check for the following items: • • • •
Are there any processes waiting for disk access or for paging? If so check the I/O and memory subsystems. What processes are using most of the CPU? This may help you distribute the workload better. What processes are using most of the memory? This may help you distribute the workload better. Does ps show that your system is running many memory-intensive jobs? Look for jobs with a large set (RSS) or a high storage integral.
Identifying and Resolving Memory Issues Use vmstat or sar to check swapping actions. Check the system to ensure that swapping does not occur at any time during the session processing. By using sar 5 10 or vmstat 1 10, you can get a snapshot of page swapping. If page swapping does occur at any time, increase memory to prevent swapping. Swapping, on any database system, causes a major performance decrease and increased I/O. On a memory-starved and I/O-bound server, this can effectively shut down the PowerCenter process and any databases running on the server. Some swapping will normally occur regardless of the tuning settings. This occurs because some processes use the swap space by their design. To check swap space availability, use pstat and swap. If the swap space is too small for the intended applications, it should be increased.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-151
Run vmstate 5 (sar –wpgr ) for SunOS, vmstat –S 5 to detect and confirm memory problems and check for the following: • • •
Are pages-outs occurring consistently? If so, you are short of memory. Are there a high number of address translation faults? (System V only) This suggests a memory shortage. Are swap-outs occurring consistently? If so, you are extremely short of memory. Occasional swap-outs are normal; BSD systems swap-out inactive jobs. Long bursts of swap-outs mean that active jobs are probably falling victim and indicate extreme memory shortage. If you don’t have vmsta –S, look at the w and de fields of vmstat. These should ALWAYS be zero.
If memory seems to be the bottleneck of the system, try following remedial steps: • • • • • • •
Reduce the size of the buffer cache, if your system has one, by decreasing BUFPAGES. The buffer cache is not used in system V.4 and SunOS 4.X systems. Making the buffer cache smaller will hurt disk I/O performance. If you have statically allocated STREAMS buffers, reduce the number of large (2048- and 4096-byte) buffers. This may reduce network performance, but netstat-m should give you an idea of how many buffers you really need. Reduce the size of your kernel’s tables. This may limit the system’s capacity (number of files, number of processes, etc.). Try running jobs requiring a lot of memory at night. This may not help the memory problems, but you may not care about them as much. Try running jobs requiring a lot of memory in a batch queue. If only one memory-intensive job is running at a time, your system may perform satisfactorily. Try to limit the time spent running sendmail, which is a memory hog. If you don’t see any significant improvement, add more memory.
Identifying and Resolving Disk I/O Issues Use iostat to check i/o load and utilization, as well as CPU load. Iostat can be used to monitor the I/O load on the disks on the UNIX server. Using iostat permits monitoring the load on specific disks. Take notice of how fairly disk activity is distributed among the system disks. If it is not, are the most active disks also the fastest disks? Run sadp to get a seek histogram of disk activity. Is activity concentrated in one area of the disk (good), spread evenly across the disk (tolerable), or in two welldefined peaks at opposite ends (bad)? • • •
•
PAGE BP-152
Reorganize your file systems and disks to distribute I/O activity as evenly as possible. Using symbolic links helps to keep the directory structure the same throughout while still moving the data files that are causing I/O contention. Use your fastest disk drive and controller for your root filesystem; this will almost certainly have the heaviest activity. Alternatively, if single-file throughput is important, put performance-critical files into one filesystem and use the fastest drive for that filesystem. Put performance-critical files on a filesystem with a large block size: 16KB or 32KB (BSD).
BEST PRACTICES
INFORMATICA CONFIDENTIAL
• • • •
Increase the size of the buffer cache by increasing BUFPAGES (BSD). This may hurt your system’s memory performance. Rebuild your file systems periodically to eliminate fragmentation (backup, build a new filesystem, and restore). If you are using NFS and using remote files, look at your network situation. You don’t have local disk I/O problems. Check memory statistics again by running vmstat 5 (sar-rwpg). If your system is paging or swapping consistently, you have memory problems, fix memory problem first. Swapping makes performance worse.
If your system has disk capacity problem and is constantly running out of disk space, try the following actions: • • •
Write a find script that detects old core dumps, editor backup and auto-save files, and other trash and deletes it automatically. Run the script through cron. If you are running BSD UNIX or V.4, use the disk quota system to prevent individual users from gathering too much storage. Use a smaller block size on file systems that are mostly small files (e.g., source code files, object modules, and small data files).
Identifying and Resolving CPU Overload Issues Use sar –u to check for CPU loading. This provides the %usr (user), %sys (system), %wio (waiting on I/O), and %idle (% of idle time). A target goal should be %usr + %sys= 80 and %wio = 10 leaving %idle at 10. If %wio is higher, the disk and I/O contention should be investigated to eliminate I/O bottleneck on the UNIX server. If the system shows a heavy load of %sys, and %usr has a high %idle, this is indicative of memory and contention of swapping/paging problems. In this case, it is necessary to make memory changes to reduce the load on the system server. When you run iostat 5above, also observe for CPU idle time. Is the idle time always 0, without letup? It is good for the CPU to be busy, but if it is always busy 100 percent of the time, work must be piling up somewhere. This points to CPU overload. • • •
Eliminate unnecessary daemon processes. rwhod and routed are particularly likely to be performance problems, but any savings will help. Get users to run jobs at night with at or any queuing system that’s available always for help. You may not care if the CPU (or the memory or I/O system) is overloaded at night, provided the work is done in the morning. Use nice to lower the priority of CPU-bound jobs will improve interactive performance. Also, using nice to raise the priority of CPU-bound jobs will expedite them but will hurt interactive performance. In general though, using nice is really only a temporary solution. If your workload grows, it will soon become insufficient. Consider upgrading your system, replacing it, or buying another system to share the load.
Identifying and Resolving Network Issues
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-153
You can suspect problems with network capacity or with data integrity if users experience slow performance when they are using rlogin or when they are accessing files via NFS. Look at netsat-i. If the number of collisions is large, suspect an overloaded network. If the number of input or output errors is large, suspect hardware problems. A large number of input errors indicate problems somewhere on the network. A large number of output errors suggests problems with your system and its interface to the network. If collisions and network hardware are not a problem, figure out which system appears to be slow. Use spray to send a large burst of packets to the slow system. If the number of dropped packets is large, the remote system most likely cannot respond to incoming data fast enough. Look to see if there are CPU, memory or disk I/O problems on the remote system. If not, the system may just not be able to tolerate heavy network workloads. Try to reorganize the network so that this system isn’t a file server. A large number of dropped packets may also indicate data corruption. Run netstat-s on the remote system, then spray the remote system from the local system and run netstat-s again. If the increase of UDP socket full drops (as indicated by netstat) is equal to or greater than the number of drop packets that spray reports, the remote system is slow network server If the increase of socket full drops is less than the number of dropped packets, look for network errors. Run nfsstat and look at the client RPC data. If the retransfield is more than 5 percent of calls, the network or an NFS server is overloaded. If timeout is high, at least one NFS server is overloaded, the network may be faulty, or one or more servers may have crashed. If badmixis roughly equal to timeout, at least one NFS server is overloaded. If timeoutand retrans are high, but badxidis low, some part of the network between the NFS client and server is overloaded and dropping packets. Try to prevent users from running I/O- intensive programs across the network. The greputility is a good example of an I/O intensive program. Instead, have users log into the remote system to do their work. Reorganize the computers and disks on your network so that as many users as possible can do as much work as possible on a local system. Use systems with good network performance as file servers. If you are short of STREAMS data buffers and are running Sun OS 4.0 or System V.3 (or earlier), reconfigure the kernel with more buffers. General Tips and Summary of Other Useful Commands • • • • •
PAGE BP-154
Use dirs instead of pwd. Avoid ps. If you use sh, avoid long search paths. Minimize the number of files per directory. Use vi or a native window editor rather than emacs.
BEST PRACTICES
INFORMATICA CONFIDENTIAL
• • •
Use egrep rather than grep: it’s faster. Don’t run grep or other I/O- intensive applications across NFS. Use rlogin rather than NFS to access files on remote systems.
lsattr –E –l sys0 is used to determine some current settings on most UNIX environments. Of particular attention is maxuproc. Maxuproc is the setting to determine the maximum level of user background processes. On most UNIX environments, this is defaulted to 40 but should be increased to 250 on most systems. Avoid raw devices. In general, proprietary file systems from the UNIX vendor are most efficient and well suited for database work when tuned properly. Be sure to check the database vendor documentation to determine the best file system for the specific machine. Typical choices include: s5, The UNIX System V File System; ufs, The “UNIX File System” derived from Berkeley (BSD); vxfs, The Veritas File System; and lastly raw devices that, in reality are not a file system at all. Use PMProcs Utility ( PowerCenter Utility), to view the current Informatica processes. For example: harmon 125: pmprocs <------------ Current PowerMart processes ---------------> UID PID PPID C powermar 2711 1421 289406976 powermar 2713 2711 289406976 powermar 1421 1 powermar 2712 2711 289406976 powermar 2714 1421 289406976 powermar 2721 2714 289406976 powermar 2722 2714 289406976
STIME TTY TIME CMD 16 18:13:11 ? 0:07 dtm pmserver.cfg 0 202 11 18:13:17 ?
0:05 dtm pmserver.cfg 0 202 -
1 08:39:19 ? 17 18:13:17 ?
1:30 pmserver 0:08 dtm pmserver.cfg 0 202 -
11 18:13:20 ?
0:04 dtm pmserver.cfg 1 202 -
12 18:13:27 ?
0:04 dtm pmserver.cfg 1 202 -
8 18:13:27 ?
0:02 dtm pmserver.cfg 1 202 -
<------------ Current Shared Memory Resources ---------------> IPC status from
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-155
There are 19 Semaphores held by PowerMart processes • • • • • •
Pmprocs is a script that combines the ps and ipcs commands Only available for UNIX CPID - Creator PID LPID - Last PID that accessed the resource Semaphores - used to sync the reader and writer 0 or 1 - shows slot in LM shared memory
Finally, when tuning UNIX environments, the general rule of thumb is to tune the server for a major database system. Most database systems provide a special tuning supplement for each specific version of UNIX. For example, there is a specific IBM Redbook for Oracle 7.3 running on AIX 4.3. Because PowerCenter processes data in a similar fashion as SMP databases, by tuning the server to support the database, you also tune the system for PowerCenter. References: System Performance Tuning (from O’Reilly Publishing) by Mike Loukid, is the main reference book for this Best Practice. For detailed information on each of the parameters discussed here and much more on performance tuning of the applications running on UNIX-based systems refer this book.
PAGE BP-156
BEST PRACTICES
INFORMATICA CONFIDENTIAL
Performance Tuning Windows NT/2000 Systems
Challenge The Microsoft Windows NT/2000 environment is easier to tune than UNIX environments, but offers limited performance options. NT is considered a “selftuning” operating system because it attempts to configure and tune memory to the best of its ability. However, this does not mean that the NT system administrator is entirely free from performance improvement responsibilities. The following tips have proven useful in performance tuning NT-based machines. While some are likely to be more helpful than others in any particular environment, all are worthy of consideration. Note: Tuning is essentially the same for both NT and 2000 based systems, with differences for Windows 2000 noted in the last section. Description The two places to begin when tuning an NT server are: • •
The Performance Monitor. The Performance tab (hit ctrl+alt+del, choose task manager, and click on the Performance tab).
When using the Performance Monitor, look for these performance indicators to check: Processor: percent processor time. For SMP environments you need to add one monitor for each CPU. If the system is “maxed out” (i.e. running at 100 percent for all CPUs), it may be necessary to add processing power to the server. Unfortunately, NT scalability is quite limited, especially in comparison with UNIX environments. Also keep in mind NT’s inability to split processes across multiple CPUs. Thus, one CPU may be at 100% utilization while the other CPUs are at 0% utilization. There is currently no solution for optimizing this situation, although Microsoft is working on the problem. Memory: pages/second. In this comparison, a number of five pages per second or less is acceptable. If the number is much higher, there is a need to tune the memory
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-157
to make better use of hardware rather than virtual memory. Remember that this is only a guideline, and the recommended setting may be too high for some systems. Physical disks: percent time. This is the best place to tune database performance within NT environments. By analyzing the disk I/O, the load on the database can be leveled across multiple disks. High I/O settings indicate possible contention for I/O; files should be moved to less utilized disk devices to optimize overall performance. Physical disks: queue length. This setting is used to determine the number of users sitting idle waiting for access to the same disk device. If this number is greater than two, moving files to less frequently used disk devices should level the load of the disk device. Server: bytes total/second. This is a very nebulous performance indicator. It monitors the server network connection. It is nebulous because it bundles multiple network connections together. Some connections may be fast while others are slow, making it difficult to identify real problems, and very possibly resulting in a false sense of security. Intimate knowledge of the network card, connections, and hubstacks is critical for optimal server performance when moving data across the network. Careful analysis of the network card (or cards) and their settings, combined with the use of a Network Analyzer, can eliminate bottlenecks and improve throughput of network traffic at a magnitude of 10 to 1000 times depending on the hardware. Resolving Typical NT Problems The following paragraphs describe some common performance problems in an NT environment and suggest tuning solutions. Load reasonableness. Assume that some software will not be well coded, and some background processes, such as a mail server or web server running on the same machine, can potentially starve the CPUs on the machine. Off-loading CPU hogs may be the only recourse. Device Drivers. The device drivers for some types of hardware are notorious for wasting CPU clock cycles. Be sure to get the latest drivers from the hardware vendor to minimize this problem. Memory and services. Although adding memory to NT is always a good solution, it is also expensive and usually must be planned to support the BANK system for EISA and PCI architectures. Before adding memory, check the Services in Control Panel because many background applications do not uninstall the old service when installing a new update or version. Thus, both the unused old service and the new service may be using valuable CPU memory resources. I/O Optimization. This is, by far, the best tuning option for database applications in the NT environment. If necessary, level the load across the disk devices by moving files. In situations where there are multiple controllers, be sure to level the load across the controllers too.
PAGE BP-158
BEST PRACTICES
INFORMATICA CONFIDENTIAL
Using electrostatic devices and fast-wide SCSI can also help to increase performance, and fragmentation can be eliminated by using a Windows NT/2000 disk defragmentation product. Using this type of product is a good idea whether the disk is formatted for FAT or NTFS. Finally, on NT servers, be sure to implement disk stripping to split single data files across multiple disk drives and take advantage of RAID (Redundant Arrays of Inexpensive Disks) technology. Also increase the priority of the disk devices on the NT server. NT, by default, sets the disk device priority low. Change the disk priority setting in the Registry at service\lanman\server\parameters and add a key for ThreadPriority of type DWORD with a value of 2. Monitoring System Performance In Windows 2000 In Windows 2000 the Informatica server uses system resources to process transformation, session execution, and reading and writing of data. The Informatica server also uses system memory for other data such as aggregate, joiner, rank, and cached lookup tables. With Windows 2000, you can use system monitor in the Performance Console of the administrative tools, or system tools in the task manager, to monitor the amount of system resources used by the Informatica server and to identify system bottlenecks. Windows 2000 provides the following tools (accessible under the Control Panel/Administration Tools/Performance) for monitoring resource usage on your computer: • •
System Monitor Performance Logs and Alerts
These Windows 2000 monitoring tools enable you to analyze usage and detect bottlenecks at the disk, memory, processor, and network level. The System Monitor displays a graph which is flexible and configurable. You can copy counter paths and settings from the System Monitor display to the Clipboard and paste counter paths from Web pages or other sources into the System Monitor display. The System Monitor is portable. This is useful in monitoring other systems that require administration. Typing perfmon.exe at the command prompt causes the system to start System Monitor, not Performance Monitor. The Performance Logs and Alerts tool provides two types of performance-related logs—counter logs and trace logs—and an alerting function. Counter logs record sampled data about hardware resources and system services based on performance objects and counters in the same manner as System Monitor. Therefore they can be viewed in System Monitor. Data in counter logs can be saved as comma-separated or tab-separated files that are easily viewed with Excel. Trace logs collect event traces that measure performance statistics associated with events such as disk and file I/O, page faults, or thread activity. The alerting function allows you to define a counter value that will trigger actions such as sending a network message, running a program, or starting a log. Alerts are
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-159
useful if you are not actively monitoring a particular counter threshold value, but want to be notified when it exceeds or falls below a specified value so that you can investigate and determine the cause of the change. You might want to set alerts based on established performance baseline values for your system. Note:You must have Full Control access to a subkey in the registry in order to create or modify a log configuration. (The subkey is HKEY_CURRENT_MACHINE\SYSTEM\CurrentControlSet\Services\SysmonLog\Log_Qu eries.) The predefined log settings under Counter Logs named System Overview, are configured to create a binary log that, after manual start-up, updates every 15 seconds and logs continuously until it achieves a maximum size. If you start logging with these settings, data is saved to the Perflogs folder on the root directory and includes the counters: Memory\ Pages/sec, PhysicalDisk(_Total)\Avg. Disk Queue Length, and Processor(_Total)\ % Processor Time. If you want to create your own log setting press the right mouse on one of the log types. Some other useful counters include Physical Disk: Reads/sec and Writes/sec and Memory: Available Bytes and Cache Bytes.
PAGE BP-160
BEST PRACTICES
INFORMATICA CONFIDENTIAL
Tuning Mappings for Better Performance
Challenge In general, a PowerCenter mapping is the biggest ‘bottleneck’ in the load process as business rules determine the number and complexity of transformations in a mapping. This Best Practice offers some guidelines for tuning mappings. Description Analyze mappings for tuning only after you have tuned the system, source and target for peak performance. Consider Single-Pass Reading If several mappings use the same data source, consider a single-pass reading. Consolidate separate mappings into one mapping with either a single Source Qualifier Transformation or one set of Source Qualifier Transformations as the data source for the separate data flows. Similarly, if a function is used in several mappings, a single-pass reading will reduce the number of times that function will be called in the session. Optimize SQL Overrides When SQL overrides are required in a Source Qualifier, Lookup Transformation, or in the update override of a target object, be sure the SQL statement is tuned. The extent to which and how SQL can be tuned depends on the underlying source or target database system. Scrutinize Datatype Conversions PowerCenter Server automatically makes conversions between compatible datatypes. When these conversions are performed unnecessarily performance slows. For example, if a mapping moves data from an Integer port to a Decimal port, then back to an Integer port, the conversion may be unnecessary.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-161
In some instances however, datatype conversions can help improve performance. This is especially true when integer values are used in place of other datatypes for performing comparisons using Lookup and Filter transformations. Eliminate Transformation Errors Large numbers of evaluation errors significantly slow performance of the PowerCenter Server. During transformation errors, the PowerCenter Server engine pauses to determine the cause of the error, removes the row causing the error from the data flow, and logs the error in the session log. Transformation errors can be caused by many things including: conversion errors, conflicting mapping logic, any condition that is specifically set up as an error, and so on. The session log can help point out the cause of these errors. If errors recur consistently for certain transformations, re-evaluate the constraints for these transformation. Any source of errors should be traced and eliminated. Optimize Lookup Transformations There are a number of ways to optimize lookup transformations that are setup in a mapping. When to Cache Lookups When caching is enabled, the PowerCenter Server caches the lookup table and queries the lookup cache during the session. When this option is not enabled, the PowerCenter Server queries the lookup table on a row-by-row basis. NOTE: All the tuning options mentioned in this Best Practice assume that memory and cache sizing for lookups are sufficient to ensure that caches will not page to disks. Practices regarding memory and cache sizing for Lookup transformations are covered in Best Practice: Tuning Sessions for Better Performance. In general, if the lookup table needs less than 300MB of memory, lookup caching should be enabled. A better rule of thumb than memory size is to determine the ‘size’ of the potential lookup cache with regard to the number of rows expected to be processed. For example, consider the following example. In Mapping X, the source and lookup contain the following number of records: ITEMS (source):
5000 records
MANUFACTURER:
200 records
DIM_ITEMS:
100000 records
Number of Disk Reads
PAGE BP-162
BEST PRACTICES
INFORMATICA CONFIDENTIAL
Cached Lookup
Un-cached Lookup
LKP_Manufacturer Build Cache Read Source Records Execute Lookup Total # of Disk Reads
200
0
5000
5000
0
5000
5200
100000
100000
0
5000
5000
0
5000
105000
10000
LKP_DIM_ITEMS Build Cache Read Source Records Execute Lookup Total # of Disk Reads
Consider the case where MANUFACTURER is the lookup table. If the lookup table is cached, it will take a total of 5200 disk reads to build the cache and execute the lookup. If the lookup table is not cached, then it will take a total of 10,000 total disk reads to execute the lookup. In this case, the number of records in the lookup table is small in comparison with the number of times the lookup is executed. So this lookup should be cached. This is the more likely scenario. Consider the case where DIM_ITEMS is the lookup table. If the lookup table is cached, it will result in 105,000 total disk reads to build and execute the lookup. If the lookup table is not cached, then the disk reads would total 10,000. In this case the number of records in the lookup table is not small in comparison with the number of times the lookup will be executed. Thus the lookup should not be cached. Use the following eight step method to determine if a lookup should be cached: 1. Code the lookup into the mapping. 2. Select a standard set of data from the source. For example, add a where clause on a relational source to load a sample 10,000 rows. 3. Run the mapping with caching turned off and save the log. 4. Run the mapping with caching turned on and save the log to a different name than the log created in step 3. 5. Look in the cached lookup log and determine how long it takes to cache the lookup object. Note this time in seconds: LOOKUP TIME IN SECONDS = LS. 6. In the non-cached log, take the time from the last lookup cache to the end of the load in seconds and divide it into the number or rows being processed: NON-CACHED ROWS PER SECOND = NRS. 7. In the cached log, take the time from the last lookup cache to the end of the load in seconds and divide it into number or rows being processed: CACHED ROWS PER SECOND = CRS. 8. Use the following formula to find the breakeven row point: (LS*NRS*CRS)/(CRS-NRS) = X Where X is the breakeven point. If your expected source records is less than X, it is better to not cache the lookup. If your expected source records is more than X, it is better to cache the lookup.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-163
For example: Assume the lookup takes 166 seconds to cache (LS=166). Assume with a cached lookup the load is 232 rows per second (CRS=232). Assume with a non-cached lookup the load is 147 rows per second (NRS = 147). The formula would result in: (166*147*232)/(232-147) = 66,603. Thus, if the source has less than 66,603 records, the lookup should not be cached. If it has more than 66,603 records, then the lookup should be cached. Sharing Lookup Caches There are a number of methods for sharing lookup caches. •
•
•
Within a specific session run for a mapping, if the same lookup is used multiple times in a mapping, the PowerCenter Server will re-use the cache for the multiple instances of the lookup. Using the same lookup multiple times in the mapping will be more resource intensive with each successive instance. If multiple cached lookups are from the same table but are expected to return different columns of data, it may be better to setup the multiple lookups to bring back the same columns even though not all return ports are used in all lookups. Bringing back a common set of columns may reduce the number of disk reads. Across sessions of the same mapping, the use of an unnamed persistent cache allows multiple runs to use an existing cache file stored on the PowerCenter Server. If the option of creating a persistent cache is set in the lookup properties, the memory cache created for the lookup during the initial run is saved to the PowerCenter Server. This can improve performance because the Server builds the memory cache from cache files instead of the database. This feature should only be used when the lookup table is not expected to change between session runs. Across different mappings and sessions, the use of a named persistent cache allows sharing of an existing cache file.
Reducing the Number of Cached Rows There is an option to use a SQL override in the creation of a lookup cache. Options can be added to the WHERE clause to reduce the set of records included in the resulting cache. NOTE: If you use a SQL override in a lookup, the lookup must be cached. Optimizing the Lookup Condition In the case where a lookup uses more than one lookup condition, set the conditions with an equal sign first in order to optimize lookup performance.
PAGE BP-164
BEST PRACTICES
INFORMATICA CONFIDENTIAL
Indexing the Lookup Table The PowerCenter Server must query, sort and compare values in the lookup condition columns. As a result, indexes on the database table should include every column used in a lookup condition. This can improve performance for both cached and un-cached lookups. ¨ In the case of a cached lookup, an ORDER BY condition is issued in the SQL statement used to create the cache. Columns used in the ORDER BY condition should be indexed. The session log will contain the ORDER BY statement. ¨ In the case of an un-cached lookup, since a SQL statement created for each row passing into the lookup transformation, performance can be helped by indexing columns in the lookup condition. Optimize Filter and Router Transformations Filtering data as early as possible in the data flow improves the efficiency of a mapping. Instead of using a Filter Transformation to remove a sizeable number of rows in the middle or end of a mapping, use a filter on the Source Qualifier or a Filter Transformation immediately after the source qualifier to improve performance. Avoid complex expressions when creating the filter condition. Filter transformations are most effective when a simple integer or TRUE/FALSE expression is used in the filter condition. Filters or routers should also be used to drop rejected rows from an Update Strategy transformation if rejected rows do not need to be saved. Replace multiple filter transformations with a router transformation. This reduces the number of transformations in the mapping and makes the mapping easier to follow. Optimize Aggregator Transformations Aggregator Transformations often slow performance because they must group data before processing it. Use simple columns in the group by condition to make the Aggregator Transformation more efficient. When possible, use numbers instead of strings or dates in the GROUP BY columns. Also avoid complex expressions in the Aggregator expressions, especially in GROUP BY ports. Use the Sorted Input option in the aggregator. This option requires that data sent to the aggregator be sorted in the order in which the ports are used in the aggregator’s group by. The Sorted Input option decreases the use of aggregate caches. When it is used, the PowerCenter Server assumes all data is sorted by group and, as a group is passed through an aggregator, calculations can be performed and information passed on to the next transformation. Without sorted input, the Server must wait for all rows of data before processing aggregate calculations. Use of the
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-165
Sorted Inputs option is usually accompanied by a Source Qualifier which uses the Number of Sorted Ports option. Use an Expression and Update Strategy instead of an Aggregator Transformation. This technique can only be used if the source data can be sorted. Further, using this option assumes that a mapping is using an Aggregator with Sorted Input option. In the Expression Transformation, the use of variable ports is required to hold data from the previous row of data processed. The premise is to use the previous row of data to determine whether the current row is a part of the current group or is the beginning of a new group. Thus, if the row is a part of the current group, then its data would be used to continue calculating the current group function. An Update Strategy Transformation would follow the Expression Transformation and set the first row of a new group to insert and the following rows to update. Optimize Joiner Transformations Joiner transformations can slow performance because they need additional space in memory at run time to hold intermediate results. Define the rows from the smaller set of data in the joiner as the Master rows. The Master rows are cached to memory and the detail records are then compared to rows in the cache of the Master rows. In order to minimize memory requirements, the smaller set of data should be cached and thus set as Master. Use Normal joins whenever possible. Normal joins are faster than outer joins and the resulting set of data is also smaller. Use the database to do the join when sourcing data from the same database schema. Database systems usually can perform the join more quickly than the Informatica Server, so a SQL override or a join condition should be used when joining multiple tables from the same database schema. Optimize Sequence Generator Transformations Sequence Generator transformations need to determine the next available sequence number, thus increasing the Number of Cached Values property can increase performance. This property determines the number of values the Informatica Server caches at one time. If it is set to cache no values then the Informatica Server must query the Informatica repository each time to determine what is the next number which can be used. Configuring the Number of Cached Values to a value greater than 1000 should be considered. It should be noted any cached values not used in the course of a session are ‘lost’ since the sequence generator value in the repository is set, when it is called next time, to give the next set of cache values. Avoid External Procedure Transformations For the most part, making calls to external procedures slows down a session. If possible, avoid the use of these Transformations, which include Stored Procedures, External Procedures and Advanced External Procedures.
PAGE BP-166
BEST PRACTICES
INFORMATICA CONFIDENTIAL
Field Level Transformation Optimization As a final step in the tuning process, expressions used in transformations can be tuned. When examining expressions, focus on complex expressions for possible simplification. To help isolate slow expressions, do the following: 1. Time the session with the original expression. 2. Copy the mapping and replace half the complex expressions with a constant. 3. Run and time the edited session. 4. Make another copy of the mapping and replace the other half of the complex expressions with a constant. 5. Run and time the edited session. Processing field level transformations takes time. If the transformation expressions are complex, then processing will be slower. Its often possible to get a 10- 20% performance improvement by optimizing complex field level transformations. Use the target table mapping reports or the Metadata Reporter to examine the transformations. Likely candidates for optimization are the fields with the most complex expressions. Keep in mind that there may be more than one field causing performance problems. Factoring out Common Logic This can reduce the number of times a mapping performs the same logic. If a mapping performs the same logic multiple times in a mapping, moving the task upstream in the mapping may allow the logic to be done just once. For example, a mapping has five target tables. Each target requires a Social Security Number lookup. Instead of performing the lookup right before each target, move the lookup to a position before the data flow splits. Minimize Function Calls Anytime a function is called it takes resources to process. There are several common examples where function calls can be reduced or eliminated. Aggregate function calls can sometime be reduced. In the case of each aggregate function call, the Informatica Server must search and group the data. Thus the following expression: SUM(Column A) + SUM(Column B) Can be optimized to:
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-167
SUM(Column A + Column B) In general, operators are faster than functions, so operators should be used whenever possible. For example if you have an expression which involves a CONCAT function such as: CONCAT(CONCAT(FIRST_NAME,’ ‘), LAST_NAME) It can be optimized to: FIRST_NAME || ‘ ‘ || LAST_NAME Remember that IIF() is a function that returns a value, not just a logical test. This allows many logical statements to be written in a more compact fashion. For example: IIF(FLG_A=’Y’ and FLG_B=’Y’ and FLG_C=’Y’, VAL_A+VAL_B+VAL_C, IIF(FLG_A=’Y’ and FLG_B=’Y’ and FLG_C=’N’, VAL_A+VAL_B, IIF(FLG_A=’Y’ and FLG_B=’N’ and FLG_C=’Y’, VAL_A+VAL_C, IIF(FLG_A=’Y’ and FLG_B=’N’ and FLG_C=’N’, VAL_A, IIF(FLG_A=’N’ and FLG_B=’Y’ and FLG_C=’Y’, VAL_B+VAL_C, IIF(FLG_A=’N’ and FLG_B=’Y’ and FLG_C=’N’, VAL_B, IIF(FLG_A=’N’ and FLG_B=’N’ and FLG_C=’Y’, VAL_C, IIF(FLG_A=’N’ and FLG_B=’N’ and FLG_C=’N’, 0.0)))))))) Can be optimized to: IIF(FLG_A=’Y’, VAL_A, 0.0) + IIF(FLG_B=’Y’, VAL_B, 0.0) + IIF(FLG_C=’Y’, VAL_C, 0.0) The original expression had 8 IIFs, 16 ANDs and 24 comparisons. The optimized expression results in 3 IIFs, 3 comparisons and two additions. Be creative in making expressions more efficient. The following is an example of rework of an expression which eliminates three comparisons down to one: For example: IIF(X=1 OR X=5 OR X=9, 'yes', 'no')
PAGE BP-168
BEST PRACTICES
INFORMATICA CONFIDENTIAL
Can be optimized to: IIF(MOD(X, 4) = 1, 'yes', 'no') Calculate Once, Use Many Times Avoid calculating or testing the same value multiple times. If the same subexpression is used several times in a transformation, consider making the subexpression a local variable. The local variable can be used only within the transformation but by calculating the variable only once can speed performance. Choose Numeric versus String Operations The Informatica Server processes numeric operations faster than string operations. For example, if a lookup is done on a large amount of data on two columns, EMPLOYEE_NAME and EMPLOYEE_ID, configuring the lookup around EMPLOYEE_ID improves performance. Optimizing Char-Char and Char-Varchar Comparisons When the Informatica Server performs comparisons between CHAR and VARCHAR columns, it slows each time it finds trailing blank spaces in the row. The Treat CHAR as CHAR On Read option can be set in the Informatica Server setup so that the Informatica Server does not trim trailing spaces from the end of CHAR source fields. Use DECODE instead of LOOKUP When a LOOKUP function is used, the Informatica Server must lookup a table in the database. When a DECODE function is used, the lookup values are incorporated into the expression itself so the Informatica Server does not need to lookup a separate table. Thus, when looking up a small set of unchanging values, using DECODE may improve performance. Reduce the Number of Transformations in a Mapping Whenever possible the number of transformations should be reduced. As there is always overhead involved in moving data between transformations. Along the same lines, unnecessary links between transformations should be removed to minimize the amount of data moved. This is especially important with data being pulled from the Source Qualifier Transformation.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-169
Tuning Sessions for Better Performance
Challenge Running sessions is where ‘the pedal hits the metal’. A common misconception is that this is the area where most tuning should occur. While it is true that various specific session options can be modified to improve performance, this should not be the major or only area of focus when implementing performance tuning. Description When you have finished optimizing the sources, target database and mappings, you should review the sessions for performance optimization. Caches The greatest area for improvement at the session level usually involves tweaking memory cache settings. The Aggregator, Joiner, Rank and Lookup Transformations use caches. Review the memory cache settings for sessions where the mappings contain any of these transformations. When performance details are collected for a session, information about readfromdisk and writetodisk counters for Aggregator, Joiner, Rank and/or Lookup transformations can point to a session bottleneck. Any value other than zero for these counters may indicate a bottleneck. Because index and data caches are created for each of these transformations, , both the index cache and data cache sizes may affect performance, depending on the factors discussed in the following paragraphs. When the PowerCenter Server creates memory caches, it may also create cache files. Both index and data cache files can be created for the following transformations in a mapping: • • • •
PAGE BP-170
Aggregator transformation (without sorted ports) Joiner transformation Rank transformation Lookup transformation (with caching enabled)
BEST PRACTICES
INFORMATICA CONFIDENTIAL
The PowerCenter Server creates the index and data cache files by default in the PowerCenter Server variable directory, $PMCacheDir. The naming convention used by the PowerCenter Server for these files is PM [type of widget] [generated number].dat or .idx. For example, an aggregate data cache file would be named PMAGG31_19.dat. The cache directory may be changed however, if disk space is a constraint. Informatica recommends that the cache directory be local to the PowerCenter Server. You may encounter performance or reliability problems when you cache large quantities of data on a mapped or mounted drive. If the PowerCenter Server requires more memory than the configured cache size, it stores the overflow values in these cache files. Since paging to disk can slow session performance, try to configure the index and data cache sizes to store the appropriate amount of data in memory. Refer to Chapter 9: Session Caches in the Informatica Session and Server Guide for detailed information on determining cache sizes. The PowerCenter Server writes to the index and data cache files during a session in the following cases: • • • •
The mapping contains one or more Aggregator transformations, and the session is configured for incremental aggregation. The mapping contains a Lookup transformation that is configured to use a persistent lookup cache, and the Informatica Server runs the session for the first time. The mapping contains a Lookup transformation that is configured to initialize the persistent lookup cache. The DTM runs out of cache memory and pages to the local cache files. The DTM may create multiple files when processing large amounts of data. The session fails if the local directory runs out of disk space.
When a session is run, the PowerCenter Server writes a message in the session log indicating the cache file name and the transformation name. When a session completes, the DTM generally deletes the overflow index and data cache files. However, index and data files may exist in the cache directory if the session is configured for either incremental aggregation or to use a persistent lookup cache. Cache files may also remain if the session does not complete successfully. If a cache file handles more than 2 gigabytes of data, the PowerCenter Server creates multiple index and data files. When creating these files, the PowerCenter Server appends a number to the end of the filename, such as PMAGG*.idx1 and PMAGG*.idx2. The number of index and data files is limited only by the amount of disk space available in the cache directory. o
Aggregator Caches Keep the following items in mind when configuring the aggregate memory cache sizes. • •
INFORMATICA CONFIDENTIAL
Allocate at least enough space to hold at least one row in each aggregate group. Remember that you only need to configure cache memory for an Aggregator transformation that does NOT use sorted ports. The PowerCenter Server uses memory to process an
BEST PRACTICES
PAGE BP-171
•
•
Aggregator transformation with sorted ports, not cache memory. Incremental aggregation can improve session performance. When it is used, the PowerCenter Server saves index and data cache information to disk at the end of the session. The next time the session runs, the PowerCenter Server uses this historical information to perform the incremental aggregation. The PowerCenter Server names these files PMAGG*.dat and PMAGG*.idx and saves them to the cache directory. Mappings that have sessions which use incremental aggregation should be set up so that only new detail records are read with each subsequent run.
Joiner Caches The source with fewer records should be specified as the master source because only the master source records are read into cache. When a session is run with a Joiner transformation, the PowerCenter Server reads all the rows from the master source and builds memory caches based on the master rows. After the memory caches are built, the PowerCenter Server reads the rows from the detail source and performs the joins. Also, the PowerCenter Server automatically aligns all data for joiner caches on an eight-byte boundary, which helps increase the performance of the join.
•
Lookup Caches Several options can be explored when dealing with lookup transformation caches. •
•
•
PAGE BP-172
Persistent caches should be used when lookup data is not expected to change often. Lookup cache files are saved after a session which has a lookup that uses a persistent cache is run for the first time. These files are reused for subsequent runs, bypassing the querying of the database for the lookup. If the lookup table changes, you must be sure to set the Recache from Database option to ensure that the lookup cache files will be rebuilt. Lookup caching should be enabled for relatively small tables. Refer to Best Practice: Tuning Mappings for Better Performance to determine when lookups should be cached. When the Lookup transformation is not configured for caching, the PowerCenter Server queries the lookup table for each input row. The result of the Lookup query and processing is the same, regardless of whether the lookup table is cached or not. However, when the transformation is configured to not cache, the PowerCenter Server queries the lookup table instead of the lookup cache. Using a lookup cache can sometimes increase session performance. Just like for a joiner, the PowerCenter Server aligns all data for lookup caches on an eight-byte boundary which helps increase the performance of the lookup.
BEST PRACTICES
INFORMATICA CONFIDENTIAL
Allocating Buffer Memory When the PowerCenter Server initializes a session, it allocates blocks of memory to hold source and target data. Sessions that use a large number of source and targets may require additional memory blocks. You can tweak session properties to increase the number of available memory blocks by adjusting: • •
DTM Buffer Pool Size – the default setting is 12,000,000 bytes Default Buffer Block Size – the default size is 64,000 bytes
To configure these settings, first determine the number of memory blocks the PowerCenter Server requires to initialize the session. Then you can calculate the buffer pool size and/or the buffer block size based on the default settings, to create the required number of session blocks. If there are XML sources and targets in the mappings, use the number of groups in the XML source or target in the total calculation for the total number of sources and targets. •
Increasing the DTM Buffer Pool Size The DTM Buffer Pool Size setting specifies the amount of memory the PowerCenter Server uses as DTM buffer memory. The PowerCenter Server uses DTM buffer memory to create the internal data structures and buffer blocks used to bring data into and out of the Server. When the DTM buffer memory is increased, the PowerCenter Server creates more buffer blocks, which can improve performance during momentary slowdowns. If a session’s performance details show low numbers for your source and target BufferInput_efficiency and BufferOutput_efficiency counters, increasing the DTM buffer pool size may improve performance. Increasing DTM buffer memory allocation generally causes performance to improve initially and then level off. When the DTM buffer memory allocation is increased, the total memory available on the PowerCenter Server needs to be evaluated. If a session is part of a concurrent batch, the combined DTM buffer memory allocated for the sessions or batches must not exceed the total memory for the PowerCenter Server system. If you don’t see a significant performance increase after increasing DTM buffer memory, then it was not a factor in session performance.
•
Optimizing the Buffer Block Size Within a session, you may modify the buffer block size by changing it in the Advanced Parameters section. This specifies the size of a memory block that is used to move data throughout the pipeline. Each source, each transformation, and each target may have a different row size, which results in different numbers of rows that can be fit into one memory block. Row size
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-173
is determined in the server, based on number of ports, their datatypes and precisions. Ideally, block size should be configured so that it can hold roughly 100 rows, plus or minus a factor of ten. When calculating this, use the source or target with the largest row size. The default is 64K. The buffer block size does not become a factor in session performance until the number of rows falls below 10 or goes above 1000. Informatica recommends that the size of the shared memory (which determines the number of buffers available to the session) should not be increased at all unless the mapping is “complex” (i.e., more than 20 transformations). Running Concurrent Batches Performance can sometimes be improved by creating a concurrent batch to run several sessions in parallel on one PowerCenter Server. This technique should only be employed on servers with multiple CPUs available. Each concurrent session will use a maximum of 1.4 CPUs for the first session, and a maximum of 1 CPU for each additional session. Also, it has been noted that simple mappings (i.e., mappings with only a few transformations) do not make the engine “CPU bound” , and therefore use a lot less processing power than a full CPU. If there are independent sessions that use separate sources and mappings to populate different targets, they can be placed in a concurrent batch and run at the same time. If there is a complex mapping with multiple sources, you can separate it into several simpler mappings with separate sources. This enables you to place the sessions for each of the mappings in a concurrent batch to be run in parallel. Partitioning Sessions If large amounts of data are being processed with PowerCenter 5.x, data can be processed in parallel with a single session by partitioning the source via the source qualifier. Partitioning allows you to break a single source into multiple sources and to run each in parallel. The PowerCenter Server will spawn a Read and Write thread for each partition, thus allowing for simultaneous reading, processing, and writing. Keep in mind that each partition will compete for the same resources (i.e., memory, disk, and CPU), so make sure that the hardware and memory are sufficient to support a parallel session. Also, the DTM buffer pool size is split among all partitions, so it may need to be increased for optimal performance. Increasing the Target Commit Interval One method of resolving target database bottlenecks is to increase the commit interval. Each time the PowerCenter Server commits, performance slows. Therefore, the smaller the commit interval, the more often the PowerCenter Server writes to the target database, and the slower the overall performance. If you increase the commit interval, the number of times the PowerCenter Server commits decreases and performance may improve. When increasing the commit interval at the session level, you must remember to increase the size of the database rollback segments to accommodate this larger
PAGE BP-174
BEST PRACTICES
INFORMATICA CONFIDENTIAL
number of rows. One of the major reasons that Informatica has set the default commit interval to 10,000 is to accommodate the default rollback segment / extent size of most databases. If you increase both the commit interval and the database rollback segments, you should see an increase in performance. In some cases though, just increasing the commit interval without making the appropriate database changes may cause the session to fail part way through (you may get a database error like “unable to extend rollback segments” in Oracle). Disabling Session Recovery You can improve performance by turning off session recovery. The PowerCenter Server writes recovery information in the OPB_SRVR_RECOVERY table during each commit. This can decrease performance. The PowerCenter Server setup can be set to disable session recovery. But be sure to weigh the importance of improved session performance against the ability to recover an incomplete session when considering this option. Disabling Decimal Arithmetic If a session runs with decimal arithmetic enabled, disabling decimal arithmetic may improve session performance. The Decimal datatype is a numeric datatype with a maximum precision of 28. To use a high-precision Decimal datatype in a session, it must be configured so that the PowerCenter Server recognizes this datatype by selecting Enable Decimal Arithmetic in the session property sheet. However, since reading and manipulating a highprecision datatype (i.e., those with a precision of greater than 28) can slow the PowerCenter Server, session performance may be improved by disabling decimal arithmetic. Reducing Error Tracing If a session contains a large number of transformation errors, you may be able to improve performance by reducing the amount of data the PowerCenter Server writes to the session log. To reduce the amount of time spent writing to the session log file, set the tracing level to Terse. Terse tracing should only be set if the sessions run without problems and session details are not required. At this tracing level, the PowerCenter Server does not write error messages or row-level information for reject data. However, if terse is not an acceptable level of detail, you may want to consider leaving the tracing level at Normal and focus your efforts on reducing the number of transformation errors. Note that the tracing level must be set to Normal in order to use the reject loading utility. As an additional debug option (beyond the PowerCenter Debugger), you may set the tracing level to Verbose to see the flow of data between transformations. However, this will significantly affect the session performance. Do not use Verbose tracing
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-175
except when testing sessions. Always remember to switch tracing back to Normal after the testing is complete. The session tracing level overrides any transformation-specific tracing levels within the mapping. Informatica does not recommend reducing error tracing as a long-term response to high levels of transformation errors. Because there are only a handful of reasons why transformation errors occur, it makes sense to fix and prevent any recurring transformation errors.
PAGE BP-176
BEST PRACTICES
INFORMATICA CONFIDENTIAL
Determining Bottlenecks
Challenge Because there are many variables involved in identifying and rectifying performance bottlenecks, an efficient method for determining where bottlenecks exist is crucial to good data warehouse management. Description The first step in performance tuning is to identify performance bottlenecks. Carefully consider the following five areas to determine where bottlenecks exist; use a process of elimination, investigating each area in the order indicated: 1. 2. 3. 4. 5.
Write Read Mapping Session System
Before you begin, you should establish an approach for identifying performance bottlenecks. To begin, attempt to isolate the problem by running test sessions. You should be able to compare the session’s original performance with that of the tuned session’s performance. The swap method is very useful for determining the most common bottlenecks. It involves the following five steps: 1. Make a temporary copy of the mapping and/or session that is to be tuned, then tune the copy before making changes to the original. 2. Implement only one change at a time and test for any performance improvements to gauge which tuning methods work most effectively in the environment. 3. Document the change made to the mapping/and or session and the performance metrics achieved as a result of the change. The actual execution time may be used as a performance metric. 4. Delete the temporary sessions upon completion of performance tuning. 5. Make appropriate tuning changes to mappings and/or sessions.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-177
Write Bottlenecks Relational Targets The most common performance bottleneck occurs when the PowerCenter Server writes to a target database. This type of bottleneck can easily be identified with the following procedure: 1. 2.
Make a copy of the original session Configure the test session to write to a flat file
If the session performance is significantly increased when writing to a flat file, you have a write bottleneck. Flat File Targets If the session targets a flat file, you probably do not have a write bottleneck. You can optimize session performance by writing to a flat file target local to the PowerCenter server. If the local flat file is very large, you can optimize the write process by dividing it among several physical drives. Read Bottlenecks Relational Sources If the session reads from a relational source, you should first use a read test session with a flat file as the source in the test session. You may also use a database query to indicate if a read bottleneck exists. Using a Test Session with a Flat File Source 1. Create a mapping and session that writes the source table data to a flat file. 2. Create a test mapping that contains only the flat file source, the source qualifier, and the target table. 3. Create a session for the test mapping. If the test session’s performance increases significantly, you have a read bottleneck. Using a Database Query To identify a source bottlenecks by executing a read query directly against the source database, follow these steps: 1. Copy the read query directly from the session log. 2. Run the query against the source database with a query tool such as SQL Plus. 3. Measure the query execution time and the time it takes for the query to return the first row.
PAGE BP-178
BEST PRACTICES
INFORMATICA CONFIDENTIAL
If there is a long delay between the two time measurements, you have a source bottleneck. Flat File Sources If your session reads from a flat file source, you probably do not have a read bottleneck. Tuning the Line Sequential Buffer Length to a size large enough to hold approximately four to eight rows of data at a time (for flat files) may help when reading flat file sources. Ensure the flat file source is local to the PowerCenter Server. Mapping Bottlenecks If you have eliminated the reading and writing of data as bottlenecks, you may have a mapping bottleneck. Use the swap method to determine if the bottleneck is in the mapping. After using the swap method, you can use the session’s performance details to determine if mapping bottlenecks exist. High Rowsinlookupcache and Errorrows counters indicate mapping bottlenecks. Follow these steps to identify mapping bottlenecks: Using a Test Mapping without transformations 1. Make a copy of the original mapping 2. In the copied mapping, retain only the sources, source qualifiers, and any custom joins or queries 3. Remove all transformations. 4. Connect the source qualifiers to the target. High Rowsinlookupcache counters: Multiple lookups can slow the session. You may improve session performance by locating the largest lookup tables and tuning those lookup expressions. High Errorrows counters: Transformation errors affect session performance. If a session has large numbers in any of the Transformation_errorrows counters, you may improve performance by eliminating the errors. For further details on eliminating mapping bottlenecks, refer to the Best Practice: Tuning Mappings for Better Performance Session Bottlenecks Session performance details can be used to flag other problem areas in the session Advanced Options Parameters or in the mapping. Low Buffer Input and Buffer Output Counters
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-179
If the BufferInput_efficiency and BufferOutput_efficiency counters are low for all sources and targets, increasing the session DTM buffer pool size may improve performance. Aggregator, Rank, and Joiner Readfromdisk and Writetodisk Counters If a session contains Aggregator, Rank, or Joiner transformations, examine each Trasnformation_readfromdisk and Transformation_writetodisk counter. If these counters display any number other than zero, you can improve session performance by increasing the index and data cache sizes. For further details on eliminating session bottlenecks, refer to the Best Practice: Tuning Sessions for Better Performance. System Bottlenecks After tuning the source, target, mapping, and session, you may also consider tuning the system hosting the PowerCenter Server. Windows NT/2000 Use system tools such as the Performance tab in the Task Manager or the Performance Monitor to view CPU usage and total memory usage. UNIX On UNIX, use system tools like vmstat and iostat to monitor such items as system performance and disk swapping actions. For further information regarding system tuning, refer to the Best Practices: Performance Tuning UNIX-Based Systems and Performance Tuning NT/2000-Based Systems. The following table details the Performance Counters that can be used to flag session and mapping bottlenecks. Note that these can only be found in the Session Performance Details file. Transformation
Counters
Description
Source Qualifier and Normalizer Transformations
BufferInput_Efficiency
Percentage reflecting how seldom the reader waited for a free buffer when passing data to the DTM. Percentage reflecting how seldom the DTM waited for a full buffer of data from the reader. Percentage reflecting how seldom the DTM waited for a free buffer when passing data to the writer. Percentage reflecting how seldom the Informatica server waited for a full buffer of data from the reader. Number of times the Informatica Server read from the index or data file on the local disk, instead of using cached data.
BufferOutput_Efficiency
Target
BufferInput_Efficiency
BufferOutput_Efficiency
Aggregator and Rank
PAGE BP-180
Aggregator/Rank_readfromdisk
BEST PRACTICES
INFORMATICA CONFIDENTIAL
Transformations
Aggregator/Rank_writetodisk
Joiner Transformation
Joiner_readfromdisk
(see Note below)
Lookup Transformation All Transformations
Joiner_writetodisk
Lookup_rowsinlookupcache Transformation_errorrows
Number of times the Informatica server wrote to the index or data file on the local disk, instead of using cached data. Number of times the Informatica Server read from the index or data file on the local disk, instead of using cached data. Number of times the Informatica server wrote to the index or data file on the local disk, instead of using cached data. Number of rows stored in the lookup cache. Number of rows in which the Infor matica Server encountered an error
Note: The PowerCenter Server generates two sets of performance counters for a Joiner transformation. The first set of counters refers to the master source. The Joiner transformation does not generate output row counters associated with the master source. The second set of counters refers to the detail source.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-181
Advanced Client Configuration Options
Challenge Setting the Registry in order to ensure consistent client installations, resolve potential missing or invalid license key issues and change the Server Manager Session Log Editor to your preferred editor. Description Ensuring Consistent Data Source Names To ensure the use of consistent data source names for the same data sources across the domain, the Administrator can create a single "official" set of data sources, then use the Repository Manager to export that connection information to a file. You can then distribute this file and import the connection information for each client machine. Solution • •
From Repository Manager, choose Export Registry from the Tools drop down menu. For all subsequent client installs, simply choose Import Registry from the Tools drop down menu.
Resolving the Missing or Invalid License Key Issue The “missing or invalid license key” error occurs when attempting to install PowerCenter Client tools on NT 4.0 or Windows 2000 with a userid other than ‘Administrator.’ This problem also occurs when the client software tools are installed under the Administrator account, and subsequently a user with a non-administrator ID attempts to run the tools. The user who attempts to log in using the normal ‘nonadministrator’ userid will be unable to start the PowerCenter Client tools. Instead, the software will display the message indicating that the license key is missing or invalid.
PAGE BP-182
BEST PRACTICES
INFORMATICA CONFIDENTIAL
Solution • •
While logged in as the installation user with administrator authority, use regedt32 to edit the registry. Under HKEY_LOCAL_MACHINE open Software/Informatica/PowerMart Client Tools/. From the menu bar, select Security/Permissions, and grant read access to the users that should be permitted to use the PowerMart Client. (Note that the registry entries for both PowerMart and PowerCenter server and client tools are stored as PowerMart Server and PowerMart Client tools.)
Changing the Server Manager Session Log Editor The session log editor is not automatically determined when the PowerCenter Client tools are installed. A window appears the first time a session log is viewed from the PowerCenter Server Manager, prompting the user to enter the full path name of the editor to be used to view the logs. Users often set this parameter incorrectly and must access the registry to change it. Solution • •
• • •
While logged in as the installation user with administrator authority, use regedt32 to go into the registry. Move to registry path location: HKEY_CURRENT_USER Software\Informatica\PowerMart Client Tools\[CLIENT VERSION]\Server Manager\Session Files. From the menu bar, select View Tree and Data. Select the Log File Editor entry by double clicking on it. Replace the entry with the appropriate editor entry, i.e. typically WordPad.exe or Write.exe. Select Registry --> Exit from the menu bar to save the entry.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-183
Advanced Server Configuration Options
Challenge Configuring the Throttle Reader and File Debugging options, adjusting semaphore settings in the Unix environment, and configuring server variables. Description Configuring the Throttle Reader If problems occur when running sessions, some adjustments at the Server level can help to alleviate issues or isolate problems. One technique that often helps resolve “hanging” sessions is to limit the number of reader buffers that use Throttle Reader. This is particularly effective if your mapping contains many target tables, or if the session employs constraint-based loading. This parameter closely manages buffer blocks in memory by restricting the number of blocks that can be utilized by the Reader. Note for PowerCenter 5.x and above ONLY: If a session is hanging and it is partitioned, it is best to remove the partitions before adjusting the throttle reader. When a session is partitioned, the server makes separate connections to the source and target for every partition. This will cause the server to manage many buffer blocks. If the session still hangs, try adjusting the throttle reader. Solution: To limit the number of reader buffers using Throttle Reader in NT/2000: • •
Access file hkey_local_machine\system\currentcontrolset\services\powermart\parameter s\miscinfo. Create a new String value with value name of 'ThrottleReader' and value data of '10'.
To do the same thing in UNIX: • •
PAGE BP-184
Add this line to .cfg file: ThrottleReader=10
BEST PRACTICES
INFORMATICA CONFIDENTIAL
Configuring File Debugging Options If problems occur when running sessions or if the PowerCenter Server has a stability issue, help technical support to resolve the issue by supplying them with Debug files. To set the debug options on for NT/2000: 1. Select Start, Run, and type “regedit” 2. Go to hkey_local_machine, system, current_control_set, services, powermart, miscInfo 3. Select edit, then add value 4. Place "DebugScrubber" as the value then hit OK. Insert "4" as the value 5. Repeat steps 4 and 5, but use "DebugWriter", "DebugReader", "DebugDTM" with all three set to "1" To do the same in UNIX: Insert the following entries in the pmserver.cfg file: • • • •
DebugScrubber=4 DebugWriter=1 DebugReader=1 DebugDTM=1
Adjusting Semaphore Settings The UNIX version of the PowerCenter Server uses operating system semaphores for synchronization. You may need to increase these semaphore settings before installing the server. The number of semaphores required to run a session is 7. Most installations require between 64 and 128 available semaphores, depending on the number of sessions the server runs concurrently. This is in addition to any semaphores required by other software, such as database servers. The total number of available operating system semaphores is an operating system configuration parameter, with a limit per user and system. The method used to change the parameter depends on the operating system: • • •
HP/UX: Use sam (1M) to change the parameters. Solaris: Use admintool or edit /etc/system to change the parameters. AIX: Use smit to change the parameters.
Setting Shared Memory and Semaphore Parameters Informatica recommends setting the following parameters as high as possible for the operating system. However, if you set these parameters too high, the machine may not boot. Refer to the operating system documentation for parameter limits:
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-185
Parameter SHMMAX
Recommended Value for Solaris 4294967295
SHMMIN
1
SHMMNI
100
SHMSEG
10
SEMMNS
200
SEMMNI
70
SEMMSL
equal to or greater than the value of the PROCESSES initialization parameter
Description Maximum size in bytes of a shared memory segment. Minimum size in bytes of a shared memory segment. Number of shared memory identifiers. Maximum number of shared memory segments that can be attached by a process. Number of semaphores in the system. Number of semaphore set identifiers in the system. SEMMNI determines the number of semaphores that can be created at any one time. Maximum number of semaphores in one semaphore set. Must be equal to the maximum number of processes.
For example, you might add the following lines to the Solaris /etc/system file to configure the UNIX kernel: set shmsys:shminfo_shmmax = 4294967295 set shmsys:shminfo_shmmin = 1 set shmsys:shminfo_shmmni = 100 set shmsys:shminfo_shmseg = 10 set semsys:shminfo_semmns = 200 set semsys:shminfo_semmni = 70 Always reboot the system after configuring the UNIX kernel. Configuring Server Variables One configuration best practice is to properly configure and leverage Server variables. Benefits of using server variables: • •
PAGE BP-186
Ease of deployment from development environment to production environment. Ease of switching sessions from one server machine to another without manually editing all the sessions to change directory paths.
BEST PRACTICES
INFORMATICA CONFIDENTIAL
•
All the variables are related to directory paths used by server.
Approach In Server Manager, edit the server configuration to set or change the variables. Each registered server has its own set of variables. The list is fixed, not userextensible. Server Variable $PMRootDir $PMSessionLogDir $PMBadFileDir $PMCacheDir $PMTargetFileDir $PMSourceFileDir $PMExtProcDir $PMSuccessEmailUser $PMFailureEmailUser $PMSessionLogCount $PMSessionErrorThreshold
Value (no default – user must insert a path) $PMRootDir/SessLogs $PMRootDir/BadFiles $PMRootDir/Cache $PMRootDir/TargetFiles $PMRootDir/SourceFiles $PMRootDir/ExtProc (no default – user must insert a path) (no default – user must insert a path) 0 0
Where are these variables referenced? • •
Server manager session editor: anywhere in the fields for session log directory, bad file directory, etc. Designer: Aggregator/Rank/Joiner attribute for ‘Cache Directory’; External Procedure attribute for ‘Location’
Does every session and mapping have to use these variables (are they mandatory)? •
No. If you remove any variable reference from the session or the widget attributes then the server does not use that variable.
What if a variable is not referenced in the session or mapping? •
The variable is just a convenience; the user can choose to use it or not. The variable will be expanded only if it is explicitly referenced from another location. If the session log directory is specified as $PMSessionLogDir, then the logs are put in that location.
Note that this location may be different on every server. This is in fact a primary purpose for utilizing variables. But if the session log directory field is changed to designate a specific location, e.g. ‘/home/john/logs’, then the session logs will instead be placed in the directory location as designated. (The variable $PMSessionLogDir will be unused so it does not matter what the value of the variable is set to).
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-187
PAGE BP-188
BEST PRACTICES
INFORMATICA CONFIDENTIAL
Platform Sizing
Challenge Determining the appropriate platform size to support PowerCenter, considering specific environmental and processing requirements. Sizing may not be an easy task because it may be necessary to configure a single server to support numerous applications. Description This Best Practice provides general guidance for sizing computing environments. It also discusses some potential questions and pitfalls that may arise when migrating to Production. Certain terms used within this Best Practice are specific to Informatica’s PowerCenter. Please consult the appropriate PowerCenter manuals for explanation of these terms where necessary. Environmental configurations may very greatly with regard to hardware and software sizing. In addition to requirements for PowerCenter, other applications may share the server. Be sure to consider all mandatory server software components, including the operating system and all of its components, the database engine, front-end engines, etc. Regardless of whether or not the server is shared, it will be necessary to research the requirements of these additional software components when estimating the size of the overall environment. Technical Information Before delving into key sizing questions, let us review the PowerCenter engine and its associated resource needs. Each session: •
•
Represents an active task that performs data loading. PowerCenter provides session parameters that can be set to specify the amount of required shared memory per session. This shared memory setting is important, as it will dictate the amount of RAM required when running concurrent sessions, and will also be used to provide a level of performance that meets your needs. Uses up to 140% of CPU resources. This is important to remember if sessions will be executed concurrently.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-189
• •
Requires 20-30 MB of memory per session if there are no aggregations, lookups, or heterogeneous data joins contained within the mapping. May require additional memory for the caching of aggregations, lookups, or joins. The amount of memory can be calculated per session. Refer to the Session and Server guide to determine the exact amount of memory necessary per session.
Note: It may be helpful to refer to the Performance Tuning section in Phase 4 of the Informatica Methodology when determining memory settings. The Performance Tuning section provides additional information on factors that typically affect session performance, and offers general guidance for estimating session resources. The PowerCenter engine: Requires 20-30 MB of memory for the main server engine for session coordination. Requires additional memory when caching for aggregation, lookups, or joins, because: • • •
Lookup tables, when cached in full, result in memory consumption commensurate with the size of the tables involved. Aggregate caches store the individual groups; more memory is used if there are more groups. In a join, cache the master table; memory consumed depends on the size of the master.
Note: Sorting the input to aggregations will greatly reduce the need for memory. Disk space is not a factor if the machine is dedicated exclusively to the server engine. However, if the following conditions exist, disk space will need to be carefully considered: • • • •
Data is staged to flat files on the PowerCenter server. Data is stored in incremental aggregation files for adding data to aggregates. The space consumed is about the size of the data aggregated. Temporary space is not used like a database on disk, unless the cache requires it after filling system memory. Data does not need to be stripped to prevent head contention. This includes all types of data such as flat files and database tables.
Key Questions The goal of this analysis is to size the machine so that the ETL processes can complete within the specified load window. Consider the following questions when estimating the required number of sessions, the volume of data moved per session, and the caching requirements for the session’s lookup tables, aggregation, and heterogeneous joins. Use these estimates along with recommendations in the preceding Technical Information section to determine the required number of processors, memory, and disk space to achieve the required performance to meet the load window.
PAGE BP-190
BEST PRACTICES
INFORMATICA CONFIDENTIAL
Please note that the hardware sizing analysis is highly dependent on the environment in which the server is deployed. It is very important to understand the performance characteristics of the environment before making any sizing conclusions. It is vitally important to remember that in addition to PowerCenter, other applications may be vying for server resources. PowerCenter commonly runs on a server that also hosts a database engine plus query/analysis tools. In an environment where PowerCenter runs in parallel with all of these tools, the query/analysis tool often drives the hardware requirements. However, if the ETL processing is performed after business hours, the query/analysis tool requirements may not impose a sizing limitation. With these additional processing requirements in mind, consider platform size in light of the following questions: • • • • • • • • • • • • • • • • • • • • • • • •
What sources are accessed by the mappings? How do you currently access those sources? Do the sources reside locally, or will they be accessed via a network connection? What kind of network connection exists? Have you decided on the target environment (database/hardware/operating system)? If so, what is it? Have you decided on the PowerCenter server environment (hardware/operating system)? Is it possible for the PowerCenter server to be on the same machine as the target? How will information be accessed for reporting purposes (e.g., cube, ad-hoc query tool, etc.) and what tools will you use to implement this access? What other applications or services, if any, run on the PowerCenter server? Has the database table space been distributed across controllers, where possible, to maximize throughput by reading and writing data in parallel? When considering the server engine size, answer the following questions: Are there currently extract, transform, and load processes in place? If so, what are the processes, and how long do they take? What is the total volume of data that must be moved, in bytes? What is the largest table (bytes and rows)? Is there any key on this table that could be used to partition load sessions, if necessary? How will the data be moved; via flat file processing or relational tables? What is the load strategy; is the data updated, incrementally loaded, or will all tables be truncated and reloaded? Will the data processing require staging areas? What is the load plan? Are there dependencies between facts and dimensions? How often will the data be refreshed? Will the refresh be scheduled at a certain time, or driven by external events? Is there a "modified" timestamp on the source table rows, enabling incremental load strategies? What is the size of the batch window that is available for the load? Does the load process populate detail data, aggregations, or both? If data is being aggregated, what is the ratio of source/target rows for the largest result set? How large is the result set (bytes and rows)?
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-191
The answers to these questions will provide insight into the factors that impact PowerCenter's resource requirements. To simplify the analysis, focus on large, "critical path" jobs that drive the resource requirement. A Sample Performance Result The following is a testimonial from a customer configuration. Please note that these performance tests were run on a previous version of PowerCenter, which did not include the performance and functional enhancements in release 5.1. These results are offered as one example of throughput. However, results will definitely vary by installation because each environment has a unique architecture and unique data characteristics. The performance tests were performed on a 4-processor Sun E4500 with 2GB of memory. This processor handled just under 20.5 million rows, and more than 2.8GB of data, in less than 54 minutes. In this test scenario, 22 sessions ran in parallel, populating a large product sales table. Four sessions ran after the set of 22, populating various summarization tables based on the product sales table. All of the mappings were complex, joining several sources and utilizing several Expression, Lookup and Aggregation transformations. The source and target database used in the tests was Oracle. The source and target were both hosted locally on the ETL Server. Links The following link may prove helpful when determining the platform size:www.tpc.org. This website contains benchmarking reports that will help you fine tune your environment and may assist in determining processing power required.
PAGE BP-192
BEST PRACTICES
INFORMATICA CONFIDENTIAL
Running Sessions in Recovery Mode
Challenge Use PowerCenter standard functionality to recover data that is committed to a session's targets, even if the session does not complete. Description When a network or other problem causes a session whose source contains a million rows to fail after only half of the rows are committed to the target, one option is to truncate the target and run the session again from the beginning. But that is not the only option. Rather than processing the first half of the source again, you can tell the server to keep data already committed to the target database and process the rest of the source. This results in accurate and complete target data, as if the session completed successfully with one run. This technique is called performing recovery. When you run a session in recovery mode, the server notes the row id of the last row committed to the target database. The server then reads all sources again, but only processes from the subsequent row id. For example, if the server commits 1000 rows before the session fails, when you run the session in recovery mode, the server reads all source tables, and then passes data to the Data Transformation Manager (DTM) starting from row 1001. When necessary, the server can recover the same session more than once. That is, if a session fails while running in recovery mode, you can re-run the session in recovery mode until the session completes successfully. This is called nested recovery. The server can recover committed target data if the following three criteria are met: •
•
All session targets are relational. The server can only perform recovery on relational tables. If the session has file targets, the server cannot perform recovery. If a session writing to file targets fails, delete the files, and run the session again. The session is configured for a normal (not bulk) target load. The server uses database logging to perform recovery. Since bulk loading bypasses database logging, the server cannot recover sessions configured to bulk load targets. Although recovering a large session can be more efficient
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-193
than running the session again, bulk loading increases general session performance. When configuring session properties for sessions processing large amounts of data, weigh the importance of performing recovery when choosing a target load type. When you configure a session to load in bulk, the server logs a message in the session log stating that recovery is not supported. •
The server configuration parameter Disable Recovery is not selected. When the Disable Recovery option is checked, the server does not create the OPB_SRVR_RECOVERY table in the target database to store recovery-related information. If the table already exists, the server does not write information to that table.
In addition, to ensure accurate results from the recovery, the following must be true: • •
Source data does not change before performing recovery. This includes inserting, updating, and deleting source data. Changes in source files or tables can result in inaccurate data. The mapping used in the session does not use a Sequence Generator or Normalizer. Both the Sequence Generator and the Normalizer transformations generate source values: the Sequence Generator generates sequences, and the Normalizer generates primary keys. Therefore, sessions using these transformations are not guaranteed to return the same values when performing recovery.
Session Logs If a session is configured to archive session logs, the server creates a new session log for the recovery session. If you perform nested recovery, the server creates a new log for each session run. If the session is not configured to archive session logs, the server overwrites the existing log when you recover the session. Reject Files When performing recovery, the server creates a single reject file. The server appends rejected rows from the recovery session (or sessions) to the session reject file. This allows you to correct and load all rejected rows from the completed session. Example Session “s_recovery” reads from a Sybase source and writes to a target table in “production_target”, a Microsoft SQL Server database. This session is configured for a normal load. The mapping consists of: Source Qualifier: SQ_LINEITEM Expression transformation: EXP_TRANS Target: T_LINEITEM The session is configured to save 5 session logs.
PAGE BP-194
BEST PRACTICES
INFORMATICA CONFIDENTIAL
First Run The first time the session runs, the server creates a session log named s_recovery.log. (If the session is configured to save logs by timestamp, the server appends the date and time to the log file name.) The server also creates a reject file for the target table named t_lineitem.bad. The following section of the session log shows the server preparing to load normally to the production_target database. Since the server cannot find OPB_SRVR_RECOVERY, it creates the table. CMN_1053 Writer: Target is database [TOMDB@PRODUCTION_TARGET], user [lchen], bulk mode [OFF] ... CMN_1039 SQL Server Event CMN_1039 [01/14/99 18:42:44 SQL Server Message 208 : Invalid object name 'OPB_SRVR_RECOVERY'.] Thu Jan 14 18:42:44 1999 CMN_1040 SQL Server Event
CMN_1040 [01/14/99 18:42:44 DB-Library Error 10007 : General SQL Server error: Check messages from the SQL Server.] Thu Jan 14 18:42:44 1999 CMN_1022 Database driver error...
CMN_1022 [Function Name : Execute SqlStmt : SELECT SESSION_ID FROM OPB_SRVR_RECOVERY] WRT_8017 Created OPB_SRVR_RECOVERY table in target
database.
As the following session log show, the server performs six target -based commits before the session fails. TM_6095 Starting Transformation Engine... Start loading table [T_LINEITEM] at: Thu Jan 14 18:42:50 1999 TARGET BASED COMMIT POINT Thu Jan 14 18:43:59 1999 ============================================= Table: T_LINEITEM Rows Output: 10125
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-195
Rows Applied: 10125 Rows Rejected: 0 TARGET BASED COMMIT POINT Thu Jan 14 18:45:09 1999 ============================================= Table: T_LINEITEM Rows Output: 20250 Rows Applied: 20250 Rows Rejected: 0 TARGET BASED COMMIT POINT Thu Jan 14 18:46:25 1999 ============================================= Table: T_LINEITEM Rows Output: 30375 Rows Applied: 30375 Rows Rejected: 0 TARGET BASED COMMIT POINT Thu Jan 14 18:47:31 1999 ============================================= Table: T_LINEITEM Rows Output: 40500 Rows Applied: 40500 Rows Rejected: 0 TARGET BASED COMMIT POINT Thu Jan 14 18:48:35 1999 ============================================= Table: T_LINEITEM Rows Output: 50625 Rows Applied: 50625 Rows Rejected: 0 TARGET BASED COMMIT POINT Thu Jan 14 18:49:41 1999 =============================================
PAGE BP-196
BEST PRACTICES
INFORMATICA CONFIDENTIAL
Table: T_LINEITEM Rows Output: 60750 Rows Applied: 60750 Rows Rejected: 0
When a session fails, you can truncate the target and run the entire session again. However, since the server committed more than 60,000 rows to the target, rather than running the whole session again, you can configure the session to recover the committed rows. Running a Recovery Session To run a recovery session, check the Perform Recovery option on the Log Files tab of the session property sheet. To archive the existing session log, either increase the number of session logs saved, or choose Save Session Log By Timestamp option on the Log Files tab. Start the session, or if necessary, edit the session schedule and reschedule the session. Second Run (Recovery Session) When you run the session in recovery mode, the server creates a new session log. Since the session is configured to save multiple logs, it renames the existing log s_recovery.log.0, and writes all new session information in s_recovery.log. The server reopens the existing reject file (t_lineitem.bad) and appends any rejected rows to that file. When performing recovery, the server reads the source, and then passes data to the DTM beginning with the first uncommitted row. In the session log below, the server notes the session is in recovery mode, and states the row at which it will begin recovery (i.e., row 60751.) TM_6098 Session [s_recovery] running in recovery mode. … TM_6026 Recovering from row [60751] for target instance [T_LINEITEM].
When running the session with the Verbose Data tracing level, the server provides more detailed information about the session. As seen below, the server sets row 60751 as the row from which to recover. It opens the existing reject file and begins processing with the next row, 60752. Note: Setting the tracing level to Verbose Data slows the server's performance and is not recommended for most production sessions.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-197
CMN_1053 SetRecoveryInfo for transform(T_LINEITEM): Rows To Recover From = [60751]: CMN_1053 Current Transform [SQ_lineitem]: Rows To Consume From = [60751]: CMN_1053 Output Transform [EXPTRANS]: Rows To Produce From = [60751]: CMN_1053 Current Transform [EXPTRANS]: Rows To Consume From = [60751]: CMN_1053 Output Transform [T_LINEITEM]: Rows To Produce From = [60751]: CMN_1053 Writer: Opened bad (reject) file [C:\winnt\system32\BadFiles\t_lineitem.bad]
Third Run (Nested Recovery) If the recovery session fails before completing, you can run the session in recovery mode again. The server runs the session as it did the earlier recovery sessions, creating a new session log and appending bad data to the reject file. You can run the session in recovery mode as many times as necessary to complete the session's target tables. When the server completes loading target tables, it performs any configured postsession stored procedures or commands normally, as if the session completed in a single run. Returning to Normal Session After successfully recovering a session, you must edit the session properties to clear the Perform Recovery option. If necessary, return the session to its normal schedule and reschedule the session. Things to Consider In PowerCenter 5.1, the DisableRecovery server initialization flag defaults to Yes. This means the OPB_SRVR_RECOVERY table will not be created, and ‘Perform Recovery’ will not be possible unless this flag is changed to No during server configuration. You will need to have “create table” permissions in the target database in order to create this table.
PAGE BP-198
BEST PRACTICES
INFORMATICA CONFIDENTIAL
Developing the Business Case
Challenge Identifying the departments and individuals that are likely to benefit directly from the project implementation. Understanding these individuals, and their business information requirements, is key to defining and scoping the project.
Description The following four steps summarize business case development and lay a good foundation f or proceeding into detailed business requirements for the project. 1. One of the first steps in establishing the business scope is identifying the project beneficiaries and understanding their business roles and project participation. In many cases, the Project Sponsor can help to identify the beneficiaries and the various departments they represent. This information can then be summarized in an organization chart that is useful for ensuring that all project team members understand the corporate/business organization.
• •
Activity - Interview project sponsor to identify beneficiaries, define their business roles and project participation.
Deliverable - Organization chart of corporate beneficiaries and participants.
2. The next step in establishing the business scope is to understand the business problem or need that the project addresses. This information should be clearly defined in a Problem/Needs Statement, using business terms to describe the problem. For example, the problem may be expressed as "a lack of information" rather than "a lack of technology" and should detail the business decisions or analysis that is required to resolve the lack of information. The best way to gather this type of information is by interviewing the Project Sponsor and/or the project beneficiaries.
• •
Activity - Interview (individually or in forum) Project Sponsor and/or beneficiaries regarding problems and needs related to project. Deliverable - Problem/Need Statement
3. The next step in creating the project scope is defining the business goals and objectives for the project and detailing them in a comprehensive Statement of Project Goals and Objectives. This statement should be a high-level expression of the desired business solution (e.g., what strategic or tactical benefits does the business expect to gain from the project,) and should avoid any technical considerations at this point. Again, the Project Sponsor and beneficiaries are the best sources for this type of information. It may be practical to combine information gathering for the needs assessment and goals definition, using individual interviews or general meetings to elicit the information.
•
Activity - Interview (individually or in forum) Project Sponsor and/or beneficiaries regarding business goals and objectives for the project.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-199
•
Deliverable - Statement of Project Goals and Objectives
4. The final step is creating a Project Scope and Assumptions statement that clearly defines the boundaries of the project based on the Statement of Project Goals and Objective and the associated project assumptions. This statement should focus on the type of information or analysis that will be included in the project rather than what will not. The assumptions statements are optional and may include qualifiers on the scope, such as ass umptions of feasibility, specific roles and responsibilities, or availability of resources or data.
• •
PAGE BP-200
Activity - Business Analyst develops Project Scope and Assumptions statement for presentation to the Project Sponsor. Deliverable - Project Scope and Assumptions statement
BEST PRACTICES
INFORMATICA CONFIDENTIAL
Assessing the Business Case
Challenge Developing a solid business case for the project that includes both the tangible and intangible potential benefits of the project.
Description The Business Case should include both qualitative and quantitative assessments of the project. The Qualitative Assessment portion of the Business Case is based on the Statement of Problem/Need and the Statement of Project Goals and Objectives (both generated in Subtask 1.1.1) and focuses on d iscussions with the project beneficiaries of expected benefits in terms of problem alleviation, cost savings or controls, and increased efficiencies and opportunities. The Quantitative Assessment portion of the Business Case provides specific measurable details of the proposed project, such as the estimated ROI, which may involve the following calculations:
•
Cash flow analysis- Projects positive and negative cash flows for the anticipated life of the project. Typically, ROI measurements use the cash flow formula to depict results.
•
Net present value - Evaluates cash flow according to the long-term value of current investment. Net present value shows how much capital needs to be invested currently, at an assumed interest rate, in order to create a stream of payments over time. For instance, to generate an income stream of $500 per month over six months at an interest rate of eight percent would require an investment-a net present value-of $2,311.44.
•
Return on investment - Calculates net present value of total incremental cost savings and revenue divided by the net present value of total costs multiplied by 100. This type of ROI calculation is frequently referred to as return of equity or return on capital employed.
•
Payback - Determines how much time will pass before an initial capital investment is recovered.
The following are steps to calculate the quantitative business case or ROI:
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-201
Step 1. Develop Enterprise Deployment Map. This is a model of the project phases over a timeline, estimating as specifically as possible customer participation (e.g., by department and location), subject area and type of information/analysis, numbers of users, numbers of data marts and data sources, types of sources, and size of data set. Step 2. Analyze Potential Benefits. Discussions with representative managers and users or the Project Sponsor should reveal the tangible and intangible benefits of the project. The most effective format for presenting this analysis is often a "before" and "after" format that compares the current situation to the project expectations. Step 3. Calculate Net Present Value for all Benefits. Information gathered in this step should help the customer representatives to understand how the expected benefits will be allocated throughout the organization over time, using the enterprise deployment map as a guide. Step 4. Define Overall Costs. Customers need specific cost information in order to assess the dollar impact of the project. Cost estimates should address the following fundamental cost components:
• • • • • • • • •
Hardware Networks RDBMS software Back-end tools Query/reporting tools Internal labor External labor Ongoing support Training
Step 5. Calculate Net Present Value for all Costs. Use either actual cost estimates or percentage-of-cost values (based on cost allocation assumptions) to calculate costs for each cost component, projected over the timeline of the enterprise deployment map. Actual cost estimates are more accurate than percentage-of-cost allocations, but much more time-consuming. The percentage-of-cost allocation process may be valuable for initial ROI snapshots until costs can be more clearly predicted. Step 6. Assess Risk, Adjust Costs and Benefits Accordingly. Review potential risks to the project and make corresponding adjustments to the costs and/or benefits. Some of the major risks to consider are:
• • • •
Scope creep, which can be mitigated by thorough planning and tight project scope Integration complexity, which can be reduced by standardizing on vendors with integrated product sets or open architectures Architectural strategy that is inappropriate Other miscellaneous risks from management or end users who may withhold project support; from the entanglements of internal politics; and from technologies that don't function as promised
Step 7. Determine Overall ROI. When all other portions of the business case are complete, calculate the project's "bottom line". Determining the overall ROI is simply a matter of subtracting net present value of total costs from net present value of (total incremental revenue plus cost savings). For more detail on these steps, refer to the Informatica White Paper: 7 Steps to Calculating Data Warehousing ROI.
PAGE BP-202
BEST PRACTICES
INFORMATICA CONFIDENTIAL
Defining and Prioritizing Requirements
Challenge Defining and prioritizing business and functional requirements is often accomplished through a combination of interviews and facilitated meetings (i.e., workshops) between the Project Sponsor and beneficiaries and the Project Manager and Business Analyst.
Description The following three steps are key for successfully defining and prioritizing requirements:
Step 1: Discovery During individual (or small group) interviews with high-level management, there is often focus and clarity of vision that for some, may be hindered in large meetings or not available from lower-level management. On the other hand, detailed review of existing reports and current analysis from the company's "information providers" can fill in helpful details. As part of the initial "discovery" process, Informatica generally recommends several interviews at the Project Sponsor and/or upper management level and a few with those acquainted with current reporting and analysis processes. A few peer group forums can also be valuable. However, this part of the process must be focused and brief or it can become unwieldy as much time can be expended trying to coordinate calendars between worthy forum participants. Set a time period and target list of participants with the Project Sponsor, but avoid lengthening the process if some participants aren't available. Questioning during these session should include the following:
• • • •
What are the target business functions, roles, and responsibilities? What are the key relevant business strategies, decisions, and processes (in brief)? What information is important to drive, support, and measure success for those strategies/processes? What key metrics? What dimensions for those metrics? What current reporting and analysis is applicable? Who provides it? How is it presented? How is it used?
Step 2: Validation and Prioritization The Business Analyst, with the help of the Project Architect, documents the findings of the discovery process. The resulting Business Requirements Specification includes a matrix linking the specific business requirements to their functional requirements.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-203
At this time also, the Architect develops the Information Requirements Specification in order to clearly represent the structure of the information requirements. This document, based on the business requirements findings, will facilitate discussion of informational details and provide the starting point for the target model definition. The detailed business requirements and information requirements should be reviewed with the project beneficiaries and prioritized based on business need and the stated project objectives and scope.
Step 3: The Incremental Roadmap Concurrent with the validation of the business requirements, the Architect begins the Functional Requirements Specification providing details on the technical requirements for the project. As general technical feasibility is compared to the prioritization from Step 2, the Project Manager, Business Analyst, and Architect develop consensus on a project "phasing" approach. Items of secondary priority and those with poor near-term feasibility are relegated to subsequent phases of the project. Thus, they develop a phased, or incremental, "roadmap" for the project (Project Roadmap). This is presented to the Project Sponsor for approval and becomes the first "Increment" or starting point for the Project Plan.
PAGE BP-204
BEST PRACTICES
INFORMATICA CONFIDENTIAL
Developing a WBS
Challenge Developing a comprehensive work breakdown structure that clearly depicts all of the various tasks, subtasks required to complete the project. Because project time and resource estimates are typically based on the Work Breakdown Structure (WBS), it is critical to develop a thorough, accurate WBS.
Description A WBS is a tool for identifying and organizing the tasks that need to be completed in a project. The WBS serves as a starting point for both the project estimate and the project plan. One challenge in developing a good WBS is obtaining the correct balance between enough detail, and too much detail. The WBS shouldn't be a 'grocery list' of every minor detail in the project, but it does need to break the tasks down to a manageable level of detail. One general guideline is to keep task detail to a duration of at least a day. It is also important to remember that the WBS is not necessarily a sequential document. Tasks in the hierarchy are often completed in parallel. At this stage of project planning, the goal is to list every task that must be completed; it is not necessary to determine the critical path for completing these tasks. For example, we may have multiple subtasks under a task (e.g., 4.3.1 through 4.3.7 under task 4.3). So, although subtasks 4.3.1 through 4.3.4 may have sequential requirements that force us to complete them in order, subtasks 4.3.5 through 4.3.7 can - and should - be completed in parallel if they do not have sequential requirements. However, it is important to remember that a task is not complete until all of its corresponding subtasks are completed whether sequentially or in parallel. For example, the BUILD phase is not complete until tasks 4.1 through 4.7 are complete, but some work can (and should) begin for the DEPLOY phase long before the BUILD phase is complete. The Project Plan provides a starting point for further development of the project WBS. This sample is a Microsoft Project file that has been "pre-loaded" with the Phases, Tasks, and Subtasks that make up the Informatica Methodology. The Project Manager can use this WBS as a starting point, but should review it carefully to ensure that it corresponds to the specific development effort, removing any steps that aren't relevant or adding steps as necessary. Many projects will require the addition of detailed steps to accurately represent the development effort. If the Project Manager chooses not to use Microsoft Project, an Excel version of the Work Breakdown Structure is available. The phases, tasks, and subtasks can be exported from Excel into many other project management tools, simplifying the effort to develop the WBS. After the WBS has been loaded into the selected project management tool and refined for the specific project needs, the Project Manager can begin to estimate the level of effort involved in completing each of the steps. When the estimate is complete, individual resources can be assigned and scheduled. The end result is the Project Plan. Refer to Developing and Maintaining the Project Plan for further information about the project plan.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-205
Developing and Maintaining the Project Plan
Challenge Developing the first-pass of a project plan that incorporates all of the necessary components but which is sufficiently flexible to accept the inevitable changes.
Description Use the following steps as a guide for developing the initial project plan:
• • •
• • • •
Define the project's major milestones based on the Project Scope. Break the milestones down into major tasks and activities. The Project Plan should be helpful as a starting point or for recommending tasks for inclusion. Continue the detail breakdown, if possible, to a level at which tasks are of about one to three days' duration. This level provides satisfactory detail to facilitate estimation and tracking. If the detail tasks are too broad in scope, estimates are much less likely to be accurate. Confer with technical personnel to review the task definitions and effort estimates (or even to help define them, if applicable). Establish the dependencies among tasks, where one task cannot be started until another is completed (or must start or complete concurrently with another). Define the resources based on the role definitions and estimated number of resources needed for each role. Assign resources to each task. If a resource will only be part-time on a task, indicate this in the plan.
At this point, especially when using Microsoft Project, it is advisable to create dependencies (i.e., predecessor relationships) between tasks assigned to the same resource in order to indicate the sequence of that person's activities. The initial definition of tasks and effort and the resulting schedule should be an exercise in pragmatic feasibility unfettered by concerns about ideal completion dates. In other words, be as realistic as possible in your initial estimations, even if the resulting scheduling is likely to be a hard sell to c ompany management. This initial schedule becomes a starting point. Expect to review and rework it, perhaps several times. Look for opportunities for parallel activities, perhaps adding resources, if necessary, to improve the schedule.
PAGE BP-206
BEST PRACTICES
INFORMATICA CONFIDENTIAL
When a satisfactory initial plan is complete, review it with the Project Sponsor and discuss the assumptions, dependencies, assignments, milestone dates, and such. Expect to modify the plan as a result of this review.
Reviewing and Revising the Project Plan Once the Project Sponsor and company managers agree to the initial plan, it becomes the basis for assigning tasks to individuals on the project team and for setting expectations regarding delivery dates. The planning activity then shifts to tracking tasks against the schedule and updating the plan based on status and changes to assumptions. One approach is to establish a baseline schedule (and budget, if applicable) and then track changes against it. With Microsoft Project, this involves creating a "Baseline" that remain s static as changes are applied to the schedule. If company and project management do not require tracking against a baseline, simply maintain the plan through updates without a baseline. Regular status reporting should include any changes to the schedule, beginning with team members' notification that dates for task completions are likely to change or have already been exceeded. These status report updates should trigger a regular plan update so that project management can track the effect on the overall schedule and budget. Be sure to evaluate any changes to scope (see 1.2.4 Manage Project and Scope Change Assessment ), or changes in priority or approach, as they arise to determine if they impact the plan. It may be necessary to modify the plan if changes in scope or priority require rearranging task assignments or delivery sequences, or if they add new tasks or postpone existing ones.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-207
Managing the Project Lifecycle
Challenge Providing a structure for on-going management throughout the project lifecycle.
Description It is important to remember that the quality of a project can be directly correlated to the amount of review that occurs during its lifecycle.
Project Status and Plan Reviews In addition to the initial project plan review with the Project Sponsor, schedule regular status meetings with the sponsor and project team to review status, issues, scope changes and schedule updates. Gather status, issues and schedule update information from the team one day before the status meeting in order to compile and distribute the Status Report .
Project Content Reviews The Project Manager should coordinate, if not facilitate, reviews of requirements, plans and deliverables with company management, including business requirements reviews with business personnel and technical reviews with project technical personnel. Set a process in place beforehand to ensure appropriate personnel are invited, any relevant documents are distributed at least 24 hours in advance, and that reviews focus on questions and issues (rather than a laborious "reading of the code"). Reviews may include:
• • • • • • • • •
PAGE BP-208
Project scope and business case review Business requirements review Source analysis and business rules reviews Data architecture review Technical infrastructure review (hardware and software capacity and configuration pla nning) Data integration logic review (source to target mappings, cleansing and transformation logic, etc.) Source extraction process review Operations review (operations and maintenance of load sessions, etc.) Reviews of operations plan, QA plan, deployment and support plan
BEST PRACTICES
INFORMATICA CONFIDENTIAL
Change Management Directly address and evaluate any changes to the planned project activities, priorities, or staffing as they arise, or are proposed, in terms of their impact on the project plan.
• • •
Use the Scope Change Assessment to record the background problem or requirement and the recommended resolution that constitutes the potential scope change. Review each potential change with the technical team to assess its impact on the project, evaluating the effect in terms of schedule, budget, staffing requirements, and so forth. Present the Scope Change Assessment to the Project Sponsor for acceptance (with formal sign-off, if applicable). Discuss the assumptions involved in the impact estimate and any potential risks to the project.
The Project Manager should institute this type of change management process in response to any issue or request that appears to add or alter expected activities and has the potential to affect the plan. Even if there is no evident effect on the schedule, it is important to document these changes because they may affect project direction and it may become necessary, later in the project cycle, to justify these changes to management.
Issues Management Any questions, problems, or issues that arise and are not immediately resolved should be tracked to ensure that someone is accountable for resolving them so that their effect can also be visible. Use the Issues Tracking template, or something similar, to track issues, their owner, and dates of entry and resolution as well as the details of the issue and of its solution. Significant or "showstopper" issues should also be mentioned on the status report.
Project Acceptance and Close Rather than simply walking away from a project when it seems complete, there should be an explicit close procedure. For most projects this involves a meeting where the Project Sponsor and/or department managers acknowledge completion or sign a statement of satisfactory completion.
•
• •
Even for relatively short projects, use the Project Close Report to finalize the project with a final status report detailing: o What was accomplished o Any justification for tasks expected but not completed o Recommendations Prepare for the close by considering what the project team has learned about the environments, procedures, data integration design, data architecture, and other project plans. Formulate the recommendations based on issues or problems that need to be addressed. Succinctly describe each problem or recommendation and if applicable, briefly describe a recommended approach.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-209
Configuring Security
Challenge Configuring a PowerCenter security scheme to prevent unauthorized access to mappings, folders, sessions, batches, repositories, and data – in order to ensure system integrity and data confidentiality. Description Configuring security is one of the most important components of building a Data Warehouse. Security should be implemented with the goals of easy maintenance and scalability. Determining an optimal security configuration for a PowerCenter environment requires a thorough understanding of business requirements, data content, and end users’ access requirements. Knowledge of PowerCenter’s security facilities is also a prerequisite to security design. Before implementing security measures, it is imperative to answer the following basic questions: • • • • • • • •
Who needs access to the Repository? What do they need the ability to do? Is a central administrator required? What permissions are appropriate for him/her? Is the central administrator responsible for designing and configuring the repository security? If not, has a security administrator been identified? What levels of permissions are appropriate for the developers? Do they need access to all the folders? Who needs to start sessions manually? Who is allowed to start and stop the Informatica Server? How will PowerCenter security be administered? Will it be the same as the database security scheme? Do we need to restrict access to Global Objects?
The following pages offer some answers to the these questions and some suggestions for assigning user groups and access privileges. In most implementations, the administrator takes care of maintaining the Repository. There should be a limit to the number of administrator accounts for PowerCenter. While this is less important in a development/unit test environment, it is critical for protecting the production environment.
PAGE BP-210
BEST PRACTICES
INFORMATICA CONFIDENTIAL
PowerCenter’s security approach is similar to database security environments. All security management is performed through the Repository Manager. The internal security enables multi-user development through management of users, groups, privileges, and folders. Every user ID must be assigned to one or more groups. These are PowerCenter users, not database users; all password information is encrypted and stored in the repository. The Repository may be connected to sources/targets that contain sensitive information. The Server Manager provides another level of security for this purpose. It is used to assign read, write, and execute permissions for global objects. Global Objects include Database Connections, FTP Connections and External Loader Connections. Global Object permissions, in addition to privileges and permissions assigned using the Repository Manager, affect the ability to perform tasks in the Server Manager, the Repository Manager, and the command line program, pmcmd. The Server Manager also offers an enhanced security option that allows you to specify a default set of privileges that applies restricted access controls for Global Objects. Only the owner of the Object or a Super User can manage permissions for a Global Object. Choosing the Enable Security option activates the following set of default privileges: User Owner Owner Group World
Default Global Object Permissions Read, Write and Execute Read and Execute No Permissions
Enabling Enhanced Security does not lock the restricted access settings for Global Objects. This means that the permissions for Global Objects can be changed after enabling Enhanced Security. Although privileges can be assigned to users or groups, privileges are commonly assigned to groups, with users then added to each group. This approach is simpler than assigning privileges on a user-by-user basis, since there are generally few groups and many users, and any user can belong to more than one group. The following table summarizes some possible privileges that may be granted: Privilege Session Operator Use Designer Browse Repository Create Sessions and Batches Administer Repository Administer Server
INFORMATICA CONFIDENTIAL
Description Can run any sessions or batches, regardless of folder level permissions. Can edit metadata in the Designer. Can browse repository contents through the Repository Manager. Can create, modify, and delete sessions and batches in Server Manager. Can create and modify folders. Can configure connections on the server and
BEST PRACTICES
PAGE BP-211
Privilege
Super User
Description stop the server through the Server Manager or the command-line interface. Can perform all tasks with the repository and the server
The next table suggests a common set of initial groups and the privileges that may be associated with them: Group Developer
Description PowerCenter developers who are creating the mappings.
End User
Business end users who run reports off of the data warehouse. Operations department that runs and maintains the environment in production. Data warehouse Administrators who maintain the entire warehouse environment.
Operator
Administrator
Privileges Session Operator, Use Designer, Browse Repository, Create Sessions and Batches Browse Repository
Session Operator, Administer Server, Browse Repository Super User
Users with Administer Repository or Super User privileges may edit folder properties, which must identify a folder owner and group, and also determine whether the folder is shareable, meaning that shortcuts can be created pointing to objects within the folder, thereby enabling object reuse. After a folder is flagged as shareable, this property cannot be changed. For each folder, privileges are set for the owner, group, and repository (i.e., any user). The following table details the three folder level privileges: Read, Write, and Execute: Privilege Read
Write Execute
Description Can read, copy, and create shortcuts to repository objects in the folder. Users without read permissions cannot see the folder. Can edit metadata in the folder. Can run sessions using mappings in the folder.
Allowing shortcuts enables other folders in the same repository to share objects such as source/target tables, transformations, and mappings. A recommended practice is to create only one shareable folder per repository, and to place all reusable objects within that sharable folder. When other folders create a shortcut from a shareable folder, that folder inherits the properties of the object, so changes to common logic or elements can be managed more efficiently.
PAGE BP-212
BEST PRACTICES
INFORMATICA CONFIDENTIAL
Users who own a folder or have Administer Repository or Super User privileges can edit folder properties to change the owner, the group assigned to the folder, the three levels of privileges, and the Allow Shortcuts option. Note that users with the Session Operator privilege can run sessions or batches, regardless of folder level permissions. A folder owner should be allowed all three folder level permissions. However members within the folder’s group may contain only Read/Write, or possibly all three levels, depending on the desired level of security. Repository privileges should be restricted to Read permissions only, if any at all. You might also wish to add a group specific to each application if there are many application development tasks being performed within the same repository. For example, if you have two projects, ABC and XYZ, it may be appropriate to create a group for ABC developers and another for XYZ developers. This enables you to assign folder level security to the group and keep the two projects from accidentally working in folders that belong to the other project team. In this example, you may assign group level security for all of the ABC folders to the ABC group. In this way, only members of the ABC group can make changes to those folders. Tight security is recommended in the production environment to ensure that the developers and other users do not accidentally make changes to production. Only a few people should have Administer Repository or Super User privileges, while everyone else should have the appropriate privileges within the folders they use. Informatica recommends creating individual User IDs for all developers and administrators on the system rather than using a single shared ID. One of the most important reasons for this is session level locking. When a session is in use by a developer, it cannot be opened and modified by anyone but that user. Locks thus prevent repository corruption by preventing simultaneous uncoordinated updates. Also, if multiple individuals share a common login ID, it is difficult to identify which developer is making (or has made) changes to an object.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
PAGE BP-213
PAGE BP-214
BEST PRACTICES
INFORMATICA CONFIDENTIAL