Pipeline Partitioning Overview You create a session for each mapping you want the Informatica Server to run. Every mapping contains one or more source pipelines. A source pipeline consists of a source qualifier and all the transformations and targets that receive data from that source qualifier. If you use PowerCenter, you can specify partitioning information for each source pipeline in a mapping. If you use PowerMart, you must accept the default partitioning information. The partitioning information for a pipeline controls the following factors:
• • • •
The number of reader, transformation, and writer threads that the master thread creates for the pipeline. For more information, see Understanding Processing Threads. How the Informatica Server reads data from the source, including the number of connections to the source. How the Informatica Server distributes rows of data to each transformation as it processes the pipeline. How the Informatica Server writes data to the target, including the number of connections to each target in the pipeline.
You can specify partitioning information for a pipeline by setting the following attributes:
•
•
•
Location of partition points. Partition points mark the thread boundaries in a pipeline and divide the pipeline into stages. The Informatica Server sets partition points at several transformations in a pipeline by default. If you use PowerCenter, you can define other partition points. When you add partition points, you increase the number of transformation threads, which can improve session performance. The Informatica Server can redistribute rows of data at partition points, which can also improve session performance. For more information on partition points, see Partition Points. Number of partitions. A partition is a pipeline stage that executes in a single thread. If you use PowerCenter, you can set the number of partitions at any partition point. If you use PowerMart, the Informatica Server defines one partition for the pipeline. When you add partitions, you increase the number processing threads, which can improve session performance. For more information, see Number of Partitions. Partition types. The Informatica Server specifies a default partition type at each partition point. If you use PowerCenter, you can change the partition type. The partition type controls how the Informatica Server redistributes data among partitions at partition points. For more information, see Partition Types.
Partition Points By default, the Informatica Server sets partition points at various transformations in the pipeline. Partition points mark thread boundaries as well as divide the pipeline into stages. A stage is a section of a pipeline between any two partition points. When you set partition point at a transformation, the new pipeline stage includes that transformation. Table 10-1 lists the partition points that the Workflow Manager creates by default: Table 10-1. Default Partition Points Transformation (Partition Point)
Default Partition Type
Description
Source Qualifier or Normalizer transformation
Pass-through
Controls how the Informatica Server reads data from the source and passes data into the source qualifier.
Rank and unsorted Aggregator transformations
Hash auto-keys
Ensures that the Informatica Server groups rows properly before it sends them to the transformation.
Target instances
Pass-through
Controls how the target instances pass data to the targets.
If you use PowerCenter, you can add partition points at other transformations and delete some partition points. If you use PowerMart, you cannot add or delete partition points.
Figure 10-1 shows the default partition points and pipeline stages for a simple mapping with one source pipeline:
The mapping in Figure 10-1 contains four stages. The partition point at the source qualifier marks the boundary between the first (reader) and second (transformation) stages. The partition point at the Aggregator transformation marks the boundary between the second and third (transformation) stages. The partition point at the target instance marks the boundary between the third (transformation) and fourth (writer) stage. When you add a partition point, you increase the number of pipeline stages by one. Similarly, when you delete a partition point, you reduce the number of stages by one. For more information, see Understanding Processing Threads. Besides marking stage boundaries, partition points also mark the points in the pipeline where the Informatica Server can redistribute data across partitions. For example, if you place a partition point at a Filter transformation and define multiple partitions, the Informatica Server can redistribute rows of data among the partitions before the Filter transformation processes the data. The partition type you set at this partition point controls the way in which the Informatica Server passes rows of data to each partition. For more information, see Partition Types. For more information on adding and deleting partition points, see Adding and Deleting Partition Points. Number of Partitions A partition is a pipeline stage that executes in a single reader, transformation, or writer thread. By default, the Informatica Server defines a single partition in the source pipeline. If you use PowerCenter, you can increase the number of partitions. This increases the number of processing threads, which can improve session performance. For example, you need to use the mapping in Figure 10-1 to extract data from three flat files of various sizes. To do this, you define three partitions at the source qualifier to read the data simultaneously. When you do this, the Workflow Manager defines three partitions in the pipeline. Figure 10-2 shows the threads that the master thread creates for this mapping:
By default, the Informatica Server sets the number of partitions to one. If you use PowerMart, you cannot change the number of partitions. If you use PowerCenter, you can generally define up to 16 partitions at any partition point. However, there are situations in which you can define only one partition in the pipeline. For more information, see Restrictions on the Number of Partitions. Note: Increasing the number of partitions or partition points increases the number of threads. Therefore, increasing the number of partitions or partition points also increases the load on the server machine. If the server machine contains ample CPU bandwidth, processing rows of data in a session concurrently can increase session performance. However, if you create a large number of partitions or partition points in a session that processes large amounts of data, you can overload the system. For more information on adding and deleting partitions, see Adding and Deleting Partitions. Partition Types When you configure the partitioning information for a pipeline, you must specify a partition type at each partition point in the pipeline. The partition type determines how the Informatica Server redistributes data across partition points. The Workflow Manager allows you to specify the following partition types:
• •
•
•
Round-robin partitioning. The Informatica Server distributes data evenly among all partitions. Use round-robin partitioning where you want each partition to process approximately the same number of rows. For more information, see Round-Robin Partitioning. Hash partitioning. The Informatica Server applies a hash function to a partition key to group data among partitions. If you select hash auto-keys, the Informatica Server uses all grouped or sorted ports as the partition key. If you select hash user keys, you specify a number of ports to form the partition key. Use hash partitioning where you want to ensure that the Informatica Server processes groups of rows with the same partition key in the same partition. For more information, see Hash Partitioning. Key range partitioning. You specify one or more ports to form a compound partition key. The Informatica Server passes data to each partition depending on the ranges you specify for each port. Use key range partitioning where the sources or targets in the pipeline are partitioned by key range. For more information, see Key Range Partitioning. Pass-through partitioning. The Informatica Server passes all rows at one partition point to the next partition point without redistributing them. Choose pass-through partitioning where you want to create an additional pipeline stage to improve performance, but do not want to change the distribution of data across partitions. For more information, see Pass-through Partitioning.
You can specify different partition types at different points in the pipeline.
The mapping in Figure 10-3 reads data about items and calculates average wholesale costs and prices. The mapping must read item information from three flat files of various sizes, and then filter out discontinued items. It sorts the active items by description, calculates the average prices and wholesale costs, and writes the results to a relational database in which the target tables are partitioned by key range. When you use this mapping in a session, you can increase session performance by specifying different partition types at the following partition points in the pipeline:
• • •
•
Source qualifier. To read data from the three flat files concurrently, you must specify three partitions at the source qualifier. Accept the default partition type, pass-through. Filter transformation. Since the source files vary in size, each partition processes a different amount of data. Set a partition point at the Filter transformation, and choose round-robin partitioning to balance the load going into the Filter transformation. Sorter transformation. To eliminate overlapping groups in the Sorter and Aggregator transformations, use hash auto-keys partitioning at the Sorter transformation. This causes the Informatica Server to group all items with the same description into the same partition before the Sorter and Aggregator transformations process the rows. You can delete the default partition point at the Aggregator transformation. Target. Since the target tables are partitioned by key range, specify key range partitioning at the target to optimize writing data to the target.