Sql Server Integration Services An Introduction

  • Uploaded by: Gautam
  • 0
  • 0
  • December 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Sql Server Integration Services An Introduction as PDF for free.

More details

  • Words: 4,941
  • Pages: 23
SQL Server Integration Services an Introduction - Part 1 Introduction SQL Server Integration Services (SSIS) is a platform for building high performance data integration and workflow solutions. It allows creating packages or SSIS packages which are made up of tasks that can move data from source to destination and if necessary alter it on the way. SSIS is basically an ETL (Extraction, Transformation, and Load) tool whose main purpose is to do extraction, transformation and loading of data but it can be used for several other purposes for example, to automate maintenance of SQL Server databases, update multidimensional cube data etc as well. SSIS is a component of SQL Server 2005/2008 and is successor of DTS (Data Transformation Services) which had been in SQL Server 7.0/2000. Though from end-user perspective DTS and SSIS looks similar to each other to some extent, it is not the case in actual. SSIS has been completely written from the scratch and hence it overcomes the several limitations of DTS. Though the list of differences between DTS and SSIS is quite large but one thing to note here is that the internal architecture of SSIS is completely different from DTS. It has segregated the Data Flow Engine from the Control Flow Engine or SSIS Runtime Engine and hence improves the performance several folds, more details about architecture and the internals of SSIS is not in scope of this introductory article and I will cover it in my next article series “SQL Server Integration Services – An inside View”. Note: In this article, here and there I am referring “SSIS 2008” it means SSIS version which comes with SQL Server 2008 whereas “SSIS 2005” means SSIS version which comes with SQL Server 2005. Creating an SSIS Package There are three different ways you can use to create SSIS packages as follows.



The Import and Export Wizard – Though this is the one of the simplest way to create SSIS package but it has very limited capability, I mean you cannot define any kind of transformation in this (though with SSIS 2008, you can only choose to include a Data Conversion Transformation if there is a mismatch in data type between source and destination) and used mainly in case of simple data transfer from source to destination. For more details, refer to “Import and Export” section later in this article.



The SSIS Designer – The SSIS Designer is hosted inside the Business Intelligence Development Studio (BIDS) as part of an Integration Services project. It is a graphical tool that you can use to create and maintain Integration Services packages. It has a

toolbox which contains the various items needed for Control Flow, Data Flow Task as well as tasks needed for maintenance plans. The number of tasks in SSIS is much larger than what was available in the DTS. For more details, refer to “SSIS Designer” section later in this article. •

SSIS API Programming – SSIS provides API object model, which you can use in your choice of programming language to create SSIS package programmatically. For more details, refer to “SSIS API Programming” section later in this article.

SSIS Components As we learnt in the introduction that SSIS allows creating packages or rather SSIS packages which are made up of tasks that can move data from source to destination and if necessary alter it on the way. Within SSIS package we can define the workflow of the package and SSIS runtime engine ensures the tasks inside the package are executed in orderly fashion as defined, in other words it ensures to maintain the workflow of tasks inside the package. So it’s now time to have a look on different tasks/components/executables of a package. Package A package itself is a collection of tasks and which are executed in an orderly fashion by SSIS runtime engine. It is an XML file, which can be saved on SQL server or on file system. A package can be executed by SQL Server Agent Job, DTEXEC command (a command line utility bundled with SSIS to execute a package, there is another similar utility DTEXECUI which has a GUI), from BIDS environment or by calling one package by another package (achieves modular approach). You can use DTUTIL utility to move package from file system to SQL Server or vice versa or else you can use undocumented sp_dts_getpackage/sp_ssis_getpackage and sp_dts_putpackage/sp_ssis_putpackage stored procedures which reside in msdb system database. Control Flow It handles the main workflow of the package and determines processing sequence within the package. It consists of containers, different kind of work flow tasks, and precedence constraints. Control Flow Tasks A task is an individual unit of work. SSIS provides several inbuilt control flow tasks which perform a variety of workflow actions. They provide functionality to the package in much the same way that a method does in programming language. All the inbuilt tasks are operational task except Data Flow Task. Though there are several dozen inbuilt tasks to use but if required you can extend them and you can write your custom task using VB/C# etc. Containers Container groups variety of package components (including other containers), affecting their scope, sequence of execution, and mutual interaction. They are used to create logical group of tasks. There are basically four types of containers in SSIS as given below: •

Task Host Containers - Default container, every task falls into it.



Sequence Containers - Defines a subset of the overall package control flow.



For Loop Containers - Defines a repeating control flow in a package.



ForEach Loop Containers – Loops for collection, enumerates through a collection for example you will use it when you will have to process each record of a record-set.

Precedence Constraints Precedence constraints link the items in your package into a logical flow and specify the conditions upon which the items are executed. It provides ordinal relationship between various items in the package. It helps manage which order the tasks will execute. It directs the tasks to execute in a given order. In other words, it defines links among containers and tasks and evaluates conditions that determine the sequence in which they are processed. More specifically, they provide transition from one task or container to another. The condition can be either Constraint or Expression or both. The constraint can further be Success (Green Line), Failure (Red Line) and Complete (Blue Line). For example: In the package image below Script Task 1 will be executed only if the Execute SQL Task completed successfully. Script Task 2 will be executed irrespective of whether the Execute SQL Task completed successfully or failed. Script Task 3 will be executed only if the Execute SQL Task failed.

Apart from the above discussed constraints you can also define some expression as a condition with precedence constraints which is evaluated at runtime and depending on its value the transition is decided. For example in the image below, After Task A, Task B will be executed if the value of X >= Z or Task C will be executed if the value of X < Z. You can also combine the constraint and expression in a single condition with either AND or OR operator.

Variables The concept of variable in SSIS is same as the variables in any other programming languages. It provides temporary storage for parameters whose values can change from one package execution

to another, accommodating package reusability. It means it is used to dynamically configure a package at runtime. For example if you want to execute the same T-SQL statement or a script against a different set of connections. Depending on the place where a variable has been defined, its scope varies. They can be declared at package level, container level, tasks level, even handlers level etc. In SSIS, there are two types of variables - System (pre-defined) variables whose values are set by SSIS (ErrorCode, ErrorDescription, MachineName, PackageName, StartTime etc.) and cannot be changed and User variable which are created as on need basis at the time of package development and can be assigned value of the corresponding type. Note: An exception applies here, there is a system variable called “Propagate” whose value you can change from its default value TRUE to FALSE to stop event bubbling from a task to its parent and grand-parent. Refer to my next article in this series viz. “SQL Server Integration Services Features and Properties” which discusses about it in more details. Connection Managers A connection manager is a logical representation of a connection. SSIS provides different types of connection managers which uses different data providers and that enable packages to connect to a variety of data sources and servers. A package can have multiple instances of connection managers and one connection manager can be used by multiple tasks in the package. A few examples of connection managers are: •

ADO Connection Manager – Connects to ActiveX Data Objects (ADO) objects.



ADO.NET Connection Manager – Connects to a data source by using a .NET provider.



OLEDB Connection Manager – Connects to a data source by using an OLE DB provider.



Flat File Connection Manager – Connect to data in a single flat file.



FTP Connection Manager – Connect to an FTP server.

By Default, every task which uses a connection manager, during execution it opens a connection, perform the operation and close the connection before moving to the next task, it means each task has got its own connection. So let’s consider of a scenario where there are three tasks in package and they use the same connection manager, during runtime there would be three connections opened and closed at source. But you don’t want this behavior you want all three tasks to be executed in a single connection; I mean there should be only one connection open to source irrespective of how many tasks use this connection. Here comes one property of connection manager viz. RetainSameConnection for your rescue. The RetainSameConnection property on the OLE DB Connection Manager enables you to run multiple tasks in a single connection if the RetainSameConnection property equals TRUE.

In the next article in this series we will look at Data Flow in SQL Server Integration Services.

SQL Server Integration Services an Introduction - Part 2 Data Flow The Data Flow Task (DFT), using the SSIS Pipeline engine, manage the flow of data from the data source adapters to the data destination adapters and let the user do the necessary transformations, clean, and modify data in the way. Note: The Data Flow Tasks (DFT) is not a separate component outside Control Flow rather it is placed inside a Control Flow only but I have given a separate heading/section for this to give more emphasis on it as it is the most important task in SSIS. A DFT can include multiple data flows. If a task copies several sets of data, and if the order in which the data is copied is not significant, it can be more convenient to include multiple data flows in a single DFT. In the first image below, the DFT has one data flow with two transformations in the way before writing to destination whereas in second image the DFT has two data flows first one has two transformations and second one has three transformations.

Transformations The transformation changes the data to a desired format. It performs modifications to data, through a variety of operations, such as aggregation (e.g. averages or sums), merging (of multiple input data sets), distribution (to different outputs), data type conversion, or reference table lookups (using exact or fuzzy comparisons) etc. Below are some inbuilt transformations:



Derived Column Transformation – It creates new column values by applying expressions to transformation input columns. The result can be added as a new column or inserted into an existing column as a replacement value.



Lookup Transformation – It performs lookup by joining data in input columns with columns in a reference dataset. Usually used in a scenario where you have a subset of master data set and you want to pull related transaction records.



Union All Transformation – It combines multiple inputs and gives UNION ALL of these multiple result-sets.



Merge Transformation – It combines two sorted datasets into a single sorted dataset and is similar to the Union All transformations. Use the Union All transformation instead of the Merge transformation in case if the inputs are not sorted or the combined output does not need to be sorted or the transformation has more than two inputs.



Merge Join Transformation – It provides an output that is generated by joining two sorted datasets using either a FULL, LEFT, or INNER joins.



Conditional Split Transformation – It can route data rows to different outputs depending on the content of the data. The implementation of the Conditional Split transformation is similar to a CASE decision structure in a programming language. The transformation evaluates expressions, and based on the results, directs the data row to the specified output. This transformation also provides a default output, so that if a row matches no expression it is directed to the default output.



Multicast Transformation – It distributes its input to one or more outputs. This transformation is similar to the Conditional Split transformation. Both transformations direct an input to multiple outputs. The difference between the two is that the Multicast transformation directs every row to every output, and the Conditional Split directs a row to a single output.

There are several inbuilt transformation tasks available inside SSIS Designer to use though if required you can extend these transformations and can write your own custom transformations. Data Paths Data Path connects data flow components inside a DFT. Though it looks like Precedence Constraint of Control Flow but it is not the same, it shows the flow of data from one component of DFT to another component of DFT whereas Precedence Constraint shows the control flow or ordinal relationship between control flow tasks. Data Path contains the meta-data of the data flowing through the path, for example what are the columns, its name, type, size etc. While debugging you can attach a data viewer to a Data Path to see the data flowing through that data path. Note: The data viewer shows the data of one buffer at a time, you can click on next button to see the data from next buffer. (I will discuss about the SSIS buffer management in my next article viz. “SQL Server Integration Services – An inside View”) Data Source Adapters Data Source Adapters or simply the Source Adapters facilitate the retrieval of data from various data sources. It uses connection managers which in turn use different data providers to connect to heterogeneous sources for example flat file, OLE DB, .NET Framework data providers etc.

Data Destination Adapters Data Destination Adapters or simply Destination Adapters loads output of the data flow into target stores, such as flat files, database, or in-memory ADODB record-sets etc. Similar to Source Data Adapters, It uses connection managers which in turn use different data providers to connect to heterogeneous destination for example flat file, OLE DB, .NET Framework data providers etc. Now let’s talk of some of the properties/settings in details, I am taking OLEDB as an example as it is one of the most used.



Data Access Mode – It allows to define the method to upload data into the destination. The fast load option will use BULK INSERT statement instead of INSERT statement as in case of without specifying it.



Keep Identity – If selected the identity values of source will be preserved and will be uploaded the same into the destination table or else destination table will create its own identity values if there is any column of identity type.



Keep Nulls – If selected the null values of the source will be preserved and will be uploaded into the destination table or else if there is any column has default constraint defined at destination table and NULL value is coming from the source for that column then in that case, default value will be inserted into the destination table.



Table Lock – If selected the TABLOCK will acquired on the table during data upload. It is recommended option if table is not being used by any other application at the time of data upload as it removes the overhead of lock escalation.



Check Constraints – If selected the pipeline engine will check the table constraint for incoming data and fails if it violates it. The recommendation is to uncheck this setting if constraint checking is not required as it will improve the performance.



Rows per batch – The blank text box indicates its default value -1, it means all incoming rows will be considered as one batch. You can specify a nonzero, positive integer to direct the pipeline engine to break the incoming rows in multiple chunks of N (what you specify) rows. In other words, it specifies the number of rows in a batch.



Maximum insert commit size – You can specify the batch size that the OLE DB destination tries to commit during fast load operations; it actually splits up the chunks of data as they are inserted into your destination. If you provide a value for this property, the destination commits rows in batches that are the smaller of (a) the Maximum insert commit size, or (b) the remaining rows in the buffer that is currently being processed.

• Note: It’s good practice to set the value for the above two settings, because having a large batch or leaving the default value will negatively affect memory performance specially the tempdb, so recommendation is to test your scenario and specify optimum values for these settings depending on your environment, load and pull. In the next article in this series we will look at the Import and Export Wizard in SQL Server Integration Services.

SQL Server Integration Services an Introduction - Part 3 This article is part 3 of a 4 part series that introduces SQL Server Integration Services (SSIS). This article shows how to use the Import and Export Wizard. The Import and Export Wizard provides the simplest method of copying data between data sources and of constructing basic packages. But it has a major limitation; it does not allow doing the transformation in the way (though with SSIS 2008, you can only choose to include a Data Conversion Transformation if there is a mismatch in data type between source and destination). However you can pull data from the source to the staging server, do the transformation over there and again transfer data from staging to production server, but it’s too much of work and takes lots of resources. Let’s run through some example: Launch the Import and Export Wizard •

On the Start menu, point to All Programs, point to Microsoft SQL Server 2005/2008 (Depending on the SQL Server version you have installed), and then click Import and Export Data, or



In SQL Server Management Studio, connect to the Database Engine server type, expand Databases, right-click a database, point to Tasks, and then click Import Data or Export data(if you click on Import data, by default destination server details will come up on which you are performing this operation likewise if you click on Export data by default source server details will come up automatically on which you are performing this operation) or



In a command prompt window, run DTSWizard.exe.

After the welcome screen, the next screen comes up where you will have to specify the source server name, credential to use and database name as given below:

After clicking next, you will be prompted to enter destination server name, credential to use, database name(If you want to create new database, you can do so by clicking on New Command button) as given below:

After clicking next, you will be prompted to choose whether you want data from tables or views or you want write query to pull the data. Depending on the choice the next screen will vary. After clicking next (I chose the first option to pull data from tables or views), you will be given a list of all the available tables and views at source. You need to choose tables or views you want to pull data from. Edit Mapping button allows you the change the mapping of columns between source and destination whereas the Preview button allows you to preview top 100 records from the selected table or view.

After clicking next, you will be prompted to choose two options, first one says do you want to run the package immediately and the second one says do you want to save the package for later use, if yes where do you want to save the package either on file system or SQL Server and on next screen it will ask you for location or server details to save the package depending on your choice.

After clicking next, and finally Finish button, the import and Export wizard will start transferring the data and status will be shown as below, on completion you can click on Close button:

SSIS Designer SSIS Designer is a very rich graphical tool that you can use to create and maintain Integration Services packages. Using this designer you can construct control flow, data flow in a package and add even handlers to the package and its objects. It also shows the execution progress during run-time. It has four permanent TABs and in addition to that, one additional TAB pops up during execution to show the package progress, as given below:

Control Flow Tab – You construct the control flow in a package on the design surface of the Control Flow tab. Drag items from Toolbox to the design surface and connect them into a control flow by clicking the icon for the item, and then dragging the arrow from one item to another. Data Flow Tab – This is basically used if a package contains a Data flow task; you can add data flows to the package. You construct the data flows in a package on the design surface of the Data Flow tab. When you double click on Data Flow Task in Control Flow tab, details of that Data Flow Task is opened in Data Flow designer surface, here you can define data flows. Event Handlers Tab – Package and its components have got different events in their execution life-cycle. You can create event handlers for these events in Even Handlers designer surface. (I will discuss more about event handlers in my next article viz. “SQL Server Integration Services -

Features and Properties”.) Package Explorer Tab – A Package Explorer tab displays the contents of the package, In other words, packages can be complex, including many tasks, connection managers, variables, and other elements. The explorer view of the package lets you see a complete list of package elements. Progress Tab – This tab appears when you execute a package in designer and shows the execution progress. This tab is changed to Execution Result once you stop executing the package, which still contains the results of last execution until you close the package or re-run it. At the bottom there is a control tray of Connection Managers, which will display all the used and available connection managers for the package. While executing a package in designer, every task will change its color as given below: •

No color/White Color – It indicates the execution of the task has not started yet.



Yellow Color – It indicates the execution of task is in progress.



Green Color – It indicate the execution of the task has completed successfully.



Red Color – It indicates the execution of the task has completed but it has failed.

Now let’s run through some example of creating SSIS package: Note: In the examples below I have taken one task in package for simplicity, in real life a package might contain several tasks inside it. Scenario 1 In this I will create a very simple package; it will use Execute Process Task to execute Notepad. Open a new package; drag Execute Process Task from Toolbox to Control Flow.

Right click on the task, click on Edit and set the relevant properties as shown below: Now your package is all set to be run, hit F5 key on keyboard and package will start executing, click on Progress tab to see the execution progress. Scenario 2 In this example I will create a very simple package; it will use Data Flow Task to pull data from source to destination. Open a new package; drag Data Flow Task from Toolbox to Control Flow. Double click on the data flow task, now the details of it will be open in Data Flow Designer tab. Drag a source and a destination from the toolbox(Toolbox in Data Flow tab changes its content, it will now only show source, transformation and destination tasks) to the designer and specify source and destination details. While configuring the source, you will have to select one of the available connection managers, data access method and what are the columns you want to pass through the data path. While configuring the destination, you will have to select one of the available connection managers, data access method and the mapping between source and destination columns, though

SSIS is smart enough to do this mapping on its on the basis of similarity of column names and types between source and destination but if required you can change it.

Scenario 3 In this example I will take the same package created in scenario 2 and add a derived column (VendorDetails = AccountNumber + “:” + Name) and sort transformation (to sort the incoming record-set on Name column before uploading) in the way. (You can do one transformation or multiple transformations in the way from source to destination as per your need).

In the next article in this series we will look at how to use the SSIS API Model.

SQL Server Integration Services an Introduction - Part 4 Now let’s run through an example of using SSIS API. In this example you will write code using SSIS API to create package, add a task to it, save it on the file system (Same what was created in scenario 1 example of SSIS Designer section above). And later on you will load that package from the file system and execute it, everything programmatically this time. SSIS provides API object model, which you can use in your choice of programming language to create SSIS package programmatically. So what is the need of creating a package programmatically if you can do that using SSIS Designer? Let’s consider a scenario: Though you can create a package with multiple Data Flow Tasks and multiple data flows in a single data flow task. You cannot change the mapping between source and destination during runtime. I mean you cannot change the source, destination and column mapping metadata while executing a package. So now the problem is what if you want to build a generic loading package that can load data from any data source to any destination as long as the metadata is known? What if you want to create self-modifying package? In the scenario as mentioned above, you can use SSIS API object model to write code in C#/VB .Net etc to create package programmatically on the fly and can execute it. Here are some namespaces/assemblies which you will use frequently while creating package programmatically.

Now let’s run through an example of using SSIS API. In this example you will write code using SSIS API to create package, add a task to it, save it on the file system (Same what was created in scenario 1 example of SSIS Designer section above). And later on you will load that package from the file system and execute it, everything programmatically this time. using System; using Microsoft.SqlServer.Dts.Runtime; using Microsoft.SqlServer.Dts.Tasks.ExecuteProcess; namespace SSIS_API_Programming { class Program { static void Main(string[] args) { CreateAndSavePackage(); LoadAndExecutePackage(); } private static void CreateAndSavePackage() { Package pkg = new Package(); pkg.Name = "MySSISAPIExamplePackage"; //Adding ExecuteProcessTask to Package

//STOCK is moniker which is used most often in the //Microsoft.SqlServer.Dts.Runtime.Executables.Add(System.String) method //though you can specify a task by name or by ID Executable exec = pkg.Executables.Add("STOCK:ExecuteProcessTask"); //TaskHost class is a wrapper for every task TaskHost thExecuteProcessTask = exec as TaskHost; thExecuteProcessTask.Name = "Execute Process Task"; //Set relevant properties of the task ExecuteProcess execPro = (ExecuteProcess)thExecuteProcessTask.InnerObject; execPro.Executable = @"C:\Windows\System32\notepad.exe"; execPro.WorkingDirectory = @"C:\Windows\System32"; Application app = new Application(); //Save the package on file system, you can chooese to save on SQL Server as well app.SaveToXml(@"D:\ExecuteProcess.dtsx", pkg, null); } private static void LoadAndExecutePackage() { Package pkg; Application app; DTSExecResult pkgResults; app = new Application(); //Load the package from file system, you can chooese to load from SQL Server as well pkg = app.LoadPackage(@"D:\ExecuteProcess.dtsx", null); //Execute the package pkgResults = pkg.Execute(); Console.WriteLine(pkgResults.ToString()); Console.ReadKey(); } } } Conclusion In this first article I discussed about SSIS, a platform for building high performance data integration and workflow solutions. To achieve this, it uses separate data flow engine from Runtime engine and does multithreading (allowing multiple executables/data flow to run in parallel). Then I talked of different ways to create SSIS packages and different kind of SSIS components. In the next article, I am going to write more about the features and properties of SSIS, Even Logging, Event Handlers, Transaction Support, Checkpoint Restart-ability and SSIS validation process. So stay tuned to see the power and capabilities of SSIS.

Related Documents


More Documents from "LearnItFirst"