Dino Esposito Wintellect June 18, 2002 View the code sample page for this article. ADO.NET unquestionably provides a rich, database-like object model to hold fetched data in memory. In particular, the DataSet class works as a central repository for tables of data, and also supports constraints and logical relations between tables. Furthermore, the DataSet object is a disconnected data container. At a first sight, it seems that by combining these features together, you can stop executing complex SQL queries full of (nested) INNER JOIN and GROUP BY clauses. You could split a complex query in two or more distinct subqueries, store the results in distinct DataTable objects, and then use in-memory data relations and constraints to restore the necessary referential integrity. For example, you could fill two distinct DataTable objects with customers and orders and use a DataRelation object to bind them together. The advantage is that you don't need Microsoft SQL Server™ (or any other DBMS system) to run potentially heavy INNER JOIN statements and, perhaps more importantly, much less redundant data is transferred across the wire. Albeit effective, though, this technique does not always turns out to be the best possible approach for complex queries when you have very large and highly volatile databases. In this article, I'll review an alternate technique that makes intensive use of ADO.NET in-memory objects without causing too much of a hit to users and the DBMS itself.
Split Your Query Several ADO.NET books, including the David Sceppa's excellent Programming ADO.NET Core Reference (Microsoft Press), suggest splitting a complex query into smaller steps, each of which populates a different DataTable object in the same DataSet container. Let's consider a practical example. Suppose that you need to obtain all the orders issued in a given year grouped by customer. Furthermore, you don't want all the orders, only those with at least 30 items. In addition, you want the name of the employee that issued the order and the name of the customer's company. The query can be written like this: DECLARE @TheYear int SET @TheYear = 1997 SELECT o.customerid, od.orderid, o.orderdate, o.shippeddate, SUM(od.quantity*od.unitprice) AS price, c.companyname, e.lastname FROM Orders AS o INNER JOIN Customers AS c ON c.customerid=o.customerid INNER JOIN Employees AS e ON e.employeeid=o.employeeid INNER JOIN [Order Details] AS od ON o.orderid=od.orderid WHERE Year(o.orderdate) = @TheYear AND od.orderid=o.orderid
GROUP BY o.customerid, c.companyname, od.orderid, o.orderdate, o.shippeddate, e.lastname HAVING SUM(od.quantity) >30 ORDER BY o.customerid
Run this query using a plain old command, regardless of whether you're using ADO or ADO.NET, and it produces the result set shown in Figure 1.
Figure 1. The output of the previous query as displayed by the SQL Server Query Analyzer. The query has a core subquery and a couple of ancillary INNER JOIN statements. The core block takes care of selecting the orders with at least 30 items emitted in a given year. The following listing presents the core query: SELECT o.customerid, o.orderid, o.orderdate, o.shippeddate, SUM(od.quantity*od.unitprice) AS price, o.employeeid FROM orders AS o INNER JOIN [Order Details] AS od ON o.orderid=od.orderid WHERE Year(o.orderdate) = @TheYear AND od.orderid=o.orderid GROUP BY o.customerid, o.orderid, o.orderdate, o.shippeddate, o.employeeid HAVING SUM(od.quantity) >30 ORDER BY o.customerid
The result set contains only the ID of both the customer and the employee. What we need, though, is the last name of the employee and the name of the customer's company. The last query is simpler than the previous and, more importantly, returns a more compact result set where potentially lengthy names of employees and customers are not repeated. You can then split the first full query in three subqueries—the core query for orders, plus a couple of distinct queries to create look-up tables for employees and customers. SELECT employeeid, lastname FROM Employees SELECT customerid, companyname FROM Customers
The following ADO.NET code shows how to run three queries to fill three different tables in the target DataSet. Dim conn As SqlConnection = New SqlConnection(connString) Dim adapter As SqlDataAdapter = New SqlDataAdapter() conn.Open() adapter.SelectCommand = New SqlCommand(cmdCore, conn) adapter.SelectCommand.Parameters.Add("@TheYear", 1997) adapter.SelectCommand.Parameters.Add("@TheQuantity", 30) adapter.Fill(ds, "Orders") adapter.SelectCommand = New SqlCommand(cmdCust, conn) adapter.Fill(ds, "Customers") adapter.SelectCommand = New SqlCommand(cmdEmpl, conn) adapter.Fill(ds, "Employees") conn.Close()
Notice that when you need to run one query after the next, you should manage the connection yourself to avoid unnecessary open/close operations. The adapter's Fill method, in fact, silently opens and closes the connection each time it runs unless you explicitly associate an open connection with its SelectCommand object property. To establish an in-memory link between the various tables, you can add a couple of data relations connecting employeeid to lastname and customerid to companyname. A DataRelation object normally sets up a one-to-many relation between two tables in the same DataSet. In this case, though, you need a rather unusual many-to-one relation or, more precisely, a one-to-many relation in which the parent table (Orders) becomes the child and any of the two child tables (Employees and Customers) acts as the parent.
Figure 2. Parent and child tables exchange their role in the relation. The ADO.NET DataRelation object is also flexible enough to handle many-to-one relations. When a DataRelation object gets created, the ADO.NET runtime silently sets up a unique constraint that prevents the parent table from containing duplicates of
the key. Of course, if the designated parent table includes duplicates, an exception is thrown. Dim relOrder2Employees As DataRelation relOrder2Employees = New DataRelation("Orders2Employees", _ ds.Tables("Orders").Columns("employeeid"), _ ds.Tables("Employees").Columns("employeeid")) ds.Relations.Add(relOrder2Employees)
The DataRelation constructor takes three arguments; the first being the name of the relation. The other two arguments identify the columns involved in the relation. The first DataColumn object represents the parent column, whereas the second DataColumn object is the child column. The constructor raises an ArgumentException exception whenever it detects that the parent column doesn't currently have unique values. The simplest workaround consists of adding a fourth Boolean argument to the constructor. relOrder2Employees = New DataRelation("Orders2Employees", _ ds.Tables("Orders").Columns("employeeid"), _ ds.Tables("Employees").Columns("employeeid"), _ False)
Add a false value to the constructor and the unique constraint, which is the main culprit of the exception, will not be created. Having set these relations would let you add a couple of computed columns to the Orders table just for display purposes. In theory, the following code is logical. Dim orders As DataTable = ds.Tables("Orders") orders.Columns.Add("Employee", GetType(String), _ "Child(Orders2Employees).lastname") orders.Columns.Add("Customer", GetType(String), _ "Child(Orders2Customers).companyname")
In practice, however, this code won't work. Worse yet, the code throws a rather misleading syntax error on the Child keyword. (See last month's column for more information about computed columns.) The real problem is that the Child keyword is accepted in a column expression only if a unique constraint exists on the parent column of the specified relation. This is not clearly written in the documentation, but still makes a lot of sense. Amazingly, if you programmatically try to access the children of any Orders row, nothing bad happens and you can read the corresponding single row on the Employees or Customers table. The code below shows this concept in action. Dim orders As DataTable = ds.Tables("Orders") Dim employee As DataRow = orders.Rows(0).GetChildRows(relOrder2Employees) MsgBox employee("lastname")
So, the previously mentioned syntax error does not imply that you cannot handle many-to-one relations, so to speak. It just means that the Child expression cannot be used to set up expression-based columns when the relation has no unique constraint.
Although the Orders table is the logical parent of the data relationship, in the context of getting additional information about a customer or an employee ID, the roles are exchanged. The Orders table is the child and the Employees and the Customers tables are the parents. To ensure that you solve this issue, invert the columns in the DataRelation's constructors and use the following code to define the computed columns. Dim orders As DataTable = ds.Tables("Orders") orders.Columns.Add("Employee", GetType(String), _ "Parent(Orders2Employees).lastname") orders.Columns.Add("Customer", GetType(String), _ "Parent(Orders2Customers).companyname")
What is the moral of the story? You split a rather complex query into three simpler queries, saved the database a couple of INNER JOIN statements, and more importantly, moved much less redundant data from the server up to the client. So far, it seems the perfect solution.
Alternate Approaches The previous solution is based on look-up tables that I fetched without filters. What if the look-up tables are significantly large? Can you afford to download 10,000 employees records to find a few hundreds last names? Do you want to fetch all of that possibly redundant data?. Worse yet, that data is not simply redundant, it's useless. On the other hand, consider that look-up tables can be useful in many circumstances during the lifetime of the application. This means that even though downloading the full table may be overkill for the goal of a single query, it could be a worthwhile trade-off in the broader context of the overall application. That said, let's consider alternate techniques to minimize the size of lookup tables. The most obvious approach is that you add a WHERE clause to narrow the result set. Unfortunately, such a solution is not always easy and/or effective to implement, especially when the tables to look up are not the main actors of the query. In this case, for instance, the filter for employees involves checking conditions on other joined tables—Order and Order Details. All in all, perhaps the best approach is extracting all the distinct customers from the result set generated by the previously executed query. You can run two similar queries and force the database to run the same subquery twice. This way you get exactly the data you need with minimal overlapping. In addition, thanks to the SQL Server query optimization engine, the cost of the repeated queries is minimized. SELECT DISTINCT t.customerid, t.companyname FROM (SELECT o.customerid, o.orderid, o.orderdate, o.shippeddate, SUM(od.quantity*od.unitprice) AS price, c.companyname, o.employeeid FROM orders AS o INNER JOIN Customers AS c ON c.customerid=o.customerid INNER JOIN [Order Details] AS od ON o.orderid=od.orderid WHERE Year(o.orderdate) = @TheYear AND od.orderid=o.orderid GROUP BY o.customerid, o.orderid, c.companyname, o.orderdate, o.shippeddate, o.employeeid
HAVING SUM(od.quantity) >30) AS t
In summary, the key reason to prefer a data retrieval code based on smaller queries is that you move the burden of joining data onto the client . On the client, the data is partitioned into independent but easily manageable linked tables, providing for an unprecedented level of flexibility. Smaller and simpler queries move data directly in ad hoc DataTable objects, thus making life easier for client applications. However, what if splitting the query leads to some data inconsistency?
Transactions Needed In general, each query command, no matter how complex, executes in the context of an implicit transaction. This ensures that as long as the command is running, no one is allowed to perform tasks that may affect the overall data consistency. But what if you split that logical monolithic query into smaller steps? Assuming that yours is an highly volatile environment with a significant level of data contention, chances are good that other users place their update statements in the middle of your subqueries. If we consider the sample query command seen above, this risk is perhaps limited because the second and the third query only affect look-up tables. Nevertheless, if another user deletes a customer after the Orders data table has been populated, you may easily retain inconsistent data. The only workaround for this situation consists in wrapping all the subqueries in an all-encompassing explicit transaction. Transactions are a unit of work that execute according to four principles: • • • •
Atomicity Consistency Isolation Durability
These principles are often cumulatively known as ACID. For our purposes, the key element is isolation. The isolation property refers to the ability of the database to shield each running transaction from the effects of other concurrently running transactions. If our transacted query is running at the same time in which another user updates the database, the final results may vary according to the level of isolation. Normally each transacted operation can have the level of isolation it needs. If the application needs absolute data consistency, and cannot afford phantom rows, the only possible isolation level is serializable. A serializable transaction locks all the involved tables, preventing other users from updating or inserting rows. The tables are unlocked only when the transaction is completed. It goes without saying that with this isolation level, dirty reads (reading uncommitted data) and phantom rows (missing rows added, or getting rows deleted by other transactions) are not allowed, but the overall level of concurrency is affected.
Keep in mind that if the application requires that a customer be deleted or an order modified while the subqueries are running, you must group the subqueries in a serializable wrapper transaction. SET TRANSACTION ISOLATION LEVEL SERIALIZABLE BEGIN TRANSACTION -- get Orders -- get Customers -- get Employees COMMIT TRANSACTION
Once again, if the required level of scalability makes a serializable transaction affordable to you, then by all means use one. Otherwise, if you reckon that locking all the tables for the entire duration of the query is harmful, then ADO.NET in-memory objects can help you. We'll take a look at in-memory objects next.
Disjoin Joined Data Let's move back and reconsider the first query that we analyzed at the beginning of this article. The goal is getting all the orders issued in a given year for more than a certain quantity of items. In addition, we want to know the total amount of the order, as well as the name of the customer and the employee who worked on it. DECLARE @TheYear int DECLARE @TheAmount int SET @TheYear = 1997 SET @TheAmount = 30 SELECT o.customerid, od.orderid, o.orderdate, o.shippeddate, SUM(od.quantity*od.unitprice) AS price, c.companyname, e.lastname FROM Orders AS o INNER JOIN Customers AS c ON c.customerid=o.customerid INNER JOIN Employees AS e ON e.employeeid=o.employeeid INNER JOIN [Order Details] AS od ON o.orderid=od.orderid WHERE Year(o.orderdate) = @TheYear AND od.orderid=o.orderid GROUP BY o.customerid, c.companyname, od.orderid, o.orderdate, o.shippeddate, e.lastname HAVING SUM(od.quantity) >@TheAmount ORDER BY o.customerid
The above SQL command returns—in a single shot—all the information we need. Since the query runs in an implicit transaction, we can be sure that we only get consistent and serializable data. Why did we discard this sort of old-fashioned solution? The result set is made of rows extracted from three independent tables, not to mention [Order Details]. They are: • • •
Orders Customers Employees
In addition, we moved more data than necessary because of the INNER JOIN statements. The latter point cannot be improved, but with some ADO.NET code you can improve the former point and make the whole solution more compelling and performant. The idea is that you run the query, gather all the data in a single DataTable object, and then split it out into distinct and related DataTable objects. The final output is like running distinct queries. You don't pay the price of setting an explicit, serializable transaction, you don't download more rows than necessary, but each row has a bit of redundant information. When does this approach sound reasonable? I've found it useful in scenarios in which you need to organize a complex master/detail view on the client with group-by functionalities and filters. In these cases, having independent but related tables is incredibly advantageous and ADO.NET provides many features on this side. Let's see how to proceed. The following code shows the main procedure: Function SplitData(ByVal ds As DataSet) As DataSet Dim _dataset As New DataSet() ' Make a full worker copy of the DataSet _dataset = ds.Copy() CreateCustomers(_dataset, ds) CreateEmployees(_dataset, ds) ' Remove columns from Orders(companyname [2] and lastname [4]) _dataset.Tables("Orders").Columns.RemoveAt(1) _dataset.Tables("Orders").Columns.RemoveAt(3) Return _dataset End Function
The code makes a complete deep copy of the source DataSet, which will be used as the Orders table in the resulting new DataSet. This new DataSet is then added dynamically, creating tables for customers and employees. Finally, the columns currently included in child tables are removed from the Orders table. The figure below shows the newly created Customers table that contains only (and all) of the customers used in the Orders table. Both tables maintain a customerid field so that relations can still be set.
Figure 3. The new Customers table created from the result set of the single query. Let's briefly discuss the code necessary to create and populate the customers and employees tables. To start out, you create a new DataTable object by cloning the original Orders table. Unlike Copy, the Clone method simply duplicates metadata. The DataTable interface does not allow for individual column clonation, so this is the simplest way to create equivalent columns. The newly created table contains too many columns, so let's remove the unneeded ones. From the structure of the source DataSet, you know that the customerid and the companyname columns are the first two in the collection. Dim _customers As DataTable = orig.Tables("Orders").Clone() _customers.TableName = "Customers" ' Remove unneeded columns Dim i As Integer For i = 2 To _customers.Columns.Count - 1 _customers.Columns.RemoveAt(2) Next
Once the structure of the table is done, we need to fill it out. The first rub to consider is that the source table contains duplicated customers. On the other hand, the only way you have to get the customers is by filtering out the source DataSet. The source is already sorted by customerid, so let's scroll the rows and implement a handcrafted mechanism to select only distinct rows. Dim row As DataRow Dim customerKey As String = "" For Each row In _dataset.Tables("Orders").Rows ' Already sorted by CustomerID If customerKey <> row("customerid") Then ' select distinct
Next
_customers.ImportRow(row) customerKey = row("customerid") End If
' Add to the DataSet _dataset.Tables.Add(_customers)
The fastest way to import a row into a new table is through the ImportRow method. Basically, ImportRow performs a row bulk copy of all the columns that map the schema. Creating the Employees table is not much different. Of course, the columns to remove are different. In this case, in accordance to the structure of the source, the columns to preserve are the third and the fourth. The code below removes the first two columns and then implements a loop for all the others. Dim _employees As DataTable = orig.Tables("Orders").Clone() _employees.TableName = "Employees" ' Remove unneeded columns _employees.Columns.RemoveAt(0) _employees.Columns.RemoveAt(0) Dim i As Integer For i = 2 To _employees.Columns.Count - 1 _employees.Columns.RemoveAt(2) Next
While populating the Employees table, you must get rid of the duplicated rows. In this case, though, the source must be sorted by employeeid to simplify the operation. You can create a sorted view for the table and then scroll the elements. Dim employeeKey As Integer = 0 Dim view As DataView = New DataView(_dataset.Tables("Orders")) view.Sort = "employeeid" Dim rowView As DataRowView For Each rowView In view If employeeKey <> Convert.ToInt32(rowView("employeeid")) Then ' select distinct _employees.ImportRow(rowView.Row) employeeKey = Convert.ToInt32(rowView("employeeid")) End If Next ' Add to the DataSet _dataset.Tables.Add(_employees)
Summary In this article, I presented a relatively complex query and discussed three ways to implement it effectively. As a matter of fact, with ADO you had fewer opportunities to improve the query implementation. With ADO.NET, you can exploit a far richer disconnected object model that results in more programming power. Which of the options discussed here is the best? There is no certain answer, but a lot of factors can
affect the final results. Influencing factors are the effective bandwidth you have, the inherent concurrency of the data, and the level of scalability required. To determine the best approach, you have to try and measure the real performance you get.