Execution Plan When SQL Server executes a query, it will use an execution plan if one already exists. If there is no preexisting plan, SQL Server follows a series of steps to create a plan. The objective of this is to find an acceptable plan, not necessarily the optimal plan. The effort to find the best possible plan might take longer than to execute an acceptable, but not optimal plan. To come to this plan, the following steps are taken by the Query Optimizer. Each step might produce one or more plans. Each plan is assigned a total cost for execution, and SQL Server uses the plan with the lowest costs. •
First, SQL Server checks if a trivial plan is available. If so, it will go ahead with this plan. For example, with an INSERT statement using the VALUES clause, there is only one possible execution.
•
If no trivial plan is available, SQL will try to simplify the query, so that a trivial plan will be available. This will not result in a new plan, but helps SQL to analyze the query. At this point SQL server will load any statistics that will help it in the costbased process that follows. This cost-based process has three steps. o First is the transaction processing phase. In this phase, SQL Server picks out plans for simple queries typical of transaction processing databases. If a plan is formed with a total cost below a threshold, it will use this plan. o
If a cheap plan can not be found, the Quick Plan phase starts. In this phase, SQL includes choices that are often useful in more complex queries. This might include use of indexes and nested loop joins. Again, if a cheap enough plan is found, SQL Server will use this plan.
o
The last phase is the Full Optimization Phase. In this phase, SQL Server compares every possible execution plan and then goes with the cheapest one. This phase has a time limit. When the time limit is reached, SQL Server goes ahead with the cheapest plan found up until that moment.
Statistics This process -- except for the trivial plan -- is based solely on statistics. Bad statistics lead to bad execution plans. If an index is placed on a small table, the optimal plan might be to ignore the index and use a table scan. If the statistics are not updated while the table grows significantly, SQL Server will still assume that the table is small and use the table scan, even if this is no longer the optimal plan. In most cases it is wise to let SQL Server automatically update statistics. To turn on the automatic updates for all statistics in a database, if not already on, execute the following statement: USE master EXEC sp_dboption 'MyDatabase', ' auto update statistics', 'true' To turn on the automatic updates for all statistics on a specific table, such as Clients, execute the following statement: USE MyDatabase EXEC sp_autostats Clients, 'ON'
To manually update the statistics on specific table, such as Clients, execute the following command: USE MyDatabase UPDATE STATISTICS Clients Other options for updating statistics can be found on this website. How Statistics Work Keeping statistics up to date is crucial for optimal performance of a database. In some cases, however, an available execution plan is not ideal for the given situation. To understand when and why this happens, it is important to understand how statistics work. The following command displays the statistics for a given index (index IX1a on table Clients) USE MyDatabase DBCC SHOW_STATISTICS (Clients, IX1a) This gives all statistical information on this index. Among the date and time of the last update and it's density, it returns samples of the index. This sample is taken at random from the database and gives the following information:
Figure 1: DBCC SHOW_STATISTICS output The last part of this output is important. On the basis of these records, SQL determines the expected number of hits for a search parameter. The output has the following entries: RANGE_HI_KEY: The upper value of the histogram step. For example, 'Smit' RANGE_ROWS: The number of rows of the histogram step, excluding the occurrences of the upper bound. For example, there are 66,811 rows between 'Smit' and 'Snijders'. EQ_ROWS: The number of rows equal to the value of the upper bound of the histogram step. For example, there are 48,760 rows with the value 'Smit'. DISTINCT_RANGE_ROWS: The number of distinct values in the histogram step, excluding the upper bound. For example, there are 462 distinct values between 'Smit' and 'Snijders' (not including either). AVG_RANGE_ROWS: The average number of occurrences for each unique value in the histogram step. This is calculated as RANGE_ROWS / DISTINCT_RANGE_ROWS. For example, between 'Smit' and 'Snijders' are 66811 rows with 462 distinct values, resulting in an expected 144.61255 rows for unique value between 'Smit' and 'Snijders'. The Estimated Execution Plan for a search for 'Smits' (between 'Smit' and 'Snijders') confirms this number:
Figure 2: Estimated Execution Plan - estimated row count 'Smits' On the basis of this estimate, the following steps in an execution plan will be determined. If a search value is matched in the statistics table, the value for the EQ_ROWS will be the estimated row count, as can be seen in the case for searching for 'Smit':
Figure 3: Estimated Execution Plan - estimated row count 'Smit'
Variance In the case of last names, there is a very large difference in the number of occurrences per distinct value. In our database, we host approximately 18 million records with about 500.000 different last names. This gives an average of 36 records per name. In fact, half of these names only occur once. Another 100.000 names occur less than 10 times. The frequently used names occur up to 95,000 times. The variance is extremely high. In cases of such extremely high variance, the use of statistics in this manner is not very accurate. If a second indexed search parameter is provided, SQL Server might decide not to use this index, but to use a filter to extract the second parameter. When in practice the actual row count is much higher, this plan is less ideal and might lead to extreme query times. The entry for Smit is found directly in the statistics. Therefore the estimated row count equals the actual row count of 48,760. The same applies to the street name 'Kerkstraat' with a rowcount of 42,722. If a search is made for last name 'Smit' and street name 'Kerkstraat' the following plan is used:
Figure 4: Execution Plan - Search for Smit and Kerkstraat As can be seen, optimal use is made of the two available indexes. If the same query is made for 'Smits', where the estimated row count is 144 (as calculated above) and an actual row count of 25,718 (a given) and 'Kerkstraat', the following plan is used:
Figure 5: Execution Plan - Search for Smits and Kerkstraat Because SQL Server expects 144 rows, the effort of filtering the records for the second parameter is predicted to be faster than to use two indexes and cross referencing these. However, the actual row count is much higher, causing this plan to be far from ideal. Even though the number of occurrences for 'Smits' is approximately half of the number for 'Smit', this query takes three and a half minutes to execute while the query for Smit only takes about 1 second.
Work Around So how do you solve this problem? Statistics can not be expanded. They are designed to be kept small, so that the process of determining an execution plan is kept small. Even if the statistics could be doubled in size, this problem will still arise. The option of using an index hint will only be helpful if used in all the queries with a search parameter in this field. Adding an index hint is a manual job, requiring knowledge of the dataset. A human will encounter the same difficulties in determining a best use of indexes as SQL Server itself. Also, automating the use of index hints in a production environment is time-consuming and far from efficient.
In these situations, a work around is the best option. By changing the specific index into a clustered index, a bookmark lookup is no longer necessary. This saves considerable time and effort. If the statistics are off and the number of hits exceeds the expected number, all relevant records are clustered together and require little extra time. In the above examples, if the index on gnaam is a clustered index, the execution plan is as follows:
Figure 6: Execution Plan - Search for Smits and Kerkstraat with Clustered Index This query takes 7 seconds to execute. A significant improvement.
Index Planning The effect of changing currently existing clustered index into a non-clustered index should be examined. There can only be one clustered index on a table. In general, the difference in performance for a field containing unique values is minimal between a clustered and a non-clustered index. If, however, this field is used in a one to many relationships, the general performance for the overall queries to the database might decline.
It is wise to take all these aspects into consideration when designing the index structure for your databases. It is often worth the effort to try different configurations and go with the one best suitable. Of course using the Index Tuning Wizard will help in this process.