
- Designing Indexes for Performance A number of ways are available to improve SQL Server performance, but the greatest speed improvement results from having indexes that the optimizer can use to avoid table scans and reduce the I/O costs of resolving queries. Proper index design is the most important issue in tuning SQL Server performance.
- Why Use Indexes? Two primary reasons exist for creating indexes in SQL Server: 1. To maintain uniqueness of the indexed column(s) 2. To provide fast access to the data in tables. If you have primary-key constraints, you need unique indexes to ensure the integrity of the primary key and avoid duplicates. However, you don't have to have indexes in order to access the data. SQL Server can always perform a table scan to retrieve data rows. But table scans aren't an efficient mechanism to retrieve a single row from a million-row table. If the table could store 50 rows per page, it would require 20,000 page reads to access a single row. You need a more direct access path to the data. Indexes can provide this direct access path. The tricky part is deciding which indexes to create, and also which type of index to create: clustered or nonclustered. To make the appropriate decisions, you need to know when SQL Server can use indexes and how they are used. You also need to understand the performance and size implications of indexes.
- Index Usage Criteria To effectively determine the indexes that should be created, you need to know whether they'll be used by the SQL Server. If an index isn't being used, it's just wasting space and creating unnecessary overhead during updates. The main criterion to remember is that SQL Server can't use an index defined on a table unless the query contains a column in a valid search argument or join clause that matches at least the first column of the index. You need to keep this in mind when choosing the column order for composite indexes. For example, if you had an index on an employee table as follows: Create index idx1 on employee (division, dept, eml_type) Each of the following queries could make use of the index: Select * from employee Where division = 'accounting' And dept = 102 And empl_type = 'exempt' Select * from employee Where division = 'accounting' And empl_type = 'exempt' Select * from employee Where division = 'accounting' And dept = 102 The following query couldn't use the index because it doesn't specify the first column of the index: Select * from employee Where empl_type = 'exempt' For the index idx1 to be used for the last query, you'd have to reorder the columns so that empl_type was first-but then the index wouldn't be useful for any queries specifying only division and/or dept. To satisfy all queries in this case require defining multiple indexes on the employee table. You may think that the easy solution is to index all columns of a table. While taking up a significant amount of space, this strategy might work in DSS environment. However, too many indexes can have an adverse impact on performance in an OLTP environment.
- Indexes and Performance Although providing a performance benefit for queries, indexes can be a hindrance to good performance for updates. This situation is due to the overhead incurred to keep indexes up-to-date when data is modified, inserted, or deleted. This problem is common in databases that must support both OLTP and Decision Support-type applications. In a Decision System (DSS), having too many indexes isn't much of a performance issue because the data is relatively static. You typically load the data, create the indexes, and forget about it until the next data load. As long as you have the indexes to support the user queries, and they're getting decent response time, the only penalty of having too many indexes is the space wasted for indexes that won't be used and the additional time required to create the excessive indexes. In an OLTP environment, too many indexes can lead to significant performance degradation, especially if the number of indexes on a table exceeds four or five. Every single row inserted is one data page write and one or more index page writes (depending on whether a page split occurs) for every index on the table. With eight nonclustered indexes, that would be a minimum of nine writes to the database. For an update not-in-place of a single row, which is a delete followed by an insert, you'd be looking at potential 17 writes to the database. Therefore, for an OLTP environment, you want as few indexes as possible- typically only the indexes required to support the update transactions and enforce your uniqueness constraints. Meeting the index needs of DSS and OLTP requirements is an obvious balancing act, with no easy solution. It often involves making hard decisions as to which queries have to live with a table scan and which updates have to contend with additional overhead. One solution is to have two separate databases-one for DSS applications and another for OLTP applications. Obviously, this method would require some mechanism to keep the databases in sync. Another possible alternative is to have only the required indexes in place during normal processing to support the OLTP requirements. At the end of the business day, create the indexes necessary to support the DSS queries, which can run as batch jobs after normal processing hours. When the DSS reports are complete, drop the additional indexes, and you're prepared for the next day's processing.
- Index Selection Determining which index to define involves performing a detailed query analysis. This involves examining the search clauses to see what columns are referenced, knowing the bias of the data to determine the usefulness of the index, and ranking the queries in order of importance. Because it's usually not possible to index for everything, index for the queries that are most critical to your applications or those are run frequently by a large number of users. If you have a query that's run only once a month, is it worth creating an index to support only that query, and having to maintain it throughout the rest of the month? If, due to processing requirements, you must have the index in place when the query is run, consider creating the index only when you run the query, and then drop the index for the remainder of the month. This is a feasible approach as long as the time to create the index and run the query that uses the index doesn't exceed the time needed to run the query without the index in place.
- Evaluating Index Usefulness When the SARGs or join clauses in a query match the indexes on a table, SQL Server evaluates all possible indexes, comparing them to each other and to the cost of a table scan to determine the least expensive method to process the query in terms of page I/Os. Also, SQL Server typically uses only one index per table to satisfy a query. The only exception to this rule is when a query contains an or clause. If a query has multiple where clauses that can be satisfied by an index, SQL Server evaluates all alternatives, but uses only the index that will result in the most efficient query plan. How does SQL Server determine the cheapest index to use? For example, how does it know how many rows will be returned by the following query? Select * from table Where key between 1,000,000 and 2,000,000 If the table contains 10,000,000 rows with values ranging between 0 and 20,000,000, how does the optimizer know whether to use an index or table scan? There could be 10 rows in the range, or 900,000. How does SQL Server estimate how many rows are between 1,000,000 and 2,000,000? The optimizer gets this information from the index distribution page.
- The Distribution Page SQL Server keeps distribution information for each index on a separate page in the database called the distribution page. The location of this page is stored in the sysindexes table in the distribution column. The optimizer uses this information to estimate the number of rows that would match the search argument or a join clause for a query. The optimizer stores two types of information in the distribution page: 1. Sample data values for the first column of the index, in a structure similar to (but not the same as) histogram 2. The density/selectivity of the index column The distribution page is built at index-creation time if the table contains data. If the table is empty, no distribution page is created. In addition, the distribution page isn't maintained by SQL Server as data is inserted/ updated/deleted in the table. To update or, if necessary, create the distribution page, the DBA or table owner needs to run the update statistics command. Update statistics table_name [index_name] If only the table name is specified, update statistics updates statistics for all indexes on the table. The general recommendation is that if your data volume changes by more than 10 to 20 percent, or more than 10 to 20 percent of your indexes values have been modified, it's a good time to update statistics.
- What's Stored on the Distribution Page? The distribution page is populated by dividing the width of the first column in the index (plus two bytes of overhead) into the space available on the distribution page, to determine the number of slots it can store on the distribution page. (See figure 12.1 Page 367). The number of distribution slots for a fixed-length column would then be calculated as: (2016 - 2 - (number of key columns * 2)) /(column width + 2) The number of distribution steps for a variable-length column would be determined as follows: (2016 - 2 - (number of key columns * 2)) / (maximum column width + 2) Let's consider an example of an index defined on a single integer column that doesn't allow null values. The number of steps that could stored for this column would be: (2016 - 2 - (1*2))/(4+2) = 2012 / 6 = 335 steps (slots) With 10,000,000 rows in the table and up to 335 steps on the page, we can store a sample value for every: 10,000,000 rows / 335 steps = 29, 850 rows SQL Server then populates the distribution table by walking the index and storing the key value in each step every 29,850 rows, starting with the first row in the table. How does the optimizer use this information to determine the effectiveness of the index? Consider the query: Select * from table Where key between 1,000,000 and 2,000,000 SQL Server estimates the number of rows within the range of values by multiplying the number of steps that the search values are found on or between times the number of rows per step. In this example, SQL Server compares the search values 1,000,000 and 2,000,000 to the values stored on the distribution page. If it finds 1,000,000 at step 157 and 2,000,000 at step 182, the estimated number of rows would be: 182-157=25 steps 25 steps * 29,850 rows per step = 746,250 rows Additionally, because distribution steps are kept only on the first column of the index, the optimizer must use a different method for determining the number of rows matching the specified portions of a multicolumn key. If there is an index on one of these other columns, it examines the distribution page on that index to estimate the selectivity of the search clause on that column alone. Otherwise, the optimizer uses the index density values.
- Index Densities When the optimizer doesn't use distribution steps for equality searches (column=constant), it uses a value called the density. The density is the average proportion of duplicates for the index key(s). For example, an index on a 10,000-row table with 2,500 unique values would have a density of 1/2500 = 0.0004 The index density is applied against the number of rows in the table to estimate the average number of rows that would match any given value. Therefore, any single value compared against the index key on a 10,000-row table with an index density of 0.0004 would be expected to match 10,000 * 0.0004 = 4 rows. The lower the density value, the more selective the index is; the higher the density, the less selective the index. If an index consisted of all duplicates, the density would be 1. For multicolumn indexes, SQL Server now stores multiple densities for each sequential combination of columns. For example, if you had an index on column A, B, and C, SQL Server would store densities for A alone A and B combined A, B, and C combined Typically, the index value should become smaller (that is, more selective) as you add more columns to the index. For example, if the densities were as follows: A 0.05 A, B 0.004 A, B, C 0.0001 With 10,000 rows in the table and a search value compared against A alone, you would estimate it to match: 0.05 * 10,000 = 500 rows if the query provided values for both A and B, you would expect it to match: 0.004 * 10,000 = 40 rows If A, B, C are all specified, the query should match only: 0.0001 * 10,000 = 1 row All density values are stored at the end of the distribution page. At least two densities are stores, even for single-column indexes. One is used for determining join selectivity or search argument with unknown values and is referred to as the alldensity value. This density value is based on all rows in the table, because the join or search value could match a value with a large number of duplicates. The other density is used for search clauses when the value being searched falls between two distribution steps. If there were a few values in the table with a large number of duplicates, the alldensity value could conceivably refer to a greater number of rows than would exist between the two steps. For example, consider a table with 10,000 rows with an index on a char(8) key: 8 bytes plus 2 = 10 bytes per step 1992 bytes / 10 bytes per step = 199 steps 10,000 rows / 199 steps = 50 rows per step Let's assume there are 150 unique values in the table. The alldensity would be 1 / 150 = 0.0067 If a search value was found to be between two steps and the alldensity value was applied, it would estimate 0.0067 * 10,000 = 67. For this reason, SQL Server stores a regular density value that excludes some values with a high number of duplicates-essentially those values that don't span multiple cells in the distribution.
- Index Design Guidelines Now that you have an understanding of how the optimizer uses indexes and index statistics to optimize queries, let's examine some guidelines to consider when developing your index strategy.
- Clustered Index Indications Assigning the clustered index to the primary key may be appropriate if the primary key is the primary access path for that table, but good candidates for the clustered index to consider are 1. Range searches 2. Columns containing a number of duplicate values 3. Columns frequently referenced in an order by 4. Columns other than the primary key referenced in join clauses In most applications, the primary-key column on a table is almost always retrieved in single-row lookups. For single-row lookups, a nonclustered index usually costs you only one I/O more than a similar clustered index. Clustered indexes can improve performance for range retrievals because the clustered index can be used to set the bounds of a search, even if the query involves a large percentage of the rows in the table. Because the data is in sorted order, SQL Server can use the clustered index to find the start and end points within the range, and scan only the data pages within the range. Let's assume that you have 2,000,000 titles in the table, with an estimated 1,000,000 books in the price range between $5 and $10, and you want to run the following query: Select title from titles Where price between $5.0 and $10.0 If you had a clustered index on the table, the rows would be grouped together by price; you could start your search at the first row where price is equal or greater than $5 and scan the rows in order until you find the last row within the range. If there were 40 rows per page, with 1,000,000 rows within the range, it would cost you approximately 1,000,000 / 40 = 25,000 data page reads Plus the number of index page reads required to find the first row in the range. This value should be equal to the number of levels in the index. The same concept holds true for indexes on columns with a large number of duplicates. With a clustered index, the duplicate values are grouped together, minimizing the number of pages that would need to be read to retrieve them. Another good candidate for a clustered index is a column used frequently in queries for sorting the result set. You also want to try to keep clustered indexes on relatively static columns to minimize the re-sorting of data when an indexed column is updated. Any time a clustered index row moves, all nonclustered indexes pointing to that row need to be updated. Clustered indexes can also be more efficient for joins than nonclustered indexes because they're usually much smaller in size. Typically at least one level less. If you require only a single index on a table, it's typically advantageous to make it a clustered index.
- Nonclustered Index Indications Until tables become extremely large, the actual space taken by a nonclustred index is a minor expense compared to the increased access performance. In an OLTP environment, however, you need to remember the impact on performance of each additional index defined on a table. Also, when defining nonclustered indexes, you typically want to define indexes on columns with a low number of duplicates (that is, with low-density values) so that they can be used effectively by the optimizer. A high number of duplicates on a nonclustered index can often make it more expensive (in terms of I/O) to process the query using the nonclustered index than a table scan. Let's look at an example: Select title from titles Where price between $5 and $10 Again, if you have 1,000,000 rows within the range, those 1,000,000 rows could be randomly scattered throughout the table. Although the index leaf level has all the index rows in sorted order, reading all data rows one at a time would require at least 1,000,000 page reads. Thus, the I/O estimate for range retrievals using a clustered index is Number of matching rows + number of index levels + number of index pages to be scanned to find rows In the preceding example, if you have 1,000,000 rows in the table, the page cost estimate would be 1,000,000 data pages + index pages Contrast this with the cost of a table scan. At 40 rows per page, and 2,000,000 rows in the table, a full table scan would cost only 50,000 pages. Therefore, a clustered index (25,000 pages) would be more efficient, but if a clustered index is already defined on a better candidate column, a table scan would actually be more efficient than a nonclustered index. This same principle holds true for nonclustered indexes with a large number of duplicate values (index density is high). As a rule of thumb, nonclustered indexes are more effective if less than 10 to 20 percent of the data is to be accessed via the nonclustered index. Nonclustered indexes can also help improve performance for certain range retrievals by avoiding the need for a worktable (sorting). For example, Select * from authors Where city in ("Boston", "San Francisco", "Chicago") Order by city In general, nonclustered indexes are useful for single-row lookups, joins, queries on columns that are highly selective, or queries with small range retrievals.
- Composite Indexes versus Multiple Indexes Composite indexes are selected by SQL Server to satisfy a query only if at least the first column of the index is specified in a where clause. Also, because of the increased width of composite indexes, the index structure will consist of more levels and pages than a narrower index. At times, it might be more beneficial to have many indexes than one or more larger composite indexes. Having more indexes gives the optimizer more alternatives to look at and possibly derive a more optimal plan. Remember that distribution steps are stored only for the first column in a composite index. On the flip side of things, each additional index has a negative impact on update performance. So how do you determine the optimal index(es) for a table? Let's look at an example: Select pub_id, title, notes from titles Where type = 'Computer' And price > $15 The index candidates include 1. Clustered or nonclustered index on type only 2. Clustered or nonclustered index on price only 3. Clustered index on type, nonclustered index on price 4. Nonclustered index on type, clustered index on price 5. Clustered or nonclustered index on type, price 6. Clustered or nonclustered index on price, type 7. Clustered or nonclustered index on pub_id, title, notes, type, price Which are the best options in which circumstances? The answer is entirely dependent on the data distribution and whether the indexes would be unique enough to be worthwhile (see the example on page 380). As you can see from many of the questions raised, you can't examine just one query and devise the best indexing strategy. You need to examine all queries going against the database and design your indexes appropriately to support multiple queries.
- Indexing for Multiple Queries Indexing for multiple queries involves asking many of the questions raised earlier. It really requires that you sit down and perform a detailed query analysis and transaction analysis for all applications going against a database. After gathering the information, you need to begin ranking the queries/ transactions in order of importance. The more critical queries should get a higher priority. After you've ranked your queries and decided for which ones you're going to tune the system, you need to examine the where clauses and see what types of indexes you need to support those queries. To evaluate the usefulness of the index, you also need to have a good understanding of the data and the data distribution. Don't assume that just because a column is referenced in a where clause, it needs an index. Consider also whether other, already-indexed columns are more selective or whether the index, based on the selectivity, would even be used. Start out by devising an indexing strategy for each query independently. Then begin comparing strategies, looking for overlaps between them, to find indexes that satisfy more than one query/transaction. At this point, you may find that a composite index is more useful, or that multiple smaller indexes would be better. After devising your indexing strategy, go ahead and implement and test it, but be prepared to make changes. Few can predict query performance accurately the first time out.
- Summary The most important aspect to improving SQL Server performance is proper index design. Choosing the appropriate indexes to be used by SQL Server to process queries involves thoroughly understanding the queries and transactions being run against the database, understand the bias of the data, understand how SQL Server uses indexes, and staying aware of the performance implications of overindexing tables in an OLTP environment. In general, consider using clustered indexes to support range retrievals or when data needs to be sorted in clustered index order; use nonclustered indexes for single or discrete row retrievals. Have a good Understanding of the SQL Server query optimizer to know how it uses indexes and index statistics to develop query plans. Keep your index statistics up to date.
Last update March 7, 1998.