Skip to main content

Business Intelligence

Indexing and statistics strategies for DB2 UDB for iSeries

Part 2: Indexing Strategies for Performance Tuning

Now that indexes and their functions have been discussed, let's talk about how to use them most effectively.

There are two approaches to index creation: proactive and reactive. As the name implies, proactive index creation involves anticipating which columns will be most often used for selection, joining, grouping, and ordering; and then building indexes over those columns. In the reactive approach, indexes are created based on optimizer feedback, query implementation plan, and system performance measurements.

In practice, both methods will be used iteratively. As the numbers of users increase, more indexes are useful. Also, as the users become more adept at using the application, they might start using additional columns that will require more indexes.

The following section provides a tested proactive approach for index creation. Use these techniques as a starting point, and then add or delete indexes reactively after user and system behavior have been monitored.

A General Approach
It is useful to initially build indexes based on the database model and the application(s), in lieu of creating the indexes because of the needs of any particular query. As a starting point, consider designing basic indexes founded on the following criteria:

· Primary and foreign key columns based on the database model
· Commonly used local selection columns, including columns that are dependent, such as the make and model of an automobile
· Commonly used join columns not considered to be primary or foreign key columns
· Commonly used grouping columns

After analyzing the database model, consider the database requests of the application and the actual SQL. A developer can add to the basic index design and consider building some perfect indexes that incorporate the selection, join, grouping, and ordering criteria.

The "perfect index" is defined as a binary radix index that provides the optimizer with useful and adequate statistics, and multiple implementation methods — taking into account the entire query request.

A Proactive Approach
Returning to the analogy of the recipe, the goals of indexing are to give the optimizer:

· Information about ingredients or the data contained within the tables, such as the number of distinct values, the distribution of data values, and the average number of duplicate values.
· Choices about which cooking instructions to assemble, or which methods to use to process the query. In many recipes, the cooking method could be steaming, frying, or broiling. The choice depends on the desired result. In the same way, the optimizer has different methods available and will pick the appropriate method based on what it knows about the available ingredients and the desired result.

Before beginning the proactive process, the database model and a set of sample queries that will run against the database are needed. These queries will generally have the following format:

With a query like this, the proactive index creation process can begin. The basic rules are:

· Custom-build a radix index for the largest or most commonly used queries.
        Example using the query above:
              radix index over join column(s) - a.join_col and b.join_col
              radix index over most commonly used local selection
              column(s) - b.col2
· For ad-hoc on-line analytical processing (OLAP) environments or less frequently invoked queries, build single-key EVIs over the local selection column(s) used in the queries.
        Example using the query above:
              EVI over non-unique local selection columns - b.col1 and b.col2

Clearly, these are general rules whose specific details depend on the environment. For example, the "most commonly used" queries can consist of three or 300 queries. How many indexes that are built depends on user expectations, available disk storage, and the maintenance overhead.

Perfect Radix Index Guidelines
In a perfect radix index, the order of the columns is important — even making a difference as to whether the optimizer uses the index for data retrieval at all. As a general rule, order the columns in an index in the following way:

· Equal predicates first. That is, any predicate that uses the "=" operator may narrow down the range of rows the fastest and should therefore be first in the index.
· If all predicates have an equal operator, then order the columns as follows:
· Selection predicates + join predicates
· Join predicates + selection predicates
· Selection predicates + group by columns
· Selection predicates + order by columns

In addition to the guidelines above, in general, the most selective key columns should be placed first in the index.

A binary radix index can be used for selection, joins, ordering, grouping, temporary tables, and statistics. When evaluating data access methods for queries, create binary radix indexes with keys that match the query's local selection and join predicates in the query WHERE clause. A binary radix index is the fastest data access method for a query that is highly selective and returns a small number of rows.

As stated earlier, when creating a binary radix index with composite keys, the order of the keys is important. The order of the keys can provide faster access to the rows. The order of the keys should normally be local selection and join predicates, or, local selection and grouping columns (equal operators first and then inequality operators). Binary radix indexes should be created for predetermined queries or for queries that produce a standard report. A binary radix index uses disk resources; therefore, the number of binary radix indexes to create is dependent upon the system resources, size of the table, and query optimization.

The following examples illustrate these guidelines:

Example 1: A one-table query
This query uses the table ITEMS and finds all the customers who returned orders at year end 2000 that were shipped via air. It is assumed that longstanding customers have the lowest customer numbers.

SELECT CUSTOMER, CUSTOMER_NUMBER, ITEM_NUMBER
	FROM ITEMS 
	WHERE YEAR = 2000
	AND QUARTER = 4    
	AND RETURNFLAG = 'R'                  
 	AND SHIPMODE = 'AIR'
	ORDER BY CUSTOMER_NUMBER, ITEM_NUMBER

The query has four local selection predicates and two ORDER BY columns. Following the guidelines, the perfect index would put the key columns covering the equal predicates first (YEAR, QUARTER, RETURNFLAG, SHIPMODE), followed by the ORDER BY columns CUSTOMER_NUMBER, ITEM_NUMBER.

To determine how to order the key columns covering equal local selection predicates, evaluate the other queries that will be running. Place the most commonly used columns first and/or the most selective columns first, based on the data distribution.

Example 2: A three-table query
Star schema join queries use joins to the dimension tables to narrow down the number of rows in the fact table to produce the result set in the report. This query finds the total first quarter revenue and profit for two years for each customer in a given sales territory.

SELECT T3.YEAR, T1.CUSTOMER_NAME,
SUM(T2.REVENUE_WO_TAX), SUM(T2.PROFIT_WO_TAX) 
	FROM CUST_DIM T1, SALES_FACT T2, TIME_DIM T3
	WHERE T2.CUSTKEY=T1.CUSTKEY
	AND T2.TIMEKEY = T3.TIMEKEY
	AND T3.YEAR IN (2001, 2000)
	AND T3.QUARTER = 1 
	AND T1.CONTINENT='AMERICA' 
	AND T1.COUNTRY='UNITED STATES' 
	AND T1.REGION='CENTRAL' 
	AND T1.TERRITORY='FIVE'
	GROUP BY T3.YEAR, T1.CUSTOMER_NAME 
	ORDER BY  T1.CUSTOMER_NAME, T3.YEAR

This query has two join predicates and six selection predicates. The first task is to focus on the selection predicates for each table in the query.

For the time dimension table TIME_DIM, the query specifies two local selection predicates. The index over the time dimension table should contain YEAR and QUARTER first, followed by the join predicate column TIMEKEY.

For the customer dimension table, the query specifies four local selection predicates. These predicates are related to each other along a geographical hierarchy (territory-region-country-continent). Since all the predicates are equal predicates, the order of the index keys for these predicates should follow the hierarchy of the database schema. The index over the customer dimension table should contain CONTINENT, COUNTRY, REGION, TERRITORY, followed by the join predicate column CUSTKEY.

For the fact table SALES_FACT, the two columns in the WHERE clause are TIMEKEY and CUSTKEY. Since both TIMEKEY and CUSTKEY are used as join predicates, the guidelines recommend two indexes, each with the respective join column: TIMEKEY and CUSTKEY. This will allow the optimizer to obtain statistics and to cost all the possible join orders.

According to the guidelines, an index should cover the columns from the group by clause and the order by clause. Because the group by and order by clauses use columns from two different tables, the query may be implemented in two steps (i.e., the selection and join must be completed prior to the grouping and ordering). The optimizer and database engine cannot take advantage of an existing index for grouping or order, so creating an index over GROUP BY and ORDER BY columns is required.

More information on star schema join optimization can be found in the paper, Star Schema Join Support within DB2 UDB for iSeries, at: ibm.com/servers/enable/site/education/ibo/record.html?star

Example 3: Non-equal predicates in a query
Predicates with inequalities tend to return more rows than predicates with equality operators. For example, if a user requests all the rows where the date is between a start and an end point, such as the beginning and end of a quarter, then the query may return more rows than if the user asked for a specific day or week. Because an inequality predicate implies a range of values instead of a specific value, the optimizer makes different decisions about how to build the access plan. The following query asks for the same report as the query in the previous example, but the YEAR local selection predicate is much less specific:

SELECT T3.YEAR, T1.CUSTOMER_NAME,
SUM(T2.REVENUE_WO_TAX), SUM(T2.PROFIT_WO_TAX) 
	FROM CUST_DIM T1, SALES_FACT T2, TIME_DIM T3
	WHERE T2.CUSTKEY=T1.CUSTKEY
	AND T2.TIMEKEY = T3.TIMEKEY
	AND T3.YEAR < 2001
	AND T3.QUARTER = 1 
	AND T1.CONTINENT='AMERICA' 
	AND T1.COUNTRY='UNITED STATES' 
	AND T1.REGION='CENTRAL' 
	AND T1.TERRITORY='FIVE'
	GROUP BY T3.YEAR, T1.CUSTOMER_NAME 
	ORDER BY T1.CUSTOMER_NAME, T3.YEAR

In the previous example, the best index over the time dimension table was one built over the two local selection predicates first, then the join predicate. Here, one of the selection predicates is using a 'less than' operator, while the join predicate is an 'equal' predicate. Because 'equal' predicates provide the most direct path to the key values, an index over time dimension for this query would be QUARTER, TIMEKEY, YEAR. This produces a logical range of key values that the database engine can position to and process contiguously.

Encoded Vector Index Guidelines
EVIs are primarily used for local selection on a table; they can also provide the query optimizer with accurate statistics regarding the selectivity of a given predicate value. EVIs cannot be used for grouping or ordering and have very limited use in joins. When executing queries that contain joins, grouping, and ordering; a combination of binary radix indexes and EVIs may be used to implement the query. When the selected row set is relatively small, a binary radix index will usually perform faster access. When the selected row set is roughly between 20% and 70% of the table being queried, table probe access using a bitmap, created from an EVI or binary radix index will be the best choice. Also, the optimizer and database engine have the ability to use more than one index to help with selecting the data. This technique may be used when: the local selection contains AND or OR conditions, a single index does not contain all the proper key columns, or a single index cannot meet all of the conditions. Single key EVIs can help in this scenario since the bitmaps or RRN lists created from the EVIs can be combined to narrow down the selection process.

Example 1: A one-table query
Recall that this query uses the table ITEMS and finds all of the customers who returned orders at year end 2000 that were shipped via air. It is assumed that longstanding customers have the lowest customer numbers.

SELECT CUSTOMER, CUSTOMER_NUMBER, ITEM_NUMBER
	FROM ITEMS 
	WHERE YEAR = 2000
	AND QUARTER = 4        
	AND RETURNFLAG = 'R'                  
 	AND SHIPMODE = 'AIR'
	ORDER BY CUSTOMER_NUMBER, ITEM_NUMBER

The query has four local selection predicates and two ORDER BY columns. Following the EVI guidelines, single key indexes would be created with key columns covering the equal predicates EVI1 — YEAR, EVI2 — QUARTER, EVI3 — RETURNFLAG, EVI4 — SHIPMODE. The optimizer will determine which of the indexes will be used to generate dynamic bitmaps. Based on this query, the bitmaps will be ANDed together, and table probe access will be used to locate and retrieve the rows from the ITEMS table.

If another similar query was requested, the same EVIs could be used.

SELECT CUSTOMER, CUSTOMER_NUMBER, ITEM_NUMBER
	FROM ITEMS 
	WHERE YEAR = 2000
	AND MONTH IN (1, 2, 3)       
	AND RETURNFLAG = 'R'                  
 	AND SHIPMODE = 'RAIL'

In this case, EVI1 — YEAR, EVI3 — RETURNFLAG, EVI4 — SHIPMODE could be used. The local selection on MONTH would be satisfied by reading and testing the data in the row identified by the other selection criteria.

Proactively Tuning Many Queries
The previous examples assume that an index is being built to satisfy a particular query. In many environments, there are hundreds of different query requests. In business intelligence and data warehousing environments, users have the ability to modify existing queries or even create new ad-hoc queries. For these environments, it is not possible to build the perfect index for every query.

By applying the indexing concepts previously discussed, it is possible to create an adequate number of binary radix indexes to cover the majority of problem areas, such as common local selection and join predicates. For ad-hoc query environments, it is also possible to create a set of radix and EVI indexes that can be combined using the index ANDing / ORing technique to achieve acceptable response times. The best approach will be to create an initial set of indexes based on the database model, the application and the user's behavior, and then monitor the database activity and implementation methods.

Reactive Query Tuning
The reactive approach is very similar to the Wright Brothers' initial airplane flight experiences. Basically, the query is put together, pushed off a cliff, and watched to see if it flies. In other words, build a prototype of the proposed application without any indexes and start running some queries. Or, build an initial set of indexes and start running the application to see what gets used and what does not. Even with a smaller database, the slow running queries will become obvious very quickly.

The reactive tuning method is also used when trying to understand and tune an existing application that is not performing up to expectations.

Using the appropriate debugging and monitoring tools, which are described in the next section, the database feedback messages that will tell basically three things can be viewed:

· Any indexes the optimizer recommends for local selection
· Any temporary indexes used for a query
· The implementation method(s) that the optimizer has chosen to run the queries

DB2 UDB for iSeries includes an index advisor, which is a built-in tool that recommends permanent indexes. The index advisor messages in the joblog can be viewed through iSeries Navigator or OS/400 command Display Job Log (DSPJOBLOG), by querying the Database Monitor data, or by using Visual Explain.

If the database engine is building temporary indexes to process joins or to perform grouping and selection over permanent tables, permanent indexes should be built over the same columns, and try and eliminate the temporary index creation. In some cases, a temporary index is built over a temporary table, so a permanent index will not be able to be built for those tables. The same tools can be used to note the creation of the temporary index, the reason the temporary index was created, and the key columns in the temporary index.

Understanding the queries implementation method(s) will also allow a focus on other areas that affect database performance, such as: system resources, application logic, and user behavior.

The following table outlines a few problem scenarios and offers suggestions for interpreting the recommendations of the optimizer and improving performance.

SituationOptimizer recommendsThe developer should:
There are no indexes built over the query's tables.Build an index for local selection.Build an index over the selection, join, and/or grouping fields.
You have built an index over some local selection columns or join columns.Build an index for local selection.Build an index over all the local selection columns, include the join columns.
You have built an index over all the selection fields and performance is a little better but still not acceptable.NothingReorder the columns in the index to place the most selective, equal predicates first, include the join columns.
You have built the perfect index and the optimizer will not use it.NothingUse a database evaluation tool to determine which access method the optimizer selects. The optimizer might have determined that a full table scan is more efficient.
You have built an index that contains all of the relevant columns but the optimizer does not use it.Build an index that contains the same columns but lists them in a different order.Build an index with the columns ordered according to the optimizer recommendations.
You have built all the recommended indexes, yet debug messages still indicate that the optimizer's query estimate is not at all close to actual query run times.NothingAdd the following clause to your SQL statement:

OPTIMIZE FOR ALL ROWS1
You have a query that contains several inequalities and/or the selection predicates are separated by OR conditions.Build a radix index over the first few columnsBuild single column indexes over each column, which will encourage the database to use dynamic bitmaps and index ANDing/ORing.

The main idea behind all of these recommendations is to give the optimizer as much information as possible about the tables and columns with which you are working. Remember, with DB2 UDB for iSeries, statistical information can be provided to the optimizer by creating indexes.

Note: The optimizer may generate a different access plan based on the user's information regarding optimizing for all rows or a subset of rows. Refer to the product documentation on how to use this clause.

Creating Indexes for Multi-table Queries (Joins)
When running queries that join tables, the need for the right indexes becomes even more critical. It is very important to analyze optimizer feedback messages to understand the query implementation and whether or not the query response time reflects building temporary indexes.

If the system is building temporary indexes over permanent tables, it is probably because the indexes are required to process a nested loop join. Nested loop join via index requires indexes over the join keys. The good news is that the optimizer will request an index be created to complete the query. The bad news is that the user must wait while the index is created, and the index is deleted when the query completes. If the same query is executed again, the temporary index will be recreated. If 20 users are running the same query, each user will have a temporary index created.

If a temporary index is being created over a permanent table, at a minimum, you should build a permanent index over the same key columns as the temporary index.

The Index Advisor might also recommend indexes that are different from the temporary indexes being created. The advisor looks only at the local selection predicates in the WHERE clause. For example, in a two-table join query, the first table might be accessed with a full table scan, and the second table might be accessed via a temporary index. The optimizer will recommend an index for the local selection of the table and provide feedback on the building of the temporary index for the nested loop join. Consider building an index over any columns recommended by the Index Advisor, and build an index like the temporary index for the nested loop join.

These recommendations may yield several indexes, all designed for the same query. As part of the analysis, run the query again with all of the recommended indexes present. If the desired results are still not being achieved, consider creating a radix index that combines all of the columns in the following priority: selection predicates with equalities, join predicates, then one selection predicate defined with inequalities.

If the WHERE clause contains only join predicates, ensure that a radix index exists over each table in the join. The join column(s) must be in the primary or left-most position of the key.

Tuning for One Query versus Tuning for Many
As you proceed through this iterative process, you will begin to see how indexes can be built that will tune many queries at one time. Start with one query and tune it. Then look at two or three. Find the columns that are used in all of the queries and build indexes over those fields. Picking the right columns and getting them in the right order will become more intuitive and productive.

Other Indexing Tips

· Avoid null-capable key columns if expecting to use index-only access. The index-only access method is not available in the Classic Query Engine (CQE) when any key in the index is null-capable.

· Avoid using derived expressions in a local selection or join condition. Access via an index may not be used for predicates that have derived values. Or, a temporary index will be created to provide key values and attributes that match the derivative. For example, if a query includes one of the following predicates:

WHERE SHIPDATE > (current_date - 10) or
UPPER(customer_name) = 'SMITH'


...the optimizer considers that predicate to be a derived value and may not use an index for local selection.

· Index access is not used for predicates where both operands are from the same table. For example, if a query includes the following statement:

WHERE SHIPDATE > ORDERDATE

...the optimizer will not use an index to retrieve the data since it must access the same row for both operands.

· Consider Index Only Access (IOA). If all of the columns used in the query are represented in the index as key columns, the optimizer can request index only access. With IOA, DB2 UDB for iSeries does not have to retrieve any data from the actual table. All of the information required to implement the query is available in the index. This may eliminate the random access to the table and may drastically improve query performance.

· Use the most selective columns as keys in the index, adding one key column used with inequality comparisons. Because of how the optimizer processes inequalities as ranges of values, there is little benefit in putting more than one inequality predicate value into an index.

· For key columns that are unique, specify UNIQUE when creating the index. A primary key constraint will produce an index with unique keys.

[Back | Next]