Skip to main content

Business Intelligence

Indexing and statistics strategies for DB2 UDB for iSeries

Part 5: Statistics Strategies for Performance Tuning

Let's review why a good statistics strategy is important.

If the correct statistics are collected and available, the SQE optimizer will more accurately estimate the number of rows to process. Better estimates will allow for better query optimization and the selection of the best (performing) query plan for your environment.

Statistics strategies can be separated into two main approaches, namely proactive and reactive. Both approaches can and should employ indexes along with column stats. Given that indexes are used for both statistics and implementation, you can say that the column stats complement a good indexing strategy.

A "proactive" strategy can be used to identify and create column statistics based on the data model, actual data, and the SQL requests expected. Proactively identifying the column stats to collect, and collecting the information is usually a manual process, but a developer can also employ some automatic techniques.

A "reactive" strategy can be used to identify and create column statistics based on the actual data, specific SQL requests and/or the lack of indexes. Reactively identifying the column stats to collect, and collecting the information is usually an automatic process, but a developer can also employ some manual techniques to control when and where this collection occurs.

Statistics on Demand!
As mentioned earlier, the enhancements to DB2 UDB for iSeries include the ability to autonomously recognize the need for a column statistic, generate a request for the statistical data, and then collect the information in the background. This feature can be used in both proactive and reactive statistics strategies.

The automatic statistic collection is controlled by the system value QDBFSTCCOL. This system value is set to *ALL by default. This value allows all requests for background stats collections, whether initiated by the system or initiated by the user.

The statistic collections are handled by the system job QDBFSTCCOL. This job is responsible for processing the requests at a very low run priority with the goal of collecting the column stat(s) as fast as possible without impacting other, higher priority work on the system. The collection process is basically a full table scan of the table that contains the column or columns in which the developer may be interested. Given no other competing work, the time it takes to gather a column stat is equal to or less than the time it takes to perform a query requiring a full table scan. The SQE stats engine has the ability to gather stats in parallel, even if the OS/400 SMP feature is not installed or not enabled. If the opportunity to collect stats presents itself, the engine will automatically take advantage of the available computing resources to collect as much information in the time allowed.

When the SQE optimizer builds a query plan, the plan and associated information is stored in a central location called the SQE plan cache. Information about the statistical questions, answers, and sources are also stored in the plan cache. This information is used to initiate the automatic stats collections. Since the plan cache is cleared during an IPL of the system or logical partition, any requests for column stat collections that have not been satisfied will be lost. Be sure to consider this behavior as the statistic collection strategy is constructed.

Proactive Strategy
When tuning a SQL-based application or queries proactively, the developer can use the automatic stats collection to their advantage. Before going "live" with the application or queries, ensure a good indexing strategy is in place, then exercise the application or SQL requests with the system value QDBFSTCCOL set to *NONE. This will allow the SQE optimizer to get a good look at the SQL requests, and more importantly the statistical sources available. If some statistics are unavailable, SQE will queue up the appropriate stats collection requests. Once all the application logic and/or queries have been executed, change the system value QDBFSTCCOL to *ALL and allow the system to go off and collect the column stats. This technique has the advantage of allowing different columns from the same table to be processed with one scan of that table object. Another value of this particular proactive approach is it allows the developer to "control" when the auto stats collection occurs — instead of just having it collecting in background with no one aware it is running.

Another proactive technique is to manually request the column stats based on analysis of the data model, data, and SQL within the application. Once a good indexing strategy is in place, consider creating column stats to complement the indexes and/or cover gaps in the indexing strategy (columns that are good candidates for stats collection will be covered later in this paper). Use iSeries Navigator - Database to select the table and column(s) on which the developer want statistics collected. The collection(s) can occur immediately (running within your OS/400 job, at your job's priority) or can be submitted to the background and run at a lower priority. When the statistics are collected, the developer are ready to go live with the application and queries.

A proactive approach to identifying and gathering column statistics is crucial if:

· The developer requires the application and queries to run their very best "right out-of-the-box."
· The developer is unsure of the indexing strategy or have a sub-optimal indexing strategy.
· The developer expects little or no computing resources to be available for the background stats collection(s) during normal operations.

Reactive Strategy
When tuning SQL-based applications or queries reactively, the developer can also use the automatic stats collection to their advantage. As a matter of fact, this is the main reason stats collection is in place. By default, the system will identify and gather stats as the application and queries are executed. Over time, the optimizer will have more, and better, information about the data.

Once again, before going "live" with the application or queries, ensure a good indexing strategy is in place. Set the system value QDBFSTCCOL to *ALL or *SYSTEM. As the SQL requests are optimized, the stats engine will interrogate the statistical sources available. If there are some statistics that are unavailable, SQE will queue up the appropriate stats collection requests.

Another reactive technique is to manually request the column stats based on analysis of the data and the SQL requests that have been optimized. Using one of the query analysis and feedback tools is very helpful. Visual Explain and the database monitor have both been enhanced to provide feedback on which columns should have statistics collected. In addition to an "index advisor," Visual Explain also has a "statistics advisor." These advisors are based on the optimizer feedback for a given SQL request. The database monitor can produce a new row (column QQRID = 3015) that contains feedback on which column(s) should have a stat collected. Once the query has been analyzed, use the feedback to determine whether an index will be best, or a column stat will be adequate. Request the column stat manually using iSeries Navigator - Database, or the Visual Explain statistics advisor. Rerun the query to see what changed based on the presence of the additional information.

Identifying Columns with Statistics
After proactive or reactive stats collections, the developer can identify and document which tables have column stats by using iSeries Navigator to display each table's statistical data. Better yet, the stats engine APIs can be used to programmatically identify and list the tables and column stats. An example utility can be found at: ibm.com/servers/eserver/iseries/db2/statsscriptprocedure.html

The APIs and this utility can also be used to build a script that creates a batch of column statistics requests. This will be helpful if the developer has tuned an application on one system, and then wants to migrate or install the application on another system. A script of DDL SQL statements can be used to create the physical data model (tables and indexes), and the utility can be used to construct and run a set of column stat collection requests.

What Columns Should Have Stats Available?
Now that you know when and how to collect stats, the next question is, what columns should have stats available?

The answer depends on the data model, the data, and the SQL requests. It also depends on the indexing strategy employed.

In general, statistics should be available for columns used in local selection, joining, grouping, ordering, and distinct processing. Keep in mind that providing the optimizer with a new column stat does not necessarily mean that the optimizer will re-optimize the query and build a new access plan.

Let's take a look at local selection (find all the rows where a column equals something in particular). If the data in the column has high cardinality (lots of different values) and the query is requesting one value or a very small set of values, a radix index is most likely the best object to use to identify and select the associated rows. As it turns out, this same radix index is also well suited to give the optimizer the statistics it needs for optimization. In this case, a column stat is redundant, and not as beneficial. If, on the other hand, the query is selecting a large set of values, using a radix index may not be the best method to find the statistical information since this process involves reading some number of keys from the index. The more values requested by the query, the more keys to read. In this case, a column stat would be beneficial since the statistical information can be obtained much faster.

In the following example the query is using four columns for local selection...

SELECT CUSTOMER, CUSTOMER_NUMBER, ITEM_NUMBER
	FROM ITEMS 
	WHERE YEAR = 2000
	AND QUARTER = 4        
	AND RETURNFLAG = 'R'                  
 	AND SHIPMODE = 'AIR'

Let's assume the developer has a radix index over two of the columns, YEAR and QUARTER. The optimizer can use this index to obtain statistics on YEAR = 2000 and QUARTER = 4, but what about RETURNFLAG = 'R' and SHIPMODE = 'AIR'? Since there are no stats sources available for these two columns, the optimizer will use default values. Given that the index is sub-optimal (only containing two of the four local selection columns), a set of column stats will be beneficial, and will complement the index. Creating a column stat for RETURNFLAG and SHIPMODE will allow the optimizer to understand the selectivity of all the columns and make the appropriate choice of using the two-key column index or using a table scan. Without the additional column stats, the optimize might incorrectly conclude that the selectivity is high. Thus, using the index will yield the best performance.

Another major use of column stats is to accurately predict the size of temporary data structures used when joining, grouping, and determining distinct sets of values. The optimizer can choose to use an index for joining, grouping, and distinct processing; or, it can use various data structures. Let's assume the optimizer chooses to use a hash table for grouping. To accurately cost the use of this method, and to determine the optimal size of the hash table, statistics about the associated grouping column(s) must be available. If the grouping columns are from more than one table, a permanent index cannot be created to satisfy the query. Once again, a set of column stats can be used to provide this information — without the overhead of carrying additional indexes just to provide stats. The following example illustrates this:

SELECT C.CUSTOMER, I.RETURNFLAG, SUM(I.QUANTITY)
	FROM		ITEMS I,
			CUSTOMERS C
	WHERE	I.YEAR = 2000
	AND 		I.QUARTER = 4        
	AND 		I.CUSTKEY = C.CUSTKEY
	GROUP BY 	C.CUSTOMER,
			I.RETURNFLAG

To assist the optimizer, create column stats on table CUSTOMERS - column CUSTOMER, and table ITEMS - column RETURNFLAG.

Since column stats are derived and stored on a per-column basis, correlation between columns is not well represented. If you know that there is a strong correlation between two or more columns, it would be best to create an index using those columns. This will provide the optimizer with a more complete picture of the data. An example of this concept is the relationship between automobile make and model. While there are many different makes, and many different models, there exists a strong correlation between the model of the automobile and the make of that model. Only Ford produces the Mustang, and only Chevrolet produces the Camaro.

Maintaining Statistics
Obviously as the data changes, the statistics about the data must also change. Otherwise the query plans can get outdated and become sub optimal. One big difference between indexes and column stats is the fact that indexes are maintained immediately — as rows are inserted, updated, or deleted. Column stats, on the other hand, are not maintained immediately, and this can factor into the statistics strategy.

To keep the column stats from getting too far out-of-date, the statistics manager keeps track of the point at which the column stat was generated, and the subsequent changes to the table. Once a significant delta has been reached, the stats are marked as "stale." Today, this delta is approximately 15%. When a column stat is marked stale, it is still used during optimization, but will cause a stats refresh operation to be requested. The scenario might go something like this:

· No stats exist for a particular column used by SQL query "X."
· SQL query "X" is issued and the optimizer uses default stats, and requests column stats be generated.
· Column stats are generated in the background.
· SQL query "X" is issued and the optimizer uses the new column stats.
· Data processing occurs that inserts, updates, and/or deletes rows.
· SQL query "X" is issued and the optimizer marks the stats stale, uses the stale stats, and requests the column stats to be refreshed.
· Column stats are refreshed in the background.
· SQL query "X" is issued and the optimizer uses the refreshed column stats.

Based on this behavior, here are some recommendations for different environments.

On-line Transaction processing (OLTP)
In this environment, the data is changing constantly, and many of the SQL requests will access a narrow range of rows. A good indexing strategy (employing radix indexes) is best for statistics and implementation. Use the automatic stats collection to supplement the indexes. If the developer wants to take a more hands-on approach and control or limit the stats collection(s) during the transaction processing period(s), consider setting the system value QDBFSTCCOL to *NONE or *USER. If there is a window of idle computing resources at the end of the day or week, set the system value back to *ALL or *SYSTEM during this period and allow the statistics manager to catch up the column stats. Set the system value back to *NONE or *USER after the stats have been generated or the window of time runs out. Analyze the column stats collected and consider using this information to modify or enhance the indexes.

Business Intelligence (data warehousing)
In this application scenario, the data is normally only changed at regular intervals of time, such as end-of-day, end-of-week, and / or end-of-month. The SQL requests will be ad-hoc and varied, accessing both narrow and wide ranges of rows. A good indexing strategy (employing radix and EVI indexes) is required for implementation, but a good stats strategy will be important for those cases where the optimizer chooses alternate methods of access and processing (i.e., table scans, joining and grouping via hashing, etc.). Use the automatic stats collection, to supplement the indexes, but plan on refreshing the column statistics as part of your daily, weekly, or monthly load process. This can be done programmatically using the stats APIs. After the data has been loaded, and the column stats refreshed, continue to use the automatic stats collection to pick up any column stats that might have been overlooked. Analyze the column stats collected and consider using this information to modify or enhance your indexes. It is plausible that some columns will be covered by both indexes and column stats.

Operational Data Store (ODS) and Transaction Reporting
In this situation, some of the data may be changing constantly as it trickles in from the transactional systems, while other data may be changed periodically. The SQL requests will be varied, accessing both narrow and wide ranges of rows. A good indexing strategy (employing radix and EVI indexes) is required for stats and implementation. Column stats will be important for those cases where the optimizer chooses alternate methods of access and processing (i.e., table scans, joining and grouping via hashing, etc.). Use the automatic stats collection to supplement the indexes. If the developer wants to take a more hands-on approach and control or limit the stats collection(s) during the transaction reporting period(s), consider setting the system value QDBFSTCCOL to *NONE or *USER. If there is a window of idle computing resources at the end of the day or week, set the system value back to *ALL or *SYSTEM during this period and allow the statistics manager to "catch up" the column stats. Set the system value back to *NONE or *USER after the stats have been generated or the window of time runs out. Analyze the column stats collected and consider using this information to modify or enhance the indexes. It is plausible that some columns will be covered by both indexes and column stats.

Temporary Tables
If the environment creates and uses temporary tables, and these tables come and go frequently, consider identifying and collecting column stats on the permanent tables prior to using the application(s). Turn off the automatic stats collection and avoid generating requests against the transient temporary tables. At the end of the day or week, consider turning on the automatic stats collection for some period of time. This will allow the statistics manager to handle any legitimate requests for collection of stats.

One way to do this is to use the job scheduling function built into OS/400. Here is a simple example that allows the automatic stats collection to occur everyday between 3:30 AM and 5:30 AM:

ADDJOBSCDE JOB(STARTSTATS)
   CMD(CHGSYSVAL SYSVAL(QDBFSTCCOL) VALUE(*ALL))
   FRQ(*WEEKLY) SCDDATE(*NONE) SCDDAY(*ALL) SCDTIME('3:30')     
   ADDJOBSCDE JOB(STARTSTATS)
   CMD(CHGSYSVAL SYSVAL(QDBFSTCCOL) VALUE(*USER))  
   FRQ(*WEEKLY) SCDDATE(*NONE) SCDDAY(*ALL) SCDTIME('5:30') 

More information about job scheduling and the CL commands can be found on the Web at the iSeries Information Center: http://publib.boulder.ibm.com/html/as400/infocenter.html

[Back | Next]