Skip to main content

 
IBM Systems  > Servers  > Mainframe servers  > Software  > 

Technical resources

  
Latest news   |   Presentations and documents   |   FAQs, hints and tips   |   Downloads

Improved reliability, robustness and diagnostics of the SA z/OS communication framework and a new performance option for the automation manager

IBM Tivoli System Automation for z/OS (SA z/OS) versions 3.1 and 3.2 have put significant effort into reworking the SA z/OS XCF communication framework to make SA z/OS more reliable and robust during periods of heavy workload. The automation manager (AM) has been enhanced to accelerate the throughput of work items. Additional diagnostic tools allow the monitoring of work item performance and collection of work item tracking information. This makes it easier for IBM service to analyse the work item life cycle for a very long period of time.

Improved reliability and robustness
With APARs OA24020, OA25016 and OA25744, SA z/OS improves the reliability and robustness of the XCF-based communication framework.
The following problem areas are addressed:

  • The starting and stopping of many resources has occasionally been associated with AM loops, abends and work item loss.
  • Recurrent loss of work items during agent-manager communication, resulting in shutdown and startup failures.
  • Recurrent timeout during agent-manager communication resulting in message ING008I

XCF buffer tuning
With APAR OA25744, the new message INGX1002I is issued if the SA z/OS communication framework cannot send data segments due to a Short On XCF Buffers condition:

INGX1002I SHORT ON XCF BUFFERS - RETRY

SA z/OS attempts to resend the data segment but it is possible that this cannot be achieved in time, and the timeout message ING008I occurs. In this case, you should tune the XCF buffer usage, for example, by increasing the number of XCF buffers. SA z/OS uses a buffer size that fits perfectly into the XCF buffer size of 4028. For detailed information about to tune XCF see the IBM Paper Parallel Sysplex Performance: XCF Performance Considerations V3.1.

Increased automation manager throughput
With APAR OA20329, you can increase the number of resource status updates, queries, etc. that can be processed by the automation manager. The automation manager delays the I/O operations to the takeover file for the specified number of seconds. The in-storage pages are only marked to be written. The theory is that, within the same time interval, the same page needs to be written several times. This significantly reduces the number of I/Os to the takeover file. It gives you the same performance characteristics for an update work item as for a query work item.

You switch on the performance option via the following parameter in the parmlib member HSAPRMxx:

IOINTERVAL=5

This means a 5-second delay for the actual I/O interval until the primary automation manager (PAM) writes the buffered data to the takeover file. You can specify a number between 0 and 10 seconds. The default value is 0, which switches off the performance option.

Be aware that work items processed by the PAM during the I/O interval delay cannot be recovered by AM takeover when the PAM abnormally terminates.

Work item statistics
The INGAMS command shows history information about the work items that have been processed by the automation manager. The automation manager keeps track of the last 500 work items processed by each of the tasks. The work item statistics show:

  • The number of work items queued by the PAM and not yet processed.
  • The CPU time consumed by the PAM.
  • The number of work items processed during the last 10 minutes.
  • The number of milliseconds to process a specific workitem. This is shown for the last 500 work items.
  • The number of seconds that the task is processing for the current work item (elapsed time). If this number is unexpectedly high, it is an indication that something is wrong and the AM might hang while processing the work item.

Workitem Life Cycle Reporting
With APAR OA22431, SA z/OS provides a work item tracking facility that makes it easier for SA z/OS Service to analyze the life cycle of a work item or an order (an order flows from the PAM to the agent and, for example, starts one or more resources) over a very long period of time, with minimum overhead. It helps to track down lost requests during agent-manager communication and other AM-related problems.

SA z/OS customers can enable SA z/OS Life Cycle Reporting (LCR) to collect checkpoints during the life cycle of:

  • Work items that flow from the agent to the PAM, including responses
  • Orders that flow from the PAM to the agent

Normally, SA z/OS Service advises whether a certain customer problem requires LCR to be enabled and provides guidance with the process of data collection.

By default, SA z/OS Life Cycle Recording (LCR) is disabled. When enabled, the automation agent (AA) and the primary automation manager (PAM) write life cycle records to a data space (DSP). Each AA and the PAM have their own data space. LCR must be enabled for the PAM and at least one AA. Before LCR can be enabled, the size of the data space must be defined. This can be done with the subcommand LCR ON (see the example below). The specified number is the maximum size (in megabytes) of the data space. The data space has a small initial size and is extended as needed up to the maximum size. LCR is designed to run over a long period of time. If the data space becomes full LCR wraps around and overwrites the oldest checkpoints.

When the problem has been captured, the data spaces must be off-loaded into external data sets with the subcommand LCR SAVE. A sequential data set is therefore required for the PAM and for each AA that LCR has been enabled for. You can specify your own data set or let LCR allocate the data sets automatically.

Examples:
INGRXQRY LCR ON 250;MY.AGENT.DATA.SET
INGRXQRY LCRM ON 250;MY.PAM.DATA.SET
INGRXQRY LCR SAVE
INGRXQRY LCRM SAVE

Finally, the data sets must be forwarded to SA z/OS Service for evaluation and analysis.