IBM®
Skip to main content
    Country/region [select]      Terms of use
 
 
   
     Home      Products      Services & industry solutions      Support & downloads      My account     
Servers  >  Mainframe servers  >  White papers  >  

zSeries Maintenance Suggestions to Help Improve Availability in a Parallel Sysplex Environment

Author: Barbara J. Bryant

IBM Corporation
zSeries Software Service
Poughkeepsie, NY
Phone (external): (914)-435-4027
Phone (internal): 8-295-4027
email:BRYANTB@US.IBM.COM
http://www.ibm.com/servers/eserver/zseries/library/whitepapers/psos390maint.html

July 20, 2001


Notices

IBM is providing information in this document to assist customers with planning a Preventive Maintenance process to help ensure the highest levels of availability. IBM does not guarantee the results.

All statements regarding IBM's future direction and intent are subject to change or withdrawal without notice and represent goals and objectives only.


Trademarks

The following terms are trademarks of the IBM Corporation in the United States or other countries or both:

  • CBPDO
  • CICS
  • DB2
  • IMS
  • OS/390
  • ESO
  • IBM
  • OS/390
  • Parallel Sysplex
  • S/390
  • ServicePac

Table of Contents

Introduction

Maintenance Considerations in a Parallel Sysplex

HIPER and Special Attention APARs

Reducing Amount of Maintenance to Install

RSU Integration Testing

OS/390 Enhanced HOLDDATA

Maintenance Process Suggestions for S/390 Products

Conclusion


Introduction

This paper contains information to assist customers with planning a preventive maintenance process to help attain higher availability in a Parallel Sysplex environment. The basic concepts apply to non-parallel environments, as well. This paper assumes that the reader has a basic understanding of the S/390 Software maintenance installation process and service deliverables.

Outage analysis has shown that about 15% of outages affecting multiple systems/subsystems could have been avoided by better preventive maintenance practices. That is, the cause of the outage was fixed by a PTF that had been available for six months or more.

IBM provides service for current S/390 brand products, including OS/390. PTFs are available for corrective and preventive maintenance. The most common service deliverables used for preventive maintenance are the Enhanced Service Offering (ESO), which provides monthly PER closed PTFs, and CBPDO, which in addition to the monthly PER closed PTFs, includes COR closed reach ahead fixes for HIPERS and PEs on a weekly basis. Maintenance can also be updated using the ServicePac Offering or with the SystemPac's Selective Follow On Service tapes (HIPERs and PTFs resolving PEs).

S/390 Service Update Facility (SUF) is a new internet based service maintenance tool to order and receive corrective and recommended preventive maintenance. SUF uses the customer's CSI as input to tailor the service to the customer's environment. For more information on SUF, please click here to visit the SUF web site at url ibm.com/servers/eserver/zseries/zos/suf/.

IBM recommends that all OS/390 customers have a well-defined process to install preventive maintenance on a regular basis. Before defining a preventive maintenance process, it is helpful to understand the key aspects of S/390 Service, including HIPER and Special Attention APARs, Notification of high impact APARs, and Recommended Service Upgrade (RSU) which have become an integral part the OS/390 strategy.

Installing maintenance is a labor intensive task for system programmers, many of whom are busy providing system support and fighting fires. Much "installation" time is spent researching PTFs that are held for special actions to be taken. Researching HIPER and PE APARs is also very time consuming. However, availability in a Parallel Sysplex environment depends on a pro-active maintenance process. The resources required to support the maintenance process are a trade off between the work effort and System Availability.


Maintenance Considerations in a Parallel Sysplex

Since continuous availability is a prime objective of many Parallel Sysplex configurations, installing preventive maintenance to avoid known defects is a key component to meeting those objectives. To have the least impact on Parallel Sysplex availability, it is recommended that maintenance be installed and activated on one system at a time. This is known as rolling IPLs. The amount of time between rolling IPLs may vary depending on the amount of maintenance and the urgency to get the maintenance installed on all systems in the Parallel Sysplex configuration.

Note that the dynamic LPA function is not intended for IBM service updates. Pointers to LPA modules for system address spaces may be stored internally, making dynamic LPA updates ineffective in these cases. Dynamic LPA also requires CSA storage for the updated modules. For corrective service, an LLA refresh may be sufficient to activate a fix to a linklst module.


Cloned Applications in a Data Sharing Environment

Cloning applications to run in a datasharing environment on multiple systems can improve availability during scheduled and unscheduled outages. Data Sharing helps minimize the impact of outages that occur on a single system. Using rolling IPLs to activate maintenance on one system at a time avoids an outage to cloned applications because the applications are available on at least one other system in the Parallel Sysplex configuration.


No Sysplex-wide IPLs Intended for Service

A homogeneous Parallel Sysplex environment, where all systems are at the same release, and even the same maintenance level, is ideal for Parallel Sysplex availability. However, OS/390 allows up to four consecutive releases of OS/390 to coexist in a Parallel Sysplex configuration. Release migrations within four OS/390 releases can be accomplished without a sysplex-wide IPL. For more information regarding release compatability and migration, please click here to view the 'Planning Guide for Multisystem Customers: OS/390 Coexistence and Planning Considerations' at url http://publibz.boulder.ibm.com/cgi-bin/bookmgr_OS390/BOOKS/E0Z2B100/5.0. Similarly, the intention is that corrective or preventive service should not require a sysplex-wide IPL. A sysplex-wide IPL is defined as a concurrent IPL of all systems in the Parallel Sysplex configuration at the same time.

It may be necessary to install a PTF on all systems in the Parallel Sysplex configuration to resolve a problem, but no PTF should be required on all systems concurrently. In some cases, to avoid a sysplex-wide IPL, a prerequisite PTF is required to be installed on all systems in the Parallel Sysplex configuration. The prerequisite PTF enables toleration or co-existence of mixed service levels in the Parallel Sysplex configuration. The toleration PTF and the subsequent PTF may be installed using rolling IPLs. PTFs that require toleration contain a ++HOLD for ACTION to indicate the requirement.

All Parallel Sysplex related products, including OS/390, DB2, IMS, IRLM, RLS, and CICS have the same requirement for no sysplex-wide IPLs for service.


Identifying Parallel Sysplex APARs

The keyword SYSPLEXDS is used in APARs to help identify problems specifically related to the Parallel Sysplex environment. The keyword is used in high and low impact APARs. Careful attention should be paid to HIPER APARs with the SYSPLEXDS keyword. The Datasharing PSP bucket contains all the APARs with the SYSPLEXDS keyword.


Parallel Sysplex Test Environment

It is important for customers to have a test system, similar to that of their production environment, in order to better test maintenance in their own application environment, prior to installation on production systems. The test system should not impact the production environment. A test Parallel Sysplex with application datasharing is suggested for proper testing of the Parallel Sysplex environment.


HIPER and Special Attention APARs

IBM provides PTFs to correct defects found in S/390 products. The HIPER process is used to help identify defects that could result in outages and/or impact end-user availability.

The definition of HIPER APARs has changed over the last few years to help improve the identification of high impact service. Special Attention APARs were introduced at the same time to help identify other APARs of interest that are recommended for installation. All PER closed PTFs for all HIPER and Special Attention APARs are included in the monthly Recommended Service Upgrade (RSU).


HIPER APARs

HIPER was originally the designation for High Impact and/or PERvasive APARs. Since it is difficult to predict how pervasive a problem might be, many high impact APARs were not marked HIPER. Pervasive, but non-critical APARs, do not require the same urgency as high impact problems. This led to inconsistencies across the S/390 products in identifying HIPER APARs.

The definition of HIPER APARs was changed a few years ago (1996). HIPER is now used to indicate that an APAR will (or possibly will) have a high impact on availability. Special Attention is now used to indicate recommended pervasive, but not high impact APARs. Symptom flags have been added to help identify why the APAR is HIPER as follows:

  • Loss of Data,
  • System Outage,
  • Loss of Major function or Subsystem,
  • Severe Performance Impact

Other flags can be turned on, along with the symptom flags, to provide additional information on the impact of the APAR. These are:

  • Pervasive
  • XSYSTEM
  • Product Specific Keyword: __________

The Pervasive flag indicates that an APAR is High Impact AND pervasive. One of the symptom flags, such as System Outage, must also be flagged.

The XSYSTEM flag can be used to indicate that an APAR provides cross-system toleration or coexistence support in a multi-system environment. Toleration APARs are usually identified in the Program Directory of the product requiring the toleration. For toleration or coexistence problems found after a products ships, that cause high impact problems, APARs are marked HIPER and the PSP buckets updated. It is important to install toleration fixes in advance, not only to avoid the problem, but also to avoid a scheduled IPL for an upcoming migration of other systems in the Parallel Sysplex configuration, that require the toleration.

The Product Specific keyword can be used to identify groups of APARs for a specific function or environment. The Product Specific keywords currently defined are:

  • YR2000
  • EURO99
  • SYSPLXDS

Each keyword has its own associated PSP bucket, YR2000MVS, EURO99, and DATASHARING, respectively.

Both the XSYSTEM and Product Specific keywords are very new. Prior to availability of these flags, this information was provided in the APAR descriptive text.

IBM suggests reviewing HIPER APARs weekly and installing the applicable PTFs on the test system, in preparation for production, as needed. HIPER maintenance should be rolled out to production systems on a regular basis, such as monthly. For customers with high availability requirements, it is suggested that specific individuals at the enterprise be identified by expertise (i.e. Operating System, Networking, Subsystem, etc.) to review HIPERs weekly and recommend installing those critical to the production environment as soon as possible. Severity 1 HIPERs and those HIPERs also marked pervasive should be strongly considered. Customers running in a Parallel Sysplex environment should pay particular attention to HIPER APARs which include the keyword SYSPLEXDS. This keyword identifies problems that have specific impact on the Parallel Sysplex and/or datasharing environment and can be found in the APAR text or the new Product Specific keyword field of HIPER or Special Attention APARs.


HIPER and PE Notification

There are several ways for customers to get information about HIPER APARs. PE APARs identify that a previous PTF is in error. PE APARs that cause high impact symptoms are also marked HIPER. The following tools are available which provide information on both HIPER and PE APARs:

  • PSP (Preventive Service Planning) Buckets are available by contacting the IBM Support Center or electronically through ServiceLink. The PSP buckets contain the HIPERs and PEs in the Service Recommendation list.
  • ALERT/ASAP is an electronic notification of HIPER and PE APARs through ServiceLink. Customers are notified of HIPERs and PEs for selected products, as they are identified by IBM support center.
  • OS/390 Enhanced HOLDDATA is an SMP/E installable file containing ++HOLDs for HIPERs and PEs used to identify HIPERs and PE fixes that are not installed on a system. After the file is received, an SMP/E REPORT ERRSYSMODS report can be run to identify the HIPERs and resolving PEs not installed. The SMP/E report includes the HIPER symptom flags. For more information on OS/390 Enhanced HOLDDATA, click here to visit the OS/390 Enhanced HOLDDATA web site at url http://service.boulder.ibm.com/390holddata.html.
  • S/390 Service Update Facility (SUF) is the new web based service maintenance tool used to order and receive recommended maintenance. SUF can be used to receive the HIPER and PE fixes that are not currently installed. For more information on SUF, please click here to visit the SUF web site at url ibm.com/servers/eserver/zseries/zos/suf/.
  • In addition to the HIPER and PE notifications above, some very critical HIPER APARs may be highlighted on the S/390 Software Support web site as Red Alerts. Red Alerts are a new way to help identify very critical HIPER APARs that may impact Parallel Sysplex availability. These are not intended to replace the HIPER process. Instead, Red Alerts are used to communicate a small number of very hot APARs quickly. Red Alerts can be found on the web at url http://www.ibm.com/servers/eserver/support/zseries/. Click here to view the S/390 web site. Please check this site often for service information.

Special Attention APARs

Special Attention APARs are APARs that do not have a high impact but are recommended for other reasons such as:

  • Pervasive (not HIPER)
  • New Function Support
  • Serviceability
  • Installability
  • XSYSTEM (Cross System Toleration/Coexistence)
  • Product Specific Keyword (YR2000,EUR099,SYSPLXDS)

ALERT/ASAP of ServiceLink can be used to monitor for Special Attention APARs. No HOLDDATA is generated for Special Attentions or PSP updates made. Special Attention APARs can be installed during normal preventive maintenance windows or when the support is needed. The only exception to this is for YR2000 APARs. All YR2000 APARs are identified in OS/390 Enhanced HOLDDATA, even if they are not HIPER. It is suggested that YR2000 APARs be treated as HIPERs, reviewing them weekly for applicability and installing with the HIPER maintenance roll-out.

As with HIPER APARs, XSYSTEM and Product Specific keyword are new flags on Special Attention APARs. The XSYSTEM flag for cross system toleration is intended to help identify maintenance needed for toleration/coexistence of multiple product levels in a Parallel Sysplex configuration. It will also be used for APARs which must be rolled out throughout the Parallel Sysplex configuration prior to installing subsequent APARs/PTFs. Although these APARs do not identify high impact problems, XSYSTEM Special Attention APARs are identified to help avoid the the possibility of multiple rolling IPLs for corrective service.


Reducing Amount of Maintenance to Install

In order to reduce the amount of maintenance that customers need to manage and install, IBM introduced 2 changes for S/390 maintenance, FIN(Fixed If Next release) and RSU (Recommended Service Upgrade).


FIN APARs

For low impact APARs, where no immediate PTFs are required, fixes are deferred to future releases where better testing can be done during the product test cycles. With customer concurrence, the APARs for low impact, non pervasive problems are closed with a code of FIN indicating IBM's intention to provide a fix in a release available within 18 months. Based on the impact of the problem and the likelihood that other customers will experience the problem, IBM decides if an APAR is a good FIN candidate. If a fix is required, the APAR is closed PER and PTF(s) provided. Subsequently, if a customer experiences a problem which is described by a FIN APAR and a fix is really required, a new APAR will be created.


Recommended Service Upgrade (RSU)

In the past, IBM has traditionally provided maintenance for all defects and has recommended installing all service for preventive maintenance. With OS/390 Version 1 Release 2 (9/96), IBM changed this philosophy in order to reduce the amount of the maintenance for customers to install. IBM announced RSU, a subset of the maintenance, as the recommended preventive maintenance to install. PTFs for low impact problems, not likely to affect customers, that could not be deferred are not recommended. RSU can eliminate the installation of 20%-30% of the available PTFs.

RSUs are available monthly on the preventive service deliverables around the 15th of the month. ESO and CBPDO packages still contain all the available maintenance. In addition, a separate file containing ++ASSIGN statements identifies the RSU maintenance to be installed. For example, a ++ASSIGN statement to recommend UW12345 would contain the following:

            ++ASSIGN SOURCEID(RSUyymm) to UW12345.

The criteria for inclusion in the RSU is as follows:

  • Severity 1 & 2 APARs
  • HIPERs
  • Special Attentions
  • Fixes for PEs
  • Security/Integrity APARs

In order to apply the RSU with SMP/E, the file containing the the ++ASSIGN SOURCEID(RSUyymm) must first be RECEIVEd. Then APPLY BYSOURCEID (RSUyymm) is done to install the RSU and all its prerequisites.

IBM strongly suggests only installing the RSU maintenance which is tested in a Parallel Sysplex environment. Being more selective in installing maintenance reduces the risk of change and also the risk PEs, to attain a more stable system. IBM's intent is to avoid shipping fixes that are not recommended by deferring them to future releases by using the FIN process.

Please note that IBM is working on a new RSU recommendation based on a new test effort known as Consolidated Service Test (CST). The goal is to provide a consistent, installable, tested maintenance level for OS/390 and zOS operating systems and key subsystems, such as DB2, IMS, CICS, and MQ. The CST recommendation will be available initially from a website until the new RSU is available on the ESO and CBPDO deliverables by the 4th quarter of 2001.

The contents of the RSU will change based on quarterly and monthly recommendations that have been tested. For more information, please click here to visit the CST web site at url ibm.com/servers/eserver/zseries/zos/servicetst/.

S/390 Service Update Facility (SUF) supports the OS/390 maintenance strategy and recommends the most current RSU that has been integration tested for preventive maintenance. SUF will electronically ship only the RSU PTFs, along with their prerequisites, that are not already received in the SMP/E target zone.

Customer experience has shown that the RSU is sufficient for system availability.


RSU Integration Testing

An 8 image Parallel Sysplex environment, called the Service Plex, is set up in Poughkeepsie to test S/390 maintenance. The Service Plex tests RSU preventive maintenance and corrective service PTFs for OS/390 and the major Parallel Sysplex products.

The Service Plex runs three releases of OS/390 concurrently and tests some systems with RSU maintenance only and other systems with all maintenance for selected products. Service for each release of OS/390 is tested for a year and a half after General Availability (GA) the product. RSU maintenance is installed monthly when it becomes available from Software Manufacturing in Boulder. For example, RSU9812 which became available around 1/15/99 was installed and IPLed into the Service Plex on Monday, 1/18. Additional testing is done to validate service fixes for an initial period for new releases , across 4 releases within the Paralell Sysplex environment.

The RSU and corrective service systems run in a datasharing environment. The workloads include CICS, IMS, DB2 and VSAM/RLS online transaction processing. Stress testing is done on each of the workloads using TPNS scripts to generate high volumes of transactions. Also stress testing of a mixed workload is done on a weekly basis. The processors are run at high cpu utilization to simulate customer stress environments.

The goal of the RSU systems is to run for 30 days in a Parallel Sysplex environment, testing the maintenance. Any problems found are resolved. APARs are created and marked HIPER and/or PE, as appropriate. Currently, the Service Plex is testing OS/390 V2 R4, R5 and R6. When OS/390 V2 R7 is available, R5, R6, and R7 will be tested.

Three other systems in the Service Plex, running the same releases as the RSU systems, are used for testing corrective fixes. All PTFs that were shipped to Boulder for selected products are picked up weekly. These PTFs are installed prior to the PTFs being COR closed and available from Software Manufacturing. Problems found are isolated to fixes installed and corrective actions are taken. If a PTF in error can be rebuilt, it is retested. If a PTF in error is COR closed by the time a problem is found in test the PTF must be marked PE and a superseding PTF provided.

Traps and circumventions can also be tested on these systems.


OS/390 Enhanced HOLDDATA

IBM provides OS/390 Enhanced HOLDDATA, which is HOLDDATA with additional information to identify the reason for the hold and a fixing PTF. Enhanced HOLDDATA provides a hold against the FMID for HIPER and YR2000 maintenance. For PEs, it provides a hold against the PTF in error. OS/390 Enhanced HOLDDATA can help manage all products on the OS/390 platform.

OS/390 Enhanced HOLDDATA is received into the SMP/E global zone. The SMP/E REPORT ERRSYSMODS command can then be used on any target zone to identify missing critical service that applies to the customer system. This allows for the identification of any missing HIPER, PE or YR2000 fixes. Additionally, the report identifies whether a corrective PTF is available, whether the corrective PTF is already in RECEIVE status, and any symptom flags for a HIPER.

OS/390 Enhanced HOLDDATA is available through ESO packages, with CBPDO as of OS/390 R7 and from the web. For more information on OS/390 Enhanced HOLDDATA, click here to visit the OS/390 Enhanced HOLDDATA web site at url http://service.boulder.ibm.com/390holddata.html.


Maintenance Process Suggestions for S/390 Products

A good maintenance process requires planning and coordination similar to installing a new release. Determining what maintenance to install, the roll-out schedule and how to reduce the risk of known problems that can affect availability are key elements of the maintenance process.


Maintenance Schedule and Roll-Out Plan

It is important to have a scheduled preventive maintenance plan. IBM suggests that preventive maintenance be installed quarterly (every 3 months) and that HIPERs and PEs be monitored weekly. Based on a risk assessment, fixes for HIPERs and PEs should be installed, along with corrective maintenance for problems actually experienced, as soon as possible, monthly, if practical.

In order to keep up with three month maintenance cycles, the roll-out plan needs to be less than thirteen weeks. If maintenance takes three months to roll-out, the production systems will be four to six months behind in service.


Deciding what Preventive Maintenance to Install

When preparing for a preventive maintenance upgrade, it is suggested to start with the latest RSU level that has been integration tested on the Service Plex. That is RSUyymm, where mm is the current month-2. This timing allows for the fact that the RSU becomes available around the 15th of the following month and gives the Service Plex at least 2 weeks to test. The actual test time ranges from 2 to 6 weeks depending on the date within the month.

IBM suggests installing only the RSU sourceids for preventive maintenance for OS/390. Corrective fixes for problems encountered, not included in the RSU, should also be installed. Although RSU is available for all S/390 brand products, Parallel Sysplex related products, such as DB2, IMS, CICS, and MQ currently suggest installing all preventive maintenance available. These products should be no more than 3 to 6 months behind in maintenance. HIPERs should be monitored weekly and installed monthly for these products, as well.

When new releases of OS/390 are installed, RSU maintenance should be included up to two months prior to the install (current month - 2), with a preventive maintenance update three months later.


Analysis and Risk Assessment

Reviewing HIPER and PE APARs is critical to avoiding outages due to known problems. Individuals should be identified from each of the product areas to monitor and review HIPER APARs for applicability to the customer environment. IBM recommends receiving OS/390 Enhanced HOLDDATA weekly and running the SMP/E REPORT ERRSYSMODS to identify the HIPERs that are missing and also any installed PTFs that have been found in error (PEs). Weekly meetings should be held to assess the risk of each applicable APAR and those APARs determined to be a high risk to availability should be scheduled for the next maintenance window. Maintenance windows should be scheduled to install HIPERs/PEs and corrective service, as necessary.

When doing a risk assessment on HIPERs, consider the following:

  • Is the failing environment applicable to your system?
  • What was the impact, system outage, loss of function, etc.?
  • How likely is it to occur (pervasive)?
  • Double Check Severity 1 APARs.

All maintenance, including HIPERs and resolving PEs should be tested in the customer's test environment prior to production.


Conclusion

The customer's maintenance process should be based on availability requirements. Staying current on maintenance can avoid outages caused by known defects. The preventive maintenance plan for improved availability in a Parallel Sysplex environment should be well-defined and maintenance should be installed regularly, preferably every 3 months. When installing the maintenance, the RSU for the current month-2 should be selected. Weekly monitoring of HIPERs is required to ensure that the necessary fixes that could affect availability are installed and rolled out to production as soon as possible. If preventive maintenance can not be installed quarterly, it is even more important to monitor for HIPER and PEs and install monthly.

As a rule of thumb, if most of the defect problems experienced are already known APARs with available PTFs, then the preventive maintenance schedule is not aggressive enough.

Parallel Sysplex home

OS/390 home

July, 2001


    About IBM Privacy Contact