WSC IBM Z
|Abstract: IBM recommends using either cached DASD or a coupling facility for the JES2 checkpoint in a multi-access spool (MAS) complex. In MASes with four or more members, a coupling facility can provide performance benefits over DASD.|
|Advantages of Using the Coupling Facility|
The coupling facility is faster for read operations than cached DASD, but slightly slower for writes when comparing the JES2 checkpoint I/O operations. The real advantage of the coupling facility lies in its FIFO queuing of lock requests. This ensures round-robin (equitable) sharing of the checkpoint, delivering it to the members in the order requested. As the number of members of the MAS increases, this is important because of the increased contention for the primary checkpoint data set (CKPT1).
MASDEF HOLD and DORMANCY Recommendations
Setting the Hold and Dormancy times is installation dependent, but in most cases a wide range of values are acceptable. Below are some observations based on recent experiences which apply equally to cached DASD and coupling facility except where noted:
Should be between 20 and 50 (units are in hundredths of a second). Less than 20 (.2 second) causes excessive overhead of reading and writing the checkpoint with little useful time for exclusive control. Most requests for the checkpoint are clustered in periods of less than 0.20 seconds. Some job tracking/scheduling subsystems and output retrieval/archival subsystems benefit from long hold times up to a full second. Times over a second tend to lock out other members and be counter-productive.
Minimum DORMANCY Time
Most requests for the checkpoint can afford to wait for three or more seconds without noticeable degradation. This applies to heavy batch, TSO, NJE, RJE, or JES2/PSF printing workloads. However, some job scheduling subsystems and output retrieval subsystems benefit from more frequent access if they are managing large queues of JES2 work, such as submitting many jobs, status commands, or PSO requests. These subsystems tend to get behind occasionally if they don't have more frequent accesses and longer hold times.
The old rule of thumb to make the minimum dormancy equal to the sum of other member's hold times (plus I/O time) may sound good for a steady state round-robin configuration. However, most members request checkpoint access in very erratic or clustered patterns, so the round-robin pattern does not apply. See below for more specific recommendations.
Maximum DORMANCY Time
The default is 500 (five seconds) which is fine for most members in most installations. Anything less should only be done for relatively JES2-idle members that should pick up JES2 work quickly when it becomes available. Anything more should only be done for members which you want to keep out of the way of more JES2-intensive members.
MASDEF MODE (DUAL vs. DUPLEX) Recommendations
DUAL mode processing alternates between CKPT1 and CKPT2 and transfers less data by using the change log at the front of the checkpoint. This provides slightly faster I/O times than DUPLEX mode, but cannot be used with a coupling facility. (You must perform an all-member warm start to change the MASDEF MODE setting.)
DUPLEX mode treats CKPT1 as the primary checkpoint data set and writes back up copies to CKPT2 in case there is a failure or CKPT1. In this mode, you may gain some performance benefits from setting DUPLEX=NO on a member that needs a very short HOLD time (less than the "Primary Write" time). If you specify DUPLEX=OFF, make sure that at least one other member in the MAS with DUPLEX=ON is always active.
Rules of Thumb
As dangerous as it is to publish actual numbers instead of recommending customers develop their own numbers, here are some "starting values" you can use for guidance. (Your mileage will vary.)
Single Member MAS
If you do not share spool with any other members, use the default values provided by IBM for MASDEF, except for HOLD.
Here is a chart with some general recommendations based on the total number of members in the MAS and the type of workload on each member:
|System Workload||Two Members||Three Members||Four Members||Five or More Members|
|BATCH, NJE, RJE, TSO, Print||Hold=50, Dorm=(50,500)||Hold=40, Dorm=(80,500)||Hold=30, Dorm=(90,500)||Hold=20, Dorm=(100,500)|
|Heavy SSI Usage *||Hold=80, Dorm=(20,500)||Hold=80, Dorm=(20,500)||Hold=80, Dorm=(20,500)||Hold=80, Dorm=(20,500)|
|Little JES2 activity||Hold=30, Dorm=(80,500)||Hold=20, Dorm=(100,500)||Hold=20, Dorm=(100,500)||Hold=20, Dorm=(100,500)|
* Be aware of which members have heavy SSI usage, and try to limit them to as few members as possible. These members may need longer hold times or shorter dormancy times.
The best measurement tool for your JES2 checkpoint is the (lack of) symptoms of JES2 delays by your applications. Here are some tools available for checkpoint analysis:
SDSF MAS Panel
This is a convenient display of the members' status, hold and dormancy times and actual times. This also a convenient panel for adjusting the times and seeing immediate results. Beware that these times are only instantaneous, and do not show averages.
RMF Monitor III
See the Subsystem Display, then "JES Delays". Excessive delays here are often due to checkpoint delays.
RMF CF Structure Activity Report
JES2 writes many blocks of data at once, so you will often see "No Subchannel Available" in these reports. This is normal and should not alarm you. The service times for Sych and Asynch requests should be within the published guidelines for your environment.
This displays the count and average wait time for $QSUSE requests (access to the checkpoint). This command, provided in JES2 SP Version 5.2, is not yet a documented external, but is described in WSC Flash 9744.
Turn on $TRACE(17) records for ten to fifteen minutes during your most active time of the day from a JES2 perspective. This may be during peak TSO activity, when many jobs are submitted, when JES2 queues are longest, or during a JES2 restart.
Here are sample JES2 operator commands to trace to class x:
When through, spin off the $TRCLOG, and turn off tracing, then use the IBM external writer (XWTR) or SDSF to write the trace records to disk:
Then analyze the data with the JES2T17A sample program provided with JES2.
You should also be aware of the volatility of your coupling facility. If you loose power, you will lose data unless you have battery back-up. The original recommendation was to have CKPT2 on DASD, but with careful planning, you can put both of the JES2 checkpoints on coupling facility as long as you can prevent both losing data at once.
Planning for Outages
Always have your back-up checkpoints (NEWCKPT1 and NEWCKPT2) defined on the JES2 CKPTDEF statement, and have the structures or data sets pre-allocated and defined. This will save time, confusion and possible outages in the event of a checkpoint error.
Reconfiguring your Checkpoint
Never change the CKPT1 and CKPT2 parameters and restart JES2 to change the checkpoint configuration. Always use the checkpoint reconfiguration dialog to move the checkpoint, and then change the CKPTDEF parameters afterwards.
Protecting your Checkpoint Data Sets
Use RACF(*) or your favorite security product to protect the JES2 Checkpoint data sets (and new checkpoint data sets) from inadvertent or unauthorized deletion.
IBM System z Family
Flash, Coupling Facility
|Is this your first visit to Techdocs (the Technical Sales Library)?