Skip to main content

Techdocs Library > Flashes >

Using a Coupling Facility for the JES2 Checkpoint



Document Author:

Kathy Walsh


Document ID:

FLASH10009


Doc. Organization:

Advanced Technical Sales


Document Revised:

08/23/2003


Product(s) covered:

JES2







Abstract: IBM recommends using either cached DASD or a coupling facility for the JES2 checkpoint in a multi-access spool (MAS) complex. In MASes with four or more members, a coupling facility can provide performance benefits over DASD.

Advantages of Using the Coupling Facility
The coupling facility is faster for read operations than cached DASD, but slightly slower for writes when comparing the JES2 checkpoint I/O operations. The real advantage of the coupling facility lies in its FIFO queuing of lock requests. This ensures round-robin (equitable) sharing of the checkpoint, delivering it to the members in the order requested. As the number of members of the MAS increases, this is important because of the increased contention for the primary checkpoint data set (CKPT1).

MASDEF HOLD and DORMANCY Recommendations

Setting the Hold and Dormancy times is installation dependent, but in most cases a wide range of values are acceptable. Below are some observations based on recent experiences which apply equally to cached DASD and coupling facility except where noted:

HOLD Time

Should be between 20 and 50 (units are in hundredths of a second). Less than 20 (.2 second) causes excessive overhead of reading and writing the checkpoint with little useful time for exclusive control. Most requests for the checkpoint are clustered in periods of less than 0.20 seconds. Some job tracking/scheduling subsystems and output retrieval/archival subsystems benefit from long hold times up to a full second. Times over a second tend to lock out other members and be counter-productive.

Minimum DORMANCY Time

Most requests for the checkpoint can afford to wait for three or more seconds without noticeable degradation. This applies to heavy batch, TSO, NJE, RJE, or JES2/PSF printing workloads. However, some job scheduling subsystems and output retrieval subsystems benefit from more frequent access if they are managing large queues of JES2 work, such as submitting many jobs, status commands, or PSO requests. These subsystems tend to get behind occasionally if they don't have more frequent accesses and longer hold times.

The old rule of thumb to make the minimum dormancy equal to the sum of other member's hold times (plus I/O time) may sound good for a steady state round-robin configuration. However, most members request checkpoint access in very erratic or clustered patterns, so the round-robin pattern does not apply. See below for more specific recommendations.

Maximum DORMANCY Time

The default is 500 (five seconds) which is fine for most members in most installations. Anything less should only be done for relatively JES2-idle members that should pick up JES2 work quickly when it becomes available. Anything more should only be done for members which you want to keep out of the way of more JES2-intensive members.

MASDEF MODE (DUAL vs. DUPLEX) Recommendations

DUAL mode processing alternates between CKPT1 and CKPT2 and transfers less data by using the change log at the front of the checkpoint. This provides slightly faster I/O times than DUPLEX mode, but cannot be used with a coupling facility. (You must perform an all-member warm start to change the MASDEF MODE setting.)

DUPLEX mode treats CKPT1 as the primary checkpoint data set and writes back up copies to CKPT2 in case there is a failure or CKPT1. In this mode, you may gain some performance benefits from setting DUPLEX=NO on a member that needs a very short HOLD time (less than the "Primary Write" time). If you specify DUPLEX=OFF, make sure that at least one other member in the MAS with DUPLEX=ON is always active.

Rules of Thumb

As dangerous as it is to publish actual numbers instead of recommending customers develop their own numbers, here are some "starting values" you can use for guidance. (Your mileage will vary.)

Single Member MAS

If you do not share spool with any other members, use the default values provided by IBM for MASDEF, except for HOLD.

MASDEF HOLD=60000,
DORMANCY=(5,500),
MODE=DUPLEX,
DUPLEX=ON


Multiple Members

Here is a chart with some general recommendations based on the total number of members in the MAS and the type of workload on each member:


System WorkloadTwo MembersThree MembersFour MembersFive or More Members
BATCH, NJE, RJE, TSO, PrintHold=50, Dorm=(50,500)Hold=40, Dorm=(80,500)Hold=30, Dorm=(90,500)Hold=20, Dorm=(100,500)
Heavy SSI Usage *Hold=80, Dorm=(20,500)Hold=80, Dorm=(20,500)Hold=80, Dorm=(20,500)Hold=80, Dorm=(20,500)
Little JES2 activityHold=30, Dorm=(80,500)Hold=20, Dorm=(100,500)Hold=20, Dorm=(100,500)Hold=20, Dorm=(100,500)


* Be aware of which members have heavy SSI usage, and try to limit them to as few members as possible. These members may need longer hold times or shorter dormancy times.

Measurement Tools

The best measurement tool for your JES2 checkpoint is the (lack of) symptoms of JES2 delays by your applications. Here are some tools available for checkpoint analysis:

SDSF MAS Panel

This is a convenient display of the members' status, hold and dormancy times and actual times. This also a convenient panel for adjusting the times and seeing immediate results. Beware that these times are only instantaneous, and do not show averages.

RMF Monitor III

See the Subsystem Display, then "JES Delays". Excessive delays here are often due to checkpoint delays.

RMF CF Structure Activity Report

JES2 writes many blocks of data at once, so you will often see "No Subchannel Available" in these reports. This is normal and should not alarm you. The service times for Sych and Asynch requests should be within the published guidelines for your environment.

$D PERFDATA(QSUSE)

This displays the count and average wait time for $QSUSE requests (access to the checkpoint). This command, provided in JES2 SP Version 5.2, is not yet a documented external, but is described in WSC Flash 9744.

$TRACE(17) data

Turn on $TRACE(17) records for ten to fifteen minutes during your most active time of the day from a JES2 perspective. This may be during peak TSO activity, when many jobs are submitted, when JES2 queues are longest, or during a JES2 restart.

Here are sample JES2 operator commands to trace to class x:

$S TRACE(17)
$T TRACEDEF,TABLES=20
$T TRACEDEF,ACTIVE=Y,LOG=(START=Y,CLASS=x,SIZE=92000)


When through, spin off the $TRCLOG, and turn off tracing, then use the IBM external writer (XWTR) or SDSF to write the trace records to disk:

$T TRACEDEF,ACTIVE=NO,SPIN
$P TRACE(17)

S XWTR.X,UNIT=SYSDA,DSNAME=JES2.TROUT,SPACE=(CYL,(1,3))
F X,CLASS=x


Then analyze the data with the JES2T17A sample program provided with JES2.

Availability Considerations

You should also be aware of the volatility of your coupling facility. If you loose power, you will lose data unless you have battery back-up. The original recommendation was to have CKPT2 on DASD, but with careful planning, you can put both of the JES2 checkpoints on coupling facility as long as you can prevent both losing data at once.

Planning for Outages

Always have your back-up checkpoints (NEWCKPT1 and NEWCKPT2) defined on the JES2 CKPTDEF statement, and have the structures or data sets pre-allocated and defined. This will save time, confusion and possible outages in the event of a checkpoint error.

Reconfiguring your Checkpoint

Never change the CKPT1 and CKPT2 parameters and restart JES2 to change the checkpoint configuration. Always use the checkpoint reconfiguration dialog to move the checkpoint, and then change the CKPTDEF parameters afterwards.

Protecting your Checkpoint Data Sets

Use RACF(*) or your favorite security product to protect the JES2 Checkpoint data sets (and new checkpoint data sets) from inadvertent or unauthorized deletion.



Classification:

Software

Category:

Operational Management




Platform(s):

IBM System z Family



:


Keywords:

Flash, Coupling Facility

The Techdocs Library
Is this your first visit to Techdocs (the Technical Sales Library)?

Learn more


Techdocs QuickSearch

: