After -the ported application is working, you can optimize performance, if necessary. To run the application efficiently, you may need to extend it to exploit facilities unique to z/OS. Here are some recommendations to consider. Keep in mind that if the code from the original platform is poorly written, it is still poor code. As an example, in one poorly performing application, we discovered that the application was opening and closing a file inside of a loop. Once we changed the code to move the open and close outside the loop, the performance jumped dramatically.
To improve performance in the z/OS environment, some recommendations are:
- Use spawn() rather than fork() where possible.
- Use a threading model rather than a process model where possible. In malloc()-intensive applications, use the HEAPPOOLS runtime option.
- Try to economize on file I/O.
- Switch to line or record I/O rather than character I/O, where possible.
- Perform character set conversions efficiently.
- If using shared memory, be aware of its extended system queue area (ESQA) requirements.
- Do not use spins with serialization
- Compile your production application with optimization.
- For large load modules, consider using LPA or VLF.
- Scrutinize any pthread_yield() calls in mainline application paths
- Consider using the HEAPPOOLS runtime option
There is a profiling tool available that can provide detailed information about where an application is spending most of its instructions. Close examination can reveal questionable programming practices. You can further optimize high usage routines to improve performance.
There are memory leak detection tools available.
Use spawn() instead of fork()
The z/OS platform has some performance characteristics that are not common to other UNIX platforms. A fork() causes z/OS to create another address space and clone the running application. This is an expensive operation that can be avoided by changing the application if warranted. If the fork()'ed rocess are long running processes, fork() performance will be acceptable. If they are not long running, you should replace the fork() should with either a local spawn() or a pthread_create(). Both of these substitutions are nontrivial, except for the case of a fork() followed by an exec(). In this case, the substitution to a local spawn is simple.
In the spawn() case, spawn's argument is the name of an executable module as well as other arguments. Since spawn() does not clone the heap or stack, you must pass data needed by the spawned module or move it to a shared memory segment. You need to change program logic to make another new main() that can be spawned or change the existing main() to get to the point of the fork(). In either case, you will have to initialize data areas since the heap and stack are not cloned. The temptation of passing in the parent's heap should be ignored. Heaps are not made to be shared by multiple processes.
If your application creates many processes, to improve performance set the environment variable _BPX_SHAREAS to YES or REUSE and use spawn(). Similar to fork() and exec(), spawn() runs much faster and saves resources because it does not have to copy the address space. However, if you do not set the environment variable _BPX_SHAREAS to YES or REUSE, spawn will do exactly what fork() and exec() do, and there will be no performance improvement.
If your application is multithreaded you must use spawn() instead of fork().
If your application is designed to create multiple copies, with each running the same program, then spawn() might not be useful. Many applications rely on having program initialization performed once by the parent process and propagated via fork() to all the child processes. The spawn() function only propagates a few things like open file descriptors. spawn()'s assumption is that the new process will run a different program, not another copy of the same one.
__spawn() provides the ability to do a spawn that has additional data in the inheritance structure. You can specify the userid, cwd, umask and some other things as well. For example, when used with a web server, __spawn() allows a web worker thread that has used pthread_security_np() to spawn a CGI script which will be set up with the correct security identity.
- Applications that use pipes
After changing from using fork() to spawn(), an application that uses pipes can appear to hang. Often one process will work correctly for a while, then get stuck in a blocking read of the pipe.
A pipe consists of two file descriptors (fd) such that data written to "fd B" of the pipe can be read from "fd A" of the pipe. When a process forks, the pipe gets copied as well. Data written to "fd B" in the parent can be read from "fd A" in the child. When using fork(), the parent and child both close their copy of the unused pipe file descriptor. Normally, when data flows from parent to child, the parent closes "fd A" and the child closes "fd B". When data flows the other way, from child to parent, the parent closes "fd B" and the child closes "fd A". In either case, each process uses and leaves open only one half of the pipe.
With spawn() there is no explicit way for the child to close its unused half of the pipe. Because both ends of the pipe are open in the child process, the child will never see EOF on a read of "fd A" -- the write half or "fd B" is open in the child. EOF is detected only on a read of "fd A" when the pipe is empty and all copies of "fd B" are closed.
The solution is for the parent to mark the file descriptor for its half of the pipe (normally "fd B") to be closed-on-exec. If this is done then "fd B" will not be open in the child. When the parent closes its "fd B" then EOF will be detected in the child after all available data has been read from "fd A".
- Applications that use shared memory
When using spawn(), an application that uses shared memory may find that shmat() with a specified shmaddr returns -1 if both processes are in the same address space, although this appeared to work on previous tests. The problem with a returned -1 would only occur if the application had previously used spawn() with _BPX_SHAREAS=NO and then switched to spawn() with _BPX_SHAREAS=YES or REUSE.
If you require the shared memory to be at the same address, you have two choices:
- Run the two processes in separate address spaces and do the shmat(), specifying the same starting address.
- When running in the same address space, pass the address of the shared memory from process 1 to process 2. Then in process 2, just use the shared memory and do not do the shmat(). If only 2 processes are involved, then regular memory will suffice, and a malloc() in the first process can replace the shmget(). Then just pass the address of the heap storage to the second process.
Use a threading model instead of a process model
Threads are a good alternative because they can be started and stopped more efficiently than processes. z/OS supports heavy-weight threads -- if you are using multiple threads, each thread can run on a different processor at the same time.
One limiting factor is the number of threads in the address space.
File I/O and memory
When doing file I/O, keep these guidelines in mind:
- Do your work in memory rather than in temporary files.
- If your application extensively uses temporary files to save data, consider replacing this logic to use memory instead. On some UNIX platforms, memory is limited, so some applications use temporary files to avoid out-of-memory errors. Take advantage of z/OS's abundant memory to do work.
- Use larger buffers for file I/O. For peak performance, use buffers sized in the range 64K to 256K.
- Don't open a file unless you are going to read or write to it, and don't close a file until you have finished working with it.
Many UNIX applications read data from files one byte at a time. For z/OS, consider changing the application to read "lines" or "records" instead of characters.
Likewise, many UNIX applications read data from terminals one byte at a time. If possible, consider reading "lines" instead of characters.
Character set conversion
The guidelines for efficient data conversion are similar to those for efficient I/O:
- iconv_open() and iconv_close() should be done at the same time as fopen() and fclose(), that is during application initialization and termination. iconv_open() and iconv_close() services are expensive and are intended to be part of program initialization and termination. Sometimes, to simplify code development, iconv_open() and iconv_close() calls are issued every time translation is needed. We have seen performance greatly enhanced in some cases, when an application was changed to do iconv_open() only once (during initialization) and iconv_close() only once (during termination).
- Buffer as many bytes of data as possible on calls to iconv(). For example, if a line of data is read, the entire line should be passed to iconv().
Many UNIX programs being ported to the S/390 platform were written to read and write a byte of data at a time. Hence, iconv() would be called for each byte. The overhead to call iconv() and set up for conversion of a buffer of data is fairly high (on the order of 100 instructions per call), whether there is one byte or many bytes in the buffer. However, once the setup is done, it only takes 5 or 6 instructions per byte for iconv() to convert buffered data.
Shared memory -- shmat() -- is typically used between server processes or used by server address spaces to communicate with clients.
On z/OS, shared memory is as efficient as any other type of memory access. When you use it, you need to be aware of its impact on the extended system queue area (ESQA) storage requirements. ESQA storage is in common and page fixed, which causes it to consume real memory. A number of z/OS UNIX System Services (z/OS UNIX) use base z/OS functions that consume ESQA storage. Installations having constraints on virtual storage or main memory can control the amount of ESQA storage consumed. Ensuring the appropriate size of ESQA and extended common service area (CSA) storage is critical to the long-term operation of the system.
For each real page of shared storage, a 32-byte anchor block is allocated in ESQA. In addition, for every address space accessing that page, an additional control block is allocated -- let's call it a page block for this discussion. The anchor block and the page block are very similar in structure (both 32 bytes), but their fields are different. Both anchor blocks and page blocks are allocated in fixed ESQA storage and they consume real memory.
Example of shared-memory consumption of ESQA:
A server that allocates 8MB of shared memory and has 500 clients connected to it will consume the equivalent to 33MB of ESQA:8MB * 256 pages/MB * 503 connections * 32 bytes/page or 33MB of ESQA
The 503 comes from 500 clients, 1 server, 1 anchor block, and 1 connection to a kernel data space used to manage the storage.
For information about controlling the use of ESQA, see z/OS UNIX Planning .
If you are using memory mapping with large files or large shared memory segments, OS/390 V2R6 provides new programming options that significantly reduce the ESQA storage requirements. The shmget() and mmap() C functions have the new options that require the storage to be allocated in megabyte multiples and reside on megabyte boundaries. All processes sharing these megabytes have the same access to the storage.
- The __IPC_MEGA option of shmget() (BPX1MGT callable service) allows applications to use large quantities of shared memory without excessive system overhead.
- The __MAP_MEGA option of mmap() (BPX1MMP callable service) allows applications to map very large files without the overhead in ESQA.
- The functions munmap() (BPX1MUN callable service) and mprotect() (BPX1MPR callable service) have a different scope when they are used with memory maps that have been created with the __MAP_MEGA option. When munmap() is used to unmap a MAP_MEGA mapping, entire segments are unmapped. When mprotect() is used to change the access protection of a MAP_MEGA mapping, the change is system-wide. All active maps to the same file-offset range are affected by the request.
Do not use spins with serialization
If you are writing an application that runs in multiple processes or on multiple threads, it is not uncommon for these work units to need to share resources. Sharing resources also implies the need to serialize access to these resources. There are several ways to serialize access to shared resources:
- When sharing a resource across processes, use semaphores. See the explanation of the C functions semget(), semctl(), and semop().
- When sharing resources between threads, you can use mutexes or condition variables. See the explanation of the C functions pthread_mutex_init() and pthread_cond_init().
These serialization mechanisms are provided by the operating system or runtime library. Sometimes programmers feel these functions perform too slowly and create their own mechanisms to handle serialization. Avoid these common mistakes:
- Spin loops that check for a resource being available in an infinite while loop.
- Spin loops that check for a resource being available and then usleep() for a small amount of time before checking again.
In a z/OS system, these loops can consume excessive CPU cycles while preventing other users from running.
Compile your production application with optimization
For each release of the compiler, we have a web page that lists the various optimization options available to improve your application's performance. For example, using the IPA option with the the compiler puts high usage routines inline where called. This eliminates the call overhead entirely.
For large load modules, consider using LPA or VLF
To lessen the impact of very large modules when you have thousands of users, turn on the sticky bit and put the module into the link pack area (LPA). Your users can then share a single copy of the load module. This greatly reduces the working set size for each user and reduces system paging activity. If this process does any forks, the forks will be speeded up.
If you cannot put your module into LPA for any reason, but the module will be loaded into many address spaces or loaded repeatedly into a few address spaces, consider using the Virtual Lookaside Facility (VLF). To do this:
- Turn on the sticky bit
- Put the module into a link list data set or a steplib
- Define the load library to VLF so that the module gets cached.
VLF will then have the module in storage and you will avoid the I/O to fetch the module each time. However, the module will still consume storage in each address space using it.
pthread_yield() calls in mainline paths
pthread_yield() (Thread.yield() in Java) is intended to allow some thread other than the current thread to get control of the processor. On some platforms, calling this service gives the processor to another thread without any fixed length delay in the calling thread. However, on z/OS pthread_yield gives the processor to another thread by putting the current thread in a timed wait. Sometimes the duration of this timed wait can cause delays in response time and drops in external throughput. Any pthread_yield() calls in mainline application paths should be scrutinized. In most cases, these pthread_yield calls should be removed from mainline paths.
Using HEAPPOOLS for malloc and free requests
If you are running a multithreaded application and doing frequent calls to malloc, free, or other heap storage functions, consider turning on the HEAPPOOLS runtime option. HEAPPOOLS is designed to manage malloc and free requests without getting a lock. It uses compare and swap logic to accomplish a malloc or free in about 50 instructions. Without HEAPPOOLS, a malloc or free will take 300 instructions; plus the lock, which may trigger a WAIT/POST. The type of application that will benefit most from HEAPPOOLS is a multithreaded application that obtains and frees lots of small (4K) pieces of storage.