A highly available operating system is not something that just happens. It must be designed and planned from the start. z/OS has in it some basic recovery, availability, and serviceability (RAS) features that are core components and have been so, not just for years, but for decades.
The z/OS operating system
z/OS prides itself in having good first failure data capture (FFDC) techniques, designed to capture data and identify problems that occur during production without impacting system availability. This functionality has been a core part of the operating system design since 1974. Because of its long history, sometimes we take this functionality for granted. . In z/OS, if a task (comparable to a UNIX thread) fails , the system is designed to quickly make a copy of the running programs, together with its data and system control blocks, to another location in memory. This data can then be dumped to disk or tape for later analysis. For user applications and many subsystems, the dump is designed to be created without impacting unrelated work on the system. Even for system-level problems, the focus is on collecting the data with the least impact on other work.
After a problem occurs and the dump is taken, z/OS and the subsystems typically either remain up or else can be restarted without a re-IPL. This capability is handled by recovery routines such as ESTAE and FRR being given control. These routines are designed to capture diagnostic data, identify and restore modified system structures, resources, and locks held by the failed user, and return control to the program at a specified point. These serviceability features are available to be used by ISV and user applications as well.
Contrast this with the UNIX environment. When there is a problem in the operating system code, subsystem, or even some application problems, the UNIX system typically takes a system (kernel) dump and then stops, requiring a re-IPL.
One important mechanism to help in problem determination is using the system trace records. (Think of this as the z/OS flight recorder!) Various z/OS component and system level traces are active by default. Information about which module was entered and by which transaction is efficiently written to a trace table in memory where it can later be dumped and analyzed. In addition, there is flexibility associated with the ability to turn on additional levels of tracing on a component-by-component basis without a restart for supplemental data gathering. Key here is that this can be highly efficient without significant impact on overall performance.
UNIX systems typically do not have tracing enabled by default. Additionally, in most UNIX systems tracing can be prohibitive from a performance perspective. Furthermore, tracing is typically activated by specifying a program variable and restarting the application. This is obviously disruptive and unacceptable to z/OS. All this makes UNIX problem determination more difficult, which can result in a less serviceable environment than z/OS.
Processes and address spaces can hang for any number of reasons. One example is a deadlock — situations between applications or in the latches obtained by the UNIX kernel (operating system) when two threads try to acquire system-level locks for resources held by the other.
Part of fundamental z/OS recovery process is the cleanup of any resources that programs might have acquired. By cleaning up these resources, programs can free memory that was obtained, hardware and software locks can be released, and data sets (files) can be closed. If a program encounters an error before it has the opportunity to clean up its resources on its own or if the application does not invoke its own recovery routine, the system recovery termination manager (RTM) is designed to invoke the End of Task and End of Memory processing to release the locks, latches, and other system resources. In addition, z/OS has included extensive logic to be able to restart canceled UNIX System Services processes and (non-USS) address spaces without requiring re-IPLs.
On the other hand, the UNIX operating system typically do not track or recover from deadlock conditions on the latches. Even if one or both of these threads are canceled the problem is not resolved. UNIX still considers those latches and locks held because it does not go through the code to return the resources. The only thing left to do is to reboot.
Two aspects of system security that can affect availability are protecting it from hackers who are actively trying to infiltrate the system, and a more common problem, protecting it from internal employees who accidentally make unexpected changes. z/OS with the optional feature Security Server (RACF), aided by the design of the operating system, can address both of these concerns in an efficient, manageable package.
For example, if a hacker gains access by compromising a user's password, then the hacker has access to only those resources that that particular user controls. Since most users do not control general system resources, the risk of damage to the system is minimal. In addition, the z/OS security has effective audit tools designed so that hackers poking around tryting to discover weaknesses, will often immediately draw attention to themselves. For example, RACF realtime functions can be set to disable a user ID who exceeds an administratively established failed password attempt threshold. In addition, automation can intercept system messages generated when failed resource access attempts occur and be programmed to take immediate action.
Security is built into the system
There are many functions that are critical to providing a security-rich environment. This includes encryption of passwords as users sign on to the system, encrypting data being transmitted between servers, protecting system and user files from unauthorized access, auditing capabilities, and protecting the security manager database itself. Many of the these functions are available on UNIX systems by taking overt action and may possibly require purchasing and installing independent software vendor products or open source applications. As such, many customers may not know they need to take these actions, possibly resulting in an unsecure system. In addition, these may add unacceptable performance bottlenecks to the UNIX system. In the worst case, a security modification or addition could require rebuilding the kernel to install the packages which may invalidate any service contracts with the operating system provider. z/OS security interfaces to an external security manager like RACF are designed, architected, and integrated into the product, helping to optimize performance, and are generally much easier to set up and manage. In addition, being a centralized server, this is designed to only have to be done once.
Digital certificates can be a basic building block of a trusted infrastructure supporting secure transactions over the Internet. Public Key Infrastructure (PKI) services provide for the life-cycle management of digital certificates. These services include creation (or signing) of digital certificates and renewal after a period of time. On other platforms, applications using Digital Certificates may require the customer to go to third-party vendors to purchase and manage certificates which could be costly on a per certificate basis. This may be a significant expense for a customer who wants to used digital certificates to establish secure identification of and communication with large numbers of its customers or business partners With the PKI services that are built into z/OS, customers have the capability to create and manage their own digital certificates in whatever numbers they choose. Cost effective z/OS PKI services can make digital certificate based security an option where it may not have otherwise been.
On distributed systems, there is system traffic traveling everywhere. Often, passwords can travel in the clear within a company's intranet (unless someone took the added effort to encrypt it as noted earlier). Using simple technology, a hacker could tap into a line to steal data and passwords. Hundreds of these distributed systems can be consolidated on a single zSeries server with data being passed to each other inside the box using virtual network technology such as HyperSockets, Guest LANs, etc. Even between z/OS images on different servers within a Parallel Sysplex, data can be exchanged using the CF links. All this helps protect z/OS and zSeries from physical network attacks.
Part of the Communication Server, a base element of z/OS, is Intrusion Detection Services (IDS). IDS is built within the TCP stack and is designed to discover and defend against attacks such as scans, single packet attacks, and flooding. It automatically logs this activity for later analysis.
z/OS and RACF are designed to protect resources by default. This security can be applied and managed across the system to manage access control lists. This applies equally to the standard z/OS files as well as the z/OS hierarchical file systems.
The IBM eServer zSeries™ Parallel Sysplex® clustering environment is designed to allow concurrent read/write access to shared data from all processing nodes in a configuration without sacrificing performance or data integrity. Each node can be configured to concurrently cache shared data in local processor memory through hardware assisted cluster-wide serialization and coherency controls. As a result, work requests that are associated with a single workload, such as business transactions or database queries, can be dynamically distributed for parallel execution on nodes in the sysplex cluster based on available processor capacity. In the event of a hardware or software outage, either planned or unplanned, data sharing workloads can be dynamically redirected to available servers, providing near continuous application availability.
Another significant advantage of using Parallel Sysplex technology is the ability to enable nondisruptive hardware and software maintenance and installation. Through data sharing and dynamic workload management, servers can be dynamically removed from or added to the cluster, which can allow installation and maintenance activities to be performed while the remaining systems continue to process work. Furthermore, by adhering to the IBM software and hardware coexistence policy, software and hardware upgrades can be introduced one system at a time. This capability can allow you to roll changes through systems at a pace that makes sense for the business. The ability to perform rolling hardware and software maintenance nondisruptively can also help the business to implement critical functions and react to rapid growth without necessarily affecting application availability.
The Automatic Recovery Manager (ARM) policy can be used to automatically restart failed applications on any image across the sysplex.
One of the key requirements for a Parallel Sysplex cluster is to provide a single system image to the end user. For the TCP/IP interface, this is done using Virtual IP Addressing (VIPA). If there is a TCP/IP or system outage, one of the backup stacks is designed to automatically activate (take over) the IP address from the stack that suffered the outage. Normal client recovery action will attempt to reconnect to the same IP address and the client will be able to quickly establish a new connection to the backup stack. In addition, since the backup stack immediately notifies the client that the old connection has been terminated, this can be done without waiting for the normal TCP timeouts. The Sysplex Distributor can help provide dynamic workload balancing as well as simplifying the management of the network.
Parallel Sysplex technology builds on and extends the strengths of zSeries e-business servers with nearly linear scalability to create the industry's most powerful commercial processing clustered environment.
Since 1993 (with MVS/ESA SP V4.3), z/OS has also been a UNIX platform. It has supported file systems (HFS and faster performing zFS), and has the same look and feel as other UNIX platforms from a user and application point of view. Yet, there are differences. While UNIX applications can run on z/OS just as they can run on other platforms, they can also enjoy the added qualities of service already mentioned that z/OS provides, as well as the z/OS Workload Manager and other features.
z/OS can provide availability benefits even for simple functions like checking or recovering from a bad pointer. Although sigaction() or siglongjmp() UNIX validity checking calls can be utilized to check for a bad address, these methods are rarely used. In z/OS, even if the subsystem is given bad data, the recovery routines are designed to protect the operating system, UNIX kernel, and subsystems. On a UNIX platform, a bad pointer may cause an error in the subsystem code which could in turn cause a system dump, requiring a re-IPL.
If an application should get hung, the only UNIX recourse is to "Kill" the process . Sometimes Kill doesn't work. Since the UNIX implementation on z/OS uses z/OS services, there is an additional recovery step that can be taken, the z/OS UNIX "Superkill" command. Superkill invokes z/OS services designed to cancel the address space running the UNIX process and z/OS End of Task and End of Memory services will release system resources that were held. In addition, many UNIX servers running on z/OS are designed on z/OS to clean up after themselves so they can be restarted without requiring a re-IPL.
In general, UNIX systems and subsystems usually do not take advantage of any specialized hardware because they are written to be general programs, running on any of the standardized UNIX and Linux platforms. DB2 for z/OS, on the other hand, has been designed to be optimized and tightly integrated with the z/OS operating system and the zSeries platform to help provide better performance and the ability to take advantage of platform-specific availability options such as Parallel Sysplex clustering.
The pressure that drove the evolution of the z/OS operating system came from requirements for a robust 24x7x365 environment to run mission critical workloads where frequent re-IPLs are unacceptable. Application outages (not just system outages) can mean thousands of dollars of lost business for each minute of outage for many business environments. Our customers demand a secure, reliable, serviceable, and available system, and z/OS delivers.