|
A highly available operating system is not something that just happens.
It must be designed and planned from the start.
z/OS has in it some basic recovery,
availability, and serviceability (RAS) features that are core components and
have been so, not just for years, but for decades.
The z/OS operating system
z/OS prides itself in having good first failure data capture (FFDC) techniques,
designed to capture data and identify problems that occur during production
without impacting system availability.
This functionality has been a core part of the operating system design since
1974.
Because of its long history, sometimes we take this functionality for granted. .
In z/OS, if a task (comparable to a UNIX thread) fails , the system is designed
to
quickly make a copy of the running programs, together with its data and system
control
blocks, to another location in memory.
This data can then be dumped to disk or tape for later analysis.
For user applications and many subsystems, the dump is designed to be created
without impacting unrelated work on the system.
Even for system-level problems, the focus is on collecting the data with the
least impact on other work.
After a problem occurs and the dump is taken, z/OS and the subsystems typically
either remain up or else can be restarted without a re-IPL.
This capability is handled by recovery routines such as ESTAE and FRR
being given control.
These routines are designed to capture diagnostic data, identify and restore
modified system structures, resources, and locks held by the failed user, and
return control to the program at a specified point.
These serviceability features are available to be used by ISV and user
applications as well.
Contrast this with the UNIX environment. When there is a problem in the
operating system code, subsystem, or even some application problems, the UNIX
system typically takes a system (kernel) dump and then stops, requiring a
re-IPL.
One important mechanism to help in problem determination is using the system
trace records. (Think of this as the z/OS flight recorder!) Various z/OS
component and system level traces are active by default. Information about
which module was entered and by which transaction is efficiently written to a
trace table in memory where it can later be dumped and analyzed. In addition,
there is flexibility associated with the ability to turn on additional levels
of tracing on a component-by-component basis without a restart for supplemental
data gathering. Key here is that this can be highly efficient without
significant impact on overall performance.
UNIX systems typically do not have tracing enabled by default. Additionally, in
most UNIX systems tracing can be prohibitive from a performance perspective.
Furthermore, tracing is typically activated by specifying a program variable
and restarting the application. This is obviously disruptive and unacceptable
to z/OS. All this makes UNIX problem determination more difficult, which can
result in a less serviceable environment than z/OS.
System cleanup
Processes and address spaces can hang for any number of reasons. One example
is a deadlock situations between applications or in the latches obtained by
the UNIX kernel (operating system) when two threads try to acquire system-level
locks for resources held by the other.
Part of fundamental z/OS recovery process is the cleanup of any resources that
programs might have acquired. By cleaning up these resources, programs can free
memory that was obtained, hardware and software locks can be released, and
data sets (files) can be closed. If a program encounters an error before it
has the opportunity to clean up its resources on its own or if the application
does not invoke its own recovery routine, the system recovery termination
manager (RTM) is designed to invoke the End of Task and End of Memory
processing to release the locks, latches, and other system resources. In
addition, z/OS has included extensive logic to be able to restart canceled UNIX
System Services processes and (non-USS) address spaces without requiring
re-IPLs.
On the other hand, the UNIX operating system typically do not track or recover
from deadlock conditions on the latches. Even if one or both of these threads
are canceled the problem is not resolved. UNIX still considers those latches
and locks held because it does not go through the code to return the resources.
The only thing left to do is to reboot.
z/OS security
Two aspects of system security that can affect availability are protecting it
from hackers who are actively trying to infiltrate the system, and a more
common problem, protecting it from internal employees who accidentally make
unexpected changes. z/OS with the optional feature Security Server (RACF),
aided by the design of the operating system, can address both of these concerns
in an efficient, manageable package.
For example, if a hacker gains access by compromising a user's password, then
the hacker has access to only those resources that that particular user
controls. Since most users do not control general system resources, the risk of
damage to the system is minimal. In addition, the z/OS security has effective
audit tools designed so that hackers poking around tryting to discover
weaknesses, will often immediately draw attention to themselves. For example,
RACF realtime functions can be set to disable a user ID who exceeds an
administratively established failed password attempt threshold. In addition,
automation can intercept system messages generated when failed resource access
attempts occur and be programmed to take immediate action.
Security is built into the system
There are many functions that are critical to providing a security-rich
environment. This includes encryption of passwords as users sign on to the
system, encrypting data being transmitted between servers, protecting system
and user files from unauthorized access, auditing capabilities, and protecting
the security manager database itself. Many of the these functions are
available on UNIX systems by taking overt action and may possibly require
purchasing and installing independent software vendor products or open source
applications. As such, many customers may not know they need to take these
actions, possibly resulting in an unsecure system. In addition, these may add
unacceptable performance bottlenecks to the UNIX system. In the worst case, a
security modification or addition could require rebuilding the kernel to
install the packages which may invalidate any service contracts with the
operating system provider. z/OS security interfaces to an external security
manager like RACF are designed, architected, and integrated into the product,
helping to optimize performance, and are generally much easier to set up and
manage. In addition, being a centralized server, this is designed to only have
to be done once.
Digital certificates can be a basic building block of a trusted infrastructure
supporting secure transactions over the Internet. Public Key Infrastructure
(PKI) services provide for the life-cycle management of digital certificates.
These services include creation (or signing) of digital certificates and
renewal after a period of time. On other platforms, applications using Digital
Certificates may require the customer to go to third-party vendors to purchase
and manage certificates which could be costly on a per certificate basis. This
may be a significant expense for a customer who wants to used digital
certificates to establish secure identification of and communication with large
numbers of its customers or business partners With the PKI services that are
built into z/OS, customers have the capability to create and manage their own
digital certificates in whatever numbers they choose. Cost effective z/OS PKI
services can make digital certificate based security an option where it may not
have otherwise been.
On distributed systems, there is system traffic traveling everywhere. Often,
passwords can travel in the clear within a company's intranet (unless someone
took the added effort to encrypt it as noted earlier). Using simple
technology, a hacker could tap into a line to steal data and passwords.
Hundreds of these distributed systems can be consolidated on a single zSeries
server with data being passed to each other inside the box using virtual
network technology such as HyperSockets, Guest LANs, etc. Even between z/OS
images on different servers within a Parallel Sysplex, data can be exchanged
using the CF links. All this helps protect z/OS and zSeries from physical
network attacks.
Part of the Communication Server, a base element of z/OS, is Intrusion
Detection Services (IDS). IDS is built within the TCP stack and is designed to
discover and defend against attacks such as scans, single packet attacks, and
flooding. It automatically logs this activity for later analysis.
z/OS and RACF are designed to protect resources by default. This security can
be applied and managed across the system to manage access control lists. This
applies equally to the standard z/OS files as well as the z/OS hierarchical
file systems.
Parallel Sysplex
The IBM eServer zSeries Parallel Sysplex® clustering environment is designed
to allow concurrent read/write access to shared data from all processing nodes
in a configuration without sacrificing performance or data integrity. Each node
can be configured to concurrently cache shared data in local processor memory
through hardware assisted cluster-wide serialization and coherency controls. As
a result, work requests that are associated with a single workload, such as
business transactions or database queries, can be dynamically distributed for
parallel execution on nodes in the sysplex cluster based on available processor
capacity. In the event of a hardware or software outage, either planned or
unplanned, data sharing workloads can be dynamically redirected to available
servers, providing near continuous application availability.
Another significant advantage of using Parallel Sysplex technology is the
ability to enable nondisruptive hardware and software maintenance and
installation. Through data sharing and dynamic workload management, servers can
be dynamically removed from or added to the cluster, which can allow
installation and maintenance activities to be performed while the remaining
systems continue to process work. Furthermore, by adhering to the IBM software
and hardware coexistence policy, software and hardware upgrades can be
introduced one system at a time. This capability can allow you to roll changes
through systems at a pace that makes sense for the business. The ability to
perform rolling hardware and software maintenance nondisruptively can also help
the business to implement critical functions and react to rapid growth without
necessarily affecting application availability.
The Automatic Recovery Manager (ARM) policy can be used to automatically
restart failed applications on any image across the sysplex.
One of the key requirements for a Parallel Sysplex cluster is to provide a
single system image to the end user. For the TCP/IP interface, this is done
using Virtual IP Addressing (VIPA). If there is a TCP/IP or system outage, one
of the backup stacks is designed to automatically activate (take over) the IP
address from the stack that suffered the outage. Normal client recovery action
will attempt to reconnect to the same IP address and the client will be able to
quickly establish a new connection to the backup stack. In addition, since the
backup stack immediately notifies the client that the old connection has been
terminated, this can be done without waiting for the normal TCP timeouts. The
Sysplex Distributor can help provide dynamic workload balancing as well as
simplifying the management of the network.
Parallel Sysplex technology builds on and extends the strengths of zSeries
e-business servers with nearly linear scalability to create the industry's most
powerful commercial processing clustered environment.
Middleware
Since 1993 (with MVS/ESA SP V4.3), z/OS has also been a UNIX platform. It has
supported file systems (HFS and faster performing zFS), and has the same look
and feel as other UNIX platforms from a user and application point of view.
Yet, there are differences. While UNIX applications can run on z/OS just as
they can run on other platforms, they can also enjoy the added qualities of
service already mentioned that z/OS provides, as well as the z/OS Workload
Manager and other features.
z/OS can provide availability benefits even for simple functions like checking
or recovering from a bad pointer. Although sigaction() or siglongjmp() UNIX
validity checking calls can be utilized to check for a bad address, these
methods are rarely used. In z/OS, even if the subsystem is given bad data, the
recovery routines are designed to protect the operating system, UNIX kernel,
and subsystems. On a UNIX platform, a bad pointer may cause an error in the
subsystem code which could in turn cause a system dump, requiring a re-IPL.
If an application should get hung, the only UNIX recourse is to "Kill" the
process . Sometimes Kill doesn't work. Since the UNIX implementation on z/OS
uses z/OS services, there is an additional recovery step that can be taken, the
z/OS UNIX "Superkill" command. Superkill invokes z/OS services designed to
cancel the address space running the UNIX process and z/OS End of Task and End
of Memory services will release system resources that were held. In addition,
many UNIX servers running on z/OS are designed on z/OS to clean up after
themselves so they can be restarted without requiring a re-IPL.
In general, UNIX systems and subsystems usually do not take advantage of any
specialized hardware because they are written to be general programs, running
on any of the standardized UNIX and Linux platforms. DB2 for z/OS, on the other
hand, has been designed to be optimized and tightly integrated with the z/OS
operating system and the zSeries platform to help provide better performance
and the ability to take advantage of platform-specific availability options
such as Parallel Sysplex clustering.
z/OS delivers
The pressure that drove the evolution of the z/OS operating system came from
requirements for a robust 24x7x365 environment to run mission critical
workloads where frequent re-IPLs are unacceptable. Application outages (not
just system outages) can mean thousands of dollars of lost business for each
minute of outage for many business environments. Our customers demand a
secure, reliable, serviceable, and available system, and z/OS delivers.
Contact z/OS.
Send us your questions and comments.
|