What is a Parallel Sysplex?
The z Systems Parallel Sysplex cluster contains innovative multisystem data sharing technology. It allows direct, concurrent read/write access to shared data from all processing nodes in the configuration without sacrificing performance or data integrity. Each node can concurrently cache shared data in local processor memory through hardware-assisted cluster-wide serialization and coherency controls. As a result, work requests that are associated with a single workload, such as business transactions or data base queries, can be dynamically distributed for parallel execution on nodes in the sysplex cluster based on available processor capacity.
Parallel Sysplex technology builds on and extends the strengths of z Systems e-business servers by linking up to 32 servers with near linear scalability to create the industry's most powerful commercial processing clustered system. Every server in a Parallel Sysplex cluster has access to all data resources and every "cloned" application can run on every server. Using the z Systems "Coupling Technology," the Parallel Sysplex technology provides a "shared data" clustering technique that permits multi-system data sharing with high performance read/write integrity. This "shared data" (as opposed to "shared nothing") approach enables workloads to be dynamically balanced across all servers in the Parallel Sysplex cluster. This approach allows critical business applications to take advantage of the aggregate capacity of multiple servers to help ensure maximum system throughput and performance during peak processing periods. In the event of a hardware or software outage, either planned or unplanned, workloads can be dynamically redirected to available servers thus providing near continuous application availability.
Another significant and unique advantage of using Parallel Sysplex technology is the ability to perform hardware and software maintenance and installations in a nondisruptive manner. Through data sharing and dynamic workload management, servers can be dynamically removed from or added to the cluster allowing installation and maintenance activities to be performed while the remaining systems continue to process work. Furthermore, by adhering to IBM's software and hardware coexistence policy, software and/or hardware upgrades can be introduced one system at a time. This capability allows customers to roll changes through systems at a pace that makes sense for their business. The ability to perform rolling hardware and software maintenance in a nondisruptive manner allows business to implement critical business function and react to rapid growth without affecting customer availability.
Parallel Sysplex technology is an enabling technology, allowing highly reliable, redundant, and robust z Systems technologies to achieve near continuous availability. A properly configured Parallel Sysplex cluster is designed to have no single points of failure, for example:
The Parallel Sysplex is a way of managing this multi-system environment, providing benefits of:
Within a Parallel Sysplex cluster it is possible to construct a parallel processing environment with no single points of failure. Since all systems in the Parallel Sysplex can have concurrent access to all critical applications and data, the loss of a system due to either hardware or software failure does not necessitate loss of application availability. Peer instances of a failing subsystem executing on remaining healthy system nodes can take over recovery responsibility for resources held by the failing instance. Alternatively, the failing subsystem can be automatically restarted on still-healthy systems using automatic restart capabilities to perform recovery for work in progress at the time of the failure. While the failing subsystem instance is unavailable, new work requests can be redirected to other data-sharing instances of the subsystem on other cluster nodes to provide continuous application availability across the failure and subsequent recovery. This provides the ability to mask planned as well as unplanned outages to the end user.
Because of the redundancy in the configuration, there is a significant reduction in the number of single points of failure. Without a Parallel Sysplex, the loss of a CEC could severely impact the performance of an application, as well as introduce system management difficulties in redistributing the workload or reallocating resources until the failure is repaired. In an parallel sysplex environment, it is possible that the loss of a CEC may be transparent to the application, and the CEC's workload can be redistributed automatically within the parallel sysplex with little performance degradation. Therefore, events that otherwise would seriously impact application availability, such as failures in CEC hardware elements or critical operating system components, would, in an parallel sysplex environment, have reduced impact.
Even though they work together and present a single image, the nodes in a Parallel Sysplex cluster remain individual systems, making installation, operation and maintenance non-disruptive. You can introduce changes, such as software upgrades, one system at a time — remaining systems continue to process work. This allows you to roll changes through your systems at a pace that makes sense for your business.
The Parallel Sysplex environment can scale near linearly from 2 to 32 systems. This can be a mix of any servers that support the Parallel Sysplex environment. The aggregated capacity of this configuration meets every processing requirement known today.
The entire Parallel Sysplex cluster can be viewed as a single logical resource to end users and business applications. Just as work can be dynamically distributed across the individual processors within a single SMP server, so too can work be directed to any node in a Parallel Sysplex cluster having available capacity. This avoids the need to partition data or applications among individual nodes in the cluster or to replicate databases across multiple servers.
Workload balancing also permits you to run diverse applications across a Parallel Sysplex cluster while maintaining the response levels critical to your business. You select the service level agreements required for each workload, and the z/OS Workload Manager (WLM), along with the subsystems such as CP/SM or IMS automatically balances tasks across all the resources of the Parallel Sysplex cluster to meet your business goals. Whether the work is coming from batch, SNA, TCP/IP, DRDA®, or MQSeries® (non-persistent) messages, dynamic session balancing, getting the business requests into the system best able to process the transaction provides the performance and flexibility you need to give the responsiveness your customers demand, and it is invisible to the users.
There are several aspects to consider for recovery. First, when a failure occurs, it is important to bypass it by automatically redistributing the workload to utilize the remaining available resources. Secondly, it is necessary to recover the elements of work that were in progress at the time of the failure. Finally, when the failed element is repaired, it should be brought back into the configuration as quickly and transparently as possible to again start processing the workload. Parallel Sysplex technology enables all this to happen.
Once the failing element has been isolated, it is necessary to non-disruptively redirect the workload to the remaining available resources in the parallel sysplex. In the event of failure in the parallel sysplex environment, the OLTP workload is automatically and quickly redistributed without operator intervention.
Generic Resource Management provides the ability to specify to VTAM a common network interface. This can be used for CICS TORs, IMS TM, TSO, or DB2 DDF work.
One of the features of this support is that, for example, if one of the CICS TORs fails, only a subset of the network will be affected. The affected terminals will be able to immediately logon again and continue processing after being connected to a different TOR.
The parallel sysplex solution satisfies a major customer requirement for continuous 24-hour-a-day, 7-day-a-week, 365-days-a-year (24x7x365) availability, while providing techniques for achieving simplified Systems Management consistent with this requirement. Some of the features of the parallel sysplex solution that contribute to increased availability also help to eliminate some Systems Management tasks. Examples include:
The Workload Manager, or WLM provides sysplex-wide workload management capabilities based on installation specified performance goals and the business importance of the workloads. The Workload Manager tries to attain the performance goals through dynamic resource distribution. WLM provides the Parallel Sysplex Cluster with the intelligence to determine where work needs to be processed and in what priority. The priority is based on the customer's business goals and is managed by sysplex technology.
The Sysplex Failure Management policy allows the installation to specify failure detection intervals and recovery actions to be initiated in the event of the failure of a system in the sysplex.
Without SFM, when one of the systems in the Parallel Sysplex fails, the operator is notified and prompted to take some recovery action. The operator may choose to partition the non-responding system from the parallel sysplex or may choose to take some action to try to recover the system. This period of operator intervention might tie up critical system resources required by the remaining active systems. Sysplex Failure Manager allows the installation to code a policy to define the recovery actions to be initiated when specific types of problems are detected, such as fencing off the failed image which prevents it access to shared resources, logical partition deactivation, or central storage and expanded storage acquisition, to be automatically initiated following detection of a parallel sysplex failure.
The Automatic Restart Manager enables fast recovery of the subsystems that might hold critical resources at the time of failure. If other instances of the subsystem in the parallel sysplex need any of these critical resources, fast recovery will make these resources available more quickly. Even though automation packages are used today to restart the subsystem to resolve such deadlocks, ARM can be activated closer to the time of failure.
The Automatic Restart Manager reduces operator intervention in the following areas:
Cloning refers to replicating the hardware and software configurations across the different physical servers in the parallel sysplex. That is, an application that is going to take advantage of parallel processing, might have identical instances running on all images in the parallel sysplex. The hardware and software supporting these applications could also be configured identically on all systems in the parallel sysplex to reduce the amount of work required to define and support the environment.
The concept of symmetry allows new systems to be easily introduced and enables automatic workload distribution in the event of failure or when an individual system is scheduled for maintenance. It also significantly reduces the amount of work required by the systems programmer in setting up the environment. Symmetry does not preclude the need for systems to have unique configuration requirements, such as the asymmetric attachment of printers and communications controllers, or asymmetric workloads that do not lend themselves to the parallel environment.
Helping manage cloning is the use of system symbolics. z/OS provides support for the substitution values in startup parameters, JCL, system commands, and started tasks. These values can be used in parameter and procedure specifications to allow unique substitution when dynamically forming a resource name.
A number of base z/OS ™ components have discovered that the IBM ® S/390 ® Coupling Facility shared storage provides an excellent medium for sharing component information for the purpose of multi-system resource management. This exploitation called IBM eServer z Systems ™ Resource Sharing enables sharing of physical resources such as files, tape drives, consoles, catalogs, etc. with significant improvements in cost, performance and simplified systems management. This is NOT to be confused with Parallel Sysplex ® data sharing by the database subsystems. z Systems Resource Sharing delivers immediate value even for customers who are not leveraging data sharing, through native system exploitation delivered with the base z/OS software stack.
One of the goals of the Parallel Sysplex solution is to provide simplified Systems Management by reducing complexity in managing, operating, and servicing a parallel sysplex, without requiring an increase in the number of support staff and without reducing availability.
Even though there could be multiple servers and z/OS images in the parallel sysplex and a mix of different technologies, it is essential that the collection of systems in the parallel sysplex appear as a single entity to the operator, the end-user, the database administrator, and so on. A single system image ensures reduced complexity from both operational and definition perspectives.
Regardless of the number of system images and the complexity of the underlying hardware, the parallel sysplex solution provides for a single system image from several perspectives:
Even though individual hardware elements or entire systems in the parallel sysplex fail, a single system image must be maintained. This means that, as with the concept of single point of control, the presentation of the single system image is not dependent on a specific physical element in the configuration. From the end-user point of view, the parallel nature of applications in the parallel sysplex environment must be transparent. An application should be accessible regardless of which physical z/OS image supports it.
It is a requirement that the collection of systems in the parallel sysplex can be managed from a logical single point of control. The term single point of control means the ability to access whatever interfaces are required for the task in question, without reliance on a physical piece of hardware. For example, in an parallel sysplex of many systems, it is necessary to be able to direct commands or operations to any system in the parallel sysplex, without the necessity for a console or control point to be physically attached to every system in the parallel sysplex.
One of the prime goals of parallel sysplex is continuous availability. Therefore, it is a requirement that changes such as new applications, software, or hardware can be introduced non-disruptively and that they be able to coexist with current levels in the parallel sysplex. In support of compatible change, the hardware and software components of the parallel sysplex solution will allow the coexistence of two levels, that is, level N and level N+1 . This means, for example, that no IBM software product will make a change that cannot be tolerated by the previous release.
From a hardware perspective, a Parallel Sysplex supports n-2 compatibility for server families. Within any single Parallel Sysplex one can have the current server with the previous family and the family after that. An (n-3) server in the Parallel Sysplex, even if not directly connected to the (n) server, would not be a supported configuration. This is also true for the Server Time Protocol (STP) timing network.
A design goal of the Parallel Sysplex clustering is that no application changes are required to take advantage of the technology. For the most part, this had held true, although some CICS affinities need to be investigated to get the maximum advantage from the configuration.
Through this state-of-the-art cluster technology, the power of multiple z/OS systems can be harnessed to work in concert on common workloads. The z Systems Parallel sysplex cluster takes the commercial strengths of the z/OS platform to improved levels of system management, competitive price/performance, scalable growth, and continuous availability.