System z servers are designed to reduce not only unplanned outages, but also planned outages such as for service or upgrades. Many of the RAS features of the server target the total picture, not just avoiding unplanned down time.
In the past, a common reason for planned outages was to apply hardware configuration changes. Now, virtually all of these requirements have been removed. Gone are the days that whenever you wanted hardware maintenance done, you needed to power down the server and restart it with a “Power On Reset” (POR). While production transactions are executing, one can now replace and upgrade key internal components of the server. At the largest scale are “books.” Each processor book contains CPs, memory, and I/O “fan-out” cards. On a minimum of a two book server, these books can be dynamically pulled and replaced for upgrades or repair while continuing to execute the workload. At a finer level, each individual I/O fan-out card within each book can be hot-pluggable without the loss of I/O connectivity. Every card in the I/O domain such as FICON port cards, OSA-Express, Crypto, and Coupling links, etc. can be concurrently, repaired, or replaced. There is even concurrent power and thermal maintenance including the HMC and Support Element. While this is happening, transactions continue without missing a beat.
Applying maintenance to software requires the software to be “recycled,” brought down and restarted, to pick up changes. On the other hand, System z servers are designed so microcode maintenance and even driver upgrades can be applied while application continuing to execute.
You can grow the server from a single book sub-Uni 1-way to a 54-way dynamically to accommodate planned, or even unplanned capacity requirements. Emergency upgrades are possible with Capacity Back-Up (CBU) such as in a disaster recovery situation or loss of a server. Since US laws require tests of disaster recovery capability, the CBU contract comes with five tests that can be renued. CBU even supports specialized processors such as ICFs, IFLs, zIIPs, and zAAPs. For planned events, a planned upgrade can use the customer-initiated On-Off Capacity Upgrade on Demand (OOCUoD) offering. This allows one to upgrade to meet end of year capacity requirements, and downgrade the server in the first quarter! With the System z10 and z/OS R10, this is expended further with the Provisioning Manager. Rules can be set up defining when additional capacity should be provisioned to meet your business need. This provides a fast response to capacity and workload changes, and helps ensure processing power to meet your business requirements.
A balanced design requires memory and I/O to grow with the CP capacity. This is not a problem with the System z10 as one can concurrently add I/O and memory. After the I/O cards are added, you can then modify the I/O configuration definitions including channel paths, control units, and I/O devices. You can then add/remove LPARs to a new or existing logical channel subsystem, and then dynamically add cryptographic features to existing LPARs. While you are changing the airplane’s engine, wings, fuselage, and the cockpit while in flight, you are also growing it from a simple two-passenger plane to a 400 passenger luxury-liner!
A Fault Tolerant Design
A “Fault Tolerant” design allows the system to continue running if there is the loss of a single component. Moreover, it is designed to do this without even impacting transactions. System z provides fault tolerance for all of its key components. This includes not just the CPs (Transparent CP sparing), memory (Dynamic memory sparing), or I/O (I/O Interconnect), but also the timing oscillator card, power supply, channel paths, OSA cards, Support Elements (SEs), and others. Through internal monitoring, possible problems are detected and problem components are designed to be switched over without even failing a single transaction.
This is in addition to IBM’s most robust processor design with intelligent retries and a very elaborate and successful design of internal recovery hardware.
Server Phone Home
IBM System z servers are able to detect potential error situations before they become problems and phone home (including web-enabled communications), calling IBM so a CE can schedule time to come and replace the potential problem part. This is fully automatic, done by the hardware, while service can be done without impacting application availability such as z/OS or Linux. This long standing design feature starts dispatching parts and personnel within seconds of a machine event.