IBM continues to introduce new and advanced RAS (Reliability, Availability, Serviceability) functions in the IBM Power™ Systems to improve the overall system availability. With advanced functions in fault resiliency, recovery and redundancy design, the impact to the system from hardware failure has been significantly reduced. With these system attributes, Power Systems continue to be leveraged for server consolidation. For customers experiencing rapid growth in computing needs, upgrading hardware capacity incrementally with limited disruption becomes an important system capability.
Concurrent add and repair capabilities for the Power Systems have been introduced incrementally since 1997, starting with power supply, fan, I/O device, PCI adapter and I/O enclosure/drawer. In 2008 IBM introduced significant enhancements to the enterprise Power Systems 595 and 570 that highlighted the ability to add/upgrade system capacity and repair the CEC, i.e. central electronic complex, or the heart of a large computer system, which includes the processors, memory and I/O hubs (GX adapters), without powering down the system.
The IBM Power 770, 780 and 795 servers continue to improve on the CEC hot add and repair functions that were introduced with the Power Systems 595 and 570. Best practice experience from client engagements in 2008-2010 calls for a new level of minimum enablement criteria that includes proper planning during system order, configuration, installation, I/O optimization for RAS, etc.
This paper provides a technical overview and description of the Power 770, 780 and 795 CEC Hot Add & Repair Maintenance (CHARM) functions. It also includes best practices, minimum enablement criteria and planning guidelines to help the system administrator obtain the maximum system availability benefits from these functions.
With proper advanced planning and minimum criteria being met, the Power Systems 770, 780 and 795 CEC Hot Add & Repair Maintenance (CHARM) function allows the expansion of the system processors, memory, and I/O hub capacity, and their repair as well, with limited disruption to the system operation.