Enterprise computing environments require constant availability. Time lost to unavailable compute resources can cost financial institutions millions of dollars1,2. Lost time can come from unscheduled outages. Such outages originate from influences outside the datacenter and from failing hardware components or software problems within the datacenter. Scheduled outages represent the larger contributor to enterprise system unavailability and lost time. Routine maintenance for system upgrades, code updates, or repairs are the biggest contributor to scheduled outages3,4. Generally when an unscheduled outage occurs with a computer system there will need to be one or more scheduled outages to repair the system. Further, upgrading system hardware capacity can require scheduled outages as well. Systems such as the IBM Power systems5 family of servers have robust availability features. These features are designed to address both unplanned and planned outages within customer’s enterprise computing environments. One of the key maintenance related feature sets of IBM Power systems servers is known as CEC Hot Add Repair and Maintenance (CHARM)6.
Given that enterprise servers rarely experience significant lulls in their utilization, adding more capacity to those systems while they are running is very valuable to the customer. However, adding more physical processor, memory or IO hardware while the machine is running, presents a challenge. IBM’s POWER enterprise class servers meet this challenge with CHARM. Additional nodes of compute resources can be added to the machine and those resources dynamically configured for use. Thus, hot node add, hot memory upgrade, and hot IO drawer add enable power systems to avoid scheduled outages for capacity upgrades. The next area of server unavailability addressed by CHARM relates to repairing hardware in the rare instance of a hardware failure. When hardware in an enterprise computer system experiences a failure, the system automatically restarts with the failing components logically isolated from the rest of the system. This allows customers to continue operations and defer the maintenance until a more convenient time. Using the capabilities of Hot Repair, IBM service personnel can replace the failing hardware while the server is running. The repaired hardware can then be dynamically returned to use by the customer applications. Hot repair allows for the repair of critical components within the power systems server in a manner most considerate of customer compute availability.
1 Arnold A, Chief Technology Officer, Vision Solutions. Assessing the Financial Impact of Downtime (26th April 2010). IT-Director.com website. Available at: http://www.it-director.com/business/costs/content.php?cid=12043 (link resides outside of ibm.com). Accessed June 27, 2011.
2 Martinez H. How Much Does Downtime Really Cost? (August 6, 2009). InfoManagement Direct website. Available at: http://www.informationmanagement.com/infodirect/2009_133/downtime_cost-10015855-1.html (link resides outside of ibm.com). Accessed June 27, 2011.
3 Weygant PA. Clusters for High Availability. 2nd Edition. Prentice Hall; 2005: Chapter 1.
4 Brodkin J. Amazon: Bad execution during planned upgrade caused outage (April 29, 2011). Network World website. Available at: http://www.networkworld.com/news/2011/042911-amazon-explanation.html (link resides outside of ibm.com). Accessed June 27, 2011.
5 Arroyo, R. X. Harrington, R. J. Hartman, S. P. Nguyen, T. IBM POWER7 systems. IBM Journal of Research and Development. Volume 55 Issue 3; 2011: Page(s): 2:1 - 2:13
6 Eide C, Kitamorn A, Kumar A, Larson D, Lo E, Mehta C, Nayar N, Patel J, Stallman A. IBM Power 770/780 and 795 Servers CEC Hot Add & Repair Maintenance Technical Overview (May 2011). Available at: http://www.ibm.com/common/ssi/cgi-bin/ssialias?infotype=SA&subtype=WH&appname=STGE_PO_PO_USEN&htmlfid=POW03058USEN&attachment=POW03058USEN.PDF (PDF, 1.46MB). Accessed June 27, 2011.