The Decade of Smart
In 2008, the IBM leadership team, recognizing the vast impact of technology on our society, began a dialog on the best ways to create a smarter planet. In a smarter planet, intelligence is infused into the systems and processes that make the world work. These technologies encompass both traditional computing infrastructures and the rapidly growing world of intelligent devices: smart phones, global positioning systems (GPS), cars, appliances, roadways, power grids, water systems.
By 2012 we expect almost a trillion devices to be connected to the Internet. These Internet connected devices allow both new modes of social interactions and new ways for businesses to connect to their clients, their employees, and their suppliers. At the same time, these interactions produce enormous amounts of data — raw information about the way people use their resources and how data flows through the marketplace — that can be used to understand how societies operate.
Servers in this environment must have the ability to rapidly process transactions from the Internet, store, and process large amounts of data in backend databases, and support analytical tools allowing business and community leaders to gain the insight needed to make sound, timely decisions within their areas of responsibility.
Smarter Systems for a Smarter Planet™
In February 2010, IBM announced the first models in a new generation of Power Servers based on the POWER7® microprocessor. While these new servers are designed to meet traditional computing demands of improved performance and capacity, the transition to “smarter” solutions requires that smarter servers also: scale quickly and efficiently, automatically optimize workload performance and flexibly manage resources as application demands change, avoid downtime and save energy, and automate management tasks.
Therefore, IBM has vastly increased the parallel processing capabilities of POWER7 systems — integrated across hardware and software — a key requirement for managing millions of concurrent transactions. As expected, the new Power Systems continue to enhance the proud heritage of IBM Power servers — delivering industry-leading transaction processing speed in designs built to efficiently process the largest database workloads. In addition, these new offerings, optimized for running massive Internet workloads, deliver a leap forward in “throughput” computing.
Emerging business models will gather large amounts of data (from the Internet, sensors in electric grids, roads, or the supply chain for example). This data will be deposited in large databases and scrutinized using advanced analytical tools — gleaning information that can provide competitive advantage. Pools of high speed multi-threaded POWER7 processor-based servers can be deployed in optimized pools to efficiently process Internet workloads, store and process large amounts of data in databases, and employ specialized analytic tools (like the IBM Smart Analytics System 7700) to derive information useful to the business. These three computing modes — massive parallel processing, "throughput" computing, and analytics capabilities — are integrated and managed consistently with IBM Systems Director software.
Describing Server Reliability, Availability, Serviceability (RAS)
Since the early 1990’s, the Power Development team in Austin has aggressively pursued integrating industry-leading mainframe reliability technologies in Power servers. Arguably one of the most important capabilities, introduced first in 1997, is the inclusion of a hardware design methodology called First Failure Data Capture (FFDC) in all IBM Power System servers. This methodology uses hardware-based fault detectors to extensively instrument internal system components. Each detector is a diagnostic probe capable of reporting fault details to a dedicated service processor. FFDC, when coupled with automated firmware analysis, is used to quickly and accurately determine the root cause of a fault the first time it occurs, regardless of phase of system operation and without the need to run “recreate” diagnostics. The overriding imperative is to identify which component caused a fault ─ on the first occurrence of the fault ─ and to prevent any reoccurrence of the error. This feature has been described in detail in a series of RAS technical articles and “white papers” that provide technical detail of the IBM Power design.
The article “Fault-tolerant design of the IBM pSeries® 690 system using POWER4™ processor technology1” emphasized how POWER systems were designed — from initial RAS concept to full deployment. In the nine years since the introduction of POWER4, IBM has introduced several successive generations of POWER processors, each containing new RAS features. Subsequent white papers2,3 included descriptions of how RAS attributes were defined and measured to ensure that RAS goals were in fact being met and detailing how each new feature contributed to reliable system operations. In general, these documents outlined the core principles guiding IBM engineering design that are reflected in the RAS architecture. A user should expect a server to provide physical safety, system integrity, and automated fault detection and identification.
- Systems should be reliable in a broad sense: they should
- Systems should be configurable to achieve required levels of availability
- Systems should diagnose problems automatically and proactively
The intent of this white paper, then, is to highlight POWER7 server design features that extend the inherent Power Architecture® RAS capabilities delivered in previous server generations. The POWER7 module is a much denser chip than its POWER6® predecessor, containing eight cores instead of two, and featuring 32 MB of integrated, not external, L3 cache. Compared to POWER6 it also includes substantially more function, with higher levels of simultaneous multi-threading (SMT) per core.
In addition to advances in density, virtualization, and performance the processor module also contains substantial new features related to reliability and availability, building on the rich heritage of prior generations.
Emphatically, a system’s reliability, and the availability of the applications it supports, is a function of much more than just the reliability of the processors, or even of the entire system hardware. A full description of a system design for RAS must include all of the hardware, the firmware, the operating system, middleware, applications, operating environment, duty cycle, and so forth.
The POWER7+ module, built with 32 nm technology dramatically increases the number of circuits available, supporting a larger L3 cache (at 80 MB, it’s 2.5 times larger than it’s POWER7 predecessor), and new performance acceleration features in support of active memory expansion and hardware based data encryption. Power Servers using this new module will be able to achieve higher frequencies within the same power envelope and improved performance per core when compared to POWER7 based offerings.
1 D.C.Bossen A. Kitamorn, K.F. Reick, and M.S. Floyd, “Fault-tolerant design of the IBM pSeries 690 system using POWER4 processor technology”, IBM Journal of Research and Development, VOL.46 NO.1, January 2002.
2 J. Mitchell, D. Henderson, G. Ahrens, “IBM eServer p5: A Highly Available Design for Business-Critical Applications”, p5tiPOWER5RASwp121504.doc, December 2004.
3 J. Mitchell, D. Henderson, G. Ahrens, and J. Villarreal, “Highly Available IBM Power Systems Servers for Business-Critical Applications”, PSW03003-USEN-00, April 2008.