Skip to main content

 
IBM Systems  > System p  > Hardware  > 

POWER2: Next Generation of the RS/6000 Family

  
Introduction
Architecture
Implementation
Chip and Packaging Technology
Performance
Summary

An early version of this paper has been submitted to the IBM Journal of Research and Development.

Introduction
In 1990, IBM announced the RS/6000 family of highly concurrent superscalar workstations and servers, supporting clock rates ranging from 20 MHz to 30 MHz [1]. The 25-MHz Model 530 achieved performance levels which exceeded those of many of its contemporaries (Sun 4/200, DECstation 3100, MIPS M/2000, and Apollo DN10000) by more than 40% on a variety of benchmarks (Dhrystones 1.1, Whetstones, Linpack [dp], Livermore Loops [geometric mean] and SPECmark) [2]. All models included an 8K-byte (KB) instruction cache (I-cache) and either a 32KB or 64KB data cache (D-cache). These POWER processors were the first implementations of the IBM POWER (Performance Optimized With Enhanced RISC) Architecture.

Over the years, the POWER-based RS/6000 offerings have improved incrementally. Desktop, desk side, and rack system clock rates increased up to 62.5 MHz. More than ten of these models support a 32KB I-cache. Additional compiler capability, especially in the area of restructuring data access patterns, has improved benchmarks and customers' code. Changes in the I/O area have increased Micro Channel bandwidth from 40MB/S to 80MB/S peak.

While these changes were taking place, the competition also improved. A dichotomy in design philosophies became apparent. RS/6000 systems and compilers aggressively exploit superscalar capabilities. Other designs, such as Sun SuperSPARC and Motorola 88110, also exhibit this philosophy. These superscalar capabilities involve multiple functional units and the hardware complexity to allow the units to function relatively autonomously. Some argue that the complexity makes high clock rates difficult to achieve and that more performance can be achieved by clock rate than by aggressive instruction-level parallelism. Examples of this alternative philosophy are the DEC 21064 (also known as Alpha), the HP PA7100, and the MIPS R4000. The debate about the advantages of each approach appears on electronic forums and in articles and editorials. The popularity of the topic has led to the coining of catchy synonyms for the two approaches, such as the "Speed Demons" (high clock rate) versus the "Brainiacs" (complexity) [3].

While it is desirable to pursue both approaches, the goals are often in conflict [4]. For a given technology, there is likely to be sets of clock rate/instruction-level parallelism pairs which provide near optimal performance. Although many factors (compiler optimization, as well as chip and system designers' abilities) cloud a comparison, hardware measurement is the generally accepted method of judging the trade-offs. Benchmarks clearly illustrate that the optimal design point is very application specific.

By tracking the performance of various systems for the past few years, two points become apparent. First, performance improves at a healthy pace in the workstation and server markets; without continual improvements, leaders soon lose their position. Second, performance for a given vendor is a stair-step function. Often the competition is close, with several vendors jockeying for the lead position. While at any instant a system may be dominant, leaders change frequently.

This paper describes the next generation of implementations of the POWER Architecture, POWER2 processors and systems. The initial three models are the 55-MHz Model 58H, the 66.5-MHz Model 590, and the 71.5-MHz Model 990. The arrival of the POWER2-based systems moves the RS/6000 family into the lead on many industry standard benchmarks with a combination of increased clock rate, exploitation of architectural enhancements, doubled functional units, and increased cache capacity. The POWER2 enhanced superscalar capability further widens the gap between the instruction-level parallelism and clock rate approaches.

This paper consists of four major sections. The architecture section discusses enhancements to the programmer's view of the hardware, primarily new instructions which improve storage reference bandwidth, allow hardware square root, and speed floating-point to integer conversion. The implementation section includes a POWER2 processor description including functional units, caches, and translation look-aside buffers (TLBs.) The third section describes the fabrication technology. The performance section examines how these changes affect performance and compares the resulting POWER2 performance to that of several competitive systems on a variety of industry standard benchmarks. The performance results demonstrate that superscalar capabilities are an attractive alternative to aggressive clock rates.

Back to top

Architecture
The RS/6000 systems are implementations of a Reduced Instruction Set Computer (RISC) architecture. As is characteristic of many RISC architectures, loads and stores provide the only storage access; arithmetic instructions use only register operands. Several instructions, often considered more complex than a traditional RISC definition, enhance performance. The instructions include a floating-point multiply-add (FMA) instruction, a branch-on-count (BCT) operation, and update forms of storage references.

The FMA compound instruction consists of a floating-point multiply and a dependent add. On POWER and POWER2 implementations, the FMA operation performs the multiply and add with a total latency of only two cycles. Independent FMA instructions can start every cycle. The FMA operation allows a peak MFLOPS rate equal to two times the MHz rate while using a single functional unit. Many experts credit the FMA instruction as a key component of the RS/6000's outstanding floating-point performance. The HP PA7100 has a similar compound operation that allows a floating-point multiply and an independent add. Simple coding of common constructs, such as inner product or daxpy, often involve dependent pairs of operations, requiring additional compiler complexity to exploit the HP compound operations.

The BCT form of a conditional branch decrements and tests a special purpose register, the Count Register, to determine the outcome of the branch. Often a loop closing branch can be coded using the BCT form; the programmer loads the Count Register with an iteration count for the loop and the branch unit decrements and tests this value independently of other Fixed-Point Unit (FXU) work. In many other architectures, a general purpose register (GPR) is used to hold the iteration count, and the FXU performs the decrement and test. The FXU forwards the test result to the branch unit in the form of a condition code result. The RS/6000's BCT instruction and Count Register are examples of architectural separation of resources that enhance the implementer's ability to exploit instruction-level parallelism. The FXU can off-load the loop count decrement and test operations, while the branch unit can accurately determine the fetch path without FXU synchronization.

Both addressing forms of storage references, indexed and displacement, support an "update form." This pre-update of the base register (with the effective address) greatly decreases the need for explicit address arithmetic. The multiple operations, which comprise each of the FMA, the BCT, and the update forms, allow designers an opportunity to provide instruction-level parallelism beyond the number of functional units and available dispatch bandwidth.

POWER2 supports a superset of the POWER Architecture. New instructions provide performance opportunity: quad-word floating-point storage references, square root, and convert to integer. Virtual address translation changes improve performance and add capability. The architecture also adds hardware performance monitoring. Support of all POWER instructions maintains upward compatibility for programs.

Back to top
dotted_rule_443.gif

New Instructions
The architecture adds high-performance floating-point storage access instructions, Load Quad and Store Quad, which support all of the addressing forms for double-precision storage references: in dexed and displacement, with and without update forms. The quad-word (128 bits) loads move two adjacent double-precision storage operands into two adjacent floating-point registers (FPRs).

Due to the implicit register updates available in storage reference instructions and the BCT branch, most RS/6000 floating-point loops simply consist of storage references, floating-point arithmetic, and a branch. The FXU (which executes all storage reference instructions) and Floating-Point Unit (FPU) operate fairly autonomously. Therefore, either the number of storage references or the number of arithmetic instructions usually limits the number of cycles required to execute an iteration of a floating-point loop. When storage references limit the performance of a loop, Load Quad and Store Quad instructions can provide improvement.

The dominant loop from the Linpack benchmark [5] shown in Figure 1 illustrates the quad-word benefit. The top code block represents the dominant loop after inlining but without unrolling. The pseudo-assembly code shows three storage reference instructions, the performance limiter for this code on a POWER processor. The bottom code block represents the code after unrolling this stride-1 loop (by a factor of two). The pseudo-assembly code shows three pairs of storage references.

Each pair involves two adjacent storage locations and two adjacent FPRs. A quad-word reference can replace each pair, resulting in roughly the same number of cycles per iteration as in the upper loop. Since unrolling doubles the number of floating-point operations per iteration, and the unrolled loop requires the same number of cycles, the quad-word storage reference capability almost doubles the performance on the Linpack benchmark.

In addition to the quad-word storage reference instructions, the architecture adds a Square Root instruction. On previous RS/6000s, a library routine call provided the square root function. By replac ing the call with a single instruction, the number of cycles per operation drops from about 50 to roughly 25. In the SPEC CFP92 suite, hardware square root provides a substantial gain on the ORA bench mark, which spends about 50% of its time in the library square root routine. Application areas that will exhibit performance gains from the Square Root instruction include computational physics and graphics.

Additional new instructions allow more efficient conversion of a floating- point value to an integer value. The fcir and fcirz instructions provide the conversion with default rounding and with round toward zero, respectively. They improve random number generation where the seed is a floating-point value but the modulo arithmetic calculations require integer inputs. Other examples of use include histo gram updates and table look-up routines that convert a floating-poi nt input value into an integer value for indexing a table or array. Furthermore, interpolation can use the floating-point-to-integer conver sion to determine which two adjacent grid points (integer indices) surround a calculated point on a grid. The compiler can use fcirz to provide the Fortran INT intrinsic function.

Back to top

Previous | Next