|
An early version of this paper has been submitted to the IBM Journal of Research and Development.
Introduction
The POWER version of the RS/6000 Floating-Point Unit (FPU) set a new standard for floating-point performance. Its innovative multiply- add fused (MAF) dataflow minimizes latency, rounding error, and chip busing [1]. The MAF unit performs a double-precision multiply in a single cycle and a double-precision add in the following cycle. A single round occurs in the final and bypassable stage of the pipeline. The FPU combines, in a single two-stage pipeline, capabilities which many other processors, such as the SuperSPARC Microprocessor [2], provide with two units, usually a separate multiplier and adder. The simultaneous use of multiple execution units requires additional data buses as well as control logic for detecting dependencies across units. The architecture supports the exploitation of the MAF capability through a set of Multiply-Add" instructions. The POWER processor support of these instructions allows execution of a dependent pair of operations with a combined latency of only two cycles. This feature is unique in the industry.
The POWER2 FPU design goal is to leverage off of these strong points to provide a FPU that sets new standards for not only number crunching capability but also data throughput and processor flexibility. The POWER2 FPU achieves a MFLOPS rate never achieved before by a personal workstation machine [3] by:
- Integrating dual generic MAF ALUs
- Doubling the instruction bandwidth and quadrupling the data bandwidth over that of the POWER FPU
- Adding support for additional functions
- Using dynamic instruction scheduling techniques [4] to maximize instruction- level parallelism, not only between its own internal units but also across the rest of the CPU.
A System Perspective
Floating-point computation had a very revolutionary role within the evolution of computer processing. First, in the early systems, fixed-point arithmetic was used to perform numerical computation. Necessity for a floating-point representation grew from the dynamic range limitations and portability concerns associated with the various fixed-po int word lengths available in the industry. Integer emulation of floating-point numbers became standard.
Second, as silicon became cheaper, it became practical to dedicate hardware to the task of floating-point computation. This dedicated hardware could perform the standard arithmetic operations in significantly less time than the integer processor which was customized for its own specific tasks. The first attempts involved a coprocessing element which was fed instructions once the core processor determined that the instructions were floating-point operations. In early versions, the FPU and the fixed-point unit (FXU) could not run simultaneously.
The third evolutionary step was incorporating this dedicated hardware into the rest of the CPU in a way which maximized floating-point performance and minimized processor overhead. As an example, processors, such as the Intel 8087, coarsely overlapped floating-point and nonfloating-point operations. As floating- point capabilities increased, migration of floating-point-dominated applications further accelerated the demand for more advances. Integrating the floating-point processor with the rest of the CPU became imperative. Various methods attempted to integrate these units [5]. The POWER processor achieved much of its floating-point performance by tightly coupling the FPU to the rest of the CPU, particularly the FXU. Although this design point significantly advanced the state of the art in floating-point computation, the POWER2 FPU has since taken a further step by removing interlocks and increasing the autonomy of the multiple functional units.
POWER FPU Overview
Figure 1 shows a block diagram of the POWER FPU. The FPU receives two instructions from the instruction cache unit (ICU). These two instructions go through a predecode stage where the FPU discards non-floating-point instructions. The two instructions then go through a register renaming stage [6]. Register renaming allows hardware to remove any read-before-write or write-before-write conflicts between arithmetic and subsequent load operations. Register renaming, along with the pending store queue buffer, greatly increases the potential for the FXU and FPU to operate independently. The rename stage forwards the two instructions to the execution unit responsible for that class of instruction. The load unit receives load operations while the MAF execution unit receives both arithmetic and store operations.
This MAF unit performs all of the floating-point arithmetic instructions, such as the multiply-add fused operation, as well as all floating-point store operations. All internal data representations use the IEEE [7] double-precision format (with an extended exponent field).
Dual Unit Motivation for POWER2
Three factors determine the time required by a processor to complete a program [8]:
- The number of instructions required to execute the program
- The processor cycle time
- The average number of processor cycles required to execute an instruction
Compiler capabilities determine the total number of instructions required for a given program. The second and third factors are under the CPU designers' control. The POWER2 FPU targets both factors. To decrease the cycle time, the POWER2 processor employs .5 micron CMOS technology. This process allows processor clock rates that are more than twice that of the initial versions of the POWER processor. In decreasing the average number of cycles required to execute an instruction, one can either decrease the latency of the execution unit or add more execution units. Given a two-cycle latency for dependent multiply-add instructions, decreasing the latency for a single FPU instruction is unlikely. However, increasing instruction-level parallelism to decrease average time for a group of instructions is viable. POWER2 achieves this by doubling the number of floating-point execution units. A fundamental challenge confronting the POWER2 FPU design team was feeding both units simultaneously to achieve maximum performance.
Previous | Next
|