Skip to main content

 
IBM Systems  > System p  > Hardware  > 

POWER2 Fixed-Point, Data Cache, and Storage Control Units

  

An early version of this paper has been submitted to the IBM Journal of Research and Development.

Introduction
The multichip POWER2 processor implementation provides industry leading performance in both floating-point and fixed-point applications [1]. Three of the chips on the multichip module, the Fixed-Point Unit (FXU), Data Cache Unit (DCU), and Storage Control Unit (SCU), provide a tightly integrated subsystem that avoids bottlenecks in the cache, memory, and I/O interfaces. The balanced system design allows POWER2 systems to excel on both technical and commercial applications.

The FXU, DCU, and SCU functionality and system structure are similar to those in a POWER implementation [2, 3]. This paper presents the FXU, DCU, and SCU designs, as well as the memory and I/O interfaces, found in the POWER2 system. The new implementation includes the following improvements:

  • An additional fixed-point execution unit
  • Improved Fixed-Point Unit / Floating-Point Unit (FPU) synchronization
  • New floating-point quad-word load and store instructions
  • Improved address translation
  • Faster fixed-point multiply/divide
  • Multiported data cache (D-cache)
  • Larger caches with longer cache line sizes
  • Increased bandwidth into and out of the caches through wider data buses
  • An improved external interrupt mechanism
  • An improved I/O DMA mechanism to support multiple streaming Micro Channels

Fixed-Point Unit (FXU) 
Figure 1 shows a block diagram of the POWER2 FXU. The FXU decodes and executes all instructions, except for branch and Condition

Register logical instructions, which never leave the Instruction Cache Unit (ICU), and floating-point arithmetic instructions, which are executed by the FPU. Fixed-point and floating-point instructions are dispatched by the ICU on the instruction bus (IBUS) to the FXU and FPU simultaneously and are executed in parallel in the FXU and FPU. The FXU contains the address translation, data protection, and data cache directories for all load/ store instructions.

fig 1

The FXU receives up to four instructions from the ICU over the IBUS shown in the upper left section of Figure 1 . The instruction buffer unit queues instructions for the two decode units. The decode units decode the instructions and issues them to the two execution units. The decode units also control the general purpose registers (GPR). The architecture defines thirty- two 32-bit GPRs. Hardware keeps consistent two copies of the GPRs, one for each execution unit. For load/store operations, the address translation logic converts virtual addresses to real addresses, and the data cache control unit controls the data cache and its directory. The Processor Bus (PBUS) unit provides the interface to other POWER2 processor chips.

Instruction Buffer
The ICU dispatches instructions to the FXU and FPU across the four-instru ction wide (4 x 36 bits) IBUS. Associated with each instruction is a valid bit and a set of three tag bits that provide additional information about the instruction. On each cycle, the FXU moves valid instructions and their tags from the IBUS to the eight-entry FXU instruction buffer and queues them for decoding and execution. The FXU limits the number of instructions transferred by informing the ICU of how many entries exist in its buffer.

An instruction's valid bit from the ICU is further qualified by the status of pending branches, ICU instruction cancels, and other related conditions, creating a r eal valid bit for the instruction. The FXU may cancel the instruction by resetting the re al valid bit when:

  • The instruction is canceled by the ICU in the cycle after its dispatch.
  • The instruction was conditionally dispatched and the branch is subsequently taken.
  • An interrupt has occurred.

Instruction Decode
The two instruction decode units have the following responsibilities:

  • Decode instructions, read the GPRs
  • Control GPR bypass
  • Control sign extensions and inverters
  • Generate the immediate field bypass controls
  • Issue instructions to the two execution units

The two decode units are identical. For each instruction, a decode unit combines the primary and extended opcode fields into a single 10-bit field. The decode and execution units use this field to decode all instructions. At the end of the decode cycle, this combined opcode is latched for use during the execute cycle.

During the decode cycle, three (or four for the second execution unit) GPR values are read according to the specified register source and target fields of the instruction. If the data required during the execute cycle is not in the GPR, performance can be improved by routing the required data from its source directly to the execution unit (as well as to the GPR). Three instances of this technique, called a bypass , are implemented.

Results from the arithmetic logic unit (ALU) can bypass the GPRs when a register to register (RR) operation is dependent on the results of an RR operation that executed in the previous cycle. Data from the PBUS can bypass the GPRs when a load from I/O or a move from special purpose register type instruction is followed by a dependent operation. Data from the cache can bypass the GPRs when a load from memory space occurs.

The decode units also manage the issuing of instructions to the two execution units. In particular, the units resolve register dependencies and issue some operations (such as string and load/store multiple operations) to both execution units. The decode units also control the setup for the three-leg adder operations, as explained in the next section.

Execution Control Unit 
The two execution control units control load and store execution, manage holdoffs for operands that have not arrived to the execution unit, and write results to the GPRs. Each unit controls its corresponding execution unit.

Execution Unit
The FXU contains two fixed-point execution units, which provides the capability to execute two fixed-point instructions per cycle - twice that of the POWER processor. Both units contain one adder and one logic unit functional block, which provides each unit the capability to execute all fixed-point arithmetic (except multiply and divide) and Boolean operations. Execution Unit 0 also performs special operations such as cache operations and all privileged operations. Execution Unit 1 performs multiply and divide. Other responsibilities for the execution units include performing the data transformations required by fixed-point RR operations, computing the effective address for all storage references, and providing data flow controls during the execution of move to" and move from" special purpose register instructions. Each execution unit is controlled by its corresponding execution control unit and decode unit, using the derived 10-bit field as the primary control interface. Performance enhancements to the POWER2 execution unit include improved multiply and divide performance, as well as support for parallel execution of dependent add operations.

The multiply/divide unit has been enhanced over POWER [ 4 ]. The multiply array supports 2-cycle operations for all multiply instructions ( mul, muls, muli), an improvement over POWER, which takes three to five cycles for a multiply. The two divide instructions ( div, divs ) execute in 13-14 cycles on POWER2 compared to 19-20 cycles for POWER. The div instruction may require three extra cycles if the algorithm converges from above. When the divisor for the div instruction is the most negative number (0x80000000), two extra cycles are required.

A three-leg adder, implemented in the second execution unit, improves performance by allowing parallel execution of dependent add operations. In the following code sequence, Execution Unit 0 adds R2 and R3 and stores the result in R1. Instead of waiting for the result of the first add, the second execution unit's three-leg adder adds R2, R3, and R5, storing the result in R4.

   A R1,R2,R3

   A R4,R1,R5

Synchronization of FXU and FPU 
Synchronization between the FXU and FPU ensures the integrity of the association between data and the instruction that operates upon it. For example, on a floating-point load ( lfd ) instruction, it ensures that the data fetched by the FXU is loaded into the correct floating-point register (FPR). In both the POWER and POWER2 implementations, data integrity is maintained by synchronizing on all floating-point loads; a floating- point load executes in the FXU during the same cycle that the rename stage in the FPU is selecting a new physical register for the load's target register. Synchronization also helps preserve precise interrupts by ensuring that the FPU does not execute an interruptible operation (IOP), or subsequent instructions, before the FXU indicates that the execution may proceed. POWER implementations use two mechanisms to preserve precise interrupts [ 4 ]. An interruptible instruction latch in the FPU ensures that the FPU never executes an IOP ahead of the FXU. The FXU may not execute an IOP until the instruction reaches the FPU rename stage. A counter, indicating the relative execution positions of the FXU and FPU, limits how far either unit can be ahead of the other. The counter-based synchronization scheme relies on the FXU and FPU seeing all instructions on the IBUS.

In POWER2 implementations, the FXU does not see FPU arithmetic operations and the FPU does not see FXU arithmetics operations. Therefore, a queueing scheme was devised to allow precise interrupts. As in POWER, the FPU may not execute IOPs ahead of the FXU. However, the synchronization has been relaxed to allow the FXU to execute all operations, except the floating-point loads, ahead of the FPU [ 5 ]. Thus, the FXU can execute all operations except floating-point loads ahead of the FPU and the FPU can execute all operations except IOPs ahead of the FXU. As a result, the POWER2 FXU can execute further along the instruction stream and, under certain conditions, provide data to the FPU in fewer cycles.

Previous | Next




 
Contents

Introduction

Fixed-Point Unit (FXU)

Data Cache Unit (DCU)

Storage Control Unit (SCU)

Summary