Skip to main content

 
IBM Systems  > System p  > Hardware  > 

POWER2 CPU-Intensive Workload Performance

  
 

Introduction
The POWER2 processor provides industry-leading performance across a broad range of applications in the workstation and server markets. One significant market segment consists of applications commonly characterized as CPU-intensive workloads . The Systems Performance Evaluation Corporation's (SPEC) integer (CINT92) and floating- point (CFP92) benchmarks represent these workloads. The POWER2 processor has the highest performance results on both benchmarks (as of the announcement of the POWER2 products). This achievement is possible due to the significant performance improvement of POWER2 over its predecessor, POWER. The SPECfp92 and SPECint92 ratings for POWER2 exceed those of the highest performing POWER implementation by factors of 1.9 and 1.7, respectively. (When models are significant to the discussion, we use the IBM RS/6000 Model 990 to represent the POWER2 processor and the Model 980 to represent the high-end POWER implementation.) These speedups result from a combination of compiler and hardware design enhancements.

First, we describe the performance improvements provided by the latest versions of the compilers (C Set ++ 2.1 for C code and XLF 3.1 for Fortran code) that implement new optimizations to exploit the new features of the POWER2. Next, we discuss the performance gains obtained from various POWER2 architectural and implementation improvements. These improvements include a faster clock, additional functional units, improved caches, and new instructions [1,2,3,4].

 

The SPEC Benchmark Suites
Throughout this paper, the SPEC integer and floating-point benchmark suites are the basis for evaluating the POWER2 performance improvements. These SPEC suites are widely accepted measures of workstation performance, especially among computer system users engaged in computationally intensive engineering and scientific work. The integer suite (CINT92) consists of six programs; the floating-point suite (CFP92) has fourteen programs. Tables 1 and 2 list some characteristics of these programs [5].

 

Compiler Improvements
To achieve the full potential of the POWER2 architectural features, new compilers include enhancements such as performing more aggressive high-order transformation s, scheduling instructions to take maximum advantage of the dual integer (Fixed-Point Units, or FXUs) and dual Floating-Point Units (FPUs), and exploiting the new POWER2 instructions.

Loop Unrolling
Of the high-order transformations in the new compilers, loop unrolling is the most important optimization for the POWER2. Unrolling consists of replicating the body of a loop by some factor and reducing the iteration count by an equivalent factor. For example, the following simple loop is shown, both before and after being unrolled by a factor of four:

Before Unrolling:

  DO J=1,1000

      SUM(J) = OFFSET + X(J) * Y(J)

  ENDDO



  After Unrolling:

  DO J=1,1000,4

      SUM(J) = OFFSET + X(J) * Y(J)

      SUM(J+1) = OFFSET + X(J+1) * Y(J+1)

      SUM(J+2) = OFFSET + X(J+2) * Y(J+2)

      SUM(J+3) = OFFSET + X(J+3) * Y(J+3)

  ENDDO
tab 1

On some architectures, unrolling avoids the branch penalty overhead associated with each loop iteration. This justification for loop unrolling is not generally valid on POWER and POWER2 since both often achieve zero-cycle branches.

tab2

Unrolling does provide several other benefits on POWER2. First, unrolling provides an opportunity to expose the parallelism between successive loop iterations by creating a substantially larger basic block (the sequence of nonbranch instructions between branches) for the body of the loop. The larger basic block permits the compiler's instruction scheduler to make more efficient use of the multiple functional units because there are more opportunities to schedule instructions in otherwise "dead" cycles. In the preceding code example, four independent floating-point multiply-add (fma) instructions will keep both FPUs busy. Hence, unrolling enables the compiler to expose greater instruction-level parallelism to the POWER2 hardware.

Second, unrolling often creates opportunities to utilize quad-word storage reference instructions. These new POWER2 operations load (or store) two 64-bit floating-point operands into registers in a single memory access. In cases where successive iterations of a loop access sequential elements of an array, unrolling the loop by a factor of two or more allows two instances of a Load Double instruction (64-bit load) from successive iterations to be replaced by a single Load Quad (128-bit load). For floating-poin t codes dominated by storage reference instructions, this can result in a substantial performance improvement.

A third advantage of unrolling is that it exposes parallelism between long-latency instructions (divide and square-root) across loop iterations. In programs dominated by long-latency instructions, the parallelism among such instructions significantly affects performance [6]. For instance, in a floating-point loop where a scalar is divided by each element of an array, the loop spends most of its time performing the 17-cycle divide. Unrolling often makes it easier for the compiler to expose parallelism of two independent divide (or square root) operations. In this case, unrolling can result in effectively 8.5 cycles per floating- point divide. In loops with long latency instructions, unrolling is a particularly useful technique.

Although loop unrolling increases code size, this does not greatly impact the SPEC performance on POWER2 because the POWER2 instruction cache is large with respect to the SPEC programs, which have small code footprints (the set of unique instruction cache lines touched during a program's execution). The more serious drawback to unrolling is an increase in register use. An unrolled loop has more unique variables, increasing the number of registers needed to retain these variables. This increased register usage may increase the amount of spill code - instructions that save and later restore the values of registers to or from memory, making the registers available for other variables. Because this spill code can often adversely affect performance, the compiler applies heuristics to determine how much unrolling should be applied to a given loop. For a more thorough discussion of loop unrolling and other high-order transforms, see [7].

To illustrate the importance of loop unrolling on actual code, consider an inner loop from the SPEC floating-point benchmark 052.alvinn:

  for (hu = 0; hu << (30+1); hu++) 

  {

    psum_array[hu] += delta[ou] * h_o_weights[ou][hu];

    h_o_w_ch_sum_array[ou][hu] += delta[ou] * hidden_act[hu];

  }
Without any unrolling, the compiler might generate the following unoptimized code sequence for the main body of the loop. The Load Double with Update (lfdu) instructions are loading the successive elements of the h_o_weights and hidden_act arrays. The Load Double (lfd) instructions are loading the successive elements of the h_o_w_ch_sum_array and psum_array arrays while the Store Double with Update (stfdu) instructions store back the results. The floating-point multiply-add (fma) instructions perform the required arithmetic operations. Because storage references dominate this loop, the FXU's ability to process loads and stores will limit performance:
      CL.54:

      lfdu    fp5,gr3=hidden_act(gr3,8)

      lfd     fp4=h_o_w_ch_sum_array(gr7,8)

      lfdu    fp3,gr6=h_o_weights(gr6,8)

      lfd     fp2=psum_array(gr4,8)

      fma     fp4=fp4,fp1,fp5,fcr

      fma     fp2=fp2,fp1,fp3,fcr

      stfdu   gr7,h_o_w_ch_sum_array(gr7,8)=fp4

      stfdu   gr4,psum_array(gr4,8)=fp2

      bc      ctr=CL.54

Invoked with an appropriate optimization level, the compiler might unroll the loop by a factor of two and generate the code that follows. (In reality, the compiler or preprocessors will typically unroll by a factor of four or more. For the sake of simplicity, this example used human unrolling.)

      CL.54:

      lfqu    fp4,fp5,gr6=hidden_act(gr6,16)

      lfq     fp8,fp9=h_o_w_ch_sum_array(gr4,8)

      lfqu    fp2,fp3,gr7=h_o_weights(gr7,16)

      lfq     fp6,fp7=psum_array(gr3,16)

      fma     fp4=fp8,fp0,fp4,fcr

      fma     fp5=fp9,fp0,fp5,fcr

      fma     fp2=fp6,fp0,fp2,fcr

      fma     fp3=fp7,fp0,fp3,fcr

      stfqu   gr4,h_o_w_ch_sum_array(gr4,16)=fp4,fp5

      stfqu   gr3,psum_array(gr3,16)=fp2,fp3

      bc      ctr=CL.54

The Load Quad (lfq) instruction loads two successive array elements from each of the psum_array and h_o_w_ch_sum_array arrays; the Store Quad with Update (stfqu) instructions write back these array elements. Finally, the Load Quad with Update (lfqu) instructions load successive pairs of elements from the h_o_weights and the hidden_act arrays. The unrolled loop requires three cycles for every two iterations while the original loop requires three cycles for only a single iteration. Thus, unrolling improves the performance of this storage reference limited loop by a factor of two.

Previous | Next