Introduction
The new POWER2-based Model 590 brings more good news to technical professionals who require high computing performance at workstation prices. By adding new architectural and hardware features to the proven RS/6000 base, POWER2 improves the performance of many types of engineering/scientific applications. This paper compares the application performance of the new Model 590 (POWER2) to that of the Model 580 (POWER), and discusses the salient features of the POWER2 processor. The paper introduces the comparison within the context of fundamental loops that highlight the various improvements. Next, it examines the performance on a wide variety of frequently encountered constructs. Finally, measurements on a collection of full scientific/engineering application programs demonstrate the system performance of the Model 590. Related papers [1,2] discuss the performance of the SPEC benchmarks and commercial applications.
Some key architectural extensions and POWER2 hardware features, which a related paper describes in more depth [3], are:
- word floating-point load and store instructions
- Square Root instruction
- Dual floating-point execution units (FPU)
- Dual fixed-point execution units (FXU)
- Improved divide performance
- Additional physical floating-point registers for improved register renaming
- Decoupled floating-point arithmetic, floating-point load, and floating- point store pipelines
- Improved instruction fetch capabilities
- Larger, dual-ported data cache
- Wider bus interfaces
- Larger translation look-aside buffers
- Faster cycle time
- Improved branch processing
The intent of this paper is to relate these capabilities to measurable improvements exhibited in engineering and scientific application programs.
Loop Kernel Performance
Table 1 lists the performance for loops selected to demonstrate the effects of certain POWER2 enhancements. The data working sets for these loops fit within the cache; therefore, the measurements reflect the performance of the processor and not the memory subsystem. The loops consist of double-precision calculations that, except where noted, are coded in a straightforward manner such as:
do j = 1,M
do i = 1,N
x(i) = sqrt(y(i))
enddo
call dummy(x,N)
end do
The values of variables N and M are 256 and 15360, respectively. The outer j- loop allows accurate inner loop timings. The call to subroutine dummy prevents the compiler from simply eliminating the outer loop.
The first two loops (rows 1 and 2 of Table 1) show the effect of the new divide and square root capabilities. The columns in the table are, left to right, the kernel description, the number of floating-point operations per loop iteration, the number of floating-point operands loaded per iteration, the number of floating-point operands stored per iteration, the average number of cycles per inner loop iteration, and the corresponding MFLOPS (millions of floating-point operations per second) rate. The compiler will unroll such loops where possible to expose two independent Square Root instructions to fully utilize both FPU pipelines. As a result, the average number of cycles for the Square Root instruction is 15. Similarly, the average divide time is 10 cycles.
The daxpy (double precision a times x plus y) loop, in row 3 of Table 1, demonstrates the effects of the quad-word storage references and dual floating-point pipes. The compiler unrolls the loop to a depth of four and thus generates 11 instructions (four Load Quads, four floating-point multiply-adds (FMAs), two Store Quads, and a branch). Each iteration of the loop executes in three cycles since the FXU can perform 2 load/store operations per cycle. The FMAs and the branch execute in the same three cycles. The measured performance confirms this, exhibiting 168.5 MFLOPS, which is roughly 8 floating-point operations per three cycles at 66.5 MHz. This is greater than four times the performance of the Model 580, demonstrating the effectiveness of the 32-byte bandwidth of the Model 590 on storage reference limited loops.
Unrolling does not affect the dot product (row 4) because the FMA in each iteration depends on completion of the FMA in the previous iteration. Since the FPUs have latencies of two cycles, the processor performs only one FMA every two cycles; this loop will operate at one-fourth the peak performance of the machine.
The performance of the dot product improves greatly by using partial sums (see row 5 of Table 1). A code example for a dot product four-way sum reduction is:
do i=1,n,4
s1=s1+x(i)*y(i)
s2=s2+x(i+1)*y(i+1)
s3=s3+x(i+2)*y(i+2)
s4=s4+x(i+3)*y(i+3)
enddo
s=s1+s2+s3+s4
The compiler generates nine instructions for the four-way sum reduction dot product (four Load Quads, four FMAs, and one branch). The instructions execute in three cycles (0.75 cycles per original iteration). As one might expect, measurements confirm that the loop obtains two-thirds of peak performance. In this case, instruction fetch capability limits the performance. Since the branch is the ninth instruction, the processor does not fetch and process the branch until the second fetch cycle. By the time it decodes the branch, a third sequential fetch cycle has begun.
If, in the preceding example, the reference to array y becomes another reference to x , the resulting loop calculates the sum of the squares for elements of the x array. A code example for a four-way sum of squares is:
do i=1,n,4
s1=s1+x(i)*x(i)
s2=s2+x(i+1)*x(i+1)
s3=s3+x(i+2)*x(i+2)
s4=s4+x(i+3)*x(i+3)
enddo
s=s1+s2+s3+s4
|
In this case, the new loop eliminates two load instructions and increases data reuse within the loop. The loop instruction count for this loop becomes seven: two Load Quads, four FMAs, and one branch. Since the number of instructions fetched per cycle is 8, the processor sees the branch in the first cycle of the loop, permitting the loop to execute in two cycles and yielding nearly peak performance (see row 6 in Table 1).
In addition to increasing data bandwidth, the Load Quad instructions have minimized the instruction fetch requirements for the loop by combining four double-word references into two quad-word references.
The use of partial sums affects the order of the computations in the previous loops. Because floating-point arithmetic is not associative, this may produce a different floating-point result than the original version. The result of the unrolled loop tends to be more accurate because there is a greater chance that the sum reflects smaller product terms.

Previous | Next
|
|