Skip to main content

 
IBM Systems  > System p  > Hardware  > 

The IBM POWER2 Architecture Implementations

  
 
Abstract
POWER2 Architecture Implementations
POWER2 Architecture Implementation Differences
Characteristics of Different Classes of Codes
Performance Differences
Summary
  

Abstract 
This paper describes some of the important characteristics of the POWER2 architecture implementations used in products announced by the IBM RS/6000 Division in September 1993 and May 1994. All POWER2 architecture implementations are completely binary compatible. The POWER2 architecture is also upward binary compatible with the IBM POWER architecture.

The major differences discussed in this paper are: 1) data cache size, 2) second level cache (L2 cache) and 3) memory interface and processor interface (processor to data cache interface) widths. These characteristics result in different levels of performance impact for three classes of codes: Integer, Floating Point and Commercial Transaction Processing. The performance discussions are based on SPECint92, SPECfp92, LINPACK and TPC-C benchmark results.

 
Back to top
 

POWER2 Architecture Implementations 
In September 1993, the IBM RS/6000 Division announced three RS/6000 servers based on the first implementation of the POWER2 architecture. This implementation is referred to as the original POWER2 in this paper. The September announcement was well received by IBM customers, many of them requested a desktop implementation and greater commercial transaction processing capacity. Satisfying these two requests was a major challenge for the May 1994 IBM announcement.

Several design trade-offs were made to meet the stringent packaging and price targets in a desktop implementation. In this paper, the POWER2 architecture implementation for the desktop is referred to as the new-desktop POWER2. Another set of design trade-offs were required to achieve greater commercial transaction processing capacity at a reasonable cost. This effort led to a new POWER2 architecture implementation for the server models, referred to as the new-server POWER2. The three implementations of the POWER2 architecture and the corresponding RS/6000 model numbers are summarized in Table 1.

table1
Back to top
 

POWER2 Architecture Implementation Differences 
At the processor level all three implementations are identical and thus, completely binary compatible. All implementations can execute up to six instructions (branch, conditional register, two fixed-point, and two floating point) per cycle and have the same instruction cache size (32KB). The major differences between the three implementations are summarized in Table 2.

table2

The new-server and the new-desktop implementations have smaller data caches (128KB and 64KB respectively). The L2 cache is an option on the model 390 and is standard on the 59H, R20 and R24 models. The L2 cache is a combined instruction and data cache. Existing application programs will, in general, exploit the L2 cache without re-compilation. Another significant difference is in the memory and the processor interface widths, as shown in the last two columns in Table 2.

The memory interface width is the width of the memory data bus. The processor interface is the width of the processor to data cache bus. The following section discusses how these design choices tailor the three implementations for different classes of applications.

For the new-server and the new-desktop implementations, the data cache to L2 cache interface width is the same as the respective memory interface width presented in column 5 of Table 2.

Back to top
 

Characteristics of Different Classes of Codes 
Integer codes as represented by SPECint92 typically do not access large amounts of data and mostly fit in any reasonably large data cache. A reduction in data cache size will reduce performance slightly but the addition of an L2 cache usually makes up for the loss up to a point. For integer codes a four-word processor interface is sufficient to keep the two fixed point units busy. Also, a 32KB instruction cache is usually sufficient for these types of codes.

Typical floating point codes represented by the SPECfp92 and LINPACK benchmarks access large amounts of data. These floating point codes are more significantly affected by the smaller data caches and the reduced interface widths. The L2 cache compensates for some of the lost performance, but not all. This is especially true on the new desktop models. On some floating point codes, the four-word processor interface may not be sufficient to keep the two floating point units busy. The 32KB instruction cache is usually sufficient for these types of codes.

Commercial transaction processing (TP) workloads are almost exclusively integer codes but their characteristics are different from SPECint92. TP codes typically contain a large instruction footprint, which will not fit in a 32KB instruction cache. In this environment, the addition of an L2 cache significantly reduces the instruction cache miss penalty. Since TP data is mostly 32-bit (integer) data, a four-word processor interface is sufficient to feed the two fixed point units. Thus, the new-desktop and new-server POWER2 implementations show significant performance gains on TP codes.

The behavior of classes of codes discussed above is fairly typical. However, customer application code behavior may be different.

Back to top
 

Performance Differences 
For the discussions in this section the model 590 original POWER2, the model 59H new-server POWER2 and the model 390 new-desktop POWER2 are selected in the configuration as shown in Table 3.

table3

Since these models operate at about the same frequency, the implementation differences are reflected in the performance results. The discussions are restricted to the performance of benchmarks: SPECint92, SPECfp92, LINPACK, and TPC-C. The benchmark results of all three models are presented in Table 3.

The models 590 and 59H have nearly identical performance on SPECint92. The model 390 SPECint92 result is about 7% lower. The four-word CPU cache interface (4x32 bit) should be sufficient to feed the two integer units of the model 390. The most likely source of the performance decline is the smaller data cache and/or the memory interface.

On LINPACK SP (100x100 Single Precision Benchmark), all three models offer the same performance. The LINPACK SP data fits in the 64KB Data cache. The L2 cache and the memory interface have minimal impact on this benchmark result. The four-word processor interface is sufficient to keep both floating point units busy.

On LINPACK DP (100x100 Double Precision Benchmark), the models 590 and 59H performance results are the same. The LINPACK DP data fits in the 128KB data cache. Thus, the differences in cache size and memory interface do not affect performance. The model 390 LINPACK DP performance is less than half that of the model 590. The eight-word processor interface can deliver two quad-word floating point storage reference instructions (two load quads or two store quads) per machine cycle. Since the model 390 four-word interface operates at half the capacity of the eight-word interface on the model 590, a 50% drop in performance should be expected. Further, the DP data does not fit in the 64KB data cache, causing the performance to drop substantially.

The model 590 offers slightly better performance than the model 59H on SPECfp92. The reduction in memory interface width and data cache size is not made up by the addition of the L2 cache. The model 390 SPECfp92 is about 20% lower than the model 59H results. The smaller data cache and narrower interfaces contributed to the decline in performance relative to the models 590 and 59H.

The new-desktop and the new-server POWER2 implementations have smaller data cache lines than the original POWER2 implementation. The smaller cache lines may affect the performance of floating point codes with contiguous memory access.

On double precision floating point codes (Engineering/Scientific applications) the original POWER2 implementation (such as the model 590) delivers better performance than the new-server implementation (such as the model 59H).

The models 59H and 390 show significant performance gains over the model 590 (Table 3) on the TPC-C benchmark. The significant gains are due to the addition of L2 cache, application tuning, and operating system enhancements. The addition of L2 cache accounts for a large part of the performance gain. The L2 cache, significantly larger than the instruction cache, reduces the effect of instruction cache misses by containing the majority of code paths executed by the transaction processing software. A four-word processor interface is sufficient to feed the two integer units. The performance loss, if any, due to a reduced memory interface is inconsequential compared to the performance boost due to the large L2 cache. The 59H performance gain over 590 due to the addition of L2 cache is estimated to be about 20 percent for workloads similar to TPC-C.

The TPC-C benchmark results confirm that the new-desktop and the new-server POWER2 implementations are more suitable for commercial applications than the original POWER2 implementations.

Finally, an important aspect of the original POWER2 implementation needs to be discussed. The eight-word memory interface of the original POWER2 implementations require a minimum of four memory cards to achieve their full bandwidth potential. With less than four memory cards, the effective memory interface width and the data cache size become 4 words and 128KB respectively. In other words, a model 590 with two memory cards effectively becomes a model 59H with no L2 cache. All 590 benchmark results included in Table 3 were measured on a machine with four memory cards installed.

Back to top
 

Summary 
The IBM RS/6000 Division has introduced three different implementations of the POWER2 architecture. The original implementations (models 590, 58H and 990) are optimal for Engineering/Scientific codes that crunch double precision floating point data. The new implementations (models 59H, R20, R24, 380 and 390) are more suitable for integer and commercial applications. All three POWER2 architecture implementations have comparable performance on integer and single precision floating point codes.

Back to top
 

Acknowledgments
The authors would like to thank E. Dee Prewit, Steve W. White, Maurice Franklin and other colleagues for their valuable suggestions.

References

  1. Steven W. White and Sudhir Dhawan, "POWER2: Next Generation of the RS/6000 Family", PowerPC and POWER2: Technical Aspects of the New IBM RS/6000,pp 8-18.


Authors
Sohel R. Saiyed, IBM Corporation, 11400 Burnet Road, Austin, TX 78758. Mr. Saiyed has extensive experience in architecture, performance and compilers both inside and outside IBM. Currently he is an Advisory Engineer in Systems Architecture and Performance, RS/6000 Division. He has an M.S. degree in Computer Engineering from Clemson University and a B. Tech. degree in Electrical Engineering from the Indian Institute of Technology, Kharagpur, India. Internet address: saiyed@hwperform.austin.ibm.com.

Jacob Thomas, IBM Corporation, 11400 Burnet Road, Austin, TX 78758. Since 1981, Mr. Thomas has held technical positions at IBM development and marketing divisions. His interests include performance evaluation of supercomputers and high performance compilers. Currently, he is an Advisory Programmer in Processor Performance, RS/6000, Division. He holds an M.S. degree in Statistics from Michigan State University and an M.S. degree in Physics from Birla Institute of Technology and Science, Pilani, India. Internet address: thomasj@hwperform.austin.ibm.com.

Trademarks

All trademarks are the property of their respective owners.

IBM, PowerPC, POWER, POWER2 and RS/6000 are trademarks of International Business Machines Corporation.

SPECint92 and SPECfp92 are trademarks of Standard Performance Evaluation Corporation.

TPC Benchmark C is a trademark of the Transaction Processing Performance Council.