Skip to main content

 
IBM Systems  > System p  > Hardware  > 

POWER2 Commercial Workload Performance

  
   
Introduction
Workload Characteristics
Analysis Methodology
Commercial Benchmarks
System Performance
Features for High Performance
Summary
 

An early version of this paper has been submitted to the IBM Journal of Research and Development. 

Introduction
Data-intensive applications, such as transaction processing and file servers, form a major market segment for computer systems. Collectively, these are called "commercial" applications because banks, airlines, insurance companies, and other commercial enterprises were the initial users. Early microprocessor-based systems simply could not handle such data-intensive applications because they did not have enough processing power and I/O connectivity to handle many data-processing users concurrently. In recent years, the dramatic increase in the processing power available from microprocessors has made it possible for systems such as RISC-based workstations to compete in markets that were once exclusively the domain of mainframes. In fact, commercial applications represent one of the most rapidly growing market segments for RISC-based UNIX workstations today [1].

This article examines some of the features of the POWER2-based RS/6000 that make it well suited for the commercial arena. We begin by describing the performance characteristics of commercial workloads. We then discuss our analysis methodology and some common benchmarks used by the industry to compare commercial performance. We conclude with a discussion of the POWER2 hardware features that contribute to the enhanced commercial performance of the RS/6000.

Back to top

Workload Characteristics 
Commercial workloads include a wide variety of applications; some of the more prominent applications include on-line transaction processing, other database management services (such as batch transaction processing and decision support), and file servers.

While different commercial applications stress the system in different ways, most share some common characteristics such as:

  • Many (up to hundreds or thousands of) concurrent users.
  • Long path lengths over a large set of instructions, with a substantial part of the path length in the operating system code.
  • Fewer loop iterations and more (nonloop) branches than in scientific applications (see [2] for an analysis of the branching characteristics of scientific benchmarks).
  • Extensive manipulations of data structures through pointers, requiring integer arithmetic for address resolution.
  • Relatively little floating-point arithmetic. Data manipulation consists primarily of string or integer comparison, updates, or insertions.
  • High random I/O rates, with data spread over many megabytes or gigabytes of disk. Disk I/Os are primarily short (4KB or 8KB) and successive disk I/Os are often randomly distributed over the total disk space used.
Note that the characteristics of commercial applications differ dramatically from those of scientific or numerically intensive programs, which often have small instruction working sets with tight loops, use floating-point arithmetic extensively, and often do sequential rather than random I/O.

Table 1 shows some characteristics of certain commercial benchmarks. The methodology used in determining these characteristics is discussed in the next section; the benchmarks themselves are described subsequently.

The first two numeric columns in Table 1 show the percentage of executed instructions that are branches, and the percentage of these that are taken branches. An analysis of branching behavior helps to explain one of the reasons why the instruction cache (I-cache) miss rates for commercial workloads are higher than the I-cache miss rates for scientific applications. Consider a typical scientific workload dominated by short- to medium-length loops. For such a workload, where most branch instructions return control to the head of a loop, the percentage of taken branches would be much higher than the percentages shown in Table 1. Also, if the largest instruction loop for this workload fits in the I-cache, the I-cache miss rate for the scientific application would be very low.

In contrast, the branching data in Table 1 shows that, for commercial applications, the percentage of taken branches is relatively low, indicating that these applications execute relatively few iterations per loop. This lack of dominant loops implies that the application has a lower probability of re-executing recent instructions, leading to a higher I-cache miss rate. Other characteristics of commercial applications that lead to high I-cache miss rates include many processes and a high context switch rate. The number of processes is often proportional to the number of concurrent users; frequent context switches are the result of frequent, short I/Os and interprocess synchronization. Commercial applications tend to have higher data cache (D-cache) miss rates as well, because their data exhibits less locality and sequentiality than scientific applications.

tab 1

The "Average Sequential Block Size" column in Table 1 shows the branchy nature of commercial workloads in a different way. In this paper, we define the term sequential block to be the sequence of instructions executed between two taken branches (including the second branch instruction). (This is a dynamic measure, not something obtainable from a static analysis of the code.) These sequential blocks are much shorter than in the scientific and engineering applications that we have analyzed.

The last column in Table 1 shows the fraction of the total number of instructions that the AIX operating system code executes, as opposed to the instructions that the user application and shared library code execute. These numbers indicate that much of the work in commercial applications is actually done by the operating system. One reason for this relatively high usage of the operating system code is that in these applications, there is frequent movement of small amounts of data between different levels of the system with few arithmetic computations on the data. (For example, an application might send queries from an on-line user's terminal to a database server, which then sends back responses; this involves the use of a lot of operating system communication code and very little arithmetic.) This is in contrast to many scientific applications, where once the operating system brings data to the application space, the application performs extensive arithmetic manipulation before it hands the data back to the operating system for storage. Operating system code typically has a high incidence of branches and few loops.

Clearly, hardware features (such as the number of integer arithmetic units; the sizes, organization, and the number of levels of instruction and data caches; and the latency to caches and to memory) all have a strong bearing on the degree to which commercial applications perform well on a system. The I/O subsystem will have an important effect on the performance of most applications. However, in this paper, we focus on processors and memory subsystems, and we will not discuss I/O further.

Until recently, many RISC-based UNIX workstations focused on the scientific application environment, but as this article and others in this publication [3,4] show, the POWER2- based workstation provides superior performance in both scientific and commercial environments.

Back to top

Analysis Methodology 
The results presented in this paper come from one of three sources: direct measurement on POWER2-based systems, analysis of traces taken on POWER-based systems, or simulations using these traces.

POWER2 Performance Measurement
The POWER2 processor has a built-in hardware performance monitoring capability that collects measurements which allow not only detailed performance analysis, but also detailed workload characterization and analysis of system behavior [5].

Traces and Simulation
The AIX Performance group has a tool that collects instruction traces. These traces record the sequence of instructions executed (for both system code and application code) along with the virtual address of each instruction and data reference. The traces can be postprocessed to reveal workload characteristics such as those shown in Table 1. They can also drive simulators of the processor and memory subsystem. From the simulations, we can determine many of the same system performance characteristics as from hardware measurements, including cache miss rates.

Determining these quantities from simulations, which are driven by software traces, is less accurate than direct hardware measurements, but the technique has the advantage of being applicable to system configurations that do not actually exist. For example, we can create a trace once and use it to investigate many different cache configurations. It also allowed us to begin analyzing POWER2 performance before the processor was actually available. We could also analyze workload characteristics for systems that do not have hardware counters.

We have validated our trace-based simulation methodology against direct hardware measurements on both POWER-based and POWER2-based systems, and the results almost always agree to within 10%.


Previous | Next Back to top