An early version of this paper has been submitted to the IBM Journal of Research and Development.
Introduction
The POWER2 Performance Monitor, a tool used within IBM, provides detailed hardware measurements necessary to study hardware and software interaction on workloads. The POWER2 processor integrates this fully software accessible monitor. This monitor can selectively measure specific software processes with minimal disruption.
The monitor provides performance measures that include the number of executed instructions, elapsed cycles, counts and delays associated with cache and TLB misses, and utilization of the various execution elements. Such measures can help to efficiently locate and eliminate performance bottlenecks.
The Need for Hardware Measurement
Improving system performance increasingly depends on a detailed understanding of hardware and software interaction [1]. An effective measurement facility can provide the necessary understanding and a basis for evaluating design decisions of future systems by identifying existing bottlenecks. For example, it is useful to identify opportunities for better CPU execution utilization [2, 3] that are available through better instruction scheduling. Readily available information identifying scheduling opportunities can hasten the development of a compiler.
Of course, hardware measurement is not the only method of understanding system behavior. Simulation is an effective way to study hardware/software interaction in the earliest phases of the design and implementation of a system [4]. However, it is clearly preferable to measure rather than to simulate hardware, if it is available.
Previous Measurement Facilities
In previous RS/6000 systems, IBM development teams collected hardware performance information using special external instrumentation, which was adapted to the processor model under consideration. The external instrumentation approach was possible because the RS/6000 design provided access to some signals typically internal to a single chip [5]. The implementation also included special signals solely for performance monitoring, using either multiplexed functional pins or pins dedicated to measurement. Conversion of the externally available signals into useful performance information required additional logic.
Subtle differences between the various CPU models required each CPU to have different instrumentation. Additionally, the difficulty and cost of instrumenting any particular CPU model further constrained the number of systems instrumented.
Desirable Monitor Characteristics
A hardware measurement facility can be an asset to understanding hardware/software interaction and thus aid in improving the performance of current and future systems. A desirable characteristic is the ability to select specific times and specific threads of execution to measure. An analyst studying a numerical analysis application may be interested in particular subroutines or even particular loops [1]. On the other hand, an analyst studying a commercial application may want to be able to exclude I/O wait intervals [6].
For both internal and external instrumentation, a measurement facility is much easier to use if the software can efficiently access and control the measurements of its own execution. Desired signals are usually available within the chips; however, pin restrictions often limit externally available information. Therefore, ideally, CPU chips would incorporate monitoring instrumentation that is accessible by software in a simple yet effective manner. An example of such a capability is the software-accessible counters in the Cray Y-MP [7]. POWER2 implementations incorporate similar capabilities.
Implementation of Monitor
Systems are becoming more complex and highly integrated, allowing less data to be available for external hardware to intercept and record. One example in particular is the multichip module (MCM) packaging of the POWER2 processor. The MCM package includes five basic units, each being a distinct chip. These units are: the Fixed-Point Unit (FXU), the Floating-Point Unit (FPU), the Instruction Cache Unit (ICU or branch unit), the Storage Control Unit (SCU), and the Data Cache Units (DCUs) [8, 9, 10, 11].
Most of the chip inputs are interconnected on the MCM substrate and not externally accessible. The memory and I/O buses are the only external interfaces to the MCM [11, 12]. Consequently, there are few external POWER2 CPU signals suitable for deriving performance data.
Providing the measurement data through software accessible registers can reduce the pin I/O requirements. To this end, the designers allocated the ICU, FPU, SCU, and FXU five counters each. Similarly, the designers provided each of these four basic units with a 4-bit control field in the Monitor Mode Control Register (MMCR) that selects the set of events to be counted. Thus for each unit, it is possible to choose any one of sixteen groups of five events each for monitoring and to require only nine pins per chip as shown in Figure 1 .
As illustrated by Figure 1 , the monitor contains twenty-two 32-bit counters for CPU and storage-related performance events. The MMCR provides monitor control functions. The MMCR, along with two status bits in the POWER Machine State Register (MSR), also allows selective measurement of specific threads of execution. The counters and the MMCR are addressable for read and write operations using Programmed I/O (PIO). With one special exception discussed later, the selection of the events to count for any unit is independent of the selection of the other three units.
The FXU also adds an Instruction Match Register (IMR) to count the occurrence of specific instructions. By repetitively cycling through all desired instruction codes, one can obtain a sampling of instruction execution frequencies.
Finally, a Software Programmable Event (SPE) bit allows software to create software events." The FXU counters can count the number of cycles where the SPE bit has been set to 1.
Monitoring States
The POWER Architecture defines the system-wide MSR as part of the process state. Two MSR bits, the Process Mark (PM) bit (bit 29) and the Problem (PR) bit, along with the MMCR, control the state of the monitor. Figure 2 shows the MMCR, which provides control modes that take advantage of the MSR PM and PR bits. The MSR is part of the process state, and the operating system saves and restores the MSR when processes pause and resume execution. Therefore, with low overhead, the MSR PM bit can selectively qualify processes for monitoring. Thus the MMCR and MSR together can efficiently control the state of the Monitor.
Previous | Next
|