PAPI aims to provide the tool designer and application engineer with a consistent interface and methodology for use of the performance counter hardware found in most major microprocessors.
PAPI enables software engineers to see, in near real time, the relation between software performance and processor events.
The Performance API (PAPI) project specifies a standard application programming interface (API) for accessing hardware performance counters available on most modern microprocessors.
These counters exist as a small set of registers that count Events, occurrences of specific signals related to the processor's function. Monitoring these events facilitates correlation between the structure of source/object code and the efficiency of the mapping of that code to the underlying architecture.
This correlation has a variety of uses in performance analysis including hand tuning, compiler optimization, debugging, benchmarking, monitoring and performance modeling. In addition, it is hoped that this information will prove useful in the development of new compilation technology as well as in steering architectural development towards alleviating commonly occurring bottlenecks in high performance computing.
PAPI provides two interfaces to the underlying counter hardware; a simple, high level interface for the acquisition of simple measurements and a fully programmable, low level interface directed towards users with more sophisticated needs.
The low level PAPI interface deals with hardware events in groups called EventSets. EventSets reflect how the counters are most frequently used, such as taking simultaneous measurements of different hardware events and relating them to one another.
For example, relating cycles to memory references or flops to level 1 cache misses can indicate poor locality and memory management. In addition, EventSets allow a highly efficient implementation which translates to more detailed and accurate measurements.
EventSets are fully programmable and have features such as guaranteed thread safety, writing of counter values, multiplexing and notification on threshold crossing, as well as processor specific features. The high level interface simply provides the ability to start, stop and read specific events, one at a time.
PAPI provides portability across different platforms. It uses the same routines with similar argument lists to control and access the counters for every architecture. As part of PAPI, we have predefined a set of events that we feel represents the lowest common denominator of every good counter implementation.
Our intent is that the same source code will count similar and possibly comparable events when run on different platforms. If the programmer chooses to use this set of standardized events, then the source code need not be changed and only a fresh compilation and link is necessary. However, should the developer wish to access machine specific events, the low level API provides access to all available events and counting modes.
If an event or feature does not exist on the current platform, PAPI returns an appropriate error code. This significantly reduces the porting effort of code using PAPI because the semantics of each call to PAPI remains the same, just the argument lists need updating. In addition to the standard set, each PAPI implementation supports all native events through the ability to directly accept platform specific counter numbers. Definitions for most, if not all of these, are included as conditional macros in the header file. In this way, PAPI avoids having inefficient code to translate all events for all platforms into a uniform representation and back again.
This translation is only done for the relatively few events defined in the standardized set. Some processors like those in the POWER series have counter groups. They enable access to specific groups of counters, instead of individual events. This presents a serious portability problem, thus PAPI abstracts hardware counters from their groups with a packed naming scheme. Each counter control value or event is made up of the counter group number and the number of the specific counter in that group.
PAPI can be divided into two layers of software. The upper layer consists of the API and machine independent support functions. The lower layer defines and exports a machine independent interface to machine dependent functions and data structures. These functions access the substrate, which may consist of the operating system, a kernel extension or assembly functions to directly access the processors registers.
PAPI tries to use the most efficient and flexible of the three, depending on what is available. Naturally, the functionality of the upper layers heavily depends on that provided by the substrate. In cases where the substrates do not provide highly desirable features, PAPI attempts to emulate them as described below.
PAPI makes sure the underlying operating system or library guards against overflow of counter values.
Each counter can potentially be incremented multiple times in a single clock cycle. This combined with increasing clock speeds and the small precision of some of the physical counters means that overflow is likely to occur.
One of the more advanced features of PAPI is to provide a portable implementation of asynchronous notification when counters exceed a user specified value.
This functionality provides the basis for PAPI's SVR4 compatible profiling calls, that generate an accurate histogram of performance interrupts based on hardware metrics, not on time. Such functionality provides the basis for all line level performance analysis software, from the antiquated days of AT&T's prof to SGI's SpeedShop. Thus for any architecture with even the most rudimentary access to hardware performance counters, PAPI provides the foundation for a truly portable, source level, performance analysis tool based on real processor statistics.
What's New in This Release: [ read full changelog ]
· This version adds support for AMD Family 11 and 12 processors, and redefines the PAPI_FP_OPS event for Intel SandyBridge so that it only requires 4 counters and can run properly with hyperthreading enabled.
· A host of bugfixes and code clean-ups have also been implemented, including significant rewrites of several components for clarity and functionality