For each benchmark, a set of problem sizes are defined. Throughout this section, we refer to the kernel by the index k, and refer to particular data sets for a given kernel as di, where i = 1, 2, . . . ,Nk, and Nk varies from kernel to kernel. We assume that the data for the problem begins in a "staging area" accessible to the computation units ("main memory" or "an I/O stream") and must be moved into local memory.

There are two major metrics of interest for each problem size. The first is the total time or latency, L1(k, di), to perform kernel k for a data set size di using a single chip. This measurement should include both computation time and the time to move the data for the problem from the staging area (off the chip) to a computation or operation area (on the chip).

The second major metric of interest is the sustained achievable throughput, T(k, di). For each kernel k and problem size di, a measure of the workload, W(k, di), is defined in an operation dependent and system-independent way. (For floating-point computation operations, W is the floating-point operation count, while for communication operations, W is the number of bytes transferred.) The sustained achievable throughput is


where Ln(k, di) is the total time to solve n problems of the given type using the chip. As above, Ln(k, di) includes the time to move the data from the staging area to an operation area.

There are clear trade-offs between throughput and latency. If the entire chip is being used to solve kernel k for data set di, then Ln = nL1 and T = W/L1. In some cases, however, an operation will be able to take advantage of pipelining and perform multiple computations of the same type at the same time, resulting in higher throughput. Obviously, the extent to which this can be accomplished will depend on the input bandwidth of the chip. To measure the throughput for our purposes, it is sufficient to measure Ln for a value of n that is sufficiently large (at least n > 10, and preferably n > 100).

For embedded systems we are interested in the efficiency of the operation, that is, the use of resources relative to the potential of the device. In general the efficiency E(k, di) is defined as


where U(k) is the kernel-dependent upper bound or peak performance of the chip. The definition of U(k) is linked to the definition of the workload. When W is in floating-point operations, U(k) is the theoretical peak floating-point computation rate (based on the clock rate and the number of floating-point units). For a communication operation, where workload is defined in bytes, U(k) is the theoretical peak bandwidth between the communicating units. For benchmarks other than the signal processing and communication benchmarks, efficiency is difficult to calculate because peak performance for the corresponding workloads cannot easily be defined. For example, the workload for the database benchmark is in transactions, and there does not exist an easily calculable peak performance for the number of transactions performed. In these situations, efficiency cannot be calculated.

Another key metric is the stability of the performance. Kuck [1, p.168ff] defines stability as the ratio of the minimum achieved performance to the maximum achieved performance over a set of data set sizes and programs. Stability is defined in two senses for the kernel benchmarks, a per-kernel sense and an overall sense. The per-kernel stability is reflected by a metric called data set stability, Sd, defined as the stability for a particular kernel over all data sets for that kernel,

Stability across all kernels poses a problem, as the workloads and thus the throughput calculations are different for different kernels. However, a good indication of the overall stability can be gleaned from the geometric mean of the kernel stabilities,
Finally, for embedded systems, an important metric is the achieved performance per unit power consumed by the chip,

where P(k, di) is the overall power consumed during the operation. This normalized quantity C gives some indication of the "cost" of executing the benchmark on the given chip. Obviously, this metric ignores power consumed by other elements of the system, but allows comparison with other processors using the same metric. Performance metrics per unit size and weight are omitted, as the processing unit is perceived to be less of a driver for either of these quantities than for power consumed by the system.

In summary, developers are requested to measure the latency, throughput, and power consumed for each kernel benchmark k and data set di. The theoretical peak floating-point, operation, and communication rates should be reported for the chip. All other metrics (efficiency, stability over problem size, stability over all kernels, and performance per unit power) are derived from chip parameters and the measured quantities. Other statistics such as variance may be appropriate also and may be calculated from these results. The desired quantities are summarized in Table 1.



1. David J. Kuck. High Performance Computing: Challenges for Future Systems. Oxford University Press, New York, NY, 1996.