# Metrics

For each benchmark, a set of problem sizes are defined. Throughout this section, we refer to
the kernel by the index k, and refer to particular data sets for a given kernel as d_{i}, where i =
1, 2, . . . ,N_{k}, and N_{k} varies from kernel to kernel. We assume that the data for the problem begins
in a "staging area" accessible to the computation units ("main memory" or "an I/O stream")
and must be moved into local memory.

There are two major metrics of interest for each problem size. The first is the total time or
latency, L_{1}(k, d_{i}), to perform kernel k for a data set size d_{i} using a single chip. This measurement should include both computation time and the
time to move the data for the problem from the staging area (off the chip) to a computation
or operation area (on the chip).

The second major metric of interest is the sustained achievable throughput, T(k, d_{i}). For each
kernel k and problem size d_{i}, a measure of the workload, W(k, d_{i}), is defined in an operation dependent
and system-independent way. (For floating-point computation operations, W is the
floating-point operation count, while for communication operations, W is the number of bytes
transferred.) The sustained achievable throughput is

where L_{n}(k, d_{i}) is the total time to solve n problems of the given type using the chip. As
above, L_{n}(k, d_{i}) includes the time to move the data from the staging area to an operation area.

There are clear trade-offs between throughput and latency. If the entire chip is being used
to solve kernel k for data set di, then L_{n} = nL_{1} and T = W/L1. In some cases, however, an
operation will be able to take advantage of pipelining and perform multiple computations of the
same type at the same time, resulting in higher throughput. Obviously, the extent to which this can
be accomplished will depend on the input bandwidth of the chip. To measure the throughput
for our purposes, it is sufficient to measure Ln for a value of n that is sufficiently large (at least
n > 10, and preferably n > 100).

For embedded systems we are interested in the efficiency of the operation, that is, the use of resources relative to the potential of the device. In general the efficiency E(k, di) is defined as

where U(k) is the kernel-dependent upper bound or peak performance of the chip. The definition of U(k) is linked to the definition of the workload. When W is in floating-point operations, U(k) is the theoretical peak floating-point computation rate (based on the clock rate and the number of floating-point units). For a communication operation, where workload is defined in bytes, U(k) is the theoretical peak bandwidth between the communicating units. For benchmarks other than the signal processing and communication benchmarks, efficiency is difficult to calculate because peak performance for the corresponding workloads cannot easily be defined. For example, the workload for the database benchmark is in transactions, and there does not exist an easily calculable peak performance for the number of transactions performed. In these situations, efficiency cannot be calculated.

Another key metric is the stability of the performance. Kuck [1, p.168ff]
defines stability as the ratio of the minimum achieved performance to the maximum achieved
performance over a set of data set sizes and programs. Stability is defined in two senses for the
kernel benchmarks, a per-kernel sense and an overall sense. The per-kernel stability is reflected by
a metric called data set stability, S_{d}, defined as the stability for a particular kernel over all data sets
for that kernel,

where P(k, d_{i}) is the overall power consumed during the operation. This normalized quantity C
gives some indication of the "cost" of executing the benchmark on the given chip. Obviously,
this metric ignores power consumed by other elements of the system, but allows comparison with
other processors using the same metric. Performance metrics per unit size and
weight are omitted, as the processing unit is perceived to be less of a driver for either of these
quantities than for power consumed by the system.

In summary, developers are requested to measure the latency, throughput, and power consumed
for each kernel benchmark k and data set d_{i}. The theoretical peak floating-point, operation, and
communication rates should be reported for the chip. All other metrics (efficiency, stability over
problem size, stability over all kernels, and performance per unit power) are derived from chip
parameters and the measured quantities. Other statistics such as variance may be appropriate also
and may be calculated from these results. The desired quantities are summarized in Table 1.