# Corner Turn Benchmark

Many signal and image processing applications operate on multi-dimensional data
in multiple stages, with operations focusing on a different dimension in each
subsequent stage. If the host platform is a parallel processor, the data are
usually distributed across the nodes to exploit data parallelism, so that each
node can operate in parallel on its portion of the data as the algorithm
transitions from one stage to the next. For efficiency reasons, it is desirable
to perform a *corner turn* of the data. A corner turn operation is defined
as a copy of the object with a change in the storage order of the underlying
data. This may or may not imply transposition of the computation object,
depending on the implementation. In this the following paragraphs, we describe this operation in
more detail and describe in general terms an abstracted, high-level application
that requires a corner-turn operation (presented in [1], some of which is repeated here for
completeness).

An application that requires a corner turn works first on the *rows* of
an input matrix, and then on the *columns* of the intermediate result
matrix. Mathematically, one of the most basic examples of such an operation is
a multiplication of three matrices,

^{H}X A,

^{1}Suppose, as in the previous description [1], that we perform the operations of Equation (1) into two stages, the first producing Y = XA and the second producing Z = B

^{H}Y. Then the first stage is an operation in which an entire row of X is desired, and the second is an operation in which an entire column of Y is desired. Thus, the two stages suggest different optimal data layouts. The idea behind the corner-turn operation is to preserve data locality in the dimension being operated on. Whether or not a mathematical transpose is performed is implementation-dependent. In multidimensional arrays in the C programming language, the last array index is continuous in memory. In order to perform a corner turn of a C language array, a transpose is required; that is, the order of some of the dimensions must be reversed (see Figure 1).

`A`into two-dimensional array

`B`is

// Notice dimensions of B are the reverse of those of A int A[NX][NY], B[NY][NX]; for (i = 0; i < NX; i++) for (j = 0; j < NY; j++) B[j][i] = A[i][j];If

`NX`and

`NY`were defined at compile time to be 4 and 3, respectively, and then the memory area underlying

`A`is

A = { 0 1 2 3 4 5 6 7 8 9 10 11 }.Mathematically

and after execution of the above code the memory area underlying

`B`is

B = { 0 3 6 9 1 4 7 10 2 5 8 11 }.Figure 1: C corner turn example. In matrix B, the data are stored in corner-turned fashion compared to matrix A, and B is the transpose of A.

Now consider an implementation with the property that storage order is independent of the order of the indexes. In such an implementation, it would be possible to do a corner turn without requiring a transpose of the computation object. The vector, signal, and image processing library (VSIPL, [2]) is an example of a library where this is possible: the stride parameters of a VSIPL view allow the VSIPL copy operation to re-arrange the underlying data without changing the mathematical properties (see Figure 2).

^{1}When using objects with this property, storage order may be considered to be a mapping issue, whereas when using standard C and C++ arrays, storage order is explicitly embedded in the application program.

`A`into VSIPL matrix view

`B`is

// Notice dimensions of B are the same as those of A: // storage order is different vsip_mview_i *A = vsip_mcreate_i(NX, NY, VSIP_ROW, VSIP_MEM_NONE); vsip_mview_i *B = vsip_mcreate_i(NX, NY, VSIP_COL, VSIP_MEM_NONE); vsip_mcopy_i_i(A, B);If

`NX`and

`NY`were defined at compile time to be 4 and 3, respectively, and then the block underlying

`A`is

A = { 0 1 2 3 4 5 6 7 8 9 10 11 }.After execution of the above code, B is mathematically the same as A, and the block area underlying

`B`is

B = { 0 3 6 9 1 4 7 10 2 5 8 11 }.Figure 2: VSIPL corner turn example. Matrix views A and B are mathematically the same even though the underlying data in B are stored in corner-turn fashion compared to matrix A.

The discussion above does not consider the distribution of data over processors. Distribution effectively adds a level of memory hierarchy to the performance of a corner turn: data must be copied to a new processor as well as re-arranged on the new processor. Frequently, an all-to-all communication operation, in which every processor communicates with every other processor, is required as part of a distributed corner turn.

Parameters for two corner turn sizes are given in Table 1. These corner turn sizes are based on current applications: they assume 32-bit data, either integers or single-precision floating-point data. The workload is twice the overall matrix size since the data is being copied and must therefore pass into and out of the processor.

Parameter names |
Description | Values | |

Set 1 | Set 2 | ||

M | Matrix rows | 50 | 750 |

N | Matrix columns | 5000 | 5000 |

k | Element size (bytes) | 4 | 4 |

Matrix size (Mbyte) | 1 | 15 | |

W | Workload (Mbyte) | 2 | 30 |

For this particular kernel benchmark, the idea of timing throughput and latency based on a single processor is acceptable, but it would be preferable to get a sense of the throughput and latency possible for a multi-processor corner turn of the data. In the most frequent occurrences of a distributed corner turn in signal processing, the source and destination processor groups are either identical or are completely disjoint. The derived statistics of most interest for multi-processor transfers are the stability of the throughput over the number of processors used, and the efficiency of the transfer versus the theoretical peak bandwidth.

## Footnotes

^{1}A filtering operation can be represented as a matrix-matrix multiply, but would usually not be implemented in such a fashion. Thus, the use of matrix multiplication here is an additional level of abstraction about the application.

^{2}It would also be possible to implement standard C/C++ data structures with this property: use of VSIPL here is purely a matter of convenience.

## References

1. James M. Lebak. Polymorphous computing architectures (PCA) example application 4: Corner-turn. External report, MIT Lincoln Laboratory, Lexington, MA, October 2001.2. David A. Schwartz, Randall R. Judd, William J. Harrod, and Dwight P. Manley. Vector, signal, and image processing library (VSIPL) 1.0 application programmerâ€™s interface. Technical report, Georgia Tech Research Corporation, 2000. http://www.vsipl.org.