logo

Corner Turn Benchmark

Many signal and image processing applications operate on multi-dimensional data in multiple stages, with operations focusing on a different dimension in each subsequent stage. If the host platform is a parallel processor, the data are usually distributed across the nodes to exploit data parallelism, so that each node can operate in parallel on its portion of the data as the algorithm transitions from one stage to the next. For efficiency reasons, it is desirable to perform a corner turn of the data. A corner turn operation is defined as a copy of the object with a change in the storage order of the underlying data. This may or may not imply transposition of the computation object, depending on the implementation. In this the following paragraphs, we describe this operation in more detail and describe in general terms an abstracted, high-level application that requires a corner-turn operation (presented in [1], some of which is repeated here for completeness).

An application that requires a corner turn works first on the rows of an input matrix, and then on the columns of the intermediate result matrix. Mathematically, one of the most basic examples of such an operation is a multiplication of three matrices,

Z = BH X A,
(1)
where B and A are application-dependent matrices, X is the matrix of input data, and Z is the matrix of output data. An example situation where this might occur would be a filtering operation followed by a beamforming operation.1 Suppose, as in the previous description [1], that we perform the operations of Equation (1) into two stages, the first producing Y = XA and the second producing Z = BHY. Then the first stage is an operation in which an entire row of X is desired, and the second is an operation in which an entire column of Y is desired. Thus, the two stages suggest different optimal data layouts. The idea behind the corner-turn operation is to preserve data locality in the dimension being operated on. Whether or not a mathematical transpose is performed is implementation-dependent. In multidimensional arrays in the C programming language, the last array index is continuous in memory. In order to perform a corner turn of a C language array, a transpose is required; that is, the order of some of the dimensions must be reversed (see Figure 1).
The C code to perform a corner turn of two-dimensional array A into two-dimensional array B is
// Notice dimensions of B are the reverse of those of A
int A[NX][NY], B[NY][NX];  
				
for (i = 0; i < NX; i++)
	for (j = 0; j < NY; j++)
		B[j][i] = A[i][j];
If NX and NY were defined at compile time to be 4 and 3, respectively, and equation then the memory area underlying A is
A = { 0 1 2 3 4 5 6 7 8 9 10 11 }.
Mathematically
equation and after execution of the above code the memory area underlying B is
B = { 0 3 6 9 1 4 7 10 2 5 8 11 }.
Figure 1: C corner turn example. In matrix B, the data are stored in corner-turned fashion compared to matrix A, and B is the transpose of A.

Now consider an implementation with the property that storage order is independent of the order of the indexes. In such an implementation, it would be possible to do a corner turn without requiring a transpose of the computation object. The vector, signal, and image processing library (VSIPL, [2]) is an example of a library where this is possible: the stride parameters of a VSIPL view allow the VSIPL copy operation to re-arrange the underlying data without changing the mathematical properties (see Figure 2).1 When using objects with this property, storage order may be considered to be a mapping issue, whereas when using standard C and C++ arrays, storage order is explicitly embedded in the application program.
The C code to perform a corner turn from VSIPL matrix view A into VSIPL matrix view B is
// Notice dimensions of B are the same as those of A: 
// storage order is different
vsip_mview_i *A = vsip_mcreate_i(NX,
                                 NY, 
                                 VSIP_ROW,
                                 VSIP_MEM_NONE);
vsip_mview_i *B = vsip_mcreate_i(NX,
                                 NY, 
                                 VSIP_COL,
                                 VSIP_MEM_NONE);
  
vsip_mcopy_i_i(A, B);
If NX and NY were defined at compile time to be 4 and 3, respectively, and equation then the block underlying A is
A = { 0 1 2 3 4 5 6 7 8 9 10 11 }.
After execution of the above code, B is mathematically the same as A, and the block area underlying B is
B = { 0 3 6 9 1 4 7 10 2 5 8 11 }.
Figure 2: VSIPL corner turn example. Matrix views A and B are mathematically the same even though the underlying data in B are stored in corner-turn fashion compared to matrix A.

The discussion above does not consider the distribution of data over processors. Distribution effectively adds a level of memory hierarchy to the performance of a corner turn: data must be copied to a new processor as well as re-arranged on the new processor. Frequently, an all-to-all communication operation, in which every processor communicates with every other processor, is required as part of a distributed corner turn.

Parameters for two corner turn sizes are given in Table 1. These corner turn sizes are based on current applications: they assume 32-bit data, either integers or single-precision floating-point data. The workload is twice the overall matrix size since the data is being copied and must therefore pass into and out of the processor.

Table 1: Corner turn parameters.
Parameter
names
Description Values
Set 1 Set 2
M Matrix rows 50 750
N Matrix columns 5000 5000
k Element size (bytes) 4 4
Matrix size (Mbyte) 1 15
W Workload (Mbyte) 2 30

For this particular kernel benchmark, the idea of timing throughput and latency based on a single processor is acceptable, but it would be preferable to get a sense of the throughput and latency possible for a multi-processor corner turn of the data. In the most frequent occurrences of a distributed corner turn in signal processing, the source and destination processor groups are either identical or are completely disjoint. The derived statistics of most interest for multi-processor transfers are the stability of the throughput over the number of processors used, and the efficiency of the transfer versus the theoretical peak bandwidth.

Footnotes

1A filtering operation can be represented as a matrix-matrix multiply, but would usually not be implemented in such a fashion. Thus, the use of matrix multiplication here is an additional level of abstraction about the application.
2It would also be possible to implement standard C/C++ data structures with this property: use of VSIPL here is purely a matter of convenience.

References

1. James M. Lebak. Polymorphous computing architectures (PCA) example application 4: Corner-turn. External report, MIT Lincoln Laboratory, Lexington, MA, October 2001.

2. David A. Schwartz, Randall R. Judd, William J. Harrod, and Dwight P. Manley. Vector, signal, and image processing library (VSIPL) 1.0 application programmer’s interface. Technical report, Georgia Tech Research Corporation, 2000. http://www.vsipl.org.