What is the fastest way to transpose a matrix in C

Question

I have a reasonably large matrix that I need to transpose.

Assume, for example, that my matrix is

a b c d e f
g h i j k l
m n o p q r

I want the result be as follows:

a g m
b h n
c I o
d j p
e k q
f l r

What is the fastest way to do this?

Damon · Answer 1 · Jul 22, 2022

This is an excellent question.

There are several reasons why you might wish to transpose the matrix in memory rather than merely swap coordinates, such as matrix multiplication and Gaussian smearing.

First, let me list one of the transposition routines I employ.

void transpose(float *src, float *dst, const int N, const int M) {
    #pragma omp parallel for
    for(int n = 0; n<N*M; n++) {
        int i = n/N;
        int j = n%N;
        dst[n] = src[M*j + i];
    }
}

Now let's see why the transpose is useful. Consider matrix multiplication C = A*B. We could do it this way.

for(int i=0; i<N; i++) {
    for(int j=0; j<K; j++) {
        float tmp = 0;
        for(int l=0; l<M; l++) {
            tmp += A[M*i+l]*B[K*l+j];
        }
        C[K*i + j] = tmp;
    }
}

That way, however, is going to have a lot of cache misses. A much faster solution is to take the transpose of B first

transpose(B);
for(int i=0; i<N; i++) {
    for(int j=0; j<K; j++) {
        float tmp = 0;
        for(int l=0; l<M; l++) {
            tmp += A[M*i+l]*B[K*j+l];
        }
        C[K*i + j] = tmp;
    }
}
transpose(B);

Matrix multiplication is O(n3) and transpose is O(n2), taking the transpose should have no influence on calculation time (for large n). Loop tiling is more effective than obtaining the transpose in matrix multiplication, but it is considerably more difficult.