This is an excellent question.
There are several reasons why you might wish to transpose the matrix in memory rather than merely swap coordinates, such as matrix multiplication and Gaussian smearing.
First, let me list one of the transposition routines I employ.
void transpose(float *src, float *dst, const int N, const int M) {
#pragma omp parallel for
for(int n = 0; n<N*M; n++) {
int i = n/N;
int j = n%N;
dst[n] = src[M*j + i];
}
}
Now let's see why the transpose is useful. Consider matrix multiplication C = A*B. We could do it this way.
for(int i=0; i<N; i++) {
for(int j=0; j<K; j++) {
float tmp = 0;
for(int l=0; l<M; l++) {
tmp += A[M*i+l]*B[K*l+j];
}
C[K*i + j] = tmp;
}
}
That way, however, is going to have a lot of cache misses. A much faster solution is to take the transpose of B first
transpose(B);
for(int i=0; i<N; i++) {
for(int j=0; j<K; j++) {
float tmp = 0;
for(int l=0; l<M; l++) {
tmp += A[M*i+l]*B[K*j+l];
}
C[K*i + j] = tmp;
}
}
transpose(B);
Matrix multiplication is O(n3) and transpose is O(n2), taking the transpose should have no influence on calculation time (for large n). Loop tiling is more effective than obtaining the transpose in matrix multiplication, but it is considerably more difficult.