Post

Exercise 7

The effect of loop tiling on matrix transposition

In lectures, we saw a model for throughput of a matrix transpose operation. Here we’re going to look at the effect on throughput of loop tiling. You can find an implementation of naive transposition and one with one level of loop tiling below.

Compile the code

We’ll use the Intel compiler to build this code. So after logging in to Hamilton and downloading, load the relevant module

1
module load intel/2022.2

The untiled version of the code can be compiled with

1
icx -O1 -std=c99 -o transpose transpose.c

and the tiled version with

1
icx -O1 -std=c99 -o transpose-blocked transpose-blocked.c

Measure effective bandwidth as a function of matrix size

Exercise

 For both the blocked and unblocked code, measure the memory bandwidth as a function of the number of rows and columns (using square matrices is fine) from around $N=100$ to $N=20000$. Try both with $N$ a power of two, and $N$ a multiple of ten. 

Question

 What do you observe comparing the blocked and unblocked performance? 

Question

 Do you notice anything different when using power of two sizes compared to multiples of ten? 

The default blocking size is a $64\times64$ tile. You can override these sizes when compiling with

1
2
icx -O1 -std=c99 -o transpose-blocked transpose-blocked.c -DRSTRIDE=X
-DCSTRIDE=Y

by setting X and Y to appropriate numbers.

Question

 Given that a Hamilton CPU has a 32kB level one cache size. What is a good tile size if you want to block for level one cache? 

Question

 Do you notice any performance changes if you change the tile size? 

Measuring cache behaviour

The code is annotated with likwid markers, so we can run it with likwid-perfctr and measure the cache behaviour. To do this, load the likwid/5.2.0 module and recompile the two executables with

1
icx -O1 -std=c99 -DLIKWID_PERFMON -o transpose transpose.c -llikwid

and

1
icx -O1 -std=c99 -DLIKWID_PERFMON -o transpose-blocked transpose-blocked.c -llikwid

Exercise

 For a $4096\times4096$ matrix, measure the main memory bandwidth and data volume for both the blocked and unblocked cases with likwid-perfctr -g MEM -C 0 -m <executable\> 4096 4096

Question

What do you observe about the measured data volume (reported by likwid) compared to the effective data volume?

What about if you change to a $5000\times5000$ matrix?

Can you explain what you see?

This post is licensed under CC BY 4.0 by the author.

Trending Tags