Exercise 7

Posted Feb 3, 2025

2 min read

The effect of loop tiling on matrix transposition

In lectures, we saw a model for throughput of a matrix transpose operation. Here we’re going to look at the effect on throughput of loop tiling. You can find an implementation of naive transposition and one with one level of loop tiling below.

Compile the code

We’ll use the Intel compiler to build this code. So after logging in to Hamilton and downloading, load the relevant module

module load intel/2022.2

The untiled version of the code can be compiled with

icx -O1 -std=c99 -o transpose transpose.c

and the tiled version with

icx -O1 -std=c99 -o transpose-blocked transpose-blocked.c

Measure effective bandwidth as a function of matrix size

Exercise

For both the blocked and unblocked code, measure the memory bandwidth as a function of the number of rows and columns (using square matrices is fine) from around $N=100$ to $N=20000$. Try both with $N$ a power of two, and $N$ a multiple of ten.

Question

What do you observe comparing the blocked and unblocked performance?

Question

Do you notice anything different when using power of two sizes compared to multiples of ten?

The default blocking size is a $64\times64$ tile. You can override these sizes when compiling with

icx -O1 -std=c99 -o transpose-blocked transpose-blocked.c -DRSTRIDE=X
-DCSTRIDE=Y

by setting X and Y to appropriate numbers.

Question

Given that a Hamilton CPU has a 32kB level one cache size. What is a good tile size if you want to block for level one cache?

Question

Do you notice any performance changes if you change the tile size?

Measuring cache behaviour

The code is annotated with likwid markers, so we can run it with likwid-perfctr and measure the cache behaviour. To do this, load the likwid/5.2.0 module and recompile the two executables with

icx -O1 -std=c99 -DLIKWID_PERFMON -o transpose transpose.c -llikwid

and

icx -O1 -std=c99 -DLIKWID_PERFMON -o transpose-blocked transpose-blocked.c -llikwid

Exercise

For a $4096\times4096$ matrix, measure the main memory bandwidth and data volume for both the blocked and unblocked cases with likwid-perfctr -g MEM -C 0 -m <executable\> 4096 4096.

Question

What do you observe about the measured data volume (reported by likwid) compared to the effective data volume?

What about if you change to a $5000\times5000$ matrix?

Can you explain what you see?

Study, Master

DU PO

This post is licensed under CC BY 4.0 by the author.

The effect of loop tiling on matrix transposition

Compile the code

Measure effective bandwidth as a function of matrix size

Exercise

Question

Question

Question

Question

Measuring cache behaviour

Exercise

Question

Trending Tags