Post

POGPU Exercise 4

Exercise 4

Roofline analysis of matrix–vector multiplication

The goal of this exercise is to perform a roofline analysis of matrix–vector multiplication. We will look at the effect that compiler flags have on the performance of the generated code. Next, we will check whether the performance we obtain depends on the problem size. Finally, we will investigate loop blocking.

Background

Below you can find the file “dmvm.c”, which contains a C implementation of matrix–vector multiplication, which computes

\[y = Ax\]

for a $n_{row}\times n_{col}$rectangular matrix of doubles. The code accepts two command-line arguments: 

  • the number of rows and
  • the number of columns. 

When run, the program generates the matrices, performs the computation and prints out information about the performance. For example, after compiling and running with “./dmvm 1000 2000” you might see output similar to 1019 1000 2000 6491.23. 

The four columns are:

  1. the number of iterations the test was run for;
  2. the number of rows the matrix had;
  3. the number of columns the matrix had; and
  4. the performance in MFlops/s.

Downloading and compiling the code

We’ll use the gcc compiler for this exercise, thus in addition to the likwid module you should load the module with the compiler.

Compilation without any optimisations

First, we will compile without any optimisations.

You should now be able to run the dmvm binary with ./dmvm 4000 4000 for a $4000\times 4000$ matrix.

Obtain the machine and code characteristics

To perform a roofline analysis, we need the machine and code characteristics.

Exercise

Measure the main memory streaming memory bandwidth using the triad benchmark of likwid-bench.

Exercise

Compute the peak floating point throughput of a compute node. The maximum frequency of a single core of a compute node of Hamilton is 3.35GHz, and in ideal conditions each core can issues two double-precision FMAs (fused multiply-add instructions) per cycle.

Exercise

Read the code, in particular the dmvm function and determine the best case arithmetic intensity by counting data accesses (using the perfect cache model) and floating point operations.

Produce a roofline plot of -O0 compiled code

Exercise

Using a fixed number of columns $n_{col}=10000$, measure the performance of your dmvm binary over a range of row sizes from 1000 to 100000. Plot the resulting data using a roofline plot.

Question

What do you think your next step in optimising the code should be?

Better compiler options

We’ll stop crippling the compiler. Try increasing the compiler optimisation level. What other optimisation options does the compiler provide? In the discussion note which options produced the fastest results.

Exercise

Add the results you get from these runs to your roofline plots. What do you see?

Question

Is performance for any of these options independent of the problem size? What do you think the bottleneck for this algorithm is?

Loop blocking for out-of-cache performance

You should have observed that when the number of rows gets too large, the performance of the code drops by almost a factor of two. When the number of rows is very large, some of the vector data (which is accessed more than once) no longer fits in cache.

Exercise

Use the pessimal cache model to obtain a worst case arithmetic intensity and add these new data points to your roofline plot.

A mechanism to solve this problem is to traverse the data in an order that is aware of the cache. For loop-based code, this is called loop blocking or tiling. Have a look at the following implementation: dmvm-blocked.c. Compile it, using the set of compiler options that worked best for the original code. Use -o dmvm-blocked to produce a new binary.

Exercise

As before, using a fixed number of columns $n_{col}=10000$, measure the performance of your dmvm-blocked binary over a range of row sizes from 1000 to 100000. You should observe that the performance is now largely insensitive to the number of rows. Plot the resulting data using a roofline plot.

Question

What do you think your next step (if any) in optimising the code should be?

  1. This code is taken from examples developed at RRZE
This post is licensed under CC BY 4.0 by the author.

Trending Tags