Post

GPU Programming Lecture 01

GPU Programming Lecture 01

Parallel Computer Architectures

What is parallelism?

What is concurrency?

Concurrency

Concurrency Two processes A and B are executed concurrently iff the execution of B may start before the termination of A, and vice versa.

Can we have concurrency on a single-core machine?

Under what circumstances is concurrency possible?

  • Time multiplexing
  • Space multiplexing

Types of Concurrency

→ Time multiplexing

Interleaving Two concurrent processes A and B are executed in an interleaved manner iff A and B are executed alternatingly on the same execution unit. image.png

→ Space multiplexing

Parallelism Two concurrent processes A and B are executed in parallel iff A and B are executed simultaneously on different execution units.

image.png

Classification of Parallel Computer Architectures

image.png

Flynn’s Taxonomy

  • Classifies architectures by the multiplicity of interacting data and instruction streams they provide:
    • SISD: Single Instruction stream, Single Data stream
    • SIMD: Single Instruction stream, Multiple Data streams
    • MISD: Multiple Instruction streams, Single Data stream
    • MIMD: Multiple Instruction streams, Multiple Data streams

image.png

GPU Architecture

Background Information on GPUs

  • 1970s Specialized graphics circuits in arcade video game hardware
  • 1981 First dedicated GPU chip NEC7220
  • 1990s Dissemination of OpenGL for GPU programming
  • 2000s (Mis)use of graphics pipeline for matrix computations
  • 2008 Nvidia introduces G80, the first general purpose GPU along with CUDA
  • 2010s Development and advancement of low-level unified graphics/compute
  • Today GPU computing common in AI and HPC applications
  • Soon Ubiquitous heterogeneous compute — consumer to HPC

GPU Architecture

  • Different GPU vendors, AMD, Intel, Nvidia use different terminology but architectures are similar
  • Hierarchy of schedulers
  • Memory and cache hierarchy
  • Compute units consisting of processing elements

image.png

GPU Architecture: Nvidia GA100 (Ampere) Chip

image.png

GPU Architecture: Nvidia GA100 SM

image.png

SIMT Execution Model

  • Single Instruction, Multiple Thread (Nvidia speak)
  • Warp-based execution:
    • Warp: group of threads that execute the same instruction on different data elements concurrently
    • Lanes in a warp diverge at conditional statements by masking lanes dependent on the execution path they take → SIMD model, but hidden from programmer by the programming model image.png

image.png

image.png

→ GPU is basically the successor of associative array processors

CPU vs. GPU Architectures

CPU vs. GPU Architectures: Compute Density

image.png

AMD EPYC 7702 CPU CCD:

  • 8 Zen 2 cores (white) and 4 L3 cache slices
  • Compute logic (blue) per core including ALU, FPU, SIMD units image.png

AMD MI100:

  • 8 arrays (white) à 16 compute units
  • Compute logic (blue) per array including ALU, FPU, matrix cores

image.png

 CPUGPU
ChipAMD EPYC 7702AMD MI 100
Die size [mm²]592750
Cores64128
Base Clock [GHz]2.01.0
Instructions per cycle16 (2x AVX2 FMA)64 (32x FMA)
FP64 Peak Performance [GFLOP/s]20488192
Compute density [GFLOP/s/mm²]3.4610.9
  • Compute density of GPUs is (at least) a factor 3 higher since:
    • CPUs are optimized for latency → Do one thing as fast as possible.
    • GPUs are optimized for throughput → Do as many things as possible at once.

Parallel Programming Models: CUDA

Architecture Model

image.png

Manual Memory Movement — gedankenexperiment

Let’s consider the impact of data movement on a CPU and GPU:

Doing compute on a CPU:

  • We initialise N elements in memory, which takes time $t_A$.
  • Then we do some computation, which takes time $t_C N / p$.

Doing compute on a GPU:

  • We initialise N elements in memory, which takes time $t_A$.
  • Then we transfer to the GPU, which takes $N t_T$.
  • Then we compute: $t_C N / P$.
  • Then transfer back to main memory, which takes $N t_T$.

Manual Memory Movement Cost

We would like: tA + N tc / p > tA + N tc / P + 2 N tT

Which requires: (p⁻¹ - P⁻¹) > 2 tT / tc

Immediately see:

  1. Increasing tc improves likelihood GPU computing is worthwhile.
  2. Reducing transfers between CPU and GPU is a priority.
  3. If P » p, then only p matters for comparisons.

Execution Model

image.png

image.png

image.png

CUDA Workflow

Workflow controlled by master process on host:

  1. Initialise context
  2. Allocates host and device memory
  3. Copy data from host to device
  4. Configure and execute kernels on device
  5. Copy data from device to host
  6. Free memory on device and host

CUDA Kernels

In CUDA,

  1. Kernel launches are asynchronous, like immediate-mode MPI
  2. The kernel launch must include size information in the form of <<<grid, block>>> where grid and block are dim3 type.
  3. Kernel qualifiers (e.g. __global__) determine where the kernel can be called and which returns are allowed (e.g. __global__ implies void).

DEMO

Hello World in CUDA

Example: Hello World

image.png

DEMO

Vector multiplication in CUDA

Example: Scalar Multiplication

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
int main(int argc, char **argv) {
    int N = 256;
    double *v = new double[N];
    double *v_d;
    cudaMalloc((void**)&v_d, sizeof(double) * N);
    cudaMemcpy(v_d, v, sizeof(double) * N, cudaMemcpyHostToDevice);
    scalar_mul<<<1,256>>>(v_d, 5, N);
    cudaMemcpy(v, v_d, sizeof(double) * N, cudaMemcpyDeviceToHost);
    for (int i = 0; i < N; i++) std::cout << v[i] << "," << "\n";
    cudaFree(v_d);
    delete[] v;
}

__global__ void scalar_mul(double *v, double a, const int N){
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    v[tid] = a * v[tid];
}
This post is licensed under CC BY 4.0 by the author.

Trending Tags