TDT4200 - Parallelle beregninger

01 Introduction

The main focus is on programming models:

Hardware went parallel around the turn of the century, and continues to pile on the resources. Software hasn’t really kept up, making code run better with extra processors is an “advanced topic”. Figuring out how to present parallel computing for the masses is very much a work-in-progress.

02 The von Neumann computer

The strength of Neumann

The von Neumann computer is a bridging model:

The weakness(es) of Neumann

The bridging model for parallel computing

We don’t have one. It has been a work in progress for decades.

We write programs in the von Neumann style, and try detect the parts that can be done in parallel. We will discuss explicit ways to run

03 Caches and virtual memory

Temporal locality and spatial locality

Memory: capacity vs. speed vs. cost

Fully associative and set-associative caches

Virtual memory and paging

04 Instruction-level parallelism

Pipelining

Out-of-order execution

Bernstein’s conditions define that statements S1S_1, S2S_2 are independent (i.e. will produce the same result when run in any order) if

We have three types of dependencies:

Superscalar processors: With automatic (in)dependence detection in place, we can replicate the ALU parts of the Neumann machine and make the control path dispatch several instructions at once. This is fondly known as multiple issue in computer architecture.

Prefetching & branch prediction

Many loops create regular access patterns in memory. By default, that kind of loop will regularly create cache misses at the end of every cache line. Upcoming misses can be avoided by detecting the pattern and starting the memory transfer early.

When pre-loading instructions, if() statements, loop tails, etc. make it hard to decide which branch to pre-load. Wrong guesses require the pipeline to be flushed. It is possible to store branch statistics next to a table of the branch instructions’ addresses in the code.

Vectorization

Vector registers are extra-wide registers that can store several consecutive values at once. With the vector registers loaded, there are instructions that do the same thing to all of their elements in parallel. Packing data into wide registers like this, we can do 4 times the work in one of the read-modify-write cycles of Neumann

05 Flynn's taxonomy

Flynn's taxonomy

According to Flynn, there are two things to deal with: instructions and data, and we can either have a single or multiple of each at a time. This gives us four combinations:

Shared memory

Threads & race conditions

Distributed memory

Multi-computer parallelism

Interconnects

06 Shared and distributed memory, interconnects

07 Speedup and efficiency

08 The advection equation

09 Concepts of MPI

Message Passing Interface

MPI is a standard specification for a number of function calls, and what they are supposed to do (not how).

MPI runs parallel processes.

Two-sided communication

For every send message statement, there must be a matching receive statement in the other process.

rank = who_am_i()
if (rank == 1) {
  numbers = [1, 2, 3, 4]
  send(numbers, 4, rank_2)
}
else if (rank == 2) {
  numbers = [0, 0, 0, 0]
  recv(numbers, 4, rank_1)
}

Here, the number of processes is hardcoded. We should rather decide ranks based on the total number of current processes.

Communication modes

10 Six-function MPI with point-to-point and collective communication

11 MPI: Communication modes

12 MPI: Scattering and gathering data

13 MPI: Derived data types

14 MPI: Parallel I/O

15 Communicators

16 Cartesian communicators & Parallel I/O

17 Pthreads introduction

18 Creating and removing pthreads

19 Pthread operations, synchronization

20 Introduction to OpenMP

21 Atomicity in OpenMP

22 OpenMP worksharing directives

23 OpenMP loop scheduling & threads vs. caches

24 Cache coherence

25 Cache optimizations: loop tiling & vector operations

26 OpenMP: Tasks

27 Roofline analysis

28 Application types

29 Loose ends

30 The GPU and U

31 Cooperation Organisation

32 Shuffle Shenanigans

33 Thread Lightly