courses/tdt4200

📂 assets
📂 programs
📄 book
📄 mock_exam ✨

TDT4200 Mock Exam

True/False (10%)

The CUDA syncthreads() operation synchronizes all active threads on the GPU

False. syncthreads() only synchronizes threads within a single thread block.

An MPI Allreduce call will use more total bandwidth than the corresponding MPI Reduce call

True. Allreduce will distribute the result to all active processes, thus use more bandwidth.

CUDA context switches can increase occupancy

True. Some threads must idle if they wait for a memory operation. By performing a context switch and running other threads, the hardware is better utilized.

Simultaneous Multithreading allows any pair of independent threads to run simultaneously

False. SMT only runs threads simultaneously when they require distinct parts of available ALUs

Any collective MPI operation be replaced with a collection of point-to point operations

True. The collective MPI operations are helper functions, built from point-to-point operations.

The operational intensity of a program increases when it is run on a faster processor

Operational intensity is defined as

\text{operational intensity} = \frac{computational work}{data movement}

A higher operational intensity indicates that a larger proportion of the algorithm's work involves computation rather than data movement. This will not increase by a faster processor.

False.

Worksharing directives in OpenMP end with an implicit barrier by default

True. Directives such as #pragma omp parellel end with an implicit barrier at the end of the following structured block.

A mutual exchange with standard mode Send and Recv operations can not deadlock

False. If all processes perform Send before any perform Recv, the program will deadlock. Thus, we should use Sendrecv which will schedule the processes such that they do not deadlock.

Strong scaling results suggest that an algorithm is suitable for use on higher processor counts than weak scaling results

False. Both may benefit from higher processer counts. However, for weak scaling we must also increase the problem size.

Pthreads operations on a cond variable associate it with a mutex variable

True. All pthread condition variables are associtated with a mutex variable.

2 Message passing and MPI (15%)

2.1 In MPI terminology, what distinguishes the Ready and Synchronized communication modes from each other?

TODO communication modes

2.2

Using pseudo-code, write a barrier implementation for a message-passing program.

if (rank == 0) {
    for (int i = 1; i < number_of_threads; i++) {
        MPI_Recv(source_rank = i);
    }
    MPI_Broadcast(root = 0);
} else {
    MPI_Send(dest_rank = 0);
    MPI_Broadcast(root = 0);
}

3 GPGPU programming (10%)

3.1 Briefly describe two differences between POSIX threads and CUDA threads

POSIX threads run on the CPU, while CUDA threads run on the GPU.
With POSIX threads, you can have dynamic threads. CUDA threads are only static.

POSIX threads run with entirely separate instruction counters, and can directly make calls that access the operating system. CUDA threads run in SIMT mode where threads in a thread block share the same instruction, and can only directly access memory on the graphics device that executes them.

3.2 What differentiates shared and global memory spaces in CUDA programming?

4 Performance analysis (5%)

In a strong scaling study, what is theoretically the maximal speedup attainable by a program with 8% inherently sequential run time?

The maximal speedup in a strong scaling study is $S = 1/p$ where $p$ is the inherently sequential ratio. With $p = 0.08$ , $S = 1/0.08 = 12.5$

5 Threads (10%)

5.1 Give an example of a race condition

If multiple processes can access a shared resource at the same time, and at least one of them is an update, we have a race condition. (Because the result will depend on what process "wins the race").

Process 1   | Process 2
a = a + 1   | a = a + 1

This is a race condition because both processes may read $a = 0$ before incrementing, such that the result is $a = 1$ instead of $a = 2$ .

5.2 Describe two ways to prevent race conditions using OpenMP

MPI

MPI_Barrier
MPI_Sendrecv (instead of MPI_Send, then MPI_Recv)

Pthreads

Mutexes
Semaphores
Condition variables
Read-Write Locks

OpenMP

#pragma omp critical(name)
#pragma omp atomic
#pragma omp barrier
omp_init_lock

CUDA

__syncthreads
Cooperative groups
atomicAdd and similar atomic functions

6 15%

The C program on the following page calculates whether each point in a 640×480 array lies inside a unit circle scaled to the array, and saves the array in a file where values are scaled to their point’s distance from the origin. The program partitions its domain along the vertical axis, and assumes that the number of ranks evenly divides the domain’s height. Modify the program so that it will also work with rank counts that are not divisors of the height.

7 15%

The C program on the following page calculates whether each point in a 640×480 array lies inside a unit circle scaled to the array, and writes an image file where points inside are shaded according to their distance from the origin.

7.1 Create a version of the write map function in the form of a CUDA kernel which can be called with thread blocks of size 4 × 4

void write_map ( float *map ) {
    int i = blockDim.x * blockIdx.x + threadIdx.x:
    int j = blockDim.y * blockIdx.y + threadIdx.y:

    float y = (2*i-height) / (float)height;
    float x = (2*j-width) / (float)width;
    float rad = sqrt(y*y+x*x);
    if ( rad < 1.0 )
        MAP(i,j) = rad;
    else
        MAP(i,j) = 1.0;
}

7.2 Create a version of the main() function which

llocates the floating-point array in GPU device memory instead
alls your write map kernel using 4 × 4 thread blocks, and
opies the result into a host memory buffer, before passing it to save image

#define BLOCK 4

int main() {
    int width = 640;
    int height = 480;

    float map_h[heigth*width];
    float* map_d;
    cudaMalloc((void **) &map_d, width*height*sizeof(float));

    dim3 gridBlock(width/BLOCK,height/BLOCK);
    dim3 threadBlock(BLOCK,BLOCK);
    write_map<<<gridBlock, threadBlock>>>();

    cudaMemcpy(&map_h, map_d, width*height*sizeof(float), cudaMemcpyDeviceToHost);

    cudaDeviceSynchronize();

    save_image(host_map);
    cudaFree(map_d);

    return 0;
}