TDT4200 Mock Exam
True/False (10%)
The CUDA syncthreads() operation synchronizes all active threads on the GPU
False. syncthreads() only synchronizes threads within a single thread block.
An MPI Allreduce call will use more total bandwidth than the corresponding MPI Reduce call
True. Allreduce will distribute the result to all active processes, thus use more bandwidth.
CUDA context switches can increase occupancy
True. Some threads must idle if they wait for a memory operation. By performing a context switch and running other threads, the hardware is better utilized.
Simultaneous Multithreading allows any pair of independent threads to run simultaneously
False. SMT only runs threads simultaneously when they require distinct parts of available ALUs
Any collective MPI operation be replaced with a collection of point-to point operations
True. The collective MPI operations are helper functions, built from point-to-point operations.
The operational intensity of a program increases when it is run on a faster processor
Operational intensity is defined as
A higher operational intensity indicates that a larger proportion of the algorithm's work involves computation rather than data movement. This will not increase by a faster processor.
False.
Worksharing directives in OpenMP end with an implicit barrier by default
True. Directives such as end with an implicit barrier at the end of the following structured block.
A mutual exchange with standard mode Send and Recv operations can not deadlock
False. If all processes perform Send
before any perform Recv
, the program will deadlock. Thus, we should use Sendrecv
which will schedule the processes such that they do not deadlock.
Strong scaling results suggest that an algorithm is suitable for use on higher processor counts than weak scaling results
False. Both may benefit from higher processer counts. However, for weak scaling we must also increase the problem size.
Pthreads operations on a cond variable associate it with a mutex variable
True. All pthread condition variables are associtated with a mutex variable.
2 Message passing and MPI (15%)
2.1 In MPI terminology, what distinguishes the Ready and Synchronized communication modes from each other?
TODO communication modes
2.2
Using pseudo-code, write a barrier implementation for a message-passing program.
if (rank == 0) {
for (int i = 1; i < number_of_threads; i++) {
MPI_Recv(source_rank = i);
}
MPI_Broadcast(root = 0);
} else {
MPI_Send(dest_rank = 0);
MPI_Broadcast(root = 0);
}
3 GPGPU programming (10%)
3.1 Briefly describe two differences between POSIX threads and CUDA threads
- POSIX threads run on the CPU, while CUDA threads run on the GPU.
- With POSIX threads, you can have dynamic threads. CUDA threads are only static.
POSIX threads run with entirely separate instruction counters, and can directly make calls that access the operating system. CUDA threads run in SIMT mode where threads in a thread block share the same instruction, and can only directly access memory on the graphics device that executes them.
3.2 What differentiates shared and global memory spaces in CUDA programming?
4 Performance analysis (5%)
In a strong scaling study, what is theoretically the maximal speedup attainable by a program with 8% inherently sequential run time?
The maximal speedup in a strong scaling study is where is the inherently sequential ratio. With ,
5 Threads (10%)
5.1 Give an example of a race condition
If multiple processes can access a shared resource at the same time, and at least one of them is an update, we have a race condition. (Because the result will depend on what process "wins the race").
Process 1 | Process 2
a = a + 1 | a = a + 1
This is a race condition because both processes may read before incrementing, such that the result is instead of .
5.2 Describe two ways to prevent race conditions using OpenMP
MPI
MPI_Barrier
MPI_Sendrecv
(instead ofMPI_Send
, thenMPI_Recv
)
Pthreads
- Mutexes
- Semaphores
- Condition variables
- Read-Write Locks
OpenMP
omp_init_lock
CUDA
__syncthreads
- Cooperative groups
atomicAdd
and similar atomic functions
6 15%
The C program on the following page calculates whether each point in a 640×480 array lies inside a unit circle scaled to the array, and saves the array in a file where values are scaled to their point’s distance from the origin. The program partitions its domain along the vertical axis, and assumes that the number of ranks evenly divides the domain’s height. Modify the program so that it will also work with rank counts that are not divisors of the height.
7 15%
The C program on the following page calculates whether each point in a 640Ă—480 array lies inside a unit circle scaled to the array, and writes an image file where points inside are shaded according to their distance from the origin.
7.1 Create a version of the write map function in the form of a CUDA kernel which can be called with thread blocks of size 4 Ă— 4
void write_map ( float *map ) {
int i = blockDim.x * blockIdx.x + threadIdx.x:
int j = blockDim.y * blockIdx.y + threadIdx.y:
float y = (2*i-height) / (float)height;
float x = (2*j-width) / (float)width;
float rad = sqrt(y*y+x*x);
if ( rad < 1.0 )
MAP(i,j) = rad;
else
MAP(i,j) = 1.0;
}
7.2 Create a version of the main() function which
- llocates the floating-point array in GPU device memory instead
- alls your write map kernel using 4 Ă— 4 thread blocks, and
- opies the result into a host memory buffer, before passing it to save image
#define BLOCK 4
int main() {
int width = 640;
int height = 480;
float map_h[heigth*width];
float* map_d;
cudaMalloc((void **) &map_d, width*height*sizeof(float));
dim3 gridBlock(width/BLOCK,height/BLOCK);
dim3 threadBlock(BLOCK,BLOCK);
write_map<<<gridBlock, threadBlock>>>();
cudaMemcpy(&map_h, map_d, width*height*sizeof(float), cudaMemcpyDeviceToHost);
cudaDeviceSynchronize();
save_image(host_map);
cudaFree(map_d);
return 0;
}