COSC 6374 Parallel Computation Project I Team 07 Neelima Paramatmuni Padmaja Pentela Shyamali Balasubramaniyan Shravanthi Denthumdas Samira Hemrajani.
Problem Description: The objective of our project is to parallelize a sequential code such that it can be executed for N number of processes. The original source code performs Image Analysis for an image which is treated as a matrix of characters. NOTE : 1 character is a pixel of black and white image. The most computation intensive task of the code was FFT operations. These were parallelized using the MPI based FFT routines. The major challenge of this task was: • Understand the sequential code and investigate parallelization approaches. • Understand data distribution requirements of the FFTW routines. • Synchronization of tasks to enforce ordering if and when required. We focused mainly on our solution to be scalable. This is because we would be required to test our parallelized code for bigger images which are impossible to be viewed on a single processor. Parallelization Strategy: The following steps unfold the various stages of our task parallelization strategy. Setting up the ‘parallel environment’ and use of MPI FFTW routines: Firstly, we initialize the Parallel environment using MPI_Init (). Next, We create plans for the forward and inverse transforms, of type fftwnd_mpi_plan, using function: fftwnd_mpi_plan fftw2d_mpi_create_plan (MPI_Comm comm, int nx, int ny, fftw_direction dir, int flags); Distributing the problem/data: In order to obtain a load balanced form of data distribution for the MPI FFTW routines we use the following function: fftwnd_mpi_local_sizes ( fftplan, &local_height,&local_start, &local_ny_after_transpose,&local_y_start_after_transpose, &total_local_size); This allows us to divide the image horizontally such that each process gets x number of rows and y columns where x<1st dimension of image and y always equal to 2nd dimension of image.
X dimesion along which the image is divided
Y dimension (always constant) Figure 1: Slab decomposition for data distribution
Parallelizing Reading the Image: In accordance with our task parallelization strategy, every process will carry out its task concurrently. The image read operation is also parallelized in a similar fashion. The Image is read
based on the local_start and local_height namely the parameters returned by our data layout function. Thereby, each process reads only its own subset of data. Padding the image: Padding of the image and the filter is a major part of the image analysis task. We need to introduce a ‘border’ around the original image due to symmetry requirements of the FFT operation. This is achieved by padding the image with zeros on the right and the bottom.
CASE 1 CASE 2 CASE 3 Figure 2: Image Padding
Since, each process gets a subset of the rows of the data, during parallelization the image padding will have to be carried in one of the following 3 ways: Case 1 : A process may have rows that require padding only on right side (refer to figure 2). Case 2 : A process may have rows that require padding on the bottom and well as the right hand side. Case 3 : A process may have chunk of data which is only padding. Padding the filter: The filter used for the image analysis task needs to be blown up to the size (same dimensions) of the padded image. The padding of the filter is done in a similar manner as image padding i.e. appending zeros to the right and bottom. FFT of image and filter: The actual computation of the transform is performed by the function fftwnd_mpi. We replaced the fftwnd_one() in the original sequential version with corresponding parallel versions: fftwnd_mpi(fftplan,1, padimg, NULL, FFTW_NORMAL_ORDER ) fftwnd_mpi(fftplan,1, padfilter, NULL, FFTW_NORMAL_ORDER) Convolution: FFT of the padded image and filter are performed first, then these are multiplied to get the convoluted output. Since this task is now divided over multiple processes, we modify the convolution loop to range from 0 to number of rows on each process (image height on each process). Inverse FFT: This is the Inverse of the FFT operation. For purpose of parallelization we replace the fftwnd_one() with fftwnd_mpi(ifftplan,1, convolved, NULL, FFTW_NORMAL_ORDER ). Depadding: The depadding operation is carried out on the output of the inverse FFT operation. It separates the image from the padding that we had introduced during the FFT operation.
The following image shows the output of IFFT which needed to be depadded. The depadding is defined by 3 main scenarios in our parallel environment. CASE 1 CASE 2 CASE 3
Figure 3: De-padding operation
Case1: Process may have only padding and no useful image information. Case2: Process may contain image information along with padding on the top and left. In this case we will have to extract the image information accordingly. Case3: Process may contain image information along with padding to the left. Writing the output Image: While writing the output image we need to address synchronization i.e. enforce an ordering of processes. Hence, we check for the rank of the process, if the rank is zero it writes to the output buffer and passes it on to the next process rank+1. However, the last process only receives the buffer from the process rank-1.NOTE: that the remaining processes send to rank+1 and receive from rank-1. Testing and Verification: We followed the following approach for testing our parallelized code: We executed the sequential code, and parallel code for different number of processes. These outputs were then visualized in MATLAB and compared to ensure that our code was executing correctly. Additionally, we used the timing functions provided by the MPI library to calculate the speedup and efficiency of our parallel code. Results: We obtained readings on the cluster using the image file provided by Dr. Gabriel under folder /pvfs2. We obtained 3 readings for 1, 2, 4 and 8 processed respectively. These are tabulated below. # proc
1
2
Iter1
981.964
660.945
365.442
203.530
Iter2
987.536
682.134
366.828
204.916
Iter3
981.474
671.897
374.905
205.372
Avg
983.658
671.659
369.0586
204.606
4
8
Table 1: Number of Processes Vs Average Execution time over 3 iterations.
Please Note: We obtained the execution time for the parallel code by taking the Maximum Execution time i.e. the execution time of the slowest process in the group. Speedup and Efficiency: We calculated the speed up and efficiency as follows: Speed up: Total time (1) / Total time (n) Efficiency: Speed up/ No. of processes
No. of processes
2
4
8
Speed up
1.4645
2.6653
4.8075
Efficiency
0.7322
0.666
0.6009
Table 2: Speedup and Efficiency
Figure 4: Number of processed Vs Speedup
Figure 5: Number of processed Vs Efficiency
Conclusion: Implementation of parallel computations through use of MPI libraries in the above project allowed us to greatly scale our solution. We can perform image analysis for images having sizes in the range of 100s and 1000s GB’s which would not have been possible sequentially. We also realized that the performance of our task improves with respect to the time taken to perform the analysis which is reduced to a great extent. This can be inferred from the graphs plotted above.
Individual Contribution of team members: Neelima Paramatmuni: • Analysis of sequential code and identify parts of code that need to be parallelized • Studied FFTW MPI routines • Write code to parallelize the sequential code (Reading, Padding and Writing image). • Debugging the code. • Testing the code • Calculating speedup and efficiency. • Documentation Padmaja Pentela: • Analysis of code and FFTW MPI routines • Helped in file reading , padding and writing to file functions • Documentation Shyamali Balasubramaniyan: • Analyze the code • Studied the FFTW MPI routines • Testing the code • Recording measurements, graphs and documentation. Shravanthi Denthumdas: • Analysis of sequential code and studied the FFTW MPI routines • Testing the code • Documentation Samira Hemrajani: • Contributed to the analysis of the sequential code • Addressing questions like: What is Image Analysis? How FFT is carried out? What are filters, Gabor filters, etc? • Investigated different parallelization approaches, reviewed fftwnd_mpi_local_sizes for Data distribution in MPI_FFT • Incorporated the timing functions namely MPI_Wtime and gettimeofday in the parallel and sequential codes respectively • Carried out intensive testing • Helped in documenting the results