Due Mon April 13, 11:59pm
20 points total
This assignment is intended to help you develop an understanding of the two primary forms of parallel execution present in a modern multi-core CPU:
- SIMD execution within a single processing core
- Parallel execution using multiple cores
You will also gain experience measuring and reasoning about the performance of parallel programs (a challenging, but important, skill you will use throughout this class). This assignment involves only a small amount of programming, but a lot of analysis!
ISPC is needed to compile many of the programs used in this assignment. You can install ISPC using one of the following methods:
- Install
ISPCusing the package manager on your system.- Ubuntu users
sudo apt install ispc. - Mac users with Homebrew
brew install ispc.
- Ubuntu users
- Download the latest release of
ISPCfrom the ISPC GitHub repository and follow the installation instructions in the README.
Build and run the code in the prog1_mandelbrot_threads/ directory of
the code base. (Type make to build, and ./mandelbrot to run it.)
This program produces the image file mandelbrot-serial.ppm, which is a visualization of a famous set of
complex numbers called the Mandelbrot set. Most platforms have a .ppm
view. As you can see in the images below, the
result is a familiar and beautiful fractal. Each pixel in the image
corresponds to a value in the complex plane, and the brightness of
each pixel is proportional to the computational cost of determining
whether the value is contained in the Mandelbrot set. To get image 2,
use the command option --view 2. (See function mandelbrotSerial()
defined in mandelbrotSerial.cpp). You can learn more about the
definition of the Mandelbrot set at
http://en.wikipedia.org/wiki/Mandelbrot_set.
Your job is to parallelize the computation of the images using
std::thread. Starter
code that spawns one additional thread is provided in the function
mandelbrotThread() located in mandelbrotThread.cpp. In this function, the
main application thread creates another additional thread using the constructor
std::thread(function, args...) It waits for this thread to complete by calling
join on the thread object.
Currently the launched thread does not do any computation and returns immediately.
You should add code to workerThreadStart function to accomplish this task.
You will not need to make use of any other std::thread API calls in this assignment.
What you need to do:
- Modify the starter code to parallelize the Mandelbrot generation using two processors. Specifically, compute the top half of the image in thread 0, and the bottom half of the image in thread 1. This type of problem decomposition is referred to as spatial decomposition since different spatial regions of the image are computed by different processors.
- Extend your code to use up to maximum number of threads computed by(
std::thread::hardware_concurrency()), partitioning the image generation work accordingly (threads should get blocks of the image). In your write-up, produce a graph of speedup compared to the reference sequential implementation as a function of the number of threads used FOR VIEW 1. Is speedup linear in the number of threads used? In your writeup hypothesize why this is (or is not) the case? (you may also wish to produce a graph for VIEW 2 to help you come up with a good answer. Hint: take a careful look at the three-thread datapoint.) - To confirm (or disprove) your hypothesis, measure the amount of time
each thread requires to complete its work by inserting timing code at
the beginning and end of
workerThreadStart(). How do your measurements explain the speedup graph you previously created? - Modify the mapping of work to threads to achieve to improve speedup to at about 7-8x on both views of the Mandelbrot set (if you're above 7x that's fine, don't sweat it). You may not use any synchronization between threads in your solution. We are expecting you to come up with a single work decomposition policy that will work well for all thread counts---hard coding a solution specific to each configuration is not allowed! (Hint: There is a very simple static assignment that will achieve this goal, and no communication/synchronization among threads is necessary.). In your writeup, describe your approach to parallelization and report the final speedup obtained.
- Now run your improved code with 2x number of HW threads. Is performance noticeably greater than when running with 1x HW threads? Why or why not?
This program computes a mandelbrot fractal image, but it achieves even greater speedups by utilizing both the CPU's cores and the SIMD execution units within each core.
In Program 1, you parallelized image generation by creating one thread for each processing core in the system. Then, you assigned parts of the computation to each of these concurrently executing threads. (Since threads were one-to-one with processing cores in Program 1, you effectively assigned work explicitly to cores.) Instead of specifying a specific mapping of computations to concurrently executing threads, Program 2 uses ISPC language constructs to describe independent computations. These computations may be executed in parallel without violating program correctness (and indeed they will!). In the case of the Mandelbrot image, computing the value of each pixel is an independent computation. With this information, the ISPC compiler and runtime system take on the responsibility of generating a program that utilizes the CPU's collection of parallel execution resources as efficiently as possible.
You will make a simple fix to Program 2 which is written in a combination of
C++ and ISPC (the error causes a performance problem, not a correctness one).
With the correct fix, you should observe performance that is over 32 times
greater than that of the original sequential Mandelbrot implementation from
mandelbrotSerial().
When reading ISPC code, you must keep in mind that although the code appears
much like C/C++ code, the ISPC execution model differs from that of standard
C/C++. In contrast to C, multiple program instances of an ISPC program are
always executed in parallel on the CPU's SIMD execution units. The number of
program instances executed simultaneously is determined by the compiler (and
chosen specifically for the underlying machine). This number of concurrent
instances is available to the ISPC programmer via the built-in variable
programCount. ISPC code can reference its own program instance identifier via
the built-in programIndex. Thus, a call from C code to an ISPC function can
be thought of as spawning a group of concurrent ISPC program instances
(referred to in the ISPC documentation as a gang). The gang of instances
runs to completion, then control returns back to the calling C code.
Stop. This is your friendly instructor. Please read the preceding paragraph again. Trust me.
As an example, the following program uses a combination of regular C code and ISPC code to add two 1024-element vectors. As we discussed in class, since each instance in a gang is independent and performing the exact same program logic, execution can be accelerated via implementation using SIMD instructions.
A simple ISPC program is given below. The following C code will call the following ISPC code:
------------------------------------------------------------------------
C program code: myprogram.cpp
------------------------------------------------------------------------
const int TOTAL_VALUES = 1024;
float a[TOTAL_VALUES];
float b[TOTAL_VALUES];
float c[TOTAL_VALUES]
// Initialize arrays a and b here.
sum(TOTAL_VALUES, a, b, c);
// Upon return from sum, result of a + b is stored in c.
The corresponding ISPC code:
------------------------------------------------------------------------
ISPC code: myprogram.ispc
------------------------------------------------------------------------
export sum(uniform int N, uniform float* a, uniform float* b, uniform float* c)
{
// Assumption programCount divides N evenly.
for (int i=0; i<N; i+=programCount)
{
c[programIndex + i] = a[programIndex + i] + b[programIndex + i];
}
}
The ISPC program code above interleaves the processing of array elements among program instances. Note the similarity to Program 1, where you statically assigned parts of the image to threads.
However, rather than thinking about how to divide work among program instances
(that is, how work is mapped to execution units), it is often more convenient,
and more powerful, to instead focus only on the partitioning of a problem into
independent parts. ISPCs foreach construct provides a mechanism to express
problem decomposition. Below, the foreach loop in the ISPC function sum2
defines an iteration space where all iterations are independent and therefore
can be carried out in any order. ISPC handles the assignment of loop iterations
to concurrent program instances. The difference between sum and sum2 below
is subtle, but very important. sum is imperative: it describes how to
map work to concurrent instances. The example below is declarative: it
specifies only the set of work to be performed.
-------------------------------------------------------------------------
ISPC code:
-------------------------------------------------------------------------
export sum2(uniform int N, uniform float* a, uniform float* b, uniform float* c)
{
foreach (i = 0 ... N)
{
c[i] = a[i] + b[i];
}
}
Before proceeding, you are encouraged to familiarize yourself with ISPC
language constructs by reading through the ISPC walkthrough available at
http://ispc.github.io/example.html. The example program in the walkthrough
is almost exactly the same as Program 2's implementation of mandelbrot_ispc()
in mandelbrot.ispc. In the assignment code, we have changed the bounds of
the foreach loop to yield a more straightforward implementation.
What you need to do:
- Compile and run
mandelbrot_ispc. The ISPC compiler is currently configured to target AVX2 on x86-64 CPUs, which generates 8-wide SIMD vector instructions. On Apple Silicon systems—such as M1 M2, M3, and M4 machines—ISPC targets the ARM64/AArch64 architecture and uses NEON SIMD instructions instead, which are typically 4-wide for single-precision floating-point operations. Check theMakefileand update the ISPC target as needed when building on Apple Silicon. What is the maximum speedup you expect given what you know about these CPUs? Why might the number you observe be less than this ideal? (Hint: Consider the characteristics of the computation you are performing? Describe the parts of the image that present challenges for SIMD execution? Comparing the performance of rendering the different views of the Mandelbrot set may help confirm your hypothesis.).
We remind you that for the code described in this subsection, the ISPC compiler maps gangs of program instances to SIMD instructions executed on a single core. This parallelization scheme differs from that of Program 1, where speedup was achieved by running threads on multiple cores.
Take a look at the technical details of your CPU to understand the SIMD capabilities of your machine. For example, you can use the command lscpu on Linux or sysctl -a | grep machdep.cpu on macOS to get detailed information about your CPU's architecture and supported instruction sets.
ISPCs SPMD execution model and mechanisms like foreach facilitate the creation
of programs that utilize SIMD processing. The language also provides an additional
mechanism utilizing multiple cores in an ISPC computation. This mechanism is
launching ISPC tasks.
See the launch[2] command in the function mandelbrot_ispc_withtasks. This
command launches two tasks. Each task defines a computation that will be
executed by a gang of ISPC program instances. As given by the function
mandelbrot_ispc_task, each task computes a region of the final image. Similar
to how the foreach construct defines loop iterations that can be carried out
in any order (and in parallel by ISPC program instances, the tasks created by
this launch operation can be processed in any order (and in parallel on
different CPU cores).
What you need to do:
- Run
mandelbrot_ispcwith the parameter--tasks. What speedup do you observe on view 1? What is the speedup over the version ofmandelbrot_ispcthat does not partition that computation into tasks? - There is a simple way to improve the performance of
mandelbrot_ispc --tasksby changing the number of tasks the code creates. By only changing code in the functionmandelbrot_ispc_withtasks(), you should be able to achieve performance that exceeds the sequential version of the code by over 32 times! How did you determine how many tasks to create? Why does the number you chose work best? - What are differences between the thread abstraction (used in Program 1) and the ISPC task abstraction? There are some obvious differences in semantics between the (create/join and (launch/sync) mechanisms, but the implications of these differences are more subtle. Here's a thought experiment to guide your answer: what happens when you launch 10,000 ISPC tasks? What happens when you launch 10,000 threads? (For this thought experiment, please discuss in the general case - i.e. don't tie your discussion to this given mandelbrot program.)
The smart-thinking student's question: Hey wait! Why are there two different
mechanisms (foreach and launch) for expressing independent, parallelizable
work to the ISPC system? Couldn't the system just partition the many iterations
of foreach across all cores and also emit the appropriate SIMD code for the
cores?
Program 3 is an ISPC program that computes the square root of 20 million
random numbers between 0 and 3. It uses a fast, iterative implementation of
square root that uses Newton's method to solve the equation sqrt to converge to an accurate solution
for values in the (0-3) range. (The implementation does not converge for
inputs outside this range). Notice that the speed of convergence depends on the
accuracy of the initial guess.
Note: This problem is a review to double-check your understanding, as it covers similar concepts as programs 1 and 2.
What you need to do:
- Build and run
sqrt. Report the ISPC implementation speedup for single CPU core (no tasks) and when using all cores (with tasks). What is the speedup due to SIMD parallelization? What is the speedup due to multi-core parallelization? - Modify the contents of the array values to improve the relative speedup of the ISPC implementations. Construct a specific input that maximizes speedup over the sequential version of the code and report the resulting speedup achieved (for both the with- and without-tasks ISPC implementations). Does your modification improve SIMD speedup? Does it improve multi-core speedup (i.e., the benefit of moving from ISPC without-tasks to ISPC with tasks)? Please explain why.
- Construct a specific input for
sqrtthat minimizes speedup for ISPC (without-tasks) over the sequential version of the code. Describe this input, describe why you chose it, and report the resulting relative performance of the ISPC implementations. What is the reason for the loss in efficiency? (keep in mind we are using the--target=avx2option for ISPC, which generates 8-wide SIMD instructions).
Program 4 is an implementation of the saxpy routine in the BLAS (Basic Linear
Algebra Subproblems) library that is widely used (and heavily optimized) on
many systems. saxpy computes the simple operation result = scale*X+Y, where X, Y,
and result are vectors of N elements (in Program 4, N = 20 million) and scale is a scalar. Note that
saxpy performs two math operations (one multiply, one add) for every three
elements used. saxpy is a trivially parallelizable computation and features predictable, regular data access and predictable execution cost.
What you need to do:
- Compile and run
saxpy. The program will report the performance of ISPC (without tasks) and ISPC (with tasks) implementations of saxpy. What speedup from using ISPC with tasks do you observe? Explain the performance of this program. Do you think it can be substantially improved? (For example, could you rewrite the code to achieve near linear speedup? Yes or No? Please justify your answer.) - Note that the total memory bandwidth consumed computation in
main.cppisTOTAL_BYTES = 4 * N * sizeof(float);. Even thoughsaxpyloads one element from X, one element from Y, and writes one element toresultthe multiplier by 4 is correct. Why is this the case? (Hint, think about how CPU caches work.) - Extra Credit: (points handled on a case-by-case basis) Improve the performance of
saxpy. We're looking for a significant speedup here, not just a few percentage points. If successful, describe how you did it and what a best-possible implementation your systems might achieve. Also, if successful, come and tell us, we'll be interested. ;-)
Want to know about ISPC and how it was created? One of the two creators of ISPC, Matt Pharr, wrote an amazing blog post on the history of its development called The story of ispc. It really touches on many issues of parallel system design -- in particular the value of limiting the scope of programming languages vs general purpose programming languages. And it gets at real-world answers to common questions like... "why can't the compiler just automatically parallelize my program for me?"
Submission Method: Submit via Moodle
Group Submission: Only one submission per group is required.
Required File:
- Single PDF file named
Assignment_1_Write_Up.pdf
PDF Contents:
- System Information
- CPU architecture
- Number of cores
- Supported SIMD instructions
- Number of hardware threads
- Written Answers
- Responses to all assignment questions
- Performance analysis and reasoning
- Graphs and visualizations
- Code Artifacts
- Relevant code snippets from your implementation
Code Submission: You do not need to submit source code files. However, ensure all code compiles and runs—be prepared to demonstrate it to the TA upon request.
- Please make sure both group members' names and Student IDs are in the document.
- Extensive ISPC documentation and examples can be found at http://ispc.github.io/
- Zooming into different locations of the mandelbrot image can be quite fascinating
- Intel provides a lot of supporting material about AVX2 vector instructions at http://software.intel.com/en-us/avx/.
- The Intel Intrinsics Guide is very useful.

