Parallel Programming: Fueling High Performance in Computing

Parallel Programming: Fueling High Performance in Computing

I am quite passionate about high performance computing. Given that it is a niche (yet indispensable) domain, I am writing this two-part series to give quasi-laymen an HPC-101. Eventually, I hope this dovetails into a more in-depth description of some problem(s) I am working on.

The Need For Speed

Conventionally, computers have been used to automate human tasks in the industry. In academia, however, computers have also carved a niche due to their reliability and “super-human” capabilities. Several domains involve unbelievably high amounts of number crunching. Given a sufficiently large data volume, the domain scientist might be able to extract meaningful patterns from this crunching process.

There is a tradeoff for the sake of time. You can either process your experimental data very carefully, in which case the amount of data you can use is small. Or, you can do some quick and dirty processing over a large amount of experimental data.

Each new step in high performance computing pushes this curve further ahead. The domain scientist then has the capability to collect more data, as well as run complex processing algorithms on this data. Needless to say, that greatly advances the underlying science.

This trend is now visible in the industry as well. Cutting edge industrial technology is no longer just automating human tasks. It is creating new tasks and setting benchmarks therein.

In aviation, pilot training is typically done on simulators. High quality simulators have also enabled investigators to create several “what-if” scenarios quickly, and then fly them to see what might have caused a crash.

The Boeing 777 aircraft was the first ever to make extensive use of simulations in the design process. Boeing saw a 7X reduction in the amount of physical prototyping required to fix the wing design of this beautiful jet.

The newfound predictive and discriminative power of AI has been explored simply because there was access to faster machines. The fact that top tier AI conferences today have over 10k paper submissions–majority of them being extensive epirical studies of incremental improvements–is a testament to the ubiquitousness of High Performance Computing!

Here is a document that talks about the Return-on-Investment on Supercomputers for several different industries.

The dichotomy of performance

Computer hardware has evolved mainly on two fronts: processing speed and data storage capacity (memory). While they fuse into one system model for the purposes of algorithm design or performance analysis, the architectural advancements therein have been quite orthogonal.

It is very common to pitch a new processor with expected performance gains on either scientific benchmarks (relevant to academia), or some selected software that people use (Photoshop, for example). The underlying argument being, if you buy the new processor, you get that much more speed (for everything).

However, the converse (IE. If you want that much more speed, then you have to buy the new processor) is false in this case (converse of a statement is in general neither trivially true nor false). Why so? Well, let’s look at how a computer runs a program:

    1. Fetch Instruction from memory.
    2. If instruction needs data, fetch it from memory.
    3. Execute instruction.
    4. If instruction produced new data, write it to the memory.

This is how a monolithic processor would work. These 4 steps happen one after the other (not exactly, but for the sake of simplicity) as long as the program has unexecuted instructions.

So you see that the net run-time of a program is influenced by:

  1. The rate at which one can read from and write to memory,
  2. The rate at which instructions can be executed by the processor.

If you buy a new processor, step 3 becomes even faster, but that has no effect on other steps. If you buy faster memory, everything but step 3 gets faster. But then what if your favourite programs typically don’t have steps 2 and 4?

Since it is infeasible to read and analyse all machine instructions, we need a mathematical model that would tell us how the speed of our program would change with a change in the performance of our hardware.

The roofline model

Let’s define some jargon before we get to the model:


The acronym FLOP stands for a FLoating point OPeration. We can hence define a speed metric (FLOPs) that measures the floating point operations a processor can do in one second.


The speed of RAM access is called memory bandwidth. It simply refers to the number of bytes you can read per unit time.

The roofline model quantifies the scalability of an algorithm with a new metric, arithmetic intensity. This is measured as the number of floating point operations done per byte of data read.

Now, for a given peak bandwidth B and a peak FLOPs rate P of a processor, the speed of a program is given by:

$$ S=\min{\begin{cases}P \\B \times I\end{cases}} $$

The intuition behind this is as follows. Either the implementation of the program is compute heavy (step 3 dominates), or memory heavy. In case it is compute heavy, then the speed of the processor is the only thing that influences the run-time of the program implementation. If not, then the memory bandwidth must be the bottleneck. The graph shows the two cases:

To make the most of a faster processor, you need to be in the right hand region of this curve. If you find yourself in the left side, you have two options. Rewrite your program to do the same thing with a higher arithmetic intensity (think about how you can reuse the data you read), or buy better memory. This nvidia developer blog that describes the CUTLASS library has a nice discussion on how it improves the arithmetic intensity of a matrix multiplication operation by simply changing the way data is accessed.

Read the second part of the series here: Making Processors Fast Again

I am a BITSian alumnus working with IBM Research. My work today is primarily aimed at making deep learning faster (and hence, potentially deeper). As a fallback, I make deep neural networks memory efficient.

This post has been taken from my personal blog. You can read more of my work here

Saurabh Raje’s personal page