The theory of multiprocessor computing shows that there is a mathematical limit on how fast a real parallelized program may go.

When a program runs on more than one CPU, its total run time should be less. But how much less? And what are the limits on the speedup?

Adding CPUs to shorten execution time

You can make your program run faster by distributing the work it does over multiple CPUs. There is always some part of the program's logic that has to be executed serially, by a single CPU.

There are two basic limits to the speedup you can achieve by parallel execution:

Tuning for parallel execution comes down to battling these two limits. You strive to increase the parallel fraction, p, because in some cases even a small change in p (from 0.8 to 0.85, for example) makes a dramatic change in the effectiveness of added CPUs.

And you work to ensure that each added CPU does a full CPU's work and does not interfere with the work of other CPUs.

Parallel Speedup and Amdahl's Law

The speedup gained from applying n CPUs, Speedup(n), is the ratio of the one-CPU execution time to the n-CPU parallel execution time: Speedup(n) = T(1)/T(n). If you measure the one-CPU execution time of a program at 100 seconds, and the program runs in 60 seconds with 2 CPUs, Speedup(2) = 100/60 = 1.67.

This number captures the improvement from adding hardware to the system. We expect T(n) to be less than T(1); if it is not, adding CPUs has made the program slower, and something is wrong! So Speedup(n) should be a number greater than 1.0, and the greater it is, the more pleased we are. Intuitively you might hope that the speedup
would be equal to the number of CPUs --- twice as many CPUs, half the time --- but this ideal can quite never be achieved.
Normally, the number Speedup(n) must be less than n, reflecting the fact that not all parts of a program benefit from parallel execution. However, it is possible --- in rare situations --- for Speedup(n) to be larger than n. This is called a superlinear speedup --- the program has been sped up by more than the increase of CPUs.

There are always parts of a program that you cannot make parallel --- code that must run serially. The serial parts of the program cannot be speeded up by concurrency.

The mathematical statement of this idea is called Amdahl's law. Let p be the fraction of the program's code that can be made parallel (p is always a fraction less than 1.0.) The remaining (1-p) of the code must run serially. In practical cases, p ranges from 0.2 to 0.99.

The potential speedup for a program is proportional to p divided by the CPUs you can apply, plus the remaining serial part, (1-p):

Speedup(n) = -----------    (Amdahl's law: Speedup(n) given p)

The fraction p has a strong effect on the possible speedup, as shown in this graph:

In particular, the more CPUs you have, the more benefit you get from increasing p. Using only 4 CPUs, you need only p = 0.6 to get half the ideal speedup. With 8 CPUs, you need p = 0.85 to get half the ideal speedup.

Calculating the Parallel Fraction of a Program

You do not have to guess at the value of p for a given program. Measure the execution times T(1) and T(2) to calculate a measured Speedup(2) = T(1)/T(2). The Amdahl's law equation can be rearranged to yield p when Speedup(n) is known:

         2            SpeedUp(2) - 1
p = --- * --------------    (Amdahl's law: p given Speedup(2))
         1                SpeedUp(2)

Predicting Execution Time with n CPUs

In some cases, the Speedup(2) = T(1)/T(2) is a value greater than 2, in other words, a superlinear speedup (described earlier). When this occurs, the formula in the preceding section returns a value of p greater than 1.0, clearly not useful. In this case you need to calculate p from two other more realistic timings, for example T(2) and T(3). The general formula for p is:

                                      Speedup(n) - Speedup(m)
p  =  -------------------------------------------
                      (1 - 1/n)*Speedup(n) - (1 - 1/m)*Speedup(m)

where n and m are the two processor counts whose speedups are known. You can use this calculated value of p to extrapolate the potential speedup with higher numbers of processors.

back to top