Introduction to Parallel Program Design


I. Introduction

This tutorial discusses in general terms some of the issues associated with developing a parallel programs for solving parallel problems. We will cover the following topics:

  • General Goals of Parallel Programming
  • Functional Parallelism
  • Data Parallelism
  • Sample Problem
  • Other Online Examples
  • References
  • Acknowledgements

    Forward


    II. Goals

    Some definitions:

    Speedup is the serial execution time divided by the parallel execution time, given a fixed problem size and number of processors. Perfect speedup equals the number of processors. For instance, if you were to run on 8 processors, perfect speedup would mean that you would run 8 times faster on 8 processors than on one.

    Scalability refers to how speedup changes as the problem size and number of processors increases. For instance, as you increase the problem size, you may spend more time in one part of your code. If this piece is computed in parallel, your speedup may increase realtive to the speedup you obtained when running the smaller problem.

    II A. Ideal speedup/scalability in a Parallel Program

    Ideal goals for writing a program with maximum speedup and scalability:

    Points to keep in mind:

    II. Common Parallel Programming Models

    There are many different models or ways to do parallel programming. We survey some of them here.

    II B. Data versus Functional Parallelism

    II C. SPMD versus Master Worker

    Forward
    Return to Introduction


    III. Functional Parallelism

    This section contains two examples of types of problems that could be solved using functional parallelism.

    III A Ecosystem Modeling

    The diagram shows five processes, each running a different program. In this case, each program calculates the population of a given group, where each group's growth depends on that of its neighbors. As time progresses, each process calculates its current state, then shares information with the dependent populations, so they can all go on to calculate the state at the next time step.

    The load balancing for this program is static (pre-scheduled) -- each process' load is determined and inflexible at the start of the application. It is also likely to be unequal, with the different programs requiring different amounts of computation before sharing state.

    The communication pattern is a ring. This will influence how the different programs are mapped to physical processors. Those programs that need to communicate should ideally be only one communication "hop" from each other.


    III B. Audio Signal Processing (pipeline)

    To process the audio signal, the data set is passed through three distinct computational filters. Each filter is a separate process. The first chunk of data must pass through the first filter before progressing to the second. When it does, the second chunk of data passes through the first filter. By the time the third chunk of data is in the first filter, all three processes are busy.

    Again, load balancing is static and will be unequal if different filters require different amounts of computation. The communication pattern is a 1-dimensional mesh.

    IV. Examples of Data Parallelism

    This section contains three examples of types of problems that could be solved using data parallelism.

    IV A. Image Processing

    The diagram above shows a SPMD solution. All four processes are running the same three-step program, processing different data. The dashed horizontal lines represent barriers or synchronization points. Each process must complete that step before all processes can proceed to the next step.

    The sketches below show an image that will be processed. The layout on the left shows a 2-D block decomposition, on the right a 1-D block decomposition.

    Data decomposition:

    3.2 Effect of Pollution on Forested Areas


    This program calculates the effect of pollution on tree growth and mortality for a geographic area.

    III C. Chess


    In this example, a computer is playing chess. In choosing its next move, it will analyze responses to all possible moves for one or more rounds into the future.

    Forward
    Backward
    Return to Introduction


    V. Sample Problem

    In this section we will walk through the parallel program development of a simple problem.

    V A. Problem Description

    This problem was developed from a description found in Fox et al. (1988) Solving Problems on Concurrent Processors, vol. 1. Prentice Hall.

    Calculate the amplitude along a uniform, vibrating string after a specified amount of time has elapsed.

    The numerical solution proceeds by discretization. That is, instead of trying to find the amplitude at any position along the string, at any time, we restrict ourselves to looking at position at discrete points in space at discrete time. If we have enough points, our solution will look continuous.

    The amplitude is represented along the y axis, and the position is labeled by discrete points i along the x-axis. The amplitude will then be computed at discrete time steps.

    The equation to be solved is the one-dimensional wave equation:

    A(i,t+1) = (2.0 * A(i,t)) - A(i,t-1)
    + (c * (A(i-1,t) - (2.0 * A(i,t)) + A(i+1,t)))

    Note that amplitude will depend on previous timesteps (t, t-1) and neighboring points (i-1, i+1).

    V B. Decomposition: Function versus Data

    Function: divide the computation into chunks of disjoint or unassociated work

    Data: Give each process a subset of a domain

    Data Decomposition

    Data Replication

    Load balancing

    Communication

    Block decomposition by position

    Each color represents a different process. The boxes at the edges of each color indicate that the endpoints will require communication at each time step with the neighboring process.

    V C. Code Outline

    1. Read in starting values
    2. Establish communication channels
    3. Divide data among processes

      For each time step{

    4. Exchange endpoints
    5. Calculate amplitude for new time step
      }

    6. Output results

    V D. SPMD Solution

    Our analysis leads to a SPMD solution:

    Reading/writing data

    Pseudo Code

          program wave_spmd
    
    C     Learn number of tasks and taskid
          call initialize task
          call get task identification and information
    
    C     Identify left and right neighbors
    
    C     Get program parameters
          read tpoints, nsteps
    
    C     Divide data amongst processes
          read values
    
    C     Update values for each point along string
          do t = 1, nsteps
    C        Send to left, receive from right
                call send left endpoint to left neighbor
                call receive left endpoint from right neighbor
    C        Send to right, receive from left
                call send right endpoint to right neighbor
                call receive right endpoint from left neighbor
    
    C        Update points along line
             do i = 1, npoints
               newval(i) = (2.0 * values(i)) - oldval(i) 
         &     + (sqtau * (values(i-1) - (2.0 * values(i)) + values(i+1))) 
             end do
    
          end do
    
    C     Write results out to file
          write values
    
          call terminate parallel environment
    

    Click here for a more fully-developed pseudo code using MPI calls.

    Click here for the complete program.

    V E. SPMD with Master Worker Embedded

    Another solution is to embed a Master/Worker model within SPMD:

    Reading/writing data

    Pseudo Code

    All
    
    
          program wave_mw
    
    C     Learn number of tasks and taskid
          call initialize task
          call get task identification and information
    
    Master
    
    C     Get program parameters
          if (taskid .eq. MASTER) then
             read tpoints, nsteps
    
    C        Master broadcasts total points, time steps
             call send two numbers to all Workers
    
    Workers
    
          else
    C        Workers receive total points, time steps
             call all Workers receive two numbers
          end if
    
    Master
    
          if (taskid .eq. MASTER) then
             do i = 1, tpoints
                read(10) values(i)
             end do
    
    C        Master sends chunks to Workers
             do i = 1, nproc-1
    C              Send first point and number of points handled to Worker
                   call send two numbers
    
    C              Send chunk of array to Worker
                   call send chunk of array
             end do
    
    Workers
    
          else
    C        Receive first point and number of points
             call receive two numbers
    
    C        Receive chunk of array 
             call receive chunk of array
    
          end if
    
    
    All
    
    C     Update values along the wave for nstep time steps
          do t = 1, nsteps
    
    C        Send to left, receive from right
                call send left endpoint to left neighbor
                call receive right  endpoint from right neighbor
    C        Send to right, receive from left
                call send right endpoint to right neighbor
                call receive left endpoint from left neighbor
    
    C        Update points along line
             do i = 1, npoints
                newval(i) = (2.0 * values(i)) - oldval(i) +
         &      (sqtau * (values(i-1) - (2.0 * values(i)) + values(i+1)))
             end do
    
          end do
    
    Master
    
    C     Master collects results from Workers and prints
          if (taskid .eq. MASTER) then
             do i = 1, nproc - 1
    C           Receive first point and number of points
                call receive two numbers
    
    C           Receive results
                call receive chunk of results
    
    C           Write out results
                write results(i)
    
    Workers
    
          else 
    C        Send first point and number of points handled to Master
             call send two numbers
    
    C        Send results to Master
             call send results
          end if
    
    All
    
          call terminate parallel environment
          end
    

    Click here for a more fully-developed pseudo code using MPI calls.

    Click here for the complete program.

    Forward
    Backward
    Return to Introduction


    VI. Other Online Examples

    Online locations of parallel programming solutions to a variety of computational problems.

    Forward
    Backward
    Return to Introduction


    VII. References

    Ian Foster
    "Designing and Building Parallel Programs"
    1995 Addison-Wesley Publishing Company, Inc
    Online version of this book

    Geoffrey C. Fox
    "Solving Problems on Concurrent Processors"
    1988 Prentice Hall

    Backward
    Return to Introduction


    VII. Acknowledgements

    This tutorial is based on the same tutorial developed at the Cornell Theory Center.

    Backward
    Return to Introduction