CMU-CS-98-128
Computer Science Department
School of Computer Science, Carnegie Mellon University



CMU-CS-98-128

Efficient Irregular Computation on High-Bandwidth Pipelined-Memory Multiprocessors

Marco Zagha

May 1998

Ph.D. Thesis

Unavailable Electronically

Keywords: Parallel algorithms, latency hiding, PRACm QUQW, BSP, Cray C90, Cray J90, vector multipocessor, memory contention, interleaved memories, memory banks, benchmarks, sorting, merging, sparse matrices, graph algorithms


A major goal in the parallel computing community has been to transfer results from the theory of parallel computing to the practice of developing efficient implementations of general-purpose parallel algorithms. However, the most widely-used theoretical model, the Parallel Random Access Machine (PRAM), is an impractical model of real parallel machines. Thus, much current research centers on redesigning algorithms for models that account for the communication bottlenecks present on most distributed memory machines, a task that is feasible for certain classes of algorithms, but is fundamentally difficult for algorithms that use irregular data structures and access patterns.

Instead, this thesis considers whether it is feasible to map PRAM algorithms onto a class of realistic parallel machines with only a subset of the features of an ideal PRAM. Our goal is to show that general-purpose PRAM algorithms can be implemented efficiently and systematically on a class of high-bandwidth shared-memory parallel machines that

1) have a high-bandwidth network between the processors
and memory banks,
2) allow for fine-grained memory accesses,
3) can tolerate latency to the memory from processors by allowing
for multiple outstanding memory requests, and
4) have memory banks that are slower than the processors
and compensate by having more memory banks than processors.

These features are currently provided on vector multiprocessors, and might be provided in future parallel architectures. To model these machines, we extend Valiant's Bulk Synchronous Parallel (BSP) model to include machines that compensate for slow memory banks by having more memory banks than processors. We evaluate this extended model on the Cray C90 and Cray J90 and explore experimentally how various PRAM variants can be emulated in this model, with particular emphasis on the Queued-Read Queued-Write PRAM.

The major focus of the thesis is a practical study of techniques for implementing and modeling irregular PRAM algorithms on vector multiprocessors. The thesis describes a set of case studies on fundamental PRAM algorithms, including sorting, merging, sparse matrix multiplication, and graph connectivity. Each case study describes the selection of a practical PRAM algorithm, the process of mapping the algorithm onto the machine model, a performance model for predicting running time, and practical issues and tradeoffs in developing fast, general-purpose implementations for the Cray C90 and Cray J90.

146 pages


Return to: SCS Technical Report Collection
School of Computer Science homepage

This page maintained by [email protected]