PHiPAC Fast Matrix Multiply Home Page

The PHiPAC (Portable High Performance ANSI C) Page
BLAS3 Compatible Fast Matrix Matrix Multiply

Jeff A Bilmes

Krste Asanovic

Rich Vuduc

Sriram Iyer

Jim Demmel

CheeWhye Chin

Dominic Lam

Portable automatic generation of fast BLAS-GEMM compatible matrix-matrix multiply using PHiPAC techniques.


BLAS3 matrix-matrix operations usually have great potential for agressive optimization. Unfortunately, they usually need to be hand-coded for a specific machine and/or compiler to achieve near peak performance. We have developed a methodology whereby near-peak performance on such routines can be acheved automatically. First, rather than code by hand, we produce parameterized code generators whose parameters are germane to the resulting machine performance. Second, the generated code follows the PHiPAC (Portable High Performance Ansi C) coding suggestions that include manual loop unrolling, explicit removal of unnecessary dependencies in code blocks (if not removed, C semantics would prohibit many optimizations), and use of machine sympathetic C constructs. Third, we develop search scripts that, for a given code generator, find the best set of parameters for a given architecture/compiler. We have developed a BLAS-GEMM compatible multi-level cache-blocked matrix-matrix multiply code generator that has achieved performance around 90% of peak on the Sparcstation-20/61, IBM RS/6000-590, HP 712/80i, SGI Power Challenge R8k, SGI Octane R10k, and 80% on the SGI Indigo R4k. On the IBM, HP, SGI R4k, and the Sun Ultra-170, the resulting DGEMM is, in fact, faster than the GEMM in the vendor-optimized BLAS GEMM. Other generators, search scripts, and performance results are under development.

{New! A report on Automatic assembly of highly tuned code fragments

{New! The technical report describing the 1.0 release is now available (and is also available in pdf format).

{New! Take a peek at our upcoming new web page.

{New! The completely re-written beta release of the PHiPAC matrix package, which includes faster parameter search, software pipelining, and a more efficient blocking strategy, is now avaible. A TR describing this release is now available.

Here is the old alpha release of the PHiPAC BLAS-compatible matrix-matrix multiply generator. If you have PHiPAC bugs or performance numbers to report, please send them to phipac AT icsi DOT berkeley DOT edu

Here is an example of the bunch-mode MLP2 Backpropagation training algorithm using PHiPAC matrix-matrix multipliers. This speeds up training and forward passes by up to 4 times. The code is not particularly robust, and is meant primarily as an example, but should still be useful.


  • {New!Our most recent TR (in postscript or pdf) describing release version 1.0 in detail (also available as a UCB TR in postscript or PDF formats).
  • Our most recent conference paper (in postscript or pdf) paper to be presented at the 1997 International Conference on Supercomputing in Vienna, Austria.
  • {New! A paper presented at ICASSP'97 reporting on how PHiPAC can be used to speed the bunch-mode backpropagation algorithm.
  • A LAWN111 technical report (also available here) describing the PHiPAC methodology (PDF is also available but you'll also need Adobe Acrobat to view it).
  • Bibtex citation entries.
  • Slides

  • ICS97 Slides
  • ASCI Presentation Slides
  • BLAS Technical Workshop, 1995 PHiPAC Slides
  • PHiPAC slides from the Castle retreat.
  • Related links

  • BLAS Technical Workshop, 1995
  • NETLIB, a site containing lots of mathematical papers and software.
  • FFTW, fast FFT code

  • Jeff Bilmes
    <bilmes AT cs DOT berkeley DOT edu>
    Former Ph.D. Student (Now professor at UW)
    Computer Science Division
    Dept. of EECS
    U.C. Berkeley
    Berkeley CA, 94720
    < bilmes AT icsi DOT berkeley DOT edu>
    Former Research Assistant
    International Computer Science Institute
    1947 Center St. Suite 600
    Berkeley CA, 94704

    Return to the ICSI home page.