BLAS3 matrix-matrix operations usually have great potential for agressive optimization. Unfortunately, they usually need to be hand-coded for a specific machine and/or compiler to achieve near peak performance. We have developed a methodology whereby near-peak performance on such routines can be acheved automatically. First, rather than code by hand, we produce parameterized code generators whose parameters are germane to the resulting machine performance. Second, the generated code follows the PHiPAC (Portable High Performance Ansi C) coding suggestions that include manual loop unrolling, explicit removal of unnecessary dependencies in code blocks (if not removed, C semantics would prohibit many optimizations), and use of machine sympathetic C constructs. Third, we develop search scripts that, for a given code generator, find the best set of parameters for a given architecture/compiler. We have developed a BLAS-GEMM compatible multi-level cache-blocked matrix-matrix multiply code generator that has achieved performance around 90% of peak on the Sparcstation-20/61, IBM RS/6000-590, HP 712/80i, SGI Power Challenge R8k, SGI Octane R10k, and 80% on the SGI Indigo R4k. On the IBM, HP, SGI R4k, and the Sun Ultra-170, the resulting DGEMM is, in fact, faster than the GEMM in the vendor-optimized BLAS GEMM. Other generators, search scripts, and performance results are under development.
The technical report describing the 1.0 release is now available (and is also available in pdf format).
Take a peek at our upcoming new web page.
The completely re-written beta release of the PHiPAC matrix package, which includes faster parameter search, software pipelining, and a more efficient blocking strategy, is now avaible. A TR describing this release is now available.
Here is the old alpha release of the PHiPAC BLAS-compatible matrix-matrix multiply generator. If you have PHiPAC bugs or performance numbers to report, please send them to phipac AT icsi DOT berkeley DOT edu
Here is an example of the bunch-mode MLP2 Backpropagation training algorithm using PHiPAC matrix-matrix multipliers. This speeds up training and forward passes by up to 4 times. The code is not particularly robust, and is meant primarily as an example, but should still be useful.
<bilmes AT cs DOT berkeley DOT edu>|
Former Ph.D. Student (Now professor at UW)
Computer Science Division
Dept. of EECS
Berkeley CA, 94720
bilmes AT icsi DOT berkeley DOT edu>
Former Research Assistant
International Computer Science Institute
1947 Center St. Suite 600
Berkeley CA, 94704