Compiler mediated performance of a Cell Processor

Organizers: Kathryn O'Briend, Alexandre Eichenberger, Kevin O'Brien, and Michael Gschwind, IBM

Developed for multimedia and game application workloads, the Cell processor provides support for highly parallel codes, which have high computation and memory requirements, as well as for scalar codes, which require fast response times and full featured programming environments. This first generation Cell processor implements on a single chip a POWER processor with two levels of cache and eight attached streaming processors with their own local memories and globally consistent DMA engines. In addition to processor-level parallelism, each processing element has Single Instruction Multiple Data (SIMD) units that can each process from 2 double floats up to 16 chars per cycle.

The complexities of the Cell processor span multiple dimensions. At the elementary level, the Cell system has two distinct processor types, each with its own application level ISA. One ISA (PE) is the familiar 64-bit PowerPC with VMX, the other, (SPE) is a new 128-bit SIMD instruction set for multimedia and general floating point processing. Typical applications on the Cell processor will consist of a combination of codes to exploit both these processors. The pipelines of both processor types must be taken into account, and the SPE presents several challenges not seen in the PX, chief among them the instruction prefetch capabilities and the significant branch miss penalties resulting from the lack of hardware branch prediction. At the next level, the SPE is a short SIMD or multimedia processor with scant support for scalar operations. On the next dimension is the parallelism of the machine when deploying applications across all SPEs.

It has been demonstrated that expert programmers can develop and hand tune applications to exploit the full performance potential of this machine. We believe that sophisticated compiler optimization technology can bridge the gap between usability and performance in this arena. To this end we have developed a research prototype compiler targeting the Cell processor. In this tutorial we discuss a variety of compiler techniques we have investigated/implemented, and their associated performance benefits. These techniques are aimed at automatically generating high quality codes over the broad spectrum of heterogeneous parallelism available on the Cell processor.

The tutorial will begin with a brief overview of the Cell architecture to motivate the discussion of compiling to exploit specific novel features of the architecture. The techniques we describe include compiler supported branch prediction, compiler assisted instruction fetch, and the generation of scalar codes on SIMD units. We will then discuss our techniques for automatic generation of SIMD codes, and automatically parallelizing single programs across the multiple heterogeneous processors. In particular we will describe and discuss the performance of our technique for presenting the user with a single shared memory image through our compiler controlled memory management. We will also report and discuss the results we have achieved to date, which indicate that significant speedup can be achieved on this processor with a high level of support from the compiler.

Intended Audience

This tutorial is intended for those with a background in Computer Architecture and Compiler Writing. Some knowledge of parallelization techniques would also be useful.

The Speakers