CGO 2009 Workshops and Tutorials


EPHAMWorkshop on Exploiting Parallelism using GPUs and other Hardware-Assisted MethodsProgram Finalized
ODES7th Workshop on Optimizations for DSP and Embedded
OPEN64 2nd Annual Workshop on Open64
WISH Workshop on Infrastructures for Software/Hardware co-design Program finalized
STMCS canceled


SSA-RA Tutorial on SSA-based Register Allocation
STM canceled
TBB canceled

Preliminary Schedule

Sunday, 22 March 2009

7:30am - 5:30pmRegistration - registration desk
7:45am - 8:30amBreakfast
8:30am - 10:30amMorning Session 1
10:30am - 11:00amBreak
11:00am - 12:30pm Morning Session 2
12:30pm - 1:30pmLunch
1:30pm - 3:00pmAfternoon Session 1
3:00pm - 3:30pmBreak
3:30pm - 5:30pmAfternoon Session 2


Room ARoom BRoom ERoom F
Afternoon ODES Open64WISH


ODES - 7th Workshop on Optimizations for DSP and Embedded Systems

Program now finalized!

Optimizations are crucial to meet the performance, power and cost requirements that DSP and embedded systems have. The aim of the ODES workshop is to give the opportunity to researchers and practitioners working on this, to share their findings and get feedback. We think interacting with the community is crucial to do relevant research, and therefore ODES tries to maximize the interaction by carefully selecting the program committee members that review the submissions and at the workshop itself, reserving enough time for discussion.

Topics of interest include, but are not limited to:

Please find more information at this workshop's URL:

Open64 Workshop

We invite you to participate in the Open64 Workshop (associated with CGO 2009) to take place on Sunday, March 22nd and to present your recent research & development work using the Open64 compiler infrastructure. Active Open64 researchers & developers from academia and industry will report their research projects, experimental results, and development experiences using the Open64 compiler.

A considerable amount of work has been performed with the Open64 compiler infrastructure over the last few years, we anticipate hearing a number of significant research topics and encouraging results at this workshop.

Topics of discussion will be (but are not limited to) findings and results in

The workshop provides a forum for discussion of your findings and experiences with a broad range of Open64 researchers and developers. It is also the main opportunity for the participants to exchange their expectations and wishes for future development of Open64.

Please find more information at this workshop's URL:

EPHAM - Exploiting Parallelism using GPUs and other Hardware-Assisted Methods

This workshop will focus on compilation techniques for exploiting parallelism in emerging massively multi-threaded and multi-core architectures, with a particular focus on the use of general-purpose GPU computing techniques to overcome traditional barriers to parallelization. Recently, GPUs have evolved to address programming of general purpose computations, especially those exemplified by data parallel models. This change will have long-term implications for languages, compilers, and programming models. Development of higher level programming languages, models and compilers that exploit such processors will be important. Clearly, the economics and performance of applications is affected by a transition to general-purpose GPU computing. This will require new ideas and directions as well as recasting some older techniques to the new paradigm

8:45am - 9:00amWelcome
9:00am - 10:30amProgramming Models
Design and Implementation of a High Level Framework for GPUs
Michael Wolfe
A MapReduce Framework in Heterogenous GPU Environment
Dehao Chen, Chuntao Hong, Wenguang Chen, Haibo Lin, Weimin Zheng
GPU Kernels as Data-Parallel Array Computations in Haskell
Sean Lee, Manuel M.T. Chakravarty, Vinod Grover, Gabriele Keller
10:30am - 11:00amBreak
11:00am - 12:00 noonApplication and Performance Frameworks
Analytical Performance Prediction for Evaluation and Tuning of GPGPU Applications
Sara Kaghsorkhi, Wen-mei Hwu
GPU-Accelerated Text Mining
Yongpeng Zhang, Frank Mueller, Xiaohui Cui, Thomas Potok
12:00 noon - 12:15pmWrapup

Topics of Interest

We invite papers in this emerging discipline which include, but are not limited, to the following areas of interest.

Important Dates

Feb. 15th 2009: Paper submission deadline
Mar. 8th 2009: Notification of acceptance
Mar. 15th 2009: Camera-ready version of papers due
Mar. 22nd 2009: The workshop

Workshop Organizers

Vinod Grover, NVIDIA Corporation
Richard Johnson, NVIDIA Corporation

Program Committee

Manuel M T Chakravarty, University of New South Wales
Rudi Eigenman, Purdue University
Anwar Ghuloum, Intel
Naga Govindaraju, Microsoft
Wen-mei Hwu, University of Illinois, Urbana-Champaign
Miriam Leeser, Northeastern University
Dinesh Manocha, University of North Carolina
Shane Ryoo, ZeroSoft
Bratin Saha, Intel
Bixia Zheng, AMD

Submission Guidelines

Papers of 6-10 pages may be submitted using any format. The abstract should clearly state the problem being studied, the methods used, and the results. If the results are preliminary, the authors should state their expectation for the final results. To submit, please send a pdf of your submission to Final submissions should use the standard ACM conference format (two columns with 9 pt Times Roman font, etc.)

Intel© Threading Building Blocks: Programming for Current and Future Multicore Platforms

    Michael Voss, Ph.D.
    Performance, Analysis and Threading Lab, Intel Corporation


An important key to unlocking the performance potential of current and future multicore platforms is to choose a programming model that lets you express concurrency without requiring that you explicitly manage it. This half-day tutorial introduces the Intel© Threading Building Blocks concurrency library, which offers a rich and complete approach to expressing parallelism in C++ programs.

By using a high-level concurrency platform like Intel© Threading Building Blocks (Intel TBB), developers express the concurrency in their applications and let the runtime library manage the low-level, platform-specific details of implementing and scheduling the concurrency on the hardware. It is these implementation and scheduling details that will need to change as architectures change. By using a concurrency platform instead of using a native threading library directly, an application is generally easier to write, is more likely to be correct, and will continue to enjoy increased performance as platforms evolve.

After completing this half-day tutorial, attendees will (1) understand why a concurrency library such as TBB is often a better choice than using a native threading library directly, (2) have a good understanding of the features and capabilities of the TBB library and (3) be positioned to begin using the library for developing their own applications.


Michael Voss is a Senior Staff Software Engineer in the Intel Performance, Analysis and Threading Lab where he is one of the senior developers of the Intel© Threading Building Blocks library. His interests include languages and compilers for parallel computing, adaptive compilation, and optimization. He received his Ph.D. from the School of Electrical and Computer Engineering at Purdue University in 2001. He was an assistant professor in the Edwards S. Rogers Department of Electrical and Computer Engineering at the University of Toronto from 2001 - 2005 and an adjunct Professor from 2005 - 2007.


    1.  Exploiting Multicore Platforms (15 minutes)
        a.  Increasing core counts is the new free lunch
        b.  Threading to the platform is the wrong way to go
            i.  Achieving portable performance is very difficult
            ii. High potential to introduce bugs
            iii. It does not provide a path forward!
        c.  Concurrency platforms are the key to forward-scaling parallelism
            i.  Intel TBB / Microsoft* Parallel Pattern Library
            ii. the OpenMP API
            iii. Data parallel programming models

    2.  An overview of Intel Threading Building Blocks (15 minutes)
        a.  A generic C++ library
        b.  Available as a commercial and open-source package
        c.  Overview of the basic components available in the library

    3.  Concurrent Containers (30 minutes)
        a.  Introduction to concurrent_map, concurrent_queue and concurrent_vector
        b.  compare / contrast with other approaches
        c.  Provide code examples

    4.  Parallel Algorithms (30 minutes)
        a.  introduction to parallel_for, parallel_reduce, parallel_do and pipeline
        b.  discuss Range concept and Partitioner hint
        c.  compare and contrast with other approaches
        d.  Provide code examples

    ---- Mid-point Break ---

    5.  Tasks and the task scheduler (45 minutes)
        a.  Introduction to work-stealing task schedulers
        b.  How to define tasks using Intel TBB
        c.  Provide code examples

    6.  Exception handling and cancellation support (15 minutes)
        a.  The Intel TBB model exceptions and cancellation
        b.  Provide code examples

    7.  tbb_thread (15 minutes)
        a.  An overview of the thread interface
        b.  When to use task and when to use threads
        c.  Provide code examples

    8.  Wrap-up (15 minutes)
        a.  Future directions for TBB
        b.  How to get TBB
        c.  How to interact with the TBB community

Community WEB page:

Tutorial on Software Transactional Memory

    Yang Ni, Ph.D.
    Adam Welc, Ph.D.
    Programming Systems Lab, Intel Corporation


The advent of multi-core processors in the mainstream computing market is forcing programmers to shift from writing sequential code to building concurrent applications. Today, concurrent programs are typically synchronized using locks. However, it is hard to write correct and scalable programs using locks. Fine-grained locks may not compose and are prone to deadlocks. Coarse-grained locks may not deliver the performance promised by multi-core processors. Recently, Transactional Memory (TM) has been proposed as an alternative to locks for multi-core programming. TM has attracted a lot of attention as a safer, more modular, and more scalable concurrency control mechanism.

This tutorial will provide a comprehensive overview of transactional memory: description of transactional language extensions and their semantics as well as discussion of important aspects of a software transactional memory (STM) system implementation. We will show how to extend C/C++ with transactional constructs and how these constructs alleviate some of the problems related to programming with locks. We will also present the fundamentals of implementing STM algorithms for C/C++ in the TM runtime, describe how TM support can be elegantly expressed in a form of an application interface and discuss generation of the code by the TM compiler that makes appropriate use of such interface.


Yang Ni is a Research Scientist in Intel's Programming Systems Lab. He has been working on programming languages for platforms from mobile devices to chip multi processors. His current research focuses on transactional memory. He is a major contributor to the Intel C/C++ TM compiler. Yang received his Ph.D. in Computer Science from Rutgers University.

Adam Welc is a Research Scientist in Intel's Programming Systems Lab. His work is in the area of programming language design and implementation, with specific interests in concurrency control, compiler and run-time system optimizations, transactional processing as well as architectural support for programming languages and applications. Adam received the Master of Science in Computer Science from Poznan University of Technology, Poland, in July 1999. He continued his graduate studies at Purdue University, receiving the Master of Science in Computer Science in May 2003, and the Ph.D. in Computer Science in March 2006.

WISH - Workshop on Infrastructures for Software/Hardware co-design

A major hindrance to the development of co-designed hardware and software systems is the availability of fast, accurate, and reliable infrastructures for performance evaluation and analysis. Simultaneously varying both the software and hardware components of a system introduces complexities which render traditional evaluation methodologies unusable.

Traditional evaluation methodologies assume that either the hardware or the software components of a system are fixed. For example, an accepted methodology for evaluating the effectiveness of a compiler optimization is to compare the execution of two differently compiled binaries on the same hardware. Likewise, an accepted methodology for evaluating microarchitectural hardware changes has been to measure relatively short, but representative, samples of the same program's execution with multiple configurations of a detailed timing simulator.

Nonetheless, the co-design of hardware and software systems is being pursued as part of many industry and academic projects. Researchers have therefore been forced to build their own custom infrastructures and invent methodologies to demonstrate the viability of their ideas. The purpose of this workshop is for experienced practitioners in this area to share their gained expertise and knowledge to a wider audience in the hopes of broadening community understanding. Identifying readily-available building blocks and tools, as well as opportunities for further improvements in this area are the goals of this workshop.

Topics of interest include, but are not limited to:

Important Dates

Submissions guidelines

Authors should submit a 3 page double spaced extended abstract by February 16th, 2009 to Final submission is a slide set for a 20 minute presentation.

General Chair

Uma Srinivasan (Intel)

Program Chair

Anne Holler (VMware)

Program Committee

Jim Callister, Intel
Robert HUndt, Google
Richard Johnson, NVIDIA Corporation
Naveen Neelakantam, University of Illinois, Urbana-Champaign
Uma Srinivasan, Intel
Pratap Subrahmanyam, VMware
Craig Zilles, University of Illinois, Urbana-Champaign


Schedule for Workshop on Infrastructures for Software/Hardware co-design (WISH) Sunday, 22 March 2009, Afternoon, Room 4
1:30-2:30 Keynote
David R. Ditzel
2:30-3:00 "Dynamic Binary Translation from 32-bit to 64-bit mode for x86 Virtualization"
Yu-Hsin Joyce Chen
3:00-3:30 Break
3:30-4:00 "Run-time Data Dependence Analysis Using Detected Loop Regions in Binary Codes"
Yukinori SATO and Tadao NAKAMURA
4:00-4:30 "Cheetah: A Light-Weight, Super-Fast Tracing Infrastructure for Itanium Processors"
Ram Srinivasan, Chris Krieger, Jim Callister
4:30-5:00 "Simulation-based Performance Evaluation of Co-designed Hardware/Software"
Naveen Neelakantam and Craig Zilles
5:00-5:30 "Measure of Similarity Degree between loops Based on Graph Dependency"
Lamia Djoudi and William Jalby

Tutorial on SSA-based Register Allocation

    Philip Brisk, EPFL
    Alain Darte, ENS Lyon
    Jens Palsberg, UCLA
    Fabrice Rastello, ENS Lyon


In the past years, we discovered that the interference graphs of SSA-form programs are chordal. This has several attractive implications for register allocation: First, coloring the interference graph is no longer NP-complete. Second, the number of needed registers is generally less than in non SSA-form programs. Finally, this number is equal to maximal number of simultaneously live variables.

This opens the door to what we call SSA-based register allocation. The main advantage of this new approach, is that the two problems, spilling, then coloring/coalescing can be optimized separately in two consecutive (more clean and simpler) phases. This is especially interesting for the design of compilers (either aggressive or just in time) for embedded processors.

The goal of this tutorial is to explain the basics of this in-two-phases approach and to provide the elements to help compiler writers build both memory friendly, fast and competitive (fewer spill code) register allocators. To illustrate this, we will also present new possible approaches to spilling, coalescing, and SSA destruction, as well as experiments with an SSA-based register allocator.


We will start from scratch, explain the key ideas in detail via examples, and present both theoretical and experimental results. We have divided the material such that each speaker will present a well-rounded portion.

  1. Chaitin's et al. NP-completeness theorem, SSA form, polynomial time: Alain Darte
  2. Spilling, interval graphs, interprocedural: Philip Brisk
  3. Coalescing: Fabrice Rastello
  4. SSA destruction, experiments: Jens Palsberg


Philip Brisk received his B.S., M.S., and Ph.D. degrees, all in Computer Science, from UCLA in 2002, 2003, and 2006 respectively. Since 2006, he has been a postdoctoral scholar with the Ecole Polytechnique Federale de Lausanne (EPFL). His research interests include application-specific processor design, reconfigurable computing, and compilers.

Alain Darte is a CNRS research scientist at LIP, ENS Lyon, France. He received his Ph.D. degree at ENS-Lyon in 1993. His main scientific interests are in mathematical tools, automatic program transformations, and optimizations for parallelizing compilers and for compiler-based tools used to automatically synthesize hardware accelerators. His main contributions concern polyhedral techniques for loop transformations and memory optimizations, software pipelining, and register allocation, in connexion with industrial partners such as Hewlett Packard (past Pico project) and STMicroelectronics. He is member of the editorial board of ACM TECS and has served in many program committees such as ASAP, CASES, CC, CGO, DATE, Euro-Par, ICS, PLDI.

Jens Palsberg is a Professor of Computer Science at UCLA. He received a Ph.D. in Computer Science from University of Aarhus, Denmark in 1992. His research interests span the areas of compilers, embedded systems, programming languages, software engineering, and information security. He is an associate editor of ACM Transactions of Programming Languages and Systems, a member of the editorial board of Information and Computation, and a former member of the editorial board of IEEE Transactions on Software Engineering. He is serving as the vice chair of ACM SIGBED, Special Interest Group on Embedded Systems, and he has served as vice chair of computer science at UCLA, as associate head of computer science at Purdue University, as general chair of POPL, as conference chair of LICS, and as a program chair for POPL, EMSOFT, MEMOCODE, PASTE, SAS, SREIS, and TACAS.

Fabrice Rastello is a researcher of Computer Science at Inria, ENS Lyon, France. He has done his PhD thesis at LIP, ENS Lyon, France (received in 2000) in the field of automatic parallelization (tiling, heterogeneous computing). Then he worked for two years in a compiler group at STMicroelectronics (Grenoble, France) where he mainly developed back-end optimizations in LAO (linear assembly optimizer). His research topic is the compiler optimizations for DSP/VLIW/Media like embedded processors. He still works in collaboration with STmicroelectronics' compiler team for the ST200 family DSP processors. His current research work is focused on JIT compilation, SSA-based optimizations, code compression, register allocation and instruction-cache optimizations.