In this thesis we address the problem of integrated software pipelining for clustered VLIW architectures. The phases that are integrated and solved as one combined problem are: cluster assignment, instruction selection, scheduling, register allocation and spilling.

As a first step we describe two methods for integrated code generation of basic blocks. The first method is optimal and based on integer linear programming. The second method is a heuristic based on genetic algorithms.

We then extend the integer linear programming model to modulo scheduling. To the best of our knowledge this is the first time anybody has optimally solved the modulo scheduling problem for clustered architectures with instruction selection and cluster assignment integrated.

We also show that optimal spilling is closely related to optimal register allocation when the register files are clustered. In fact, optimal spilling is as simple as adding an additional virtual register file representing the memory and have transfer instructions to and from this register file corresponding to stores and loads.

Our algorithm for modulo scheduling iteratively considers schedules with increasing number of schedule slots. A problem with such an iterative method is that if the initiation interval is not equal to the lower bound there is no way to determine whether the found solution is optimal or not. We have proven that for a class of architectures that we call transfer free, we can set an upper bound on the schedule length. I.e., we can prove when a found modulo schedule with initiation interval larger than the lower bound is optimal.

Experiments have been conducted to show the usefulness and limitations of our optimal methods. For the basic block case we compare the optimal method to the heuristic based on genetic algorithms.


This chapter gives a brief introduction to the area of integrated code generation and to the Optimist framework. We also give a list of our contributions and describe the thesis outline.


A processor in an embedded device often spends the major part of its life executing a few lines of code over and over again. Finding ways to optimize these lines of code before the device is brought to the market could make it possible to run the application on a cheaper or more energy efficient hardware. This fact motivates spending large amounts of time on aggressive code optimization. In this thesis we aim at improving current methods for code optimization by exploring ways to generate provably optimal code.

Compilers and code generation for VLIW architectures

This section contains a very brief description of compilers and VLIW architectures. For a more in depth treatment of these topics, please refer to a good text book in this area, such as the “Dragon book” [1].


Typically a compiler is a program that translates computer programs from one language to another. In this thesis we consider compilers that translate human readable code, e.g. C, into machine code for processors with static instruction level parallelism. For such architectures, it is up to the compiler to generate the parallelism.

The Front end of a compiler is the part which reads the input program and does a translation into some intermediate representation (IR).

Code generation, which is the part of the compiler that we focus on in this thesis, is performed in the back end of a compiler. In essence, it is the process of creating executable code from the previously generated IR. One usual way to do this is to perform three phases in some sequence:

    Instruction selection phase — Select target instructions matching the IR.

    Instruction scheduling phase — Map the selected instructions to time slots on which to execute them.

    Register allocation phase — Select registers in which intermediate values are to be stored.

While doing the phases in sequence is simpler and less computationally heavy, the phases are interdependent. Hence, integrating the phases of the code generator gives more opportunity for optimization. The cost of integrating the phases is that the size of the solution space greatly increases. There is a combinatorial explosion when decisions in all phases are considered simultaneously. This is especially the case when we consider complicated processors with clustered register files and functional units where many different target instructions may be applied to a single IR operation, and with both explicit and implicit transfers between the register clusters.