Phases of Compiler Design


Computer technology has evolved to a great extent and so have the programming languages. Today, a programmer can choose from a plethora of programming languages (High Level Languages) according to his needs. But the computer only understands the language of 1’s and 0’s (Low level language). And hence, it’s a daunting task to make the computer understand these languages. And who does this work? Yes, a compiler!

In simplest terms, a compiler is a program that converts a source language into a target language. For now, we can think of a compiler as a magic black box which takes in a program written by us (the programmer) and translates it into a form which can be understood by the computer. Simple enough, right?

Fig. 1

Now, any text file cannot be converted into an equivalent machine code. The source program must conform to all the specifications of the language in order to be transformed into an equivalent machine code. We will see how this works, if we look into the magic box a little. The compiler is designed into two parts.The first phase is the analysis phase while the second phase is called synthesis.

The main objective of the analysis phase is to break the source code into parts. It then arranges these pieces into a meaningful structure (or grammar of the language). Lexical analysis, syntax analysis and semantic analysis constitute the analysis phase.( Don’t panic if the terms don’t seem familiar, we’ll know what they mean in a while!)

Lexical Analysis:

Stream of characters in the source program is grouped into meaningful sequences called lexemes. Tokens are produced for each lexeme. A token is an abstract symbol generated during lexical analysis. We need not worry about how this is done. At this stage we simply try to understand the concept. Consider a simple C statement:

j = i+3;

The following token will be produced for the above code:

<id2>  <=> <id1> <+> <3><EOS>

  • id1 and id2 are used to denote identifiers or variables.
  • Rest of the tokens (<+>,  <=>, <EOS> & <3>) indicate “+” ,  “equal”, “End of Statement”  and the number 3 respectively.

Note that “if” is a keyword defined in C. Let us see what happens if we type the following code by mistake:

j = if+3;

<id2> <=> <if token > <+> <3>

The above series of tokens does not make sense in the true sense but is nevertheless generated by the lexical analyser.

Generally, a token has an attribute value attached to it. It denotes the position of the variable in a symbol table. A symbol table is a table which stores information about an identifier and is referred at various stages of compilation.

Syntax Analysis:

The syntax analyzer checks each line of the code and spots every tiny mistake that the programmer has committed while typing the code. In fact, it even suggests possible ways to rectify our errors. Ever wondered how that happens?  Well, of course it’s not magic! The compiler follows a detailed procedure using the tokens creates by the lexical analyzer and creates a tree-like structure called the syntax tree. However, at this point it is sufficient to understand exactly what type or errors are detected during syntax analysis.

The syntax analyzer checks whether the order of tokens conform to the rules of the programming language. Unmatched parenthesis, missing semicolons are some of the errors detected in this phase. For instance,


  1. System.out.println(“Hello World);        /* Unterminated string. “ expected after ‘World’*/
  2. x = y+5      // missing semicolon after the statement
  3. if( str==NULL)
    System.out.println(“invalid string”; 
    /* mismatched parenthesis */

If there are no errors in the code, the syntax analyser successfully constructs a syntax tree which is later used by the semantic analyser.

Semantic Analysis:

“Semantic” by definition is concerned with meanings. A semantic analyser is mainly concerned with what the program means and how it executes. Type checking is an important aspect of semantic analysis where each operator should be compatible with its operands.

Ex 1:

int a[4];
float x;

x = 1;
a[x]= 3; // invalid statement


The code given seems fine at the first glance, with all the colons and parenthesis in the right place, and you expect the answer to be 3. But there is a basic error: x is of type float and the index of an array should be of type int. These types of errors are detected in semantic analysis.

Fig: 2 Illustration of how different types of errors are caught by the compiler


Fig. 2 An illustration of how errors are caught at various stages of compilation

A compiler may construct intermediate representations while converting a source program to a target program. The representation should be easy to convert into a target language. It is then passed onto the second phase of compiler design: the synthesis phase. This phase involves the actual construction of target program and includes code optimisation and code generation.


Code Optimisation:  As the name suggests, this phase aims at optimising the target code. The code can be optimised in terms of time taken to execute, length of the code, memory utilised or any other criteria. Consider the following example:

int a, b, c, d;
a = 2;
c= 4;
d = b + c;
d = a +b;

Clearly, the statement “d = b + c;” is redundant and is immediately overwritten by “d = a+b;”. Such statements may be eliminated in the optimization phase. It is evident that this phase directly affects the performance of the program. Hence, a considerable amount of time is spent on this phase.

Code Generation: Target code is generated at this phase using the intermediate representation of the source program. The machine instructions perform the same tasks as the intermediate code. Registers are allocated to variables in the program. This has to be done carefully so as to avoid any clashes or repeated assignments. Various algorithms have been formulated to generate the most efficient machine code.

Fig 3. shows a flow chart of various phases of compilation. Each phase produces an output which is used by the subsequent phases to finally generate the target code.

self creation: inspired from "Compilers by Aho, Lam, Sethi & Ullman"

Fig 3. Structure of a Compiler


What we just learnt is the tip of the iceberg! The study of compilers is vast and interesting at the same time. Extensive research is being done at each of the above levels to process programs in the most optimal manner. So, it’s a long way to go!

Meanwhile, here is a small exercise for you: Discuss the stages at which the following errors will be detected.


x = 1;

while(x<5; x!=3)


var = a + b;


char x[] = “hello”;

switch (x)

You can post your answers in the comments.

SEE ALSO Programming: A Mind Map of Inheritance


Leave A Reply