Program Construction

AS Level — Unit 1: Fundamentals of Computer Science

Compilers, Interpreters and Assemblers

Before a program written in a high-level or assembly language can run on a computer, it must be translated into machine code. Three types of translator perform this task: compilers, interpreters, and assemblers.

Compilers

A compiler translates an entire high-level language program (the source code) into machine code (the object code) in one complete process before the program is run.

The source code is read as a whole and translated into a standalone executable.
Once compiled, the object code can be run repeatedly without the source code or the compiler.
Errors are reported after the entire program has been analysed, producing an error list.
Examples: C, C++, Pascal.

Interpreters

An interpreter translates and executes a high-level language program one line at a time, without producing a separate object code file.

Each statement is translated and executed immediately.
Execution stops as soon as an error is encountered and an error message is produced.
The source code and interpreter must both be present every time the program is run.
Generally slower than compiled programs because translation happens at runtime.
Examples: Python (in interactive mode), early BASIC.

Assemblers

An assembler translates assembly language (a low-level, mnemonic-based language) into machine code.

Assembly language uses mnemonics (e.g. MOV, ADD, JMP) that correspond closely to machine code instructions.
Each assembly language instruction typically translates to one machine code instruction.
The process is simpler than compilation because there is an almost one-to-one correspondence between assembly mnemonics and machine code.

Comparison

Feature	Compiler	Interpreter	Assembler
Input language	High-level	High-level	Assembly (low-level)
Translation method	Whole program at once	One line at a time	Whole program at once
Output	Standalone object code	No separate output	Machine code
Execution speed	Fast (compiled once)	Slower (translated at runtime)	Fast
Error reporting	Full list after analysis	Stops at first error	Error list
Source code needed at runtime	No	Yes	No

A compiler translates a complete high-level language program into machine code before execution. An interpreter translates and executes a high-level language program one statement at a time. An assembler translates assembly language mnemonics into machine code.

Stages of Compilation

The compilation process is not a single step — it is a pipeline of distinct stages, each transforming the source code into a form that is easier for the next stage to process.

1. Lexical Analysis

Lexical analysis (also called scanning or tokenising) is the first stage of compilation. The lexical analyser (or lexer) reads the raw source code character by character and converts it into a stream of tokens.

A token is a meaningful unit of the language: a keyword, identifier, operator, literal value, or punctuation symbol.

What the lexical analyser does:

Removes whitespace (spaces, tabs, newlines) and comments, as these have no meaning in the compiled output.
Identifies and classifies each token (e.g. IF → keyword, totalScore → identifier, + → operator, 42 → integer literal).
Reports lexical errors such as illegal characters.

Example:

Source code: total = price + tax

Token	Type
`total`	Identifier
`=`	Assignment operator
`price`	Identifier
`+`	Arithmetic operator
`tax`	Identifier

Lexical analysis converts raw source code into a stream of tokens — the smallest meaningful units of the language (keywords, identifiers, operators, literals). Whitespace and comments are discarded.

2. Symbol Table Construction

As the lexical analyser identifies tokens, the compiler builds a symbol table — a data structure that stores information about every identifier (variable, constant, function, procedure, class) found in the source code.

Information stored in the symbol table for each identifier:

Its name (e.g. totalScore)
Its type (e.g. integer, real, boolean)
Its scope (where in the program it is accessible: local or global)
Its memory address or location (assigned during code generation)
Other attributes such as whether it is a constant, parameter, or array

Why the symbol table is important:

It is consulted by later stages (particularly semantic analysis and code generation) to look up identifier details.
It allows the compiler to detect errors such as undeclared variables or type mismatches.
It enables the compiler to assign memory addresses efficiently.

The symbol table is a data structure built during compilation that records all identifiers in the source program, along with their name, data type, scope, and memory location. It is used throughout all subsequent compilation stages.

3. Syntax Analysis

Syntax analysis (also called parsing) checks whether the stream of tokens produced by the lexical analyser conforms to the grammar rules of the programming language.

The parser groups tokens into grammatical structures and builds a parse tree (or abstract syntax tree, AST) — a hierarchical representation of the program’s structure.

What the parser checks:

That statements follow the correct grammatical form (e.g. an IF statement must be followed by a condition, then THEN, then a body, and optionally ELSE).
That brackets, parentheses, and BEGIN/END blocks are correctly matched.
That expressions have the correct structure (e.g. 3 + * 5 would be a syntax error).

What it does NOT check:

Whether identifiers have been declared (that is semantic analysis).
Whether types are compatible (also semantic analysis).

Example of a syntax error: IF x > 5 PRINT "Yes" (missing THEN keyword).

Syntax analysis (parsing) checks that the token sequence conforms to the grammar of the language and builds a parse tree representing the program’s grammatical structure. It reports syntax errors where the grammar rules are violated.

4. Semantic Analysis

Semantic analysis checks the meaning of the program — things that are grammatically correct but logically invalid.

The semantic analyser works with the parse tree and the symbol table to perform checks including:

Check	Example Error
Type compatibility	Adding a string to an integer: `name + 5`
Variable declaration	Using a variable that has not been declared
Scope checking	Accessing a local variable outside its subroutine
Function call matching	Calling a function with the wrong number or types of arguments
Array bounds	Using a negative index or an index outside the declared range (in some languages)

The semantic analyser annotates the parse tree with type information and may also perform type coercion where permitted (e.g. automatically converting an integer to a real when required).

Example of a semantic error: total = "hello" + 5 — syntactically valid (an assignment with an expression), but semantically invalid because a string and an integer cannot be added.

Semantic analysis checks the meaning and logical consistency of the parse tree. It verifies type compatibility, that all identifiers are declared, and that operations are used correctly. It reports semantic errors that are syntactically valid but logically incorrect.

5. Code Generation

Code generation is the stage where the compiler translates the verified and annotated parse tree into machine code (or sometimes an intermediate code such as bytecode).

What the code generator does:

Translates each construct in the parse tree into the equivalent machine code instructions.
Allocates memory addresses for variables and other data, using the symbol table.
Generates code to handle subroutine calls, parameter passing, and return values.
Produces code for the target processor’s instruction set.

In some compilers, an intermediate code (such as three-address code or bytecode) is generated first, which is then translated to machine code in a final step. This makes it easier to target different processors.

Code generation translates the annotated parse tree into machine code (or intermediate code) that can be executed by the target processor. Memory addresses are assigned to variables using the symbol table.

6. Code Optimisation

Code optimisation improves the generated machine code to make the program run more efficiently (faster, using less memory, or both), without changing what the program does.

Optimisation can be applied at different stages (e.g. on the intermediate code or the final machine code). Common optimisation techniques include:

Technique	Description	Example
Removing redundant code	Eliminating instructions that have no effect	`x = x + 0` → removed
Constant folding	Evaluating constant expressions at compile time rather than at runtime	`area = 3.14159 * 4 * 4` → computed to `50.265...`
Loop optimisation	Moving calculations that do not change out of loops	Computing a constant value inside a loop is moved before the loop
Strength reduction	Replacing expensive operations with cheaper equivalents	Replacing multiplication by a power of 2 with a bit shift
Dead code elimination	Removing code that can never be reached or whose result is never used	A branch that can never be taken
Register allocation	Keeping frequently used values in CPU registers rather than RAM to reduce memory accesses	Keeping a loop counter in a register

Code optimisation transforms machine code or intermediate code to improve efficiency (speed or memory use) without changing the program’s results. Techniques include constant folding, removing redundant code, loop optimisation, and dead code elimination.

Summary of Compilation Stages

flowchart TD
    A[Source Code] --> B[Lexical Analysis\nTokenises source code\nRemoves whitespace & comments\nBuilds symbol table]
    B --> C[Syntax Analysis\nChecks grammar rules\nBuilds parse tree]
    C --> D[Semantic Analysis\nChecks meaning & types\nAnnotates parse tree]
    D --> E[Code Generation\nProduces machine code\nAllocates memory]
    E --> F[Code Optimisation\nImproves efficiency\nNo change to behaviour]
    F --> G[Object Code / Executable]

Stage	Input	Output	Errors Detected
Lexical analysis	Source code (characters)	Token stream	Illegal characters, invalid tokens
Symbol table construction	Identifiers from token stream	Symbol table	(Populated throughout compilation)
Syntax analysis	Token stream	Parse tree	Grammar violations (missing keywords, unmatched brackets)
Semantic analysis	Parse tree + symbol table	Annotated parse tree	Type mismatches, undeclared variables, scope errors
Code generation	Annotated parse tree	Machine/intermediate code	(Translation errors)
Code optimisation	Machine/intermediate code	Optimised machine code	N/A

A common exam question asks you to name and describe the stages of compilation in order. Remember the sequence: Lexical → Syntax → Semantic → Code Generation → Optimisation, with the symbol table being built during lexical analysis and consulted throughout. Be clear on what each stage checks: lexical = valid tokens, syntax = grammar structure, semantic = meaning and types.

← Software Engineering The Need for Different Types of Software Systems →