Program Construction
AS Level — Unit 1: Fundamentals of Computer Science
Compilers, Interpreters and Assemblers
Before a program written in a high-level or assembly language can run on a computer, it must be translated into machine code. Three types of translator perform this task: compilers, interpreters, and assemblers.
Compilers
A compiler translates an entire high-level language program (the source code) into machine code (the object code) in one complete process before the program is run.
- The source code is read as a whole and translated into a standalone executable.
- Once compiled, the object code can be run repeatedly without the source code or the compiler.
- Errors are reported after the entire program has been analysed, producing an error list.
- Examples: C, C++, Pascal.
Interpreters
An interpreter translates and executes a high-level language program one line at a time, without producing a separate object code file.
- Each statement is translated and executed immediately.
- Execution stops as soon as an error is encountered and an error message is produced.
- The source code and interpreter must both be present every time the program is run.
- Generally slower than compiled programs because translation happens at runtime.
- Examples: Python (in interactive mode), early BASIC.
Assemblers
An assembler translates assembly language (a low-level, mnemonic-based language) into machine code.
- Assembly language uses mnemonics (e.g.
MOV,ADD,JMP) that correspond closely to machine code instructions. - Each assembly language instruction typically translates to one machine code instruction.
- The process is simpler than compilation because there is an almost one-to-one correspondence between assembly mnemonics and machine code.
Comparison
| Feature | Compiler | Interpreter | Assembler |
|---|---|---|---|
| Input language | High-level | High-level | Assembly (low-level) |
| Translation method | Whole program at once | One line at a time | Whole program at once |
| Output | Standalone object code | No separate output | Machine code |
| Execution speed | Fast (compiled once) | Slower (translated at runtime) | Fast |
| Error reporting | Full list after analysis | Stops at first error | Error list |
| Source code needed at runtime | No | Yes | No |
A compiler translates a complete high-level language program into machine code before execution. An interpreter translates and executes a high-level language program one statement at a time. An assembler translates assembly language mnemonics into machine code.
Stages of Compilation
The compilation process is not a single step — it is a pipeline of distinct stages, each transforming the source code into a form that is easier for the next stage to process.
1. Lexical Analysis
Lexical analysis (also called scanning or tokenising) is the first stage of compilation. The lexical analyser (or lexer) reads the raw source code character by character and converts it into a stream of tokens.
A token is a meaningful unit of the language: a keyword, identifier, operator, literal value, or punctuation symbol.
What the lexical analyser does:
- Removes whitespace (spaces, tabs, newlines) and comments, as these have no meaning in the compiled output.
- Identifies and classifies each token (e.g.
IF→ keyword,totalScore→ identifier,+→ operator,42→ integer literal). - Reports lexical errors such as illegal characters.
Example:
Source code: total = price + tax
| Token | Type |
|---|---|
total |
Identifier |
= |
Assignment operator |
price |
Identifier |
+ |
Arithmetic operator |
tax |
Identifier |
Lexical analysis converts raw source code into a stream of tokens — the smallest meaningful units of the language (keywords, identifiers, operators, literals). Whitespace and comments are discarded.
2. Symbol Table Construction
As the lexical analyser identifies tokens, the compiler builds a symbol table — a data structure that stores information about every identifier (variable, constant, function, procedure, class) found in the source code.
Information stored in the symbol table for each identifier:
- Its name (e.g.
totalScore) - Its type (e.g. integer, real, boolean)
- Its scope (where in the program it is accessible: local or global)
- Its memory address or location (assigned during code generation)
- Other attributes such as whether it is a constant, parameter, or array
Why the symbol table is important:
- It is consulted by later stages (particularly semantic analysis and code generation) to look up identifier details.
- It allows the compiler to detect errors such as undeclared variables or type mismatches.
- It enables the compiler to assign memory addresses efficiently.
The symbol table is a data structure built during compilation that records all identifiers in the source program, along with their name, data type, scope, and memory location. It is used throughout all subsequent compilation stages.
3. Syntax Analysis
Syntax analysis (also called parsing) checks whether the stream of tokens produced by the lexical analyser conforms to the grammar rules of the programming language.
The parser groups tokens into grammatical structures and builds a parse tree (or abstract syntax tree, AST) — a hierarchical representation of the program’s structure.
What the parser checks:
- That statements follow the correct grammatical form (e.g. an
IFstatement must be followed by a condition, thenTHEN, then a body, and optionallyELSE). - That brackets, parentheses, and
BEGIN/ENDblocks are correctly matched. - That expressions have the correct structure (e.g.
3 + * 5would be a syntax error).
What it does NOT check:
- Whether identifiers have been declared (that is semantic analysis).
- Whether types are compatible (also semantic analysis).
Example of a syntax error: IF x > 5 PRINT "Yes" (missing THEN keyword).
Syntax analysis (parsing) checks that the token sequence conforms to the grammar of the language and builds a parse tree representing the program’s grammatical structure. It reports syntax errors where the grammar rules are violated.
4. Semantic Analysis
Semantic analysis checks the meaning of the program — things that are grammatically correct but logically invalid.
The semantic analyser works with the parse tree and the symbol table to perform checks including:
| Check | Example Error |
|---|---|
| Type compatibility | Adding a string to an integer: name + 5 |
| Variable declaration | Using a variable that has not been declared |
| Scope checking | Accessing a local variable outside its subroutine |
| Function call matching | Calling a function with the wrong number or types of arguments |
| Array bounds | Using a negative index or an index outside the declared range (in some languages) |
The semantic analyser annotates the parse tree with type information and may also perform type coercion where permitted (e.g. automatically converting an integer to a real when required).
Example of a semantic error: total = "hello" + 5 — syntactically valid (an assignment with an expression), but semantically invalid because a string and an integer cannot be added.
Semantic analysis checks the meaning and logical consistency of the parse tree. It verifies type compatibility, that all identifiers are declared, and that operations are used correctly. It reports semantic errors that are syntactically valid but logically incorrect.
5. Code Generation
Code generation is the stage where the compiler translates the verified and annotated parse tree into machine code (or sometimes an intermediate code such as bytecode).
What the code generator does:
- Translates each construct in the parse tree into the equivalent machine code instructions.
- Allocates memory addresses for variables and other data, using the symbol table.
- Generates code to handle subroutine calls, parameter passing, and return values.
- Produces code for the target processor’s instruction set.
In some compilers, an intermediate code (such as three-address code or bytecode) is generated first, which is then translated to machine code in a final step. This makes it easier to target different processors.
Code generation translates the annotated parse tree into machine code (or intermediate code) that can be executed by the target processor. Memory addresses are assigned to variables using the symbol table.
6. Code Optimisation
Code optimisation improves the generated machine code to make the program run more efficiently (faster, using less memory, or both), without changing what the program does.
Optimisation can be applied at different stages (e.g. on the intermediate code or the final machine code). Common optimisation techniques include:
| Technique | Description | Example |
|---|---|---|
| Removing redundant code | Eliminating instructions that have no effect | x = x + 0 → removed |
| Constant folding | Evaluating constant expressions at compile time rather than at runtime | area = 3.14159 * 4 * 4 → computed to 50.265... |
| Loop optimisation | Moving calculations that do not change out of loops | Computing a constant value inside a loop is moved before the loop |
| Strength reduction | Replacing expensive operations with cheaper equivalents | Replacing multiplication by a power of 2 with a bit shift |
| Dead code elimination | Removing code that can never be reached or whose result is never used | A branch that can never be taken |
| Register allocation | Keeping frequently used values in CPU registers rather than RAM to reduce memory accesses | Keeping a loop counter in a register |
Code optimisation transforms machine code or intermediate code to improve efficiency (speed or memory use) without changing the program’s results. Techniques include constant folding, removing redundant code, loop optimisation, and dead code elimination.
Summary of Compilation Stages
flowchart TD
A[Source Code] --> B[Lexical Analysis\nTokenises source code\nRemoves whitespace & comments\nBuilds symbol table]
B --> C[Syntax Analysis\nChecks grammar rules\nBuilds parse tree]
C --> D[Semantic Analysis\nChecks meaning & types\nAnnotates parse tree]
D --> E[Code Generation\nProduces machine code\nAllocates memory]
E --> F[Code Optimisation\nImproves efficiency\nNo change to behaviour]
F --> G[Object Code / Executable]
| Stage | Input | Output | Errors Detected |
|---|---|---|---|
| Lexical analysis | Source code (characters) | Token stream | Illegal characters, invalid tokens |
| Symbol table construction | Identifiers from token stream | Symbol table | (Populated throughout compilation) |
| Syntax analysis | Token stream | Parse tree | Grammar violations (missing keywords, unmatched brackets) |
| Semantic analysis | Parse tree + symbol table | Annotated parse tree | Type mismatches, undeclared variables, scope errors |
| Code generation | Annotated parse tree | Machine/intermediate code | (Translation errors) |
| Code optimisation | Machine/intermediate code | Optimised machine code | N/A |
A common exam question asks you to name and describe the stages of compilation in order. Remember the sequence: Lexical → Syntax → Semantic → Code Generation → Optimisation, with the symbol table being built during lexical analysis and consulted throughout. Be clear on what each stage checks: lexical = valid tokens, syntax = grammar structure, semantic = meaning and types.