Algorithms & Programs
AS — Unit 1: Fundamentals of Computer Science
What Is an Algorithm?
An algorithm is a finite, ordered sequence of unambiguous instructions that, given a set of inputs, produces a defined set of outputs and terminates in a finite amount of time.
An algorithm is essentially a step-by-step recipe for solving a problem. Before any code is written, a programmer designs an algorithm to describe the logic of the solution. Algorithms are language-independent – the same algorithm can be implemented in Python, VB.NET, Java, or any other language.
Characteristics of an Algorithm
- Input – An algorithm has zero or more inputs taken from a specified set.
- Output – An algorithm produces one or more outputs that have a defined relationship to the inputs.
- Finiteness – An algorithm must always terminate after a finite number of steps.
- Definiteness – Each step must be precisely and unambiguously defined.
- Effectiveness – Every step must be basic enough that it could, in principle, be carried out by a person using pencil and paper.
Why Algorithms Matter
- They allow us to plan a solution before writing code.
- They help us communicate solutions to other programmers regardless of programming language.
- They allow us to analyse the efficiency of a solution before implementing it.
- They form the basis for correctness proofs and testing strategies.
Representing Algorithms
Before implementing an algorithm in code, we represent it using one of three standard methods.
Pseudocode
Pseudocode is a structured, English-like notation for describing algorithms. It is not a real programming language – there is no single standard syntax – but it follows logical programming constructs such as sequence, selection, and iteration.
Example – Finding the largest value in a list:
BEGIN
SET max TO list[0]
FOR i FROM 1 TO LENGTH(list) - 1
IF list[i] > max THEN
SET max TO list[i]
END IF
END FOR
OUTPUT max
END
Pseudocode advantages:
- Quick to write and easy to read.
- Not tied to any programming language.
- Focuses on logic rather than syntax.
Flowcharts
A flowchart is a diagrammatic representation of an algorithm using standard symbols:
| Symbol | Shape | Purpose |
|---|---|---|
| Terminator | Rounded rectangle (oval) | Start or end of the algorithm |
| Process | Rectangle | An instruction or action |
| Decision | Diamond | A yes/no or true/false question |
| Input/Output | Parallelogram | Data input or output |
| Flow line | Arrow | Shows the direction of flow |
Flowchart advantages:
- Visual representation makes logic easy to follow.
- Good for showing branching and looping clearly.
- Useful for non-programmers to understand the logic.
Flowchart disadvantages:
- Can become very large and unwieldy for complex algorithms.
- Harder to modify than pseudocode.
Structured English
Structured English uses a restricted subset of natural language combined with programming constructs. It avoids technical jargon and is useful for communicating with non-technical stakeholders.
Example:
1. Start with the first item in the list.
2. Compare it with every other item.
3. If a larger item is found, remember it as the new largest.
4. After checking all items, display the largest item.
In the exam, you may be asked to write an algorithm in pseudocode or to trace through a flowchart. Make sure you are comfortable with all three representations and can convert between them.
Linear Search
Linear search (also called sequential search) is an algorithm that checks every element in a list, one by one, from the beginning until the target value is found or the end of the list is reached.
How It Works
- Start at the first element of the list.
- Compare the current element with the target value.
- If they match, return the position (index) of the element.
- If they do not match, move to the next element.
- If the end of the list is reached without finding the target, report that the item was not found.
Code Examples
def linear_search(data, target):
for i in range(len(data)):
if data[i] == target:
return i # Return the index where target is found
return -1 # Target not found
# Example usage
numbers = [4, 7, 2, 9, 1, 5, 8]
result = linear_search(numbers, 9)
if result != -1:
print(f"Found at index {result}")
else:
print("Not found")
Function LinearSearch(data() As Integer, target As Integer) As Integer
For i As Integer = 0 To data.Length - 1
If data(i) = target Then
Return i ' Return the index where target is found
End If
Next
Return -1 ' Target not found
End Function
' Example usage
Dim numbers() As Integer = {4, 7, 2, 9, 1, 5, 8}
Dim result As Integer = LinearSearch(numbers, 9)
If result <> -1 Then
Console.WriteLine("Found at index " & result)
Else
Console.WriteLine("Not found")
End If
Performance
- Best case: O(1) – the target is the first element.
- Worst case: O(n) – the target is the last element or not present.
- Average case: O(n) – on average, half the list is searched.
Advantages and Disadvantages
| Advantages | Disadvantages |
|---|---|
| Simple to understand and implement | Slow for large data sets |
| Works on unsorted data | Checks every element in the worst case |
| Works on any data structure (arrays, linked lists) | Not efficient compared to binary search on sorted data |
Binary Search
Binary search is a search algorithm that works on sorted data by repeatedly dividing the search interval in half. It compares the target value to the middle element and eliminates half of the remaining elements at each step.
How It Works
- Set
lowto the first index andhighto the last index. - Find the middle index:
mid = (low + high) // 2. - If
data[mid]equals the target, returnmid. - If the target is less than
data[mid], sethigh = mid - 1(search the left half). - If the target is greater than
data[mid], setlow = mid + 1(search the right half). - Repeat steps 2-5 until the target is found or
low > high(target not present).
Code Examples
def binary_search(data, target):
low = 0
high = len(data) - 1
while low <= high:
mid = (low + high) // 2
if data[mid] == target:
return mid
elif data[mid] < target:
low = mid + 1
else:
high = mid - 1
return -1 # Target not found
# Example usage
numbers = [1, 3, 5, 7, 9, 11, 13, 15]
result = binary_search(numbers, 7)
if result != -1:
print(f"Found at index {result}")
else:
print("Not found")
Function BinarySearch(data() As Integer, target As Integer) As Integer
Dim low As Integer = 0
Dim high As Integer = data.Length - 1
While low <= high
Dim mid As Integer = (low + high) \ 2
If data(mid) = target Then
Return mid
ElseIf data(mid) < target Then
low = mid + 1
Else
high = mid - 1
End If
End While
Return -1 ' Target not found
End Function
' Example usage
Dim numbers() As Integer = {1, 3, 5, 7, 9, 11, 13, 15}
Dim result As Integer = BinarySearch(numbers, 7)
If result <> -1 Then
Console.WriteLine("Found at index " & result)
Else
Console.WriteLine("Not found")
End If
Performance
- Best case: O(1) – the target is the middle element.
- Worst case: O(log n) – the search space is halved at each step.
- Average case: O(log n).
Why O(log n)?
Each comparison eliminates half the remaining elements. For a list of 1,000,000 items, binary search needs at most about 20 comparisons (since log2(1,000,000) is approximately 20). Linear search would need up to 1,000,000 comparisons.
Comparison of Searching Algorithms
| Feature | Linear Search | Binary Search |
|---|---|---|
| Data must be sorted? | No | Yes |
| Best case | O(1) | O(1) |
| Worst case | O(n) | O(log n) |
| Average case | O(n) | O(log n) |
| Implementation complexity | Very simple | Slightly more complex |
| Suitable for | Small or unsorted data sets | Large, sorted data sets |
| Works on linked lists? | Yes (efficiently) | Not efficiently (no random access) |
A common exam question asks you to explain when you would use each algorithm. If the data is unsorted and small, linear search is appropriate. If the data is sorted and large, binary search is far more efficient. Remember: binary search requires sorted data – if the data is unsorted, the cost of sorting it first must be considered.
Bubble Sort
Bubble sort is a simple sorting algorithm that repeatedly steps through the list, compares adjacent elements, and swaps them if they are in the wrong order. The largest unsorted value “bubbles up” to the end of the list on each pass.
How It Works
- Start at the beginning of the list.
- Compare the first two adjacent elements. If the first is greater than the second, swap them.
- Move to the next pair and repeat the comparison and swap.
- Continue until the end of the list – this completes one pass. The largest element is now in its correct position.
- Repeat passes for the remaining unsorted portion of the list.
- If a complete pass is made with no swaps, the list is sorted and the algorithm terminates early.
Code Examples
def bubble_sort(data):
n = len(data)
for i in range(n - 1):
swapped = False
for j in range(n - 1 - i):
if data[j] > data[j + 1]:
data[j], data[j + 1] = data[j + 1], data[j]
swapped = True
if not swapped:
break # List is already sorted
return data
# Example usage
numbers = [64, 34, 25, 12, 22, 11, 90]
print(bubble_sort(numbers))
# Output: [11, 12, 22, 25, 34, 64, 90]
Sub BubbleSort(data() As Integer)
Dim n As Integer = data.Length
For i As Integer = 0 To n - 2
Dim swapped As Boolean = False
For j As Integer = 0 To n - 2 - i
If data(j) > data(j + 1) Then
Dim temp As Integer = data(j)
data(j) = data(j + 1)
data(j + 1) = temp
swapped = True
End If
Next
If Not swapped Then
Exit For ' List is already sorted
End If
Next
End Sub
' Example usage
Dim numbers() As Integer = {64, 34, 25, 12, 22, 11, 90}
BubbleSort(numbers)
For Each num As Integer In numbers
Console.Write(num & " ")
Next
' Output: 11 12 22 25 34 64 90
Performance
- Best case: O(n) – the list is already sorted (only one pass with no swaps needed).
- Worst case: O(n^2) – the list is in reverse order.
- Average case: O(n^2).
- Space complexity: O(1) – it sorts in place (no additional memory needed beyond a temporary swap variable).
Insertion Sort
Insertion sort builds the sorted list one element at a time. It takes each element from the unsorted portion and inserts it into its correct position within the sorted portion, shifting elements as necessary.
How It Works
- Consider the first element as a sorted sub-list of one element.
- Take the next element (the “key”) from the unsorted portion.
- Compare the key with each element in the sorted portion, moving from right to left.
- Shift elements in the sorted portion one position to the right to make space.
- Insert the key into its correct position.
- Repeat until all elements have been processed.
Code Examples
def insertion_sort(data):
for i in range(1, len(data)):
key = data[i]
j = i - 1
while j >= 0 and data[j] > key:
data[j + 1] = data[j] # Shift element to the right
j -= 1
data[j + 1] = key # Insert key in correct position
return data
# Example usage
numbers = [12, 11, 13, 5, 6]
print(insertion_sort(numbers))
# Output: [5, 6, 11, 12, 13]
Sub InsertionSort(data() As Integer)
For i As Integer = 1 To data.Length - 1
Dim key As Integer = data(i)
Dim j As Integer = i - 1
While j >= 0 AndAlso data(j) > key
data(j + 1) = data(j) ' Shift element to the right
j -= 1
End While
data(j + 1) = key ' Insert key in correct position
Next
End Sub
' Example usage
Dim numbers() As Integer = {12, 11, 13, 5, 6}
InsertionSort(numbers)
For Each num As Integer In numbers
Console.Write(num & " ")
Next
' Output: 5 6 11 12 13
Performance
- Best case: O(n) – the list is already sorted (no shifts needed, just one comparison per element).
- Worst case: O(n^2) – the list is in reverse order.
- Average case: O(n^2).
- Space complexity: O(1) – it sorts in place.
When Is Insertion Sort Useful?
- When the data set is small.
- When the data is nearly sorted (it performs very well in this case, approaching O(n)).
- It is an online algorithm – it can sort a list as it receives new data.
Merge Sort
Merge sort is a divide and conquer sorting algorithm. It recursively splits the list into halves until each sub-list contains a single element, then merges the sub-lists back together in sorted order.
How It Works
- Divide: Split the list into two halves.
- Conquer: Recursively apply merge sort to each half.
- Merge: Combine the two sorted halves into a single sorted list by comparing elements from each half and placing the smaller one first.
Worked example with [38, 27, 43, 3]:
Step 1 (Divide): [38, 27, 43, 3]
/ \
[38, 27] [43, 3]
/ \ / \
[38] [27] [43] [3]
Step 2 (Merge):
[38] [27] [43] [3]
\ / \ /
[27, 38] [3, 43]
\ /
[3, 27, 38, 43]
Code Examples
def merge_sort(data):
if len(data) <= 1:
return data
mid = len(data) // 2
left_half = merge_sort(data[:mid])
right_half = merge_sort(data[mid:])
return merge(left_half, right_half)
def merge(left, right):
result = []
i = 0
j = 0
while i < len(left) and j < len(right):
if left[i] <= right[j]:
result.append(left[i])
i += 1
else:
result.append(right[j])
j += 1
# Add any remaining elements
result.extend(left[i:])
result.extend(right[j:])
return result
# Example usage
numbers = [38, 27, 43, 3, 9, 82, 10]
print(merge_sort(numbers))
# Output: [3, 9, 10, 27, 38, 43, 82]
Function MergeSort(data() As Integer) As Integer()
If data.Length <= 1 Then
Return data
End If
Dim mid As Integer = data.Length \ 2
Dim leftHalf(mid - 1) As Integer
Dim rightHalf(data.Length - mid - 1) As Integer
Array.Copy(data, 0, leftHalf, 0, mid)
Array.Copy(data, mid, rightHalf, 0, data.Length - mid)
leftHalf = MergeSort(leftHalf)
rightHalf = MergeSort(rightHalf)
Return Merge(leftHalf, rightHalf)
End Function
Function Merge(left() As Integer, right() As Integer) As Integer()
Dim result(left.Length + right.Length - 1) As Integer
Dim i As Integer = 0
Dim j As Integer = 0
Dim k As Integer = 0
While i < left.Length AndAlso j < right.Length
If left(i) <= right(j) Then
result(k) = left(i)
i += 1
Else
result(k) = right(j)
j += 1
End If
k += 1
End While
While i < left.Length
result(k) = left(i)
i += 1
k += 1
End While
While j < right.Length
result(k) = right(j)
j += 1
k += 1
End While
Return result
End Function
' Example usage
Dim numbers() As Integer = {38, 27, 43, 3, 9, 82, 10}
Dim sorted() As Integer = MergeSort(numbers)
For Each num As Integer In sorted
Console.Write(num & " ")
Next
' Output: 3 9 10 27 38 43 82
Performance
- Best case: O(n log n).
- Worst case: O(n log n).
- Average case: O(n log n).
- Space complexity: O(n) – it requires additional memory to hold the sub-lists during merging.
Why O(n log n)?
The list is divided in half at each level of recursion, giving log n levels. At each level, all n elements must be compared during the merge step. Therefore, the total work is n multiplied by log n.
Comparison of Sorting Algorithms
| Feature | Bubble Sort | Insertion Sort | Merge Sort |
|---|---|---|---|
| Best case | O(n) | O(n) | O(n log n) |
| Worst case | O(n^2) | O(n^2) | O(n log n) |
| Average case | O(n^2) | O(n^2) | O(n log n) |
| Space complexity | O(1) – in place | O(1) – in place | O(n) – extra memory needed |
| Stable? | Yes | Yes | Yes |
| Adaptive? | Yes (with early exit) | Yes (fast on nearly sorted data) | No |
| Suitable for | Small or nearly sorted data | Small or nearly sorted data | Large data sets |
| Approach | Comparison and swap | Shift and insert | Divide and conquer |
A stable sorting algorithm preserves the relative order of elements with equal keys. All three algorithms covered here (bubble, insertion, merge) are stable.
In the exam you may be asked to recommend a sorting algorithm for a specific scenario. For large data sets, merge sort is preferred because its worst case is O(n log n). For small or nearly sorted data, insertion sort is often the best practical choice. Bubble sort is mainly used for educational purposes due to its simplicity.
Big O Notation
Big O notation is a mathematical notation used to describe the upper bound of an algorithm’s time complexity or space complexity. It describes how the resource usage of an algorithm grows relative to the input size n, focusing on the worst-case scenario.
Big O ignores constants and lower-order terms. For example, an algorithm that takes 3n^2 + 5n + 2 steps is described as O(n^2), because the n^2 term dominates as n grows large.
Common Big O Complexities
| Notation | Name | Description | Example |
|---|---|---|---|
| O(1) | Constant | The time taken does not depend on the input size. | Accessing an array element by index; checking if a number is odd/even. |
| O(log n) | Logarithmic | The time grows logarithmically. The algorithm halves the problem size at each step. | Binary search. |
| O(n) | Linear | The time grows in direct proportion to the input size. | Linear search; summing all elements in a list. |
| O(n log n) | Linearithmic | A combination of linear and logarithmic growth. Common in efficient sorting. | Merge sort; the best comparison-based sorts. |
| O(n^2) | Quadratic | The time grows as the square of the input size. Typically caused by nested loops. | Bubble sort; insertion sort (worst case). |
| O(2^n) | Exponential | The time doubles with each additional input element. Very slow for even moderate n. | Recursive Fibonacci (naive); brute-force subset enumeration. |
| O(n!) | Factorial | The time grows factorially. Extremely slow; only practical for very small n. | Travelling Salesman Problem (brute-force); generating all permutations. |
Growth Rate Comparison
For an input of size n = 1,000:
| Complexity | Approximate Operations |
|---|---|
| O(1) | 1 |
| O(log n) | 10 |
| O(n) | 1,000 |
| O(n log n) | 10,000 |
| O(n^2) | 1,000,000 |
| O(2^n) | Astronomically large (impractical) |
Time Complexity vs Space Complexity
- Time complexity measures how the running time of an algorithm grows with input size.
- Space complexity measures how the memory usage of an algorithm grows with input size.
For example, merge sort has time complexity O(n log n) but space complexity O(n) because it requires additional arrays during merging. Bubble sort has time complexity O(n^2) but space complexity O(1) because it sorts in place.
You must be able to identify the Big O complexity of the standard algorithms: linear search is O(n), binary search is O(log n), bubble sort is O(n^2), insertion sort is O(n^2), and merge sort is O(n log n). You should also be able to explain why each algorithm has its particular complexity.
Recursion vs Iteration
Recursion is a technique where a function calls itself to solve a smaller instance of the same problem. Each recursive call works towards a base case that stops the recursion. Iteration uses loops (for, while) to repeat a block of code.
Structure of a Recursive Function
Every recursive function must have:
- A base case – a condition that stops the recursion and returns a value directly.
- A recursive case – the function calls itself with a modified argument that moves towards the base case.
Example: Factorial
The factorial of n (written n!) is defined as:
- 0! = 1 (base case)
- n! = n * (n-1)! (recursive case)
Recursive approach:
def factorial_recursive(n):
if n == 0: # Base case
return 1
else: # Recursive case
return n * factorial_recursive(n - 1)
print(factorial_recursive(5)) # Output: 120
Function FactorialRecursive(n As Integer) As Integer
If n = 0 Then ' Base case
Return 1
Else ' Recursive case
Return n * FactorialRecursive(n - 1)
End If
End Function
Console.WriteLine(FactorialRecursive(5)) ' Output: 120
Iterative approach:
def factorial_iterative(n):
result = 1
for i in range(1, n + 1):
result = result * i
return result
print(factorial_iterative(5)) # Output: 120
Function FactorialIterative(n As Integer) As Integer
Dim result As Integer = 1
For i As Integer = 1 To n
result = result * i
Next
Return result
End Function
Console.WriteLine(FactorialIterative(5)) ' Output: 120
Example: Fibonacci Sequence
# Recursive Fibonacci
def fibonacci_recursive(n):
if n <= 1: # Base case
return n
else: # Recursive case
return fibonacci_recursive(n - 1) + fibonacci_recursive(n - 2)
# Iterative Fibonacci
def fibonacci_iterative(n):
if n <= 1:
return n
a, b = 0, 1
for i in range(2, n + 1):
a, b = b, a + b
return b
print(fibonacci_recursive(7)) # Output: 13
print(fibonacci_iterative(7)) # Output: 13
' Recursive Fibonacci
Function FibonacciRecursive(n As Integer) As Integer
If n <= 1 Then ' Base case
Return n
Else ' Recursive case
Return FibonacciRecursive(n - 1) + FibonacciRecursive(n - 2)
End If
End Function
' Iterative Fibonacci
Function FibonacciIterative(n As Integer) As Integer
If n <= 1 Then
Return n
End If
Dim a As Integer = 0
Dim b As Integer = 1
For i As Integer = 2 To n
Dim temp As Integer = b
b = a + b
a = temp
Next
Return b
End Function
Console.WriteLine(FibonacciRecursive(7)) ' Output: 13
Console.WriteLine(FibonacciIterative(7)) ' Output: 13
Comparison Table
| Feature | Recursion | Iteration |
|---|---|---|
| Uses | Function calls itself | Loop constructs (for, while) |
| Termination | Base case | Loop condition becomes false |
| Memory usage | Higher (each call adds to the call stack) | Lower (no call stack overhead) |
| Risk | Stack overflow if base case not reached or recursion too deep | Infinite loop if condition never met |
| Readability | Often more elegant and closer to mathematical definitions | Can be more straightforward for simple repetition |
| Speed | Often slower due to function call overhead | Usually faster |
| Best for | Problems with recursive structure (trees, divide and conquer) | Simple repetition and counting |
A common exam question asks you to explain the advantages and disadvantages of recursion versus iteration. Always mention the call stack: each recursive call adds a new frame to the call stack, which uses memory. If the recursion is too deep, this can cause a stack overflow error.
Trace Tables
A trace table is a technique used to dry-run an algorithm by recording the values of variables at each step of execution. It is used to check the correctness of an algorithm and to understand its behaviour.
How to Create a Trace Table
- Create a column for each variable in the algorithm.
- Add a column for any output.
- Add a column for the condition being tested (optional but helpful).
- Work through the algorithm step by step, recording the value of each variable after every instruction that changes it.
Example: Tracing a Linear Search
Algorithm to search for the value 5 in the list [3, 8, 5, 1]:
data = [3, 8, 5, 1]
target = 5
FOR i FROM 0 TO 3
IF data[i] == target THEN
OUTPUT i
END IF
END FOR
Trace table:
| Step | i | data[i] | data[i] == target? | Output |
|---|---|---|---|---|
| 1 | 0 | 3 | No | – |
| 2 | 1 | 8 | No | – |
| 3 | 2 | 5 | Yes | 2 |
| 4 | 3 | 1 | No | – |
Example: Tracing a Bubble Sort Pass
First pass of bubble sort on [5, 3, 8, 1]:
| Step | j | data[j] | data[j+1] | Swap? | List after step |
|---|---|---|---|---|---|
| 1 | 0 | 5 | 3 | Yes | [3, 5, 8, 1] |
| 2 | 1 | 5 | 8 | No | [3, 5, 8, 1] |
| 3 | 2 | 8 | 1 | Yes | [3, 5, 1, 8] |
After the first pass, the largest value (8) has “bubbled” to the end.
Example: Tracing Recursion
Tracing factorial_recursive(4):
| Call | n | n == 0? | Returns |
|---|---|---|---|
| factorial_recursive(4) | 4 | No | 4 * factorial_recursive(3) |
| factorial_recursive(3) | 3 | No | 3 * factorial_recursive(2) |
| factorial_recursive(2) | 2 | No | 2 * factorial_recursive(1) |
| factorial_recursive(1) | 1 | No | 1 * factorial_recursive(0) |
| factorial_recursive(0) | 0 | Yes | 1 |
Now the returns unwind: 1 * 1 = 1, then 2 * 1 = 2, then 3 * 2 = 6, then 4 * 6 = 24.
Trace tables are a very common exam question. Practise by tracing through algorithms by hand. Always work methodically – update one variable at a time and do not skip steps. For recursive algorithms, show the chain of calls and the order in which values are returned.
Data Compression
Compression reduces file size, making data faster to transmit and requiring less storage. At AS level you need to understand both categories and specific algorithms.
Lossless vs Lossy Compression
| Feature | Lossless | Lossy |
|---|---|---|
| Data loss | None — original data perfectly reconstructed | Some data permanently removed |
| File size | Larger than lossy | Smaller than lossless |
| Use cases | Text, code, spreadsheets, medical images, archives | Photos, music, video, streaming |
| File formats | ZIP, PNG, GIF, FLAC, 7Z | JPEG, MP3, MP4/MPEG, AAC |
| How it works | Identifies and encodes patterns/redundancy | Removes data humans are unlikely to notice |
Lossless compression reduces file size while allowing the original data to be perfectly reconstructed. Lossy compression achieves greater size reduction by permanently removing some data, meaning the original cannot be fully restored.
Run-Length Encoding (RLE)
RLE is a lossless algorithm that replaces consecutive runs of the same value with a count and the value.
Worked Example
Original data: AAAAAABBBBCCDDDDDDD
| Run | Count | Value | Encoded |
|---|---|---|---|
| AAAAAA | 6 | A | 6A |
| BBBB | 4 | B | 4B |
| CC | 2 | C | 2C |
| DDDDDDD | 7 | D | 7D |
Encoded: 6A4B2C7D
Space calculation:
- Original: 19 characters (19 bytes if 1 byte per character)
- Encoded: 8 characters (8 bytes)
- Saving: (19 - 8) / 19 × 100 = 57.9%
When RLE Works Well vs Poorly
- Works well: Data with long runs of repeated values — simple graphics, icons, diagrams, fax documents
- Works poorly: Data with little repetition — photographs, random data. The encoded version may be larger than the original (e.g.
ABCDEF→1A1B1C1D1E1Fis longer)
# Python — Simple RLE encoding
def rle_encode(data):
encoded = ""
i = 0
while i < len(data):
count = 1
while i + count < len(data) and data[i + count] == data[i]:
count += 1
encoded += str(count) + data[i]
i += count
return encoded
def rle_decode(encoded):
decoded = ""
i = 0
while i < len(encoded):
count = ""
while encoded[i].isdigit():
count += encoded[i]
i += 1
decoded += encoded[i] * int(count)
i += 1
return decoded
original = "AAAAAABBBBCCDDDDDDD"
compressed = rle_encode(original)
print(f"Original: {original} ({len(original)} chars)")
print(f"Compressed: {compressed} ({len(compressed)} chars)")
print(f"Decompressed: {rle_decode(compressed)}")
' VB.NET — Simple RLE encoding
Function RLEEncode(data As String) As String
Dim encoded As String = ""
Dim i As Integer = 0
While i < data.Length
Dim count As Integer = 1
While i + count < data.Length AndAlso data(i + count) = data(i)
count += 1
End While
encoded &= count.ToString() & data(i)
i += count
End While
Return encoded
End Function
Dim original As String = "AAAAAABBBBCCDDDDDDD"
Dim compressed As String = RLEEncode(original)
Console.WriteLine("Original: " & original & " (" & original.Length & " chars)")
Console.WriteLine("Compressed: " & compressed & " (" & compressed.Length & " chars)")
Dictionary-Based (LZ) Compression
Dictionary-based compression (named after Lempel and Ziv — LZ77, LZ78, LZW) builds a dictionary of recurring patterns in the data and replaces them with shorter codes.
How It Works
- The algorithm scans the data and identifies repeated sequences
- Each unique sequence is added to a dictionary with a short index code
- Subsequent occurrences of the sequence are replaced with the dictionary code
- The dictionary is stored with the compressed data so it can be decoded
Worked Example
Original text: THE CAT SAT ON THE MAT
| Dictionary Index | Entry |
|---|---|
| 0 | THE |
| 1 | CAT |
| 2 | SAT |
| 3 | ON |
| 4 | MAT |
Encoded: 0 1 2 3 0 4 (plus spaces/structure)
The word “THE” appears twice but is only stored once in the dictionary. The second occurrence is replaced by its index code (0), saving space.
Key characteristics:
- Lossless — original data is perfectly reconstructed using the dictionary
- More effective than RLE for text and general-purpose compression
- Used in ZIP, GIF, PNG, and PDF formats
- Effectiveness depends on how much repetition exists in the data
Huffman Encoding
Huffman encoding is a lossless algorithm that assigns variable-length binary codes to characters based on their frequency. More frequent characters get shorter codes, less frequent characters get longer codes.
How It Works
- Count the frequency of each character in the data
- Build a binary tree by repeatedly combining the two lowest-frequency characters
- Assign binary codes by traversing the tree (left = 0, right = 1)
- Replace each character with its Huffman code
Worked Example
Text: AABBBCCCCDDDD (14 characters)
Step 1 — Frequency count:
| Character | Frequency |
|---|---|
| A | 2 |
| B | 3 |
| C | 4 |
| D | 5 |
Step 2 — Build the tree (combine two smallest repeatedly):
- Combine A(2) + B(3) = AB(5)
- Combine C(4) + AB(5) = CAB(9)
- Combine D(5) + CAB(9) = Root(14)
Step 3 — Assign codes (traversing the tree):
| Character | Code | Code Length |
|---|---|---|
| D | 0 | 1 bit |
| C | 10 | 2 bits |
| A | 111 | 3 bits |
| B | 110 | 3 bits |
Step 4 — Encode:
AABBBCCCCDDDD → 111 111 110 110 110 10 10 10 10 0 0 0 0 0
Space calculation:
- Using fixed-length codes (e.g. ASCII 7-bit): 14 × 7 = 98 bits
- Using Huffman codes: (2×3) + (3×3) + (4×2) + (5×1) = 6 + 9 + 8 + 5 = 28 bits
- Saving: (98 - 28) / 98 × 100 = 71.4%
(The dictionary must also be stored, adding some overhead, but for large files the savings are substantial.)
In exam questions on Huffman coding, you may be asked to: (1) build a frequency table, (2) construct the Huffman tree, (3) determine the codes for each character, or (4) calculate the space saving compared to fixed-length encoding. Always show your working clearly. Remember that more frequent characters get shorter codes — this is the key insight that makes Huffman efficient.
Compression File Formats Summary
| Format | Type | Algorithm | Typical Use |
|---|---|---|---|
| ZIP | Lossless | LZ + Huffman (DEFLATE) | General file archiving |
| PNG | Lossless | LZ + filtering | Web graphics, screenshots, diagrams |
| GIF | Lossless | LZW | Simple animations, icons |
| FLAC | Lossless | Linear prediction + Rice coding | High-quality audio archiving |
| 7Z | Lossless | LZMA | High-ratio file archiving |
| JPEG | Lossy | DCT + quantisation + Huffman | Photographs |
| MP3 | Lossy | Psychoacoustic model + Huffman | Music, podcasts |
| MP4/MPEG | Lossy | Motion estimation + DCT | Video |
| AAC | Lossy | Improved psychoacoustic model | Music streaming (Apple, YouTube) |
When recommending a compression method, consider the type of data and whether quality loss is acceptable. Text files and program code must use lossless compression (any data loss would corrupt them). Photos and music can use lossy compression because small quality reductions are often imperceptible.