Databases & Distributed Systems

A2 Level — Unit 4: Architecture, Data, Communication & Applications

Data Consistency, Redundancy, and Independence

Data Consistency

Data consistency means that all copies and representations of a data item agree with each other. If a customer’s address is stored in multiple places, every instance must hold the same value at any given time.

In a flat-file system, the same data is often duplicated across many files. When one copy is updated but others are not, the data becomes inconsistent. A database approach addresses this by storing each data item once and providing controlled access to it.

Data Redundancy

Data redundancy is the unnecessary duplication of data. In a flat-file system, the same information (e.g. a customer name and address) may be stored in the orders file, the invoices file, and the accounts file.

Problems caused by redundancy:

Wasted storage — the same data occupies space in multiple files
Update anomalies — changing data in one file but not others leads to inconsistency
Insert anomalies — data cannot be added without creating a related record (e.g. cannot store a new customer without an order)
Delete anomalies — deleting a record may accidentally remove the only copy of important data

Data Independence

Data independence is the ability to change the way data is stored or organised without affecting the programs that use it.

Type	Description	Example
Logical data independence	Changes to the logical structure (e.g. adding a field, creating a new view) do not require changes to application programs	Adding a “mobile number” field to a customer table does not break existing queries that do not use that field
Physical data independence	Changes to the physical storage (e.g. moving to a different disk, changing file organisation) do not affect the logical structure or applications	Moving a database from a hard drive to an SSD requires no changes to SQL queries

Data independence is achieved through the three-schema architecture of a DBMS:

External schema — individual user views
Conceptual schema — the logical structure of the entire database
Internal schema — the physical storage structure

Relational Database Concepts

Key Terminology

Formal Term	Informal Equivalent	Meaning
Relation	Table	A two-dimensional structure storing data about one entity
Attribute	Column / Field	A named property of the entity (e.g. Surname, DateOfBirth)
Tuple	Row / Record	A single instance of the entity with values for each attribute
Domain	Data type / Range	The set of allowable values for an attribute

Types of Keys

Key Type	Definition	Example
Primary key	An attribute (or combination) that uniquely identifies each tuple in a relation	StudentID in a Students table
Foreign key	An attribute in one relation that references the primary key of another relation, creating a link between them	StudentID in an Enrolments table referencing Students
Composite key	A primary key made up of two or more attributes together	(StudentID, ModuleID) in an Enrolments table
Candidate key	Any attribute (or combination) that could serve as the primary key — there may be several candidates	Both StudentID and NationalInsuranceNumber could uniquely identify a student

In an exam, when identifying keys, look for attributes that are unique and never null. If no single attribute is unique, consider whether a combination of attributes could form a composite key. The primary key is the candidate key that is actually chosen for use.

Normalisation

Normalisation is the process of organising data in a relational database to reduce redundancy and eliminate anomalies. Data is progressively restructured through a series of normal forms.

Worked Example

Consider a flat-file table for a school’s course enrolment system:

StudentCourses (Unnormalised)

StudentID	StudentName	Tutor	TutorRoom	CourseID	CourseName	Grade
S1	Anna Jones	Mr Smith	R101	C1, C3	Maths, Physics	A, B
S2	Ben Lee	Ms Patel	R204	C1, C2	Maths, English	B, A
S3	Clara Diaz	Mr Smith	R101	C2	English	C

This table has repeating groups (CourseID, CourseName, Grade can have multiple values per student) — it is in UNF (Unnormalised Form).

UNF to 1NF — Remove Repeating Groups

First Normal Form (1NF): A relation is in 1NF if it contains no repeating groups and every field contains only atomic (indivisible) values. Each row must be uniquely identifiable.

To achieve 1NF, separate the repeating groups so each cell contains a single value. Each row must be unique, so we need a composite primary key: (StudentID, CourseID).

StudentCourses (1NF)

StudentID	StudentName	Tutor	TutorRoom	CourseID	CourseName	Grade
S1	Anna Jones	Mr Smith	R101	C1	Maths	A
S1	Anna Jones	Mr Smith	R101	C3	Physics	B
S2	Ben Lee	Ms Patel	R204	C1	Maths	B
S2	Ben Lee	Ms Patel	R204	C2	English	A
S3	Clara Diaz	Mr Smith	R101	C2	English	C

Composite primary key: (StudentID, CourseID)

1NF to 2NF — Remove Partial Dependencies

Second Normal Form (2NF): A relation is in 2NF if it is in 1NF and every non-key attribute is fully functionally dependent on the whole primary key (no partial dependencies).

A partial dependency exists when a non-key attribute depends on only part of a composite key.

In our 1NF table:

StudentName, Tutor, TutorRoom depend only on StudentID (not on CourseID) — partial dependency
CourseName depends only on CourseID (not on StudentID) — partial dependency
Grade depends on both StudentID and CourseID — full dependency

To achieve 2NF, split the table so that each non-key attribute depends on the entire key:

Students

StudentID	StudentName	Tutor	TutorRoom
S1	Anna Jones	Mr Smith	R101
S2	Ben Lee	Ms Patel	R204
S3	Clara Diaz	Mr Smith	R101

Courses

CourseID	CourseName
C1	Maths
C2	English
C3	Physics

Enrolments

StudentID	CourseID	Grade
S1	C1	A
S1	C3	B
S2	C1	B
S2	C2	A
S3	C2	C

2NF to 3NF — Remove Transitive Dependencies

Third Normal Form (3NF): A relation is in 3NF if it is in 2NF and there are no transitive dependencies — no non-key attribute depends on another non-key attribute.

A transitive dependency exists when a non-key attribute determines another non-key attribute.

In the Students table: Tutor determines TutorRoom (each tutor is always in the same room). So TutorRoom is transitively dependent on StudentID via Tutor:

StudentID → Tutor → TutorRoom

To achieve 3NF, separate out the transitive dependency:

Students (3NF)

StudentID	StudentName	Tutor
S1	Anna Jones	Mr Smith
S2	Ben Lee	Ms Patel
S3	Clara Diaz	Mr Smith

Tutors (3NF)

Tutor	TutorRoom
Mr Smith	R101
Ms Patel	R204

Final 3NF Schema:

Students (StudentID, StudentName, Tutor*)
Courses (CourseID, CourseName)
Enrolments (StudentID*, CourseID*, Grade)
Tutors (Tutor, TutorRoom)

Italic with asterisk = foreign key, Bold italic with asterisk = primary key

Normalisation questions are very common in the exam and carry significant marks. Always show each step clearly: identify the repeating groups (UNF to 1NF), identify partial dependencies (1NF to 2NF), then identify transitive dependencies (2NF to 3NF). Underline primary keys and mark foreign keys clearly.

Entity Relationship Modelling

Relationships Between Entities

An entity is a thing about which data is stored (e.g. Student, Course, Teacher). Entities have relationships with each other.

Relationship	Meaning	Example
One-to-one (1:1)	Each instance of entity A is related to exactly one instance of entity B, and vice versa	One headteacher manages one school
One-to-many (1:M)	Each instance of entity A can be related to many instances of entity B, but each B relates to only one A	One tutor teaches many students; each student has one tutor
Many-to-many (M:M)	Each instance of entity A can be related to many instances of entity B, and vice versa	Many students enrol on many courses

ER Diagram Notation

Entities are drawn as rectangles, relationships as lines between them, with the degree of the relationship marked:

[Student] 1 -------- M [Enrolment] M -------- 1 [Course]

Resolving Many-to-Many Relationships

Relational databases cannot directly implement many-to-many relationships. They must be resolved using a junction table (also called a linking table or associative entity).

Before resolution:

[Student] M -------- M [Course]

After resolution:

[Student] 1 -------- M [Enrolment] M -------- 1 [Course]

The Enrolment junction table contains the primary keys of both Student and Course as foreign keys (which together form a composite primary key), plus any attributes specific to the relationship (e.g. Grade, EnrolmentDate).

SQL — Structured Query Language

Data Query Language (DQL)

SELECT, FROM, WHERE

-- Select all columns from a table
SELECT * FROM Students;

-- Select specific columns
SELECT StudentName, Tutor FROM Students;

-- Filter with WHERE
SELECT StudentName, Grade
FROM Enrolments
INNER JOIN Students ON Enrolments.StudentID = Students.StudentID
WHERE Grade = 'A';

-- Multiple conditions
SELECT * FROM Students
WHERE Tutor = 'Mr Smith' AND StudentName LIKE 'A%';

ORDER BY and GROUP BY

-- Sort results
SELECT StudentName, Grade FROM Enrolments
INNER JOIN Students ON Enrolments.StudentID = Students.StudentID
ORDER BY StudentName ASC;

-- Group and aggregate
SELECT CourseID, COUNT(*) AS NumberOfStudents
FROM Enrolments
GROUP BY CourseID;

-- HAVING filters groups (WHERE filters individual rows)
SELECT CourseID, COUNT(*) AS NumberOfStudents
FROM Enrolments
GROUP BY CourseID
HAVING COUNT(*) > 1;

JOIN Operations

-- INNER JOIN: returns only matching rows from both tables
SELECT Students.StudentName, Courses.CourseName, Enrolments.Grade
FROM Enrolments
INNER JOIN Students ON Enrolments.StudentID = Students.StudentID
INNER JOIN Courses ON Enrolments.CourseID = Courses.CourseID;

-- LEFT JOIN: returns all rows from the left table, with NULLs where there is no match
SELECT Students.StudentName, Enrolments.CourseID
FROM Students
LEFT JOIN Enrolments ON Students.StudentID = Enrolments.StudentID;

Join Type	Returns
INNER JOIN	Only rows where there is a match in both tables
LEFT JOIN	All rows from the left table, plus matched rows from the right (NULLs where no match)

Data Manipulation Language (DML)

-- INSERT a new record
INSERT INTO Students (StudentID, StudentName, Tutor)
VALUES ('S4', 'David Park', 'Ms Patel');

-- UPDATE an existing record
UPDATE Enrolments
SET Grade = 'A'
WHERE StudentID = 'S3' AND CourseID = 'C2';

-- DELETE a record
DELETE FROM Enrolments
WHERE StudentID = 'S1' AND CourseID = 'C3';

Data Definition Language (DDL)

-- CREATE TABLE
CREATE TABLE Students (
    StudentID VARCHAR(5) PRIMARY KEY,
    StudentName VARCHAR(50) NOT NULL,
    Tutor VARCHAR(30),
    FOREIGN KEY (Tutor) REFERENCES Tutors(Tutor)
);

CREATE TABLE Enrolments (
    StudentID VARCHAR(5),
    CourseID VARCHAR(5),
    Grade CHAR(1),
    PRIMARY KEY (StudentID, CourseID),
    FOREIGN KEY (StudentID) REFERENCES Students(StudentID),
    FOREIGN KEY (CourseID) REFERENCES Courses(CourseID)
);

SQL questions in the exam expect correct syntax. Remember: WHERE filters individual rows before grouping; HAVING filters groups after aggregation. Use INNER JOIN when you only want matching records, and LEFT JOIN when you need all records from one table regardless of whether they match.

Purpose of a DBMS

A Database Management System (DBMS) is software that manages access to the database, acting as an intermediary between users/applications and the stored data.

Key Functions

Function	Description
Data dictionary	Stores metadata — descriptions of tables, fields, data types, constraints, relationships, and access permissions. Sometimes called the “database about the database”
Security	Controls access through user authentication, authorisation levels, and access rights (read, write, modify, delete). Different users can have different views of the data
Backup and recovery	Provides tools for regular backups and transaction logs to restore the database to a consistent state after a failure
Concurrent access	Manages multiple users accessing the database simultaneously using locking, timestamps, or serialisation to prevent conflicts
Query processing	Translates high-level queries (SQL) into low-level operations, optimises query execution plans for efficiency
Data integrity	Enforces constraints (e.g. primary key uniqueness, referential integrity, data validation rules) to maintain data accuracy

A data dictionary (also called the system catalogue) contains metadata about every object in the database: table names, field names, data types, constraints, relationships, indexes, and user access rights. It is maintained automatically by the DBMS.

Big Data

Characteristics of Big Data

Big Data is commonly described by the four Vs:

Characteristic	Description	Example
Volume	Extremely large amounts of data — terabytes to petabytes	Social media generates petabytes of data daily
Velocity	Data is generated and must be processed at high speed, often in real time	Stock market transactions, sensor data from IoT devices
Variety	Data comes in many formats — structured (databases), semi-structured (XML, JSON), unstructured (text, images, video)	Combining sales databases with social media posts and customer reviews
Veracity	The quality, accuracy, and trustworthiness of the data varies	Sensor data may contain errors; social media data may be biased or false

Predictive Analytics

Predictive analytics uses statistical techniques and machine learning on historical data to make predictions about future events or behaviours.

Applications of predictive analytics:

Retail: Predicting which products a customer is likely to buy, optimising stock levels
Healthcare: Predicting disease outbreaks, identifying patients at risk of readmission
Finance: Credit scoring, fraud detection, market trend prediction
Transport: Predicting traffic congestion, optimising delivery routes

Machine Learning Applications

Machine learning is a subset of artificial intelligence where systems learn patterns from data without being explicitly programmed for each task.

Application	Description
Recommendation systems	Netflix suggests films; Amazon suggests products based on past behaviour
Natural language processing	Voice assistants (Siri, Alexa) understand and respond to speech
Image recognition	Medical imaging analysis, facial recognition, autonomous vehicles
Fraud detection	Banks identify unusual transaction patterns automatically
Spam filtering	Email systems classify messages as spam or legitimate

Data Warehousing

What Is a Data Warehouse?

A data warehouse is a large, centralised repository that stores historical data from multiple operational sources, organised specifically for analysis and reporting rather than day-to-day transaction processing.

ETL Process

Data enters the warehouse through the ETL process:

Stage	Action	Detail
Extract	Pull data from multiple sources	Operational databases, spreadsheets, external feeds, log files
Transform	Clean, standardise, and restructure the data	Remove duplicates, convert formats, apply business rules, handle missing values
Load	Write the processed data into the warehouse	May be done in bulk (batch) or incrementally

Star Schema

The star schema is the most common data warehouse design:

A central fact table contains measurable business data (e.g. sales amount, quantity sold) and foreign keys to dimension tables
Surrounding dimension tables contain descriptive attributes (e.g. product name, customer location, date)

erDiagram
    SALES_FACT }o--|| DATE_DIMENSION : ""
    SALES_FACT }o--|| PRODUCT_DIMENSION : ""
    SALES_FACT }o--|| CUSTOMER_DIMENSION : ""
    SALES_FACT }o--|| STORE_DIMENSION : ""
    SALES_FACT {
        int DateKey FK
        int ProductKey FK
        int CustomerKey FK
        int StoreKey FK
        int Quantity
        decimal Revenue
    }

The fact table holds: DateKey, ProductKey, CustomerKey, StoreKey, Quantity, Revenue

OLAP vs OLTP

Feature	OLTP (Online Transaction Processing)	OLAP (Online Analytical Processing)
Purpose	Day-to-day operations	Analysis and reporting
Queries	Simple, frequent (INSERT, UPDATE)	Complex, infrequent (aggregations, joins)
Data	Current, detailed	Historical, summarised
Users	Many concurrent users (clerks, customers)	Few users (analysts, managers)
Design	Normalised (3NF) for minimal redundancy	Denormalised (star schema) for query speed
Response time	Milliseconds	Seconds to minutes acceptable

Data Mining

What Is Data Mining?

Data mining is the process of discovering patterns, correlations, and anomalies in large datasets using statistical and computational techniques to make predictions or gain insights.

Techniques

Technique	Description	Example
Classification	Assigns data items to predefined categories based on attributes	Classifying emails as spam or not spam; classifying loan applicants as high or low risk
Clustering	Groups data items by similarity without predefined categories	Grouping customers by purchasing behaviour to identify market segments
Association rules	Finds relationships between items that frequently occur together	“Customers who buy bread and butter also tend to buy milk” (market basket analysis)

Applications of Data Mining

Retail: Market basket analysis to optimise product placement and promotions
Healthcare: Identifying patterns in patient data to improve diagnoses and treatment plans
Banking: Detecting fraudulent transactions by identifying unusual patterns
Telecommunications: Predicting customer churn (customers likely to leave)
Science: Discovering patterns in genomic data, climate data, or astronomical observations

Data mining and data warehousing are closely related. The warehouse provides the clean, organised historical data that mining techniques analyse. Be prepared to explain the difference: the warehouse stores the data; mining discovers patterns within it.

Distributed Databases

What Is a Distributed Database?

A distributed database is a database where the data is stored across multiple computers (nodes) at different physical locations, but appears to users as a single, unified database.

Fragmentation

Data can be distributed across nodes using two methods:

Type	Description	Example
Horizontal fragmentation	Different rows of a table are stored at different nodes	Customer records for Wales stored on the Cardiff server; records for England stored on the London server
Vertical fragmentation	Different columns of a table are stored at different nodes	Customer name and address stored at one node; financial details stored at a more secure node

A combination of both (mixed fragmentation) is also possible.

Replication

Replication means storing copies of the same data at multiple nodes.

Replication Type	Description
Full replication	Every node holds a complete copy of the entire database
Partial replication	Frequently accessed data is copied to multiple nodes; other data exists at only one node
No replication	Each data item exists at exactly one node

Advantages of Distributed Databases

Advantage	Explanation
Reliability	If one node fails, others continue operating; data is not lost if replicated
Performance	Data is stored close to where it is used most, reducing network traffic and response times
Scalability	New nodes can be added to handle growth without redesigning the entire system
Local autonomy	Each site can control its own data while still participating in the larger system

Challenges of Distributed Databases

Challenge	Explanation
Data consistency	Keeping all copies of data synchronised across nodes is difficult, especially with replication
Complexity	The DBMS must handle distributed query processing, distributed transactions, and failure recovery across multiple nodes
Network dependency	The system relies on network connections; if a link fails, some data may become temporarily inaccessible
Security	Data transmitted between nodes must be protected; more nodes means more potential attack points
Cost	Hardware, software, and administration costs are higher than a centralised database

Exam questions on distributed databases often ask you to weigh advantages against challenges for a given scenario. Consider factors like the geographic spread of users, the need for reliability, the importance of consistency, and the available budget. A company with offices worldwide benefits from distributed databases; a small local business may not.