Back to Projects
Project 05 Clinical Informatics ยท Relational DB Design ยท GDPR

Clinical Trials Database

A production-grade relational data model for managing hospital clinical trials โ€” built around pseudonymised participants, longitudinal visit data, flexible per-study variables, file assets, and a GDPR-compliant identity separation architecture.

MySQLER ModellingClinical DataGDPRNormalisationRelational Design

Purpose & Context

This data model was designed for the role of manager of a hospital clinical trials support service. The challenge was to build a single database that could serve multiple simultaneous clinical studies, each with its own set of measured variables, participant lists, and data-collection schedules โ€” while meeting strict legal and ethical requirements around patient identity protection.

Real clinical trial databases face a fundamental tension: they need to be flexible enough to accommodate completely different studies (a cardiology trial measuring blood pressure over 12 visits looks very different from an oncology trial collecting tumour imaging files at three time points), yet structured enough to guarantee data integrity, auditability, and GDPR compliance. This schema resolves that tension through a combination of a dynamic variable system, a visit-based longitudinal structure, and an external identity-link table stored in a separate database.

Entity-Relationship Diagram

The diagram below shows the full schema. Solid lines represent identifying relationships; dashed lines represent non-identifying ones. Crow's-foot notation indicates cardinality at each end.

ER Diagram โ€” Clinical Trials Database
Entity-Relationship diagram generated with MySQL Workbench ยท 12 tables ยท 2 database scopes

Table-by-Table Breakdown

The schema is composed of 12 tables organised into four logical groups: study management, participants & visits, clinical data, and users & access control. Each is explained below.

Study Management

STUDY

The central anchor of the schema. Every other entity ultimately traces back to a study.

๐Ÿ”‘study_idPrimary key (INT, auto-increment)
study_codeShort human-readable identifier (e.g. "CARDIO-2024")
nameFull study title
descriptionObjectives and protocol summary
๐Ÿ”—LAB_lab_idFK โ†’ LAB. The lab responsible for this study
๐Ÿ”—CONTACT_contact_idFK โ†’ CONTACT. The principal investigator / contact person

LAB

Represents the hospital laboratory or research unit responsible for running one or more studies.

๐Ÿ”‘lab_idPrimary key
nameLab name
departmentHospital department
emailLab contact email
phoneLab contact phone

One lab can run many studies (1:N with STUDY).

CONTACT

The principal investigator or designated contact for a study โ€” kept separate from LAB to allow different people to be responsible for different studies within the same lab.

๐Ÿ”‘contact_idPrimary key
nameFull name
emailContact email
phoneContact phone
organizationAffiliated institution

VARIABLE

A catalogue of all possible clinical data items (e.g. "systolic_bp", "tumour_volume", "haemoglobin"). The label of stored data is not static โ€” it comes from this table, making the schema flexible across completely different study types.

๐Ÿ”‘variable_idPrimary key
codeMachine-readable key (e.g. "sbp")
display_nameHuman-readable label
data_kind"numeric", "text", "file" โ€” determines which child table stores it
unitPhysical unit if applicable (e.g. "mmHg", "kg")

STUDY_VARIABLE  junction table

Bridges STUDY and VARIABLE in a many-to-many relationship: a study uses many variables, and the same variable (e.g. "age") can appear in many studies. This is where the metadata about how a variable is used in a specific study is stored.

๐Ÿ”‘study_idComposite PK + FK โ†’ STUDY
๐Ÿ”‘variable_idComposite PK + FK โ†’ VARIABLE
is_longitudinalWhether this variable is collected at every visit or just once at enrolment
is_requiredWhether this field is mandatory for data entry in this study
Participants & Visits

PARTICIPANT

Stores pseudonymised participant records. There is no name, date of birth, or national ID in this table โ€” only a system-generated code. The link to real patient identity is stored in a completely separate database (IDENTITY_LINK), as required by GDPR and clinical trial regulations.

๐Ÿ”‘participant_idAuto-generated surrogate primary key
participant_codeStudy-specific pseudonymous code (e.g. "PT-0042")
enrollment_dateDate the participant joined the study
status"active", "withdrawn", "completed"
๐Ÿ”—STUDY_study_idFK โ†’ STUDY. Which study this participant belongs to
Identity separation: the participant_code is the only link to the IDENTITY_LINK table stored in an external DB.

VISIT

Models the longitudinal structure of clinical trials. Each row is one hospital visit by one participant. Clinical observations and file assets are always attached to a visit, ensuring all data is time-stamped and traceable.

๐Ÿ”‘visit_idPrimary key
visit_dateDate of the visit
visit_labelProtocol label (e.g. "Baseline", "Month 3", "Follow-up")
visit_numberSequential number within the study protocol
๐Ÿ”—PARTICIPANT_participant_idFK โ†’ PARTICIPANT. Who attended this visit
Clinical Data

OBSERVATION

Stores scalar clinical measurements โ€” a single numeric or text value for a given variable at a given visit. This table is the main store for structured clinical data (blood pressure, lab results, survey scores, etc.).

๐Ÿ”‘observation_idPrimary key
observed_atExact timestamp of measurement
valueThe measured value (stored as VARCHAR; type is governed by VARIABLE.data_kind)
๐Ÿ”—VISIT_visit_idFK โ†’ VISIT. When and for whom this was measured
๐Ÿ”—VARIABLE_variable_idFK โ†’ VARIABLE. What was measured

FILE_ASSET

Handles binary clinical data โ€” MRI images, genomic sequencing files, ECG recordings, etc. Files are not stored in the database itself; instead, this table stores a URI pointing to the file in an object store or file system.

๐Ÿ”‘file_idPrimary key (UUID or auto-increment)
storage_uriPath or URL to the file in external storage
file_nameOriginal filename for display
๐Ÿ”—VISIT_visit_idFK โ†’ VISIT. Which visit produced this file
๐Ÿ”—VARIABLE_variable_idFK โ†’ VARIABLE. What type of data this file represents
Users & Access Control

app_user

Application users โ€” the people who log in to enter or review data. Kept entirely separate from clinical participants to avoid any data confusion.

๐Ÿ”‘user_idPrimary key
emailLogin identifier (unique)
full_nameDisplay name
is_activeSoft-delete / deactivation flag

ROLE

Defines the permission levels in the system. Three roles were specified: General Admin (full access), Supervisor (read all, approve entries), and Data Entry (write access to their assigned studies only).

๐Ÿ”‘role_idPrimary key
role_name"admin", "supervisor", "data_entry"

app_user_has_ROLE  junction

Many-to-many bridge between users and roles. A user can hold multiple roles (e.g. admin of one study, supervisor of another).

๐Ÿ”‘app_user_user_idComposite PK + FK โ†’ app_user
๐Ÿ”‘ROLE_role_idComposite PK + FK โ†’ ROLE

USER_STUDY_ACCESS

Controls which users can access which studies, and at what permission level. This allows a data-entry user to be restricted to exactly the studies they are responsible for, while admins have global access.

๐Ÿ”‘user_idComposite PK + FK โ†’ app_user
๐Ÿ”‘stuy_idComposite PK + FK โ†’ STUDY
access_level"read", "write", "admin" โ€” fine-grained per study

Key Design Decisions

GDPR Identity Separation

PARTICIPANT contains zero personally identifiable information. The participant_code is a system-generated opaque token. The mapping to real patient identity (MRN, name, national ID) lives in an IDENTITY_LINK table stored in a completely separate database โ€” physically and logically isolated, accessible only to authorised clinical staff, not to the research application.

EAV Dynamic Variables

Rather than hard-coding columns like blood_pressure or tumour_size, the schema uses a VARIABLE catalogue and OBSERVATION rows โ€” a controlled Entity-Attribute-Value pattern. This lets completely different studies reuse the same tables without schema migrations, while the known set of variables ensures data integrity.

1:N Longitudinal Visits

All clinical data โ€” both scalar OBSERVATION values and FILE_ASSET entries โ€” hangs off a VISIT row, not directly off PARTICIPANT. This means every measurement is automatically time-stamped and ordered. The is_longitudinal flag on STUDY_VARIABLE further distinguishes variables collected once (e.g. sex) from those collected at every visit (e.g. weight).

RBAC Role-Based Access

The combination of ROLE (global permission type), app_user_has_ROLE (user-role assignment), and USER_STUDY_ACCESS (per-study permission level) implements a two-layer access control system. A supervisor can see all data for their assigned studies; a data-entry user can only write to theirs; a general admin has unrestricted access.

What I Learned

This was the most complex data modelling exercise I had tackled, because the requirements were genuinely contradictory at first glance: the schema had to be flexible (different studies, different variables) yet rigid enough to be queryable (you can't analyse data you can't find). Resolving that tension through the VARIABLE catalogue and STUDY_VARIABLE junction โ€” rather than reaching for a schema-less NoSQL approach โ€” taught me when relational design is the right tool even for variable-structure data.

The GDPR identity-separation requirement was the most interesting constraint. It forced me to think beyond the database itself and consider system architecture: what lives where, who can access what, and how the two databases talk to each other only through controlled, auditable queries. This is a pattern I expect to encounter often in health data engineering.