Project 04 Bioinformatics · Web Application · Sequence Analysis

ClustalΩ Web Application

A PHP web interface for running multiple sequence alignments with Clustal-Omega directly from the browser, supporting FASTA sequences, UniProt IDs, PDB entry IDs and file uploads, with four dedicated input modes and downloadable results.

PHPClustal-OmegaHTML/CSS/JSREST APIsMSA

Launch App Live

Overview

Multiple sequence alignment (MSA) is one of the most fundamental operations in comparative genomics and molecular evolution. It allows researchers to identify conserved regions across sequences, infer evolutionary relationships and find functional motifs shared across species or protein families.

This project wraps the industry-standard Clustal-Omega command-line tool in a clean web interface so that it can be used without any bioinformatics setup or command-line knowledge. The application runs on a university server and accepts sequences from four different sources, handling everything from raw FASTA input to live database lookups, then returns a formatted alignment with a one-click download.

Goals

Make Clustal-Omega accessible through a browser with no setup
Support 4 input methods with automatic format detection
Fetch sequences live from UniProt and RCSB PDB APIs
Validate input and give meaningful, specific error messages
Return results in 7 standard alignment formats, downloadable
Handle protein, DNA and RNA sequences correctly

Tech Stack

Backend: PHP 8 (no framework)
Alignment engine: Clustal-Omega 1.2.4 (exec)
Sequence APIs: UniProt REST, RCSB PDB REST
Frontend: Vanilla HTML, CSS, JavaScript
Deployment: PHP on university Apache server
Version control: GitHub

How It Works

The application is structured around a single PHP endpoint (align.php) that handles the entire alignment pipeline. When a user submits the form, the backend processes the request in three sequential stages before returning the result:

Stage 1 — Input Detection

The server inspects the raw input and classifies it automatically. Lines starting with > are treated as FASTA; 4-character tokens like 1HHO are PDB IDs; tokens matching the UniProt accession pattern are fetched from UniProt. No dropdown or selector is required from the user.

Stage 2 — Validation

Every FASTA sequence is parsed line-by-line and checked against the IUPAC alphabet for the selected molecule type (protein, DNA, or RNA). Specific errors are raised for missing headers, empty sequences, invalid residue characters and fewer than 2 sequences, with messages that tell the user exactly what to fix.

Stage 3 — Execution

The validated FASTA is written to a temporary file with a unique random ID, and Clustal-Omega is called via PHP's exec() with the user's chosen options, output format, sequence type (--seqtype), guide-tree iterations, and any extra flags. The result file is stored for download and its content is returned to the frontend as JSON.

Stage 4 — Output & Download

The frontend renders the alignment in a monospace panel, colour-coding the Clustal conservation line (* fully conserved, : strongly similar, . weakly similar). A download button serves the raw file with the correct extension for the chosen format.

Input Modes

A key design decision was to avoid asking the user to declare what kind of input they are providing. Instead, the backend detects the format automatically using regular expression matching and heuristics:

FASTA sequences —> pasted directly; the first non-blank line must start with >. If the user pastes raw sequence letters without a header, the app detects this and explains the correct format rather than failing silently.
UniProt accession IDs —> tokens matching the UniProt pattern (e.g. P69905) trigger a live fetch from uniprot.org/uniprot/{ID}.fasta. Partial failures (one bad ID among many) surface as warnings rather than blocking the alignment.
PDB entry IDs —> 4-character codes (e.g. 1HHO) are fetched from rcsb.org/fasta/entry/{ID}. Chain suffixes such as 1SBIA are stripped automatically.
File upload —> drag-and-drop or file-picker upload of .fasta, .fa, .fas, .txt, or .seq files up to 16 MB.

Output Formats

The application exposes all seven output formats supported by Clustal-Omega, selectable from a dropdown in the options panel. The result file is always offered for download with the correct file extension:

Clustal (.aln)

The classic Clustal format. Sequences are broken into blocks of 60 columns, with a conservation line below each block showing *, :, and . symbols for conservation level. Best for quick visual inspection.

FASTA / PHYLIP / Stockholm

FASTA is the most portable format for passing alignments to other tools. PHYLIP is required by many phylogenetics programs (e.g. RAxML). Stockholm is used by HMMER and Pfam for profile HMM construction.

MSF / SELEX / Vienna

MSF (GCG format) is accepted by older analysis pipelines. SELEX is used by some HMM tools. Vienna format is specific to RNA secondary structure alignment workflows.

Security Considerations

Because the app runs an external binary on user-provided input, several layers of protection were added to prevent misuse:

No shell expansion —> clustalo is called via PHP's exec() with every argument individually escaped using escapeshellarg(), making shell injection impossible.
Extra-options blocklist —> user-supplied CLI flags are parsed with shlex.split() and rejected if they contain shell metacharacters (; | & ` $ < >).
Download path validation —> the /download/<filename> route validates the filename against a strict regex before constructing any file path, preventing directory traversal attacks.
Upload size cap —> PHP's upload_max_filesize is set to 16 MB. Input files are deleted immediately after the alignment exits.

What I Learned

This project forced me to think about the full lifecycle of a bioinformatics tool, not just the science, but the engineering around it. Wrapping a command-line binary in a web interface is deceptively tricky, you have to handle everything the binary would normally tell the user via interactive prompts or manual flag inspection and surface it clearly in a UI designed for researchers who may not be comfortable with the terminal.

The most interesting technical challenge was the auto-detection logic. The naive approach (just check if it starts with >) breaks immediately when users paste sequences without headers, or when a PDB ID like 1SBI is ambiguous with a short sequence. Building the detection to be both robust and informative, giving users a specific, actionable error rather than "unrecognized format", required careful iteration.

On the deployment side, I learned the practical aspects of running a PHP application on a shared university Apache server — no virtual environments or process managers needed, just dropping files into public_html and pointing to the binary via .htaccess. Updates are a single git pull away.