Back to Projects
Project 04 Bioinformatics Β· Web Application Β· Sequence Analysis

ClustalΞ© Web Application

A Flask-based web interface for running multiple sequence alignments with Clustal-Omega directly from the browser β€” supporting FASTA sequences, UniProt IDs, PDB entry IDs, and file uploads, with automatic input detection and downloadable results.

PythonFlaskClustal-OmegaHTML/CSS/JSREST APIsMSA

Overview

Multiple sequence alignment (MSA) is one of the most fundamental operations in comparative genomics and molecular evolution. It allows researchers to identify conserved regions across sequences, infer evolutionary relationships, and find functional motifs shared across species or protein families.

This project wraps the industry-standard Clustal-Omega command-line tool in a clean web interface so that it can be used without any bioinformatics setup or command-line knowledge. The application runs on a university server and accepts sequences from four different sources β€” handling everything from raw FASTA input to live database lookups β€” then returns a formatted alignment with a one-click download.

 Goals

  • Make Clustal-Omega accessible through a browser with no setup
  • Support 4 input methods with automatic format detection
  • Fetch sequences live from UniProt and RCSB PDB APIs
  • Validate input and give meaningful, specific error messages
  • Return results in 7 standard alignment formats, downloadable
  • Handle protein, DNA, and RNA sequences correctly

 Tech Stack

  • Backend: Python 3.10 + Flask
  • Alignment engine: Clustal-Omega 1.2.4 (subprocess)
  • Sequence APIs: UniProt REST, RCSB PDB REST
  • Frontend: Vanilla HTML, CSS, JavaScript
  • Deployment: Gunicorn on university server
  • Version control: GitHub

How It Works

The application is structured around a single Flask route (POST /align) that handles the entire alignment pipeline. When a user submits the form, the backend processes the request in three sequential stages before returning the result:

Stage 1 β€” Input Detection

The server inspects the raw input and classifies it automatically. Lines starting with > are treated as FASTA; 4-character tokens like 1HHO are PDB IDs; tokens matching the UniProt accession pattern are fetched from UniProt. No dropdown or selector is required from the user.

Stage 2 β€” Validation

Every FASTA sequence is parsed line-by-line and checked against the IUPAC alphabet for the selected molecule type (protein, DNA, or RNA). Specific errors are raised for missing headers, empty sequences, invalid residue characters, and fewer than 2 sequences β€” with messages that tell the user exactly what to fix.

Stage 3 β€” Execution

The validated FASTA is written to a temporary file with a unique UUID, and Clustal-Omega is called as a subprocess with the user's chosen options β€” output format, sequence type (--seqtype), guide-tree iterations, and any extra flags. The result file is stored for download and its content is returned to the frontend as JSON.

Stage 4 β€” Output & Download

The frontend renders the alignment in a monospace panel, colour-coding the Clustal conservation line (* fully conserved, : strongly similar, . weakly similar). A download button serves the raw file with the correct extension for the chosen format.

Input Modes

A key design decision was to avoid asking the user to declare what kind of input they are providing. Instead, the backend detects the format automatically using regular expression matching and heuristics:

  • FASTA sequences β€” pasted directly; the first non-blank line must start with >. If the user pastes raw sequence letters without a header, the app detects this and explains the correct format rather than failing silently.
  • UniProt accession IDs β€” tokens matching the UniProt pattern (e.g. P69905) trigger a live fetch from uniprot.org/uniprot/{ID}.fasta. Partial failures (one bad ID among many) surface as warnings rather than blocking the alignment.
  • PDB entry IDs β€” 4-character codes (e.g. 1HHO) are fetched from rcsb.org/fasta/entry/{ID}. Chain suffixes such as 1SBIA are stripped automatically.
  • File upload β€” drag-and-drop or file-picker upload of .fasta, .fa, .fas, .txt, or .seq files up to 16 MB.

Output Formats

The application exposes all seven output formats supported by Clustal-Omega, selectable from a dropdown in the options panel. The result file is always offered for download with the correct file extension:

Clustal (.aln)

The classic Clustal format. Sequences are broken into blocks of 60 columns, with a conservation line below each block showing *, :, and . symbols for conservation level. Best for quick visual inspection.

FASTA / PHYLIP / Stockholm

FASTA is the most portable format for passing alignments to other tools. PHYLIP is required by many phylogenetics programs (e.g. RAxML). Stockholm is used by HMMER and Pfam for profile HMM construction.

MSF / SELEX / Vienna

MSF (GCG format) is accepted by older analysis pipelines. SELEX is used by some HMM tools. Vienna format is specific to RNA secondary structure alignment workflows.

Security Considerations

Because the app runs an external binary on user-provided input, several layers of protection were added to prevent misuse:

  • No shell expansion β€” clustalo is called via a Python list argument to subprocess.run(), never through a shell string, making shell injection impossible.
  • Extra-options blocklist β€” user-supplied CLI flags are parsed with shlex.split() and rejected if they contain shell metacharacters (; | & ` $ < >).
  • Download path validation β€” the /download/<filename> route validates the filename against a strict regex before constructing any file path, preventing directory traversal attacks.
  • Upload size cap β€” Flask's MAX_CONTENT_LENGTH is set to 16 MB. Input files are deleted immediately after the subprocess exits.

What I Learned

This project forced me to think about the full lifecycle of a bioinformatics tool β€” not just the science, but the engineering around it. Wrapping a command-line binary in a web interface is deceptively tricky: you have to handle everything the binary would normally tell the user via interactive prompts or manual flag inspection, and surface it clearly in a UI designed for researchers who may not be comfortable with the terminal.

The most interesting technical challenge was the auto-detection logic. The naive approach (just check if it starts with >) breaks immediately when users paste sequences without headers, or when a PDB ID like 1SBI is ambiguous with a short sequence. Building the detection to be both robust and informative β€” giving users a specific, actionable error rather than "unrecognized format" β€” required careful iteration.

On the deployment side, I learned the practical aspects of running Flask in a production environment using Gunicorn, managing persistent processes on a shared university server, and using GitHub as the deployment pipeline so updates are a single git pull away.