Standard Operating Procedure: De Novo Genome Assembly

1KSA Genome Assembly Pipeline — SLURM / Conda

Version: 1.0 Date: 2026-02-19 Queue: agrp Conda environment: 1ksa_assembly Pipeline tools: KMC · NanoPlot · Chopper · Kraken2 · KrakenTools · Flye · Hifiasm · Racon · minimap2 · BUSCO · QUAST · samtools

Overview

This SOP describes how to assemble a draft genome from Oxford Nanopore long-read sequencing data using the 1KSA Nextflow pipeline, adapted for a SLURM cluster using a conda environment.

The pipeline has six stages:

[0] K-mer analysis        → estimate genome size and coverage
[1] QC + trimming         → NanoPlot + Chopper
[2] Decontamination       → Kraken2 + extract_kraken_reads.py
[3] Assembly              → Flye (< 3 Gb) or Hifiasm (≥ 3 Gb)
[4] Mapping               → minimap2
[5] Polishing             → Racon (Flye only)
[6] Evaluation            → BUSCO + QUAST
[7] Report generation     → generate_report.sh

Stages 1, 2, 4, 5, and 6 are managed by Nextflow and submitted automatically to SLURM. Stages 0, 3, and 7 are submitted as standalone SLURM jobs or run directly.

Prerequisites

Access to the SLURM cluster with queue agrp
Conda environment 1ksa_assembly installed with all tools
BUSCO lineage database downloaded (see step 1.3)
Raw FASTQ file (concatenated, basecalling already done)
A logs/ directory in the pipeline folder (SLURM opens the log file before executing the script, so this must exist before any sbatch call):

mkdir -p logs

Step 0: Prepare Your Environment

0.1 Log in to the server

ssh username@your-server.ac.za

0.2 Clone the pipeline repository

cd /path/to/your/working/directory
git clone https://github.com/DIPLOMICS-SA/Genome-Assembly-Pipeline-Nextflow.git
cd Genome-Assembly-Pipeline-Nextflow

Or copy your adapted pipeline files into the working directory.

0.3 Download the BUSCO lineage database (once per server)

conda activate 1ksa_assembly
busco --download eukaryota_odb10   # change lineage if needed

# Verify download:
ls ./busco_downloads/lineages/eukaryota_odb10

Other lineage options: viridiplantae_odb10, insecta_odb10

0.4 Prepare your FASTQ file

If your files are split across multiple files or compressed:

# Concatenate multiple FASTQ files
cat *.fastq > species_name_fastq_pass_con.fastq

# Or, if gzipped:
gunzip -c *.fastq.gz > species_name_fastq_pass_con.fastq

Step 1: K-mer Analysis

This is mandatory before assembly. It estimates genome size and read coverage — the two key parameters for Flye assembly.

1.1 Submit the k-mer job

Note: SLURM opens logs/kmer_<JOBID>.out before running the script, so the logs/ directory must exist first. If you have not already done so (see Prerequisites):
mkdir -p logs

sbatch submit_kmer.sh /path/to/species_name_fastq_pass_con.fastq species_name

1.2 Monitor the job

squeue -u $USER
# or watch the log:
tail -f logs/kmer_<JOBID>.out

1.3 Read the results

cat k_mers_Stats_species_name.txt

Example output:

Species_name: k-mer= 17
Total input bases 153410333621
Peaks or Plateaus detected=2
Ploidy= 2n =2n
2n Genome Length=0.87 Gb at 176 X Coverage
Expected Assembly Length if fully collapsed=0.87 Gb at 176 X Coverage

View the k-mer plot (download to local machine):

# On your local machine:
scp username@your-server.ac.za:/path/to/pipeline/species_name_k_mers.png .

1.4 Record: genome size and coverage

Note the genome length (e.g. 0.87g) and coverage (e.g. 176) — you will need these in the next step.

Step 2: Configure Assembly Parameters

nano params.config

Edit the following values:

species_name='species_name'    # Your species name (no spaces)
assembler='flye'               # 'flye' for < 3 Gb; 'hifiasm' for ≥ 3 Gb

threads=15
LINEAGE='eukaryota_odb10'      # Change if needed

# Flye only (from k-mer analysis):
genome_size='0.87g'            # From k_mers_Stats file
flye_coverage=176              # From k_mers_Stats file
flye_read_type='nano-raw'      # Use 'nano-hq' for Q15+ reads

Choosing the assembler

Genome size	Assembler	Notes
< 3 Gb	`flye`	Runs through full pipeline including Racon polishing
≥ 3 Gb	`hifiasm`	No polishing step; BUSCO + QUAST run directly on assembly

Choosing read type for Flye

Read quality	Parameter
Q > 10 (standard Nanopore)	`nano-raw`
Q > 15 (high-accuracy Nanopore)	`nano-hq`

Step 3: Run the Assembly Pipeline

3.1 Submit the main pipeline

sbatch submit_slurm.sh /path/to/species_name_fastq_pass_con.fastq

This single job runs all pipeline stages in order: 1. QC + trimming (Nextflow → SLURM) 2. Assembly — submitted as its own SLURM job (submit_flye.sh or submit_hifiasm.sh), master.sh waits for it 3. Mapping (Nextflow → SLURM) 4. Polishing, if Flye (Nextflow → SLURM) 5. BUSCO + QUAST evaluation (Nextflow → SLURM)

3.2 Monitor progress

# Top-level job
squeue -u $USER

# All SLURM jobs including Nextflow-submitted sub-jobs
watch squeue -u $USER

# Live log of the main job
tail -f genome_assembly_<JOBID>.out

3.3 Resume if interrupted

The pipeline uses Nextflow's -resume flag automatically. If the top-level job is killed, just resubmit:

sbatch submit_slurm.sh /path/to/reads.fastq

Assembly steps (Flye/Hifiasm) also have built-in resume logic.

Step 4: Check Assembly Outputs

After the pipeline finishes, check the results:

# BUSCO summary
cat results/Busco_results/Busco_output/short_summary.specific.*.txt

# QUAST report
cat results/quast_report/Quast_result/report.txt

# Final assembly
ls results/Racon_results/     # Flye path
ls results/Hifiasm_results/   # Hifiasm path

BUSCO completeness > 80% is generally acceptable for a draft assembly.

Step 5: Generate the Final Report

Once you are satisfied with the assembly quality, generate the consolidated report:

bash generate_report.sh

This script (run from the pipeline root directory, no job submission needed): 1. Sorts the SAM file and calculates mapping coverage 2. Collects all key output files 3. Compiles a single report text file with all QC metrics 4. Appends software version information 5. Organises outputs into two folders: - results/<species_name>/ — primary outputs (assemblies, BUSCO, QUAST, NanoPlot, report) - results/<species_name>_other_results_outputs/ — raw pipeline directories

Primary output files in `results/<species_name>/`

File	Description
`<species>_report.txt`	Consolidated assembly report
`<species>_flye_assembly.fasta`	Raw Flye assembly
`<species>_racon_polished.fasta`	Polished assembly (Flye only)
`<species>_hifiasm_assembly.fasta`	Hifiasm assembly
`<species>_busco_summary.txt`	BUSCO completeness summary
`<species>_quast_report.txt`	QUAST assembly statistics
`<species>_NanoStats_before_trim.txt`	Raw read QC
`<species>_NanoStats_after_trim.txt`	Trimmed read QC
`<species>_NanoPlot_*.html`	Interactive QC reports

Parameter Reference

Assembly parameters (params.config)

Parameter	Description	Example
`species_name`	Species identifier (no spaces)	`Acacia_karroo`
`assembler`	`flye` or `hifiasm`	`flye`
`threads`	CPUs for Nextflow-managed steps	`15`
`kraken_db`	Path to Kraken2 database directory	`/data/kraken2_db`
`LINEAGE`	BUSCO lineage database	`eukaryota_odb10`
`genome_size`	From k-mer analysis (Flye only)	`0.87g`
`flye_coverage`	From k-mer analysis (Flye only)	`176`
`flye_read_type`	`nano-raw` or `nano-hq`	`nano-raw`

SLURM resource allocations (nextflow.config)

Process	CPUs	Memory	Walltime
NANOCHECK1/2	8	32 GB	4h
TRIM	15	32 GB	8h
DECONTAMINATE	15	64 GB	12h
MAPPINGS	15	64 GB	12h
POLISH1	15	64 GB	12h
BUSCOstat_final	15	64 GB	24h
QUAST_final	8	32 GB	8h
Flye (submit_flye.sh)	40	300 GB	72h
Hifiasm (submit_hifiasm.sh)	40	200 GB	48h

Adjust these in nextflow.config and submit_flye.sh / submit_hifiasm.sh based on your actual genome size and cluster limits.

Troubleshooting

Problem	Solution
K-mer job finds no peaks	Check read quality; try increasing k-mer size in `kmer-Analysis.sh`
Flye job runs out of memory	Increase `--mem` in `submit_flye.sh`
Flye job hits walltime	Resubmit `submit_slurm.sh` — Flye will resume automatically
BUSCO completeness < 50%	Check read depth and quality; consider re-basecalling with a newer model
Nextflow cannot find conda env	Verify `conda activate 1ksa_assembly` works interactively on the cluster
`sacct` shows FAILED state	Check `logs/` files for the failing step

Quick-Start Checklist

[ ] 0. Log in to server and navigate to pipeline directory
[ ] 1. Concatenate/unzip FASTQ files
[ ] 2. mkdir -p logs   ← required before any sbatch call
[ ] 3. sbatch submit_kmer.sh reads.fastq species_name
[ ] 4. Check k_mers_Stats_<species>.txt — note genome size and coverage
[ ] 5. Edit params.config with correct values (including kraken_db path)
[ ] 6. sbatch submit_slurm.sh reads.fastq
[ ] 7. Monitor with: watch squeue -u $USER
[ ] 8. Check BUSCO and QUAST results
[ ] 9. bash generate_report.sh
[ ] 10. Review results/<species_name>/<species>_report.txt