Skip to content

Standard Operating Procedure: De Novo Genome Assembly

1KSA Genome Assembly Pipeline — SLURM / Conda

Version: 1.0 Date: 2026-02-19 Queue: agrp Conda environment: 1ksa_assembly Pipeline tools: KMC · NanoPlot · Chopper · Kraken2 · KrakenTools · Flye · Hifiasm · Racon · minimap2 · BUSCO · QUAST · samtools


Overview

This SOP describes how to assemble a draft genome from Oxford Nanopore long-read sequencing data using the 1KSA Nextflow pipeline, adapted for a SLURM cluster using a conda environment.

The pipeline has six stages:

[0] K-mer analysis        → estimate genome size and coverage
[1] QC + trimming         → NanoPlot + Chopper
[2] Decontamination       → Kraken2 + extract_kraken_reads.py
[3] Assembly              → Flye (< 3 Gb) or Hifiasm (≥ 3 Gb)
[4] Mapping               → minimap2
[5] Polishing             → Racon (Flye only)
[6] Evaluation            → BUSCO + QUAST
[7] Report generation     → generate_report.sh

Stages 1, 2, 4, 5, and 6 are managed by Nextflow and submitted automatically to SLURM. Stages 0, 3, and 7 are submitted as standalone SLURM jobs or run directly.


Prerequisites

  • Access to the SLURM cluster with queue agrp
  • Conda environment 1ksa_assembly installed with all tools
  • BUSCO lineage database downloaded (see step 1.3)
  • Raw FASTQ file (concatenated, basecalling already done)
  • A logs/ directory in the pipeline folder (SLURM opens the log file before executing the script, so this must exist before any sbatch call):
mkdir -p logs

Step 0: Prepare Your Environment

0.1 Log in to the server

ssh username@your-server.ac.za

0.2 Clone the pipeline repository

cd /path/to/your/working/directory
git clone https://github.com/DIPLOMICS-SA/Genome-Assembly-Pipeline-Nextflow.git
cd Genome-Assembly-Pipeline-Nextflow

Or copy your adapted pipeline files into the working directory.

0.3 Download the BUSCO lineage database (once per server)

conda activate 1ksa_assembly
busco --download eukaryota_odb10   # change lineage if needed

# Verify download:
ls ./busco_downloads/lineages/eukaryota_odb10

Other lineage options: viridiplantae_odb10, insecta_odb10

0.4 Prepare your FASTQ file

If your files are split across multiple files or compressed:

# Concatenate multiple FASTQ files
cat *.fastq > species_name_fastq_pass_con.fastq

# Or, if gzipped:
gunzip -c *.fastq.gz > species_name_fastq_pass_con.fastq

Step 1: K-mer Analysis

This is mandatory before assembly. It estimates genome size and read coverage — the two key parameters for Flye assembly.

1.1 Submit the k-mer job

Note: SLURM opens logs/kmer_<JOBID>.out before running the script, so the logs/ directory must exist first. If you have not already done so (see Prerequisites):

mkdir -p logs

sbatch submit_kmer.sh /path/to/species_name_fastq_pass_con.fastq species_name

1.2 Monitor the job

squeue -u $USER
# or watch the log:
tail -f logs/kmer_<JOBID>.out

1.3 Read the results

cat k_mers_Stats_species_name.txt

Example output:

Species_name: k-mer= 17
Total input bases 153410333621
Peaks or Plateaus detected=2
Ploidy= 2n =2n
2n Genome Length=0.87 Gb at 176 X Coverage
Expected Assembly Length if fully collapsed=0.87 Gb at 176 X Coverage

View the k-mer plot (download to local machine):

# On your local machine:
scp username@your-server.ac.za:/path/to/pipeline/species_name_k_mers.png .

1.4 Record: genome size and coverage

Note the genome length (e.g. 0.87g) and coverage (e.g. 176) — you will need these in the next step.


Step 2: Configure Assembly Parameters

nano params.config

Edit the following values:

species_name='species_name'    # Your species name (no spaces)
assembler='flye'               # 'flye' for < 3 Gb; 'hifiasm' for ≥ 3 Gb

threads=15
LINEAGE='eukaryota_odb10'      # Change if needed

# Flye only (from k-mer analysis):
genome_size='0.87g'            # From k_mers_Stats file
flye_coverage=176              # From k_mers_Stats file
flye_read_type='nano-raw'      # Use 'nano-hq' for Q15+ reads

Choosing the assembler

Genome size Assembler Notes
< 3 Gb flye Runs through full pipeline including Racon polishing
≥ 3 Gb hifiasm No polishing step; BUSCO + QUAST run directly on assembly

Choosing read type for Flye

Read quality Parameter
Q > 10 (standard Nanopore) nano-raw
Q > 15 (high-accuracy Nanopore) nano-hq

Step 3: Run the Assembly Pipeline

3.1 Submit the main pipeline

sbatch submit_slurm.sh /path/to/species_name_fastq_pass_con.fastq

This single job runs all pipeline stages in order: 1. QC + trimming (Nextflow → SLURM) 2. Assembly — submitted as its own SLURM job (submit_flye.sh or submit_hifiasm.sh), master.sh waits for it 3. Mapping (Nextflow → SLURM) 4. Polishing, if Flye (Nextflow → SLURM) 5. BUSCO + QUAST evaluation (Nextflow → SLURM)

3.2 Monitor progress

# Top-level job
squeue -u $USER

# All SLURM jobs including Nextflow-submitted sub-jobs
watch squeue -u $USER

# Live log of the main job
tail -f genome_assembly_<JOBID>.out

3.3 Resume if interrupted

The pipeline uses Nextflow's -resume flag automatically. If the top-level job is killed, just resubmit:

sbatch submit_slurm.sh /path/to/reads.fastq

Assembly steps (Flye/Hifiasm) also have built-in resume logic.


Step 4: Check Assembly Outputs

After the pipeline finishes, check the results:

# BUSCO summary
cat results/Busco_results/Busco_output/short_summary.specific.*.txt

# QUAST report
cat results/quast_report/Quast_result/report.txt

# Final assembly
ls results/Racon_results/     # Flye path
ls results/Hifiasm_results/   # Hifiasm path

BUSCO completeness > 80% is generally acceptable for a draft assembly.


Step 5: Generate the Final Report

Once you are satisfied with the assembly quality, generate the consolidated report:

bash generate_report.sh

This script (run from the pipeline root directory, no job submission needed): 1. Sorts the SAM file and calculates mapping coverage 2. Collects all key output files 3. Compiles a single report text file with all QC metrics 4. Appends software version information 5. Organises outputs into two folders: - results/<species_name>/ — primary outputs (assemblies, BUSCO, QUAST, NanoPlot, report) - results/<species_name>_other_results_outputs/ — raw pipeline directories

Primary output files in results/<species_name>/

File Description
<species>_report.txt Consolidated assembly report
<species>_flye_assembly.fasta Raw Flye assembly
<species>_racon_polished.fasta Polished assembly (Flye only)
<species>_hifiasm_assembly.fasta Hifiasm assembly
<species>_busco_summary.txt BUSCO completeness summary
<species>_quast_report.txt QUAST assembly statistics
<species>_NanoStats_before_trim.txt Raw read QC
<species>_NanoStats_after_trim.txt Trimmed read QC
<species>_NanoPlot_*.html Interactive QC reports

Parameter Reference

Assembly parameters (params.config)

Parameter Description Example
species_name Species identifier (no spaces) Acacia_karroo
assembler flye or hifiasm flye
threads CPUs for Nextflow-managed steps 15
kraken_db Path to Kraken2 database directory /data/kraken2_db
LINEAGE BUSCO lineage database eukaryota_odb10
genome_size From k-mer analysis (Flye only) 0.87g
flye_coverage From k-mer analysis (Flye only) 176
flye_read_type nano-raw or nano-hq nano-raw

SLURM resource allocations (nextflow.config)

Process CPUs Memory Walltime
NANOCHECK1/2 8 32 GB 4h
TRIM 15 32 GB 8h
DECONTAMINATE 15 64 GB 12h
MAPPINGS 15 64 GB 12h
POLISH1 15 64 GB 12h
BUSCOstat_final 15 64 GB 24h
QUAST_final 8 32 GB 8h
Flye (submit_flye.sh) 40 300 GB 72h
Hifiasm (submit_hifiasm.sh) 40 200 GB 48h

Adjust these in nextflow.config and submit_flye.sh / submit_hifiasm.sh based on your actual genome size and cluster limits.


Troubleshooting

Problem Solution
K-mer job finds no peaks Check read quality; try increasing k-mer size in kmer-Analysis.sh
Flye job runs out of memory Increase --mem in submit_flye.sh
Flye job hits walltime Resubmit submit_slurm.sh — Flye will resume automatically
BUSCO completeness < 50% Check read depth and quality; consider re-basecalling with a newer model
Nextflow cannot find conda env Verify conda activate 1ksa_assembly works interactively on the cluster
sacct shows FAILED state Check logs/ files for the failing step

Quick-Start Checklist

[ ] 0. Log in to server and navigate to pipeline directory
[ ] 1. Concatenate/unzip FASTQ files
[ ] 2. mkdir -p logs   ← required before any sbatch call
[ ] 3. sbatch submit_kmer.sh reads.fastq species_name
[ ] 4. Check k_mers_Stats_<species>.txt — note genome size and coverage
[ ] 5. Edit params.config with correct values (including kraken_db path)
[ ] 6. sbatch submit_slurm.sh reads.fastq
[ ] 7. Monitor with: watch squeue -u $USER
[ ] 8. Check BUSCO and QUAST results
[ ] 9. bash generate_report.sh
[ ] 10. Review results/<species_name>/<species>_report.txt