Standard Operating Procedure: Storage Management

SAIAB HPC Server - lab417.saiab.ac.za

Document Information
Version	1.0
Date	2026-02-10
Author	SAIAB HPC Administration
Contact	EP.deVilliers@saiab.nrf.ac.za

Purpose and Scope
Prerequisites
Storage System Overview
Procedure: Running Disk Monitoring System
Procedure: Responding to Low Disk Space Alerts
Procedure: Assisting Users with Storage Optimization
Scheduled Maintenance
Troubleshooting
Best Practices
Appendix A: Disk Monitoring System Details
Appendix B: Common Storage Optimization Commands
Appendix C: Emergency Procedures
Appendix D: Report Interpretation Guide

1. Purpose and Scope

1.1 Purpose

This Standard Operating Procedure (SOP) provides comprehensive instructions for managing storage resources on the SAIAB HPC bioinformatics server (lab417.saiab.ac.za). It ensures consistent monitoring, optimization, and maintenance of disk space across all filesystems.

1.2 Scope

This SOP covers: - Automated disk monitoring and reporting system operations - Manual storage assessments and interventions - User support for storage optimization - Emergency procedures for critical disk space situations - Storage best practices for bioinformatics data

1.3 Intended Audience

System administrators with sudo/root privileges
HPC support staff
Anyone responsible for maintaining server storage health

2. Prerequisites

2.1 Required Privileges

Root or sudo access to lab417.saiab.ac.za
SSH access to the server
Permission to read user directories for scanning

2.2 System Requirements

Disk monitoring system installed at: /home/evilliers/work/sysadmin/
Mail server configured for email notifications
Sufficient access to all monitored filesystems:
/home (user home directories)
/data (shared data storage)
/mnt/agrp/lab417 (NFS-mounted research storage)

2.3 Knowledge Requirements

Basic Linux system administration
Understanding of filesystem concepts (disk usage, inodes, permissions)
Familiarity with bioinformatics file formats (FASTQ, SAM/BAM)
SLURM job scheduler basics (for storage-intensive operations)

3. Storage System Overview

3.1 Filesystem Layout

Mount Point	Type	Purpose	Monitored
`/home`	Local	User home directories	Yes
`/data`	Local	Shared bioinformatics data	Yes
`/mnt/agrp/lab417`	NFS	Research group storage	Yes
`/tmp`	tmpfs	Temporary files	No

3.2 Current Status (Baseline)

Based on typical operations: - Disk Usage: 86-94% utilization across filesystems - User Count: ~38 active users - Potential Savings: ~796GB identified per scan - FASTQ compression: ~632GB (70-80% savings) - SAM to BAM conversion: ~164GB (50-80% savings)

3.3 Automated Monitoring

The disk monitoring system runs automatically every Sunday at 2:00 AM via cron job. It: - Scans all user directories in monitored filesystems - Identifies optimization opportunities - Generates individual user reports - Emails reports to users and administrators - Creates backup reports in home directories

3.4 Alert Thresholds

Large File: Files >100MB flagged for review
Old File: Files >365 days flagged for archival
User Reporting: Only users with >1GB total usage receive reports
Critical Space: <10% free space requires immediate action

4. Procedure: Running Disk Monitoring System

4.1 Testing and Dry Runs

Step 1: Test on Single User

Before running full system scans, test on a single user account:

# SSH to server
ssh admin@lab417.saiab.ac.za

# Navigate to sysadmin directory
cd /home/evilliers/work/sysadmin

# Run dry-run test on single user (no emails sent)
sudo ./bin/disk_monitor.sh --dry-run --user evilliers --verbose

Step 2: Review Test Output

Check the output for: - Files correctly identified (FASTQ, SAM, large files, old files) - Accurate size calculations - Proper space savings estimates - Report formatting - No errors or warnings

Step 3: Test Full System (Dry Run)

Once single-user test succeeds:

# Test full scan without NFS (faster for testing)
sudo ./bin/disk_monitor.sh --dry-run --skip-nfs --verbose

Step 4: Verify Report Generation

Check that reports are created but not sent:

# List generated reports
ls -lt reports/admin_summary_*.txt | head -1

# Review latest admin summary
cat reports/admin_summary_*.txt | less

4.2 Manual Production Execution

Step 1: Full System Scan with Notifications

When ready to run in production:

# Full scan including NFS, with email notifications
sudo ./bin/disk_monitor.sh

Expected Duration: 30-120 minutes depending on: - Number of files to scan - NFS mount responsiveness - System load - Number of users

Step 2: Monitor Progress

Open a separate terminal and watch the logs:

# Watch main execution log
tail -f logs/disk_monitor_*.log

# Watch notification log
tail -f logs/notifications_$(date +%Y%m%d).log

Step 3: Verify Completion

After execution completes, verify:

# Check for successful completion in log
grep "COMPLETED" logs/disk_monitor_*.log | tail -1

# Verify admin summary was created
ls -lt reports/admin_summary_*.txt | head -1

# Check notification delivery status
grep "SUCCESS\|FAILED" logs/notifications_*.log | tail -20

Step 4: Review Admin Summary

# Read the admin summary
cat reports/admin_summary_*.txt

Review: - Filesystem usage percentages - Top 10 users by disk consumption - Total potential space savings - Any failed notifications - Performance metrics

4.3 Targeted User Scans

For investigating specific user issues:

# Scan single user with full reporting
sudo ./bin/disk_monitor.sh --user username

# Check user's report
sudo cat /home/username/DISK_USAGE_REPORT.txt

# Verify email was sent
grep "username" logs/notifications_$(date +%Y%m%d).log

4.4 Quick Scans (Skip NFS)

For faster execution during testing or urgent checks:

# Skip NFS scanning
sudo ./bin/disk_monitor.sh --skip-nfs

Use when: - NFS mount is slow or unresponsive - Need quick local filesystem check - Testing configuration changes - Troubleshooting issues

5. Procedure: Responding to Low Disk Space Alerts

5.1 Immediate Assessment

Step 1: Check Current Disk Usage

# Check all filesystem usage
df -h

# Check specific filesystems
df -h /home /data /mnt/agrp/lab417

# Check inode usage (can fill up independently)
df -i

Step 2: Identify Severity Level

Free Space	Severity	Action Required
>20%	Normal	Routine monitoring
10-20%	Warning	Schedule cleanup within 1 week
5-10%	Critical	Immediate action required
<5%	Emergency	Follow emergency procedures

Step 3: Identify Top Consumers

# Find largest directories in /home
sudo du -h --max-depth=1 /home | sort -hr | head -20

# Find largest directories in /data
sudo du -h --max-depth=1 /data | sort -hr | head -20

# Find largest files across filesystem
sudo find /home -type f -size +1G -exec ls -lh {} \; | sort -k5 -hr | head -20

5.2 Run Targeted Analysis

Step 1: Run Immediate Disk Scan

# Run disk monitor for quick assessment
sudo ./bin/disk_monitor.sh --skip-nfs

Step 2: Review Admin Summary

# Read latest admin summary
cat reports/admin_summary_*.txt

# Focus on:
# - Top users by consumption
# - Total potential savings
# - Quick win opportunities (uncompressed FASTQ)

5.3 Contact High-Usage Users

Step 1: Prioritize User Contacts

Based on admin summary, identify users with: 1. Largest disk usage 2. Most uncompressed FASTQ files 3. SAM files that should be BAM 4. Large old files

Step 2: Send Personalized Communication

# Check if user already received automated report
grep "username" logs/notifications_*.log

# If not recent, run targeted scan
sudo ./bin/disk_monitor.sh --user username

Email template:

Subject: Urgent: Storage Optimization Required on HPC Server

Dear [username],

Our HPC server is currently experiencing limited disk space. Your account has been
identified as a significant storage consumer.

Current usage: [X] GB

We have identified opportunities to reduce your storage footprint:
- [X] GB in uncompressed FASTQ files
- [X] GB in SAM files that should be BAM format
- [X] GB in files older than 1 year

Please review the detailed report in your home directory:
  ~/DISK_USAGE_REPORT.txt

Priority actions (within 48 hours):
1. Compress all uncompressed FASTQ files
2. Convert SAM files to BAM format
3. Archive or delete old files

Need assistance? Contact: evilliers@saiab.ac.za

Thank you for your prompt attention to this matter.

5.4 Monitor Improvement

Step 1: Track Disk Space Changes

# Record current usage
df -h > /tmp/disk_before.txt

# After users complete cleanup (wait 24-48 hours)
df -h > /tmp/disk_after.txt

# Compare
diff /tmp/disk_before.txt /tmp/disk_after.txt

Step 2: Re-scan to Verify

# Run follow-up scan
sudo ./bin/disk_monitor.sh

# Compare new admin summary to previous
diff reports/admin_summary_[previous].txt reports/admin_summary_[latest].txt

6. Procedure: Assisting Users with Storage Optimization

6.1 Compress FASTQ Files

Users should be guided through this process:

Interactive Compression

# Navigate to directory with FASTQ files
cd /path/to/fastq/files

# Compress single file
gzip filename.fastq
# Result: filename.fastq.gz (70-80% space savings)

# Compress all FASTQ files in directory
gzip *.fastq
gzip *.fq

# Verify compression worked
ls -lh *.gz

Batch Compression with SLURM

For large numbers of files:

# Create compression script
cat > compress_fastq.sh << 'EOF'
#!/bin/bash
#SBATCH --job-name=compress_fastq
#SBATCH --output=compress_%j.log
#SBATCH --partition=agrp
#SBATCH --cpus-per-task=4
#SBATCH --mem=8G

# Compress FASTQ files with parallel gzip
find . -name "*.fastq" -o -name "*.fq" | \
  parallel -j 4 'gzip {}'
EOF

# Submit job
sbatch compress_fastq.sh

Safety Checks

# Test on one file first
gzip -t filename.fastq.gz  # Verify integrity

# If verification succeeds, original can be deleted
# (gzip automatically removes original after compression)

6.2 Convert SAM to BAM

Single File Conversion

# Install samtools (if not already in conda environment)
conda activate bioinformatics
conda install samtools

# Convert SAM to sorted BAM
samtools view -bS input.sam | samtools sort -o output.bam

# Index BAM file
samtools index output.bam

# Verify BAM file integrity
samtools quickcheck output.bam
echo $?  # Should return 0 if OK

# If successful, remove SAM file
rm input.sam

Batch Conversion with SLURM

# Create conversion script
cat > sam_to_bam.sh << 'EOF'
#!/bin/bash
#SBATCH --job-name=sam2bam
#SBATCH --output=sam2bam_%j.log
#SBATCH --partition=agrp
#SBATCH --cpus-per-task=8
#SBATCH --mem=16G

# Load conda environment
source ~/miniconda3/etc/profile.d/conda.sh
conda activate bioinformatics

# Convert all SAM files to BAM
for sam in *.sam; do
  bam="${sam%.sam}.bam"
  echo "Converting $sam to $bam"

  samtools view -@ 4 -bS "$sam" | samtools sort -@ 4 -o "$bam"
  samtools index "$bam"

  # Verify and remove SAM if successful
  if samtools quickcheck "$bam"; then
    echo "Success: $bam verified"
    rm "$sam"
  else
    echo "ERROR: $bam verification failed, keeping $sam"
  fi
done
EOF

# Submit job
sbatch sam_to_bam.sh

6.3 Archive Old Files

Identify Old Files

# Find files older than 1 year
find /home/username -type f -mtime +365 -size +100M

# Count old files and total size
find /home/username -type f -mtime +365 -exec du -ch {} + | tail -1

Create Archive

# Create dated archive
tar czf archive_$(date +%Y%m%d).tar.gz /path/to/old/files/

# Verify archive integrity
tar tzf archive_$(date +%Y%m%d).tar.gz > /dev/null
echo $?  # Should return 0 if OK

# Move to archive location
mv archive_*.tar.gz /mnt/agrp/lab417/archives/username/

# Only remove originals after verifying archive
# rm -rf /path/to/old/files/

Best Practices for Archival

Always verify archives before deleting originals
Use dated filenames for easy identification
Document archive contents with README files
Store archives on long-term storage (NFS, not /home)
Test extraction periodically to ensure archives remain readable

7. Scheduled Maintenance

7.1 Weekly Automated Scan

Verify Cron Job is Active

# Check if cron job is installed
sudo ls -l /etc/cron.d/disk-monitor

# View cron schedule
sudo cat /etc/cron.d/disk-monitor

Expected content:

# Disk monitoring system - runs every Sunday at 2:00 AM
0 2 * * 0 root /home/evilliers/work/sysadmin/bin/disk_monitor.sh >> /home/evilliers/work/sysadmin/logs/cron.log 2>&1

Review Weekly Execution

Every Monday morning:

# Check Sunday's cron execution
tail -100 logs/cron.log

# Review latest admin summary
cat reports/admin_summary_*.txt | less

# Check for any notification failures
grep "FAILED" logs/notifications_*.log | tail -20

7.2 Monthly Review

Perform these checks on the first Monday of each month:

Step 1: Trend Analysis

# Compare disk usage over time
df -h > /tmp/disk_usage_$(date +%Y%m).txt

# Compare to previous month
diff /tmp/disk_usage_[prev_month].txt /tmp/disk_usage_$(date +%Y%m).txt

Step 2: Review Top Users

# Check if same users appear in top 10 consistently
grep "TOP 10 USERS" reports/admin_summary_*.txt | tail -5

Step 3: Effectiveness Metrics

Total space reclaimed this month
Number of users who compressed FASTQ files
Number of SAM files converted to BAM
Reduction in old file count

Step 4: Configuration Review

# Review configuration settings
cat config/disk_monitor.conf | grep -E "THRESHOLD|DAYS|MIN_"

Consider adjusting if: - Too many false positives in reports - Missing significant storage issues - Performance issues with scanning

7.3 Quarterly Maintenance

Every 3 months:

Clean Up Old Logs

# Logs are auto-cleaned after 90 days, verify:
find logs/ -name "*.log" -mtime +90 -ls

# Manually clean if needed
find logs/ -name "*.log" -mtime +90 -delete

# Clean old reports
find reports/ -name "admin_summary_*.txt" -mtime +90 -delete

Audit User Accounts

# Find users who haven't logged in recently
sudo lastlog | grep "Never logged in\|200 days"

# Check disk usage of inactive accounts
for user in $(sudo lastlog | grep "Never" | awk '{print $1}'); do
  echo -n "$user: "
  sudo du -sh /home/$user
done

Review System Capacity

Assess if additional storage is needed
Plan for capacity expansion if consistently >80% usage
Consider implementing quotas if needed

8. Troubleshooting

8.1 Common Issues and Solutions

Issue	Symptoms	Solution
Disk monitor not running	No recent logs or reports	Check cron job, verify symlink, review permissions
Email notifications failing	Users not receiving reports	Test mail server, check SMTP config, verify email addresses
NFS scanning hangs	Script runs >3 hours	Use `--skip-nfs`, check NFS mount health, adjust timeout
Lock file error	"Another instance running"	Verify no process running, remove stale lock file
Permission denied errors	Scan fails for some users	Ensure running as root, check directory permissions
Reports show 0 bytes savings	No optimization opportunities found	May be false negative, manually verify with `find`

8.2 Detailed Troubleshooting Steps

Issue: No Email Notifications Received

Step 1: Verify email is enabled

grep ENABLE_EMAIL config/disk_monitor.conf
# Should show: ENABLE_EMAIL=true

Step 2: Test mail system

echo "Test email" | mail -s "Test from HPC" evilliers@saiab.ac.za

# Check mail logs
sudo tail -f /var/log/mail.log

Step 3: Check notification logs

grep "FAILED\|ERROR" logs/notifications_*.log

Step 4: Verify email addresses

# Check if users have valid email format
cat config/disk_monitor.conf | grep EMAIL_DOMAIN

Note: Users always receive file notifications in home directory even if email fails.

Issue: Disk Monitor Runs Too Slowly

Step 1: Check current performance

# Review last execution time
grep "Duration:" reports/admin_summary_*.txt | tail -1

Step 2: Disable NFS if slow

# Temporary: use --skip-nfs flag
sudo ./bin/disk_monitor.sh --skip-nfs

# Permanent: edit config
vim config/disk_monitor.conf
# Set: ENABLE_NFS_SCAN=false

Step 3: Reduce parallel scans

vim config/disk_monitor.conf
# Reduce: MAX_PARALLEL_SCANS=2  (default: 4)

Step 4: Limit scan depth

vim config/disk_monitor.conf
# Reduce: MAX_FIND_DEPTH=10  (default: 20)

Step 5: Add exclusion patterns

vim config/disk_monitor.conf
# Add paths to EXCLUDE_PATTERNS
# Example: EXCLUDE_PATTERNS="*.tmp *.bak .cache* .singularity*"

Issue: Lock File Prevents Execution

Step 1: Check if process actually running

ps aux | grep disk_monitor.sh

Step 2: If no process found, remove stale lock

sudo rm /var/run/disk_monitor.lock

Step 3: Re-run script

sudo ./bin/disk_monitor.sh

Issue: Permission Denied Errors

Step 1: Verify running as root

whoami  # Should show: root
sudo ./bin/disk_monitor.sh  # Use sudo

Step 2: Check log file permissions

ls -la logs/
# Should be writable by root/admin
sudo chown -R evilliers:evilliers logs/

Step 3: Verify scan path permissions

ls -ld /home /data /mnt/agrp/lab417

8.3 Emergency Recovery

Disk Full Scenario

If filesystem reaches 100% capacity:

Step 1: Immediate space recovery

# Clear system temporary files
sudo rm -rf /tmp/*
sudo rm -rf /var/tmp/*

# Clear old log files
sudo find /var/log -name "*.gz" -delete
sudo find /var/log -name "*.old" -delete

# Truncate large log files
sudo truncate -s 0 /var/log/syslog

Step 2: Identify largest consumers

# Find massive files quickly
sudo find /home -xdev -type f -size +10G -exec ls -lh {} \;

# Check for core dumps
sudo find /home -name "core.*" -exec rm {} \;

Step 3: Force user cleanup

# Temporarily move largest files to archive
sudo mkdir -p /mnt/agrp/lab417/emergency_archive/$(date +%Y%m%d)
sudo mv [large_files] /mnt/agrp/lab417/emergency_archive/$(date +%Y%m%d)/

# Document what was moved
echo "Moved files from emergency cleanup" > /mnt/agrp/lab417/emergency_archive/$(date +%Y%m%d)/README.txt

Step 4: Notify affected users Send immediate email explaining emergency action and where files were moved.

9. Best Practices

9.1 Proactive Storage Management

1. Regular Monitoring - Review weekly admin summaries every Monday - Track disk usage trends monthly - Set up alerts for >85% utilization

2. User Education - Include storage best practices in onboarding - Reference training materials in automated reports - Provide examples of good file organization

3. Preventive Actions - Encourage compression of FASTQ files immediately after generation - Recommend BAM format over SAM from the start - Promote archival of completed project data

9.2 Bioinformatics File Management

File Format Best Practices

File Type	Recommended Storage	Space Savings	Notes
FASTQ	Always compress (.gz)	70-80%	Use `pigz` for faster compression
SAM	Convert to BAM immediately	50-80%	Smaller and faster to process
BAM	Already compressed, keep	N/A	Always index (.bai)
VCF	Compress to VCF.gz	60-75%	Use bgzip for compatibility
BED	Keep small files uncompressed	Variable	Compress large files
FASTA	Compress unless frequently accessed	60-70%	Reference genomes can stay uncompressed

Directory Organization

~/project_name/
├── raw_data/           # Original FASTQ (compressed)
├── alignments/         # BAM files (sorted and indexed)
├── variants/           # VCF files
├── scripts/            # Analysis scripts
├── results/            # Final outputs
└── README.md          # Project documentation

9.3 Archive Strategy

When to Archive - Project completed and published - Data not accessed in >1 year - Reference data superseded by newer version - Intermediate files after final results generated

Where to Archive 1. Short-term: /mnt/agrp/lab417/archives/ (1-2 years) 2. Long-term: External backup systems or cold storage 3. Published data: Deposit to public repositories (SRA, ENA, etc.)

Archive Naming Convention

YYYY-MM-DD_ProjectName_Description.tar.gz

Examples:
2026-02-10_SeabreamGenome_RawFASTQ.tar.gz
2026-02-10_SeabreamGenome_Alignments.tar.gz
2026-02-10_SeabreamGenome_FinalResults.tar.gz

9.4 Quota Considerations

If implementing quotas (future enhancement):

Recommended quotas by user type: - Students/Course participants: 50GB soft, 75GB hard - Researchers: 200GB soft, 300GB hard - Project groups: 1TB soft, 1.5TB hard

Grace period: 7 days for soft quota

9.5 Data Retention Policy

Establish clear policies:

Data Type	Retention	Action After Retention
Raw sequencing data	3 years	Archive or delete if published
Intermediate files	1 year	Delete after results verified
Final results	Permanent	Keep in archived projects
Temporary analysis	30 days	Auto-delete or user confirms
Published data	Permanent	Move to public repositories

Appendix A: Disk Monitoring System Details

A.1 System Architecture

/home/evilliers/work/sysadmin/
├── bin/
│   └── disk_monitor.sh          # Main orchestrator script
├── lib/
│   ├── file_scanner.sh           # Filesystem scanning functions
│   ├── report_generator.sh       # Report generation functions
│   └── notification_handler.sh   # Email and file notification functions
├── config/
│   ├── disk_monitor.conf         # Central configuration file
│   └── disk-monitor.cron         # Cron schedule
├── logs/
│   ├── disk_monitor_*.log        # Execution logs
│   ├── notifications_*.log       # Email delivery logs
│   └── cron.log                  # Cron execution logs
├── reports/
│   └── admin_summary_*.txt       # Admin summary archives
└── templates/
    └── (Future: email templates)

A.2 Configuration Parameters

Key settings in config/disk_monitor.conf:

Scan Locations

SCAN_PATHS=(
    "/home"
    "/data"
    "/mnt/agrp/lab417"
)

Thresholds

LARGE_FILE_THRESHOLD="100M"       # Flag files larger than this
OLD_FILE_AGE_DAYS=365             # Flag files older than this
MIN_DISK_USAGE="1G"               # Minimum usage to generate report

Performance

NICE_LEVEL=19                     # Process priority (19 = lowest)
IONICE_CLASS=3                    # I/O priority (3 = idle)
MAX_PARALLEL_SCANS=4              # Concurrent user scans

Email Settings

ENABLE_EMAIL=true
ADMIN_EMAIL="evilliers@saiab.ac.za"
EMAIL_DOMAIN="saiab.ac.za"

A.3 Report Outputs

User Reports (~/DISK_USAGE_REPORT.txt): - Current disk usage by filesystem - List of uncompressed FASTQ files with compression commands - List of SAM files with BAM conversion commands - Large files (>100MB) - Old files (>1 year) - Total potential space savings - Prioritized action recommendations

Admin Summary (reports/admin_summary_YYYYMMDD_HHMMSS.txt): - System-wide filesystem status with usage graphs - Top 10 users by disk consumption - System-wide statistics (total files, potential savings) - Notification delivery status - Performance metrics

A.4 Execution Logs

Main Log (logs/disk_monitor_YYYYMMDD_HHMMSS.log): - Script execution start/end times - Users scanned - Errors encountered - Summary statistics

Notification Log (logs/notifications_YYYYMMDD.log): - Email delivery attempts - Success/failure status - Error messages

Cron Log (logs/cron.log): - Automated execution records - Output from cron-triggered scans

Appendix B: Common Storage Optimization Commands

B.1 Finding Files

Find large files

# Files larger than 1GB
find /home/username -type f -size +1G -exec ls -lh {} \;

# Top 20 largest files
find /home/username -type f -exec ls -s {} \; | sort -n -r | head -20

Find old files

# Files not accessed in >1 year
find /home/username -type f -atime +365

# Files not modified in >1 year
find /home/username -type f -mtime +365

Find specific file types

# All FASTQ files (uncompressed)
find /home/username -name "*.fastq" -o -name "*.fq"

# All SAM files
find /home/username -name "*.sam"

# Count and size
find /home/username -name "*.fastq" -exec du -ch {} + | tail -1

B.2 Compression Commands

gzip (single-threaded)

# Compress single file
gzip file.fastq

# Compress with best compression (slower)
gzip -9 file.fastq

# Keep original file
gzip -k file.fastq

# Decompress
gunzip file.fastq.gz

pigz (parallel gzip, faster)

# Install if not available
conda install pigz

# Compress with 4 threads
pigz -p 4 file.fastq

# Compress all FASTQ files in parallel
find . -name "*.fastq" | parallel -j 4 pigz

Test compressed file integrity

# Test gzip integrity
gzip -t file.fastq.gz

# Will exit with error if corrupt
echo $?  # 0 = OK, non-zero = error

B.3 SAM/BAM Operations

Convert SAM to BAM

# Basic conversion
samtools view -bS input.sam > output.bam

# Convert and sort in one step
samtools view -bS input.sam | samtools sort -o output.sorted.bam

# With multiple threads (faster)
samtools view -@ 4 -bS input.sam | samtools sort -@ 4 -o output.sorted.bam

Index BAM files

# Create BAI index
samtools index input.bam

# Results in input.bam.bai

Verify BAM integrity

samtools quickcheck input.bam
echo $?  # 0 = OK, 1 = corrupt

Get BAM statistics

samtools flagstat input.bam
samtools stats input.bam

B.4 Archive Creation

Create compressed archive

# Create tar.gz archive
tar czf archive.tar.gz /path/to/files/

# Create with progress display
tar czf archive.tar.gz /path/to/files/ --verbose

# Exclude certain patterns
tar czf archive.tar.gz --exclude="*.tmp" --exclude="*.log" /path/to/files/

Verify archive integrity

# List contents without extracting
tar tzf archive.tar.gz

# Verify integrity
tar tzf archive.tar.gz > /dev/null
echo $?  # 0 = OK

Extract archive

# Extract all
tar xzf archive.tar.gz

# Extract to specific directory
tar xzf archive.tar.gz -C /destination/path/

# Extract single file
tar xzf archive.tar.gz path/to/specific/file

B.5 Disk Usage Analysis

Directory sizes

# Current directory subdirectories
du -h --max-depth=1 . | sort -hr

# Specific user's home
sudo du -h --max-depth=1 /home/username | sort -hr

# All users in /home
sudo du -sh /home/* | sort -hr

Find disk usage by file type

# All FASTQ files
find /home/username -name "*.fastq*" -exec du -ch {} + | tail -1

# All BAM files
find /home/username -name "*.bam" -exec du -ch {} + | tail -1

Filesystem usage

# Human-readable format
df -h

# Inode usage
df -i

# Specific filesystem
df -h /home

Appendix C: Emergency Procedures

C.1 Critical Disk Space (<5% Free)

Immediate Actions (within 30 minutes)

Alert team

echo "URGENT: Disk space critical on lab417" | \
  mail -s "URGENT: Disk Space Critical" evilliers@saiab.ac.za

Prevent new writes (if absolutely necessary)

# Make filesystem read-only (EXTREME - only if disk full)
# This will prevent all new writes
# sudo mount -o remount,ro /home

# Better: set quota for all users temporarily
# Requires quota support

Clear immediate space

# Remove temp files
sudo rm -rf /tmp/*

# Clear old logs
sudo find /var/log -name "*.gz" -delete
sudo find /var/log -name "*.old" -delete

# Check for core dumps
sudo find /home -name "core.*" -size +100M -delete

Identify top consumers

# Fastest way to find space hogs
sudo du -sh /home/* /data/* | sort -hr | head -10

Contact top users immediately
Phone call or instant message
Request immediate cleanup
Provide specific file locations to remove

Short-term Actions (within 24 hours)

Run disk monitor
```
sudo ./bin/disk_monitor.sh --skip-nfs
```
Force cleanup of top offenders
Work with users to compress FASTQ
Convert SAM to BAM
Move old files to archive
Temporary space expansion
Mount additional volumes if available
Move data to NFS temporarily

C.2 NFS Mount Failure

Symptoms - Disk monitor hangs - Users can't access /mnt/agrp/lab417 - Long delays on filesystem access

Diagnosis

# Check mount status
mount | grep agrp

# Check NFS server connectivity
ping [NFS_SERVER_IP]

# Try to access mount
ls -la /mnt/agrp/lab417

# Check NFS services
sudo systemctl status nfs-client.target

Resolution

Remount NFS

sudo umount -f /mnt/agrp/lab417
sudo mount -a

If unmount fails

# Force unmount
sudo umount -l /mnt/agrp/lab417
sudo mount -a

Skip NFS in monitoring

# Temporarily disable NFS scanning
sudo ./bin/disk_monitor.sh --skip-nfs

Contact NFS administrator
Report issue with NFS server
Provide error messages
Check /var/log/syslog for NFS errors

C.3 Disk Monitor Failure

Symptoms - No reports generated - Script exits with errors - No recent logs

Diagnosis

# Check last execution
ls -lt logs/disk_monitor_*.log | head -1

# Review error messages
tail -100 logs/disk_monitor_*.log | grep -i error

# Check for lock file
ls -la /var/run/disk_monitor.lock

# Verify script exists and is executable
ls -la bin/disk_monitor.sh

Resolution

Remove stale locks
```
sudo rm /var/run/disk_monitor.lock
```

Test script manually

sudo ./bin/disk_monitor.sh --dry-run --user evilliers --verbose

Check permissions

sudo chown -R evilliers:evilliers /home/evilliers/work/sysadmin/
chmod +x bin/disk_monitor.sh

Review configuration

cat config/disk_monitor.conf | less
# Look for syntax errors

Manual scan alternative

# Generate basic disk usage report
df -h > /tmp/disk_report.txt
sudo du -sh /home/* >> /tmp/disk_report.txt
mail -s "Manual Disk Report" evilliers@saiab.ac.za < /tmp/disk_report.txt

Appendix D: Report Interpretation Guide

D.1 Understanding User Reports

Sample User Report Structure

==========================================
DISK USAGE REPORT for username
Generated: 2026-02-10 03:15:22
==========================================

CURRENT DISK USAGE:
  /home/username:        45.2 GB
  /data:                 102.8 GB
  /mnt/agrp/lab417:      356.4 GB
  TOTAL:                 504.4 GB

OPTIMIZATION OPPORTUNITIES:

1. UNCOMPRESSED FASTQ FILES (Priority: HIGH)
   Found: 45 files
   Current size: 180.5 GB
   After compression: ~45.1 GB (estimated)
   Potential savings: ~135.4 GB

   Example files:
   /home/username/project1/sample1_R1.fastq (12.3 GB)
   /home/username/project1/sample1_R2.fastq (12.1 GB)
   ...

   Recommended command:
   gzip *.fastq

2. SAM FILES (Priority: MEDIUM)
   Found: 12 files
   Current size: 89.2 GB
   After BAM conversion: ~26.8 GB (estimated)
   Potential savings: ~62.4 GB

   Example files:
   /data/alignments/sample1.sam (15.4 GB)
   ...

   Recommended command:
   samtools view -bS input.sam | samtools sort -o output.bam

3. OLD FILES (Priority: LOW)
   Found: 234 files older than 1 year
   Total size: 67.3 GB

TOTAL POTENTIAL SAVINGS: ~265.1 GB (53% reduction)

ACTION ITEMS:
1. [HIGH] Compress 45 FASTQ files (saves ~135 GB)
2. [MEDIUM] Convert 12 SAM files to BAM (saves ~62 GB)
3. [LOW] Review 234 old files for archival (saves ~67 GB)

Interpretation Guidelines

Priority HIGH: Action will save significant space (>100GB or >50% of user total)
Priority MEDIUM: Moderate savings (10-100GB or 10-50% of user total)
Priority LOW: Good practice but smaller impact (<10GB or <10% of user total)

D.2 Understanding Admin Summary

Sample Admin Summary Sections

1. Filesystem Status

FILESYSTEM USAGE:
  /home         : [##########          ] 86% (1.2 TB / 1.4 TB)
  /data         : [############        ] 94% (4.7 TB / 5.0 TB)
  /mnt/agrp/lab417: [########            ] 67% (6.7 TB / 10.0 TB)

Status Indicators: - <70%: Normal (keep monitoring) - 70-85%: Attention needed (plan cleanup) - 85-95%: Warning (initiate user contact) - >95%: Critical (immediate action required)

2. Top Users

TOP 10 USERS BY DISK CONSUMPTION:
1. researcher1    : 1,234 GB  [****************************]
2. researcher2    :   892 GB  [********************        ]
3. student1       :   567 GB  [*************               ]
...

What to do: - Users at top of list are priority contacts - Check if their usage is justified (active projects) - Verify they received automated reports - Contact directly if usage seems excessive

3. System-wide Statistics

SYSTEM-WIDE STATISTICS:
  Total files scanned    : 1,234,567
  Total users scanned    : 38
  Users reported         : 28
  Uncompressed FASTQ     : 632 GB (potential savings)
  SAM files             : 164 GB (potential savings)
  Total potential savings: 796 GB

NOTIFICATION DELIVERY:
  Email sent            : 26
  Email failed          : 2
  File notifications    : 28
  Success rate          : 93%

Key metrics: - Success rate <90%: Investigate email delivery issues - Potential savings >500GB: High-priority cleanup campaign needed - Users reported vs scanned: Shows how many have significant usage

4. Performance Metrics

PERFORMANCE:
  Scan duration         : 87 minutes
  Average per user      : 2.3 minutes
  Files processed/sec   : 237

Benchmarks: - Scan duration should be <120 minutes - If >180 minutes, consider optimization (skip NFS, reduce depth)

D.3 Trend Analysis

Week-over-Week Comparison

Compare multiple admin summaries:

# Extract key metrics from last 4 weeks
for f in reports/admin_summary_*.txt; do
  echo "=== $f ==="
  grep "TOTAL:" $f
  grep "potential savings" $f
done

Red flags: - Disk usage increasing >5% per week - Same users in top 10 week after week - Potential savings increasing (users ignoring reports) - Email failure rate increasing

Document Revision History

Version	Date	Author	Changes
1.0	2026-02-10	SAIAB HPC Admin	Initial release

Document Approval

Prepared by: SAIAB HPC Administration Team Contact: evilliers@saiab.ac.za Server: lab417.saiab.ac.za

End of Standard Operating Procedure

Standard Operating Procedure: Storage Management

Table of Contents

1. Purpose and Scope

1.1 Purpose

1.2 Scope

1.3 Intended Audience

2. Prerequisites

2.1 Required Privileges

2.2 System Requirements

2.3 Knowledge Requirements

3. Storage System Overview

3.1 Filesystem Layout

3.2 Current Status (Baseline)

3.3 Automated Monitoring

3.4 Alert Thresholds

4. Procedure: Running Disk Monitoring System

4.1 Testing and Dry Runs

Step 1: Test on Single User

Step 2: Review Test Output

Step 3: Test Full System (Dry Run)

Step 4: Verify Report Generation

4.2 Manual Production Execution

Step 1: Full System Scan with Notifications

Step 2: Monitor Progress

Step 3: Verify Completion

Step 4: Review Admin Summary

4.3 Targeted User Scans

4.4 Quick Scans (Skip NFS)

5. Procedure: Responding to Low Disk Space Alerts

5.1 Immediate Assessment

Step 1: Check Current Disk Usage

Step 2: Identify Severity Level

Step 3: Identify Top Consumers

5.2 Run Targeted Analysis

Step 1: Run Immediate Disk Scan

Step 2: Review Admin Summary

5.3 Contact High-Usage Users

Step 1: Prioritize User Contacts

Step 2: Send Personalized Communication

5.4 Monitor Improvement

Step 1: Track Disk Space Changes

Step 2: Re-scan to Verify

6. Procedure: Assisting Users with Storage Optimization

6.1 Compress FASTQ Files

Interactive Compression

Batch Compression with SLURM

Safety Checks

6.2 Convert SAM to BAM

Single File Conversion

Batch Conversion with SLURM

6.3 Archive Old Files

Identify Old Files

Create Archive

Best Practices for Archival

7. Scheduled Maintenance

7.1 Weekly Automated Scan

Verify Cron Job is Active

Review Weekly Execution

7.2 Monthly Review

Step 1: Trend Analysis

Step 2: Review Top Users

Step 3: Effectiveness Metrics

Step 4: Configuration Review

7.3 Quarterly Maintenance

Clean Up Old Logs

Audit User Accounts

Review System Capacity

8. Troubleshooting

8.1 Common Issues and Solutions

8.2 Detailed Troubleshooting Steps

Issue: No Email Notifications Received

Issue: Disk Monitor Runs Too Slowly

Issue: Lock File Prevents Execution

Issue: Permission Denied Errors

8.3 Emergency Recovery

Disk Full Scenario

9. Best Practices

9.1 Proactive Storage Management

9.2 Bioinformatics File Management

9.3 Archive Strategy