BioSkills bio-sra-data
Download sequencing data from NCBI SRA using the SRA toolkit. Use when downloading FASTQ files from SRA accessions, prefetching large datasets, or validating SRA downloads.
git clone https://github.com/GPTomics/bioSkills
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/database-access/sra-data" ~/.claude/skills/gptomics-bioskills-bio-sra-data && rm -rf "$T"
database-access/sra-data/SKILL.mdVersion Compatibility
Reference examples tested with: BioPython 1.83+, Entrez Direct 21.0+, SRA Toolkit 3.0+
Before using code patterns, verify installed versions match. If versions differ:
- Python:
thenpip show <package>
to check signatureshelp(module.function) - CLI:
then<tool> --version
to confirm flags<tool> --help
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
SRA Data
Download raw sequencing data from the Sequence Read Archive using the SRA toolkit.
"Download FASTQ from SRA" → Fetch raw sequencing reads from an SRA accession as FASTQ files.
- CLI:
(SRA Toolkit)fasterq-dump SRR_ACCESSION - Python:
orsubprocess.run(['fasterq-dump', accession])
for metadataEntrez.efetch()
Installation
# macOS brew install sratoolkit # Ubuntu/Debian sudo apt install sra-toolkit # conda (recommended) conda install -c bioconda sra-tools # Verify installation fasterq-dump --version
Core Commands
fasterq-dump - Download FASTQ (Recommended)
Fast, multithreaded FASTQ extraction. Preferred over
fastq-dump.
# Download single SRA run as FASTQ fasterq-dump SRR12345678 # Output: SRR12345678.fastq (single-end) # Or: SRR12345678_1.fastq, SRR12345678_2.fastq (paired-end)
Key Options:
| Option | Description | Example |
|---|---|---|
/ | Output directory | |
/ | Output filename | |
/ | Number of threads | |
/ | Show progress bar | |
/ | Split paired reads (default) | |
/ | Also output unpaired reads | |
| Skip technical reads | |
/ | Temp directory | |
/ | Overwrite existing | |
# Common usage with options fasterq-dump SRR12345678 -O ./data/ -e 8 -p --skip-technical # Force split files (paired-end) fasterq-dump SRR12345678 -S -O ./data/
prefetch - Download SRA Files First
For large files or unreliable connections, prefetch first, then convert.
# Prefetch SRA file (downloads .sra to ~/ncbi/sra/) prefetch SRR12345678 # Then convert to FASTQ fasterq-dump ~/ncbi/sra/SRR12345678.sra # Or convert in place fasterq-dump SRR12345678 # Will find prefetched file
Prefetch Options:
| Option | Description |
|---|---|
/ | Download location |
/ | Show progress |
/ | Re-download if exists |
| Max file size (e.g., ) |
/ | Same as above |
# Prefetch with size limit prefetch SRR12345678 --max-size 100G -p # Prefetch multiple accessions prefetch SRR12345678 SRR12345679 SRR12345680 # Prefetch from a list file prefetch --option-file accessions.txt
vdb-validate - Verify Downloads
Check integrity of downloaded SRA files.
# Validate a downloaded file vdb-validate SRR12345678 # Validate with detailed output vdb-validate SRR12345678 2>&1
sra-stat - Get Run Statistics
Get information about an SRA run without downloading.
# Basic stats sra-stat --quick SRR12345678 # Detailed XML output sra-stat --xml SRR12345678
Configuration
vdb-config - Configure SRA Toolkit
Set up cache location and other settings.
# Interactive configuration vdb-config -i # Set cache directory vdb-config --set /repository/user/main/public/root=/path/to/cache # Check current configuration vdb-config --cfg
Cache Location
Default:
~/ncbi/ on Linux/macOS
# Create dedicated cache mkdir -p /data/sra_cache vdb-config --set /repository/user/main/public/root=/data/sra_cache
Code Patterns
Download Single Run
#!/bin/bash SRR="SRR12345678" OUTDIR="./fastq" mkdir -p $OUTDIR fasterq-dump $SRR -O $OUTDIR -e 8 -p
Download Multiple Runs
#!/bin/bash # From a list of accessions while read SRR; do echo "Downloading $SRR..." fasterq-dump $SRR -O ./fastq/ -e 4 -p done < accessions.txt
Prefetch Then Convert (Large Files)
#!/bin/bash SRR="SRR12345678" # Prefetch first (resumable) prefetch $SRR -p # Validate vdb-validate $SRR # Convert to FASTQ fasterq-dump $SRR -O ./fastq/ -e 8 -p # Optionally remove .sra file rm -f ~/ncbi/sra/${SRR}.sra
Batch Download Script
Goal: Download, validate, and convert multiple SRA accessions from a list file in a single automated run.
Approach: Loop through accessions, prefetch each .sra file for resumable downloading, validate integrity with vdb-validate, then convert to FASTQ with fasterq-dump.
#!/bin/bash # download_sra.sh - Download multiple SRA runs ACCESSIONS="$1" OUTDIR="${2:-./fastq}" THREADS="${3:-4}" mkdir -p $OUTDIR while read SRR; do if [[ -z "$SRR" ]] || [[ "$SRR" == \#* ]]; then continue fi echo "Processing $SRR..." # Prefetch prefetch $SRR -p -O $OUTDIR # Validate if ! vdb-validate ${OUTDIR}/${SRR}/${SRR}.sra 2>/dev/null; then echo "Validation failed for $SRR, skipping..." continue fi # Convert fasterq-dump ${OUTDIR}/${SRR}/${SRR}.sra -O $OUTDIR -e $THREADS -p # Cleanup .sra rm -rf ${OUTDIR}/${SRR} echo "Completed $SRR" done < "$ACCESSIONS"
Python Wrapper
import subprocess import os def download_sra(accession, outdir='.', threads=4, skip_technical=True): os.makedirs(outdir, exist_ok=True) cmd = ['fasterq-dump', accession, '-O', outdir, '-e', str(threads), '-p'] if skip_technical: cmd.append('--skip-technical') result = subprocess.run(cmd, capture_output=True, text=True) if result.returncode != 0: raise RuntimeError(f"fasterq-dump failed: {result.stderr}") return result.stdout # Download a run download_sra('SRR12345678', outdir='./data', threads=8)
Find SRA Accessions with Entrez
Goal: Discover SRA run accessions for a BioProject or search query without browsing the SRA website.
Approach: Search the SRA database via Entrez, then fetch run info in CSV format and parse out the run accessions (SRR IDs).
from Bio import Entrez Entrez.email = 'your.email@example.com' def find_sra_runs(term, max_results=100): handle = Entrez.esearch(db='sra', term=term, retmax=max_results) search = Entrez.read(handle) handle.close() if not search['IdList']: return [] handle = Entrez.efetch(db='sra', id=','.join(search['IdList']), rettype='runinfo', retmode='text') runinfo = handle.read() handle.close() # Parse CSV-like output runs = [] for line in runinfo.strip().split('\n')[1:]: if line: fields = line.split(',') if len(fields) > 0: runs.append(fields[0]) # First field is Run accession return runs # Find runs for a project runs = find_sra_runs('PRJNA123456[bioproject]') print(f"Found {len(runs)} runs")
SRA Accession Types
| Prefix | Type | Description |
|---|---|---|
| SRR | Run | Individual sequencing run |
| SRX | Experiment | Experimental design |
| SRS | Sample | Biological sample |
| SRP | Project/Study | Research project |
| PRJNA | BioProject | NCBI BioProject ID |
| SAMN | BioSample | NCBI BioSample ID |
Use Run accessions (SRR*) with fasterq-dump.
Common Errors
| Error | Cause | Solution |
|---|---|---|
| Invalid accession | Check accession exists |
| Insufficient space | Check temp and output dirs |
| Network issues | Use prefetch first |
| Bad output path | Create output directory |
| Cache permission | Check vdb-config |
Comparison: fasterq-dump vs fastq-dump
| Feature | fasterq-dump | fastq-dump |
|---|---|---|
| Speed | Fast (multithreaded) | Slow (single-threaded) |
| Memory | Higher | Lower |
| Progress | Built-in | None |
| Recommended | Yes | Legacy only |
Always prefer
fasterq-dump unless memory constrained.
Decision Tree
Need SRA sequencing data? ├── Know the SRR accession? │ └── fasterq-dump SRR... -O ./fastq/ -p ├── Large file (>20GB)? │ └── prefetch first, then fasterq-dump ├── Multiple runs? │ └── Loop through accessions or use prefetch --option-file ├── Need to find accessions? │ └── Search SRA database with Entrez ├── Download interrupted? │ └── prefetch supports resume └── Verify integrity? └── vdb-validate SRR...
Related Skills
- entrez-search - Search SRA database to find accessions
- sequence-io - Read downloaded FASTQ files with Biopython
- sequence-io/paired-end-fastq - Handle paired R1/R2 files
- alignment-files - Align downloaded reads