git clone https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills-
T=$(mktemp -d) && git clone --depth=1 https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills- "$T" && mkdir -p ~/.claude/skills && cp -r "$T/Skills/Sequence_Analysis/sequence-io/read-sequences" ~/.claude/skills/mdbabumiamssm-llms-universal-life-science-and-clinical-skills-read-sequences && rm -rf "$T"
Skills/Sequence_Analysis/sequence-io/read-sequences/SKILL.mdname: bio-read-sequences description: Read biological sequence files (FASTA, FASTQ, GenBank, EMBL, ABI, SFF) using Biopython Bio.SeqIO. Use when parsing sequence files, iterating multi-sequence files, random access to large files, or high-performance parsing. tool_type: python primary_tool: Bio.SeqIO measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes. allowed-tools:
- read_file
- run_shell_command
Read Sequences
Read biological sequence data from files using Biopython's Bio.SeqIO module.
Required Import
from Bio import SeqIO
Core Functions
SeqIO.parse() - Multiple Records (Iterator)
Use for files with one or more sequences. Returns an iterator of SeqRecord objects.
for record in SeqIO.parse('sequences.fasta', 'fasta'): print(record.id, len(record.seq))
Important: Always specify the format explicitly as the second argument.
SeqIO.read() - Single Record Only
Use when file contains exactly one sequence. Raises error if zero or multiple records.
record = SeqIO.read('single.fasta', 'fasta')
SeqIO.to_dict() - Load All Into Memory
Use for random access by record ID. Loads entire file into memory.
records = SeqIO.to_dict(SeqIO.parse('sequences.fasta', 'fasta')) seq = records['sequence_id'].seq
SeqIO.index() - Large File Random Access
Use for large files when you need random access without loading everything into memory.
records = SeqIO.index('large.fasta', 'fasta') seq = records['sequence_id'].seq records.close()
SeqIO.index_db() - SQLite-Backed Indexing
Use for very large files or multiple files. Creates persistent SQLite index.
# Create index (first time - parses file) records = SeqIO.index_db('index.sqlite', 'large.fasta', 'fasta') seq = records['sequence_id'].seq records.close() # Reuse existing index (instant load) records = SeqIO.index_db('index.sqlite') # Index multiple files together records = SeqIO.index_db('combined.sqlite', ['file1.fasta', 'file2.fasta'], 'fasta')
Advantages over index():
- Persistent index survives program restarts
- Can index multiple files as one database
- Lower memory for extremely large files
- SQLite file can be shared across processes
High-Performance Parsing
For maximum throughput on large files, use low-level parsers (3-6x faster than SeqIO.parse):
SimpleFastaParser
from Bio.SeqIO.FastaIO import SimpleFastaParser with open('large.fasta') as handle: for title, sequence in SimpleFastaParser(handle): if len(sequence) > 1000: print(title.split()[0]) # First word is usually ID
Returns
(title, sequence) tuples as strings (no SeqRecord overhead).
FastqGeneralIterator
from Bio.SeqIO.QualityIO import FastqGeneralIterator with open('reads.fastq') as handle: for title, sequence, quality in FastqGeneralIterator(handle): avg_qual = sum(ord(c) - 33 for c in quality) / len(quality)
Returns
(title, sequence, quality_string) tuples.
Common Formats
| Format | String | Typical Extension | Notes |
|---|---|---|---|
| FASTA | | .fasta, .fa, .fna, .faa | Most common |
| FASTA 2-line | | .fasta | One line per sequence (no wrapping) |
| FASTQ | | .fastq, .fq | With quality scores |
| FASTQ Solexa | | .fastq | Old Solexa/Illumina (pre-1.3) |
| FASTQ Illumina | | .fastq | Illumina 1.3-1.7 |
| GenBank | or | .gb, .gbk | With features/annotations |
| EMBL | | .embl | European format with features |
| Swiss-Prot | | .dat | UniProt format |
Specialized Formats
| Format | String | Use Case |
|---|---|---|
| ABI | | Sanger sequencing trace files (.ab1) |
| ABI Trimmed | | ABI with low-quality ends trimmed |
| SFF | | 454/Ion Torrent flowgram data |
| SFF Trimmed | | SFF with adapter/quality trimming |
| QUAL | | Quality scores file (pairs with FASTA) |
| PHD | | Phred/Phrap/Consed output |
| ACE | | Assembly format (Consed) |
| PDB SEQRES | | Protein sequences from PDB files |
| PDB ATOM | | Sequences from ATOM records in PDB |
| SnapGene | | SnapGene .dna files |
| GCK | | Gene Construction Kit files |
| XDNA | | DNA Strider / SerialCloner files |
Reading ABI Trace Files
# Read Sanger sequencing trace with quality record = SeqIO.read('sample.ab1', 'abi') print(f'Sequence: {record.seq}') qualities = record.letter_annotations['phred_quality'] # Auto-trim low quality ends record_trimmed = SeqIO.read('sample.ab1', 'abi-trim')
Reading 454/Ion Torrent SFF
for record in SeqIO.parse('reads.sff', 'sff'): print(record.id, len(record.seq)) # With trimming applied for record in SeqIO.parse('reads.sff', 'sff-trim'): print(record.id, len(record.seq))
Reading PDB Sequences
# Get sequences from SEQRES records for record in SeqIO.parse('structure.pdb', 'pdb-seqres'): print(f'Chain {record.id}: {record.seq}') # Get sequences from ATOM coordinates for record in SeqIO.parse('structure.pdb', 'pdb-atom'): print(f'Chain {record.id}: {record.seq}')
Alignment Formats (Read-Only)
| Format | String | Notes |
|---|---|---|
| PHYLIP | | Interleaved phylip |
| PHYLIP Sequential | | Sequential phylip |
| PHYLIP Relaxed | | Longer names allowed |
| Clustal | | ClustalW output |
| Stockholm | | Rfam/Pfam alignments |
| NEXUS | | PAUP/MrBayes format |
| MAF | | Multiple Alignment Format |
SeqRecord Object Attributes
After parsing, each record has these key attributes:
record.id # Sequence identifier (string) record.name # Sequence name (string) record.description # Full description line (string) record.seq # Sequence data (Seq object) record.features # List of SeqFeature objects (GenBank/EMBL) record.annotations # Dictionary of annotations record.letter_annotations # Per-letter annotations (quality scores) record.dbxrefs # Database cross-references
Code Patterns
Collect All Sequences Into a List
records = list(SeqIO.parse('sequences.fasta', 'fasta'))
Count Records Without Loading All
count = sum(1 for _ in SeqIO.parse('sequences.fasta', 'fasta'))
Fast Count (FASTA only)
from Bio.SeqIO.FastaIO import SimpleFastaParser with open('sequences.fasta') as f: count = sum(1 for _ in SimpleFastaParser(f))
Get Sequence IDs Only
ids = [record.id for record in SeqIO.parse('sequences.fasta', 'fasta')]
Read GenBank with Features
for record in SeqIO.parse('sequence.gb', 'genbank'): for feature in record.features: if feature.type == 'CDS': print(feature.qualifiers.get('product', ['Unknown'])[0]) cds_seq = feature.extract(record.seq) # Get feature sequence
Access FASTQ Quality Scores
for record in SeqIO.parse('reads.fastq', 'fastq'): qualities = record.letter_annotations['phred_quality'] avg_quality = sum(qualities) / len(qualities)
Read From File Handle
with open('sequences.fasta', 'r') as handle: for record in SeqIO.parse(handle, 'fasta'): print(record.id)
Custom ID Function for Indexing
def get_accession(identifier): return identifier.split('.')[0] # Remove version records = SeqIO.index('sequences.fasta', 'fasta', key_function=get_accession)
Common Errors
| Error | Cause | Solution |
|---|---|---|
| Used on multi-record file | Use instead |
| Used on empty file | Check file exists and has content |
| Typo in format string | Check format string spelling |
| Binary file or wrong encoding | Open with or check file |
| index_db file locked | Close other connections first |
Decision Tree
Need to read sequences? ├── Single record in file? │ └── Use SeqIO.read() ├── Multiple records? │ ├── Need all in memory at once? │ │ └── Use list(SeqIO.parse()) or SeqIO.to_dict() │ ├── Process one at a time (memory efficient)? │ │ └── Use SeqIO.parse() iterator │ ├── Large file, need random access by ID? │ │ ├── Single session? → Use SeqIO.index() │ │ └── Persistent/multi-file? → Use SeqIO.index_db() │ └── Maximum throughput needed? │ └── Use SimpleFastaParser or FastqGeneralIterator ├── Sanger sequencing trace? │ └── Use 'abi' or 'abi-trim' format ├── 454/Ion Torrent data? │ └── Use 'sff' or 'sff-trim' format └── Protein from structure? └── Use 'pdb-seqres' or 'pdb-atom' format
Related Skills
- write-sequences - Write parsed sequences to new files
- filter-sequences - Filter sequences by criteria after reading
- format-conversion - Convert between formats
- compressed-files - Read gzip/bzip2/BGZF compressed sequence files
- sequence-manipulation/seq-objects - Work with parsed SeqRecord objects
- database-access - Fetch sequences from NCBI instead of local files
- alignment-files - For SAM/BAM/CRAM alignment files, use samtools/pysam