LLMs-Universal-Life-Science-and-Clinical-Skills- read-sequences

<!--

install
source · Clone the upstream repo
git clone https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills-
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills- "$T" && mkdir -p ~/.claude/skills && cp -r "$T/Skills/Sequence_Analysis/sequence-io/read-sequences" ~/.claude/skills/mdbabumiamssm-llms-universal-life-science-and-clinical-skills-read-sequences && rm -rf "$T"
manifest: Skills/Sequence_Analysis/sequence-io/read-sequences/SKILL.md
source content
<!-- # COPYRIGHT NOTICE # This file is part of the "Universal Biomedical Skills" project. # Copyright (c) 2026 MD BABU MIA, PhD <md.babu.mia@mssm.edu> # All Rights Reserved. # # This code is proprietary and confidential. # Unauthorized copying of this file, via any medium is strictly prohibited. # # Provenance: Authenticated by MD BABU MIA -->

name: bio-read-sequences description: Read biological sequence files (FASTA, FASTQ, GenBank, EMBL, ABI, SFF) using Biopython Bio.SeqIO. Use when parsing sequence files, iterating multi-sequence files, random access to large files, or high-performance parsing. tool_type: python primary_tool: Bio.SeqIO measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes. allowed-tools:

  • read_file
  • run_shell_command

Read Sequences

Read biological sequence data from files using Biopython's Bio.SeqIO module.

Required Import

from Bio import SeqIO

Core Functions

SeqIO.parse() - Multiple Records (Iterator)

Use for files with one or more sequences. Returns an iterator of SeqRecord objects.

for record in SeqIO.parse('sequences.fasta', 'fasta'):
    print(record.id, len(record.seq))

Important: Always specify the format explicitly as the second argument.

SeqIO.read() - Single Record Only

Use when file contains exactly one sequence. Raises error if zero or multiple records.

record = SeqIO.read('single.fasta', 'fasta')

SeqIO.to_dict() - Load All Into Memory

Use for random access by record ID. Loads entire file into memory.

records = SeqIO.to_dict(SeqIO.parse('sequences.fasta', 'fasta'))
seq = records['sequence_id'].seq

SeqIO.index() - Large File Random Access

Use for large files when you need random access without loading everything into memory.

records = SeqIO.index('large.fasta', 'fasta')
seq = records['sequence_id'].seq
records.close()

SeqIO.index_db() - SQLite-Backed Indexing

Use for very large files or multiple files. Creates persistent SQLite index.

# Create index (first time - parses file)
records = SeqIO.index_db('index.sqlite', 'large.fasta', 'fasta')
seq = records['sequence_id'].seq
records.close()

# Reuse existing index (instant load)
records = SeqIO.index_db('index.sqlite')

# Index multiple files together
records = SeqIO.index_db('combined.sqlite', ['file1.fasta', 'file2.fasta'], 'fasta')

Advantages over index():

  • Persistent index survives program restarts
  • Can index multiple files as one database
  • Lower memory for extremely large files
  • SQLite file can be shared across processes

High-Performance Parsing

For maximum throughput on large files, use low-level parsers (3-6x faster than SeqIO.parse):

SimpleFastaParser

from Bio.SeqIO.FastaIO import SimpleFastaParser

with open('large.fasta') as handle:
    for title, sequence in SimpleFastaParser(handle):
        if len(sequence) > 1000:
            print(title.split()[0])  # First word is usually ID

Returns

(title, sequence)
tuples as strings (no SeqRecord overhead).

FastqGeneralIterator

from Bio.SeqIO.QualityIO import FastqGeneralIterator

with open('reads.fastq') as handle:
    for title, sequence, quality in FastqGeneralIterator(handle):
        avg_qual = sum(ord(c) - 33 for c in quality) / len(quality)

Returns

(title, sequence, quality_string)
tuples.

Common Formats

FormatStringTypical ExtensionNotes
FASTA
'fasta'
.fasta, .fa, .fna, .faaMost common
FASTA 2-line
'fasta-2line'
.fastaOne line per sequence (no wrapping)
FASTQ
'fastq'
.fastq, .fqWith quality scores
FASTQ Solexa
'fastq-solexa'
.fastqOld Solexa/Illumina (pre-1.3)
FASTQ Illumina
'fastq-illumina'
.fastqIllumina 1.3-1.7
GenBank
'genbank'
or
'gb'
.gb, .gbkWith features/annotations
EMBL
'embl'
.emblEuropean format with features
Swiss-Prot
'swiss'
.datUniProt format

Specialized Formats

FormatStringUse Case
ABI
'abi'
Sanger sequencing trace files (.ab1)
ABI Trimmed
'abi-trim'
ABI with low-quality ends trimmed
SFF
'sff'
454/Ion Torrent flowgram data
SFF Trimmed
'sff-trim'
SFF with adapter/quality trimming
QUAL
'qual'
Quality scores file (pairs with FASTA)
PHD
'phd'
Phred/Phrap/Consed output
ACE
'ace'
Assembly format (Consed)
PDB SEQRES
'pdb-seqres'
Protein sequences from PDB files
PDB ATOM
'pdb-atom'
Sequences from ATOM records in PDB
SnapGene
'snapgene'
SnapGene .dna files
GCK
'gck'
Gene Construction Kit files
XDNA
'xdna'
DNA Strider / SerialCloner files

Reading ABI Trace Files

# Read Sanger sequencing trace with quality
record = SeqIO.read('sample.ab1', 'abi')
print(f'Sequence: {record.seq}')
qualities = record.letter_annotations['phred_quality']

# Auto-trim low quality ends
record_trimmed = SeqIO.read('sample.ab1', 'abi-trim')

Reading 454/Ion Torrent SFF

for record in SeqIO.parse('reads.sff', 'sff'):
    print(record.id, len(record.seq))

# With trimming applied
for record in SeqIO.parse('reads.sff', 'sff-trim'):
    print(record.id, len(record.seq))

Reading PDB Sequences

# Get sequences from SEQRES records
for record in SeqIO.parse('structure.pdb', 'pdb-seqres'):
    print(f'Chain {record.id}: {record.seq}')

# Get sequences from ATOM coordinates
for record in SeqIO.parse('structure.pdb', 'pdb-atom'):
    print(f'Chain {record.id}: {record.seq}')

Alignment Formats (Read-Only)

FormatStringNotes
PHYLIP
'phylip'
Interleaved phylip
PHYLIP Sequential
'phylip-sequential'
Sequential phylip
PHYLIP Relaxed
'phylip-relaxed'
Longer names allowed
Clustal
'clustal'
ClustalW output
Stockholm
'stockholm'
Rfam/Pfam alignments
NEXUS
'nexus'
PAUP/MrBayes format
MAF
'maf'
Multiple Alignment Format

SeqRecord Object Attributes

After parsing, each record has these key attributes:

record.id          # Sequence identifier (string)
record.name        # Sequence name (string)
record.description # Full description line (string)
record.seq         # Sequence data (Seq object)
record.features    # List of SeqFeature objects (GenBank/EMBL)
record.annotations # Dictionary of annotations
record.letter_annotations  # Per-letter annotations (quality scores)
record.dbxrefs     # Database cross-references

Code Patterns

Collect All Sequences Into a List

records = list(SeqIO.parse('sequences.fasta', 'fasta'))

Count Records Without Loading All

count = sum(1 for _ in SeqIO.parse('sequences.fasta', 'fasta'))

Fast Count (FASTA only)

from Bio.SeqIO.FastaIO import SimpleFastaParser
with open('sequences.fasta') as f:
    count = sum(1 for _ in SimpleFastaParser(f))

Get Sequence IDs Only

ids = [record.id for record in SeqIO.parse('sequences.fasta', 'fasta')]

Read GenBank with Features

for record in SeqIO.parse('sequence.gb', 'genbank'):
    for feature in record.features:
        if feature.type == 'CDS':
            print(feature.qualifiers.get('product', ['Unknown'])[0])
            cds_seq = feature.extract(record.seq)  # Get feature sequence

Access FASTQ Quality Scores

for record in SeqIO.parse('reads.fastq', 'fastq'):
    qualities = record.letter_annotations['phred_quality']
    avg_quality = sum(qualities) / len(qualities)

Read From File Handle

with open('sequences.fasta', 'r') as handle:
    for record in SeqIO.parse(handle, 'fasta'):
        print(record.id)

Custom ID Function for Indexing

def get_accession(identifier):
    return identifier.split('.')[0]  # Remove version

records = SeqIO.index('sequences.fasta', 'fasta', key_function=get_accession)

Common Errors

ErrorCauseSolution
ValueError: More than one record
Used
read()
on multi-record file
Use
parse()
instead
ValueError: No records found
Used
read()
on empty file
Check file exists and has content
ValueError: unknown format
Typo in format stringCheck format string spelling
UnicodeDecodeError
Binary file or wrong encodingOpen with
encoding='latin-1'
or check file
sqlite3.OperationalError
index_db file lockedClose other connections first

Decision Tree

Need to read sequences?
├── Single record in file?
│   └── Use SeqIO.read()
├── Multiple records?
│   ├── Need all in memory at once?
│   │   └── Use list(SeqIO.parse()) or SeqIO.to_dict()
│   ├── Process one at a time (memory efficient)?
│   │   └── Use SeqIO.parse() iterator
│   ├── Large file, need random access by ID?
│   │   ├── Single session? → Use SeqIO.index()
│   │   └── Persistent/multi-file? → Use SeqIO.index_db()
│   └── Maximum throughput needed?
│       └── Use SimpleFastaParser or FastqGeneralIterator
├── Sanger sequencing trace?
│   └── Use 'abi' or 'abi-trim' format
├── 454/Ion Torrent data?
│   └── Use 'sff' or 'sff-trim' format
└── Protein from structure?
    └── Use 'pdb-seqres' or 'pdb-atom' format

Related Skills

  • write-sequences - Write parsed sequences to new files
  • filter-sequences - Filter sequences by criteria after reading
  • format-conversion - Convert between formats
  • compressed-files - Read gzip/bzip2/BGZF compressed sequence files
  • sequence-manipulation/seq-objects - Work with parsed SeqRecord objects
  • database-access - Fetch sequences from NCBI instead of local files
  • alignment-files - For SAM/BAM/CRAM alignment files, use samtools/pysam
<!-- AUTHOR_SIGNATURE: 9a7f3c2e-MD-BABU-MIA-2026-MSSM-SECURE -->