LLMs-Universal-Life-Science-and-Clinical-Skills- read-sequences

<!--

install

source · Clone the upstream repo

git clone https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills-

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills- "$T" && mkdir -p ~/.claude/skills && cp -r "$T/Skills/Sequence_Analysis/sequence-io/read-sequences" ~/.claude/skills/mdbabumiamssm-llms-universal-life-science-and-clinical-skills-read-sequences && rm -rf "$T"

manifest: Skills/Sequence_Analysis/sequence-io/read-sequences/SKILL.md

source content

name: bio-read-sequences description: Read biological sequence files (FASTA, FASTQ, GenBank, EMBL, ABI, SFF) using Biopython Bio.SeqIO. Use when parsing sequence files, iterating multi-sequence files, random access to large files, or high-performance parsing. tool_type: python primary_tool: Bio.SeqIO measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes. allowed-tools:

read_file
run_shell_command

Read Sequences

Read biological sequence data from files using Biopython's Bio.SeqIO module.

Required Import

from Bio import SeqIO

Core Functions

SeqIO.parse() - Multiple Records (Iterator)

Use for files with one or more sequences. Returns an iterator of SeqRecord objects.

for record in SeqIO.parse('sequences.fasta', 'fasta'):
    print(record.id, len(record.seq))

Important: Always specify the format explicitly as the second argument.

SeqIO.read() - Single Record Only

Use when file contains exactly one sequence. Raises error if zero or multiple records.

record = SeqIO.read('single.fasta', 'fasta')

SeqIO.to_dict() - Load All Into Memory

Use for random access by record ID. Loads entire file into memory.

records = SeqIO.to_dict(SeqIO.parse('sequences.fasta', 'fasta'))
seq = records['sequence_id'].seq

SeqIO.index() - Large File Random Access

Use for large files when you need random access without loading everything into memory.

records = SeqIO.index('large.fasta', 'fasta')
seq = records['sequence_id'].seq
records.close()

SeqIO.index_db() - SQLite-Backed Indexing

Use for very large files or multiple files. Creates persistent SQLite index.

# Create index (first time - parses file)
records = SeqIO.index_db('index.sqlite', 'large.fasta', 'fasta')
seq = records['sequence_id'].seq
records.close()

# Reuse existing index (instant load)
records = SeqIO.index_db('index.sqlite')

# Index multiple files together
records = SeqIO.index_db('combined.sqlite', ['file1.fasta', 'file2.fasta'], 'fasta')

Advantages over index():

Persistent index survives program restarts
Can index multiple files as one database
Lower memory for extremely large files
SQLite file can be shared across processes

High-Performance Parsing

For maximum throughput on large files, use low-level parsers (3-6x faster than SeqIO.parse):

SimpleFastaParser

from Bio.SeqIO.FastaIO import SimpleFastaParser

with open('large.fasta') as handle:
    for title, sequence in SimpleFastaParser(handle):
        if len(sequence) > 1000:
            print(title.split()[0])  # First word is usually ID

Returns

(title, sequence)

tuples as strings (no SeqRecord overhead).

FastqGeneralIterator

from Bio.SeqIO.QualityIO import FastqGeneralIterator

with open('reads.fastq') as handle:
    for title, sequence, quality in FastqGeneralIterator(handle):
        avg_qual = sum(ord(c) - 33 for c in quality) / len(quality)

Returns

(title, sequence, quality_string)

tuples.

Common Formats

Format	String	Typical Extension	Notes
FASTA	`'fasta'`	.fasta, .fa, .fna, .faa	Most common
FASTA 2-line	`'fasta-2line'`	.fasta	One line per sequence (no wrapping)
FASTQ	`'fastq'`	.fastq, .fq	With quality scores
FASTQ Solexa	`'fastq-solexa'`	.fastq	Old Solexa/Illumina (pre-1.3)
FASTQ Illumina	`'fastq-illumina'`	.fastq	Illumina 1.3-1.7
GenBank	`'genbank'` or `'gb'`	.gb, .gbk	With features/annotations
EMBL	`'embl'`	.embl	European format with features
Swiss-Prot	`'swiss'`	.dat	UniProt format

Specialized Formats

Format	String	Use Case
ABI	`'abi'`	Sanger sequencing trace files (.ab1)
ABI Trimmed	`'abi-trim'`	ABI with low-quality ends trimmed
SFF	`'sff'`	454/Ion Torrent flowgram data
SFF Trimmed	`'sff-trim'`	SFF with adapter/quality trimming
QUAL	`'qual'`	Quality scores file (pairs with FASTA)
PHD	`'phd'`	Phred/Phrap/Consed output
ACE	`'ace'`	Assembly format (Consed)
PDB SEQRES	`'pdb-seqres'`	Protein sequences from PDB files
PDB ATOM	`'pdb-atom'`	Sequences from ATOM records in PDB
SnapGene	`'snapgene'`	SnapGene .dna files
GCK	`'gck'`	Gene Construction Kit files
XDNA	`'xdna'`	DNA Strider / SerialCloner files

Reading ABI Trace Files

# Read Sanger sequencing trace with quality
record = SeqIO.read('sample.ab1', 'abi')
print(f'Sequence: {record.seq}')
qualities = record.letter_annotations['phred_quality']

# Auto-trim low quality ends
record_trimmed = SeqIO.read('sample.ab1', 'abi-trim')

Reading 454/Ion Torrent SFF

for record in SeqIO.parse('reads.sff', 'sff'):
    print(record.id, len(record.seq))

# With trimming applied
for record in SeqIO.parse('reads.sff', 'sff-trim'):
    print(record.id, len(record.seq))

Reading PDB Sequences

# Get sequences from SEQRES records
for record in SeqIO.parse('structure.pdb', 'pdb-seqres'):
    print(f'Chain {record.id}: {record.seq}')

# Get sequences from ATOM coordinates
for record in SeqIO.parse('structure.pdb', 'pdb-atom'):
    print(f'Chain {record.id}: {record.seq}')

Alignment Formats (Read-Only)

Format	String	Notes
PHYLIP	`'phylip'`	Interleaved phylip
PHYLIP Sequential	`'phylip-sequential'`	Sequential phylip
PHYLIP Relaxed	`'phylip-relaxed'`	Longer names allowed
Clustal	`'clustal'`	ClustalW output
Stockholm	`'stockholm'`	Rfam/Pfam alignments
NEXUS	`'nexus'`	PAUP/MrBayes format
MAF	`'maf'`	Multiple Alignment Format

SeqRecord Object Attributes

After parsing, each record has these key attributes:

record.id          # Sequence identifier (string)
record.name        # Sequence name (string)
record.description # Full description line (string)
record.seq         # Sequence data (Seq object)
record.features    # List of SeqFeature objects (GenBank/EMBL)
record.annotations # Dictionary of annotations
record.letter_annotations  # Per-letter annotations (quality scores)
record.dbxrefs     # Database cross-references

Code Patterns

Collect All Sequences Into a List

records = list(SeqIO.parse('sequences.fasta', 'fasta'))

Count Records Without Loading All

count = sum(1 for _ in SeqIO.parse('sequences.fasta', 'fasta'))

Fast Count (FASTA only)

from Bio.SeqIO.FastaIO import SimpleFastaParser
with open('sequences.fasta') as f:
    count = sum(1 for _ in SimpleFastaParser(f))

Get Sequence IDs Only

ids = [record.id for record in SeqIO.parse('sequences.fasta', 'fasta')]

Read GenBank with Features

for record in SeqIO.parse('sequence.gb', 'genbank'):
    for feature in record.features:
        if feature.type == 'CDS':
            print(feature.qualifiers.get('product', ['Unknown'])[0])
            cds_seq = feature.extract(record.seq)  # Get feature sequence

Access FASTQ Quality Scores

for record in SeqIO.parse('reads.fastq', 'fastq'):
    qualities = record.letter_annotations['phred_quality']
    avg_quality = sum(qualities) / len(qualities)

Read From File Handle

with open('sequences.fasta', 'r') as handle:
    for record in SeqIO.parse(handle, 'fasta'):
        print(record.id)

Custom ID Function for Indexing

def get_accession(identifier):
    return identifier.split('.')[0]  # Remove version

records = SeqIO.index('sequences.fasta', 'fasta', key_function=get_accession)

Common Errors

Error	Cause	Solution
`ValueError: More than one record`	Used `read()` on multi-record file	Use `parse()` instead
`ValueError: No records found`	Used `read()` on empty file	Check file exists and has content
`ValueError: unknown format`	Typo in format string	Check format string spelling
`UnicodeDecodeError`	Binary file or wrong encoding	Open with `encoding='latin-1'` or check file
`sqlite3.OperationalError`	index_db file locked	Close other connections first

Decision Tree

Need to read sequences?
├── Single record in file?
│   └── Use SeqIO.read()
├── Multiple records?
│   ├── Need all in memory at once?
│   │   └── Use list(SeqIO.parse()) or SeqIO.to_dict()
│   ├── Process one at a time (memory efficient)?
│   │   └── Use SeqIO.parse() iterator
│   ├── Large file, need random access by ID?
│   │   ├── Single session? → Use SeqIO.index()
│   │   └── Persistent/multi-file? → Use SeqIO.index_db()
│   └── Maximum throughput needed?
│       └── Use SimpleFastaParser or FastqGeneralIterator
├── Sanger sequencing trace?
│   └── Use 'abi' or 'abi-trim' format
├── 454/Ion Torrent data?
│   └── Use 'sff' or 'sff-trim' format
└── Protein from structure?
    └── Use 'pdb-seqres' or 'pdb-atom' format

Related Skills

write-sequences - Write parsed sequences to new files
filter-sequences - Filter sequences by criteria after reading
format-conversion - Convert between formats
compressed-files - Read gzip/bzip2/BGZF compressed sequence files
sequence-manipulation/seq-objects - Work with parsed SeqRecord objects
database-access - Fetch sequences from NCBI instead of local files
alignment-files - For SAM/BAM/CRAM alignment files, use samtools/pysam