install
source · Clone the upstream repo
git clone https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills-
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills- "$T" && mkdir -p ~/.claude/skills && cp -r "$T/Skills/Sequence_Analysis/sequence-io/compressed-files" ~/.claude/skills/mdbabumiamssm-llms-universal-life-science-and-clinical-skills-compressed-files && rm -rf "$T"
manifest:
Skills/Sequence_Analysis/sequence-io/compressed-files/SKILL.mdsource content
<!--
# COPYRIGHT NOTICE
# This file is part of the "Universal Biomedical Skills" project.
# Copyright (c) 2026 MD BABU MIA, PhD <md.babu.mia@mssm.edu>
# All Rights Reserved.
#
# This code is proprietary and confidential.
# Unauthorized copying of this file, via any medium is strictly prohibited.
#
# Provenance: Authenticated by MD BABU MIA
-->
name: bio-compressed-files description: Read and write compressed sequence files (gzip, bzip2, BGZF) using Biopython. Use when working with .gz or .bz2 sequence files. Use BGZF for indexable compressed files. tool_type: python primary_tool: Bio.bgzf measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes. allowed-tools:
- read_file
- run_shell_command
Compressed Files
Handle gzip, bzip2, and BGZF compressed sequence files with Biopython.
Required Imports
import gzip import bz2 from Bio import SeqIO from Bio import bgzf # For BGZF (indexable compression)
Reading Compressed Files
Gzip (.gz)
with gzip.open('sequences.fasta.gz', 'rt') as handle: for record in SeqIO.parse(handle, 'fasta'): print(record.id, len(record.seq))
Important: Use
'rt' (read text) mode, not 'rb' (read binary).
Bzip2 (.bz2)
with bz2.open('sequences.fasta.bz2', 'rt') as handle: for record in SeqIO.parse(handle, 'fasta'): print(record.id, len(record.seq))
BGZF (Block Gzip)
BGZF files can be read like regular gzip, but also support indexing:
# Read like normal gzip (auto-detected) for record in SeqIO.parse('sequences.fasta.bgz', 'fasta'): print(record.id) # Or explicitly with bgzf module with bgzf.open('sequences.fasta.bgz', 'rt') as handle: for record in SeqIO.parse(handle, 'fasta'): print(record.id)
Writing Compressed Files
Gzip (.gz)
with gzip.open('output.fasta.gz', 'wt') as handle: SeqIO.write(records, handle, 'fasta')
Bzip2 (.bz2)
with bz2.open('output.fasta.bz2', 'wt') as handle: SeqIO.write(records, handle, 'fasta')
BGZF (.bgz)
with bgzf.open('output.fasta.bgz', 'wt') as handle: SeqIO.write(records, handle, 'fasta')
BGZF: Indexable Compression
BGZF is the only compressed format that supports
and SeqIO.index()
.SeqIO.index_db()
BGZF (Block GZip Format) is a variant of gzip that allows random access. It's used by BAM files and tabix-indexed files.
Create Indexable Compressed File
from Bio import SeqIO, bgzf # Write as BGZF (can be indexed later) records = SeqIO.parse('input.fasta', 'fasta') with bgzf.open('output.fasta.bgz', 'wt') as handle: SeqIO.write(records, handle, 'fasta')
Index a BGZF File
# SeqIO.index() works with BGZF! records = SeqIO.index('sequences.fasta.bgz', 'fasta') seq = records['target_id'].seq records.close() # SeqIO.index_db() also works records = SeqIO.index_db('index.sqlite', 'sequences.fasta.bgz', 'fasta')
Convert Gzip to BGZF
from Bio import SeqIO, bgzf import gzip # Read from gzip, write to BGZF with gzip.open('input.fasta.gz', 'rt') as in_handle: with bgzf.open('output.fasta.bgz', 'wt') as out_handle: SeqIO.write(SeqIO.parse(in_handle, 'fasta'), out_handle, 'fasta')
Code Patterns
Read Gzipped FASTQ
with gzip.open('reads.fastq.gz', 'rt') as handle: records = list(SeqIO.parse(handle, 'fastq')) print(f'Loaded {len(records)} reads')
Count Records in Gzipped File
with gzip.open('sequences.fasta.gz', 'rt') as handle: count = sum(1 for _ in SeqIO.parse(handle, 'fasta')) print(f'{count} sequences')
Fast Count with Low-Level Parser
from Bio.SeqIO.FastaIO import SimpleFastaParser import gzip with gzip.open('sequences.fasta.gz', 'rt') as handle: count = sum(1 for _ in SimpleFastaParser(handle))
Convert Compressed to Uncompressed
with gzip.open('input.fasta.gz', 'rt') as in_handle: records = SeqIO.parse(in_handle, 'fasta') SeqIO.write(records, 'output.fasta', 'fasta')
Convert Uncompressed to Compressed
records = SeqIO.parse('input.fasta', 'fasta') with gzip.open('output.fasta.gz', 'wt') as out_handle: SeqIO.write(records, out_handle, 'fasta')
Auto-Detect Compression
from pathlib import Path from Bio import SeqIO, bgzf import gzip import bz2 def open_sequence_file(filepath, format): filepath = Path(filepath) suffix = filepath.suffix.lower() if suffix == '.gz': # Could be gzip or bgzf - bgzf handles both handle = bgzf.open(filepath, 'rt') elif suffix == '.bgz': handle = bgzf.open(filepath, 'rt') elif suffix == '.bz2': handle = bz2.open(filepath, 'rt') else: handle = open(filepath, 'r') return SeqIO.parse(handle, format)
Process Large Gzipped File (Memory Efficient)
with gzip.open('large.fastq.gz', 'rt') as handle: for record in SeqIO.parse(handle, 'fastq'): if len(record.seq) >= 100: process(record)
Compress Existing File (Raw Copy)
import shutil with open('sequences.fasta', 'rb') as f_in: with gzip.open('sequences.fasta.gz', 'wb') as f_out: shutil.copyfileobj(f_in, f_out)
Compression Comparison
| Format | Extension | Indexable | Speed | Compression |
|---|---|---|---|---|
| Gzip | | No | Fast | Good |
| BGZF | | Yes | Fast | Good |
| Bzip2 | | No | Slow | Better |
| LZMA | | No | Slowest | Best |
When to Use Each Format
| Use Case | Recommended Format |
|---|---|
| Archive (no random access needed) | gzip or bzip2 |
| Need to index compressed file | BGZF |
| BAM files and tabix | BGZF (native) |
| Maximum compression | bzip2 or xz |
| Best speed | gzip or BGZF |
Common Errors
| Error | Cause | Solution |
|---|---|---|
| Used 'rb' mode | Use 'rt' for text mode |
| Wrong encoding | Try |
| Not a gzip file | Check file extension matches actual format |
| Corrupt or wrong format | Verify file integrity |
| Regular gzip not indexable | Convert to BGZF first |
Decision Tree
Working with compressed sequence files? ├── Just reading sequentially? │ └── Use gzip.open() or bz2.open() with 'rt' mode ├── Need to index the compressed file? │ └── Convert to BGZF, then use SeqIO.index() ├── Writing compressed output? │ ├── Will need to index later? → Use bgzf.open() │ └── Just archiving? → Use gzip.open() or bz2.open() └── Converting between formats? └── Parse with SeqIO, write to new handle
Related Skills
- read-sequences - Core parsing functions used with compressed handles
- write-sequences - Write to compressed output files
- batch-processing - Process multiple compressed files
- alignment-files - BAM files use BGZF natively; samtools handles compression