OpenClaw-Medical-Skills bio-compressed-files
Read and write compressed sequence files (gzip, bzip2, BGZF) using Biopython. Use when working with .gz or .bz2 sequence files. Use BGZF for indexable compressed files.
git clone https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/bio-compressed-files" ~/.claude/skills/freedomintelligence-openclaw-medical-skills-bio-compressed-files && rm -rf "$T"
T=$(mktemp -d) && git clone --depth=1 https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills "$T" && mkdir -p ~/.openclaw/skills && cp -r "$T/skills/bio-compressed-files" ~/.openclaw/skills/freedomintelligence-openclaw-medical-skills-bio-compressed-files && rm -rf "$T"
skills/bio-compressed-files/SKILL.mdVersion Compatibility
Reference examples tested with: BioPython 1.83+, samtools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
- Python:
thenpip show <package>
to check signatureshelp(module.function)
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Compressed Files
Handle gzip, bzip2, and BGZF compressed sequence files with Biopython.
"Read a compressed sequence file" → Open a compressed file handle in text mode, then parse with the standard SeqIO interface.
- gzip:
(Python stdlib)gzip.open(path, 'rt') - bzip2:
(Python stdlib)bz2.open(path, 'rt') - BGZF:
(BioPython) or directbgzf.open(path, 'rt')SeqIO.parse(path, fmt)
"Make a compressed file indexable" → Convert to BGZF format. Only BGZF supports
SeqIO.index() on compressed data.
Required Imports
import gzip import bz2 from Bio import SeqIO from Bio import bgzf
Reading Compressed Files
Goal: Parse sequence records from compressed files without decompressing to disk.
Approach: Open a decompression handle in text mode (
'rt'), then pass the handle to SeqIO.parse(). The parser works identically to uncompressed input.
Gzip (.gz) (BioPython 1.83+)
with gzip.open('sequences.fasta.gz', 'rt') as handle: for record in SeqIO.parse(handle, 'fasta'): print(record.id, len(record.seq))
Important: Use
'rt' (read text) mode, not 'rb' (read binary).
Bzip2 (.bz2) (BioPython 1.83+)
with bz2.open('sequences.fasta.bz2', 'rt') as handle: for record in SeqIO.parse(handle, 'fasta'): print(record.id, len(record.seq))
BGZF (Block Gzip) (BioPython 1.83+)
BGZF files can be read like regular gzip, but also support indexing:
for record in SeqIO.parse('sequences.fasta.bgz', 'fasta'): print(record.id) with bgzf.open('sequences.fasta.bgz', 'rt') as handle: for record in SeqIO.parse(handle, 'fasta'): print(record.id)
Writing Compressed Files
Goal: Save sequence records directly to compressed files without an intermediate uncompressed step.
Approach: Open a compression handle in text mode (
'wt'), then pass it to SeqIO.write().
Gzip (.gz)
with gzip.open('output.fasta.gz', 'wt') as handle: SeqIO.write(records, handle, 'fasta')
Bzip2 (.bz2)
with bz2.open('output.fasta.bz2', 'wt') as handle: SeqIO.write(records, handle, 'fasta')
BGZF (.bgz)
with bgzf.open('output.fasta.bgz', 'wt') as handle: SeqIO.write(records, handle, 'fasta')
BGZF: Indexable Compression
Goal: Enable random access to records in compressed sequence files.
Approach: Write sequences in BGZF (Block GZip Format) — the only compressed format supporting
SeqIO.index() and SeqIO.index_db(). BGZF is a gzip variant used by BAM and tabix-indexed files.
Create Indexable Compressed File
from Bio import SeqIO, bgzf records = SeqIO.parse('input.fasta', 'fasta') with bgzf.open('output.fasta.bgz', 'wt') as handle: SeqIO.write(records, handle, 'fasta')
Index a BGZF File
records = SeqIO.index('sequences.fasta.bgz', 'fasta') seq = records['target_id'].seq records.close() records = SeqIO.index_db('index.sqlite', 'sequences.fasta.bgz', 'fasta')
Convert Gzip to BGZF
"Convert gzip to indexable format" → Parse from gzip handle, write through BGZF handle.
from Bio import SeqIO, bgzf import gzip with gzip.open('input.fasta.gz', 'rt') as in_handle: with bgzf.open('output.fasta.bgz', 'wt') as out_handle: SeqIO.write(SeqIO.parse(in_handle, 'fasta'), out_handle, 'fasta')
Code Patterns
Read Gzipped FASTQ
with gzip.open('reads.fastq.gz', 'rt') as handle: records = list(SeqIO.parse(handle, 'fastq')) print(f'Loaded {len(records)} reads')
Count Records in Gzipped File
with gzip.open('sequences.fasta.gz', 'rt') as handle: count = sum(1 for _ in SeqIO.parse(handle, 'fasta')) print(f'{count} sequences')
Fast Count with Low-Level Parser
from Bio.SeqIO.FastaIO import SimpleFastaParser import gzip with gzip.open('sequences.fasta.gz', 'rt') as handle: count = sum(1 for _ in SimpleFastaParser(handle))
Convert Compressed to Uncompressed
with gzip.open('input.fasta.gz', 'rt') as in_handle: records = SeqIO.parse(in_handle, 'fasta') SeqIO.write(records, 'output.fasta', 'fasta')
Convert Uncompressed to Compressed
records = SeqIO.parse('input.fasta', 'fasta') with gzip.open('output.fasta.gz', 'wt') as out_handle: SeqIO.write(records, out_handle, 'fasta')
Auto-Detect Compression
from pathlib import Path from Bio import SeqIO, bgzf import gzip import bz2 def open_sequence_file(filepath, format): filepath = Path(filepath) suffix = filepath.suffix.lower() if suffix == '.gz': # Could be gzip or bgzf - bgzf handles both handle = bgzf.open(filepath, 'rt') elif suffix == '.bgz': handle = bgzf.open(filepath, 'rt') elif suffix == '.bz2': handle = bz2.open(filepath, 'rt') else: handle = open(filepath, 'r') return SeqIO.parse(handle, format)
Process Large Gzipped File (Memory Efficient)
with gzip.open('large.fastq.gz', 'rt') as handle: for record in SeqIO.parse(handle, 'fastq'): if len(record.seq) >= 100: process(record)
Compress Existing File (Raw Copy)
import shutil with open('sequences.fasta', 'rb') as f_in: with gzip.open('sequences.fasta.gz', 'wb') as f_out: shutil.copyfileobj(f_in, f_out)
Compression Comparison
| Format | Extension | Indexable | Speed | Compression |
|---|---|---|---|---|
| Gzip | | No | Fast | Good |
| BGZF | | Yes | Fast | Good |
| Bzip2 | | No | Slow | Better |
| LZMA | | No | Slowest | Best |
When to Use Each Format
| Use Case | Recommended Format |
|---|---|
| Archive (no random access needed) | gzip or bzip2 |
| Need to index compressed file | BGZF |
| BAM files and tabix | BGZF (native) |
| Maximum compression | bzip2 or xz |
| Best speed | gzip or BGZF |
Common Errors
| Error | Cause | Solution |
|---|---|---|
| Used 'rb' mode | Use 'rt' for text mode |
| Wrong encoding | Try |
| Not a gzip file | Check file extension matches actual format |
| Corrupt or wrong format | Verify file integrity |
| Regular gzip not indexable | Convert to BGZF first |
Decision Tree
Working with compressed sequence files? ├── Just reading sequentially? │ └── Use gzip.open() or bz2.open() with 'rt' mode ├── Need to index the compressed file? │ └── Convert to BGZF, then use SeqIO.index() ├── Writing compressed output? │ ├── Will need to index later? → Use bgzf.open() │ └── Just archiving? → Use gzip.open() or bz2.open() └── Converting between formats? └── Parse with SeqIO, write to new handle
Related Skills
- read-sequences - Core parsing functions used with compressed handles
- write-sequences - Write to compressed output files
- batch-processing - Process multiple compressed files
- alignment-files - BAM files use BGZF natively; samtools handles compression