BioSkills bio-sequence-slicing

Slice, extract, and concatenate biological sequences using Biopython. Use when extracting subsequences, joining sequences, or manipulating sequence regions by position.

install
source · Clone the upstream repo
git clone https://github.com/GPTomics/bioSkills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/sequence-manipulation/sequence-slicing" ~/.claude/skills/gptomics-bioskills-bio-sequence-slicing && rm -rf "$T"
manifest: sequence-manipulation/sequence-slicing/SKILL.md
source content

Version Compatibility

Reference examples tested with: BioPython 1.83+, samtools 1.19+

Before using code patterns, verify installed versions match. If versions differ:

  • Python:
    pip show <package>
    then
    help(module.function)
    to check signatures

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Sequence Slicing

Extract, slice, and concatenate sequences using Biopython's Seq objects.

"Extract a subsequence" → Use Python slicing on Seq objects with 0-based half-open coordinates.

  • Python:
    seq[start:end]
    (BioPython Seq)

"Join exons into mRNA" → Extract multiple regions and concatenate them.

  • Python:
    sum((seq[s:e] for s, e in coords), Seq(''))
    or
    +
    operator

Required Import

from Bio.Seq import Seq

Core Operations

Indexing (Single Position)

seq = Seq('ATGCGATCG')
seq[0]      # 'A' - first base (0-indexed)
seq[-1]     # 'G' - last base
seq[3]      # 'C' - fourth base

Slicing (Extract Region)

seq = Seq('ATGCGATCGATCG')
seq[0:3]     # Seq('ATG') - first 3 bases
seq[3:6]     # Seq('CGA') - positions 3-5
seq[:5]      # Seq('ATGCG') - first 5
seq[-5:]     # Seq('GATCG') - last 5
seq[::2]     # Seq('AGGTGTG') - every 2nd base
seq[::-1]    # Seq('GCTAGCTAGCGTA') - reversed

Note: Slicing returns a Seq object, not a string.

Concatenation

seq1 = Seq('ATGC')
seq2 = Seq('GGGG')
combined = seq1 + seq2  # Seq('ATGCGGGG')

Can also concatenate with strings:

seq = Seq('ATGC')
extended = seq + 'NNNN'  # Seq('ATGCNNNN')

Code Patterns

Extract CDS by Coordinates

genome = Seq('NNNNATGCGATCGATCGTAANNN')
cds_start, cds_end = 4, 21
cds = genome[cds_start:cds_end]

Extract with 1-Based Coordinates

Biology often uses 1-based coordinates. Convert to 0-based:

def extract_1based(seq, start, end):
    '''Extract using 1-based inclusive coordinates'''
    return seq[start - 1:end]

genome = Seq('ATGCGATCGATCG')
region = extract_1based(genome, 1, 3)  # Seq('ATG')

Split Sequence into Codons

def split_codons(seq):
    return [seq[i:i+3] for i in range(0, len(seq) - len(seq) % 3, 3)]

seq = Seq('ATGCGATCGATCG')
codons = split_codons(seq)  # [Seq('ATG'), Seq('CGA'), ...]

Split into Fixed-Length Chunks

def chunk_sequence(seq, size):
    return [seq[i:i+size] for i in range(0, len(seq), size)]

seq = Seq('ATGCGATCGATCGATCGATCG')
chunks = chunk_sequence(seq, 10)

Join Sequences with Linker

seqs = [Seq('ATGC'), Seq('GGGG'), Seq('TTTT')]
linker = Seq('NNN')
joined = linker.join(seqs)  # Seq('ATGCNNNGGGGNNTTTT')

Or manually:

linker = 'NNN'
joined = Seq(linker.join(str(s) for s in seqs))

Extract Multiple Regions

Goal: Splice non-contiguous regions (e.g., exons) into a single continuous sequence.

Approach: Extract each region by coordinates and concatenate with

+
or
sum()
.

def extract_regions(seq, regions):
    '''Extract and concatenate multiple regions'''
    return sum((seq[start:end] for start, end in regions), Seq(''))

exon_coords = [(0, 50), (100, 150), (200, 250)]
mrna = extract_regions(genomic_seq, exon_coords)

Extract Flanking Regions

def get_flanking(seq, position, flank_size):
    '''Get sequence around a position'''
    start = max(0, position - flank_size)
    end = min(len(seq), position + flank_size + 1)
    return seq[start:end]

seq = Seq('ATGCGATCGATCGATCGATCG')
flanking = get_flanking(seq, 10, 5)  # 5 bp on each side of position 10

Tile Sequence into Overlapping Windows

def sliding_windows(seq, window_size, step=1):
    for i in range(0, len(seq) - window_size + 1, step):
        yield seq[i:i + window_size]

seq = Seq('ATGCGATCGATCG')
for window in sliding_windows(seq, 5, 2):
    print(window)

Extract Feature from SeqRecord

from Bio import SeqIO

for record in SeqIO.parse('sequence.gb', 'genbank'):
    for feature in record.features:
        if feature.type == 'CDS':
            cds_seq = feature.extract(record.seq)
            print(f'{feature.qualifiers.get("gene", ["?"])[0]}: {cds_seq[:30]}...')

Create New SeqRecord from Slice

from Bio.SeqRecord import SeqRecord

original = SeqRecord(Seq('ATGCGATCGATCGATCG'), id='full', description='Full sequence')
subset = SeqRecord(original.seq[5:15], id='subset', description=f'Positions 5-15 of {original.id}')

Coordinate Systems

SystemPosition 1Example
0-based (Python)Index 0
seq[0:3]
gets positions 0, 1, 2
1-based (Biology)Index 1Position 1-3 =
seq[0:3]
0-based half-openStart inclusive, end exclusiveStandard Python slicing

Common Errors

ErrorCauseSolution
IndexError
Index out of rangeCheck sequence length first
Unexpected lengthOff-by-one errorRemember end index is exclusive
Empty resultStart >= endCheck coordinate order
Wrong positions1-based vs 0-based confusionConvert coordinates explicitly

Decision Tree

Need to extract or combine sequences?
├── Single position?
│   └── Use indexing: seq[i]
├── Contiguous region?
│   └── Use slicing: seq[start:end]
├── Multiple non-contiguous regions?
│   └── Extract each, concatenate with +
├── Join sequences?
│   ├── No linker: seq1 + seq2
│   └── With linker: linker.join(seqs)
├── Split into parts?
│   └── List comprehension with slicing
└── From GenBank features?
    └── Use feature.extract(record.seq)

Related Skills

  • seq-objects - Create Seq and SeqRecord objects
  • sequence-io/read-sequences - Parse GenBank files with features to extract
  • transcription-translation - Translate extracted CDS regions
  • alignment-files - Extract sequences from BAM using samtools fasta/fastq