BioSkills bio-codon-usage
Analyze codon usage, calculate CAI (Codon Adaptation Index), and examine synonymous codon bias using Biopython. Use when analyzing coding sequences for expression optimization or evolutionary analysis.
git clone https://github.com/GPTomics/bioSkills
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/sequence-manipulation/codon-usage" ~/.claude/skills/gptomics-bioskills-bio-codon-usage && rm -rf "$T"
sequence-manipulation/codon-usage/SKILL.mdVersion Compatibility
Reference examples tested with: BioPython 1.83+
Before using code patterns, verify installed versions match. If versions differ:
- Python:
thenpip show <package>
to check signatureshelp(module.function)
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Codon Usage
Analyze codon usage patterns and calculate codon adaptation metrics using Biopython.
"Analyze codon usage" → Count codons in a coding sequence, compute frequencies and bias metrics.
- Python:
on 3-mers +Counter
(BioPython)CodonAdaptationIndex
"Optimize codons for expression" → Replace codons with host-preferred synonymous codons using a preference table.
- Python: custom mapping dict +
(BioPython)Seq()
Required Imports
from Bio.Seq import Seq from Bio.SeqUtils import GC123 from Bio.SeqUtils.CodonUsage import CodonAdaptationIndex from Bio.Data import CodonTable from collections import Counter
Basic Codon Counting
Goal: Tabulate codon frequencies in a coding sequence.
Approach: Split the sequence into triplets from the reading frame start, then count with
Counter.
Count Codons in Sequence
from collections import Counter def count_codons(seq): seq_str = str(seq).upper() codons = [seq_str[i:i+3] for i in range(0, len(seq_str) - 2, 3)] return Counter(codons) seq = Seq('ATGCGATCGATCGATCGTAA') codon_counts = count_codons(seq)
Codon Frequencies (Relative)
def codon_frequencies(seq): counts = count_codons(seq) total = sum(counts.values()) return {codon: count / total for codon, count in counts.items()}
Codon Adaptation Index (CAI)
Goal: Measure how well a gene's codon usage matches highly expressed genes in a target organism.
Approach: Train a CAI index from a reference set of highly expressed genes, then score query sequences (0-1 scale, higher = better adapted).
Using CodonUsage Module
from Bio.SeqUtils.CodonUsage import CodonAdaptationIndex # Create CAI calculator with reference set cai = CodonAdaptationIndex() # Generate index from highly expressed genes cai.generate_index('highly_expressed_genes.fasta') # Calculate CAI for a sequence seq = Seq('ATGCGATCGATCGATCGTAA') cai_value = cai.cai_for_gene(str(seq)) print(f'CAI: {cai_value:.3f}') # Range 0-1, higher = better adapted
CAI with Custom Codon Index
from Bio.SeqUtils.CodonUsage import CodonAdaptationIndex cai = CodonAdaptationIndex() # Set custom index (relative adaptiveness for each codon) custom_index = { 'TTT': 0.5, 'TTC': 1.0, # Phe 'TTA': 0.1, 'TTG': 0.5, 'CTT': 0.3, 'CTC': 1.0, 'CTA': 0.1, 'CTG': 1.0, # Leu # ... define all 64 codons } cai.set_cai_index(custom_index)
Synonymous Codon Usage
Goal: Quantify bias in synonymous codon usage to detect selection or mutational pressure.
Approach: Calculate RSCU — the ratio of observed to expected frequency assuming equal usage of synonymous codons. RSCU = 1.0 means no bias; >1 overused, <1 underused.
RSCU (Relative Synonymous Codon Usage)
from Bio.Data import CodonTable def calculate_rscu(seq, table_id=1): codon_table = CodonTable.unambiguous_dna_by_id[table_id] counts = count_codons(seq) # Group codons by amino acid aa_to_codons = {} for codon in counts: if codon in codon_table.stop_codons: continue try: aa = codon_table.forward_table[codon] aa_to_codons.setdefault(aa, []).append(codon) except KeyError: continue # Calculate RSCU for each codon rscu = {} for aa, codons in aa_to_codons.items(): total = sum(counts.get(c, 0) for c in codons) n_synonymous = len(codons) expected = total / n_synonymous if n_synonymous > 0 else 0 for codon in codons: observed = counts.get(codon, 0) rscu[codon] = observed / expected if expected > 0 else 0 return rscu
Identify Rare Codons
def find_rare_codons(seq, threshold=0.1): freq = codon_frequencies(seq) return {codon: f for codon, f in freq.items() if f < threshold}
Codon Bias by Position (GC123)
from Bio.SeqUtils import GC123 seq = Seq('ATGCGATCGATCGATCGATCGATCGATCGTAA') gc_total, gc_pos1, gc_pos2, gc_pos3 = GC123(seq) print(f'Total GC: {gc_total:.1f}%') print(f'1st position GC: {gc_pos1:.1f}%') print(f'2nd position GC: {gc_pos2:.1f}%') print(f'3rd position GC: {gc_pos3:.1f}% (wobble position)')
Codon Tables
Access Codon Tables
from Bio.Data import CodonTable # Get standard table std_table = CodonTable.unambiguous_dna_by_id[1] # List all available tables for id, table in CodonTable.unambiguous_dna_by_id.items(): print(f'{id}: {table.names[0]}')
Common Codon Tables
| ID | Name | Organism |
|---|---|---|
| 1 | Standard | Most organisms |
| 2 | Vertebrate Mitochondrial | Human, mouse mito |
| 4 | Mold Mitochondrial | Fungi, protozoa mito |
| 5 | Invertebrate Mitochondrial | Insects, worms mito |
| 11 | Bacterial/Plastid | E. coli, chloroplasts |
Codon Table Properties
table = CodonTable.unambiguous_dna_by_id[1] print(f'Start codons: {table.start_codons}') print(f'Stop codons: {table.stop_codons}') # Forward table: codon -> amino acid print(table.forward_table['ATG']) # 'M' # Back table: amino acid -> list of codons back_table = {} for codon, aa in table.forward_table.items(): back_table.setdefault(aa, []).append(codon) print(f'Leucine codons: {back_table["L"]}')
Code Patterns
Full Codon Usage Report
def codon_usage_report(seq, table_id=1): from Bio.Data import CodonTable table = CodonTable.unambiguous_dna_by_id[table_id] counts = count_codons(seq) total = sum(counts.values()) # Group by amino acid aa_groups = {} for codon, aa in table.forward_table.items(): aa_groups.setdefault(aa, []).append(codon) report = {} for aa, codons in sorted(aa_groups.items()): aa_total = sum(counts.get(c, 0) for c in codons) report[aa] = { 'total': aa_total, 'codons': {c: {'count': counts.get(c, 0), 'freq': counts.get(c, 0) / aa_total if aa_total > 0 else 0} for c in codons} } return report
Compare Codon Usage Between Sequences
def compare_codon_usage(seq1, seq2): freq1 = codon_frequencies(seq1) freq2 = codon_frequencies(seq2) all_codons = set(freq1.keys()) | set(freq2.keys()) comparison = {} for codon in sorted(all_codons): f1, f2 = freq1.get(codon, 0), freq2.get(codon, 0) comparison[codon] = {'seq1': f1, 'seq2': f2, 'diff': f1 - f2} return comparison
Optimize Codons for Expression
Goal: Replace codons with host-preferred synonymous alternatives to maximize protein expression.
Approach: Map each amino acid to its most-preferred codon in the target host, then reconstruct the DNA sequence.
def optimize_codons(protein_seq, preferred_codons): '''Replace codons with preferred synonymous codons''' optimized = [] for aa in str(protein_seq): if aa in preferred_codons: optimized.append(preferred_codons[aa]) else: optimized.append('NNN') # Unknown return Seq(''.join(optimized)) # E. coli preferred codons ecoli_preferred = { 'A': 'GCG', 'R': 'CGT', 'N': 'AAC', 'D': 'GAT', 'C': 'TGC', 'Q': 'CAG', 'E': 'GAA', 'G': 'GGT', 'H': 'CAC', 'I': 'ATT', 'L': 'CTG', 'K': 'AAA', 'M': 'ATG', 'F': 'TTC', 'P': 'CCG', 'S': 'TCT', 'T': 'ACC', 'W': 'TGG', 'Y': 'TAC', 'V': 'GTT', }
Codon Usage from FASTA File
from Bio import SeqIO def analyze_fasta_codon_usage(filename): all_counts = Counter() for record in SeqIO.parse(filename, 'fasta'): all_counts.update(count_codons(record.seq)) total = sum(all_counts.values()) return {codon: count / total for codon, count in all_counts.items()}
Effective Number of Codons (Nc)
A measure of codon bias (lower = more biased, range 20-61):
import math def effective_nc(seq, table_id=1): from Bio.Data import CodonTable table = CodonTable.unambiguous_dna_by_id[table_id] counts = count_codons(seq) # Group by degeneracy class aa_groups = {} for codon, aa in table.forward_table.items(): aa_groups.setdefault(aa, []).append(codon) # Calculate F for each amino acid nc_sum = 0 for aa, codons in aa_groups.items(): n = sum(counts.get(c, 0) for c in codons) if n <= 1: continue pi_sq_sum = sum((counts.get(c, 0) / n) ** 2 for c in codons) F = (n * pi_sq_sum - 1) / (n - 1) nc_sum += 1 / F if F > 0 else len(codons) return nc_sum if nc_sum > 0 else 61
Property Reference
| Metric | Range | Interpretation |
|---|---|---|
| CAI | 0-1 | Higher = better adapted to host |
| RSCU | 0-N | 1.0 = no bias, >1 = overused, <1 = underused |
| Nc | 20-61 | Lower = more biased |
| GC3 | 0-100% | GC at wobble position |
Common Errors
| Error | Cause | Solution |
|---|---|---|
| Non-standard codon | Handle N-containing codons |
| Wrong counts | Sequence not in frame | Ensure length is multiple of 3 |
| No index set | Called CAI without training | Call first |
Decision Tree
Need to analyze codon usage? ├── Count codon frequencies? │ └── Use Counter on 3-mers ├── Calculate adaptation to host? │ └── Use CodonAdaptationIndex (CAI) ├── Identify synonymous bias? │ └── Calculate RSCU ├── Check wobble position bias? │ └── Use GC123() ├── Measure overall bias? │ └── Calculate Nc (effective number of codons) └── Optimize for expression? └── Replace with preferred synonymous codons
Related Skills
- transcription-translation - Translate sequences and understand codon tables
- sequence-properties - GC123 for wobble position GC content
- sequence-io/read-sequences - Parse CDS sequences from GenBank files
- database-access/entrez-fetch - Fetch reference gene sets from NCBI for CAI training