BioSkills bio-sequence-similarity

Find homologous sequences using iterative BLAST (PSI-BLAST), profile HMMs (HMMER), and reciprocal best hit analysis. Use when identifying orthologs, distant homologs, or protein family members where standard BLAST is not sensitive enough.

install
source · Clone the upstream repo
git clone https://github.com/GPTomics/bioSkills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/database-access/sequence-similarity" ~/.claude/skills/gptomics-bioskills-bio-sequence-similarity && rm -rf "$T"
manifest: database-access/sequence-similarity/SKILL.md
source content

Version Compatibility

Reference examples tested with: BioPython 1.83+, NCBI BLAST+ 2.15+

Before using code patterns, verify installed versions match. If versions differ:

  • Python:
    pip show <package>
    then
    help(module.function)
    to check signatures
  • CLI:
    <tool> --version
    then
    <tool> --help
    to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Sequence Similarity Searches

Advanced methods for finding homologous sequences beyond standard BLAST.

"Find distant homologs" → Use iterative search (PSI-BLAST) or profile HMMs (HMMER) to detect remote sequence similarity that standard BLAST misses.

  • CLI:
    psiblast -query seq.fa -db nr -num_iterations 3
    or
    hmmsearch profile.hmm seqdb
  • Python:
    NcbipsiblastCommandline()
    (BioPython)

PSI-BLAST (Position-Specific Iterated BLAST)

Builds a position-specific scoring matrix (PSSM) through iterations to find distant homologs.

Basic PSI-BLAST

psiblast -query protein.fasta -db nr -out results.txt -num_iterations 3

Save PSSM for Reuse

psiblast -query protein.fasta -db nr \
    -out results.txt \
    -out_pssm pssm.asn \
    -out_ascii_pssm pssm.txt \
    -num_iterations 5

Use Existing PSSM

psiblast -in_pssm pssm.asn -db nr -out results.txt

Output Format

psiblast -query protein.fasta -db nr \
    -out results.txt \
    -outfmt 6 \
    -num_iterations 3 \
    -inclusion_ethresh 0.001

Key Parameters

psiblast -query protein.fasta -db nr \
    -num_iterations 5 \
    -inclusion_ethresh 0.001 \
    -evalue 0.01 \
    -num_threads 8 \
    -out results.txt

PSI-BLAST Parameters

ParameterDefaultDescription
-num_iterations1Number of iterations
-inclusion_ethresh0.002E-value for PSSM inclusion
-evalue10E-value threshold for reporting
-num_threads1CPU threads

HMMER for Profile Searches

HMMER uses profile hidden Markov models for sensitive sequence searches.

Search with Single Sequence

jackhmmer -o results.txt -A aligned.sto --cpu 8 query.fasta database.fasta

Build Profile from Alignment

hmmbuild profile.hmm alignment.sto

Search Database with Profile

hmmsearch -o results.txt --tblout hits.tbl profile.hmm database.fasta
hmmsearch -o results.txt --domtblout domains.tbl profile.hmm database.fasta

Download Pfam Profiles

wget https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz
gunzip Pfam-A.hmm.gz
hmmpress Pfam-A.hmm

Scan Sequence Against Pfam

hmmscan --tblout pfam_hits.tbl --domtblout domains.tbl Pfam-A.hmm query.fasta

Parse HMMER Output

grep -v "^#" hits.tbl | head
awk '$5 < 1e-10' hits.tbl

HMMER Output Columns (--tblout)

ColumnDescription
1Target name
2Accession
3Query name
4Query accession
5E-value (full sequence)
6Score (full sequence)
7Bias
8E-value (best domain)
9Score (best domain)

Reciprocal Best Hit (RBH) Analysis

Find orthologs using bidirectional best hits.

Create BLAST Databases

makeblastdb -in species_A.fasta -dbtype prot -out species_A_db
makeblastdb -in species_B.fasta -dbtype prot -out species_B_db

Bidirectional BLAST

blastp -query species_A.fasta -db species_B_db -outfmt 6 -evalue 1e-5 -max_target_seqs 1 > A_vs_B.txt
blastp -query species_B.fasta -db species_A_db -outfmt 6 -evalue 1e-5 -max_target_seqs 1 > B_vs_A.txt

Find Reciprocal Best Hits

awk 'FNR==NR {a[$1]=$2; next} $2 in a && a[$2]==$1 {print $1"\t"$2}' \
    A_vs_B.txt B_vs_A.txt > reciprocal_best_hits.txt

Python RBH Script

Goal: Identify orthologous gene pairs between two species using the reciprocal best hit criterion.

Approach: Parse forward and reverse BLAST results to extract the top hit per query, then retain only pairs where each sequence is the other's best match.

def find_rbh(forward_blast, reverse_blast):
    '''Find reciprocal best hits from BLAST results'''
    forward = {}
    with open(forward_blast) as f:
        for line in f:
            parts = line.strip().split('\t')
            query, subject = parts[0], parts[1]
            if query not in forward:
                forward[query] = subject

    reverse = {}
    with open(reverse_blast) as f:
        for line in f:
            parts = line.strip().split('\t')
            query, subject = parts[0], parts[1]
            if query not in reverse:
                reverse[query] = subject

    rbh = []
    for a, b in forward.items():
        if b in reverse and reverse[b] == a:
            rbh.append((a, b))

    return rbh

rbh_pairs = find_rbh('A_vs_B.txt', 'B_vs_A.txt')
for a, b in rbh_pairs:
    print(f'{a}\t{b}')

Delta-BLAST

Uses conserved domain database for more sensitive initial search.

deltablast -query protein.fasta -db nr -rpsdb cdd_delta -out results.txt

PHI-BLAST (Pattern-Hit Initiated)

Search with a pattern plus sequence.

phi_pattern="G-x(2)-[ST]-x-[RK]"
phiblast -query protein.fasta -db nr -pattern "$phi_pattern" -out results.txt

Iterative Search with Biopython

from Bio.Blast import NCBIWWW, NCBIXML

with open('query.fasta') as f:
    query = f.read()

result_handle = NCBIWWW.qblast('psiblast', 'nr', query, expect=0.001, word_size=3)

with open('psiblast_result.xml', 'w') as out:
    out.write(result_handle.read())
result_handle.close()

with open('psiblast_result.xml') as f:
    records = NCBIXML.parse(f)
    for record in records:
        for alignment in record.alignments:
            for hsp in alignment.hsps:
                if hsp.expect < 1e-10:
                    print(f'{alignment.hit_def[:50]}: E={hsp.expect}')

HMMER with Biopython

from Bio import SearchIO

results = SearchIO.parse('hmmsearch_output.txt', 'hmmer3-text')
for query_result in results:
    print(f'Query: {query_result.id}')
    for hit in query_result:
        print(f'  Hit: {hit.id}, E-value: {hit.evalue}')
        for hsp in hit:
            print(f'    Domain: {hsp.bitscore} bits')

Jackhmmer (Iterative HMMER)

Similar to PSI-BLAST but uses HMM profiles.

jackhmmer -N 5 -o results.txt --tblout hits.tbl query.fasta database.fasta
jackhmmer -N 5 -A iterations.sto --chkhmm checkpoint query.fasta database.fasta

OrthoFinder for Multi-Species Orthologs

orthofinder -f proteomes/ -t 8
orthofinder -f proteomes/ -t 8 -M msa

Prepare Input

mkdir proteomes
cp species_*.fasta proteomes/

Output Files

FileContent
Orthogroups.tsvAll orthogroups
Orthogroups_SingleCopyOrthologues.txt1:1 orthologs
Species_Tree/Inferred species tree
Gene_Trees/Individual gene trees

E-value vs Bit Score

E-valueInterpretation
< 1e-50Highly significant, likely homolog
1e-50 to 1e-10Significant, probable homolog
1e-10 to 1e-3Marginal, possible remote homolog
> 0.01Not significant

Complete Ortholog Finding Pipeline

Goal: Run an end-to-end reciprocal best hit ortholog analysis from two proteome FASTA files.

Approach: Build BLAST databases for both species, run bidirectional best-hit searches, and extract reciprocal pairs using awk.

#!/bin/bash
SPECIES_A=$1
SPECIES_B=$2
EVALUE=1e-10
THREADS=8

echo "Building databases..."
makeblastdb -in $SPECIES_A -dbtype prot -out db_A
makeblastdb -in $SPECIES_B -dbtype prot -out db_B

echo "Running forward BLAST..."
blastp -query $SPECIES_A -db db_B -outfmt 6 -evalue $EVALUE \
    -max_target_seqs 1 -num_threads $THREADS > forward.txt

echo "Running reverse BLAST..."
blastp -query $SPECIES_B -db db_A -outfmt 6 -evalue $EVALUE \
    -max_target_seqs 1 -num_threads $THREADS > reverse.txt

echo "Finding reciprocal best hits..."
awk 'FNR==NR {best[$1]=$2; next}
     $2 in best && best[$2]==$1 {print $1"\t"$2}' \
     forward.txt reverse.txt > orthologs.txt

echo "Found $(wc -l < orthologs.txt) ortholog pairs"

rm -f db_A.* db_B.*

Related Skills

  • blast-searches - Basic remote BLAST
  • local-blast - Local BLAST databases
  • entrez-fetch - Download sequences
  • alignment - Align identified homologs