BioSkills bio-local-blast

Run local BLAST searches using BLAST+ command-line tools. Use when running fast unlimited searches, building custom databases, performing large-scale analysis, or when NCBI servers are slow or unavailable.

install

source · Clone the upstream repo

git clone https://github.com/GPTomics/bioSkills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/database-access/local-blast" ~/.claude/skills/gptomics-bioskills-bio-local-blast && rm -rf "$T"

manifest: database-access/local-blast/SKILL.md

source content

Version Compatibility

Reference examples tested with: NCBI BLAST+ 2.15+

Before using code patterns, verify installed versions match. If versions differ:

Python:
```
pip show <package>
```
then
```
help(module.function)
```
to check signatures
CLI:
```
<tool> --version
```
then
```
<tool> --help
```
to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Local BLAST

"Run a BLAST search against my custom database" → Build a local BLAST database and search it with query sequences, returning tabular results with identity and e-value.

CLI:
```
makeblastdb
```
,
```
blastn
```
/
```
blastp
```
(NCBI BLAST+)
Python:
```
subprocess
```
wrapper for BLAST+

Run BLAST searches locally using NCBI BLAST+ command-line tools.

Installation (NCBI BLAST+)

# macOS
brew install blast

# Ubuntu/Debian
sudo apt install ncbi-blast+

# conda
conda install -c bioconda blast

# Verify installation
blastn -version

BLAST+ Programs

Command	Query	Database	Description
`blastn`	DNA	DNA	Nucleotide-nucleotide
`blastp`	Protein	Protein	Protein-protein
`blastx`	DNA	Protein	Translated query vs protein
`tblastn`	Protein	DNA	Protein vs translated DB
`tblastx`	DNA	DNA	Translated vs translated
`makeblastdb`	-	-	Create BLAST database

Creating BLAST Databases

makeblastdb - Create Database (NCBI BLAST+)

# Create nucleotide database
makeblastdb -in sequences.fasta -dbtype nucl -out my_db

# Create protein database
makeblastdb -in proteins.fasta -dbtype prot -out my_proteins

# With title and parse sequence IDs
makeblastdb -in sequences.fasta -dbtype nucl -out my_db \
    -title "My Reference Database" -parse_seqids

Key Options:

Option	Description	Values
`-in`	Input FASTA file	Path
`-dbtype`	Database type	`nucl` , `prot`
`-out`	Output database name	Path prefix
`-title`	Database title	String
`-parse_seqids`	Enable ID-based retrieval	Flag
`-taxid`	Assign taxonomy ID	Integer
`-taxid_map`	Taxonomy ID mapping file	Path

Database Files Created

my_db.nhr  # Header file (nucl) / .phr (prot)
my_db.nin  # Index file (nucl) / .pin (prot)
my_db.nsq  # Sequence file (nucl) / .psq (prot)
my_db.ndb  # Alias file (optional)
my_db.not  # ID index (if parse_seqids)
my_db.ntf  # Index (if parse_seqids)
my_db.nto  # Index (if parse_seqids)

Running BLAST Searches

Basic Usage (NCBI BLAST+)

# BLASTN
blastn -query query.fasta -db my_db -out results.txt

# BLASTP
blastp -query proteins.fasta -db my_proteins -out results.txt

# BLASTX (translate query, search protein DB)
blastx -query genes.fasta -db nr -out results.txt

Common Options

Option	Description	Example
`-query`	Query FASTA file	`-query seq.fa`
`-db`	Database name	`-db nt`
`-out`	Output file	`-out results.txt`
`-outfmt`	Output format	`-outfmt 6`
`-evalue`	E-value threshold	`-evalue 1e-5`
`-num_threads`	CPU threads	`-num_threads 8`
`-max_target_seqs`	Max hits	`-max_target_seqs 100`
`-max_hsps`	Max HSPs per hit	`-max_hsps 1`
`-word_size`	Word size	`-word_size 11`
`-dust`	Filter low complexity (nucl)	`-dust yes`
`-seg`	Filter low complexity (prot)	`-seg yes`

Output Formats (-outfmt)

Value	Format
`0`	Pairwise (default)
`1`	Query-anchored with identities
`5`	BLAST XML
`6`	Tabular
`7`	Tabular with comments
`10`	CSV

Tabular Output Fields (-outfmt 6)

Default columns:

qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore

Custom columns:

blastn -query query.fa -db my_db -outfmt "6 qseqid sseqid pident length evalue stitle"

Available Fields:

Field	Description
`qseqid`	Query ID
`sseqid`	Subject ID
`pident`	Percent identity
`length`	Alignment length
`mismatch`	Mismatches
`gapopen`	Gap openings
`qstart`	Query start
`qend`	Query end
`sstart`	Subject start
`send`	Subject end
`evalue`	E-value
`bitscore`	Bit score
`stitle`	Subject title
`qcovs`	Query coverage
`qcovhsp`	Query coverage per HSP

Code Patterns

Create Database and Search

Goal: Build a custom BLAST database from reference sequences and search it with a query.

Approach: Index the reference with

makeblastdb

, then run the appropriate BLAST program with tabular output.

Reference (NCBI BLAST+ 2.15+):

#!/bin/bash
makeblastdb -in reference.fasta -dbtype nucl -out ref_db -parse_seqids

blastn -query query.fasta -db ref_db -out results.txt \
    -outfmt 6 -evalue 1e-10 -num_threads 4

head results.txt

BLAST with Tabular Output (NCBI BLAST+)

#!/bin/bash
blastn -query query.fasta -db my_db \
    -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore stitle" \
    -evalue 1e-5 \
    -max_target_seqs 10 \
    -num_threads 8 \
    -out results.tsv

Filter and Sort Results (NCBI BLAST+)

# Get hits with >90% identity
awk -F'\t' '$3 >= 90' results.tsv

# Sort by E-value
sort -t$'\t' -k11 -g results.tsv

# Get best hit per query
sort -t$'\t' -k1,1 -k11,11g results.tsv | sort -t$'\t' -k1,1 -u

Batch BLAST Multiple Files (NCBI BLAST+)

#!/bin/bash
for query_file in queries/*.fasta; do
    base=$(basename "$query_file" .fasta)
    echo "Processing $base..."

    blastn -query "$query_file" -db my_db \
        -outfmt 6 -evalue 1e-5 -num_threads 4 \
        -out "results/${base}_blast.tsv"
done

Python Wrapper

Goal: Run a complete local BLAST workflow (build database, search, parse results) from Python.

Approach: Wrap

makeblastdb

and BLAST programs via

subprocess

, then parse the default tabular output into structured dictionaries.

Reference (NCBI BLAST+ 2.15+):

import subprocess
import os

def make_blast_db(fasta_file, db_name, db_type='nucl'):
    cmd = ['makeblastdb', '-in', fasta_file, '-dbtype', db_type, '-out', db_name, '-parse_seqids']
    subprocess.run(cmd, check=True)

def run_blast(query, db, output, program='blastn', evalue=1e-5, threads=4, outfmt=6):
    cmd = [program, '-query', query, '-db', db, '-out', output,
           '-outfmt', str(outfmt), '-evalue', str(evalue), '-num_threads', str(threads)]
    subprocess.run(cmd, check=True)

def parse_blast_tabular(filename):
    columns = ['qseqid', 'sseqid', 'pident', 'length', 'mismatch', 'gapopen',
               'qstart', 'qend', 'sstart', 'send', 'evalue', 'bitscore']
    hits = []
    with open(filename) as f:
        for line in f:
            values = line.strip().split('\t')
            hit = dict(zip(columns, values))
            hit['pident'] = float(hit['pident'])
            hit['evalue'] = float(hit['evalue'])
            hit['length'] = int(hit['length'])
            hits.append(hit)
    return hits

make_blast_db('reference.fasta', 'ref_db')
run_blast('query.fasta', 'ref_db', 'results.tsv')
hits = parse_blast_tabular('results.tsv')
for hit in hits[:5]:
    print(f"{hit['qseqid']} -> {hit['sseqid']}: {hit['pident']}% identity, E={hit['evalue']}")

Reciprocal Best BLAST

Goal: Identify putative orthologs between two species using reciprocal best BLAST hits.

Approach: BLAST species A against B and B against A (keeping best hit each), then find pairs where each is the other's top hit.

Reference (NCBI BLAST+ 2.15+):

#!/bin/bash
blastp -query species_A.fasta -db species_B_db -outfmt 6 -evalue 1e-5 \
    -max_target_seqs 1 -out A_vs_B.tsv

blastp -query species_B.fasta -db species_A_db -outfmt 6 -evalue 1e-5 \
    -max_target_seqs 1 -out B_vs_A.tsv

awk 'NR==FNR {a[$1]=$2; next} $2 in a && a[$2]==$1' A_vs_B.tsv B_vs_A.tsv

Extract Hit Sequences (NCBI BLAST+)

# Get subject sequence by ID (requires -parse_seqids)
blastdbcmd -db my_db -entry "sequence_id" -out hit.fasta

# Get multiple sequences
blastdbcmd -db my_db -entry_batch ids.txt -out hits.fasta

# Get all sequences from database
blastdbcmd -db my_db -entry all -out all_seqs.fasta

Prebuilt Databases (NCBI BLAST+)

Download from NCBI:

# Download and extract (uses update_blastdb.pl)
update_blastdb.pl --decompress nt

# Or download manually from:
# https://ftp.ncbi.nlm.nih.gov/blast/db/

Common databases:

```
nt
```
- All nucleotide sequences
```
nr
```
- Non-redundant protein
```
refseq_rna
```
- RefSeq RNA
```
swissprot
```
- UniProt SwissProt

Common Errors

Error	Cause	Solution
`BLAST Database error`	Database not found	Check path, rebuild database
`No hits found`	No matches or wrong DB type	Verify database type matches query
`Sequence too short`	Query below word size	Lower word_size or use longer query
`Out of memory`	Large database	Reduce threads, use -num_threads 1

Local vs Remote BLAST

Aspect	Local	Remote
Speed	Fast	Can be slow
Databases	Must download/create	All NCBI DBs available
Throughput	Unlimited	Rate limited
Setup	Requires installation	Just Biopython
Updates	Manual	Automatic

Decision Tree

Running BLAST locally?
├── Have reference sequences?
│   └── makeblastdb to create database
├── Download NCBI database?
│   └── update_blastdb.pl or manual download
├── Need tabular output?
│   └── -outfmt 6 (or 7 with headers)
├── Filter low-complexity?
│   └── -dust yes (nucl) or -seg yes (prot)
├── Multiple queries?
│   └── Put all in one FASTA, use -num_threads
├── Need XML output?
│   └── -outfmt 5
└── Extract hit sequences?
    └── blastdbcmd -entry

Related Skills

blast-searches - Remote BLAST via NCBI (no installation needed)
sequence-io - Read/write FASTA files for queries
batch-downloads - Download sequences to build local databases