SciAgent-Skills ucsc-genome-browser

Query the UCSC Genome Browser REST API for DNA sequences, track annotations, gene models, and conservation scores across 100+ genome assemblies. Retrieve sequence for any genomic region, list or fetch BED/bigWig track data, get chromosome sizes, query RefSeq/GENCODE gene structures, and access PhyloP/PhastCons conservation scores. Use for programmatic access to UCSC annotations; use Ensembl REST API instead for Ensembl-native gene IDs and VEP variant annotation.

install
source · Clone the upstream repo
git clone https://github.com/jaechang-hits/SciAgent-Skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/genomics-bioinformatics/ucsc-genome-browser" ~/.claude/skills/jaechang-hits-sciagent-skills-ucsc-genome-browser && rm -rf "$T"
manifest: skills/genomics-bioinformatics/ucsc-genome-browser/SKILL.md
source content

UCSC Genome Browser

Overview

The UCSC Genome Browser REST API at

https://api.genome.ucsc.edu/
provides programmatic access to genome sequences, annotation tracks, and hub data for 100+ assemblies including hg38, mm39, and dm6. The API is free, requires no authentication, and returns JSON. Use it with the
requests
library to fetch DNA sequences for genomic regions, retrieve track data (genes, repeats, conservation), list available tracks, and query chromosome sizes for genome-scale coordinate arithmetic.

When to Use

  • Fetching the reference DNA sequence for any genomic region (e.g., promoter, exon, CRISPR target) across human, mouse, or other assemblies
  • Retrieving RefSeq or GENCODE gene structure (exon coordinates, CDS boundaries, strand) for a locus of interest
  • Looking up PhyloP or PhastCons conservation scores to assess evolutionary constraint at a variant site
  • Listing and querying any of UCSC's 1000+ annotation tracks (repeats, regulatory elements, conservation) for a region
  • Getting chromosome sizes for a genome assembly to set up bedtools, pysam, or coverage pipelines
  • Accessing public UCSC track hubs (e.g., ENCODE, Roadmap Epigenomics) without downloading data locally
  • Use
    ensembl-database
    instead when you need Ensembl stable IDs, VEP variant annotation, or cross-species comparative genomics via the Ensembl REST API
  • For bulk local queries across millions of regions, use
    bedtools-genomic-intervals
    with pre-downloaded UCSC annotation files

Prerequisites

  • Python packages:
    requests
    ,
    matplotlib
    (for visualization)
  • Data requirements: genomic coordinates (chrom, start, end in 0-based half-open BED format), genome assembly name (e.g.,
    hg38
    ,
    mm39
    )
  • Environment: internet connection; no authentication required
  • Rate limits: no official published limit; add 0.5s delays for batch requests (>100 queries)
pip install requests matplotlib

Quick Start

import requests

BASE = "https://api.genome.ucsc.edu"

def get_sequence(genome, chrom, start, end):
    """Fetch DNA sequence for a genomic region (0-based, half-open)."""
    r = requests.get(f"{BASE}/getData/sequence",
                     params={"genome": genome, "chrom": chrom,
                             "start": start, "end": end})
    r.raise_for_status()
    return r.json()["dna"]

# Fetch 1 kb around the BRCA1 TSS on hg38
seq = get_sequence("hg38", "chr17", 43044294, 43045294)
print(f"Length: {len(seq)} bp")
print(f"Sequence: {seq[:60]}...")
# Length: 1000 bp
# Sequence: ATGATTGGTGGTTACATGCACAGTTGCTCTGGGAAGTTTCTTCTTCAGTTGAGAAAAGGT...

Core API

Query 1: Sequence Retrieval

Fetch the reference DNA sequence for any genomic region using the

getData/sequence
endpoint. Coordinates are 0-based, half-open (BED format).

import requests

BASE = "https://api.genome.ucsc.edu"

def get_sequence(genome, chrom, start, end):
    """Return DNA sequence string for the given region."""
    r = requests.get(f"{BASE}/getData/sequence",
                     params={"genome": genome, "chrom": chrom,
                             "start": start, "end": end})
    r.raise_for_status()
    data = r.json()
    return data["dna"]

# TP53 exon 4 region (hg38)
seq = get_sequence("hg38", "chr17", 7676520, 7676620)
print(f"Region: chr17:7,676,520-7,676,620 ({len(seq)} bp)")
print(f"Sequence: {seq}")
# Reverse-complement for minus-strand genes
def revcomp(seq):
    comp = str.maketrans("ACGTacgt", "TGCAtgca")
    return seq.translate(comp)[::-1]

# BRCA2 on minus strand (hg38)
seq_fwd = get_sequence("hg38", "chr13", 32315086, 32315186)
seq_rc  = revcomp(seq_fwd)
print(f"Forward: {seq_fwd[:30]}...")
print(f"RevComp: {seq_rc[:30]}...")

Query 2: Track Data Query

Retrieve annotation data (BED records) from any UCSC track for a genomic region.

import requests

BASE = "https://api.genome.ucsc.edu"

def get_track_data(genome, track, chrom, start, end):
    """Fetch annotation records from a UCSC track for a region."""
    r = requests.get(f"{BASE}/getData/track",
                     params={"genome": genome, "track": track,
                             "chrom": chrom, "start": start, "end": end})
    r.raise_for_status()
    data = r.json()
    # Track data is under the key matching the track name
    return data.get(track, data.get("data", []))

# Fetch RepeatMasker annotations in the MYC locus (hg38)
repeats = get_track_data("hg38", "rmsk", "chr8", 127_735_434, 127_742_951)
print(f"Repeat elements in MYC locus: {len(repeats)}")
for r in repeats[:3]:
    print(f"  {r.get('repName', r.get('name'))} | {r['chromStart']}-{r['chromEnd']}")
# Fetch CpG islands near a promoter
cpg_islands = get_track_data("hg38", "cpgIslandExt", "chr17", 43_044_000, 43_050_000)
print(f"CpG islands found: {len(cpg_islands)}")
for island in cpg_islands:
    print(f"  {island['name']}: {island['chromStart']}-{island['chromEnd']}, "
          f"obsExp={island.get('obsExp', 'n/a')}")

Query 3: Track List

List all available annotation tracks for a genome assembly to discover what data is available.

import requests

BASE = "https://api.genome.ucsc.edu"

def list_tracks(genome):
    """Return a dict of {track_name: track_metadata} for a genome assembly."""
    r = requests.get(f"{BASE}/list/tracks", params={"genome": genome})
    r.raise_for_status()
    return r.json().get("tracks", {})

tracks = list_tracks("hg38")
print(f"Total tracks in hg38: {len(tracks)}")

# Find conservation-related tracks
conserv = {k: v for k, v in tracks.items() if "conserv" in k.lower() or "phylop" in k.lower()}
for name, meta in list(conserv.items())[:5]:
    print(f"  {name}: {meta.get('shortLabel', '')}")

Query 4: Chromosome Sizes

Get the length of every chromosome (or scaffold) for a genome assembly.

import requests

BASE = "https://api.genome.ucsc.edu"

def get_chrom_sizes(genome):
    """Return {chrom: size_in_bp} for a genome assembly."""
    r = requests.get(f"{BASE}/list/chromosomes", params={"genome": genome})
    r.raise_for_status()
    return r.json().get("chromosomeSizes", {})

sizes = get_chrom_sizes("hg38")
print(f"hg38 chromosome count: {len(sizes)}")

# Show canonical autosomes + sex chromosomes
canonical = {c: sizes[c] for c in sorted(sizes) if c in
             [f"chr{i}" for i in range(1, 23)] + ["chrX", "chrY", "chrM"]}
for chrom, length in sorted(canonical.items(),
                             key=lambda x: int(x[0].replace("chr", "").replace("X", "23").replace("Y", "24").replace("M", "25"))):
    print(f"  {chrom}: {length:,} bp")

Query 5: Gene Annotation

Query RefSeq gene models (exon coordinates, CDS, strand) for a genomic region.

import requests

BASE = "https://api.genome.ucsc.edu"

def get_refgene(genome, chrom, start, end):
    """Retrieve RefSeq gene annotations for a region."""
    r = requests.get(f"{BASE}/getData/track",
                     params={"genome": genome, "track": "refGene",
                             "chrom": chrom, "start": start, "end": end})
    r.raise_for_status()
    return r.json().get("refGene", [])

# Query EGFR gene region (hg38)
genes = get_refgene("hg38", "chr7", 55_019_017, 55_211_628)
for g in genes:
    exon_count = g.get("exonCount", 0)
    print(f"  {g['name2']} ({g['name']}) | {g['strand']} | "
          f"tx: {g['txStart']}-{g['txEnd']} | exons: {exon_count}")
# Parse exon intervals from a refGene record
def parse_exons(gene_record):
    """Return list of (exon_start, exon_end) from a refGene record."""
    starts = [int(s) for s in gene_record["exonStarts"].strip(",").split(",") if s]
    ends   = [int(e) for e in gene_record["exonEnds"].strip(",").split(",") if e]
    return list(zip(starts, ends))

genes = get_refgene("hg38", "chr7", 55_019_017, 55_211_628)
if genes:
    g = genes[0]
    exons = parse_exons(g)
    print(f"{g['name2']}: {len(exons)} exons")
    for i, (s, e) in enumerate(exons[:4], 1):
        print(f"  Exon {i}: {s}-{e} ({e-s} bp)")

Query 6: Conservation Scores

Fetch per-base PhyloP or PhastCons conservation scores for a genomic region.

import requests

BASE = "https://api.genome.ucsc.edu"

def get_conservation(genome, track, chrom, start, end):
    """Retrieve per-base conservation scores from a bigWig track."""
    r = requests.get(f"{BASE}/getData/track",
                     params={"genome": genome, "track": track,
                             "chrom": chrom, "start": start, "end": end})
    r.raise_for_status()
    data = r.json()
    # bigWig tracks return a list of {start, end, value} intervals
    return data.get(track, [])

# PhyloP 100-way conservation at TP53 mutation hotspot (hg38)
# chr17:7,676,594 = codon 248 (R248W/Q common hotspot)
scores = get_conservation("hg38", "phyloP100way", "chr17", 7_676_580, 7_676_610)
print(f"PhyloP 100way scores ({len(scores)} intervals):")
for s in scores[:5]:
    print(f"  chr17:{s['start']}-{s['end']}: phyloP = {s['value']:.3f}")
# Positive scores = conserved; negative = fast-evolving

Query 7: Hub Access

Access public UCSC track hubs and list their available assemblies and tracks.

import requests

BASE = "https://api.genome.ucsc.edu"

def list_ucsc_genomes():
    """Return all UCSC-hosted genome assemblies."""
    r = requests.get(f"{BASE}/list/ucscGenomes")
    r.raise_for_status()
    return r.json().get("ucscGenomes", {})

genomes = list_ucsc_genomes()
print(f"Total UCSC genome assemblies: {len(genomes)}")

# Find all human assemblies
human = {k: v for k, v in genomes.items() if "Homo sapiens" in v.get("scientificName", "")}
for name, meta in sorted(human.items()):
    print(f"  {name}: {meta.get('description', '')}")

Key Concepts

0-Based vs. 1-Based Coordinates

The UCSC REST API uses 0-based, half-open intervals (BED format):

start
is inclusive,
end
is exclusive. This matches BED files and Python slicing. The UCSC Genome Browser web interface displays 1-based positions. To convert: API
start = browser_start - 1
, API
end = browser_end
.

# Browser position: chr17:7,676,521-7,676,620 (1-based, closed)
# API query (0-based, half-open):
start_api = 7_676_520   # browser_start - 1
end_api   = 7_676_620   # browser_end unchanged
seq = get_sequence("hg38", "chr17", start_api, end_api)
print(f"Fetched {len(seq)} bp (expected 100)")

Track Data Response Format

Track data returned from

/getData/track
is keyed by track name. BED-like tracks return a list of dicts with
chrom
,
chromStart
,
chromEnd
,
name
,
score
,
strand
. bigWig tracks (conservation, signal) return
{start, end, value}
intervals. Always check the actual key in the response JSON, which matches the
track
parameter name.

Common Workflows

Workflow 1: Extract Promoter Sequences for a Gene List

Goal: Retrieve 2 kb upstream of the TSS for each gene in a list, for motif analysis or primer design.

import requests
import time

BASE = "https://api.genome.ucsc.edu"
GENOME = "hg38"
PROMOTER_UP = 2000   # bp upstream of TSS

def get_refgene(genome, chrom, start, end):
    r = requests.get(f"{BASE}/getData/track",
                     params={"genome": genome, "track": "refGene",
                             "chrom": chrom, "start": start, "end": end})
    r.raise_for_status()
    return r.json().get("refGene", [])

def get_sequence(genome, chrom, start, end):
    r = requests.get(f"{BASE}/getData/sequence",
                     params={"genome": genome, "chrom": chrom,
                             "start": start, "end": end})
    r.raise_for_status()
    return r.json()["dna"]

def revcomp(seq):
    comp = str.maketrans("ACGTacgt", "TGCAtgca")
    return seq.translate(comp)[::-1]

# Genes of interest: query a known locus for each
gene_loci = {
    "BRCA1": ("chr17", 43_044_294, 43_125_482),
    "TP53":  ("chr17",  7_661_779,  7_687_538),
    "EGFR":  ("chr7",  55_019_017, 55_211_628),
}

results = {}
for gene, (chrom, locus_start, locus_end) in gene_loci.items():
    records = get_refgene(GENOME, chrom, locus_start, locus_end)
    # Pick the longest transcript
    records = [r for r in records if r.get("name2") == gene]
    if not records:
        print(f"  {gene}: not found")
        continue
    g = max(records, key=lambda x: x["txEnd"] - x["txStart"])
    if g["strand"] == "+":
        prom_start = max(0, g["txStart"] - PROMOTER_UP)
        prom_end   = g["txStart"]
    else:
        prom_start = g["txEnd"]
        prom_end   = g["txEnd"] + PROMOTER_UP
    seq = get_sequence(GENOME, chrom, prom_start, prom_end)
    if g["strand"] == "-":
        seq = revcomp(seq)
    results[gene] = {"chrom": chrom, "start": prom_start, "end": prom_end,
                     "strand": g["strand"], "seq": seq}
    print(f"  {gene}: {chrom}:{prom_start}-{prom_end} | strand={g['strand']} | {len(seq)} bp")
    time.sleep(0.5)

# Write FASTA
with open("promoters.fa", "w") as fh:
    for gene, d in results.items():
        fh.write(f">{gene} {d['chrom']}:{d['start']}-{d['end']}({d['strand']})\n")
        fh.write(d["seq"] + "\n")
print(f"\nSaved {len(results)} promoter sequences → promoters.fa")

Workflow 2: Visualize Gene Structure from refGene Track

Goal: Draw an exon-intron diagram for a gene using matplotlib from refGene track data.

import requests
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

BASE = "https://api.genome.ucsc.edu"

def get_refgene(genome, chrom, start, end):
    r = requests.get(f"{BASE}/getData/track",
                     params={"genome": genome, "track": "refGene",
                             "chrom": chrom, "start": start, "end": end})
    r.raise_for_status()
    return r.json().get("refGene", [])

def parse_exons(rec):
    starts = [int(s) for s in rec["exonStarts"].strip(",").split(",") if s]
    ends   = [int(e) for e in rec["exonEnds"].strip(",").split(",") if e]
    return list(zip(starts, ends))

# Fetch BRCA1 transcripts (hg38)
genes = get_refgene("hg38", "chr17", 43_044_294, 43_125_482)
brca1 = [g for g in genes if g.get("name2") == "BRCA1"]
print(f"BRCA1 transcripts: {len(brca1)}")

# Plot the canonical transcript (longest)
g = max(brca1, key=lambda x: x["txEnd"] - x["txStart"])
exons = parse_exons(g)
tx_start, tx_end = g["txStart"], g["txEnd"]
cds_start, cds_end = g["cdsStart"], g["cdsEnd"]

fig, ax = plt.subplots(figsize=(12, 2.5))
ax.set_xlim(tx_start - 500, tx_end + 500)
ax.set_ylim(-0.5, 1.5)

# Intron line
ax.hlines(0.5, tx_start, tx_end, color="#555", lw=1.5, zorder=1)

# Exon boxes
for exon_s, exon_e in exons:
    # UTR portion (thin) vs CDS (thick)
    cds_s = max(exon_s, cds_start)
    cds_e = min(exon_e, cds_end)
    # Full exon box (UTR height)
    ax.add_patch(mpatches.FancyBboxPatch(
        (exon_s, 0.25), exon_e - exon_s, 0.5,
        boxstyle="square,pad=0", fc="#a8c4e0", ec="#2c6fad", lw=0.8, zorder=2))
    # CDS box (taller)
    if cds_s < cds_e:
        ax.add_patch(mpatches.FancyBboxPatch(
            (cds_s, 0.15), cds_e - cds_s, 0.7,
            boxstyle="square,pad=0", fc="#2c6fad", ec="#1a4a7a", lw=0.8, zorder=3))

strand_arrow = "→" if g["strand"] == "+" else "←"
ax.set_title(f"{g['name2']} ({g['name']}) {strand_arrow} — hg38 {g['chrom']}:"
             f"{tx_start:,}-{tx_end:,} | {g['exonCount']} exons", fontsize=11)
ax.set_xlabel("Genomic position (bp)")
ax.set_yticks([])
plt.tight_layout()
plt.savefig("brca1_gene_structure.png", dpi=150, bbox_inches="tight")
print("Saved: brca1_gene_structure.png")
plt.show()

Workflow 3: Batch Conservation Score Lookup

Goal: Retrieve mean PhyloP conservation for a list of variants or regions.

import requests
import time
import pandas as pd

BASE = "https://api.genome.ucsc.edu"

def get_conservation(genome, track, chrom, start, end):
    r = requests.get(f"{BASE}/getData/track",
                     params={"genome": genome, "track": track,
                             "chrom": chrom, "start": start, "end": end})
    r.raise_for_status()
    return r.json().get(track, [])

# Variants to score (1-based positions → convert to 0-based)
variants = [
    {"id": "rs28897672",  "chrom": "chr17", "pos": 7_676_594},  # TP53 R248
    {"id": "rs80357906",  "chrom": "chr17", "pos": 43_094_692}, # BRCA1
    {"id": "rs1042522",   "chrom": "chr17", "pos": 7_676_147},  # TP53 R72P (common)
]

results = []
for v in variants:
    # Query ±5 bp window around each variant
    scores = get_conservation("hg38", "phyloP100way",
                               v["chrom"], v["pos"] - 6, v["pos"] + 5)
    values = [s["value"] for s in scores]
    mean_score = sum(values) / len(values) if values else float("nan")
    results.append({**v, "phyloP100way_mean": round(mean_score, 3),
                    "n_intervals": len(scores)})
    print(f"  {v['id']}: mean phyloP = {mean_score:.3f}")
    time.sleep(0.5)

df = pd.DataFrame(results)
df.to_csv("variant_conservation.csv", index=False)
print(f"\nSaved → variant_conservation.csv\n{df.to_string(index=False)}")

Key Parameters

ParameterModuleDefaultRange / OptionsEffect
genome
All endpoints
hg38
,
mm39
,
dm6
, any UCSC assembly
Selects the genome assembly
chrom
Sequence, Track
chr1
chrY
,
chrM
Chromosome name (UCSC
chr
-prefix convention)
start
Sequence, Track0–chrom_sizeRegion start (0-based, inclusive)
end
Sequence, Track1–chrom_sizeRegion end (0-based, exclusive)
track
getData/track
Any track name from
list/tracks
Annotation track to retrieve
hubUrl
Hub endpointsURL to hub.txtAccess a public or private track hub

Best Practices

  1. Use 0-based coordinates throughout: The API is BED-format; always subtract 1 from 1-based browser positions. Mixing conventions causes silent off-by-one errors.

  2. Add delays for batch queries: There is no enforced rate limit, but UCSC's servers are shared resources. Insert

    time.sleep(0.5)
    between requests when processing >50 regions.

    import time
    for region in regions:
        seq = get_sequence("hg38", region["chrom"], region["start"], region["end"])
        time.sleep(0.5)
    
  3. Discover track names before querying: Track names (e.g.,

    refGene
    ,
    cpgIslandExt
    ) are not always obvious. Call
    list/tracks
    first to find the correct internal name, then query
    /getData/track
    .

  4. Handle missing track keys in response: The JSON key holding track records matches the

    track
    parameter name. Always use
    .get(track, [])
    to avoid
    KeyError
    when a track returns no data in a region.

  5. Download chromosome sizes once and cache: For pipelines that need sizes across many regions, call

    list/chromosomes
    once and store the result in a dict rather than re-requesting for each query.

Common Recipes

Recipe: Fetch Assembly List and Filter by Organism

When to use: Discover available genome assemblies for a specific species.

import requests

r = requests.get("https://api.genome.ucsc.edu/list/ucscGenomes")
r.raise_for_status()
genomes = r.json()["ucscGenomes"]

# All mouse assemblies
mouse = {k: v for k, v in genomes.items()
         if "Mus musculus" in v.get("scientificName", "")}
for name, meta in sorted(mouse.items()):
    print(f"  {name}: {meta.get('description', '')}")

Recipe: Write BED File from Track Query

When to use: Save UCSC track annotations as a BED file for downstream bedtools or IGV analysis.

import requests

BASE = "https://api.genome.ucsc.edu"

r = requests.get(f"{BASE}/getData/track",
                 params={"genome": "hg38", "track": "refGene",
                         "chrom": "chr7", "start": 55_019_017, "end": 55_211_628})
r.raise_for_status()
records = r.json().get("refGene", [])

with open("egfr_refgene.bed", "w") as fh:
    for rec in records:
        fh.write(f"{rec['chrom']}\t{rec['txStart']}\t{rec['txEnd']}\t"
                 f"{rec.get('name2', rec['name'])}\t0\t{rec['strand']}\n")
print(f"Wrote {len(records)} gene records → egfr_refgene.bed")

Recipe: GC Content for a Sequence

When to use: Compute GC content of a promoter or exon after fetching its sequence.

import requests

seq = requests.get(
    "https://api.genome.ucsc.edu/getData/sequence",
    params={"genome": "hg38", "chrom": "chr17",
            "start": 43_044_294, "end": 43_046_294}
).json()["dna"].upper()

gc = (seq.count("G") + seq.count("C")) / len(seq) * 100
print(f"Region length: {len(seq)} bp | GC content: {gc:.1f}%")

Recipe: Quick Coordinate Validation

When to use: Confirm coordinates are within chromosome bounds before submitting a batch.

import requests

sizes = requests.get("https://api.genome.ucsc.edu/list/chromosomes",
                     params={"genome": "hg38"}).json()["chromosomeSizes"]

def validate(chrom, start, end):
    if chrom not in sizes:
        return f"ERROR: {chrom} not in hg38"
    if start < 0 or end > sizes[chrom] or start >= end:
        return f"ERROR: {chrom}:{start}-{end} out of bounds (chrom size={sizes[chrom]})"
    return "OK"

print(validate("chr17", 43_044_294, 43_125_482))  # OK
print(validate("chr17", -1, 100))                  # ERROR

Troubleshooting

ProblemCauseSolution
HTTP 400
on sequence endpoint
Coordinates out of chromosome bounds or
start >= end
Check chromosome size with
list/chromosomes
; swap start/end if reversed
Track query returns empty listNo features in the region for that trackConfirm track exists with
list/tracks
; widen the query window
KeyError
on track response
Response key differs from track parameterUse
.get(track, data.get("data", []))
to handle variant key names
ConnectionError
or timeout
Network issue or server loadRetry with
requests.Session()
and set
timeout=30
; add
time.sleep(1)
Sequence is all lowercaseSoftmasked regions (RepeatMasker)Call
.upper()
on returned sequence if case is irrelevant to your use
Conservation track returns no dataTrack not available for that assemblyCheck
list/tracks
for the assembly;
phyloP100way
is hg38-only; use
phyloP60way
for mm10
Wrong gene retrievedMultiple transcripts at locusFilter by
name2
(gene symbol) and select the longest transcript

Related Skills

  • ensembl-database
    — Ensembl REST API for gene/transcript annotations with stable Ensembl IDs, VEP variant effects, and cross-species homologs; preferred for Ensembl-centric workflows
  • encode-database
    — ENCODE portal for regulatory element datasets (ChIP-seq peaks, ATAC-seq) that feed into UCSC track hubs
  • bedtools-genomic-intervals
    — Perform intersection, coverage, and arithmetic on BED files downloaded from UCSC
  • regulomedb-database
    — RegulomeDB for regulatory variant scoring, which overlaps UCSC regulatory tracks

References