SciAgent-Skills ucsc-genome-browser
Query the UCSC Genome Browser REST API for DNA sequences, track annotations, gene models, and conservation scores across 100+ genome assemblies. Retrieve sequence for any genomic region, list or fetch BED/bigWig track data, get chromosome sizes, query RefSeq/GENCODE gene structures, and access PhyloP/PhastCons conservation scores. Use for programmatic access to UCSC annotations; use Ensembl REST API instead for Ensembl-native gene IDs and VEP variant annotation.
git clone https://github.com/jaechang-hits/SciAgent-Skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/genomics-bioinformatics/ucsc-genome-browser" ~/.claude/skills/jaechang-hits-sciagent-skills-ucsc-genome-browser && rm -rf "$T"
skills/genomics-bioinformatics/ucsc-genome-browser/SKILL.mdUCSC Genome Browser
Overview
The UCSC Genome Browser REST API at
https://api.genome.ucsc.edu/ provides programmatic access to genome sequences, annotation tracks, and hub data for 100+ assemblies including hg38, mm39, and dm6. The API is free, requires no authentication, and returns JSON. Use it with the requests library to fetch DNA sequences for genomic regions, retrieve track data (genes, repeats, conservation), list available tracks, and query chromosome sizes for genome-scale coordinate arithmetic.
When to Use
- Fetching the reference DNA sequence for any genomic region (e.g., promoter, exon, CRISPR target) across human, mouse, or other assemblies
- Retrieving RefSeq or GENCODE gene structure (exon coordinates, CDS boundaries, strand) for a locus of interest
- Looking up PhyloP or PhastCons conservation scores to assess evolutionary constraint at a variant site
- Listing and querying any of UCSC's 1000+ annotation tracks (repeats, regulatory elements, conservation) for a region
- Getting chromosome sizes for a genome assembly to set up bedtools, pysam, or coverage pipelines
- Accessing public UCSC track hubs (e.g., ENCODE, Roadmap Epigenomics) without downloading data locally
- Use
instead when you need Ensembl stable IDs, VEP variant annotation, or cross-species comparative genomics via the Ensembl REST APIensembl-database - For bulk local queries across millions of regions, use
with pre-downloaded UCSC annotation filesbedtools-genomic-intervals
Prerequisites
- Python packages:
,requests
(for visualization)matplotlib - Data requirements: genomic coordinates (chrom, start, end in 0-based half-open BED format), genome assembly name (e.g.,
,hg38
)mm39 - Environment: internet connection; no authentication required
- Rate limits: no official published limit; add 0.5s delays for batch requests (>100 queries)
pip install requests matplotlib
Quick Start
import requests BASE = "https://api.genome.ucsc.edu" def get_sequence(genome, chrom, start, end): """Fetch DNA sequence for a genomic region (0-based, half-open).""" r = requests.get(f"{BASE}/getData/sequence", params={"genome": genome, "chrom": chrom, "start": start, "end": end}) r.raise_for_status() return r.json()["dna"] # Fetch 1 kb around the BRCA1 TSS on hg38 seq = get_sequence("hg38", "chr17", 43044294, 43045294) print(f"Length: {len(seq)} bp") print(f"Sequence: {seq[:60]}...") # Length: 1000 bp # Sequence: ATGATTGGTGGTTACATGCACAGTTGCTCTGGGAAGTTTCTTCTTCAGTTGAGAAAAGGT...
Core API
Query 1: Sequence Retrieval
Fetch the reference DNA sequence for any genomic region using the
getData/sequence endpoint. Coordinates are 0-based, half-open (BED format).
import requests BASE = "https://api.genome.ucsc.edu" def get_sequence(genome, chrom, start, end): """Return DNA sequence string for the given region.""" r = requests.get(f"{BASE}/getData/sequence", params={"genome": genome, "chrom": chrom, "start": start, "end": end}) r.raise_for_status() data = r.json() return data["dna"] # TP53 exon 4 region (hg38) seq = get_sequence("hg38", "chr17", 7676520, 7676620) print(f"Region: chr17:7,676,520-7,676,620 ({len(seq)} bp)") print(f"Sequence: {seq}")
# Reverse-complement for minus-strand genes def revcomp(seq): comp = str.maketrans("ACGTacgt", "TGCAtgca") return seq.translate(comp)[::-1] # BRCA2 on minus strand (hg38) seq_fwd = get_sequence("hg38", "chr13", 32315086, 32315186) seq_rc = revcomp(seq_fwd) print(f"Forward: {seq_fwd[:30]}...") print(f"RevComp: {seq_rc[:30]}...")
Query 2: Track Data Query
Retrieve annotation data (BED records) from any UCSC track for a genomic region.
import requests BASE = "https://api.genome.ucsc.edu" def get_track_data(genome, track, chrom, start, end): """Fetch annotation records from a UCSC track for a region.""" r = requests.get(f"{BASE}/getData/track", params={"genome": genome, "track": track, "chrom": chrom, "start": start, "end": end}) r.raise_for_status() data = r.json() # Track data is under the key matching the track name return data.get(track, data.get("data", [])) # Fetch RepeatMasker annotations in the MYC locus (hg38) repeats = get_track_data("hg38", "rmsk", "chr8", 127_735_434, 127_742_951) print(f"Repeat elements in MYC locus: {len(repeats)}") for r in repeats[:3]: print(f" {r.get('repName', r.get('name'))} | {r['chromStart']}-{r['chromEnd']}")
# Fetch CpG islands near a promoter cpg_islands = get_track_data("hg38", "cpgIslandExt", "chr17", 43_044_000, 43_050_000) print(f"CpG islands found: {len(cpg_islands)}") for island in cpg_islands: print(f" {island['name']}: {island['chromStart']}-{island['chromEnd']}, " f"obsExp={island.get('obsExp', 'n/a')}")
Query 3: Track List
List all available annotation tracks for a genome assembly to discover what data is available.
import requests BASE = "https://api.genome.ucsc.edu" def list_tracks(genome): """Return a dict of {track_name: track_metadata} for a genome assembly.""" r = requests.get(f"{BASE}/list/tracks", params={"genome": genome}) r.raise_for_status() return r.json().get("tracks", {}) tracks = list_tracks("hg38") print(f"Total tracks in hg38: {len(tracks)}") # Find conservation-related tracks conserv = {k: v for k, v in tracks.items() if "conserv" in k.lower() or "phylop" in k.lower()} for name, meta in list(conserv.items())[:5]: print(f" {name}: {meta.get('shortLabel', '')}")
Query 4: Chromosome Sizes
Get the length of every chromosome (or scaffold) for a genome assembly.
import requests BASE = "https://api.genome.ucsc.edu" def get_chrom_sizes(genome): """Return {chrom: size_in_bp} for a genome assembly.""" r = requests.get(f"{BASE}/list/chromosomes", params={"genome": genome}) r.raise_for_status() return r.json().get("chromosomeSizes", {}) sizes = get_chrom_sizes("hg38") print(f"hg38 chromosome count: {len(sizes)}") # Show canonical autosomes + sex chromosomes canonical = {c: sizes[c] for c in sorted(sizes) if c in [f"chr{i}" for i in range(1, 23)] + ["chrX", "chrY", "chrM"]} for chrom, length in sorted(canonical.items(), key=lambda x: int(x[0].replace("chr", "").replace("X", "23").replace("Y", "24").replace("M", "25"))): print(f" {chrom}: {length:,} bp")
Query 5: Gene Annotation
Query RefSeq gene models (exon coordinates, CDS, strand) for a genomic region.
import requests BASE = "https://api.genome.ucsc.edu" def get_refgene(genome, chrom, start, end): """Retrieve RefSeq gene annotations for a region.""" r = requests.get(f"{BASE}/getData/track", params={"genome": genome, "track": "refGene", "chrom": chrom, "start": start, "end": end}) r.raise_for_status() return r.json().get("refGene", []) # Query EGFR gene region (hg38) genes = get_refgene("hg38", "chr7", 55_019_017, 55_211_628) for g in genes: exon_count = g.get("exonCount", 0) print(f" {g['name2']} ({g['name']}) | {g['strand']} | " f"tx: {g['txStart']}-{g['txEnd']} | exons: {exon_count}")
# Parse exon intervals from a refGene record def parse_exons(gene_record): """Return list of (exon_start, exon_end) from a refGene record.""" starts = [int(s) for s in gene_record["exonStarts"].strip(",").split(",") if s] ends = [int(e) for e in gene_record["exonEnds"].strip(",").split(",") if e] return list(zip(starts, ends)) genes = get_refgene("hg38", "chr7", 55_019_017, 55_211_628) if genes: g = genes[0] exons = parse_exons(g) print(f"{g['name2']}: {len(exons)} exons") for i, (s, e) in enumerate(exons[:4], 1): print(f" Exon {i}: {s}-{e} ({e-s} bp)")
Query 6: Conservation Scores
Fetch per-base PhyloP or PhastCons conservation scores for a genomic region.
import requests BASE = "https://api.genome.ucsc.edu" def get_conservation(genome, track, chrom, start, end): """Retrieve per-base conservation scores from a bigWig track.""" r = requests.get(f"{BASE}/getData/track", params={"genome": genome, "track": track, "chrom": chrom, "start": start, "end": end}) r.raise_for_status() data = r.json() # bigWig tracks return a list of {start, end, value} intervals return data.get(track, []) # PhyloP 100-way conservation at TP53 mutation hotspot (hg38) # chr17:7,676,594 = codon 248 (R248W/Q common hotspot) scores = get_conservation("hg38", "phyloP100way", "chr17", 7_676_580, 7_676_610) print(f"PhyloP 100way scores ({len(scores)} intervals):") for s in scores[:5]: print(f" chr17:{s['start']}-{s['end']}: phyloP = {s['value']:.3f}") # Positive scores = conserved; negative = fast-evolving
Query 7: Hub Access
Access public UCSC track hubs and list their available assemblies and tracks.
import requests BASE = "https://api.genome.ucsc.edu" def list_ucsc_genomes(): """Return all UCSC-hosted genome assemblies.""" r = requests.get(f"{BASE}/list/ucscGenomes") r.raise_for_status() return r.json().get("ucscGenomes", {}) genomes = list_ucsc_genomes() print(f"Total UCSC genome assemblies: {len(genomes)}") # Find all human assemblies human = {k: v for k, v in genomes.items() if "Homo sapiens" in v.get("scientificName", "")} for name, meta in sorted(human.items()): print(f" {name}: {meta.get('description', '')}")
Key Concepts
0-Based vs. 1-Based Coordinates
The UCSC REST API uses 0-based, half-open intervals (BED format):
start is inclusive, end is exclusive. This matches BED files and Python slicing. The UCSC Genome Browser web interface displays 1-based positions. To convert: API start = browser_start - 1, API end = browser_end.
# Browser position: chr17:7,676,521-7,676,620 (1-based, closed) # API query (0-based, half-open): start_api = 7_676_520 # browser_start - 1 end_api = 7_676_620 # browser_end unchanged seq = get_sequence("hg38", "chr17", start_api, end_api) print(f"Fetched {len(seq)} bp (expected 100)")
Track Data Response Format
Track data returned from
/getData/track is keyed by track name. BED-like tracks return a list of dicts with chrom, chromStart, chromEnd, name, score, strand. bigWig tracks (conservation, signal) return {start, end, value} intervals. Always check the actual key in the response JSON, which matches the track parameter name.
Common Workflows
Workflow 1: Extract Promoter Sequences for a Gene List
Goal: Retrieve 2 kb upstream of the TSS for each gene in a list, for motif analysis or primer design.
import requests import time BASE = "https://api.genome.ucsc.edu" GENOME = "hg38" PROMOTER_UP = 2000 # bp upstream of TSS def get_refgene(genome, chrom, start, end): r = requests.get(f"{BASE}/getData/track", params={"genome": genome, "track": "refGene", "chrom": chrom, "start": start, "end": end}) r.raise_for_status() return r.json().get("refGene", []) def get_sequence(genome, chrom, start, end): r = requests.get(f"{BASE}/getData/sequence", params={"genome": genome, "chrom": chrom, "start": start, "end": end}) r.raise_for_status() return r.json()["dna"] def revcomp(seq): comp = str.maketrans("ACGTacgt", "TGCAtgca") return seq.translate(comp)[::-1] # Genes of interest: query a known locus for each gene_loci = { "BRCA1": ("chr17", 43_044_294, 43_125_482), "TP53": ("chr17", 7_661_779, 7_687_538), "EGFR": ("chr7", 55_019_017, 55_211_628), } results = {} for gene, (chrom, locus_start, locus_end) in gene_loci.items(): records = get_refgene(GENOME, chrom, locus_start, locus_end) # Pick the longest transcript records = [r for r in records if r.get("name2") == gene] if not records: print(f" {gene}: not found") continue g = max(records, key=lambda x: x["txEnd"] - x["txStart"]) if g["strand"] == "+": prom_start = max(0, g["txStart"] - PROMOTER_UP) prom_end = g["txStart"] else: prom_start = g["txEnd"] prom_end = g["txEnd"] + PROMOTER_UP seq = get_sequence(GENOME, chrom, prom_start, prom_end) if g["strand"] == "-": seq = revcomp(seq) results[gene] = {"chrom": chrom, "start": prom_start, "end": prom_end, "strand": g["strand"], "seq": seq} print(f" {gene}: {chrom}:{prom_start}-{prom_end} | strand={g['strand']} | {len(seq)} bp") time.sleep(0.5) # Write FASTA with open("promoters.fa", "w") as fh: for gene, d in results.items(): fh.write(f">{gene} {d['chrom']}:{d['start']}-{d['end']}({d['strand']})\n") fh.write(d["seq"] + "\n") print(f"\nSaved {len(results)} promoter sequences → promoters.fa")
Workflow 2: Visualize Gene Structure from refGene Track
Goal: Draw an exon-intron diagram for a gene using matplotlib from refGene track data.
import requests import matplotlib.pyplot as plt import matplotlib.patches as mpatches BASE = "https://api.genome.ucsc.edu" def get_refgene(genome, chrom, start, end): r = requests.get(f"{BASE}/getData/track", params={"genome": genome, "track": "refGene", "chrom": chrom, "start": start, "end": end}) r.raise_for_status() return r.json().get("refGene", []) def parse_exons(rec): starts = [int(s) for s in rec["exonStarts"].strip(",").split(",") if s] ends = [int(e) for e in rec["exonEnds"].strip(",").split(",") if e] return list(zip(starts, ends)) # Fetch BRCA1 transcripts (hg38) genes = get_refgene("hg38", "chr17", 43_044_294, 43_125_482) brca1 = [g for g in genes if g.get("name2") == "BRCA1"] print(f"BRCA1 transcripts: {len(brca1)}") # Plot the canonical transcript (longest) g = max(brca1, key=lambda x: x["txEnd"] - x["txStart"]) exons = parse_exons(g) tx_start, tx_end = g["txStart"], g["txEnd"] cds_start, cds_end = g["cdsStart"], g["cdsEnd"] fig, ax = plt.subplots(figsize=(12, 2.5)) ax.set_xlim(tx_start - 500, tx_end + 500) ax.set_ylim(-0.5, 1.5) # Intron line ax.hlines(0.5, tx_start, tx_end, color="#555", lw=1.5, zorder=1) # Exon boxes for exon_s, exon_e in exons: # UTR portion (thin) vs CDS (thick) cds_s = max(exon_s, cds_start) cds_e = min(exon_e, cds_end) # Full exon box (UTR height) ax.add_patch(mpatches.FancyBboxPatch( (exon_s, 0.25), exon_e - exon_s, 0.5, boxstyle="square,pad=0", fc="#a8c4e0", ec="#2c6fad", lw=0.8, zorder=2)) # CDS box (taller) if cds_s < cds_e: ax.add_patch(mpatches.FancyBboxPatch( (cds_s, 0.15), cds_e - cds_s, 0.7, boxstyle="square,pad=0", fc="#2c6fad", ec="#1a4a7a", lw=0.8, zorder=3)) strand_arrow = "→" if g["strand"] == "+" else "←" ax.set_title(f"{g['name2']} ({g['name']}) {strand_arrow} — hg38 {g['chrom']}:" f"{tx_start:,}-{tx_end:,} | {g['exonCount']} exons", fontsize=11) ax.set_xlabel("Genomic position (bp)") ax.set_yticks([]) plt.tight_layout() plt.savefig("brca1_gene_structure.png", dpi=150, bbox_inches="tight") print("Saved: brca1_gene_structure.png") plt.show()
Workflow 3: Batch Conservation Score Lookup
Goal: Retrieve mean PhyloP conservation for a list of variants or regions.
import requests import time import pandas as pd BASE = "https://api.genome.ucsc.edu" def get_conservation(genome, track, chrom, start, end): r = requests.get(f"{BASE}/getData/track", params={"genome": genome, "track": track, "chrom": chrom, "start": start, "end": end}) r.raise_for_status() return r.json().get(track, []) # Variants to score (1-based positions → convert to 0-based) variants = [ {"id": "rs28897672", "chrom": "chr17", "pos": 7_676_594}, # TP53 R248 {"id": "rs80357906", "chrom": "chr17", "pos": 43_094_692}, # BRCA1 {"id": "rs1042522", "chrom": "chr17", "pos": 7_676_147}, # TP53 R72P (common) ] results = [] for v in variants: # Query ±5 bp window around each variant scores = get_conservation("hg38", "phyloP100way", v["chrom"], v["pos"] - 6, v["pos"] + 5) values = [s["value"] for s in scores] mean_score = sum(values) / len(values) if values else float("nan") results.append({**v, "phyloP100way_mean": round(mean_score, 3), "n_intervals": len(scores)}) print(f" {v['id']}: mean phyloP = {mean_score:.3f}") time.sleep(0.5) df = pd.DataFrame(results) df.to_csv("variant_conservation.csv", index=False) print(f"\nSaved → variant_conservation.csv\n{df.to_string(index=False)}")
Key Parameters
| Parameter | Module | Default | Range / Options | Effect |
|---|---|---|---|---|
| All endpoints | — | , , , any UCSC assembly | Selects the genome assembly |
| Sequence, Track | — | –, | Chromosome name (UCSC -prefix convention) |
| Sequence, Track | — | 0–chrom_size | Region start (0-based, inclusive) |
| Sequence, Track | — | 1–chrom_size | Region end (0-based, exclusive) |
| | — | Any track name from | Annotation track to retrieve |
| Hub endpoints | — | URL to hub.txt | Access a public or private track hub |
Best Practices
-
Use 0-based coordinates throughout: The API is BED-format; always subtract 1 from 1-based browser positions. Mixing conventions causes silent off-by-one errors.
-
Add delays for batch queries: There is no enforced rate limit, but UCSC's servers are shared resources. Insert
between requests when processing >50 regions.time.sleep(0.5)import time for region in regions: seq = get_sequence("hg38", region["chrom"], region["start"], region["end"]) time.sleep(0.5) -
Discover track names before querying: Track names (e.g.,
,refGene
) are not always obvious. CallcpgIslandExt
first to find the correct internal name, then querylist/tracks
./getData/track -
Handle missing track keys in response: The JSON key holding track records matches the
parameter name. Always usetrack
to avoid.get(track, [])
when a track returns no data in a region.KeyError -
Download chromosome sizes once and cache: For pipelines that need sizes across many regions, call
once and store the result in a dict rather than re-requesting for each query.list/chromosomes
Common Recipes
Recipe: Fetch Assembly List and Filter by Organism
When to use: Discover available genome assemblies for a specific species.
import requests r = requests.get("https://api.genome.ucsc.edu/list/ucscGenomes") r.raise_for_status() genomes = r.json()["ucscGenomes"] # All mouse assemblies mouse = {k: v for k, v in genomes.items() if "Mus musculus" in v.get("scientificName", "")} for name, meta in sorted(mouse.items()): print(f" {name}: {meta.get('description', '')}")
Recipe: Write BED File from Track Query
When to use: Save UCSC track annotations as a BED file for downstream bedtools or IGV analysis.
import requests BASE = "https://api.genome.ucsc.edu" r = requests.get(f"{BASE}/getData/track", params={"genome": "hg38", "track": "refGene", "chrom": "chr7", "start": 55_019_017, "end": 55_211_628}) r.raise_for_status() records = r.json().get("refGene", []) with open("egfr_refgene.bed", "w") as fh: for rec in records: fh.write(f"{rec['chrom']}\t{rec['txStart']}\t{rec['txEnd']}\t" f"{rec.get('name2', rec['name'])}\t0\t{rec['strand']}\n") print(f"Wrote {len(records)} gene records → egfr_refgene.bed")
Recipe: GC Content for a Sequence
When to use: Compute GC content of a promoter or exon after fetching its sequence.
import requests seq = requests.get( "https://api.genome.ucsc.edu/getData/sequence", params={"genome": "hg38", "chrom": "chr17", "start": 43_044_294, "end": 43_046_294} ).json()["dna"].upper() gc = (seq.count("G") + seq.count("C")) / len(seq) * 100 print(f"Region length: {len(seq)} bp | GC content: {gc:.1f}%")
Recipe: Quick Coordinate Validation
When to use: Confirm coordinates are within chromosome bounds before submitting a batch.
import requests sizes = requests.get("https://api.genome.ucsc.edu/list/chromosomes", params={"genome": "hg38"}).json()["chromosomeSizes"] def validate(chrom, start, end): if chrom not in sizes: return f"ERROR: {chrom} not in hg38" if start < 0 or end > sizes[chrom] or start >= end: return f"ERROR: {chrom}:{start}-{end} out of bounds (chrom size={sizes[chrom]})" return "OK" print(validate("chr17", 43_044_294, 43_125_482)) # OK print(validate("chr17", -1, 100)) # ERROR
Troubleshooting
| Problem | Cause | Solution |
|---|---|---|
on sequence endpoint | Coordinates out of chromosome bounds or | Check chromosome size with ; swap start/end if reversed |
| Track query returns empty list | No features in the region for that track | Confirm track exists with ; widen the query window |
on track response | Response key differs from track parameter | Use to handle variant key names |
or timeout | Network issue or server load | Retry with and set ; add |
| Sequence is all lowercase | Softmasked regions (RepeatMasker) | Call on returned sequence if case is irrelevant to your use |
| Conservation track returns no data | Track not available for that assembly | Check for the assembly; is hg38-only; use for mm10 |
| Wrong gene retrieved | Multiple transcripts at locus | Filter by (gene symbol) and select the longest transcript |
Related Skills
— Ensembl REST API for gene/transcript annotations with stable Ensembl IDs, VEP variant effects, and cross-species homologs; preferred for Ensembl-centric workflowsensembl-database
— ENCODE portal for regulatory element datasets (ChIP-seq peaks, ATAC-seq) that feed into UCSC track hubsencode-database
— Perform intersection, coverage, and arithmetic on BED files downloaded from UCSCbedtools-genomic-intervals
— RegulomeDB for regulatory variant scoring, which overlaps UCSC regulatory tracksregulomedb-database
References
- UCSC REST API documentation — full endpoint reference, parameters, and response formats
- UCSC API base URL — interactive endpoint explorer
- Kent WJ et al. (2002) Genome Res 12:996–1006 — original UCSC Genome Browser paper
- UCSC Track Database — Table Browser to explore track names and schemas