SciAgent-Skills archs4-database

Query uniformly processed RNA-seq gene expression profiles, tissue-specific expression patterns, and co-expression networks from the ARCHS4 database REST API. Retrieve z-score normalized expression across 1M+ human and mouse samples, find co-expressed genes, search samples by metadata, and download HDF5 expression matrices. For variant-level population genetics use gnomad-database; for pathway enrichment from gene lists use gget-genomic-databases (Enrichr).

install
source · Clone the upstream repo
git clone https://github.com/jaechang-hits/SciAgent-Skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/genomics-bioinformatics/archs4-database" ~/.claude/skills/jaechang-hits-sciagent-skills-archs4-database && rm -rf "$T"
manifest: skills/genomics-bioinformatics/archs4-database/SKILL.md
source content

ARCHS4 Database

Overview

ARCHS4 (All RNA-seq and ChIP-seq Sample and Signature Search) is a resource of uniformly aligned and processed human and mouse RNA-seq data from NCBI GEO and SRA, covering 1 million+ samples. The REST API at

https://maayanlab.cloud/archs4/api/
provides gene-level expression profiles, z-score normalized tissue expression, co-expression networks, and sample metadata search — all without authentication. Large-scale bulk queries can also use the downloadable HDF5 expression matrices.

When to Use

  • Retrieving tissue-specific or cell-type-specific expression z-scores for a gene of interest across hundreds of tissue types
  • Finding genes co-expressed with a query gene (co-expression network construction or guilt-by-association analysis)
  • Searching for RNA-seq samples by tissue, disease, or metadata keyword to identify candidate datasets for reanalysis
  • Comparing expression profiles of multiple genes across tissues to prioritize candidates for wet-lab follow-up
  • Accessing uniformly processed gene expression matrices (HDF5 format) for large-scale cross-study analysis
  • Validating differential expression results by checking whether a gene's expression direction matches population-level tissue profiles
  • For variant-level population allele frequencies use
    gnomad-database
    ; ARCHS4 provides expression evidence only
  • For Enrichr pathway enrichment from a gene list use
    gget-genomic-databases
    (
    gget enrichr
    ); ARCHS4 is for expression lookups

Prerequisites

  • Python packages:
    requests
    ,
    pandas
    ,
    matplotlib
    ,
    seaborn
  • Data requirements: gene symbols (HGNC format, e.g.,
    TP53
    ,
    BRCA1
    ); sample GEO/SRA IDs for direct sample queries
  • Environment: internet connection; no API key or account required
  • Rate limits: ~10 requests/second; add
    time.sleep(0.1)
    between sequential gene queries to avoid throttling
pip install requests pandas matplotlib seaborn

Quick Start

import requests

ARCHS4_BASE = "https://maayanlab.cloud/archs4/api/v1"

def archs4_get(endpoint: str, params: dict = None) -> dict:
    """Send a GET request to the ARCHS4 API and return parsed JSON."""
    r = requests.get(f"{ARCHS4_BASE}/{endpoint}", params=params, timeout=30)
    r.raise_for_status()
    return r.json()

# Quick check: top tissues expressing TP53
data = archs4_get("meta/genes/TP53/zscore")
tissues = data.get("values", [])
print(f"TP53 tissue expression entries: {len(tissues)}")
top5 = sorted(tissues, key=lambda x: x.get("zscore", 0), reverse=True)[:5]
for t in top5:
    print(f"  {t['tissue']:<40}  z={t['zscore']:.2f}")
# TP53 tissue expression entries: 200
#   thymus                                   z=2.81
#   testis                                   z=2.44

Core API

Query 1: Gene Expression Z-Scores Across Tissues

Retrieve z-score normalized expression for a gene across all available tissue types. Z-scores are computed per-sample relative to the population distribution; positive values indicate above-average expression.

import requests
import pandas as pd

ARCHS4_BASE = "https://maayanlab.cloud/archs4/api/v1"

def get_gene_tissue_zscore(gene_symbol: str, species: str = "human") -> pd.DataFrame:
    """Return tissue z-score expression profile for a gene.

    Parameters
    ----------
    gene_symbol : str
        HGNC gene symbol (e.g., 'TP53').
    species : str
        'human' or 'mouse' (default: 'human').
    """
    endpoint = f"meta/genes/{gene_symbol}/zscore"
    r = requests.get(
        f"{ARCHS4_BASE}/{endpoint}",
        params={"species": species},
        timeout=30
    )
    r.raise_for_status()
    data = r.json()
    records = data.get("values", [])
    df = pd.DataFrame(records)
    return df.sort_values("zscore", ascending=False).reset_index(drop=True)

df = get_gene_tissue_zscore("MYC")
print(f"MYC tissue z-scores: {len(df)} tissue types")
print(df[["tissue", "zscore"]].head(10).to_string(index=False))
# MYC tissue z-scores: 200
#                     tissue  zscore
#                      colon    3.12
#             small intestine    2.98
#                    placenta    2.74
# Query mouse tissues for a gene
df_mouse = get_gene_tissue_zscore("Myc", species="mouse")
print(f"Mouse Myc: top 5 tissues")
print(df_mouse[["tissue", "zscore"]].head(5).to_string(index=False))

Query 2: Co-expressed Genes

Find genes whose expression is most correlated with a query gene across all ARCHS4 samples. Useful for identifying pathway partners, regulators, or candidate targets.

import requests
import pandas as pd

ARCHS4_BASE = "https://maayanlab.cloud/archs4/api/v1"

def get_coexpressed_genes(gene_symbol: str, top_n: int = 50,
                           species: str = "human") -> pd.DataFrame:
    """Return genes co-expressed with the query gene.

    Parameters
    ----------
    gene_symbol : str
        HGNC gene symbol.
    top_n : int
        Number of correlated genes to return (default: 50).
    species : str
        'human' or 'mouse' (default: 'human').
    """
    r = requests.get(
        f"{ARCHS4_BASE}/meta/genes/{gene_symbol}/correlations",
        params={"species": species, "limit": top_n},
        timeout=30
    )
    r.raise_for_status()
    data = r.json()
    records = data.get("values", [])
    df = pd.DataFrame(records)
    return df.sort_values("correlation", ascending=False).reset_index(drop=True)

coexp = get_coexpressed_genes("PCNA", top_n=20)
print(f"Top co-expressed genes with PCNA (n={len(coexp)}):")
print(coexp[["gene", "correlation"]].head(10).to_string(index=False))
# Top co-expressed genes with PCNA (n=20):
#   gene  correlation
#   RFC4         0.91
#   RFC2         0.89
#   MCM6         0.87
# Extract gene list for downstream enrichment
gene_list = coexp["gene"].tolist()
print(f"Co-expression gene list: {gene_list[:10]}")
# Pass gene_list to Enrichr or pathway analysis tools

Query 3: Sample Search

Search for RNA-seq samples by metadata keyword (tissue, disease condition, cell type, treatment). Returns GEO/SRA sample identifiers with metadata fields.

import requests
import pandas as pd

ARCHS4_BASE = "https://maayanlab.cloud/archs4/api/v1"

def search_samples(keyword: str, species: str = "human",
                   limit: int = 100) -> pd.DataFrame:
    """Search ARCHS4 samples by metadata keyword.

    Parameters
    ----------
    keyword : str
        Search term (e.g., 'breast cancer', 'liver', 'HeLa').
    species : str
        'human' or 'mouse'.
    limit : int
        Maximum number of samples to return.
    """
    r = requests.get(
        f"{ARCHS4_BASE}/samples/search",
        params={"query": keyword, "species": species, "limit": limit},
        timeout=30
    )
    r.raise_for_status()
    data = r.json()
    records = data.get("samples", [])
    return pd.DataFrame(records)

samples = search_samples("pancreatic cancer", limit=50)
print(f"Samples matching 'pancreatic cancer': {len(samples)}")
if len(samples) > 0:
    print(samples[["sample_id", "series_id", "title"]].head(5).to_string(index=False))
# Samples matching 'pancreatic cancer': 50
#   sample_id  series_id  title
#   GSM2345678  GSE123456  Pancreatic ductal adenocarcinoma - sample 1

Query 4: Gene-Level Metadata Summary

Retrieve summary statistics and metadata for a gene including the number of samples expressing it, expression percentile, and available annotation.

import requests

ARCHS4_BASE = "https://maayanlab.cloud/archs4/api/v1"

def get_gene_metadata(gene_symbol: str, species: str = "human") -> dict:
    """Return metadata and expression summary for a gene."""
    r = requests.get(
        f"{ARCHS4_BASE}/meta/genes/{gene_symbol}",
        params={"species": species},
        timeout=30
    )
    r.raise_for_status()
    return r.json()

meta = get_gene_metadata("GAPDH")
print(f"Gene: {meta.get('gene_symbol', 'N/A')}")
print(f"Species: {meta.get('species', 'N/A')}")
print(f"Ensembl ID: {meta.get('ensembl_gene_id', 'N/A')}")
print(f"Description: {meta.get('description', 'N/A')[:80]}")
# Compare metadata for a panel of housekeeping genes
import time

housekeeping = ["GAPDH", "ACTB", "B2M", "HPRT1", "RPLP0"]
for gene in housekeeping:
    meta = get_gene_metadata(gene)
    print(f"  {gene:<8}  {meta.get('ensembl_gene_id', 'N/A')}")
    time.sleep(0.1)

Query 5: Visualization — Tissue Expression Barplot

Generate a publication-ready barplot of z-score expression across the top tissues for a gene.

import requests
import pandas as pd
import matplotlib.pyplot as plt

ARCHS4_BASE = "https://maayanlab.cloud/archs4/api/v1"

def plot_tissue_expression(gene_symbol: str, top_n: int = 20,
                            species: str = "human",
                            output_file: str = None) -> None:
    """Plot top tissue z-score expression for a gene.

    Parameters
    ----------
    gene_symbol : str
        HGNC gene symbol.
    top_n : int
        Number of top tissues to display.
    species : str
        'human' or 'mouse'.
    output_file : str
        If provided, save figure to this path.
    """
    r = requests.get(
        f"{ARCHS4_BASE}/meta/genes/{gene_symbol}/zscore",
        params={"species": species},
        timeout=30
    )
    r.raise_for_status()
    records = r.json().get("values", [])
    df = pd.DataFrame(records).sort_values("zscore", ascending=False).head(top_n)

    fig, ax = plt.subplots(figsize=(10, 6))
    colors = ["#D73027" if z > 0 else "#4575B4" for z in df["zscore"]]
    bars = ax.barh(df["tissue"][::-1], df["zscore"][::-1], color=colors[::-1])
    ax.axvline(0, color="black", linewidth=0.8, linestyle="--")
    ax.set_xlabel("Expression Z-Score")
    ax.set_title(f"ARCHS4 Tissue Expression: {gene_symbol} ({species})\nTop {top_n} tissues")
    ax.bar_label(bars, fmt="%.2f", padding=3, fontsize=8)
    plt.tight_layout()
    fname = output_file or f"{gene_symbol}_tissue_expression.png"
    plt.savefig(fname, dpi=150, bbox_inches="tight")
    print(f"Saved {fname}  ({len(df)} tissues plotted)")

plot_tissue_expression("BRCA1", top_n=15, output_file="BRCA1_tissue_expression.png")

Query 6: HDF5 Bulk Data Access

Download or stream from ARCHS4's precomputed HDF5 expression matrices for large-scale cross-sample analysis. The HDF5 files contain gene × sample count matrices for human and mouse.

import requests

# HDF5 files are available for bulk download from the ARCHS4 data portal
# URL pattern: https://maayanlab.cloud/archs4/download#expression
# Human gene-level: human_gene_v2.6.h5
# Mouse gene-level: mouse_gene_v2.6.h5

def get_h5_download_urls() -> dict:
    """Return download URLs for ARCHS4 HDF5 expression matrices."""
    base = "https://maayanlab.cloud/archs4"
    return {
        "human_gene": f"{base}/files/human_gene_v2.6.h5",
        "mouse_gene": f"{base}/files/mouse_gene_v2.6.h5",
        "human_transcript": f"{base}/files/human_transcript_v2.6.h5",
        "mouse_transcript": f"{base}/files/mouse_transcript_v2.6.h5",
    }

urls = get_h5_download_urls()
for key, url in urls.items():
    print(f"  {key:<22}  {url}")

# To work with a downloaded HDF5 file:
try:
    import h5py
    import numpy as np

    h5_path = "human_gene_v2.6.h5"   # after download

    def extract_gene_from_h5(h5_path: str, gene_symbol: str,
                              n_samples: int = 1000) -> dict:
        """Extract expression values for a gene from the HDF5 matrix."""
        with h5py.File(h5_path, "r") as f:
            genes = [g.decode() for g in f["meta"]["genes"]["gene_symbol"][:]]
            if gene_symbol not in genes:
                raise ValueError(f"{gene_symbol} not found in HDF5")
            idx = genes.index(gene_symbol)
            expr = f["data"]["expression"][idx, :n_samples]
            sample_ids = [s.decode() for s in f["meta"]["samples"]["geo_accession"][:n_samples]]
        return {"gene": gene_symbol, "expression": expr, "sample_ids": sample_ids}

    result = extract_gene_from_h5(h5_path, "TP53", n_samples=500)
    print(f"TP53 expression: mean={result['expression'].mean():.2f},"
          f" max={result['expression'].max():.2f} (n={len(result['expression'])} samples)")
except ImportError:
    print("h5py not installed. Install with: pip install h5py")
except FileNotFoundError:
    print("HDF5 file not downloaded yet. Use the URLs above to download first.")

Key Concepts

Z-Score Normalization

ARCHS4 reports gene expression as z-scores computed relative to all samples for that gene. A z-score of 0 means expression at the population mean; a z-score of 2.0 means expression 2 standard deviations above the mean. Z-scores are more interpretable across datasets than raw counts because they account for library size differences and batch effects introduced by uniform alignment across studies.

# Example: Positive z-score = above-average expression for that gene
# z > 2.0 → top ~2.5% of samples for that gene
# z < -2.0 → bottom ~2.5% of samples for that gene
# Use absolute z-score thresholds consistently when comparing across genes

HDF5 vs REST API

Access methodBest forLimitations
REST API (
/zscore
,
/correlations
)
Quick single-gene queries, explorationAggregated profiles only, no per-sample access
REST API (
/samples/search
)
Discovering relevant datasetsReturns metadata, not expression values
HDF5 downloadBulk analysis, custom co-expression, MLRequires 30–60 GB disk; download once

Species and Gene Symbol Conventions

ARCHS4 indexes human samples using HGNC gene symbols (uppercase, e.g.,

TP53
) and mouse samples using MGI symbols (first letter uppercase, e.g.,
Trp53
). The
species
parameter accepts
"human"
or
"mouse"
. Mixed-case or ensemble IDs will return empty results.

Common Workflows

Workflow 1: Multi-Gene Tissue Expression Heatmap

Goal: Compare tissue expression profiles of a gene panel and visualize as a heatmap to identify tissue-specific vs ubiquitous expression patterns.

import requests, time
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

ARCHS4_BASE = "https://maayanlab.cloud/archs4/api/v1"

gene_panel = ["MYC", "TP53", "BRCA1", "EGFR", "KRAS", "CDK4"]
top_n_tissues = 25

def get_tissue_zscores(gene: str) -> pd.Series:
    r = requests.get(
        f"{ARCHS4_BASE}/meta/genes/{gene}/zscore",
        params={"species": "human"},
        timeout=30
    )
    r.raise_for_status()
    records = r.json().get("values", [])
    df = pd.DataFrame(records).set_index("tissue")["zscore"]
    return df

# Build expression matrix (genes × tissues)
all_data = {}
for gene in gene_panel:
    try:
        all_data[gene] = get_tissue_zscores(gene)
        print(f"  Fetched {gene}")
    except Exception as e:
        print(f"  Warning: {gene} failed — {e}")
    time.sleep(0.1)

matrix = pd.DataFrame(all_data).T   # genes × tissues
# Select top tissues by max absolute z-score
tissue_importance = matrix.abs().max(axis=0).sort_values(ascending=False)
top_tissues = tissue_importance.head(top_n_tissues).index
matrix_subset = matrix[top_tissues]

# Plot heatmap
fig, ax = plt.subplots(figsize=(14, 5))
sns.heatmap(
    matrix_subset,
    cmap="RdBu_r",
    center=0,
    vmin=-3,
    vmax=3,
    ax=ax,
    cbar_kws={"label": "Z-Score"},
    linewidths=0.5
)
ax.set_title("ARCHS4 Tissue Expression Profiles — Gene Panel")
ax.set_xlabel("Tissue")
ax.set_ylabel("Gene")
plt.xticks(rotation=45, ha="right", fontsize=8)
plt.tight_layout()
plt.savefig("archs4_panel_heatmap.png", dpi=150, bbox_inches="tight")
print(f"Saved archs4_panel_heatmap.png  ({matrix_subset.shape})")

Workflow 2: Co-expression Network Seed Expansion

Goal: Start from a seed gene, retrieve co-expressed partners, then query their co-expressed genes in turn to build a two-hop co-expression neighborhood.

import requests, time
import pandas as pd

ARCHS4_BASE = "https://maayanlab.cloud/archs4/api/v1"

def get_coexp(gene: str, top_n: int = 20, species: str = "human") -> list:
    r = requests.get(
        f"{ARCHS4_BASE}/meta/genes/{gene}/correlations",
        params={"species": species, "limit": top_n},
        timeout=30
    )
    r.raise_for_status()
    return [rec["gene"] for rec in r.json().get("values", [])]

seed_gene = "PCNA"
min_correlation = 0.80

# Hop 1: direct co-expressed partners
hop1_genes = get_coexp(seed_gene, top_n=30)
print(f"Hop 1 partners of {seed_gene}: {len(hop1_genes)}")
time.sleep(0.1)

# Hop 2: co-expressed genes of each partner
edges = set()
for gene in hop1_genes[:10]:   # limit for demonstration
    partners = get_coexp(gene, top_n=20)
    for partner in partners:
        if partner != seed_gene:
            edges.add((gene, partner))
    time.sleep(0.1)

# Summarize the network
network_df = pd.DataFrame(list(edges), columns=["source", "target"])
hub_counts = network_df["source"].value_counts()
print(f"\nTwo-hop network: {len(edges)} edges")
print(f"Top hub genes:")
print(hub_counts.head(5))

network_df.to_csv(f"{seed_gene}_coexp_network.csv", index=False)
print(f"\nSaved {seed_gene}_coexp_network.csv")

Workflow 3: Sample Discovery and Dataset Summary

Goal: Search for samples by disease keyword, summarize how many GEO series are available, and export sample metadata for downstream reanalysis selection.

import requests, time
import pandas as pd

ARCHS4_BASE = "https://maayanlab.cloud/archs4/api/v1"

def search_and_summarize(keyword: str, species: str = "human",
                          limit: int = 200) -> pd.DataFrame:
    """Search samples and return a tidy metadata DataFrame."""
    r = requests.get(
        f"{ARCHS4_BASE}/samples/search",
        params={"query": keyword, "species": species, "limit": limit},
        timeout=30
    )
    r.raise_for_status()
    records = r.json().get("samples", [])
    return pd.DataFrame(records)

keyword = "colorectal cancer"
df = search_and_summarize(keyword, limit=150)
print(f"Samples matching '{keyword}': {len(df)}")

if len(df) > 0:
    # Summarize by GEO series
    series_counts = df["series_id"].value_counts()
    print(f"\nTop GEO series (by sample count):")
    print(series_counts.head(8).to_string())

    # Export sample list
    df.to_csv(f"{keyword.replace(' ', '_')}_samples.csv", index=False)
    print(f"\nSaved {keyword.replace(' ', '_')}_samples.csv ({len(df)} samples)")
    print(f"Unique GEO series: {df['series_id'].nunique()}")

Key Parameters

ParameterEndpointDefaultRange / OptionsEffect
species
All gene endpoints
"human"
"human"
,
"mouse"
Selects the species-specific sample index
limit
/correlations
,
/samples/search
100
1
500
Number of results returned
gene_symbol
(path)
/meta/genes/{gene}/zscore
,
/correlations
HGNC symbol (human) or MGI symbol (mouse)Query gene; case-sensitive
query
/samples/search
free-text stringMetadata keyword search across title, tissue, source fields
offset
/samples/search
0
integerPagination offset for large result sets
correlation
(response field)
/correlations
-1.0
1.0
Pearson correlation coefficient; filter
> 0.7
for high co-expression
zscore
(response field)
/zscore
continuous floatExpression z-score;
> 2.0
= high expression
page_size
(HDF5)
HDF5 sliceallany integerNumber of samples to extract per read from HDF5

Best Practices

  1. Use z-score thresholds consistently: Because z-scores are gene-specific, a z-score of 2.0 for a ubiquitous gene (GAPDH) and a tissue-restricted gene (TTR, liver) have different interpretive meaning. Always annotate which gene you are comparing and the tissue background.

  2. Sleep between batch queries: ARCHS4 enforces a soft rate limit of ~10 requests/second. Add

    time.sleep(0.1)
    between sequential gene queries to avoid
    429 Too Many Requests
    errors.

  3. Download HDF5 for large-scale analyses: For queries covering 50+ genes or requiring per-sample expression values, the REST API is impractical. Download the HDF5 file once and use

    h5py
    slicing for fast matrix access; this avoids hitting rate limits and is 100× faster for bulk extraction.

  4. Match gene symbol conventions by species: Human queries require HGNC uppercase symbols (e.g.,

    TP53
    ); mouse queries require MGI-style symbols (e.g.,
    Trp53
    ). Using the wrong case returns empty results without an error.

  5. Validate co-expression findings across datasets: ARCHS4 co-expression aggregates across all tissue types. A high correlation may be driven by a single tissue or study. Cross-check with tissue-specific queries or manually inspect the top contributing GEO series.

Common Recipes

Recipe: Quick Tissue Specificity Check

When to use: Rapidly determine whether a gene is broadly expressed (housekeeping) or tissue-restricted before designing experiments.

import requests

ARCHS4_BASE = "https://maayanlab.cloud/archs4/api/v1"

def tissue_specificity_summary(gene_symbol: str) -> None:
    """Print a summary of high and low expression tissues for a gene."""
    r = requests.get(
        f"{ARCHS4_BASE}/meta/genes/{gene_symbol}/zscore",
        params={"species": "human"},
        timeout=30
    )
    r.raise_for_status()
    records = r.json().get("values", [])
    zscores = [rec["zscore"] for rec in records if rec.get("zscore") is not None]
    top_high = sorted(records, key=lambda x: x.get("zscore", 0), reverse=True)[:5]
    top_low = sorted(records, key=lambda x: x.get("zscore", float("inf")))[:3]
    print(f"\n{gene_symbol} — {len(zscores)} tissues")
    print(f"  Range: [{min(zscores):.2f}, {max(zscores):.2f}]  "
          f"Mean: {sum(zscores)/len(zscores):.2f}")
    print("  High expression:")
    for t in top_high:
        print(f"    {t['tissue']:<35}  z={t['zscore']:.2f}")
    print("  Low expression:")
    for t in top_low:
        print(f"    {t['tissue']:<35}  z={t['zscore']:.2f}")

tissue_specificity_summary("TTR")   # Transthyretin — liver-specific

Recipe: Batch Gene Co-Expression Table

When to use: Generate a pairwise correlation table for a gene panel from a list of differentially expressed genes.

import requests, time
import pandas as pd

ARCHS4_BASE = "https://maayanlab.cloud/archs4/api/v1"

def batch_coexpr_table(gene_list: list, top_n: int = 10) -> pd.DataFrame:
    """For each gene in gene_list, return its top co-expressed genes."""
    rows = []
    for gene in gene_list:
        try:
            r = requests.get(
                f"{ARCHS4_BASE}/meta/genes/{gene}/correlations",
                params={"species": "human", "limit": top_n},
                timeout=30
            )
            r.raise_for_status()
            for rec in r.json().get("values", []):
                rows.append({
                    "query_gene": gene,
                    "coexp_gene": rec.get("gene"),
                    "correlation": rec.get("correlation"),
                })
            time.sleep(0.1)
        except Exception as e:
            print(f"Warning: {gene} skipped — {e}")
    return pd.DataFrame(rows)

deg_list = ["MYC", "CCND1", "CDK4", "RB1", "E2F1"]
coexp_table = batch_coexpr_table(deg_list, top_n=10)
print(f"Co-expression entries: {len(coexp_table)}")
print(coexp_table.groupby("query_gene")["coexp_gene"].count())
coexp_table.to_csv("deg_coexpression_table.csv", index=False)
print("Saved deg_coexpression_table.csv")

Recipe: Export Sample IDs for GEO Download

When to use: Identify relevant GEO accessions to download raw count matrices for a meta-analysis.

import requests
import pandas as pd

ARCHS4_BASE = "https://maayanlab.cloud/archs4/api/v1"

keyword = "glioblastoma"
r = requests.get(
    f"{ARCHS4_BASE}/samples/search",
    params={"query": keyword, "species": "human", "limit": 200},
    timeout=30
)
r.raise_for_status()
samples = pd.DataFrame(r.json().get("samples", []))
if len(samples) > 0:
    # Get unique GEO series accessions
    series = samples["series_id"].dropna().unique()
    print(f"Unique GEO series for '{keyword}': {len(series)}")
    for s in series[:10]:
        n = (samples["series_id"] == s).sum()
        print(f"  {s}  ({n} samples)")
    # Export series list for GEO download script
    pd.Series(series, name="geo_series").to_csv(
        f"{keyword}_geo_series.txt", index=False
    )
    print(f"\nSaved {keyword}_geo_series.txt")

Troubleshooting

ProblemCauseSolution
HTTP 404
for gene query
Gene symbol not found in ARCHS4 indexVerify HGNC symbol spelling; check
species
parameter matches gene convention (human: uppercase, mouse: first-letter-upper)
HTTP 429 Too Many Requests
Exceeded ~10 req/s rate limitAdd
time.sleep(0.1)
between requests; for batch queries use a 0.5 s delay
Empty
values
list in z-score response
Gene is not expressed in any indexed tissue, or wrong speciesSwitch species; verify gene is protein-coding and has GEO coverage
Empty
samples
list from search
Keyword not matched in metadata fieldsTry broader or alternative keywords (e.g.,
"liver"
instead of
"hepatic"
)
HDF5 gene not foundSymbol mismatch between HDF5 version and queryCheck available genes in
f["meta"]["genes"]["gene_symbol"][:]
; try Ensembl ID or alias
requests.exceptions.Timeout
Slow API response under loadIncrease
timeout=60
; retry with exponential backoff
Z-scores all near zeroGene has very low or absent expression across tissuesCheck the gene's expression in raw counts; the gene may be non-coding or very lowly expressed

Related Skills

  • gnomad-database
    — Population variant frequencies; use after ARCHS4 to identify variants in highly expressed genes
  • gget-genomic-databases
    — Enrichr pathway enrichment for ARCHS4 co-expression gene lists (
    gget enrichr
    )
  • pydeseq2-differential-expression
    — Differential expression analysis on bulk RNA-seq; ARCHS4 HDF5 matrices can serve as reference cohorts

References