SciAgent-Skills cbioportal-database

Access TCGA and other cancer genomics datasets via cBioPortal REST API. Retrieve somatic mutations, copy number alterations, gene expression profiles, and clinical data (survival, stage, treatment) for thousands of cancer studies. Use for tumor mutation burden analysis, oncoprint queries, and survival analysis. For population variant frequencies use gnomad-database; for drug-gene interactions use dgidb-database.

install
source · Clone the upstream repo
git clone https://github.com/jaechang-hits/SciAgent-Skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/genomics-bioinformatics/cbioportal-database" ~/.claude/skills/jaechang-hits-sciagent-skills-cbioportal-database && rm -rf "$T"
manifest: skills/genomics-bioinformatics/cbioportal-database/SKILL.md
source content

cBioPortal Database

Overview

cBioPortal for Cancer Genomics is a public repository of cancer genomics data including TCGA, ICGC, and hundreds of curated studies spanning 100+ cancer types. It provides somatic mutation profiles, copy number alterations (CNA), gene expression, clinical data (survival, stage, treatment history), and methylation data for tens of thousands of patient samples. Data is accessible via a REST API at

https://www.cbioportal.org/api/
with no authentication required.

When to Use

  • Retrieving somatic mutation profiles (variant type, amino acid change) for a gene across TCGA studies
  • Querying copy number alteration data (amplification, deep deletion) for candidate cancer driver genes
  • Accessing clinical data — overall survival, disease-free survival, tumor stage — for survival curve analysis
  • Identifying which cancer studies have molecular profiling data for a specific cancer type (e.g., breast, lung)
  • Downloading gene expression (RNA-seq FPKM/RSEM) data from specific TCGA cohorts for differential expression analysis
  • Correlating genomic alterations with clinical outcomes in a specific study
  • Use
    gnomad-database
    instead when you need population-level variant allele frequencies in healthy individuals
  • For drug-gene interaction lookups use
    dgidb-database
    ; cBioPortal provides the genomic alteration data, not drug interaction annotations

Prerequisites

  • Python packages:
    requests
    ,
    pandas
    ,
    matplotlib
  • Data requirements: Entrez gene symbols (e.g.,
    TP53
    ), cBioPortal study IDs (e.g.,
    tcga_brca
    ), molecular profile IDs
  • Environment: internet connection; no API key required
  • Rate limits: no strict rate limits; use
    time.sleep(0.2)
    between batch requests for polite access
pip install requests pandas matplotlib

Quick Start

import requests
import pandas as pd

BASE_URL = "https://www.cbioportal.org/api"

def cbio_get(endpoint, params=None):
    """GET request to cBioPortal REST API, returns parsed JSON."""
    r = requests.get(f"{BASE_URL}/{endpoint}", params=params,
                     headers={"Accept": "application/json"}, timeout=30)
    r.raise_for_status()
    return r.json()

# List available cancer types
cancer_types = cbio_get("cancer-types")
print(f"Total cancer types: {len(cancer_types)}")
# Total cancer types: 87

# Find TCGA breast cancer study
studies = cbio_get("studies", params={"keyword": "breast"})
brca = [s for s in studies if "tcga_brca" in s["studyId"]]
if brca:
    s = brca[0]
    print(f"Study: {s['name']}")
    print(f"  studyId: {s['studyId']}")
    print(f"  Samples: {s['allSampleCount']}")
# Study: Breast Invasive Carcinoma (TCGA, PanCancer Atlas)
#   studyId: brca_tcga_pan_can_atlas_2018
#   Samples: 1084

Core API

Query 1: Cancer Types and Studies

List available cancer types and find studies by cancer type or keyword.

import requests
import pandas as pd

BASE_URL = "https://www.cbioportal.org/api"

def cbio_get(endpoint, params=None):
    r = requests.get(f"{BASE_URL}/{endpoint}", params=params,
                     headers={"Accept": "application/json"}, timeout=30)
    r.raise_for_status()
    return r.json()

# Get all cancer types
cancer_types = cbio_get("cancer-types")
ct_df = pd.DataFrame(cancer_types)[["cancerTypeId", "name", "dedicatedColor"]]
print(f"Cancer types: {len(ct_df)}")
print(ct_df.head(5).to_string(index=False))

# Find all studies for a cancer type
lung_studies = cbio_get("studies", params={"keyword": "lung adenocarcinoma"})
print(f"\nLung adenocarcinoma studies: {len(lung_studies)}")
for s in lung_studies[:3]:
    print(f"  {s['studyId']:40s}  n={s['allSampleCount']}")
# Get detailed study metadata including available data types
study_id = "brca_tcga_pan_can_atlas_2018"
study = cbio_get(f"studies/{study_id}")
print(f"Study: {study['name']}")
print(f"  Reference genome: {study.get('referenceGenome', 'n/a')}")
print(f"  All sample count: {study['allSampleCount']}")

# List molecular profiles for the study
profiles = cbio_get("molecular-profiles", params={"studyId": study_id})
print(f"\nMolecular profiles ({len(profiles)} total):")
for p in profiles:
    print(f"  {p['molecularProfileId']:55s}  [{p['molecularAlterationType']}]")

Query 2: Somatic Mutations

Retrieve mutation data for a gene or set of genes in a study's mutation profile.

import requests, json
import pandas as pd

BASE_URL = "https://www.cbioportal.org/api"

def cbio_post(endpoint, body):
    """POST request to cBioPortal REST API."""
    r = requests.post(f"{BASE_URL}/{endpoint}", json=body,
                      headers={"Accept": "application/json",
                               "Content-Type": "application/json"},
                      timeout=60)
    r.raise_for_status()
    return r.json()

def cbio_get(endpoint, params=None):
    r = requests.get(f"{BASE_URL}/{endpoint}", params=params,
                     headers={"Accept": "application/json"}, timeout=30)
    r.raise_for_status()
    return r.json()

# Get all samples for a study
study_id = "brca_tcga_pan_can_atlas_2018"
samples = cbio_get(f"studies/{study_id}/samples", params={"projection": "ID"})
sample_ids = [s["sampleId"] for s in samples]
print(f"Total samples: {len(sample_ids)}")

# Mutation profile ID follows pattern: {studyId}_mutations
profile_id = f"{study_id}_mutations"

# Fetch mutations for TP53 (Entrez gene ID = 7157)
body = {
    "sampleIds": sample_ids[:200],   # first 200 samples
    "entrezGeneIds": [7157]           # TP53
}
mutations = cbio_post(f"molecular-profiles/{profile_id}/mutations/fetch", body)
print(f"TP53 mutations in first 200 samples: {len(mutations)}")

# Summarize by mutation type
mut_df = pd.DataFrame(mutations)
print("\nMutation type distribution:")
print(mut_df["mutationType"].value_counts().head(8).to_string())
# Missense_Mutation    102
# Nonsense_Mutation     28
# Splice_Site           14
# Frame_Shift_Del       12

Query 3: Copy Number Alterations

Fetch discrete CNA data (amplification = 2, gain = 1, diploid = 0, loss = -1, deep deletion = -2).

import requests
import pandas as pd

BASE_URL = "https://www.cbioportal.org/api"

def cbio_post(endpoint, body):
    r = requests.post(f"{BASE_URL}/{endpoint}", json=body,
                      headers={"Accept": "application/json",
                               "Content-Type": "application/json"},
                      timeout=60)
    r.raise_for_status()
    return r.json()

def cbio_get(endpoint, params=None):
    r = requests.get(f"{BASE_URL}/{endpoint}", params=params,
                     headers={"Accept": "application/json"}, timeout=30)
    r.raise_for_status()
    return r.json()

study_id = "brca_tcga_pan_can_atlas_2018"
# CNA profile: discrete copy number data
cna_profile_id = f"{study_id}_gistic"   # GISTIC-derived discrete CNA

samples = cbio_get(f"studies/{study_id}/samples", params={"projection": "ID"})
sample_ids = [s["sampleId"] for s in samples][:300]

# Fetch CNA for ERBB2 (Entrez 2064) and MYC (Entrez 4609)
body = {
    "sampleIds": sample_ids,
    "entrezGeneIds": [2064, 4609]   # ERBB2, MYC
}
cna_data = cbio_post(
    f"molecular-profiles/{cna_profile_id}/molecular-data/fetch", body
)
print(f"CNA records retrieved: {len(cna_data)}")

cna_df = pd.DataFrame(cna_data)
# CNA values: 2=amplification, 1=gain, 0=diploid, -1=loss, -2=deep deletion
cna_label = {2: "AMP", 1: "GAIN", 0: "DIPLOID", -1: "LOSS", -2: "HOMDEL"}

print("\nERBB2 CNA distribution:")
erbb2 = cna_df[cna_df["entrezGeneId"] == 2064]
erbb2_counts = erbb2["value"].map(lambda x: cna_label.get(int(x), str(x))).value_counts()
print(erbb2_counts.to_string())
# DIPLOID    210
# AMP         62
# GAIN        18
# LOSS        10

Query 4: Clinical Data

Retrieve per-sample or per-patient clinical attributes including survival, tumor stage, and treatment.

import requests
import pandas as pd

BASE_URL = "https://www.cbioportal.org/api"

def cbio_get(endpoint, params=None):
    r = requests.get(f"{BASE_URL}/{endpoint}", params=params,
                     headers={"Accept": "application/json"}, timeout=30)
    r.raise_for_status()
    return r.json()

study_id = "brca_tcga_pan_can_atlas_2018"

# List available clinical attributes for this study
attrs = cbio_get(f"studies/{study_id}/clinical-attributes")
attr_df = pd.DataFrame(attrs)[["clinicalAttributeId", "displayName", "datatype", "patientAttribute"]]
print(f"Clinical attributes: {len(attr_df)}")
# Show survival-related attributes
survival_attrs = attr_df[attr_df["clinicalAttributeId"].str.contains("SURVIVAL|MONTHS|STATUS", na=False)]
print(survival_attrs[["clinicalAttributeId", "displayName"]].to_string(index=False))

# Fetch OS_STATUS and OS_MONTHS for all patients
clinical = cbio_get(f"studies/{study_id}/clinical-data",
                    params={"clinicalDataType": "PATIENT",
                            "projection": "DETAILED"})
clin_df = pd.DataFrame(clinical)
# Pivot to patient × attribute matrix
clin_pivot = clin_df.pivot_table(
    index="patientId", columns="clinicalAttributeId",
    values="value", aggfunc="first"
)
print(f"\nPatients: {len(clin_pivot)}")
if "OS_STATUS" in clin_pivot.columns:
    print("OS status counts:")
    print(clin_pivot["OS_STATUS"].value_counts().to_string())
# OS status counts:
# 0:LIVING    765
# 1:DECEASED  319

Query 5: Gene Expression Data

Retrieve mRNA expression values (RSEM or FPKM) from RNA-seq profiles.

import requests
import pandas as pd

BASE_URL = "https://www.cbioportal.org/api"

def cbio_post(endpoint, body):
    r = requests.post(f"{BASE_URL}/{endpoint}", json=body,
                      headers={"Accept": "application/json",
                               "Content-Type": "application/json"},
                      timeout=60)
    r.raise_for_status()
    return r.json()

def cbio_get(endpoint, params=None):
    r = requests.get(f"{BASE_URL}/{endpoint}", params=params,
                     headers={"Accept": "application/json"}, timeout=30)
    r.raise_for_status()
    return r.json()

study_id = "brca_tcga_pan_can_atlas_2018"
# RNA-seq profile (RSEM normalized values)
rna_profile_id = f"{study_id}_rna_seq_v2_mrna_median_normed_log2"

samples = cbio_get(f"studies/{study_id}/samples", params={"projection": "ID"})
sample_ids = [s["sampleId"] for s in samples][:100]

# Fetch expression for ESR1 (Entrez 2099), ERBB2 (2064), PGR (5241)
body = {
    "sampleIds": sample_ids,
    "entrezGeneIds": [2099, 2064, 5241]   # ESR1, ERBB2, PGR
}
expr_data = cbio_post(
    f"molecular-profiles/{rna_profile_id}/molecular-data/fetch", body
)
expr_df = pd.DataFrame(expr_data)
print(f"Expression records: {len(expr_df)}")

# Pivot to gene × sample matrix
expr_pivot = expr_df.pivot_table(
    index="sampleId", columns="entrezGeneId", values="value"
)
expr_pivot.columns = ["ERBB2", "ESR1", "PGR"]   # rename by gene symbol
print(f"\nExpression matrix: {expr_pivot.shape}")
print(expr_pivot.describe().round(2))

Query 6: Gene Details and Batch Lookup

Look up gene metadata (symbol, Entrez ID, type) required to construct mutation and CNA queries.

import requests
import pandas as pd

BASE_URL = "https://www.cbioportal.org/api"

def cbio_get(endpoint, params=None):
    r = requests.get(f"{BASE_URL}/{endpoint}", params=params,
                     headers={"Accept": "application/json"}, timeout=30)
    r.raise_for_status()
    return r.json()

def cbio_post(endpoint, body):
    r = requests.post(f"{BASE_URL}/{endpoint}", json=body,
                      headers={"Accept": "application/json",
                               "Content-Type": "application/json"},
                      timeout=30)
    r.raise_for_status()
    return r.json()

# Single gene lookup by Hugo symbol
gene = cbio_get("genes/TP53")
print(f"TP53: entrezGeneId={gene['entrezGeneId']}, type={gene['type']}")
# TP53: entrezGeneId=7157, type=protein-coding

# Batch gene lookup — convert Hugo symbols to Entrez IDs
gene_symbols = ["BRCA1", "BRCA2", "TP53", "PIK3CA", "PTEN", "KRAS", "EGFR"]
body = {"geneIds": gene_symbols, "geneIdType": "HUGO_GENE_SYMBOL"}
gene_list = cbio_post("genes/fetch", body)

gene_map = {g["hugoGeneSymbol"]: g["entrezGeneId"] for g in gene_list}
gene_df = pd.DataFrame(gene_list)[["hugoGeneSymbol", "entrezGeneId", "type"]]
print(f"\nResolved {len(gene_df)} genes:")
print(gene_df.to_string(index=False))
# hugoGeneSymbol  entrezGeneId            type
#          BRCA1         672   protein-coding
#          BRCA2         675   protein-coding
#           TP53        7157   protein-coding

Query 7: Visualization — Mutation Frequency Barplot

Plot mutation frequency across TCGA studies for a cancer driver gene.

import requests, time
import pandas as pd
import matplotlib.pyplot as plt

BASE_URL = "https://www.cbioportal.org/api"

def cbio_get(endpoint, params=None):
    r = requests.get(f"{BASE_URL}/{endpoint}", params=params,
                     headers={"Accept": "application/json"}, timeout=30)
    r.raise_for_status()
    return r.json()

def cbio_post(endpoint, body):
    r = requests.post(f"{BASE_URL}/{endpoint}", json=body,
                      headers={"Accept": "application/json",
                               "Content-Type": "application/json"},
                      timeout=60)
    r.raise_for_status()
    return r.json()

# Focus on a curated set of TCGA PanCancer Atlas studies
STUDIES = {
    "brca_tcga_pan_can_atlas_2018": "BRCA",
    "luad_tcga_pan_can_atlas_2018": "LUAD",
    "coad_tcga_pan_can_atlas_2018": "COAD",
    "prad_tcga_pan_can_atlas_2018": "PRAD",
    "gbm_tcga_pan_can_atlas_2018": "GBM",
}

GENE_ENTREZ = 7157   # TP53
GENE_SYMBOL = "TP53"

rows = []
for study_id, label in STUDIES.items():
    try:
        samples = cbio_get(f"studies/{study_id}/samples", params={"projection": "ID"})
        sample_ids = [s["sampleId"] for s in samples]
        n_total = len(sample_ids)
        profile_id = f"{study_id}_mutations"
        body = {"sampleIds": sample_ids, "entrezGeneIds": [GENE_ENTREZ]}
        muts = cbio_post(f"molecular-profiles/{profile_id}/mutations/fetch", body)
        mutated_samples = len({m["sampleId"] for m in muts})
        rows.append({"study": label, "n_mutated": mutated_samples,
                     "n_total": n_total,
                     "freq": mutated_samples / n_total * 100})
        time.sleep(0.2)
    except Exception as e:
        print(f"  Skipping {study_id}: {e}")

df = pd.DataFrame(rows).sort_values("freq", ascending=True)

fig, ax = plt.subplots(figsize=(7, 4))
bars = ax.barh(df["study"], df["freq"], color="#C0392B", edgecolor="white")
ax.bar_label(bars, labels=[f"{v:.0f}%  (n={n})" for v, n in zip(df["freq"], df["n_mutated"])],
             padding=4, fontsize=9)
ax.set_xlabel(f"{GENE_SYMBOL} Mutation Frequency (%)")
ax.set_title(f"{GENE_SYMBOL} Somatic Mutation Frequency\nacross TCGA PanCancer Atlas Studies")
ax.set_xlim(0, df["freq"].max() * 1.3)
plt.tight_layout()
plt.savefig(f"{GENE_SYMBOL}_mutation_frequency.png", dpi=150, bbox_inches="tight")
print(f"Saved {GENE_SYMBOL}_mutation_frequency.png")
print(df[["study", "n_mutated", "n_total", "freq"]].to_string(index=False))

Key Concepts

cBioPortal Data Model

cBioPortal organizes data in a three-tier hierarchy: Cancer StudiesMolecular ProfilesSample-level data. A single study (e.g.,

brca_tcga_pan_can_atlas_2018
) contains multiple molecular profiles, each covering one data type. Before querying mutation or expression data, always retrieve the molecular profile list with
GET /molecular-profiles?studyId={studyId}
to confirm the correct profile ID.

Molecular Profile ID Conventions

Data TypeTypical Profile ID SuffixAlteration Type
Somatic mutations
_mutations
MUTATION_EXTENDED
Discrete CNA (GISTIC)
_gistic
COPY_NUMBER_ALTERATION
Continuous CNA (log2)
_log2CNA
COPY_NUMBER_ALTERATION
RNA-seq (log2 RSEM)
_rna_seq_v2_mrna_median_normed_log2
MRNA_EXPRESSION
Methylation
_methylation_hm27
or
_hm450
METHYLATION

Not all studies have all profile types. Always verify with

GET /molecular-profiles?studyId={studyId}
.

Entrez Gene IDs

The REST API mutation and molecular data endpoints require Entrez Gene IDs (integers), not Hugo symbols. Use

GET /genes/{hugoSymbol}
or
POST /genes/fetch
to resolve symbols to IDs before batch queries.

Common Workflows

Workflow 1: Somatic Mutation Landscape for a Gene Panel

Goal: Retrieve mutations for multiple cancer driver genes across an entire TCGA study and export to CSV.

import requests, time
import pandas as pd

BASE_URL = "https://www.cbioportal.org/api"

def cbio_get(endpoint, params=None):
    r = requests.get(f"{BASE_URL}/{endpoint}", params=params,
                     headers={"Accept": "application/json"}, timeout=30)
    r.raise_for_status()
    return r.json()

def cbio_post(endpoint, body):
    r = requests.post(f"{BASE_URL}/{endpoint}", json=body,
                      headers={"Accept": "application/json",
                               "Content-Type": "application/json"},
                      timeout=120)
    r.raise_for_status()
    return r.json()

study_id = "luad_tcga_pan_can_atlas_2018"
profile_id = f"{study_id}_mutations"

# Resolve gene symbols to Entrez IDs
gene_symbols = ["KRAS", "EGFR", "TP53", "BRAF", "STK11", "KEAP1", "RB1"]
gene_list = cbio_post("genes/fetch",
                       {"geneIds": gene_symbols, "geneIdType": "HUGO_GENE_SYMBOL"})
gene_map = {g["entrezGeneId"]: g["hugoGeneSymbol"] for g in gene_list}
entrez_ids = list(gene_map.keys())

# Fetch all samples
samples = cbio_get(f"studies/{study_id}/samples", params={"projection": "ID"})
sample_ids = [s["sampleId"] for s in samples]
print(f"Study: {study_id} — {len(sample_ids)} samples")

# Batch mutations in chunks of 500 samples to avoid timeouts
chunk_size = 500
all_muts = []
for i in range(0, len(sample_ids), chunk_size):
    chunk = sample_ids[i:i + chunk_size]
    body = {"sampleIds": chunk, "entrezGeneIds": entrez_ids}
    muts = cbio_post(f"molecular-profiles/{profile_id}/mutations/fetch", body)
    all_muts.extend(muts)
    time.sleep(0.1)

mut_df = pd.DataFrame(all_muts)
mut_df["hugoSymbol"] = mut_df["entrezGeneId"].map(gene_map)
print(f"Total mutations: {len(mut_df)}")
print("\nMutation counts per gene:")
print(mut_df.groupby("hugoSymbol")["sampleId"].nunique()
      .sort_values(ascending=False).to_string())

mut_df.to_csv(f"{study_id}_driver_mutations.csv", index=False)
print(f"\nSaved: {study_id}_driver_mutations.csv")

Workflow 2: Survival Analysis — CNA Status vs. Overall Survival

Goal: Compare overall survival between patients with ERBB2 amplification vs. diploid/loss in TCGA BRCA.

import requests
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.patches import Patch

BASE_URL = "https://www.cbioportal.org/api"

def cbio_get(endpoint, params=None):
    r = requests.get(f"{BASE_URL}/{endpoint}", params=params,
                     headers={"Accept": "application/json"}, timeout=30)
    r.raise_for_status()
    return r.json()

def cbio_post(endpoint, body):
    r = requests.post(f"{BASE_URL}/{endpoint}", json=body,
                      headers={"Accept": "application/json",
                               "Content-Type": "application/json"},
                      timeout=60)
    r.raise_for_status()
    return r.json()

study_id = "brca_tcga_pan_can_atlas_2018"
cna_profile_id = f"{study_id}_gistic"

# Get all samples
samples = cbio_get(f"studies/{study_id}/samples", params={"projection": "ID"})
sample_ids = [s["sampleId"] for s in samples]

# Fetch ERBB2 CNA (Entrez 2064)
cna_data = cbio_post(
    f"molecular-profiles/{cna_profile_id}/molecular-data/fetch",
    {"sampleIds": sample_ids, "entrezGeneIds": [2064]}
)
cna_df = pd.DataFrame(cna_data)[["sampleId", "value"]].rename(columns={"value": "erbb2_cna"})
cna_df["erbb2_cna"] = cna_df["erbb2_cna"].astype(int)
cna_df["erbb2_status"] = cna_df["erbb2_cna"].map(
    {2: "Amplified", 1: "Gain", 0: "Diploid", -1: "Loss", -2: "Deep Deletion"})

# Fetch clinical data (OS_STATUS, OS_MONTHS)
clinical = cbio_get(f"studies/{study_id}/clinical-data",
                    params={"clinicalDataType": "PATIENT", "projection": "DETAILED"})
clin_df = pd.DataFrame(clinical)
clin_pivot = clin_df.pivot_table(
    index="patientId", columns="clinicalAttributeId", values="value", aggfunc="first"
).reset_index()

# Map samples to patients
sample_patient = cbio_get(f"studies/{study_id}/samples", params={"projection": "DETAILED"})
sp_df = pd.DataFrame(sample_patient)[["sampleId", "patientId"]]

# Merge CNA + clinical via patient ID
merged = (cna_df
          .merge(sp_df, on="sampleId")
          .merge(clin_pivot[["patientId", "OS_STATUS", "OS_MONTHS"]],
                 on="patientId", how="inner"))
merged = merged.dropna(subset=["OS_STATUS", "OS_MONTHS"])
merged["OS_MONTHS"] = pd.to_numeric(merged["OS_MONTHS"], errors="coerce")
merged["event"] = (merged["OS_STATUS"] == "1:DECEASED").astype(int)

# Simple Kaplan-Meier-style plot (manual step function)
def km_curve(df, time_col="OS_MONTHS"):
    times = sorted(df[time_col].dropna().values)
    surv = []
    s = 1.0
    n = len(times)
    for i, t in enumerate(times):
        s *= (1 - 1 / (n - i))
        surv.append((t, s))
    return surv

fig, ax = plt.subplots(figsize=(8, 5))
colors = {"Amplified": "#C0392B", "Diploid": "#2980B9"}
for status, color in colors.items():
    grp = merged[merged["erbb2_status"] == status]
    if len(grp) < 10:
        continue
    km = km_curve(grp)
    times = [0] + [x[0] for x in km]
    surv  = [1.0] + [x[1] for x in km]
    ax.step(times, surv, where="post", color=color,
            label=f"ERBB2 {status} (n={len(grp)})", lw=2)

ax.set_xlabel("Overall Survival (months)")
ax.set_ylabel("Survival Probability")
ax.set_title("ERBB2 CNA Status vs. Overall Survival\nTCGA BRCA (PanCancer Atlas)")
ax.legend()
ax.set_ylim(0, 1.05)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("erbb2_survival.png", dpi=150, bbox_inches="tight")
print(f"Saved erbb2_survival.png")
print(f"ERBB2 Amplified: {(merged['erbb2_status']=='Amplified').sum()} samples")
print(f"ERBB2 Diploid:   {(merged['erbb2_status']=='Diploid').sum()} samples")

Workflow 3: Multi-Study Alteration Frequency Heatmap

Goal: Build a gene × cancer-type alteration frequency matrix across TCGA studies.

import requests, time
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

BASE_URL = "https://www.cbioportal.org/api"

def cbio_get(endpoint, params=None):
    r = requests.get(f"{BASE_URL}/{endpoint}", params=params,
                     headers={"Accept": "application/json"}, timeout=30)
    r.raise_for_status()
    return r.json()

def cbio_post(endpoint, body):
    r = requests.post(f"{BASE_URL}/{endpoint}", json=body,
                      headers={"Accept": "application/json",
                               "Content-Type": "application/json"},
                      timeout=90)
    r.raise_for_status()
    return r.json()

STUDIES = {
    "brca_tcga_pan_can_atlas_2018": "BRCA",
    "luad_tcga_pan_can_atlas_2018": "LUAD",
    "coad_tcga_pan_can_atlas_2018": "COAD",
    "gbm_tcga_pan_can_atlas_2018":  "GBM",
}
GENE_SYMBOLS = ["TP53", "KRAS", "PIK3CA", "EGFR", "PTEN"]

# Resolve genes
gene_list = cbio_post("genes/fetch",
                       {"geneIds": GENE_SYMBOLS, "geneIdType": "HUGO_GENE_SYMBOL"})
gene_map = {g["entrezGeneId"]: g["hugoGeneSymbol"] for g in gene_list}
entrez_ids = list(gene_map.keys())

freq_matrix = pd.DataFrame(index=GENE_SYMBOLS, columns=list(STUDIES.values()), dtype=float)

for study_id, label in STUDIES.items():
    try:
        samples = cbio_get(f"studies/{study_id}/samples", params={"projection": "ID"})
        sample_ids = [s["sampleId"] for s in samples]
        n_total = len(sample_ids)
        profile_id = f"{study_id}_mutations"
        body = {"sampleIds": sample_ids, "entrezGeneIds": entrez_ids}
        muts = cbio_post(f"molecular-profiles/{profile_id}/mutations/fetch", body)
        mut_df = pd.DataFrame(muts) if muts else pd.DataFrame()
        for eid, symbol in gene_map.items():
            if mut_df.empty:
                freq_matrix.loc[symbol, label] = 0.0
            else:
                n_mut = mut_df[mut_df["entrezGeneId"] == eid]["sampleId"].nunique()
                freq_matrix.loc[symbol, label] = n_mut / n_total * 100
        time.sleep(0.2)
    except Exception as e:
        print(f"  {label}: {e}")

freq_matrix = freq_matrix.fillna(0).astype(float)

fig, ax = plt.subplots(figsize=(7, 4))
im = ax.imshow(freq_matrix.values, cmap="YlOrRd", aspect="auto", vmin=0, vmax=80)
ax.set_xticks(range(len(freq_matrix.columns)))
ax.set_xticklabels(freq_matrix.columns, rotation=30, ha="right")
ax.set_yticks(range(len(freq_matrix.index)))
ax.set_yticklabels(freq_matrix.index)
for i in range(len(freq_matrix.index)):
    for j in range(len(freq_matrix.columns)):
        val = freq_matrix.iloc[i, j]
        ax.text(j, i, f"{val:.0f}%", ha="center", va="center", fontsize=9,
                color="white" if val > 40 else "black")
plt.colorbar(im, ax=ax, label="Mutation Frequency (%)")
ax.set_title("Somatic Mutation Frequency — TCGA PanCancer Atlas")
plt.tight_layout()
plt.savefig("mutation_frequency_heatmap.png", dpi=150, bbox_inches="tight")
print("Saved mutation_frequency_heatmap.png")
print(freq_matrix.to_string())

Key Parameters

ParameterFunction/EndpointDefaultRange / OptionsEffect
studyId
All study endpointsany valid cBioPortal study IDSelects the cancer study
molecularProfileId
mutations/fetch, molecular-data/fetch
{studyId}_mutations
,
{studyId}_gistic
, etc.
Selects the data type profile
entrezGeneIds
mutations/fetch, molecular-data/fetchlist of integer Entrez IDsGenes to query; use
POST /genes/fetch
to resolve symbols
sampleIds
mutations/fetch, molecular-data/fetchlist of sample ID stringsSamples to retrieve; use
GET /studies/{id}/samples
for all
clinicalDataType
clinical-data
"SAMPLE"
"SAMPLE"
,
"PATIENT"
Whether to return sample-level or patient-level clinical attributes
projection
samples, clinical-data
"SUMMARY"
"ID"
,
"SUMMARY"
,
"DETAILED"
,
"META"
Response verbosity;
"ID"
fastest for ID-only fetches
keyword
studies
""
free textFilter studies by name/cancer type keyword

Best Practices

  1. Fetch sample IDs before data queries: All mutation and molecular data endpoints require explicit

    sampleIds
    . Retrieve them with
    GET /studies/{studyId}/samples?projection=ID
    before each query.

  2. Verify profile IDs from the API: Profile IDs are not guaranteed to follow the

    _mutations
    /
    _gistic
    pattern in every study. Always confirm with
    GET /molecular-profiles?studyId={studyId}
    rather than guessing.

  3. Chunk large sample sets: The API can time out on requests with thousands of sample IDs. Batch requests in chunks of 500 samples with

    time.sleep(0.1)
    between chunks.

  4. Use Entrez IDs, not Hugo symbols, in data fetch endpoints: The mutation and molecular data endpoints accept

    entrezGeneIds
    (integers). Resolve symbols first with
    POST /genes/fetch
    .

  5. Don't hard-code Entrez IDs: Gene IDs can be looked up dynamically via the API. Hard-coded IDs become incorrect if the gene model changes. Use

    POST /genes/fetch
    to resolve gene symbols at runtime.

Common Recipes

Recipe: List All Molecular Profiles for a Study

When to use: Before running any data query — verify which profile IDs are available.

import requests

BASE_URL = "https://www.cbioportal.org/api"

def cbio_get(endpoint, params=None):
    r = requests.get(f"{BASE_URL}/{endpoint}", params=params,
                     headers={"Accept": "application/json"}, timeout=30)
    r.raise_for_status()
    return r.json()

study_id = "brca_tcga_pan_can_atlas_2018"
profiles = cbio_get("molecular-profiles", params={"studyId": study_id})
for p in profiles:
    print(f"{p['molecularProfileId']:55s}  {p['molecularAlterationType']}")
# brca_tcga_pan_can_atlas_2018_mutations          MUTATION_EXTENDED
# brca_tcga_pan_can_atlas_2018_gistic             COPY_NUMBER_ALTERATION
# brca_tcga_pan_can_atlas_2018_log2CNA            COPY_NUMBER_ALTERATION
# brca_tcga_pan_can_atlas_2018_rna_seq_v2_mrna_median_normed_log2  MRNA_EXPRESSION

Recipe: Download Full Mutation MAF for a Study

When to use: Export all somatic mutations from a study into MAF-compatible format for downstream analysis.

import requests, time
import pandas as pd

BASE_URL = "https://www.cbioportal.org/api"

def cbio_get(endpoint, params=None):
    r = requests.get(f"{BASE_URL}/{endpoint}", params=params,
                     headers={"Accept": "application/json"}, timeout=60)
    r.raise_for_status()
    return r.json()

def cbio_post(endpoint, body):
    r = requests.post(f"{BASE_URL}/{endpoint}", json=body,
                      headers={"Accept": "application/json",
                               "Content-Type": "application/json"},
                      timeout=120)
    r.raise_for_status()
    return r.json()

study_id = "coad_tcga_pan_can_atlas_2018"
profile_id = f"{study_id}_mutations"

samples = cbio_get(f"studies/{study_id}/samples", params={"projection": "ID"})
sample_ids = [s["sampleId"] for s in samples]

all_mutations = []
for i in range(0, len(sample_ids), 300):
    chunk = sample_ids[i:i + 300]
    muts = cbio_post(f"molecular-profiles/{profile_id}/mutations/fetch",
                     {"sampleIds": chunk, "entrezGeneIds": []})  # empty = all genes
    all_mutations.extend(muts)
    time.sleep(0.1)

mut_df = pd.DataFrame(all_mutations)
cols = ["hugoGeneSymbol", "sampleId", "chr", "startPosition", "endPosition",
        "referenceAllele", "variantAllele", "mutationType",
        "proteinChange", "variantType"]
available = [c for c in cols if c in mut_df.columns]
mut_df[available].to_csv(f"{study_id}_mutations.csv", index=False)
print(f"Saved {len(mut_df)} mutations → {study_id}_mutations.csv")

Recipe: Query Patient-Level Clinical Attribute

When to use: Extract a specific clinical variable (e.g., tumor stage, age at diagnosis) for all patients.

import requests
import pandas as pd

BASE_URL = "https://www.cbioportal.org/api"

def cbio_get(endpoint, params=None):
    r = requests.get(f"{BASE_URL}/{endpoint}", params=params,
                     headers={"Accept": "application/json"}, timeout=30)
    r.raise_for_status()
    return r.json()

study_id = "brca_tcga_pan_can_atlas_2018"

# Fetch a specific clinical attribute for all patients
attr_id = "TUMOR_STAGE"
clinical = cbio_get(f"studies/{study_id}/clinical-data",
                    params={"clinicalDataType": "PATIENT",
                            "projection": "DETAILED"})

clin_df = pd.DataFrame(clinical)
if "clinicalAttributeId" in clin_df.columns:
    stage_df = clin_df[clin_df["clinicalAttributeId"] == attr_id][["patientId", "value"]]
    print(f"Patients with {attr_id} annotation: {len(stage_df)}")
    print(stage_df["value"].value_counts().head(10).to_string())

Troubleshooting

ProblemCauseSolution
404 Not Found
on profile endpoint
Molecular profile does not exist for studyList profiles with
GET /molecular-profiles?studyId={id}
; confirm the profile ID
Empty mutations listGene has no mutations in the selected samples/profileVerify study has a mutation profile; check sample IDs belong to the same study
requests.exceptions.Timeout
Large sample set (>1000) in a single requestChunk requests to 300–500 samples; increase
timeout
to 120s
entrezGeneIds
key error in response
Hugo symbol passed instead of Entrez IDUse
POST /genes/fetch
to resolve symbols to integer Entrez IDs first
CNA values returned as strings
value
field is string in JSON
Cast with
pd.to_numeric()
or
int(value)
before comparison
Expression profile not foundStudy uses non-standard profile namingCheck profile list; look for
MRNA_EXPRESSION
alteration type in
GET /molecular-profiles
Survival analysis has many NA valuesClinical attribute absent for some patientsUse
dropna()
on OS columns; check attribute availability with
GET /studies/{id}/clinical-attributes

Related Skills

  • gnomad-database
    — population variant allele frequencies for healthy cohorts (complement to cBioPortal somatic data)
  • cnvkit-copy-number
    — CNVkit pipeline for generating SEG/CNA files that can be loaded into cBioPortal
  • pydeseq2-differential-expression
    — differential expression analysis that can be applied to cBioPortal RNA-seq exports

References