SciAgent-Skills arboreto-grn-inference

Gene regulatory network inference from expression data using GRNBoost2 (gradient boosting) or GENIE3 (Random Forest). Load expression matrix, optionally filter by transcription factors, infer TF-target-importance links, filter and save network. Dask-parallelized for single-cell scale. Core component of the SCENIC pipeline.

install

source · Clone the upstream repo

git clone https://github.com/jaechang-hits/SciAgent-Skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/genomics-bioinformatics/arboreto-grn-inference" ~/.claude/skills/jaechang-hits-sciagent-skills-arboreto-grn-inference && rm -rf "$T"

manifest: skills/genomics-bioinformatics/arboreto-grn-inference/SKILL.md

source content

Arboreto GRN Inference

Overview

Arboreto infers gene regulatory networks (GRNs) from gene expression data using parallelized tree-based regression. For each target gene, it trains a regression model with all other genes (or a specified TF list) as features and emits TF-target-importance triplets. It provides two interchangeable algorithms -- GRNBoost2 (gradient boosting, fast) and GENIE3 (Random Forest, classic) -- sharing identical input/output formats. Computation is Dask-parallelized, scaling from laptop cores to HPC clusters.

When to Use

Inferring transcription factor-to-target gene regulatory relationships from bulk RNA-seq expression data
Building gene regulatory networks from single-cell RNA-seq count matrices (cells as rows, genes as columns)
Generating the adjacency matrix (Step 1) of the pySCENIC regulatory analysis pipeline
Comparing regulatory network structure across experimental conditions (e.g., control vs treatment)
Producing consensus regulatory networks by running inference across multiple random seeds
Validating GRN results by comparing GRNBoost2 and GENIE3 outputs on the same dataset
For downstream regulon identification and activity scoring, use arboreto output with pySCENIC
For single-cell preprocessing (QC, normalization, clustering) before GRN inference, use scanpy-scrna-seq

Prerequisites

Python packages:

arboreto

pandas

numpy

dask

distributed

scikit-learn

scipy

Data requirements: Gene expression matrix (genes as columns, observations as rows) in TSV/CSV; optionally a TF list file (one gene name per line)
Environment: Python 3.8+; optional
```
networkx
```
,
```
matplotlib
```
for visualization

pip install arboreto distributed networkx matplotlib

Quick Start

Complete GRN inference in a single block. The

if __name__ == '__main__':

guard is required because Dask spawns worker processes via multiprocessing.

import pandas as pd
from arboreto.algo import grnboost2
from arboreto.utils import load_tf_names

if __name__ == '__main__':
    # Load expression matrix (observations x genes)
    expression_matrix = pd.read_csv('expression_data.tsv', sep='\t')
    tf_names = load_tf_names('tf_list.txt')  # optional TF filter

    # Infer GRN (uses all local cores by default)
    network = grnboost2(expression_data=expression_matrix,
                        tf_names=tf_names, seed=777)

    # Filter top links and save
    top_network = network[network['importance'] > 1.0]
    top_network.to_csv('grn_output.tsv', sep='\t', index=False, header=False)
    print(f"Inferred {len(network)} links, kept {len(top_network)} above threshold")
    # Example: Inferred 185432 links, kept 12876 above threshold

Workflow

Step 1: Load Expression Data

Arboreto accepts a pandas DataFrame (recommended) or NumPy array. Rows are observations (cells or samples), columns are genes. Gene names must be column headers.

import pandas as pd

# From TSV (genes as columns, observations as rows)
expression_matrix = pd.read_csv('expression_data.tsv', sep='\t')
print(f"Shape: {expression_matrix.shape}  "
      f"({expression_matrix.shape[0]} observations x {expression_matrix.shape[1]} genes)")
# Example: Shape: (5000, 18654)  (5000 observations x 18654 genes)

# From AnnData (e.g., after scanpy preprocessing)
import anndata as ad
adata = ad.read_h5ad('preprocessed.h5ad')
expression_matrix = pd.DataFrame(
    adata.X.toarray() if hasattr(adata.X, 'toarray') else adata.X,
    columns=adata.var_names.tolist()
)
print(f"Converted AnnData: {expression_matrix.shape}")
# Example: Converted AnnData: (5000, 18654)

Step 2: Load Transcription Factor List

Providing a TF list restricts regulators to known transcription factors, reducing computation time and improving biological relevance. If omitted, all genes are treated as potential regulators.

from arboreto.utils import load_tf_names

# From file (one TF name per line)
tf_names = load_tf_names('human_tfs.txt')
print(f"Loaded {len(tf_names)} transcription factors")
# Example: Loaded 1639 transcription factors

# Or define directly
tf_names = ['MYC', 'TP53', 'SOX2', 'NANOG', 'POU5F1']

# Verify TFs exist in expression matrix columns
tf_in_data = [tf for tf in tf_names if tf in expression_matrix.columns]
print(f"TFs found in expression data: {len(tf_in_data)}/{len(tf_names)}")
# Example: TFs found in expression data: 1583/1639

Step 3: Configure Dask Client (Optional)

By default arboreto creates an internal Dask client using all local cores. Create an explicit client for resource control, monitoring, or cluster deployment.

from distributed import LocalCluster, Client

# Custom local client with resource limits
local_cluster = LocalCluster(
    n_workers=8,
    threads_per_worker=1,   # avoid GIL contention in scikit-learn
    memory_limit='4GB'       # per worker
)
client = Client(local_cluster)
print(f"Dashboard: {client.dashboard_link}")
# Example: Dashboard: http://127.0.0.1:8787/status

Step 4: Run GRN Inference

Call

grnboost2()

(recommended) or

genie3()

. Both share the same signature and output format.

from arboreto.algo import grnboost2

if __name__ == '__main__':
    network = grnboost2(
        expression_data=expression_matrix,
        tf_names=tf_names,         # 'all' if no TF list
        client_or_address=client,  # omit to use default local scheduler
        seed=777,                  # for reproducibility
        verbose=True               # print progress
    )
    print(f"Inferred {len(network)} regulatory links")
    print(network.head())
    # Example:
    #       TF  target  importance
    # 0   MYC   CDK4       3.214
    # 1   MYC   CCND1      2.871
    # 2  TP53  CDKN1A      2.654

Step 5: Filter Results

Raw output contains links for every TF-target pair with non-zero importance. Filter to retain high-confidence regulatory interactions.

# Strategy 1: Importance threshold
threshold = 1.0
filtered = network[network['importance'] > threshold]
print(f"Threshold {threshold}: {len(filtered)} links "
      f"({len(filtered)/len(network)*100:.1f}% retained)")
# Example: Threshold 1.0: 12876 links (6.9% retained)

# Strategy 2: Top N links per target gene
top_n = 10
top_per_target = (network.groupby('target')
                  .apply(lambda g: g.nlargest(top_n, 'importance'))
                  .reset_index(drop=True))
print(f"Top {top_n} per target: {len(top_per_target)} links")
# Example: Top 10 per target: 186540 links

Step 6: Save Network

Save as a TSV file with three columns: TF, target, importance.

# Save full network
network.to_csv('full_network.tsv', sep='\t', index=False)
print(f"Saved full network: {len(network)} links")

# Save filtered network (without header for pySCENIC compatibility)
filtered.to_csv('filtered_network.tsv', sep='\t', index=False, header=False)
print(f"Saved filtered network: {len(filtered)} links")

# Clean up Dask client if explicitly created
client.close()
local_cluster.close()

Step 7: Visualize Results (Optional)

Plot the top regulatory interactions as a directed network graph.

import networkx as nx
import matplotlib.pyplot as plt

# Build directed graph from top links
top_links = network.nlargest(50, 'importance')
G = nx.from_pandas_edgelist(
    top_links, source='TF', target='target',
    edge_attr='importance', create_using=nx.DiGraph()
)

# Draw network
fig, ax = plt.subplots(figsize=(12, 10))
pos = nx.spring_layout(G, k=2, seed=42)
nx.draw_networkx_nodes(G, pos, node_size=300, node_color='lightblue', ax=ax)
nx.draw_networkx_labels(G, pos, font_size=7, ax=ax)
edges = nx.draw_networkx_edges(
    G, pos, edge_color=[G[u][v]['importance'] for u, v in G.edges()],
    edge_cmap=plt.cm.Reds, width=1.5, arrows=True, ax=ax
)
plt.colorbar(edges, ax=ax, label='Importance')
ax.set_title('Top 50 Regulatory Interactions')
plt.tight_layout()
plt.savefig('grn_network.png', dpi=300, bbox_inches='tight')
print("Saved grn_network.png")

Key Parameters

Parameter	Default	Range / Options	Effect
`expression_data`	(required)	DataFrame or ndarray	Expression matrix, observations x genes
`tf_names`	`'all'`	list of str or `'all'`	Restrict regulators to known TFs; `'all'` uses every gene
`gene_names`	`None`	list of str	Required when `expression_data` is a NumPy array
`client_or_address`	`'local'`	Client, str, or `'local'`	Dask client instance or scheduler address
`seed`	`None`	int	Random seed for reproducibility; always set for replicable results
`verbose`	`False`	bool	Print progress messages during inference

Key Concepts

Algorithm Selection Guide

Both algorithms follow the same multiple-regression strategy (train one model per target gene, extract feature importances) but differ in the underlying regressor.

Feature	GRNBoost2	GENIE3
Method	Stochastic gradient boosting with early stopping	Random Forest (or ExtraTrees)
Speed	Fast -- optimized for large datasets	Slower -- higher per-model cost
Memory	Lower (early stopping limits tree depth)	Higher (full forests per target)
Best for	Default choice; 10k+ observations, single-cell scale	Comparison with published GENIE3 results; validation
Import	`from arboreto.algo import grnboost2`	`from arboreto.algo import genie3`
Output format	TF, target, importance (identical)	TF, target, importance (identical)

Decision rule: Start with GRNBoost2. Use GENIE3 only when reproducing published GENIE3 analyses or as an independent validation of GRNBoost2 results.

Output Format

The network DataFrame has three columns, sorted by descending importance:

Column	Type	Description
`TF`	str	Transcription factor (regulator) gene name
`target`	str	Target gene name
`importance`	float	Regulatory importance score (higher = stronger predicted regulation)

Importance scores are not p-values. They represent feature importance from the tree-based model. Filter by absolute threshold or top-N per target for downstream analysis.

The

if __name__ == '__main__':

Requirement

Dask uses Python's

multiprocessing

module to spawn worker processes. Without the main guard, each spawned process re-executes the script, leading to infinite process spawning. Always wrap arboreto calls in

if __name__ == '__main__':

when running as a script. This is not needed inside Jupyter notebooks.

Common Recipes

Recipe: Distributed Computing on a Cluster

When to use: datasets too large for a single machine (e.g., 50k+ cells, 20k+ genes). Requires a Dask distributed scheduler running on the cluster.

# On head node: start scheduler
dask-scheduler
# Output: Scheduler at tcp://10.118.224.134:8786

# On each compute node: start workers
dask-worker tcp://10.118.224.134:8786 --nprocs 4 --nthreads 1 --memory-limit 16GB

from distributed import Client
from arboreto.algo import grnboost2
import pandas as pd

if __name__ == '__main__':
    client = Client('tcp://10.118.224.134:8786')
    print(f"Connected to cluster: {client.scheduler_info()['workers'].__len__()} workers")
    # Example: Connected to cluster: 16 workers

    expression_data = pd.read_csv('large_scrna.tsv', sep='\t')
    tf_names = pd.read_csv('tf_list.txt', header=None)[0].tolist()

    network = grnboost2(
        expression_data=expression_data,
        tf_names=tf_names,
        client_or_address=client,
        seed=42, verbose=True
    )
    network.to_csv('cluster_grn.tsv', sep='\t', index=False)
    print(f"Cluster inference complete: {len(network)} links")
    client.close()

Recipe: Consensus Network from Multiple Seeds

When to use: improve robustness by aggregating networks from multiple random seeds. Links appearing consistently across runs are more reliable.

import pandas as pd
from distributed import LocalCluster, Client
from arboreto.algo import grnboost2

if __name__ == '__main__':
    client = Client(LocalCluster(n_workers=8, threads_per_worker=1))
    expression_data = pd.read_csv('expression_data.tsv', sep='\t')
    tf_names = pd.read_csv('tf_list.txt', header=None)[0].tolist()

    seeds = [42, 123, 456, 789, 1001]
    all_networks = []

    for seed in seeds:
        net = grnboost2(expression_data=expression_data,
                        tf_names=tf_names,
                        client_or_address=client, seed=seed)
        net['seed'] = seed
        all_networks.append(net)
        print(f"Seed {seed}: {len(net)} links")

    combined = pd.concat(all_networks, ignore_index=True)

    # Consensus: mean importance across seeds, keep links found in 3+ runs
    consensus = (combined.groupby(['TF', 'target'])
                 .agg(mean_importance=('importance', 'mean'),
                      n_seeds=('seed', 'nunique'))
                 .reset_index())
    consensus = consensus[consensus['n_seeds'] >= 3]
    consensus = consensus.sort_values('mean_importance', ascending=False)
    consensus.to_csv('consensus_network.tsv', sep='\t', index=False)
    print(f"Consensus network: {len(consensus)} links (found in 3+ of {len(seeds)} runs)")
    # Example: Consensus network: 28453 links (found in 3+ of 5 runs)

    client.close()

Recipe: pySCENIC Integration

When to use: downstream regulatory analysis after arboreto GRN inference. Arboreto produces the adjacency matrix (Step 1 of SCENIC); pySCENIC performs regulon identification (Step 2) and activity scoring (Step 3).

from arboreto.algo import grnboost2
from arboreto.utils import load_tf_names
import pandas as pd

if __name__ == '__main__':
    # Step 1 (arboreto): GRN inference
    expression_matrix = pd.read_csv('scrna_expression.tsv', sep='\t')
    tf_names = load_tf_names('allTFs_hg38.txt')

    adjacencies = grnboost2(
        expression_data=expression_matrix,
        tf_names=tf_names, seed=777
    )
    adjacencies.to_csv('adjacencies.tsv', sep='\t', index=False, header=False)
    print(f"SCENIC Step 1 complete: {len(adjacencies)} adjacencies")
    # Example: SCENIC Step 1 complete: 234567 adjacencies

    # Steps 2-3 use pySCENIC CLI (requires pyscenic installation):
    # pyscenic ctx adjacencies.tsv hg38_ranking.feather \
    #     --annotations_fname motifs-v10nr.hgnc-m0.001-o0.0.tbl \
    #     --output regulons.csv
    #
    # pyscenic aucell scrna_expression.tsv regulons.csv \
    #     --output auc_matrix.csv
    print("Next: run pyscenic ctx (Step 2) and pyscenic aucell (Step 3)")

Expected Outputs

```
full_network.tsv
```
-- complete TF-target-importance table; columns: TF, target, importance
```
filtered_network.tsv
```
-- high-confidence subset after threshold or top-N filtering
```
consensus_network.tsv
```
-- aggregated network from multi-seed runs; columns: TF, target, mean_importance, n_seeds
```
adjacencies.tsv
```
-- arboreto output formatted for pySCENIC input (no header)
```
grn_network.png
```
-- directed graph visualization of top regulatory interactions

Troubleshooting

Problem	Cause	Solution
`RuntimeError: freeze_support()` or infinite process spawning	Missing `if __name__ == '__main__':` guard	Wrap all arboreto calls in main guard; not needed in Jupyter
`MemoryError` during inference	Expression matrix too large for available RAM	Filter low-variance genes first, reduce to top 5-10k genes, or use distributed cluster
Empty or near-empty network	TF names do not match expression matrix column names	Verify TF names overlap with `expression_matrix.columns` ; check case sensitivity
Very slow inference	Using GENIE3 on large dataset, or too few Dask workers	Switch to GRNBoost2; create explicit Dask client with more workers
`ModuleNotFoundError: No module named 'arboreto'`	Package not installed	`pip install arboreto`
All importance scores are 0 or identical	Expression matrix has no variance (e.g., all zeros after filtering)	Check data quality; ensure matrix contains non-zero expression values with variance
`TypeError` with NumPy array input	Missing `gene_names` parameter	Provide `gene_names=` list matching the column count of the array

Bundled Resources

This entry is fully self-contained with no

references/

directory.

Original reference file dispositions:

references/algorithms.md (139 lines) -- Algorithm comparison, parameter details, custom regressor kwargs. Consolidated into Key Concepts (Algorithm Selection Guide table) and Key Parameters. Custom regressor kwargs (
```
regressor_type
```
,
```
regressor_kwargs
```
) omitted: advanced tuning rarely needed; users can pass scikit-learn kwargs directly per the arboreto API docs.
references/basic_inference.md (152 lines) -- Input formats, TF loading, basic workflow, output format. Fully consolidated into Workflow Steps 1-2, Key Concepts (Output Format table), and Quick Start.
references/distributed_computing.md (242 lines) -- Local client, cluster setup, monitoring, performance tips. Consolidated into Workflow Step 3 and Recipe: Distributed Computing. Omitted: Dask dashboard panel descriptions (monitoring UI detail beyond scope), detailed network/storage tuning (cluster-admin concern).
scripts/basic_grn_inference.py (98 lines) -- CLI wrapper with argparse. Core load-infer-save logic absorbed into Workflow Steps 1-6 and Quick Start. Argparse/CLI boilerplate stripped.

Narrative use-case dispositions (from original Common Use Cases):

Single-cell RNA-seq analysis -- absorbed into When to Use bullets and Workflow Steps 1/4
Bulk RNA-seq with TF filtering -- absorbed into Workflow Step 2 and Quick Start
Comparative analysis (multiple conditions) -- omitted as standalone recipe: trivial loop over conditions, pattern shown in Consensus Network recipe

Related Skills

scanpy-scrna-seq -- upstream single-cell preprocessing (QC, normalization, HVG selection) before GRN inference
scvi-tools-single-cell -- deep generative models for batch correction and imputation prior to network inference
pyscenic (planned) -- downstream regulon identification and cellular activity scoring using arboreto adjacencies
networkx-graph-analysis -- graph-theoretic analysis of inferred regulatory networks (centrality, communities)

References

Arboreto GitHub repository -- source code, README, examples
Arboreto PyPI -- installation and version history
Moerman et al. (2019) "GRNBoost2 and Arboreto: efficient and scalable inference of gene regulatory networks" Bioinformatics 35(12):2159-2161 -- algorithm paper
Aibar et al. (2017) "SCENIC: single-cell regulatory network inference and clustering" Nature Methods 14:1083-1086 -- SCENIC pipeline paper
Dask distributed documentation -- Dask client, cluster setup, dashboard