BioSkills bio-expression-matrix-counts-ingest

Load gene expression count matrices from various formats including CSV, TSV, featureCounts, Salmon, kallisto, and 10X. Use when importing quantification results for downstream analysis.

install

source · Clone the upstream repo

git clone https://github.com/GPTomics/bioSkills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/expression-matrix/counts-ingest" ~/.claude/skills/gptomics-bioskills-bio-expression-matrix-counts-ingest && rm -rf "$T"

manifest: expression-matrix/counts-ingest/SKILL.md

source content

Version Compatibility

Reference examples tested with: pandas 2.2+, tximport 1.30+, tximeta 1.20+

Before using code patterns, verify installed versions match. If versions differ:

Python:
```
pip show <package>
```
then
```
help(module.function)
```
to check signatures
R:
```
packageVersion('<pkg>')
```
then
```
?function_name
```
to verify parameters

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Count Matrix Ingestion

Expression Units Decision Table

Unit	Source	Use for DE	Use for visualization	Cross-sample comparison
Raw counts	featureCounts, HTSeq, STAR	Yes (DE tools normalize internally)	No	No
Estimated counts	Salmon, kallisto (via tximport)	Yes (with offset correction)	No	No
TPM	Salmon, kallisto	No	Within-sample gene comparison	Partially (same composition caveat)
FPKM/RPKM	Cufflinks, legacy tools	No	Within-sample only	No (composition bias)
Normalized counts	DESeq2, edgeR	Via DE tool	Prefer VST/rlog	Yes

For differential expression, always provide raw integer counts. DE tools (DESeq2, edgeR, limma-voom) model count distributions and handle normalization internally.

Transcript-Level vs Gene-Level Quantification

Salmon and kallisto quantify at the transcript level. Naive summation of transcript counts to gene level introduces length bias: if treatment causes a shift from short to long isoforms, the gene appears upregulated even if the number of mRNA molecules is unchanged. Use tximport in R (or equivalent offset handling in Python) to correct for this.

For detailed tximport workflows, see rna-quantification/tximport-workflow. Key decisions:

```
countsFromAbundance='no'
```
(default): raw estimated counts with length offsets for DESeq2 (recommended)
```
countsFromAbundance='lengthScaledTPM'
```
: required for limma-voom, which cannot use offset matrices
```
countsFromAbundance='scaledTPM'
```
: prevents differential transcript usage from creating spurious DGE

Basic CSV/TSV Loading

Goal: Load a gene expression count matrix from a delimited text file into a pandas DataFrame.

Approach: Read CSV/TSV with gene IDs as the row index, handling comment lines if present.

"Load my count matrix" → Read a delimited file into a DataFrame with genes as rows and samples as columns.

import pandas as pd

# TSV with gene IDs as first column
counts = pd.read_csv('counts.tsv', sep='\t', index_col=0)

# CSV with header
counts = pd.read_csv('counts.csv', index_col=0)

# Skip comment lines
counts = pd.read_csv('counts.txt', sep='\t', index_col=0, comment='#')

featureCounts Output

Goal: Parse featureCounts output into a clean count matrix by stripping metadata columns.

Approach: Skip the 6 annotation columns (Chr, Start, End, Strand, Length) and clean BAM path suffixes from column names.

import pandas as pd

# featureCounts format has 6 metadata columns before counts
fc = pd.read_csv('featurecounts.txt', sep='\t', comment='#')
counts = fc.set_index('Geneid').iloc[:, 5:]  # Skip Chr, Start, End, Strand, Length
counts.columns = [c.replace('.bam', '').split('/')[-1] for c in counts.columns]

Salmon Quant Files

Goal: Combine per-sample Salmon quantification files into a single count or TPM matrix.

Approach: Iterate over quant directories, extract the desired column from each quant.sf, and merge into a DataFrame.

import pandas as pd
from pathlib import Path

def load_salmon_quants(quant_dirs, column='NumReads'):
    '''Load multiple Salmon quant.sf files into a count matrix.'''
    dfs = {}
    for qdir in quant_dirs:
        sample = Path(qdir).name
        sf = pd.read_csv(f'{qdir}/quant.sf', sep='\t', index_col=0)
        dfs[sample] = sf[column]
    return pd.DataFrame(dfs)

# Usage
quant_dirs = ['salmon_out/sample1', 'salmon_out/sample2', 'salmon_out/sample3']
counts = load_salmon_quants(quant_dirs, column='NumReads')
tpm = load_salmon_quants(quant_dirs, column='TPM')

STAR ReadsPerGene Files

Goal: Load STAR gene-level count output, selecting the correct strandedness column.

Approach: STAR outputs 4 columns per gene: gene ID, unstranded counts, sense-strand counts, antisense-strand counts. Selecting the wrong strand column silently halves the counts.

import pandas as pd
from pathlib import Path

def load_star_genecounts(filepaths, strandedness='reverse'):
    '''Load STAR ReadsPerGene.out.tab files.

    strandedness: 'unstranded' (col 1), 'forward' (col 2), 'reverse' (col 3)
    For Illumina TruSeq stranded libraries, use 'reverse'.
    '''
    col_map = {'unstranded': 1, 'forward': 2, 'reverse': 3}
    col_idx = col_map[strandedness]
    dfs = {}
    for fp in filepaths:
        sample = Path(fp).name.replace('_ReadsPerGene.out.tab', '')
        df = pd.read_csv(fp, sep='\t', header=None, index_col=0)
        dfs[sample] = df.iloc[4:, col_idx - 1]  # Skip first 4 summary rows
    return pd.DataFrame(dfs)

Strandedness verification: compare total counts in columns 2 vs 3 across samples. The column with much larger totals corresponds to the correct strand. If roughly equal, the library is unstranded (use column 1).

HTSeq Count Files

Goal: Load per-sample HTSeq count output into a combined matrix.

Approach: Read each file (two columns: gene ID and count), skipping the 5 summary lines at the end (prefixed with

__

), then merge.

import pandas as pd
from pathlib import Path

def load_htseq_counts(filepaths):
    '''Load multiple HTSeq count files into a matrix.'''
    dfs = {}
    for fp in filepaths:
        sample = Path(fp).stem.replace('_counts', '')
        df = pd.read_csv(fp, sep='\t', header=None, index_col=0, names=['gene', 'count'])
        df = df[~df.index.str.startswith('__')]  # Remove __no_feature, __ambiguous, etc.
        dfs[sample] = df['count']
    return pd.DataFrame(dfs)

kallisto Abundance Files

Goal: Combine per-sample kallisto abundance files into a single count or TPM matrix.

Approach: Read each abundance.tsv, extract the target column, and assemble into a DataFrame keyed by sample name.

import pandas as pd
from pathlib import Path

def load_kallisto_quants(abundance_files, column='est_counts'):
    '''Load multiple kallisto abundance.tsv files.'''
    dfs = {}
    for f in abundance_files:
        sample = Path(f).parent.name
        ab = pd.read_csv(f, sep='\t', index_col=0)
        dfs[sample] = ab[column]
    return pd.DataFrame(dfs)

# Usage
files = ['kallisto_out/sample1/abundance.tsv', 'kallisto_out/sample2/abundance.tsv']
counts = load_kallisto_quants(files, column='est_counts')
tpm = load_kallisto_quants(files, column='tpm')

10X Genomics Sparse Matrix

Goal: Load single-cell count data from 10X Genomics format (MTX directory or H5 file).

Approach: Use scanpy to read the sparse matrix with gene and barcode metadata into an AnnData object.

import scanpy as sc

# Load 10X directory (contains matrix.mtx, genes.tsv/features.tsv, barcodes.tsv)
adata = sc.read_10x_mtx('filtered_feature_bc_matrix/')

# Load 10X H5 file
adata = sc.read_10x_h5('filtered_feature_bc_matrix.h5')

# Convert to dense DataFrame if needed
counts = adata.to_df()

AnnData H5AD Files

Goal: Load preprocessed expression data stored in the AnnData H5AD format.

Approach: Read the H5AD file with scanpy and access the count matrix, raw layer, or metadata as needed.

import anndata as ad
import scanpy as sc

# Load h5ad
adata = sc.read_h5ad('data.h5ad')

# Access count matrix
counts = adata.to_df()  # Dense DataFrame
sparse_counts = adata.X  # Sparse matrix (if stored sparse)

# Access raw counts if normalized data is in .X
raw_counts = adata.raw.to_adata().to_df()

RDS Files (from R)

Goal: Load R-serialized count data into Python.

Approach: Use pyreadr to read RDS files directly into a pandas DataFrame.

import pyreadr

# Read RDS file
result = pyreadr.read_r('counts.rds')
counts = result[None]  # Access the data

# For Seurat objects, use anndata2ri or convert in R first

Combine Multiple Files

Goal: Merge per-sample count files into a single genes-by-samples matrix.

Approach: Glob matching files, read the first data column from each, and concatenate into a DataFrame.

import pandas as pd
from pathlib import Path

def combine_count_files(file_pattern, index_col=0, sep='\t'):
    '''Combine multiple count files into one matrix.'''
    files = sorted(Path('.').glob(file_pattern))
    dfs = {}
    for f in files:
        sample = f.stem.replace('_counts', '')
        dfs[sample] = pd.read_csv(f, sep=sep, index_col=index_col).iloc[:, 0]
    return pd.DataFrame(dfs)

# Usage
counts = combine_count_files('counts/*_counts.tsv')

Filter Low-Count Genes

Goal: Remove genes with insufficient expression to reduce noise and multiple testing burden.

Approach: Apply expression thresholds appropriate for the downstream DE tool. edgeR requires explicit filtering; DESeq2 handles it internally but benefits from a speed-only pre-filter.

edgeR: Design-Aware Filtering (Recommended)

library(edgeR)

# filterByExpr uses smallest group size from design matrix
# Requires CPM above threshold in at least n samples (n = smallest group)
keep <- filterByExpr(y, design=model.matrix(~condition, data=metadata))
y <- y[keep, , keep.lib.sizes=FALSE]

Forgetting

filterByExpr

in edgeR inflates the multiple testing burden because edgeR's quasi-likelihood pipeline does not have DESeq2's automatic independent filtering.

DESeq2: Speed Pre-Filter

# Manual pre-filter for speed only -- independent filtering at results() handles statistics
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep, ]

Python

# Group-aware filter: require min_counts in at least min_samples
min_counts, min_samples = 10, 3
expressed = (counts >= min_counts).sum(axis=1) >= min_samples
counts_filtered = counts.loc[expressed]

# For min_samples, use the smallest experimental group size
# e.g., if 3 controls and 5 treated, use min_samples=3

Handle Gene ID Versions

Goal: Strip Ensembl version suffixes for consistent ID matching across datasets.

Approach: Split on the dot separator and keep only the stable identifier prefix.

# Remove Ensembl version numbers (ENSG00000123456.12 -> ENSG00000123456)
counts.index = counts.index.str.split('.').str[0]

# Or keep as-is for compatibility

Save Count Matrix

Goal: Export a count matrix in formats suitable for downstream tools or archival.

Approach: Write as TSV, compressed TSV, or AnnData H5AD depending on the target workflow.

# Save as TSV
counts.to_csv('count_matrix.tsv', sep='\t')

# Save as compressed
counts.to_csv('count_matrix.tsv.gz', sep='\t', compression='gzip')

# Save as AnnData
import anndata as ad
adata = ad.AnnData(counts)
adata.write_h5ad('counts.h5ad')

R Loading Equivalents

Goal: Load count data in R from common formats including featureCounts and Salmon/kallisto via tximport.

Approach: Use base R read functions for alignment-based counts, or tximport with a tx2gene mapping for pseudo-alignment outputs (Salmon/kallisto). tximport handles length-bias correction automatically.

# Basic CSV/TSV
counts <- read.csv('counts.csv', row.names=1)
counts <- read.delim('counts.tsv', row.names=1)

# featureCounts
fc <- read.delim('featurecounts.txt', comment.char='#', row.names=1)
counts <- fc[, 6:ncol(fc)]

# tximport for Salmon/kallisto (RECOMMENDED over manual loading)
# Use tx2gene for gene-level summarization with length-offset correction
library(tximport)
tx2gene <- read.csv('tx2gene.csv')  # columns: TXNAME, GENEID
files <- file.path('salmon_out', samples, 'quant.sf')
names(files) <- samples
txi <- tximport(files, type='salmon', tx2gene=tx2gene)

# For DESeq2: use DESeqDataSetFromTximport (handles offsets natively)
# For limma-voom: use countsFromAbundance='lengthScaledTPM' in tximport
# See rna-quantification/tximport-workflow for full details

Related Skills

rna-quantification/featurecounts-counting - Generate featureCounts output
rna-quantification/alignment-free-quant - Generate Salmon/kallisto output
rna-quantification/tximport-workflow - Import Salmon/kallisto with length-offset correction
expression-matrix/normalization - Normalize counts for downstream analysis
expression-matrix/sparse-handling - Memory-efficient storage
expression-matrix/gene-id-mapping - Convert gene identifiers