Claude-skill-registry go-db-query
Skills for querying Gene Ontology annotation databases in DuckDB format. Use this for queries about GO annotations, genes, terms, evidence codes, or taxonomic relationships in GO-DB databases (db/*.ddb files). Particularly useful for hierarchical queries using closure tables to find genes annotated to terms and their descendants.
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/go-db-query" ~/.claude/skills/majiayu000-claude-skill-registry-go-db-query && rm -rf "$T"
skills/data/go-db-query/SKILL.mdGO-DB Query Skill
Overview
This skill provides expertise for querying GO-DB DuckDB databases containing Gene Ontology (GO) annotations. GO-DB databases store annotations linking genes/proteins to GO terms, along with the full GO ontology structure. The key feature is the use of closure tables that enable efficient hierarchical queries across the ontology graph.
Use this skill when working with queries involving:
- Finding genes annotated to specific GO terms (including descendants)
- Analyzing evidence codes and annotation sources
- Exploring ontology hierarchies and term relationships
- Computing annotation statistics by taxon, evidence, or other dimensions
- Identifying unique or redundant annotations using ontological reasoning
Core Concepts
Closure Tables
Closure tables are the heart of GO-DB querying. They contain the transitive closure of ontological relationships:
-
isa_partof_closure: Contains all is-a and part-of relationships, both direct and inferred
- Example: If "protein kinase" is-a "kinase" and "kinase" is-a "catalytic activity", the table includes all three relationships plus the transitive "protein kinase" → "catalytic activity"
-
How to use: Join annotations with closure tables to find all genes annotated to a term OR its descendants
-- Find all yeast kinases (including specific types like protein kinase) SELECT DISTINCT a.db_object_symbol, a.db_object_id FROM gaf_association a INNER JOIN isa_partof_closure ipc ON a.ontology_class_ref = ipc.subject WHERE ipc.object = 'GO:0016301' -- kinase activity AND a.db_object_taxon LIKE '%559292%'; -- yeast
Database Structure
Available databases are located in
db/*.ddb:
- Organism-specific:
(yeast),sgd.ddb
(fly),fb.ddb
(fission yeast)pombase.ddb - Taxonomic groups:
,mammal.ddb
,fungi.ddbplant.ddb - GOA databases:
,goa_human.ddbgoa_uniprot_all.ddb
Query Patterns
1. Finding Genes by GO Term (with Closure)
The most common pattern: find all genes annotated to a term or its descendants.
Pattern:
SELECT DISTINCT a.db_object_symbol, a.ontology_class_ref, t.label AS term_label FROM gaf_association a INNER JOIN isa_partof_closure ipc ON a.ontology_class_ref = ipc.subject INNER JOIN term_label t ON a.ontology_class_ref = t.id WHERE ipc.object = '<GO_TERM_ID>' AND a.db_object_taxon LIKE '%<TAXON_ID>%';
Why use closure: Without the closure join, only direct annotations are found. The closure captures all annotations to descendant terms (e.g., "protein kinase", "lipid kinase" when searching for "kinase").
2. Counting and Grouping Annotations
Aggregate annotations by dimensions like evidence type, taxon, or assigned_by.
Pattern:
SELECT evidence_type, COUNT(*) AS annotation_count, COUNT(DISTINCT db_object_id) AS unique_genes FROM gaf_association WHERE <filters> GROUP BY evidence_type ORDER BY annotation_count DESC;
Combine with closure tables to count within ontology subtrees.
3. Finding Unique Contributions
Identify annotations that are not redundant with more specific annotations from other sources.
Pattern:
SELECT a.* FROM gaf_association a WHERE NOT EXISTS ( SELECT 1 FROM gaf_association a2 INNER JOIN isa_partof_closure ipc ON a2.ontology_class_ref = ipc.subject WHERE a2.supporting_references != a.supporting_references AND ipc.object = a.ontology_class_ref -- a2 is to a child term AND a2.db_object_id = a.db_object_id );
Logic: An annotation is unique if no child-term annotation exists from a different source for the same gene.
4. Exploring Term Hierarchies
Navigate the ontology structure itself using edge and closure tables.
Find direct children:
SELECT DISTINCT e.subject, t.label FROM edge e INNER JOIN term_label t ON e.subject = t.id WHERE e.object = '<GO_TERM_ID>' AND e.predicate = 'rdfs:subClassOf';
Find all ancestors:
SELECT DISTINCT ipc.object, t.label FROM isa_partof_closure ipc INNER JOIN term_label t ON ipc.object = t.id WHERE ipc.subject = '<GO_TERM_ID>';
5. Genes Annotated to Multiple Terms
Find genes with annotations to both T1 and T2 (or their descendants).
Pattern:
SELECT DISTINCT a1.db_object_symbol, a1.db_object_id FROM gaf_association a1 INNER JOIN isa_partof_closure ipc1 ON a1.ontology_class_ref = ipc1.subject INNER JOIN gaf_association a2 ON a1.db_object_id = a2.db_object_id INNER JOIN isa_partof_closure ipc2 ON a2.ontology_class_ref = ipc2.subject WHERE ipc1.object = '<GO_TERM_1>' AND ipc2.object = '<GO_TERM_2>';
Logic: Self-join gaf_association on gene ID, then join each side with closure tables to check ancestry.
Executing Queries
Command Line Usage
# Query a specific database duckdb db/sgd.ddb "SELECT COUNT(*) FROM gaf_association" # Interactive mode duckdb db/sgd.ddb D SELECT * FROM term_label WHERE label LIKE '%kinase%' LIMIT 10; D .quit # Export results to CSV duckdb db/goa_human.ddb "COPY (SELECT ...) TO 'results.csv' (HEADER, DELIMITER ',')"
Finding the Right Database
- Organism-specific queries: Use organism database (e.g.,
for yeast)sgd.ddb - Cross-species analysis: Use taxonomic group (e.g.,
)mammal.ddb - Human-focused: Use
goa_human.ddb - Comprehensive queries: Use
(largest, >400M annotations)goa_uniprot_all.ddb
Check available databases:
ls -lh db/*.ddb
Key Tables Reference
gaf_association
Main annotation table with columns:
,db_object_symbol
: Gene identifier and symboldb_object_id
: GO term ID (e.g., "GO:0016301")ontology_class_ref
: Evidence code (e.g., "IEA", "IDA")evidence_type
: NCBI taxon ID (e.g., "taxon:9606")db_object_taxon
: GO aspect - "P" (process), "F" (function), "C" (component)aspect
: Reference IDssupporting_references
: Annotation sourceassigned_by
isa_partof_closure
Transitive closure table with columns:
: Descendant term IDsubject
: Relationship typepredicate
: Ancestor term IDobject
term_label
Term ID to label mapping:
: GO term IDid
: Human-readable labellabel
entailed_edge
All ontology relationships (including inferred):
,subject
,predicateobject
For complete schema documentation, refer to
references/schema.md.
Query Workflow
When handling a query request:
- Understand the question: Identify what data is being requested
- Determine if closure is needed: Most queries benefit from closure tables to capture hierarchical relationships
- Find the GO term ID: Use term_label to search by label if needed
- Select the right database: Choose based on organism/scope
- Build the query: Start with the appropriate pattern from
references/common_queries.md - Add filters: Refine by taxon, evidence, date, etc.
- Execute and verify: Run via
and check results make senseduckdb - Add labels for readability: Join with term_label to show human-readable names
Common Taxon IDs
- 9606: Human
- 10090: Mouse
- 559292: S. cerevisiae (yeast)
- 7227: D. melanogaster (fly)
- 284812: S. pombe (fission yeast)
Common Evidence Codes
Experimental: IDA, IMP, IGI, IPI, IEP Computational: IEA, ISS, ISO, ISA, ISM, IBA Curator/Author: TAS, NAS, IC, ND
Resources
references/schema.md
Complete schema documentation including:
- Detailed table structures and column descriptions
- Ontology table relationships
- Closure table explanations
- Index information
- Database statistics
references/common_queries.md
Comprehensive SQL examples for all query patterns:
- Pattern 1: Find genes by term (with closure)
- Pattern 2: Count/group annotations
- Pattern 3: Find unique/redundant annotations
- Pattern 4: Explore term hierarchies
- Pattern 5: Genes with multiple term annotations
- Pattern 6: Evidence analysis
- Pattern 7: Reference/citation analysis
Load these references when detailed examples or schema information is needed to construct queries.
Tips
- Start simple: Begin with basic queries and add complexity incrementally
- Use EXPLAIN: Check query plans for complex queries
- LIMIT during development: Add LIMIT to test queries on large databases
- Check indices: Closure tables have indices on subject/object pairs for performance
- Validate term IDs: Verify GO term IDs exist in term_label before running queries
- Consider performance: Closure joins can be expensive on very large databases; filter early when possible