Qsv infer-ontology
Infer a semantic ontology from all files in the working directory - entities, attributes, relationships, domain taxonomy, and cross-file join paths. Outputs ONTOLOGY.md.
git clone https://github.com/dathere/qsv
T=$(mktemp -d) && git clone --depth=1 https://github.com/dathere/qsv "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/skills/infer-ontology" ~/.claude/skills/dathere-qsv-infer-ontology && rm -rf "$T"
.claude/skills/skills/infer-ontology/SKILL.mdInfer Ontology
Scan all files in the current working directory, profile each one, then synthesize a semantic ontology describing the entities, their attributes, the relationships between files, and the domain taxonomy.
Cowork note: If relative paths don't resolve, call
andmcp__qsv__qsv_get_working_dirto sync the working directory.mcp__qsv__qsv_set_working_dir
Steps
Phase 1: Discovery
-
Sync working directory: Call
to confirm the current path. If needed, callmcp__qsv__qsv_get_working_dir
.mcp__qsv__qsv_set_working_dir -
List files: Call
to get all files in the working directory. Classify each file:mcp__qsv__qsv_list_filesTabular (handled natively by qsv MCP Server — auto-converted to CSV if needed):
- CSV/TSV/SSV/TAB (
,.csv
,.tsv
,.ssv
and.tab
compressed variants).sz - Excel (
,.xlsx
,.xls
,.xlsm
,.xlsb
).ods - JSONL (
,.jsonl
).ndjson - Parquet (
,.parquet
).pq
Non-tabular (best-effort extraction):
- Markdown (
), Text (.md
), README files — read for domain context.txt - JSON config/schema files — extract field names and types
- Data dictionaries, codebooks — extract controlled vocabularies
- Other files — note their presence but skip deep analysis
- CSV/TSV/SSV/TAB (
Phase 2: Profile Each Tabular File
-
Run data-profile on each tabular file: For every tabular file discovered in step 2, execute the full
workflow (steps 1-13). This produces for each file:/data-profile- Format metadata (delimiter, encoding, row count, column count)
- Full statistics with cardinality (
cache).stats.csv - Advanced statistics (kurtosis, Gini, entropy via moarstats)
- Frequency distributions (top 10 values per column)
- PII/PHI screening results
- Injection screening results
- Data preview (first 5 rows)
- Data Dictionary with Label/Description per column, Dataset Description, and Tags
Run profiles sequentially to avoid overwhelming the MCP server. After each profile completes, retain the key outputs (stats, frequencies, Data Dictionary, Tags) for cross-file analysis.
-
Extract context from non-tabular files: For each non-tabular file:
- Read the file content (if text-based and reasonably sized)
- Extract any domain terminology, field definitions, business rules, or relationship descriptions
- Note any schema definitions, data dictionaries, or codebooks that describe the tabular files
Phase 3: Cross-File Relationship Detection
-
Identify shared columns: Compare column names across all profiled files. Flag columns that appear in multiple files (exact name match or close variants like
/customer_id
/cust_id
).customerid -
Validate join candidates: For each pair of files sharing column names, read
and check:.stats.csv- Compare
from stats — both must be the same type; if mismatched, the relationship is invalid unless one side needs castingtype - Compare
: a foreign key column typically has cardinality ≤ the primary key's cardinalitycardinality - Check
— a value of 1.0 identifies the primary key side of the relationshipuniqueness_ratio - Check
/nullcount
— join columns with sparsity > 0.3 are unreliable join candidates (nulls don't match)sparsity - Compare value ranges (
/min
) — overlapping ranges suggest a real relationshipmax - Compare frequency distributions — if top values in one appear in the other, the relationship is likely valid
- Check
— highly skewed join columns (|skewness| > 2) may indicate data quality issues masking relationshipsskewness - Use
to test overlap when needed:mcp__qsv__qsv_sqlpSELECT COUNT(DISTINCT a.col) as overlap FROM read_csv('file1.csv') a INNER JOIN read_csv('file2.csv') b ON a.col = b.col
- Compare
-
Detect relationship types: Classify each validated relationship:
- One-to-One: both sides have cardinality ≈ row count and overlap is high
- One-to-Many: one side has cardinality ≈ row count (primary key), the other has cardinality < row count (foreign key)
- Many-to-Many: both sides have cardinality < row count — suggests a junction/bridge table may exist
- Hierarchical: a column's values are a subset of another column in the same or different file (parent-child)
- Temporal: date/time columns across files that suggest event sequences or time-series relationships
Phase 4: Ontology Synthesis
-
Define entities: Each tabular file represents one entity (or multiple if it's a denormalized/wide table). For each entity:
- Name: Derive from the filename (e.g.,
→customers.csv
)Customer - Description: Use the Dataset Description from the data-profile
- Primary key: The column with cardinality = row count and no nulls (if one exists)
- Attributes: All columns, with their Label, Description, type, and nullability from the Data Dictionary
- Name: Derive from the filename (e.g.,
-
Define relationships: From the cross-file analysis (steps 5-7), document each relationship:
- Source entity → Target entity
- Join columns (source.column → target.column)
- Relationship type (one-to-one, one-to-many, many-to-many)
- Overlap strength (percentage of source values found in target)
- Join path (the SQL or joinp expression to connect them)
-
Infer domain taxonomy: Using all collected information (column names, value distributions, Tags from each file, non-tabular file context):
- Domain: The overarching subject area (e.g., "Healthcare", "E-commerce", "Finance")
- Subdomains: More specific topic areas (e.g., "Patient Records", "Claims Processing")
- Entity hierarchy: Group entities into logical categories
- Controlled vocabularies: Columns with low cardinality that serve as classification dimensions (e.g., status codes, categories, types)
- Temporal scope: Date ranges across the dataset collection
-
Assess data quality across the collection: Summarize cross-cutting quality concerns:
- Inconsistent column naming conventions across files
- Orphaned foreign keys (references to entities not present in any file)
- PII/PHI exposure across the collection
- Completeness gaps (entities with high null rates in key columns)
- Type inconsistencies (same column name, different types across files)
Phase 5: Output
- Write ONTOLOGY.md: Generate the ontology document in the working directory using the template below.
ONTOLOGY.md Template
# Ontology: {Domain Name} > Auto-generated ontology inferred from {N} files in `{working_directory}`. > Generated: {date} ## Overview {3-5 sentence summary of the dataset collection: what domain it covers, how many entities, total rows across all files, key relationships, and overall data quality.} ## Domain Taxonomy **Domain**: {Primary domain} **Subdomains**: {Comma-separated list} **Temporal scope**: {Earliest date} to {Latest date} (if applicable) **Tags**: {Merged and deduplicated tags from all file profiles} ## Entities ### {Entity Name} (`{filename}`) {Dataset Description from data-profile} | Field | Type | Label | Description | Nullable | Cardinality | Key | |-------|------|-------|-------------|----------|-------------|-----| | ... | ... | ... | ... | ... | ... | PK/FK/— | **Quality notes**: {Any quality flags from profiling — PII, high nulls, injection, etc.} {Repeat for each entity} ## Relationships | Source | Target | Source Column | Target Column | Type | Overlap | Join Expression | |--------|--------|---------------|---------------|------|---------|-----------------| | ... | ... | ... | ... | 1:N | 98.5% | `joinp --left file1.csv file2.csv --columns col` | ### Relationship Diagram {ASCII or text-based entity-relationship diagram showing entities as boxes and relationships as labeled arrows. Example:} ``` [Customer] 1──→N [Order] N←──1 [Product] │ │ └──────── N:M ─────────────────┘ (via OrderItem) ``` ## Controlled Vocabularies {For columns with low cardinality that serve as classification dimensions} ### {Column Name} (`{filename}`) | Value | Frequency | Description | |-------|-----------|-------------| | ... | ... | ... | ## Cross-Collection Quality Summary | Dimension | Status | Details | |-----------|--------|---------| | Naming consistency | {OK/Warning} | {Details} | | Referential integrity | {OK/Warning} | {Orphaned keys found in ...} | | PII/PHI exposure | {OK/Warning} | {Columns flagged in ...} | | Type consistency | {OK/Warning} | {Mismatched types for ...} | | Completeness | {OK/Warning} | {High-null columns in ...} | ## Non-Tabular Context {Summary of information extracted from non-tabular files that informed the ontology} | File | Type | Contribution | |------|------|-------------| | ... | ... | {What domain context or definitions it provided} |
Notes
- For Excel workbooks with multiple sheets, each sheet is treated as a separate entity
- Parquet files are queried in-place via
withsqlp
rather than converted to CSVread_parquet() - The cross-file relationship detection uses column name matching and cardinality comparison as heuristics — always verify inferred relationships against domain knowledge
- For large directories (>20 tabular files), consider profiling in batches to manage context
- If a controlled Tag Vocabulary is provided, all Tags across the ontology are constrained to that vocabulary