Blockify-agentic-data-optimization blockify-integration

install
source · Clone the upstream repo
git clone https://github.com/iternal-technologies-partners/blockify-agentic-data-optimization
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/iternal-technologies-partners/blockify-agentic-data-optimization "$T" && mkdir -p ~/.claude/skills && cp -r "$T/blockify-skill-for-claude-code/skills/blockify-integration" ~/.claude/skills/iternal-technologies-partners-blockify-agentic-data-optimization-blockify-integr && rm -rf "$T"
manifest: blockify-skill-for-claude-code/skills/blockify-integration/SKILL.md
source content

Blockify Integration Skill

Why This Exists

Problem: Traditional RAG systems chunk documents by character/token count, losing semantic coherence. A 500-token chunk may split a concept mid-sentence, contain unrelated paragraphs, or bury key facts in noise.

Solution: Blockify is a patented distillation platform that transforms raw text into IdeaBlocks—self-contained semantic knowledge units optimized for AI retrieval.

MetricImprovement
Enterprise Performance78X
Vector Search Accuracy2.29X
Dataset Size Reduction40X (to ~2.5%)
Token Efficiency3.09X

End-to-End Process Flow

┌─────────────────────────────────────────────────────────────────────────────┐
│                         BLOCKIFY PIPELINE OVERVIEW                          │
└─────────────────────────────────────────────────────────────────────────────┘

  ┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐
  │  Source  │     │ Blockify │     │ ChromaDB │     │  Search  │
  │Documents │────▶│   API    │────▶│  Vector  │────▶│  Query   │
  │ .md .txt │     │ (ingest) │     │   Store  │     │ Results  │
  └──────────┘     └──────────┘     └──────────┘     └──────────┘
       │                │                │                │
       │                ▼                ▼                │
       │         ┌──────────┐     ┌──────────┐           │
       │         │IdeaBlocks│     │  OpenAI  │           │
       │         │   XML    │     │Embeddings│           │
       │         └──────────┘     │  1536-d  │           │
       │                          └──────────┘           │
       │                               │                 │
       │                               ▼                 │
       │                    ┌─────────────────┐          │
       │                    │   DISTILLATION  │          │
       │                    │  (deduplicate)  │          │
       │                    │                 │          │
       │                    │ raw_ideablocks  │          │
       │                    │       ▼         │          │
       │                    │ distilled_      │          │
       │                    │   ideablocks    │          │
       │                    └─────────────────┘          │
       │                                                 │
       └─────────────────────────────────────────────────┘

Complete Setup (Step-by-Step)

Prerequisites

Step 1: Create Environment File

cd /path/to/blockify-skill-for-claude-code

# Create .env file
cat > .env << 'EOF'
# Blockify API Keys
BLOCKIFY_API_KEY=blk_your_key_here
OPENAI_API_KEY=sk-your_key_here
EOF

Step 2: Load Environment Variables

IMPORTANT: You must load these before running any script:

export $(cat .env | grep -v '^#' | grep -v '^$' | xargs)

Or add to your shell profile (

~/.zshrc
or
~/.bashrc
):

# Blockify environment
export BLOCKIFY_API_KEY="blk_your_key_here"
export OPENAI_API_KEY="sk-your_key_here"

Step 3: Install Dependencies

cd skills/blockify-integration
python3 scripts/setup_check.py --install

Expected output:

[OK] All packages installed
[OK] API keys configured
[--] ChromaDB not initialized (will create on first ingest)

Step 4: Ingest Documents

# Single file
python3 scripts/ingest_to_chromadb.py /path/to/document.md

# Directory (batch mode)
python3 scripts/ingest_to_chromadb.py /path/to/documents/ --batch

What happens:

┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│  Read File  │───▶│   Chunk     │───▶│  Blockify   │───▶│   Parse     │
│             │    │  (2000 chr) │    │   API       │    │   XML       │
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘
                                                               │
┌─────────────┐    ┌─────────────┐    ┌─────────────┐          │
│   Store     │◀───│  Dedupe     │◀───│  Generate   │◀─────────┘
│  ChromaDB   │    │  (by ID)    │    │  Embeddings │
└─────────────┘    └─────────────┘    └─────────────┘

Step 5: Distill (Deduplicate)

Option A: Docker-based (full service)

cd /path/to/blockify-distillation-service
cp .env.example .env
# Add API keys to .env
docker-compose up -d
python3 scripts/run_distillation.py

Option B: Direct API (no Docker required)

python3 scripts/distill_chromadb.py

What happens:

┌─────────────────────────────────────────────────────────────────┐
│                    DISTILLATION PROCESS                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Pass 1: Within-Document Clustering                             │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐                         │
│  │  Doc A  │  │  Doc B  │  │  Doc C  │                         │
│  │ ┌─┐┌─┐  │  │ ┌─┐┌─┐  │  │ ┌─┐┌─┐  │  (cluster similar      │
│  │ └─┘└─┘  │  │ └─┘└─┘  │  │ └─┘└─┘  │   blocks per doc)      │
│  └─────────┘  └─────────┘  └─────────┘                         │
│       │            │            │                               │
│       ▼            ▼            ▼                               │
│  Pass 2: Cross-Document Clustering                              │
│  ┌──────────────────────────────────┐                          │
│  │  Compare representatives across  │  (find duplicates        │
│  │  all documents for global dedup  │   across documents)      │
│  └──────────────────────────────────┘                          │
│                    │                                            │
│                    ▼                                            │
│  Pass 3: Merge via Blockify Distill API                        │
│  ┌─────────┐    ┌─────────┐                                    │
│  │ Cluster │───▶│ Merged  │  (LLM combines similar blocks)     │
│  │ 5 blocks│    │ 1 block │                                    │
│  └─────────┘    └─────────┘                                    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Step 6: Search

# Search distilled collection (recommended)
python3 scripts/search_chromadb.py "your query" --collection distilled

# Search raw collection
python3 scripts/search_chromadb.py "your query" --collection raw

# Filter by entity type
python3 scripts/search_chromadb.py "your query" --entity PRODUCT

# JSON output
python3 scripts/search_chromadb.py "your query" --json

Data Flow Diagram

┌─────────────────────────────────────────────────────────────────────────────┐
│                              DATA FLOW                                       │
└─────────────────────────────────────────────────────────────────────────────┘

SOURCE FILES                    PROCESSING                      STORAGE
────────────                    ──────────                      ───────

  document1.md ─┐
  document2.md ─┼──▶ ingest_to_chromadb.py ──▶ raw_ideablocks (ChromaDB)
  document3.md ─┤         │                          │
       ...     ─┘         │                          │
                          │                          ▼
                          │               distill_chromadb.py
                          │                          │
                          ▼                          ▼
                   Blockify API              distilled_ideablocks
                   (ingest model)                    │
                          │                          │
                          ▼                          ▼
                   OpenAI Embeddings ◀──────── search_chromadb.py
                   (text-embedding-              (semantic search)
                    3-small, 1536d)

COLLECTIONS:
┌────────────────────────────────────────────────────────────────────────────┐
│ raw_ideablocks        │ Pre-distillation blocks, may have duplicates      │
├────────────────────────────────────────────────────────────────────────────┤
│ distilled_ideablocks  │ Production-ready, deduplicated (USE THIS)         │
└────────────────────────────────────────────────────────────────────────────┘

Core Concept: IdeaBlocks

An IdeaBlock is a complete, self-contained unit of knowledge that answers exactly one question:

<ideablock>
  <name>Title describing this knowledge unit</name>
  <critical_question>What specific question does this answer?</critical_question>
  <trusted_answer>The validated answer (2-3 sentences, complete).</trusted_answer>
  <tags>IMPORTANT, TECHNOLOGY, CATEGORY</tags>
  <entity>
    <entity_name>PRODUCT_NAME</entity_name>
    <entity_type>PRODUCT</entity_type>
  </entity>
  <keywords>keyword1, keyword2, keyword3</keywords>
</ideablock>

Entity types: PRODUCT, ORGANIZATION, PERSON, TECHNOLOGY, CONCEPT, LOCATION, EVENT


Model Selection

Is the content ordered/sequential (manual, procedure)?
├─ YES → Use `technical-ingest` (preserves order context)
└─ NO → Is this raw source material?
         ├─ YES → Use `ingest` (creates new IdeaBlocks)
         └─ NO → Are these existing IdeaBlocks with duplicates?
                  └─ YES → Use `distill` (merges similar blocks)
ModelInputOutputUse Case
ingest
Raw textNew IdeaBlocksFirst-time processing
distill
IdeaBlocks XMLMerged IdeaBlocksDeduplication
technical-ingest
Ordered text + contextSequenced IdeaBlocksManuals, procedures

Script Reference

Scripts Overview

scripts/
├── setup_check.py          # Verify environment, install deps
├── ingest_to_chromadb.py   # Documents → IdeaBlocks → ChromaDB (parallel)
├── search_chromadb.py      # Semantic search with OpenAI embeddings
├── distill_chromadb.py     # Deduplication (NO Docker required)
├── run_distillation.py     # Deduplication (requires Docker service)
├── run_full_pipeline.py    # End-to-end: ingest + distill + benchmark (parallel)
├── run_benchmark.py        # Compare IdeaBlocks vs chunking, generate HTML report
├── blockify_ingest.py      # Documents → JSON (no ChromaDB)
├── blockify_distill.py     # JSON → distilled JSON
└── blockify_search.py      # Search JSON files

Note: Ingestion scripts use 5 parallel workers by default. Configure via

--parallel N
flag or
BLOCKIFY_PARALLEL_WORKERS
environment variable.

Detailed Script Usage

setup_check.py

python3 scripts/setup_check.py           # Check status
python3 scripts/setup_check.py --install # Install missing packages

ingest_to_chromadb.py

python3 scripts/ingest_to_chromadb.py input.txt              # Single file
python3 scripts/ingest_to_chromadb.py docs/ --batch          # Directory (5 parallel workers)
python3 scripts/ingest_to_chromadb.py docs/ --batch -p 10    # Use 10 parallel workers
python3 scripts/ingest_to_chromadb.py docs/ --batch -s       # Sequential processing
python3 scripts/ingest_to_chromadb.py input.txt -c distilled # Target collection

search_chromadb.py

python3 scripts/search_chromadb.py "query"                    # Auto-select collection
python3 scripts/search_chromadb.py "query" -c distilled       # Specific collection
python3 scripts/search_chromadb.py "query" -e PRODUCT         # Filter by entity
python3 scripts/search_chromadb.py "query" -n 20              # Limit results
python3 scripts/search_chromadb.py "query" --json             # JSON output

distill_chromadb.py (NO Docker)

python3 scripts/distill_chromadb.py                           # Default settings
python3 scripts/distill_chromadb.py --threshold 0.8           # Higher = fewer merges
python3 scripts/distill_chromadb.py --dry-run                 # Cluster only, no API calls

Troubleshooting

Common Errors and Solutions

┌────────────────────────────────────────────────────────────────────────────┐
│ ERROR                          │ CAUSE              │ SOLUTION             │
├────────────────────────────────────────────────────────────────────────────┤
│ DuplicateIDError               │ Same IdeaBlock     │ Script handles this  │
│ "found duplicates of: ib_..."  │ extracted twice    │ automatically now    │
├────────────────────────────────────────────────────────────────────────────┤
│ InvalidArgumentError           │ Embedding model    │ Use search_chromadb  │
│ "dimension 1536, got 384"      │ mismatch           │ (fixed in script)    │
├────────────────────────────────────────────────────────────────────────────┤
│ BLOCKIFY_API_KEY not set       │ Missing env var    │ export $(cat .env    │
│                                │                    │ | grep -v '^#' |     │
│                                │                    │ grep -v '^$' | xargs)│
├────────────────────────────────────────────────────────────────────────────┤
│ 429 Rate Limit                 │ Too many requests  │ Script retries with  │
│                                │                    │ exponential backoff  │
├────────────────────────────────────────────────────────────────────────────┤
│ Empty output from API          │ max_tokens too low │ Use 8000+ tokens     │
│                                │                    │ (default in scripts) │
├────────────────────────────────────────────────────────────────────────────┤
│ ChromaDB not found             │ Not initialized    │ Run ingest first     │
├────────────────────────────────────────────────────────────────────────────┤
│ Distillation service not       │ Docker not running │ Use distill_chromadb │
│ available                      │ OR no Docker       │ .py (no Docker)      │
└────────────────────────────────────────────────────────────────────────────┘

Important Technical Notes

  1. Embedding Model Consistency

    • Ingestion uses:
      text-embedding-3-small
      (OpenAI, 1536 dimensions)
    • Search MUST use the same model
    • The
      search_chromadb.py
      script handles this automatically
  2. Duplicate Handling

    • IdeaBlock IDs are SHA256 hashes of
      name + question + answer
    • Identical content = identical ID (by design)
    • ingest_to_chromadb.py
      deduplicates within each batch automatically
  3. Chunking Strategy

    • 2000 characters per chunk
    • 200 character overlap at sentence boundaries
    • Optimal for Blockify API processing

Configuration

Environment Variables

VariableRequiredDefaultDescription
BLOCKIFY_API_KEY
Yes-API key from console.blockify.ai
OPENAI_API_KEY
Yes-API key from platform.openai.com
IDEABLOCK_DATA_DIR
No
./data/ideablocks
Data storage directory
DISTILL_SERVICE_URL
No
http://localhost:8315
Distillation service URL
BLOCKIFY_PARALLEL_WORKERS
No
5
Default parallel workers for ingestion

API Settings (Do Not Change)

ParameterValueReason
max_tokens8000Minimum for complete blocks
temperature0.5Calibrated for consistency
chunk_size2000 charsOptimal input chunking

Search Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                           SEARCH FLOW                                        │
└─────────────────────────────────────────────────────────────────────────────┘

                    ┌─────────────────┐
   User Query ────▶ │ OpenAI Embedding│ ────▶ Query Vector (1536-d)
                    │ text-embedding- │
                    │ 3-small         │
                    └─────────────────┘
                                              │
                                              ▼
                                   ┌─────────────────┐
                                   │ ChromaDB Query  │
                                   │ (cosine sim)    │
                                   └─────────────────┘
                                              │
                                              ▼
                                   ┌─────────────────┐
                                   │ Top-K Results   │
                                   │ (no reranker)   │
                                   └─────────────────┘

CURRENT LIMITATIONS:
- Single-stage retrieval only (no reranking)
- No hybrid search (vector only, no BM25)
- No query expansion

POTENTIAL IMPROVEMENTS:
- Add cross-encoder reranker for top-100 → top-10
- Implement hybrid search with BM25
- Add query expansion via LLM

Quick Reference Commands

# ═══════════════════════════════════════════════════════════════════════════
# SETUP
# ═══════════════════════════════════════════════════════════════════════════

# Load environment (run this first, every session)
export $(cat /path/to/.env | grep -v '^#' | grep -v '^$' | xargs)

# Check setup
python3 scripts/setup_check.py

# Install dependencies
python3 scripts/setup_check.py --install

# ═══════════════════════════════════════════════════════════════════════════
# INGEST (parallel by default, 5 workers)
# ═══════════════════════════════════════════════════════════════════════════

# Single file
python3 scripts/ingest_to_chromadb.py document.md

# Directory of files (5 parallel workers by default)
python3 scripts/ingest_to_chromadb.py /path/to/docs/ --batch

# Use more parallel workers for faster ingestion
python3 scripts/ingest_to_chromadb.py /path/to/docs/ --batch --parallel 10

# Sequential processing (disable parallelization)
python3 scripts/ingest_to_chromadb.py /path/to/docs/ --batch --sequential

# ═══════════════════════════════════════════════════════════════════════════
# DISTILL (DEDUPLICATE)
# ═══════════════════════════════════════════════════════════════════════════

# Without Docker (recommended for most users)
python3 scripts/distill_chromadb.py

# With Docker service
python3 scripts/run_distillation.py

# ═══════════════════════════════════════════════════════════════════════════
# SEARCH
# ═══════════════════════════════════════════════════════════════════════════

# Basic search (uses distilled if available)
python3 scripts/search_chromadb.py "your query"

# Search specific collection
python3 scripts/search_chromadb.py "your query" --collection distilled

# Filter by entity
python3 scripts/search_chromadb.py "your query" --entity PRODUCT

# JSON output
python3 scripts/search_chromadb.py "your query" --json --limit 5

# ═══════════════════════════════════════════════════════════════════════════
# BENCHMARK (compare IdeaBlocks vs traditional chunking)
# ═══════════════════════════════════════════════════════════════════════════

# Run benchmark (generates HTML report)
python3 scripts/run_benchmark.py

# With custom company name
python3 scripts/run_benchmark.py --company "My Company"

# With custom config
python3 scripts/run_benchmark.py --config ./config/benchmark_config.yaml

# Create default config file
python3 scripts/run_benchmark.py --init-config

# View generated report
open data/reports/benchmark_report_*.html

Required Execution Workflow (For Autonomous/Skill Use)

IMPORTANT: When running Blockify as a skill or autonomous task, you MUST complete ALL steps below in order. Do not skip any step.

┌─────────────────────────────────────────────────────────────────────────────┐
│                    REQUIRED EXECUTION STEPS (IN ORDER)                       │
└─────────────────────────────────────────────────────────────────────────────┘

Step 1: Environment Setup
    └─► Verify API keys configured
    └─► Run setup_check.py to confirm dependencies

Step 2: Document Ingestion
    └─► Run ingest_to_chromadb.py with --batch for directories
    └─► Record: file count, block count, any errors

Step 3: Distillation (Deduplication)
    └─► Run distill_chromadb.py (no Docker required)
    └─► Record: clusters found, blocks merged, reduction %

Step 4: Search Verification
    └─► Run at least 3 different test queries
    └─► Verify results are relevant (scores > 0.5)
    └─► Test both text and JSON output formats

Step 5: Benchmark (REQUIRED - DO NOT SKIP)          ◄── MANDATORY
    └─► Run: python3 scripts/run_benchmark.py --company "Company Name"
    └─► Record all metrics from output:
        - Vector Search Accuracy (X improvement)
        - Information Distillation (X reduction)
        - Aggregate Performance (X)
        - Enterprise Performance (X)
        - Token Efficiency (X)
        - Projected Annual Savings ($X)
    └─► Note the report file path for reference

Step 6: Documentation/Changelog
    └─► Create or update CHANGELOG.md in target directory
    └─► Include ALL metrics from Steps 2-5
    └─► Document any errors or issues encountered
    └─► Note any confusing steps for documentation improvement

Why Benchmark is Required

The benchmark compares IdeaBlocks performance against traditional chunking methods. Without running the benchmark:

  • You cannot quantify the improvement from using Blockify
  • You have no baseline for comparison
  • The value proposition cannot be demonstrated

Benchmark Output Metrics Explained

MetricWhat It MeasuresGood Value
Vector Search AccuracyHow much closer IdeaBlocks are to query intent vs chunks> 2.0X
Information DistillationWord count reduction while preserving meaning> 1.2X
Aggregate PerformanceCombined accuracy × distillation improvement> 3.0X
Enterprise PerformanceAggregate × scale factor for enterprise workloads> 40X
Token EfficiencyLLM token savings from using IdeaBlocks> 3.0X

Example Session (Complete Workflow)

# 1. Navigate to skill directory
cd /path/to/blockify-skill-for-claude-code/skills/blockify-integration

# 2. Create .env file with your API keys
cat > ../../.env << 'EOF'
BLOCKIFY_API_KEY=blk_your_key_here
OPENAI_API_KEY=sk-your_key_here
BLOCKIFY_PARALLEL_WORKERS=5
EOF

# 3. Load environment
export $(cat ../../.env | grep -v '^#' | grep -v '^$' | xargs)

# 4. Install dependencies
python3 scripts/setup_check.py --install

# 5. Ingest documents (parallel by default, 5 workers)
python3 scripts/ingest_to_chromadb.py /path/to/documents/ --batch

# Or use more workers for faster ingestion
python3 scripts/ingest_to_chromadb.py /path/to/documents/ --batch --parallel 10

# 6. Run distillation (no Docker needed)
python3 scripts/distill_chromadb.py

# 7. Search your knowledge base (run multiple test queries)
python3 scripts/search_chromadb.py "what are the key features?" --collection distilled
python3 scripts/search_chromadb.py "product benefits" --collection distilled
python3 scripts/search_chromadb.py "technical specifications" --collection distilled --json

# 8. Run benchmark (REQUIRED - generates HTML report with metrics)
python3 scripts/run_benchmark.py --company "Your Company Name"

# 9. View benchmark report
open data/reports/benchmark_report_*.html

# 10. Export results as JSON for further processing
python3 scripts/search_chromadb.py "important concepts" --json --limit 20 > results.json

Scale Considerations

Dataset SizeRecommended ApproachStorageSearch Time
< 1,000 blocksJSON files~10 MBInstant
1K - 10K blocksChromaDB, no distill~50 MB< 100ms
10K - 100K blocksChromaDB + distill~500 MB< 100ms
100K+ blocksChromaDB + distill + FAISS~2 GB< 50ms

Distillation time estimates (2,000+ blocks):

  • Pass 1 (within-document): ~30 seconds
  • Pass 2 (cross-document): ~10-15 minutes
  • Pass 3 (API merges): ~1-2 seconds per cluster

References