Skills civil-judgment-taiwan-vectorstore
Ingest Taiwan civil court judgments (HTML or PDF) — exclusively covering Taiwan civil cases — into Qdrant with Ollama embeddings, preserving traceability, deduplication, and incremental updates.
git clone https://github.com/openclaw/skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/alex02131926/civil-judgment-taiwan-vectorstore" ~/.claude/skills/clawdbot-skills-civil-judgment-taiwan-vectorstore && rm -rf "$T"
skills/alex02131926/civil-judgment-taiwan-vectorstore/SKILL.mdTaiwan Civil Judgment → Vector DB (Qdrant) Ingestion
Scope: Taiwan civil court judgments only (民事判決). This skill ingests Taiwan civil cases (HTML or PDF files) into Qdrant. All parsing, chunking, and embedding logic lives in
scripts/ingest.py — your job is to run the script, not to reimplement the pipeline.
Quick Start (follow these steps in order)
Step 1 — Activate venv
source {baseDir}/.venv/bin/activate
Step 2 — Identify the run folder
The user will provide an absolute path to a run folder.
Example:
/path/to/output/judicialyuan/20260305_142030
Verify it exists and has HTML or PDF files:
ls <RUN_FOLDER>/archive/ | grep -E '\.(html|pdf)$' | head -5
If no
archive/*.html or archive/*.pdf files → stop and tell the user the folder has no ingestible data.
Step 3 — Run ingestion
Use absolute paths throughout — no
cd needed:
python3 {baseDir}/scripts/ingest.py \ --run-folder <RUN_FOLDER>
The script handles everything: pre-flight checks, collection auto-creation (creates
civil_case_doc / civil_case_chunk if they don't exist), canonicalization, chunking, embedding, Qdrant upsert, manifest + report writing.
Re-running the same command on the same folder is always safe — deterministic IDs mean upsert = overwrite. No special
--resume flag needed; just run the same command again.
Step 4 — Check the result
Successful output looks like:
OK files=42 processed=42 skipped=0 errored=0 doc_points=42 chunk_points=187 manifest=<RUN_FOLDER>/ingest_manifest.jsonl report=<RUN_FOLDER>/ingest_report.md
Read the report (human-readable stats summary):
cat <RUN_FOLDER>/ingest_report.md
If there are errors, check the manifest (machine-readable, one JSON line per file) for per-file diagnosis:
grep -E '"status":"(skipped|error|partial)"' <RUN_FOLDER>/ingest_manifest.jsonl
Step 5 — Report to user
Tell the user:
- How many docs were ingested (
)doc_points - How many chunks were created (
)chunk_points - Whether any were skipped or errored
- Where the report file is
Done. Do not proceed to additional steps unless the user asks.
DO NOT rules (critical)
- DO NOT write your own HTML parsing, chunking, or embedding code.
handles all of this.ingest.py - DO NOT modify parsing/chunking logic casually. Only change heading detection or chunk fallback when the user explicitly asks to improve PDF/OCR robustness, and validate on a small sample before re-running a large batch.
- DO NOT call Qdrant or Ollama APIs directly. The script does this.
- DO NOT use
or skip SSL verification for any HTTP request.verify=False - DO NOT modify or delete files under
. Raw HTML is immutable source of truth.archive/ - DO NOT change chunking defaults (
,--max-chars
) unless the user explicitly asks.--overlap-chars
Hard constraints
- Raw HTML/PDF is source of truth; never overwrite it.
- Deterministic: same input → same canonical text → same SHA-256 → same Qdrant point IDs. Safe to re-run.
- Traceability: every Qdrant point carries
+doc_url
.local_path - Batched upserts (≤ 64 points/batch) to avoid Qdrant 32MB payload limit.
in every point's metadata. Current:parser_version
.v3.5-sentence-boundary
Troubleshooting
PREFLIGHT_FAILED: Qdrant not reachable
PREFLIGHT_FAILED: Qdrant not reachableQdrant is down or unreachable at the default/configured URL.
# Check if Qdrant is running curl -s http://localhost:6333/collections | head -1 # If not running, start it (or ask the user)
PREFLIGHT_FAILED: Ollama not reachable
PREFLIGHT_FAILED: Ollama not reachable# Check Ollama curl -s http://localhost:11434/api/tags | head -5
PREFLIGHT_FAILED: Ollama model missing: bge-m3:latest
PREFLIGHT_FAILED: Ollama model missing: bge-m3:latestollama pull bge-m3:latest
Then re-run Step 3.
PREFLIGHT_FAILED: No archive/*.html or archive/*.pdf found
PREFLIGHT_FAILED: No archive/*.html or archive/*.pdf foundThe run folder exists but has no archived detail pages. Check:
- Is this the correct run folder?
Output shows skipped > 0
or errored > 0
skipped > 0errored > 0Check
ingest_manifest.jsonl for per-file details:
grep -E '"status":"(skipped|error|partial)"' "<RUN_FOLDER>/ingest_manifest.jsonl"
| Manifest status | Meaning | Action |
|---|---|---|
| Doc + all chunks ingested | None |
| Doc upserted, but some section chunks failed embedding | Check Ollama stability; can re-run safely |
| Doc-level embedding failed — nothing upserted for this doc | Check Ollama; re-run safely |
| HTML read/parse failed | Check if the HTML file is corrupted |
Re-running is always safe — use the exact same command. No special flags needed; deterministic IDs → upsert/overwrite.
Override service endpoints
# Via environment variables OLLAMA_URL=http://localhost:11434 QDRANT_URL=http://localhost:6333 \ python3 scripts/ingest.py --run-folder "..." # Via CLI flags (take precedence over env vars) python3 scripts/ingest.py --run-folder "..." \ --ollama http://localhost:11434 --qdrant http://localhost:6333
Default endpoints:
| Service | Default | Env override |
|---|---|---|
| Ollama | | |
| Qdrant | | |
Test with a small batch first
python3 scripts/ingest.py --run-folder "..." --limit 5
Input folder structure (expected)
<run_folder>/ archive/ fjud_detail_001.html ← HTML input fjud_detail_002.html fjud_detail_003.pdf ← PDF input (also supported) fint_detail_001.html (if system=both) results_fjud.jsonl (optional) results_fint.jsonl (optional)
The script discovers all
archive/*.html and archive/*.pdf files automatically (sorted by filename). HTML and PDF files can coexist in the same run folder.
v1 limitation: The
system metadata field is currently hardcoded to FJUD. If a run folder contains both FJUD and FINT files, FINT files will be ingested but mislabeled as FJUD. This does not affect chunking or embeddings — only the system metadata field on the resulting Qdrant points.
CLI reference
python3 scripts/ingest.py --run-folder <PATH> [options]
| Flag | Default | Description |
|---|---|---|
| (required) | Path to an input folder |
| or | Ollama endpoint |
| or | Qdrant endpoint |
| | Ollama embedding model |
| | Vector dimension |
| | Max chars per chunk (500–1000) |
| | Overlap between chunks (10–20% of max-chars) |
| (no limit) | Process only first N files sorted by filename (lexicographic order); for testing |
Outputs
- Qdrant collections:
(1 point/doc),civil_case_doc
(many points/doc). Auto-created if they don't exist.civil_case_chunk
: human-readable summary (doc/chunk counts, error counts). Read this first after ingestion.ingest_report.md
: machine-readable, one JSON line per doc with status (ingest_manifest.jsonl
/ok
/partial
/skipped
). Read this to diagnose specific file failures (grep for non-error
statuses). Both files overlap on aggregate counts; the manifest adds per-file detail.ok
Roadmap
- v1 (current): doc + section-aware chunks
- v2: candidate issue extraction (爭點抽取)
- v3: issue-level index (
collection)civil_case_issue
Internal details
For metadata schema, canonicalization rules, section-splitting patterns, and chunking implementation, see
.references/internals.md
Lessons learned / operational gotchas
- Qdrant rejects non-UUID/non-integer point IDs (
). The script uses deterministic UUIDs — do not change the ID generation logic.400 Bad Request - Qdrant rejects payloads > 32MB. The script batches at 64 points — do not increase batch size.
- Re-running on the same folder is safe: deterministic IDs mean upsert = overwrite.
- 台灣判決書 section headings 格式不統一(e.g.「理 由」with fullwidth space、兼容字如「⽂」)。目前 parser 已先做 heading normalization;若仍切不出 section,會 fallback 對
做 chunking,避免只留下 doc-level points。full