OpenSpace resilient-context-extraction
Ensures agents extract data from context files with validation and fallback strategies before resorting to assumptions or external searches.
git clone https://github.com/HKUDS/OpenSpace
T=$(mktemp -d) && git clone --depth=1 https://github.com/HKUDS/OpenSpace "$T" && mkdir -p ~/.claude/skills && cp -r "$T/gdpval_bench/skills/prioritize-context-data-enhanced" ~/.claude/skills/hkuds-openspace-resilient-context-extraction && rm -rf "$T"
gdpval_bench/skills/prioritize-context-data-enhanced/SKILL.mdResilient Context Extraction
Objective
Prevent data hallucination and inefficiency by mandating that agents inspect, validate, and fully extract data from provided reference files before attempting web searches or generating synthetic data. When extraction is incomplete, use fallback strategies before making assumptions.
Critical Rule
If a reference file is provided in the task context, it is the source of truth. Do not fabricate data or search the web for information that may exist within the provided attachments. If extraction appears incomplete, attempt alternative methods before proceeding.
Workflow Steps
1. Scan Context for Attachments
At the start of every task, explicitly list all files provided in the context window or attachment panel.
- Check for spreadsheets (
,.xlsx
), documents (.csv
,.pdf
,.docx
), or data dumps (.pptx
,.json
,.txt
,.xml
)..yaml - Note the filename, file size (if available), and inferred content type.
- Record this list for later verification.
2. Evaluate Relevance
Determine if any provided file contains the data required to complete the task.
- Match Keywords: Do filenames or expected column headers match task requirements?
- Check Scope: Does the data cover the necessary timeframe or region?
- Prioritize: Rank files by likelihood of containing required data.
3. Extract Data with Validation
If relevant files are found:
- Read the file content using appropriate tools (e.g.,
,read_file
,pandas
).pdf_reader - Validate Extraction Completeness (NEW CRITICAL STEP):
- Check if output appears truncated (e.g., sudden cutoff mid-sentence, character limits hit).
- Compare expected data points vs. extracted data points (e.g., "Task mentions pricing tiers; did extraction include pricing numbers?").
- Look for structural indicators of incompleteness (e.g., unclosed tables, missing document endings).
- If extraction is incomplete or suspicious, proceed to Step 4 (Fallback Strategies) before using the data.
4. Apply Fallback Extraction Strategies
When initial extraction is incomplete or critical data is missing, attempt these strategies in order before making assumptions:
4a. Alternative Library/Tool Approach
- For
: Try.docx
directly viapython-docx
ifexecute_code_sandbox
was truncated.read_file - For
/.xlsx
: Try.csv
with explicit sheet selection orpandas
directly.openpyxl - For
: Try.pdf
,pdfplumber
, or shell-basedPyPDF2
if available.pdftotext - For any format: Try reading as raw bytes/text via shell commands (
,cat
,xxd
).strings
4b. Shell-Based Extraction
- Use
orshell_agent
for format-specific tools:run_shell
(extract raw docx XML)unzip -p file.docx word/document.xml | xmllint --format -
(convert Excel to CSV via shell)in2csv file.xlsx
(extract PDF text via command line)pdftotext file.pdf -
- This bypasses potential sandbox or library limitations in
.read_file
4c. Targeted Re-Reading
- If specific sections are missing (e.g., "pricing table on page 3"), attempt to re-read with focus on that region.
- Use tools that support page/section ranges if available.
4d. Document What's Missing
- Before any assumptions or external searches, explicitly document:
- What data was expected but not found.
- What extraction methods were attempted.
- Why each method failed or what portions remain unavailable.
- Example: "Attempted
onread_file
(truncated at 980 chars). AttemptedPricing_email.docx
extraction via sandbox—pricing table in section 2 not present in file. Pricing numbers unavailable from provided context."python-docx
5. Cite Source Explicitly
When presenting data in the final output:
- Explicitly state which file the data came from.
- Example: "According to
..."Massabama_active_listings.xlsx - If fallback methods were used, note this: "Extracted via shell-based
frompdftotext
..."report.pdf - This verifies to the user that real data was used, not hallucinated.
6. Fallback to Search (Last Resort Only)
If all extraction strategies fail to provide the specific data needed:
- State clearly what was missing from the files.
- Document all extraction methods attempted and why they failed.
- Then proceed with web search or estimation.
- Mark any non-file data clearly as "External Search" or "Estimated (file data unavailable)".
- Never present estimated data as if it came from the file.
Checklist
- Did I list all attached files with their types?
- Did I open and read the relevant files?
- Did I validate extraction completeness (check for truncation, missing sections)?
- If extraction was incomplete, did I try at least one fallback strategy?
- Did I explicitly document what data is unavailable before making assumptions?
- Did I cite the file name (and extraction method if non-standard) in my output?
- Did I avoid fabricating numbers that should have come from the file?
- Is any external/estimated data clearly marked as such?
Example Usage
Task: Create a report on active property listings. Context:
Massabama_active_listings.xlsx is attached.
Incorrect Approach:
- Ignore the Excel file.
- Search web for "Massabama property listings".
- Fabricate numbers based on search snippets.
- Result: Inaccurate data, ignored source of truth.
Correct Approach:
- Identify
in context.Massabama_active_listings.xlsx - Load the Excel file via
orread_file
.pandas - Validate: Check row count matches expected scope, verify all columns present.
- Count rows and sum prices directly from the sheet.
- Output: "Based on the 50 entries in
, the total value is..."Massabama_active_listings.xlsx - Result: Accurate data, verified source.
Task: Extract pricing tiers from
Pricing_email.docx.
Context: Pricing_email.docx attached, but read_file output cuts off mid-sentence.
Incorrect Approach:
- Accept truncated extraction as complete.
- Assume pricing tiers based on incomplete info.
- Proceed with fabricated numbers.
Correct Approach:
- Identify truncation:
output ends at 980 chars, mid-sentence.read_file - Attempt fallback: Use
withexecute_code_sandbox
to read full document.python-docx - If fallback also incomplete: Document "Pricing table in section 2 not extractable; attempted
(truncated) andread_file
(table structure not preserved)."python-docx - Only then: Either request clarification from user, or mark pricing as "Estimated—file data unavailable" if task requires proceeding.
- Result: User informed of limitation; no false claim of file-sourced data.
Warnings
- Hallucination Risk: Ignoring provided files OR accepting incomplete extractions without validation are primary causes of data fabrication.
- Efficiency: Reading a local file is faster and more reliable than web scraping—but only if extraction is complete.
- User Expectation: Users attach files expecting them to be used. Ignoring them or failing to extract fully is a failure of instruction following.
- Truncation Blind Spot: Many tools have character/content limits. Always verify output completeness against expected data scope.
Tool-Specific Notes
| Tool | Known Limitations | Fallback Strategy |
|---|---|---|
| May truncate at ~1000-5000 chars depending on format | Use with format-specific library |
| May have missing dependencies or sandbox errors | Use or for CLI tools |
| Slower, but more flexible with system tools | Use for , XML extraction, , etc. |
Recovery Protocol
If all extraction attempts fail:
- Stop and assess: Is this data critical to task completion?
- Document all failed attempts with specific error messages or observations.
- Communicate to user: "Critical data from [filename] could not be extracted despite [N] attempts using [methods]. Please clarify or provide alternative source."
- Proceed cautiously: If task must continue, clearly demarcate any estimated/synthesized data.
*** End Files *** Add File: examples/extraction_fallback.sh #!/bin/bash
Fallback extraction strategies for common file formats
Use when read_file produces incomplete results
extract_docx_raw() { local file="$1" # Extract raw XML from docx (docx is a zip archive) unzip -p "$file" word/document.xml 2>/dev/null | xmllint --format - 2>/dev/null }
extract_xlsx_to_csv() { local file="$1" # Convert Excel to CSV using in2csv (from csvkit) if command -v in2csv &>/dev/null; then in2csv "$file" 2>/dev/null else echo "in2csv not available; try python pandas approach" return 1 fi }
extract_pdf_text() { local file="$1" # Extract text from PDF using pdftotext if command -v pdftotext &>/dev/null; then pdftotext "$file" - 2>/dev/null else echo "pdftotext not available; try PyPDF2 via Python sandbox" return 1 fi }
extract_raw_strings() { local file="$1" # Extract printable strings from any binary file if command -v strings &>/dev/null; then strings "$file" 2>/dev/null | head -500 else echo "strings not available" return 1 fi }
Usage: ./extraction_fallback.sh <command> <file>
Commands: docx_raw, xlsx_csv, pdf_text, raw_strings
case "$1" in docx_raw) extract_docx_raw "$2" ;; xlsx_csv) extract_xlsx_to_csv "$2" ;; pdf_text) extract_pdf_text "$2" ;; raw_strings) extract_raw_strings "$2" ;; *) echo "Usage: $0 <docx_raw|xlsx_csv|pdf_text|raw_strings> <file>" exit 1 ;; esac