OpenSpace spreadsheet-validated-exec

Execute Python scripts for spreadsheets with prerequisite data validation and source accessibility checks

install

source · Clone the upstream repo

git clone https://github.com/HKUDS/OpenSpace

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/HKUDS/OpenSpace "$T" && mkdir -p ~/.claude/skills && cp -r "$T/gdpval_bench/skills/spreadsheet-direct-python-enhanced-enhanced-f0b1db" ~/.claude/skills/hkuds-openspace-spreadsheet-validated-exec && rm -rf "$T"

manifest: gdpval_bench/skills/spreadsheet-direct-python-enhanced-enhanced-f0b1db/SKILL.md

source content

Validated Python Execution for Spreadsheet Tasks

Overview

This skill extends direct Python execution for spreadsheet operations by adding a mandatory data validation phase before any processing begins. This prevents wasted iterations on inaccessible data sources and provides clear error documentation when data cannot be accessed.

When to Use This Skill

Use validated direct

run_shell

with Python scripts for spreadsheet operations when:

Reading or writing complex Excel files with multiple sheets
Data sources have already been validated as accessible
You have fallback sources identified in case of access failures
Applying formulas, formatting, or data transformations
Working with
```
openpyxl
```
,
```
pandas
```
, or similar libraries
The operation involves multiple steps that could exceed agent step limits
You need precise control over error handling and debugging
Complex scripts benefit from file-based execution for better reliability

Why Validated Direct Execution?

Beyond standard shell_agent limitations, unvalidated data access causes:

Wasted iterations attempting to process non-existent data
Unclear error messages when source files are inaccessible
No graceful degradation when primary sources fail
Missing documentation of why data operations couldn't complete

Direct

run_shell

with Python validation is more reliable because it:

Executes validation in a single step with no iteration limits
Provides clearer, immediate error messages for access failures
Handles complex operations without step constraints
Gives full control over library imports and execution flow
Writing scripts to
```
.py
```
files first avoids shell_agent parsing issues with heredocs
Documents all access failures for troubleshooting and reporting

Phase 0: Data Source Validation (REQUIRED)

Before writing any spreadsheet processing code, verify your data sources are accessible:

Step 0.1: Verify Data Availability

import os
import requests
from pathlib import Path

# For local files
def verify_local_file(file_path):
    path = Path(file_path)
    if not path.exists():
        raise FileNotFoundError(f"Data file not found: {file_path}")
    if not path.is_file():
        raise ValueError(f"Path is not a file: {file_path}")
    if path.stat().st_size == 0:
        raise ValueError(f"Data file is empty: {file_path}")
    print(f"✓ Local file verified: {file_path} ({path.stat().st_size} bytes)")
    return True

# For remote URLs
def verify_remote_url(url, timeout=30):
    try:
        # Try HEAD request first (lighter than GET)
        response = requests.head(url, timeout=timeout, allow_redirects=True)
        if response.status_code == 405:  # HEAD not allowed, try GET
            response = requests.get(url, timeout=timeout, stream=True)
        response.raise_for_status()
        
        # Check content type if available
        content_type = response.headers.get('Content-Type', '')
        if 'text/html' in content_type and 'excel' not in url.lower():
            print(f"⚠ Warning: URL may return HTML, not data file")
        
        print(f"✓ Remote URL verified: {url} (Status: {response.status_code})")
        return True
    except requests.exceptions.SSLError as e:
        print(f"✗ SSL Error: {str(e)}")
        return False
    except requests.exceptions.ConnectionError as e:
        print(f"✗ Connection Error: {str(e)}")
        return False
    except requests.exceptions.Timeout as e:
        print(f"✗ Timeout Error: {str(e)}")
        return False
    except Exception as e:
        print(f"✗ Verification Failed: {str(e)}")
        return False

Step 0.2: Handle Inaccessible Sources

If primary data source is unavailable:

Check alternative locations:
- Is data available from a backup URL?
- Is there a cached/local copy available?
- Can data be obtained from an API instead of web scraping?
- Is there an alternative data provider?

Document the failure:

def log_access_failure(source, error_type, timestamp=None):
    from datetime import datetime
    if not timestamp:
        timestamp = datetime.now().isoformat()
    
    error_log = {
        'timestamp': timestamp,
        'source': source,
        'error_type': error_type,
        'alternatives_attempted': [],
        'resolution': 'pending'
    }
    
    # Save to error log file
    import json
    with open('data_access_errors.json', 'a') as f:
        f.write(json.dumps(error_log) + '\n')
    
    return error_log

Implement fallback strategy:

def get_data_with_fallback(primary_source, fallback_sources):
    sources = [primary_source] + fallback_sources
    
    for i, source in enumerate(sources):
        print(f"Attempting source {i+1}/{len(sources)}: {source}")
        
        if source.startswith('http'):
            if verify_remote_url(source):
                return download_data(source)
        else:
            if verify_local_file(source):
                return load_local_data(source)
        
        print(f"Source {source} unavailable, trying next...")
    
    raise Exception(f"All {len(sources)} data sources unavailable")

Step 0.3: Pre-Execution Checklist

Before proceeding to spreadsheet operations, confirm:

Primary data source is accessible (file exists / URL responds)
Data file is not empty (has content to process)
Required permissions are in place (file not locked, API keys valid)
Alternative sources identified if primary fails
Error logging mechanism is configured

If any check fails: Do NOT proceed to spreadsheet processing. Report the blocking issue and either:

Resolve the data access problem first
Switch to an alternative data source
Abort the task with clear error documentation

How to Use (With Validation)

Complete Workflow Template

Recommended Pattern: Validate → Process → Report

# Step 1: Write validation + processing script to file
cat > validated_spreadsheet_process.py << 'EOF'
import sys
import os
from pathlib import Path
import pandas as pd
from openpyxl import load_workbook

# === PHASE 0: VALIDATION ===
def validate_sources():
    sources_to_check = [
        ('input_data.xlsx', 'local'),
        # ('https://backup-source.com/data.xlsx', 'remote')
    ]
    
    validated_sources = []
    for source, source_type in sources_to_check:
        try:
            if source_type == 'local':
                if not Path(source).exists():
                    print(f"ERROR: Local file not found: {source}")
                    continue
                if Path(source).stat().st_size == 0:
                    print(f"ERROR: Local file is empty: {source}")
                    continue
            validated_sources.append(source)
            print(f"✓ Validated: {source}")
        except Exception as e:
            print(f"ERROR validating {source}: {e}")
    
    if not validated_sources:
        print("FATAL: No valid data sources available. Aborting.")
        sys.exit(1)
    
    return validated_sources

# === PHASE 1: PROCESSING ===
def process_data(source_file):
    # Your spreadsheet operations here
    pass

# === PHASE 2: REPORTING ===
def generate_report(success, details):
    report = {
        'status': 'success' if success else 'failed',
        'details': details
    }
    print(f"Report: {report}")
    return report

if __name__ == '__main__':
    try:
        # Validate first
        sources = validate_sources()
        
        # Process validated sources
        for source in sources:
            process_data(source)
        
        generate_report(True, f"Processed {len(sources)} sources")
        
    except Exception as e:
        generate_report(False, str(e))
        sys.exit(1)
EOF

# Step 2: Execute the validated script
python3 validated_spreadsheet_process.py

Pattern 1: Write Script to File First (Complex Operations)

For complex multi-line scripts with multiple data sources:

# Step 1: Write the Python script to a file
cat > process_spreadsheet.py << 'EOF'
import openpyxl
from openpyxl import Workbook
from pathlib import Path
import sys

# Validate first
input_file = Path('file.xlsx')
if not input_file.exists():
    print(f"ERROR: Input file not found: {input_file}")
    sys.exit(1)
if input_file.stat().st_size == 0:
    print(f"ERROR: Input file is empty: {input_file}")
    sys.exit(1)

# Your spreadsheet code here
wb = openpyxl.load_workbook('file.xlsx')
# ... operations ...
wb.save('output.xlsx')
print('Success')
EOF

# Step 2: Execute the script
python3 process_spreadsheet.py

Pattern 2: Inline Heredoc (Simple Scripts with Validation)

For short scripts when NOT using shell_agent, include validation inline:

# Include quick validation before processing
python3 << 'EOF'
import os
import sys

# Quick validation
input_file = 'data.xlsx'
if not os.path.exists(input_file):
    print(f"ERROR: {input_file} not found")
    sys.exit(1)
if os.path.getsize(input_file) == 0:
    print(f"ERROR: {input_file} is empty")
    sys.exit(1)

# Continue with processing...
EOF

Examples

Example 1: Validated Read and Transform

With prerequisite validation:

import pandas as pd
import sys
from pathlib import Path

# Validate input file exists and has content
input_path = Path('input.xlsx')
if not input_path.exists():
    print(f"ERROR: Input file not found: {input_path}")
    sys.exit(1)
if input_path.stat().st_size == 0:
    print(f"ERROR: Input file is empty: {input_path}")
    sys.exit(1)

# Now safe to process
df = pd.read_excel('input.xlsx', sheet_name='Revenue')

# Apply transformations
df['Net_Revenue'] = df['Gross_Revenue'] * (1 - df['Tax_Rate'])

# Save results
df.to_excel('output.xlsx', index=False, sheet_name='Processed')

Example 2: Multi-Sheet with Source Validation

from openpyxl import load_workbook
import sys
from pathlib import Path

# Validate workbook accessibility
wb_path = Path('tour_data.xlsx')
if not wb_path.exists():
    print(f"ERROR: Workbook not found: {wb_path.absolute()}")
    sys.exit(1)

try:
    wb = load_workbook('tour_data.xlsx')
except Exception as e:
    print(f"ERROR: Cannot open workbook: {e}")
    print("Possible causes: file locked, corrupted, or wrong format")
    sys.exit(1)

# Iterate through sheets
for sheet_name in wb.sheetnames:
    ws = wb[sheet_name]
    # Apply formatting or calculations
    for row in ws.iter_rows(min_row=2, max_col=5):
        # Process cells
        pass

wb.save('tour_data_processed.xlsx')

Example 3: Formatting with Pre-Checks

from openpyxl import load_workbook
from openpyxl.styles import Font, PatternFill, Alignment
from pathlib import Path
import sys

# Pre-check file
report_path = Path('report.xlsx')
if not report_path.exists():
    print(f"ERROR: Report file not found: {report_path}")
    sys.exit(1)

wb = load_workbook('report.xlsx')
ws = wb.active

# Apply header styling
header_fill = PatternFill(start_color='4472C4', fill_type='solid')
header_font = Font(bold=True, color='FFFFFF')

for cell in ws[1]:
    cell.fill = header_fill
    cell.font = header_font
    cell.alignment = Alignment(horizontal='center')

wb.save('report_formatted.xlsx')

Example 4: Comprehensive Error Handling with Logging

import sys
import json
from datetime import datetime
from pathlib import Path
from openpyxl import load_workbook

def log_error(source, error_msg, error_type):
    """Log errors for later analysis and reporting"""
    error_record = {
        'timestamp': datetime.now().isoformat(),
        'source': source,
        'error_type': error_type,
        'message': error_msg
    }
    
    with open('processing_errors.log', 'a') as f:
        f.write(json.dumps(error_record) + '\n')
    
    return error_record

try:
    # Validate file before opening
    if not Path('data.xlsx').exists():
        log_error('data.xlsx', 'File not found', 'ValidationError')
        raise FileNotFoundError('data.xlsx')
    
    wb = load_workbook('data.xlsx')
    ws = wb.active
    
    # Your operations here
    value = ws['A1'].value
    
    wb.save('output.xlsx')
    print(f"Success: Processed {ws.max_row} rows")
    
except FileNotFoundError as e:
    log_error('data.xlsx', str(e), 'FileNotFoundError')
    print(f"ERROR: {str(e)}", file=sys.stderr)
    print("ACTION: Verify file path and permissions")
    sys.exit(1)
except PermissionError as e:
    log_error('data.xlsx', str(e), 'PermissionError')
    print(f"ERROR: Permission denied - file may be open in another application")
    print("ACTION: Close the file in other applications and retry")
    sys.exit(1)
except Exception as e:
    log_error('data.xlsx', str(e), type(e).__name__)
    print(f"ERROR: {str(e)}", file=sys.stderr)
    sys.exit(1)

Best Practices (Enhanced)

Always validate before processing: Check data source availability before any spreadsheet operations
Log all access failures: Document when and why data sources become unavailable
Identify fallbacks upfront: Know your alternative sources before starting
Fail fast on validation errors: Don't waste iterations on impossible operations
Prefer file-based execution for complex scripts: write to
```
.py
```
file first, then execute via
```
run_shell
```
Import only needed libraries to reduce execution time
Print clear success/error messages for debugging
Save intermediate results for complex multi-step transformations
Test with small data before scaling to large spreadsheets
Use pandas for data manipulation and openpyxl for formatting when both are needed
Clean up temporary script files after execution if they won't be reused

Data Access Failure Protocols

When Primary Source Fails

Immediate actions:
- Log the failure with timestamp, source, and error type
- Check if error is temporary (timeout) or permanent (404, file doesn't exist)
- Attempt alternative sources if available

For temporary failures (timeouts, 5xx errors):

def retry_with_backoff(operation, max_retries=3, base_delay=1):
    import time
    for attempt in range(max_retries):
        try:
            return operation()
        except (TimeoutError, ConnectionError) as e:
            if attempt == max_retries - 1:
                raise
            delay = base_delay * (2 ** attempt)
            print(f"Retry {attempt + 1}/{max_retries} after {delay}s delay")
            time.sleep(delay)

For permanent failures (file not found, 404):
- Switch to identified fallback source immediately
- Document the primary source failure
- Continue with fallback if available, otherwise abort cleanly

Error Reporting Template

def report_data_access_issue(issue_type, source, details, alternatives_tried=None):
    """Standardize error reporting for data access failures"""
    from datetime import datetime
    
    report = f"""
=== DATA ACCESS FAILURE REPORT ===
Type: {issue_type}
Source: {source}
Time: {datetime.now().isoformat()}
Details: {details}
Alternatives Attempted: {alternatives_tried or 'None'}
Recommended Action: {get_recommended_action(issue_type)}
==================================
"""
    print(report)
    return report

def get_recommended_action(issue_type):
    actions = {
        'FileNotFound': 'Verify file path, check if file was moved or deleted',
        'ConnectionError': 'Check network connectivity, try alternative source',
        'Timeout': 'Increase timeout, check source server status',
        'PermissionError': 'Check file permissions, ensure file not open elsewhere',
        'SSLError': 'Verify SSL certificate, try HTTP if appropriate',
        'default': 'Review error details, check source availability'
    }
    return actions.get(issue_type, actions['default'])

When NOT to Use This Skill

Simple single-cell reads/writes (validation overhead not warranted)
Operations that require interactive user input
Tasks where you need the agent to iteratively refine the approach
When you have no way to validate data sources (skip to error handling only)

Common Libraries

Library	Best For	Also Use For
`requests`	Web requests	Source URL validation, health checks
`openpyxl`	Reading/writing .xlsx files, formatting, formulas	-
`pandas`	Data manipulation, analysis, merging datasets	Chunked reading for large files
`xlrd`	Reading older .xls files (read-only)	-
`xlsxwriter`	Creating new .xlsx files with advanced formatting	-

Troubleshooting (Enhanced)

Issue: Data source unavailable / file not found

Solution: Run validation check first before processing. If validation fails:
1. Verify the path/URL is correct
2. Check if file was moved, renamed, or deleted
3. Try alternative sources if available
4. Document the failure and abort cleanly rather than proceeding

Issue: Source was accessible yesterday but not today

Solution: Implement fallback sources and log the change. Consider:
- Setting up monitoring for critical data sources
- Caching important data locally when possible
- Having contact information for data source maintainers

Issue: Validation passes but processing fails

Solution: Validation checks accessibility, not content validity. Add:
- Schema validation (check expected columns exist)
- Content validation (check data types, ranges)
- Sample verification (read first few rows to confirm structure)

Issue: Heredoc syntax fails with 'unknown error' when using shell_agent

Solution: Write the Python script to a
```
.py
```
file first with full validation logic, then execute with
```
python3 script.py
```
. This is significantly more reliable than inline heredoc when shell_agent is the executor.

Issue: FileNotFoundError (after validation passed)

Solution: File may have been moved/deleted between validation and processing, or path is relative and working directory changed. Use absolute paths.

Issue: PermissionError

Solution: Ensure the file is not open in another application. On Linux/Mac, check with
```
lsof | grep filename
```

Issue: ConnectionError / Timeout on remote sources

Solution: Check network connectivity, try with increased timeout, use retry logic with exponential backoff, or switch to cached/local copy if available

Issue: SSL Certificate errors

Solution: Verify the certificate is valid. As last resort for trusted internal sources only:
```
requests.get(url, verify=False)
```
- but log this as a security concern

Issue: All alternative sources fail

Solution: Abort with comprehensive error report documenting:
- All sources attempted
- Error types for each
- Timestamp of failures
- Recommended next steps for human operator

Issue: MemoryError on large files

Solution: Process data in chunks using pandas
```
chunksize
```
parameter. Validate chunk size during initial validation phase.

Issue: Formatting not applying

Solution: Ensure you're modifying cell styles before saving, and use
```
.copy()
```
for style objects. Verify workbook is not in read-only mode.