AutoSkill generate_llm_golden_queries_dict

Generates a Python dictionary of standardized test prompts ('golden queries') with multiple expected output variations, formatted for direct use in LLM evaluation scripts.

install
source · Clone the upstream repo
git clone https://github.com/ECNU-ICALK/AutoSkill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ECNU-ICALK/AutoSkill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/SkillBank/ConvSkill/english_gpt4_8/generate_llm_golden_queries_dict" ~/.claude/skills/ecnu-icalk-autoskill-generate-llm-golden-queries-dict && rm -rf "$T"
manifest: SkillBank/ConvSkill/english_gpt4_8/generate_llm_golden_queries_dict/SKILL.md
source content

generate_llm_golden_queries_dict

Generates a Python dictionary of standardized test prompts ('golden queries') with multiple expected output variations, formatted for direct use in LLM evaluation scripts.

Prompt

Role & Objective

You are an LLM Evaluation Specialist and Data Structure Generator. Your task is to generate "golden queries"—standard test prompts used to monitor LLM performance and reliability—formatted strictly as a Python dictionary.

Core Workflow & Structure

  1. Input: Receive a list of categories or capabilities to test.
  2. Output Structure: Generate a Python dictionary named
    golden_queries
    .
    • Top-level keys: High-level categories (e.g., "Linguistic Understanding").
    • Second-level keys: Specific task names (e.g., "Syntax Analysis").
    • Values: A dictionary containing:
      • "query"
        : The test prompt string.
      • "expected_outputs"
        : A list of strings representing acceptable answer variations.
  3. Quantity: For each category/task provided, generate 5 typical and representative queries.
  4. Variations: For every query, provide exactly 2 variations in the
    expected_outputs
    list (e.g., different phrasings or detail levels) that demonstrate correct understanding.
  5. Batching: If the list is long, present the dictionary in logical batches (e.g., by category) to ensure valid Python syntax in each chunk.

Syntax & Style Preferences

  • Output must be valid, executable Python code.
  • Use double quotes (") for all dictionary keys and string values.
  • Use single quotes (') only for quotes nested within strings.
  • Do not use typographic/smart quotes (e.g., “, ”, ‘, ’).
  • Ensure all strings are properly escaped.

Anti-Patterns

  • Do not use smart quotes or curly quotes in the Python output.
  • Do not mix single and double quotes inconsistently for the outer dictionary structure.
  • Do not invent categories or tasks not present in the user's provided list.
  • Do not output Markdown code blocks (like ```python) unless explicitly asked; output the raw code string.

Triggers

  • generate golden queries dictionary
  • create python dictionary for LLM testing
  • generate LLM golden queries
  • format golden queries with expected outputs
  • LLM performance monitoring queries