Claude-skill-registry llm-rankings

Comprehensive LLM model evaluation and ranking system. Use when users ask to compare language models, find the best model for a specific task, understand model capabilities, get pricing information, or need help selecting between GPT-4, Claude, Gemini, Llama, or other LLMs. Provides benchmark-based rankings, cost analysis, and use-case-specific recommendations across reasoning, code generation, long context, multimodal, and other capabilities.

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/llm-rankings" ~/.claude/skills/majiayu000-claude-skill-registry-llm-rankings && rm -rf "$T"
manifest: skills/data/llm-rankings/SKILL.md
source content

LLM Rankings Skill

Comprehensive evaluation and ranking system for comparing language models across performance, cost, and technical dimensions.

Core Capabilities

This skill provides four main ranking methodologies:

  1. Benchmark-Based Rankings - Objective comparisons using MMLU, GSM8K, HumanEval scores
  2. Task-Specific Rankings - Weighted recommendations for code generation, creative writing, reasoning, etc.
  3. Cost-Effectiveness Rankings - Performance per dollar analysis
  4. Real-World Performance - API reliability, documentation quality, ease of integration

Standard Workflows

Simple Comparison Request

When user asks "Which LLM is better for X?":

  1. Load relevant benchmark data from
    references/benchmarks.md
  2. Filter models matching requirements
  3. Calculate rankings with appropriate weighting
  4. Present top 3-5 recommendations with justification
  5. Include pricing information from
    references/pricing.md

Detailed Analysis Request

When user asks for comprehensive comparison:

  1. Load model specifications from
    references/model-details.md
  2. Generate side-by-side comparison table
  3. Include benchmark scores across multiple tests
  4. Calculate cost projections for expected usage
  5. Provide deployment considerations

Best Model for Task Query

When user describes a specific use case:

  1. Parse task requirements (performance needs, budget, technical constraints)
  2. Map to capability dimensions
  3. Load task-specific rankings from
    references/use-cases.md
  4. Return top 3 models with detailed reasoning
  5. Include caveats and alternative suggestions

Reference Resources

Load these files as needed to inform recommendations:

  • benchmarks.md - Comprehensive benchmark scores (MMLU, GSM8K, HumanEval, MMMU, etc.)
  • model-details.md - Technical specifications, context windows, API details, capabilities
  • use-cases.md - Task-specific recommendations organised by common use cases
  • pricing.md - Current pricing across all providers, cost optimisation strategies

Output Formats

Quick Recommendation

Present concise recommendations with model name, key strength, pricing snapshot, and one-sentence justification.

Comparison Table

Use markdown tables comparing models across relevant dimensions (performance, context window, pricing, best use).

Detailed Analysis

Structure as:

  1. Executive summary (2-3 sentences)
  2. Top recommendations (ranked with justification)
  3. Performance comparison (benchmark scores)
  4. Cost analysis (usage projections)
  5. Implementation considerations
  6. Alternative options

Key Principles

  1. Evidence-Based - Support all rankings with benchmark data or documented performance
  2. Context-Aware - Consider user's specific requirements, budget, technical environment
  3. Transparent - Explain weighting decisions and ranking criteria clearly
  4. Current Information - Use web_search to verify latest releases, pricing changes, benchmark updates
  5. Practical Focus - Prioritise real-world usage factors over pure benchmark scores
  6. Balanced - Present strengths and weaknesses honestly for each model

Important Considerations

  • Benchmark Limitations - Benchmarks don't perfectly reflect real-world performance
  • Task Specificity - A model's ranking varies significantly by use case
  • Pricing Volatility - API pricing changes frequently; verify for important decisions
  • Access Availability - Some models have waitlists or geographic restrictions
  • Trade-offs - Larger context windows often mean slower processing

Usage Notes

  • Always verify current pricing and availability via web search for recent changes
  • Consider user's deployment environment (API vs self-hosted)
  • Account for additional costs (vision inputs, fine-tuning, enterprise features)
  • Recommend testing on user's specific use case before committing
  • Highlight when free tiers or trials are available

Model Coverage

Provides comprehensive coverage of:

  • Anthropic: Claude Opus 4.1/4, Sonnet 4.5/4, Haiku 4
  • OpenAI: GPT-4 Turbo, GPT-4o, GPT-4o-mini, o1-preview, o1-mini
  • Google: Gemini 1.5 Pro, Gemini 1.5 Flash
  • Meta: Llama 3.1 (405B, 70B, 8B)
  • Mistral: Large 2, Small
  • DeepSeek: Coder V2
  • Other providers as relevant to user queries