Claude-skill-registry keyword-extractor

Extract keywords and key phrases from text using TF-IDF, RAKE, and frequency analysis. Generate word clouds and export to various formats.

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/keyword-extractor" ~/.claude/skills/majiayu000-claude-skill-registry-keyword-extractor && rm -rf "$T"

manifest: skills/data/keyword-extractor/SKILL.md

Keyword Extractor

Extract important keywords and key phrases from text documents using multiple algorithms. Supports TF-IDF, RAKE, and simple frequency analysis with word cloud visualization.

Quick Start

from scripts.keyword_extractor import KeywordExtractor

# Extract keywords
extractor = KeywordExtractor()
keywords = extractor.extract("Your long text document here...")
print(keywords[:10])  # Top 10 keywords

# From file
keywords = extractor.extract_from_file("document.txt")
extractor.to_wordcloud("keywords.png")

Features

Multiple Algorithms: TF-IDF, RAKE, frequency-based
Key Phrases: Extract multi-word phrases, not just single words
Scoring: Relevance scores for ranking
Stopword Filtering: Built-in + custom stopwords
N-gram Support: Unigrams, bigrams, trigrams
Word Cloud: Visualize keyword importance
Batch Processing: Process multiple documents

API Reference

Initialization

extractor = KeywordExtractor(
    method="tfidf",      # tfidf, rake, frequency
    max_keywords=20,     # Maximum keywords to return
    min_word_length=3,   # Minimum word length
    ngram_range=(1, 3)   # Unigrams to trigrams
)

Extraction Methods

# TF-IDF (best for comparing documents)
keywords = extractor.extract(text, method="tfidf")

# RAKE (best for key phrases)
keywords = extractor.extract(text, method="rake")

# Frequency (simple word counts)
keywords = extractor.extract(text, method="frequency")

Results Format

keywords = extractor.extract(text)
# Returns list of tuples: [(keyword, score), ...]
# [('machine learning', 0.85), ('data science', 0.72), ...]

# Get just keywords
keyword_list = extractor.get_keywords(text)
# ['machine learning', 'data science', ...]

Customization

# Add custom stopwords
extractor.add_stopwords(['company', 'product', 'service'])

# Set minimum frequency
extractor.min_frequency = 2

# Filter by part of speech (nouns only)
extractor.pos_filter = ['NN', 'NNS', 'NNP']

Visualization

# Generate word cloud
extractor.to_wordcloud("wordcloud.png", colormap="viridis")

# Bar chart of top keywords
extractor.plot_keywords("keywords.png", top_n=15)

Export

# To JSON
extractor.to_json("keywords.json")

# To CSV
extractor.to_csv("keywords.csv")

# To plain text
extractor.to_text("keywords.txt")

CLI Usage

# Extract from text
python keyword_extractor.py --text "Your text here" --top 10

# Extract from file
python keyword_extractor.py --input document.txt --method tfidf --output keywords.json

# Generate word cloud
python keyword_extractor.py --input document.txt --wordcloud cloud.png

# Batch process directory
python keyword_extractor.py --input-dir ./docs --output keywords_all.csv

CLI Arguments

Argument	Description	Default
`--text`	Text to analyze	-
`--input`	Input file path	-
`--input-dir`	Directory of files	-
`--output`	Output file	-
`--method`	Algorithm (tfidf, rake, frequency)	`tfidf`
`--top`	Number of keywords	20
`--ngrams`	N-gram range (e.g., "1,2")	`1,3`
`--wordcloud`	Generate word cloud	-
`--stopwords`	Custom stopwords file	-

Examples

Article Keyword Extraction

extractor = KeywordExtractor(method="tfidf")

article = """
Machine learning is transforming data science. Deep learning models
are achieving state-of-the-art results in natural language processing
and computer vision. Neural networks continue to advance...
"""

keywords = extractor.extract(article, top_n=10)
for keyword, score in keywords:
    print(f"{score:.3f}: {keyword}")

Compare Multiple Documents

extractor = KeywordExtractor(method="tfidf")

docs = [
    open("doc1.txt").read(),
    open("doc2.txt").read(),
    open("doc3.txt").read()
]

# Extract keywords from each
for i, doc in enumerate(docs):
    keywords = extractor.extract(doc, top_n=5)
    print(f"\nDocument {i+1}:")
    for kw, score in keywords:
        print(f"  {kw}: {score:.3f}")

SEO Keyword Research

extractor = KeywordExtractor(
    method="rake",
    ngram_range=(2, 4),  # Focus on phrases
    max_keywords=30
)

webpage_content = open("page.html").read()
keywords = extractor.extract(webpage_content)

# Filter by score threshold
high_value = [(kw, s) for kw, s in keywords if s > 0.5]
print("High-value keywords for SEO:")
for kw, score in high_value:
    print(f"  {kw}")

Algorithm Comparison

Algorithm	Best For	Strengths
TF-IDF	Document comparison	Finds unique terms, good for search
RAKE	Key phrases	Extracts multi-word concepts
Frequency	Quick overview	Simple, fast, interpretable

Dependencies

scikit-learn>=1.2.0
nltk>=3.8.0
pandas>=2.0.0
matplotlib>=3.7.0
wordcloud>=1.9.0

Limitations

English optimized (other languages need language-specific stopwords)
Very short texts may not have enough data for TF-IDF
Domain-specific jargon may need custom stopword handling