Claude-skill-registry keyword-extractor
Extract keywords and key phrases from text using TF-IDF, RAKE, and frequency analysis. Generate word clouds and export to various formats.
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/keyword-extractor" ~/.claude/skills/majiayu000-claude-skill-registry-keyword-extractor && rm -rf "$T"
manifest:
skills/data/keyword-extractor/SKILL.mdsource content
Keyword Extractor
Extract important keywords and key phrases from text documents using multiple algorithms. Supports TF-IDF, RAKE, and simple frequency analysis with word cloud visualization.
Quick Start
from scripts.keyword_extractor import KeywordExtractor # Extract keywords extractor = KeywordExtractor() keywords = extractor.extract("Your long text document here...") print(keywords[:10]) # Top 10 keywords # From file keywords = extractor.extract_from_file("document.txt") extractor.to_wordcloud("keywords.png")
Features
- Multiple Algorithms: TF-IDF, RAKE, frequency-based
- Key Phrases: Extract multi-word phrases, not just single words
- Scoring: Relevance scores for ranking
- Stopword Filtering: Built-in + custom stopwords
- N-gram Support: Unigrams, bigrams, trigrams
- Word Cloud: Visualize keyword importance
- Batch Processing: Process multiple documents
API Reference
Initialization
extractor = KeywordExtractor( method="tfidf", # tfidf, rake, frequency max_keywords=20, # Maximum keywords to return min_word_length=3, # Minimum word length ngram_range=(1, 3) # Unigrams to trigrams )
Extraction Methods
# TF-IDF (best for comparing documents) keywords = extractor.extract(text, method="tfidf") # RAKE (best for key phrases) keywords = extractor.extract(text, method="rake") # Frequency (simple word counts) keywords = extractor.extract(text, method="frequency")
Results Format
keywords = extractor.extract(text) # Returns list of tuples: [(keyword, score), ...] # [('machine learning', 0.85), ('data science', 0.72), ...] # Get just keywords keyword_list = extractor.get_keywords(text) # ['machine learning', 'data science', ...]
Customization
# Add custom stopwords extractor.add_stopwords(['company', 'product', 'service']) # Set minimum frequency extractor.min_frequency = 2 # Filter by part of speech (nouns only) extractor.pos_filter = ['NN', 'NNS', 'NNP']
Visualization
# Generate word cloud extractor.to_wordcloud("wordcloud.png", colormap="viridis") # Bar chart of top keywords extractor.plot_keywords("keywords.png", top_n=15)
Export
# To JSON extractor.to_json("keywords.json") # To CSV extractor.to_csv("keywords.csv") # To plain text extractor.to_text("keywords.txt")
CLI Usage
# Extract from text python keyword_extractor.py --text "Your text here" --top 10 # Extract from file python keyword_extractor.py --input document.txt --method tfidf --output keywords.json # Generate word cloud python keyword_extractor.py --input document.txt --wordcloud cloud.png # Batch process directory python keyword_extractor.py --input-dir ./docs --output keywords_all.csv
CLI Arguments
| Argument | Description | Default |
|---|---|---|
| Text to analyze | - |
| Input file path | - |
| Directory of files | - |
| Output file | - |
| Algorithm (tfidf, rake, frequency) | |
| Number of keywords | 20 |
| N-gram range (e.g., "1,2") | |
| Generate word cloud | - |
| Custom stopwords file | - |
Examples
Article Keyword Extraction
extractor = KeywordExtractor(method="tfidf") article = """ Machine learning is transforming data science. Deep learning models are achieving state-of-the-art results in natural language processing and computer vision. Neural networks continue to advance... """ keywords = extractor.extract(article, top_n=10) for keyword, score in keywords: print(f"{score:.3f}: {keyword}")
Compare Multiple Documents
extractor = KeywordExtractor(method="tfidf") docs = [ open("doc1.txt").read(), open("doc2.txt").read(), open("doc3.txt").read() ] # Extract keywords from each for i, doc in enumerate(docs): keywords = extractor.extract(doc, top_n=5) print(f"\nDocument {i+1}:") for kw, score in keywords: print(f" {kw}: {score:.3f}")
SEO Keyword Research
extractor = KeywordExtractor( method="rake", ngram_range=(2, 4), # Focus on phrases max_keywords=30 ) webpage_content = open("page.html").read() keywords = extractor.extract(webpage_content) # Filter by score threshold high_value = [(kw, s) for kw, s in keywords if s > 0.5] print("High-value keywords for SEO:") for kw, score in high_value: print(f" {kw}")
Algorithm Comparison
| Algorithm | Best For | Strengths |
|---|---|---|
| TF-IDF | Document comparison | Finds unique terms, good for search |
| RAKE | Key phrases | Extracts multi-word concepts |
| Frequency | Quick overview | Simple, fast, interpretable |
Dependencies
scikit-learn>=1.2.0 nltk>=3.8.0 pandas>=2.0.0 matplotlib>=3.7.0 wordcloud>=1.9.0
Limitations
- English optimized (other languages need language-specific stopwords)
- Very short texts may not have enough data for TF-IDF
- Domain-specific jargon may need custom stopword handling