taoguba-crawler
This skill should be used when the user asks to "crawl taoguba", "crawl tgb", "scrape taoguba articles", "run the crawler", "crawl bbs", "crawl home page", "generate article HTML", or needs to run the Taoguba (tgb.cn) web crawlers.
install
source · Clone the upstream repo
git clone https://github.com/lisniuse/taoguba-crawler-skill
Claude Code · Install into ~/.claude/skills/
git clone --depth=1 https://github.com/lisniuse/taoguba-crawler-skill ~/.claude/skills/lisniuse-taoguba-crawler-skill-taoguba-crawler
manifest:
SKILL.mdsource content
Taoguba Crawler
This skill runs the Taoguba (tgb.cn) article crawlers located in the project root.
Prerequisites
- Python 3 with
,requests
,beautifulsoup4
installedpython-dotenv - A
file in the project root containing.env
and optionallyCOOKIEUSER_AGENT
Available Crawlers
1. BBS Crawler (crawler_bbs.py
)
crawler_bbs.pyCrawl the forum board at
tgb.cn/bbs/1/1 using HTML scraping.
python crawler_bbs.py
- Extracts article list by parsing
elementsa.overhide.mw300 - Gets each article's main post and author replies
- Downloads images and embeds them as base64 in HTML
- Outputs:
andoutput/bbs_YYYY-MM-DD.jsonoutput/bbs_YYYY-MM-DD_HHMMSS.html
2. Home Crawler (crawler_home.py
)
crawler_home.pyCrawl the homepage recommendations via JSON API (
/newIndex/getZh).
python crawler_home.py
- Fetches articles from the JSON API (default 2 pages)
- Same content extraction and HTML generation as BBS crawler
- Outputs:
andoutput/home_YYYY-MM-DD.jsonoutput/home_YYYY-MM-DD_HHMMSS.html
Common Workflow
To run both crawlers:
python crawler_bbs.py && python crawler_home.py
Key Implementation Details
- Authentication: Both scripts read
fromCOOKIE
via.envpython-dotenv - Rate limiting: 0.5-1s delay between requests to avoid being blocked
- Image handling: Images are downloaded and embedded as base64 in the HTML output
- Article content: Extracts main post (
) and author replies (#first
with author badge).comment-data - Output directory: All results saved to
folderoutput/
Scripts
The crawler scripts are bundled in
scripts/:
- BBS forum crawler (HTML scraping)scripts/crawler_bbs.py
- Homepage crawler (JSON API)scripts/crawler_home.py
To run the bundled scripts directly:
python scripts/crawler_bbs.py python scripts/crawler_home.py
Troubleshooting
- If no articles are returned, check that
contains a valid.env
valueCOOKIE - If image downloads fail, the HTML will show error messages inline
- Network timeouts default to 10-15 seconds per request