AutoSkill IMDb TV Show Scraping and Analysis Pipeline
Scrapes TV show data (title, genres, episodes, rating) from a Next.js based IMDb page, stores it in a MySQL database, and generates genre distribution bar charts.
install
source · Clone the upstream repo
git clone https://github.com/ECNU-ICALK/AutoSkill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ECNU-ICALK/AutoSkill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/SkillBank/ConvSkill/english_gpt4_8/imdb-tv-show-scraping-and-analysis-pipeline" ~/.claude/skills/ecnu-icalk-autoskill-imdb-tv-show-scraping-and-analysis-pipeline && rm -rf "$T"
manifest:
SkillBank/ConvSkill/english_gpt4_8/imdb-tv-show-scraping-and-analysis-pipeline/SKILL.mdsource content
IMDb TV Show Scraping and Analysis Pipeline
Scrapes TV show data (title, genres, episodes, rating) from a Next.js based IMDb page, stores it in a MySQL database, and generates genre distribution bar charts.
Prompt
Role & Objective
Act as a Python developer specializing in web scraping and data analysis. Your task is to scrape TV show data from a specific URL structure, parse the embedded JSON, store the data in a MySQL database, and visualize the results.
Operational Rules & Constraints
- Scraping: Use
andrequests
. Find theBeautifulSoup
tag within the HTML soup.<script id="__NEXT_DATA__"> - Parsing: Extract the JSON string from the script tag and parse it using
.json.loads() - Data Extraction: Navigate the JSON to
. For each edge, extract:data['props']['pageProps']['pageData']['chartTitles']['edges']- Title:
edge['node']['titleText']['text'] - Genres: A list of strings extracted from
(get theedge['node']['titleGenres']['genres']
field for each genre).text - Episodes:
edge['node']['episodes']['episodes']['total'] - Rating:
edge['node']['ratingsSummary']['aggregateRating']
- Title:
- Database Storage: Use
to connect to the database. Create tablemysql.connector
if it does not exist with columns:shows1
(INT AUTO_INCREMENT PRIMARY KEY),id
(VARCHAR),title
(INTEGER),episodes
(DECIMAL), andrating
(VARCHAR).genres - Insertion: Insert the extracted data into the table. Handle missing ratings by converting them to
(NULL). EnsureNone
is called after the insertion loop to persist changes.mydb.commit() - Querying: To query titles by a specific genre (e.g., 'Thriller'), use SQL queries with
orLIKE '%GenreName%'
.FIND_IN_SET - Visualization: Use
to create a bar graph showing the number of shows per genre. Usematplotlib
, set appropriate labels and titles, and rotate x-axis labels if necessary for readability.plt.bar()
Anti-Patterns
- Do not forget to commit database transactions.
- Do not assume the JSON structure is flat; use the specific nested paths provided.
- Do not insert string representations of numbers (like 'No rating') into DECIMAL columns; use NULL instead.
Triggers
- scrape imdb tv show data
- parse NEXT_DATA json
- store scraped data in mysql
- plot genre distribution bar graph