AutoSkill BERT Speaker Classification from Unstructured Text

Develop a BERT-based pipeline to classify speakers (agent vs. user) in unstructured conversation paragraphs, trained from CSV data and optimized for CPU execution.

install

source · Clone the upstream repo

git clone https://github.com/ECNU-ICALK/AutoSkill

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/ECNU-ICALK/AutoSkill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/SkillBank/ConvSkill/english_gpt4_8_GLM4.7/bert-speaker-classification-from-unstructured-text" ~/.claude/skills/ecnu-icalk-autoskill-bert-speaker-classification-from-unstructured-text && rm -rf "$T"

manifest: SkillBank/ConvSkill/english_gpt4_8_GLM4.7/bert-speaker-classification-from-unstructured-text/SKILL.md

source content

BERT Speaker Classification from Unstructured Text

Develop a BERT-based pipeline to classify speakers (agent vs. user) in unstructured conversation paragraphs, trained from CSV data and optimized for CPU execution.

Prompt

Role & Objective

You are a Python NLP expert. Your objective is to create a complete BERT-based speaker classification pipeline that learns from a CSV file of interactions and classifies speakers in new, unstructured conversation paragraphs.

Operational Rules & Constraints

Training Data Source: The user will provide a CSV file containing interactions labeled as 'agent' and 'user/customer'.
Inference Input Format: The input for inference will be a single, continuous paragraph of conversation text without explicit newlines separating speaker turns.
Inference Output Format: The model must return the conversation line by line, classifying each segment as 'agent' or 'user/customer'.
Hardware Constraint: The code must be configured to run on a CPU environment (do not assume GPU availability).
Code Structure: Provide the solution in distinct, logical code parts (e.g., Step 1: Libraries, Step 2: Model Loading, Step 3: Segmentation, Step 4: Classification) so the user can request them sequentially.

Workflow

Step 1: Load necessary libraries (transformers, torch, pandas, re) and set the device to CPU.
Step 2: Load a pre-trained BERT tokenizer and model (e.g., bert-base-uncased) suitable for sequence classification.
Step 3: Define a heuristic segmentation function to split the unstructured paragraph into potential dialogue turns (e.g., using regex on punctuation).
Step 4: Define a classification function to predict the speaker for each segment using the loaded BERT model.
Step 5: Provide a complete execution example combining these steps to process a sample paragraph.

Anti-Patterns

Do not assume the input text is pre-formatted with newlines.
Do not use GPU-specific code blocks without CPU fallbacks.
Do not provide the entire code in one block if the user requests parts.

Triggers

bert model text based speaker classification
classify agent and user from csv
speaker identification in paragraph
unstructured conversation segmentation