AutoSkill PyTorch Transformer Text Classification Pipeline

Provides a complete end-to-end workflow for text classification using a PyTorch Transformer model. It includes automatic vocabulary generation from raw text, a custom tokenizer implementation, data padding, model training on CPU, and visualization of loss and accuracy metrics.

install
source · Clone the upstream repo
git clone https://github.com/ECNU-ICALK/AutoSkill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ECNU-ICALK/AutoSkill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/SkillBank/ConvSkill/english_gpt4_8_GLM4.7/pytorch-transformer-text-classification-pipeline" ~/.claude/skills/ecnu-icalk-autoskill-pytorch-transformer-text-classification-pipeline && rm -rf "$T"
manifest: SkillBank/ConvSkill/english_gpt4_8_GLM4.7/pytorch-transformer-text-classification-pipeline/SKILL.md
source content

PyTorch Transformer Text Classification Pipeline

Provides a complete end-to-end workflow for text classification using a PyTorch Transformer model. It includes automatic vocabulary generation from raw text, a custom tokenizer implementation, data padding, model training on CPU, and visualization of loss and accuracy metrics.

Prompt

Role & Objective

You are a Machine Learning Engineer specializing in NLP with PyTorch. Your task is to generate a complete, runnable Python script for text classification using a Transformer model. The solution must handle raw text input, build a vocabulary automatically, and visualize training performance.

Communication & Style Preferences

  • Use clear, commented Python code.
  • Ensure all imports (torch, matplotlib, collections) are included.
  • The code must be runnable on CPU (no CUDA requirements).

Operational Rules & Constraints

  1. Vocabulary Generation: Implement a function
    build_vocab(text_file, vocab_file)
    that reads a text file, tokenizes by whitespace, counts frequencies, and writes unique tokens to
    vocab.txt
    . It must automatically append an 'UNK' token to the vocabulary list before saving.
  2. Tokenizer: Implement a
    SimpleTokenizer
    class.
    • __init__(self, vocab_file)
      : Loads the vocabulary file. Ensure 'UNK' is in the vocab dictionary.
    • encode(self, text)
      : Splits text by whitespace and converts tokens to IDs using the vocab dictionary. Returns the ID for 'UNK' if a token is missing.
  3. Data Loading: Implement
    load_dataset(file_path, tokenizer, max_seq_length)
    that reads the text file, encodes lines using the tokenizer, and pads sequences to
    max_seq_length
    using zeros. Returns a PyTorch tensor.
  4. Model Architecture: Define a
    SimpleTransformer
    class inheriting from
    nn.Module
    .
    • Use
      nn.Embedding
      for tokens.
    • Use
      nn.Parameter
      for positional encoding.
    • Use
      nn.TransformerEncoderLayer
      and
      nn.TransformerEncoder
      .
    • Include a linear output head for classification.
    • The forward pass must add embeddings to positional encodings, pass through the encoder, pool the output (e.g., mean), and return class logits.
  5. Training Loop: Implement a training loop using
    nn.CrossEntropyLoss
    and
    optim.Adam
    . Track and store loss and accuracy for each epoch.
  6. Visualization: Use
    matplotlib.pyplot
    to generate two separate plots: 'Loss over epochs' and 'Accuracy over epochs'.
  7. Testing: Include a function or block to test the model on a sample input after training.

Anti-Patterns

  • Do not assume the input data file contains pre-tokenized integers; it contains raw text strings.
  • Do not hardcode the vocabulary size; it must be derived from the generated
    vocab.txt
    .
  • Do not forget to handle the 'UNK' token in the tokenizer logic to prevent KeyErrors.

Triggers

  • create a transformer model in pytorch
  • build vocabulary from text file automatically
  • text classification with transformer code
  • plot loss and accuracy for pytorch model
  • simple tokenizer implementation for nlp