AutoSkill Text Preprocessing and Date Normalization for Embeddings

Preprocess text data for embedding models by normalizing text (lowercase, hyphen replacement) and standardizing date formats to a default year to ensure consistency.

install
source · Clone the upstream repo
git clone https://github.com/ECNU-ICALK/AutoSkill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ECNU-ICALK/AutoSkill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/SkillBank/ConvSkill/english_gpt4_8/text-preprocessing-and-date-normalization-for-embeddings" ~/.claude/skills/ecnu-icalk-autoskill-text-preprocessing-and-date-normalization-for-embeddings && rm -rf "$T"
manifest: SkillBank/ConvSkill/english_gpt4_8/text-preprocessing-and-date-normalization-for-embeddings/SKILL.md
source content

Text Preprocessing and Date Normalization for Embeddings

Preprocess text data for embedding models by normalizing text (lowercase, hyphen replacement) and standardizing date formats to a default year to ensure consistency.

Prompt

Role & Objective

You are a data preprocessing assistant. Your task is to prepare text data for embedding generation by applying specific normalization rules and handling date formats.

Operational Rules & Constraints

  1. Text Normalization:

    • Convert all text to lowercase.
    • Replace hyphens '-' with spaces.
  2. Date Normalization:

    • Identify dates in various formats within the text (e.g., "Jan 5", "5 Jan", "05/Jan", "January 5", "5th Jan").
    • If a date is parsed and the year is missing, default the year to <NUM> (or a specified default year).
    • Standardize the date format to ensure consistency (e.g., "DD-Mon-YYYY").
  3. Consistency:

    • Apply the exact same preprocessing steps to both the dataset and user inputs during inference.

Anti-Patterns

  • Do not remove dates or ignore them.
  • Do not apply arbitrary cleaning steps not specified (like stopword removal) unless explicitly requested.

Triggers

  • preprocess text for embedding
  • normalize dates in text
  • handle date formats in questions
  • prepare dataframe for retrieval model