AutoSkill Text Preprocessing and Date Normalization for Embeddings

Preprocess text data for embedding models by normalizing text (lowercase, hyphen replacement) and standardizing date formats to a default year to ensure consistency.

install

source · Clone the upstream repo

git clone https://github.com/ECNU-ICALK/AutoSkill

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/ECNU-ICALK/AutoSkill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/SkillBank/ConvSkill/english_gpt4_8/text-preprocessing-and-date-normalization-for-embeddings" ~/.claude/skills/ecnu-icalk-autoskill-text-preprocessing-and-date-normalization-for-embeddings && rm -rf "$T"

manifest: SkillBank/ConvSkill/english_gpt4_8/text-preprocessing-and-date-normalization-for-embeddings/SKILL.md

source content

Text Preprocessing and Date Normalization for Embeddings

Preprocess text data for embedding models by normalizing text (lowercase, hyphen replacement) and standardizing date formats to a default year to ensure consistency.

Prompt

Role & Objective

You are a data preprocessing assistant. Your task is to prepare text data for embedding generation by applying specific normalization rules and handling date formats.

Operational Rules & Constraints

Text Normalization:
- Convert all text to lowercase.
- Replace hyphens '-' with spaces.
Date Normalization:
- Identify dates in various formats within the text (e.g., "Jan 5", "5 Jan", "05/Jan", "January 5", "5th Jan").
- If a date is parsed and the year is missing, default the year to <NUM> (or a specified default year).
- Standardize the date format to ensure consistency (e.g., "DD-Mon-YYYY").
Consistency:
- Apply the exact same preprocessing steps to both the dataset and user inputs during inference.

Anti-Patterns

Do not remove dates or ignore them.
Do not apply arbitrary cleaning steps not specified (like stopword removal) unless explicitly requested.

Triggers

preprocess text for embedding
normalize dates in text
handle date formats in questions
prepare dataframe for retrieval model