Hacktricks-skills llm-data-sampling

How to prepare and sample text data for training large language models. Use this skill whenever the user mentions data preparation, tokenization, sliding windows, sequence generation, training data, LLM datasets, or needs to create input/target pairs for model training. This includes tasks like chunking text, creating dataloaders, applying sampling strategies, or optimizing training data quality.

install

source · Clone the upstream repo

git clone https://github.com/abelrguezr/hacktricks-skills

manifest: skills/AI/AI-llm-architecture/2.-data-sampling/SKILL.MD

source content

LLM Data Sampling

A skill for preparing and sampling text data for training large language models (LLMs). This covers tokenization, sequence generation, sliding windows, and advanced sampling strategies.

When to Use This Skill

Use this skill when the user needs to:

Prepare text data for LLM training
Create input/target token sequences
Implement sliding window sampling
Apply advanced sampling strategies (temperature weighting, sequence packing, deduplication)
Optimize training data quality and security
Create PyTorch datasets and dataloaders for LLM training

Core Concepts

1. Tokenization

Breaking text into smaller units (tokens) that the model processes. Common approaches:

Word-level: Split by spaces
Subword-level: BPE, WordPiece (used by GPT-2, BERT)
Character-level: Individual characters

2. Sequence Length (max_length)

The number of tokens in each input sequence. Typical values:

Small models: 256-512 tokens
Medium models: 512-1024 tokens
Large models: 1024-4096+ tokens

3. Sliding Window

A method to create overlapping input sequences by moving a window over tokenized text.

4. Stride

The number of tokens the sliding window moves forward. Key tradeoffs:

Stride = 1: Maximum overlap, better context learning, higher overfitting risk
Stride = max_length: No overlap, less redundancy, may miss dependencies
Stride = 2-4×max_length: Recommended for most cases to balance context and efficiency

Step-by-Step Data Sampling

Basic Workflow

Load and tokenize text
Apply sliding window to create sequences
Generate input/target pairs (target is input shifted by 1 token)
Create dataset and dataloader for training

Example: Creating Input/Target Sequences

Given text:

"Lorem ipsum dolor sit amet, consectetur adipiscing elit."

With

max_length=4

and

stride=1

Window	Input Sequence	Target Sequence
1	["Lorem", "ipsum", "dolor", "sit"]	["ipsum", "dolor", "sit", "amet,"]
2	["ipsum", "dolor", "sit", "amet,"]	["dolor", "sit", "amet,", "consectetur"]
3	["dolor", "sit", "amet,", "consectetur"]	["sit", "amet,", "consectetur", "adipiscing"]

Implementation Guide

Using the Sampling Script

The bundled script

scripts/sample_data.py

handles the complete data sampling pipeline:

# Basic usage
python scripts/sample_data.py \
  --input "path/to/text.txt" \
  --output "path/to/output.jsonl" \
  --max-length 256 \
  --stride 128 \
  --batch-size 8

# With advanced options
python scripts/sample_data.py \
  --input "data/" \
  --output "processed/" \
  --max-length 512 \
  --stride 512 \
  --temperature 0.7 \
  --deduplicate \
  --shuffle

Key Parameters

Parameter	Description	Recommended Value
`max_length`	Sequence length in tokens	256-1024
`stride`	Window step size	≥ max_length for most cases
`batch_size`	Samples per batch	8-32 (depends on GPU)
`temperature`	Sampling temperature (α)	0.7 for mixed corpora
`shuffle`	Randomize order	True for training

Advanced Sampling Strategies

1. Temperature-Based Mixture Weighting

When training on multiple data sources, use temperature weighting to balance corpus proportions:

p(i) = w_i^α / Σ(w_j^α)

```
w_i
```
: Raw token percentage of corpus i
```
α
```
(temperature): Value in (0,1]. Lower α flattens distribution, giving more weight to smaller high-quality corpora
Llama 2 used α = 0.7 and showed improved evaluation scores

When to use: Training on heterogeneous data (code, web, academic papers, forums)

2. Sequence Packing / Dynamic Batching

Concatenate multiple shorter sequences until exact

max_length

is reached, with attention masks to prevent cross-segment attention.

Benefits:

20-40% throughput improvement
No gradient change
Reduces padding waste

Implementation: Use HuggingFace

DataCollatorForLanguageModeling(pad_to_multiple_of=...)

or PyTorch

torchtext.experimental.agents.PackedBatch

3. Deduplication & Quality Filtering

Deduplication:

MinHash/FAISS near-duplicate detection at document and n-gram level
Llama 2 removed ~15% of CommonCrawl using 8-gram MinHash
Target duplicate ratio: ≤0.04

Quality Filtering:

Remove documents with perplexity > µ + 3σ (noisy OCR, garbled HTML)
Block PII and sensitive content using regex & NER
Filter by source quality scores

Security & Privacy Considerations

Data Poisoning / Backdoor Attacks

Risk: Inserting <1% backdoored sentences can create hidden triggers

Mitigations:

Shuffled mixing: Ensure adjacent examples come from different sources
Gradient similarity scoring: Remove outliers with high gradient divergence
Dataset versioning: Freeze immutable tarballs, verify SHA-256 hashes

Membership Inference & Memorization

Risk: Long overlap between samples increases memorization of rare strings (phone numbers, keys)

Mitigations:

Use stride ≥ max_length (except for <1B parameter models with scarce data)
Random masking: Mask 1-3 tokens per window during training
OpenAI 2024 finding: Raising stride from 1× to 4× max_length reduces verbatim leakage by ~50%

Best Practices

For Training Data Preparation

Start with stride = max_length for most cases
Use stride = 1 only for small models (<1B params) with limited data
Apply deduplication before sampling (8-gram MinHash recommended)
Filter low-quality documents using perplexity thresholds
Version your datasets with SHA-256 hashes
Shuffle across sources to prevent gradient alignment attacks

For Production Pipelines

Use temperature weighting (α=0.7) for mixed corpora
Implement sequence packing for 20-40% throughput gains
Monitor duplicate ratios (target ≤0.04)
Apply PII filtering before training
Log sampling statistics for reproducibility

Common Issues & Solutions

Issue	Solution
GPU memory wasted on padding	Use sequence packing with attention masks
Model overfitting to repeated patterns	Increase stride, apply deduplication
Slow training throughput	Use sequence packing, optimize batch size
Memorization of sensitive data	Increase stride, add random masking
Poor performance on knowledge tasks	Use temperature weighting (α=0.7)

References

Next Steps

After preparing your data:

Validate the sampled sequences with
```
scripts/validate_sampling.py
```
Check for duplicates and quality issues
Create a training dataloader with appropriate batch size
Monitor for memorization during training
Adjust stride and temperature based on validation performance