Learn-skills.dev feature-engineering
Feature construction from market data for ML trading models including price, volume, on-chain, and microstructure features
git clone https://github.com/NeverSight/learn-skills.dev
T=$(mktemp -d) && git clone --depth=1 https://github.com/NeverSight/learn-skills.dev "$T" && mkdir -p ~/.claude/skills && cp -r "$T/data/skills-md/agiprolabs/claude-trading-skills/feature-engineering" ~/.claude/skills/neversight-learn-skills-dev-feature-engineering && rm -rf "$T"
data/skills-md/agiprolabs/claude-trading-skills/feature-engineering/SKILL.mdFeature Engineering for Trading ML
Feature engineering is the single highest-leverage activity in building ML trading models. Model selection (XGBoost vs. neural net vs. logistic regression) matters far less than the quality and diversity of input features. A simple model on great features will outperform a complex model on raw prices every time.
This skill covers constructing, validating, and selecting features from market data for use in classification (signal-classification) and regression models targeting crypto/Solana token trading.
Why Features Beat Models
Raw OHLCV data is non-stationary, noisy, and high-dimensional. Models trained directly on price series will overfit. Feature engineering transforms raw data into stationary, informative signals that capture distinct aspects of market behavior:
- Compression: Reduce thousands of price bars to dozens of descriptive statistics
- Stationarity: Convert non-stationary prices into stationary returns and ratios
- Domain knowledge: Encode trader intuition (support/resistance, volume climax) as computable quantities
- Regime awareness: Features that behave differently in trending vs. ranging markets help models adapt
Feature Categories
1. Price Features
Derived purely from OHLCV price columns. These capture trend, momentum, and volatility from the price series itself.
| Feature | Formula | Lookback |
|---|---|---|
| | 1 bar |
| | 1 bar |
| | 20 bars |
| | 5, 10, 20 |
| | 10 bars |
| | 1 bar |
| | 1 bar |
| | 1 bar |
| | 20 bars |
| | 20 bars |
2. Volume Features
Volume confirms or contradicts price movements. Divergences between price and volume are among the most reliable signals in short-term trading.
| Feature | Formula | Lookback |
|---|---|---|
| | 20 bars |
| | 20 bars |
| | 10 bars |
| | intraday |
| | 21 bars |
| | 1 bar |
| | 1 bar |
| | 20 bars |
3. Technical Features
Standard technical indicators computed via
pandas-ta. Use the pandas-ta skill
for full parameter documentation.
| Feature | Source | Lookback |
|---|---|---|
| RSI(14) | 14 bars |
| MACD(12,26,9) histogram | 33 bars |
| | 20 bars |
| | 20 bars |
| | 14 bars |
| ADX(14) | 14 bars |
| Stochastic %K(14,3) | 14 bars |
| CCI(20) | 20 bars |
| MFI(14) | 14 bars |
| Supertrend direction (+1/-1) | 10 bars |
4. Microstructure Features
Derived from trade-level data (individual swaps/transactions). Require on-chain or DEX API data.
| Feature | Description |
|---|---|
| Trades this bar / avg trades per bar |
| Mean trade size in USD |
| % of volume from trades > $10k |
| Count of distinct wallet addresses |
| Buy trades / total trades |
| Shannon entropy of trade size distribution |
5. On-Chain Features
Derived from blockchain state changes. Require Helius or Solana RPC data.
| Feature | Description |
|---|---|
| Change in unique holders over N periods |
| Net tokens moved by top-10 holders |
| Transfer volume / circulating supply |
| Change in DEX liquidity pool TVL |
6. Cross-Asset Features
Capture relationships between the target token and broader market.
| Feature | Description |
|---|---|
| Rolling correlation with SOL price |
| Rolling beta to BTC returns |
| Average return of tokens in same sector |
7. Time Features
Cyclical encoding of calendar time. Use sin/cos encoding to preserve cyclical continuity (hour 23 is close to hour 0).
import numpy as np hour_sin = np.sin(2 * np.pi * hour / 24) hour_cos = np.cos(2 * np.pi * hour / 24) day_of_week = np.sin(2 * np.pi * day / 7)
Stationarity
Non-stationary features will cause your model to fail on new data. A feature is stationary if its statistical properties (mean, variance) don't change over time.
Testing for Stationarity
Use the Augmented Dickey-Fuller (ADF) test:
from scipy.stats import adfuller result = adfuller(feature_series.dropna()) p_value = result[1] is_stationary = p_value < 0.05
Making Features Stationary
| Non-Stationary | Stationary Transform |
|---|---|
| Price | Log return |
| Volume | Volume ratio (vol / avg vol) |
| OBV | OBV slope (regression coefficient) |
| Holder count | Holder count change |
| RSI | Already stationary (bounded 0-100) |
| Dollar volume | Dollar volume / rolling mean |
Rule: If a feature trends upward or downward over time, it is non-stationary. Transform it into a ratio, difference, or rate of change.
Normalization
After computing features, normalize them so that all features have comparable scales. This is critical for distance-based models (KNN, SVM) and helpful for tree models.
| Method | Formula | When to Use |
|---|---|---|
| Z-score | | Gaussian-like distributions |
| Min-max | | Bounded features (RSI, BB position) |
| Rank | | Heavy-tailed distributions |
Critical: Use rolling statistics for normalization. Never use full-sample mean/std — that introduces lookahead bias.
# CORRECT: rolling z-score z = (feature - feature.rolling(60).mean()) / feature.rolling(60).std() # WRONG: full-sample z-score (lookahead bias!) z = (feature - feature.mean()) / feature.std()
No-Lookahead Guarantee
The most dangerous bug in trading ML is lookahead bias — using future information to compute features or targets. Follow these rules absolutely:
- Rolling calculations only: Never use
or.mean()
on the full series. Always use.std()
..rolling(N).mean() - Shift targets forward, not features backward: The target is
(future return), notclose.shift(-N) / close - 1
(past return used as target).close / close.shift(N) - 1 - No future index alignment: When joining feature and target DataFrames,
verify that feature row
is paired with target rowt
(where target already contains the forward shift).t - Train/test split by time: Never random split. Always
,train = data[:split_idx]
.test = data[split_idx:]
Feature Selection
After computing many features, select the most predictive and least redundant:
Step 1: Remove Low-Variance Features
from sklearn.feature_selection import VarianceThreshold selector = VarianceThreshold(threshold=0.01) X_filtered = selector.fit_transform(X)
Step 2: Correlation Filter
Remove features with > 0.9 correlation to another feature (keep the one with higher target correlation):
corr_matrix = X.corr().abs() upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)) to_drop = [col for col in upper.columns if any(upper[col] > 0.9)]
Step 3: Feature Importance
Train a random forest and rank by importance:
from sklearn.ensemble import RandomForestClassifier rf = RandomForestClassifier(n_estimators=100, random_state=42) rf.fit(X_train, y_train) importances = pd.Series(rf.feature_importances_, index=X.columns).sort_values(ascending=False)
Step 4: Mutual Information
Non-linear alternative to correlation:
from sklearn.feature_selection import mutual_info_classif mi = mutual_info_classif(X_train, y_train, random_state=42) mi_scores = pd.Series(mi, index=X.columns).sort_values(ascending=False)
Label Creation
Labels (targets) define what the model learns to predict.
Binary Classification
forward_return = close.shift(-N) / close - 1 label = (forward_return > threshold).astype(int) # 1 = up, 0 = not up
Typical thresholds: 1% for 1h bars, 3% for 4h bars, 5% for daily bars.
Multi-Class Classification
label = pd.cut(forward_return, bins=[-np.inf, -threshold, threshold, np.inf], labels=[0, 1, 2]) # 0=down, 1=flat, 2=up
Regression
target = forward_return # Predict exact return magnitude
Binary classification is recommended for initial models — it's simpler and more robust to noise.
Integration with Other Skills
: Compute technical indicators that become featurespandas-ta
: Fetch OHLCV and trade data for feature computationbirdeye-api
: Fetch on-chain data for holder/whale featureshelius-api
: Use engineered features as model inputssignal-classification
: Regime labels as features or for regime-conditional modelsregime-detection
: Clean and resample raw data before feature computationohlcv-processing
Files
References
— Complete catalog of ~40 features with formulas, lookbacks, stationarity status, and interpretation notesreferences/feature_catalog.md
— Common mistakes in trading feature engineering: lookahead bias, overfitting, survivorship bias, data snooping, non-stationarityreferences/pitfalls.md
Scripts
— Compute 25+ features from OHLCV data with stationarity testing and quality reporting. Supports demo mode with synthetic data or live data via Birdeye API.scripts/build_features.py
— Rank features by predictive power using tree-based importance and permutation importance. Identifies redundant features via correlation analysis.scripts/feature_importance.py