Agent-almanac build-feature-store
git clone https://github.com/pjt222/agent-almanac
T=$(mktemp -d) && git clone --depth=1 https://github.com/pjt222/agent-almanac "$T" && mkdir -p ~/.claude/skills && cp -r "$T/i18n/wenyan-lite/skills/build-feature-store" ~/.claude/skills/pjt222-agent-almanac-build-feature-store-aba177 && rm -rf "$T"
i18n/wenyan-lite/skills/build-feature-store/SKILL.mdBuild Feature Store
See Extended Examples for complete configuration files and templates.
以 Feast 實作集中式特徵管理,確訓練與推理間特徵服務之一致。
適用時機
- 為多團之多 ML 模型管特徵
- 確特徵訓服一致
- 實作時點正確之歷史特徵
- 為實時推理提低延之特徵
- 跨項目復用特徵定義
- 版特徵轉換
- 建可發現與治理之特徵目錄
- 止訓練管線中之特徵洩
輸入
- 必要:原數源(數據庫、數據湖、數據倉)
- 必要:裝 Feast 之 Python 環境
- 必要:離線倉後端(BigQuery、Snowflake、Redshift、或 Parquet 文件)
- 必要:在線倉後端(Redis、DynamoDB、Cassandra、或開發用 SQLite)
- 選擇性:特徵轉換邏輯(Python、SQL、Spark)
- 選擇性:實體鍵定義(user_id、product_id 等)
- 選擇性:Feast 伺服器部署之 Kubernetes 集群
步驟
步驟一:初始化 Feast 特徵庫
立 Feast 項目結構並配存儲後端。
# Install Feast with required extras pip install 'feast[redis,postgres]' # Add backends as needed # Initialize new feature repository feast init my_feature_repo cd my_feature_repo # Directory structure created: # my_feature_repo/ # ├── feature_store.yaml # Configuration # ├── features.py # Feature definitions # └── data/ # Sample data (dev only)
配
feature_store.yaml:
# feature_store.yaml project: customer_analytics registry: data/registry.db # SQLite for dev, use S3/GCS for prod provider: local # Offline store for training data offline_store: type: postgres # ... (see EXAMPLES.md for complete implementation)
生產配,以雲後端:
# feature_store.prod.yaml project: customer_analytics registry: s3://feast-registry/prod/registry.db provider: aws offline_store: type: bigquery project_id: my-gcp-project # ... (see EXAMPLES.md for complete implementation)
預期: Feast 庫已初始化附配文件,樣本特徵定義已創,離/在線倉已配,註冊路可達。
失敗時: 驗數據庫/Redis 憑(
psql -U feast_user -h localhost),查連接字串格式,確數據庫存(CREATE DATABASE feature_store),驗 S3/BigQuery/DynamoDB 之雲權,測存儲後端連通,查 Feast 版與後端之相容(feast version)。
步驟二:定實體與數源
創實體定義並連原數源。
# entities.py from feast import Entity, ValueType # Define entities (primary keys for features) customer = Entity( name="customer", description="Customer entity", value_type=ValueType.INT64, # ... (see EXAMPLES.md for complete implementation)
定數源:
# data_sources.py from feast import FileSource, BigQuerySource, RedshiftSource from feast.data_format import ParquetFormat from datetime import timedelta # Development: File-based source customer_transactions_source = FileSource( path="data/customer_transactions.parquet", # ... (see EXAMPLES.md for complete implementation)
預期: 實體定義引正 ID 列,數源成連原數,event_timestamp_column 於源中存,created_timestamp_column 允時點查詢。
失敗時: 驗源數據文件存且可讀,查 BigQuery/Redshift 憑與表存取,確時戳列格式正確(Unix 時戳或 ISO8601),驗 Kafka 連通與題存,查源與實體間之綱相容。
步驟三:定附轉換之特徵視圖
創定原數如何化為 ML 就緒特徵之特徵視圖。
# feature_views.py from feast import FeatureView, Field from feast.types import Float32, Int64, String, Bool from datetime import timedelta from entities import customer, product from data_sources import customer_features_source # Simple feature view without transformations # ... (see EXAMPLES.md for complete implementation)
預期: 特徵視圖成註,綱合源數,轉換無錯執,TTL 值合用例,on-demand 視圖合批與請特徵。
失敗時: 驗欄名與源列完合,查 dtype 相容(Int64 對 Int32),確實體引存,以樣本數驗轉換邏輯,查算中除零,驗請源綱合推理載荷。
步驟四:應特徵定義並物化特徵
部特徵定義至註冊並物化至在線倉。
# Apply feature definitions to registry feast apply # Expected output: # Created entity customer # Created feature view customer_stats # Created on demand feature view customer_segments # ... (see EXAMPLES.md for complete implementation)
程序化物化:
# materialize_features.py from feast import FeatureStore from datetime import datetime, timedelta # Initialize feature store fs = FeatureStore(repo_path=".") # Materialize all feature views # ... (see EXAMPLES.md for complete implementation)
預期: 特徵定義應至註冊而無衝,物化作業成完,在線倉以特徵填,特徵新鮮於所配 TTL 內。
失敗時: 查離線倉查詢成(
feast feature-views describe customer_stats),驗時段有數,確在線倉可寫(Redis/DynamoDB 權),查特徵名於視圖間無重,驗實體鍵於源數存,監物化作業日誌求錯,查本地倉之盤空。
步驟五:為訓練取特徵
取時點正確之歷史特徵以供模型訓練。
# get_training_data.py from feast import FeatureStore import pandas as pd from datetime import datetime # Initialize feature store fs = FeatureStore(repo_path=".") # ... (see EXAMPLES.md for complete implementation)
時點正確性驗:
# validate_pit_correctness.py import pandas as pd from datetime import datetime, timedelta def validate_point_in_time_correctness(training_df, entity_df): """ Ensure features don't leak future information. """ # ... (see EXAMPLES.md for complete implementation)
預期: 歷史特徵成取,entity_df 時戳保全,已物化特徵無 NaN,時點正確性保(無未來數洩),特徵服務合邏輯組特徵。
失敗時: 查 entity_df 具必列(實體名加 event_timestamp),驗特徵視圖名合註冊,確離線倉於所請時段有數,查時區失配(用 UTC),驗實體 ID 於源數存,察日誌求 SQL 查詢錯,驗特徵視圖 TTL 覆所請時段。
步驟六:為實時推理服特徵
自在線倉取低延特徵以供模型服務。
# serve_features.py from feast import FeatureStore import time # Initialize feature store fs = FeatureStore(repo_path=".") def get_inference_features(customer_ids: list, request_data: dict = None): # ... (see EXAMPLES.md for complete implementation)
FastAPI 整合:
# api.py from fastapi import FastAPI from pydantic import BaseModel from feast import FeatureStore import mlflow app = FastAPI() fs = FeatureStore(repo_path=".") # ... (see EXAMPLES.md for complete implementation)
預期: 在線特徵於單實體十毫秒內取,批取有效擴展,on-demand 轉換正執,請時特徵與批特徵合,API 快應(端到端小於五十毫秒)。
失敗時: 查在線倉已填(若空則物化),驗 Redis/DynamoDB 連通與延,確實體鍵於在線倉存,查冷啟問題(暖快取),驗 on-demand 轉換邏輯,監在線倉之記憶/CPU,查服務與在線倉間之網延。
驗證
- Feast 庫已初始化並配
- 離/在線倉成連
- 實體定義合源數
- 特徵視圖於註冊已註
- On-demand 轉換正執
- 物化無錯完
- 歷史特徵取附時點正確性
- 在線特徵以低延服(小於十毫秒)
- 特徵新鮮於所配 TTL 內
- 訓服一致已驗
- 特徵目錄可為發現存取
常見陷阱
- 特徵洩:歷史特徵中用未來數——恆驗時點正確,用 created_timestamp 列
- 轉換不一致:訓練對服務之異邏輯——用 Feast on-demand 視圖以求一致
- 陳特徵:在線倉未定期物化——立排程物化作業(cron/Airflow)
- 實體鍵缺:訓集中實體不在在線倉——確全面物化,優雅處缺鍵
- 類型失配:綱類與源數不合——apply 前驗 dtype,用明 Field 定義
- 在線取慢:網延或在線倉超載——服特徵倉與推理服同地,用連池
- 大特徵視圖:物化數百萬實體慢——按日分區,用增量物化,優離線查詢
- 無特徵版:破壞變影生產模型——版特徵視圖,維反向相容
- 時區混:混時區致誤合——時戳恆用 UTC
- 忽 TTL:服過期特徵——設合之 TTL,監特徵新鮮
相關技能
- 於 MLflow 實驗中記特徵元數據track-ml-experiments
- 排程特徵物化作業orchestrate-ml-pipeline
- 版特徵工程之原數源version-ml-data
- 特徵倉與模型服務之整合deploy-ml-model-serving
- 擇特徵之高效存儲格式serialize-data-formats
- 為特徵源設計綱design-serialization-schema