Marketplace agentdb-reinforcement-learning-training
Train AI agents using AgentDB's 9 reinforcement learning algorithms including Q-Learning, DQN, PPO, and Actor-Critic. Build self-learning agents, implement RL training loops with experience replay, and deploy optimized models to production.
git clone https://github.com/aiskillstore/marketplace
T=$(mktemp -d) && git clone --depth=1 https://github.com/aiskillstore/marketplace "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/dnyoussef/agentdb-reinforcement-learning-training" ~/.claude/skills/aiskillstore-marketplace-agentdb-reinforcement-learning-training && rm -rf "$T"
skills/dnyoussef/agentdb-reinforcement-learning-training/SKILL.mdAgentDB Reinforcement Learning Training
Overview
Train AI learning plugins with AgentDB's 9 reinforcement learning algorithms including Decision Transformer, Q-Learning, SARSA, Actor-Critic, PPO, and more. Build self-learning agents, implement RL, and optimize agent behavior through experience.
When to Use This Skill
Use this skill when you need to:
- Train autonomous agents that learn from experience
- Implement reinforcement learning systems
- Optimize agent behavior through trial and error
- Build self-improving AI systems
- Deploy RL agents in production environments
- Benchmark and compare RL algorithms
Available RL Algorithms
- Q-Learning - Value-based, off-policy
- SARSA - Value-based, on-policy
- Deep Q-Network (DQN) - Deep RL with experience replay
- Actor-Critic - Policy gradient with value baseline
- Proximal Policy Optimization (PPO) - Trust region policy optimization
- Decision Transformer - Offline RL with transformers
- Advantage Actor-Critic (A2C) - Synchronous advantage estimation
- Twin Delayed DDPG (TD3) - Continuous control
- Soft Actor-Critic (SAC) - Maximum entropy RL
SOP Framework: 5-Phase RL Training Deployment
Phase 1: Initialize Learning Environment (1-2 hours)
Objective: Setup AgentDB learning infrastructure with environment configuration
Agent: ml-developer
Steps:
- Install AgentDB Learning Module
npm install agentdb-learning@latest npm install @agentdb/rl-algorithms @agentdb/environments
- Initialize learning database
import { AgentDB, LearningPlugin } from 'agentdb-learning'; const learningDB = new AgentDB({ name: 'rl-training-db', dimensions: 512, // State embedding dimension learning: { enabled: true, persistExperience: true, replayBufferSize: 100000 } }); await learningDB.initialize(); // Create learning plugin const learningPlugin = new LearningPlugin({ database: learningDB, algorithms: ['q-learning', 'dqn', 'ppo', 'actor-critic'], config: { batchSize: 64, learningRate: 0.001, discountFactor: 0.99, explorationRate: 1.0, explorationDecay: 0.995 } }); await learningPlugin.initialize();
- Define environment
import { Environment } from '@agentdb/environments'; const environment = new Environment({ name: 'grid-world', stateSpace: { type: 'continuous', shape: [10, 10], bounds: [[0, 10], [0, 10]] }, actionSpace: { type: 'discrete', actions: ['up', 'down', 'left', 'right'] }, rewardFunction: (state, action, nextState) => { // Distance to goal reward const goalDistance = Math.sqrt( Math.pow(nextState[0] - 9, 2) + Math.pow(nextState[1] - 9, 2) ); return -goalDistance + (goalDistance === 0 ? 100 : 0); }, terminalCondition: (state) => { return state[0] === 9 && state[1] === 9; // Reached goal } }); await environment.initialize();
- Setup monitoring
const monitor = learningPlugin.createMonitor({ metrics: ['reward', 'loss', 'exploration-rate', 'episode-length'], logInterval: 100, // Log every 100 episodes saveCheckpoints: true, checkpointInterval: 1000 }); monitor.on('episode-complete', (episode) => { console.log('Episode:', episode.number, 'Reward:', episode.totalReward); });
Memory Pattern:
await agentDB.memory.store('agentdb/learning/environment', { name: environment.name, stateSpace: environment.stateSpace, actionSpace: environment.actionSpace, initialized: Date.now() });
Validation:
- Learning database initialized
- Environment configured and tested
- Monitor capturing metrics
- Configuration stored in memory
Phase 2: Configure RL Algorithm (1-2 hours)
Objective: Select and configure RL algorithm for the learning task
Agent: ml-developer
Steps:
- Select algorithm
// Example: Deep Q-Network (DQN) const dqnAgent = learningPlugin.createAgent({ algorithm: 'dqn', config: { networkArchitecture: { layers: [ { type: 'dense', units: 128, activation: 'relu' }, { type: 'dense', units: 128, activation: 'relu' }, { type: 'dense', units: environment.actionSpace.size, activation: 'linear' } ] }, learningRate: 0.001, batchSize: 64, replayBuffer: { size: 100000, prioritized: true, alpha: 0.6, beta: 0.4 }, targetNetwork: { updateFrequency: 1000, tauSync: 0.001 // Soft update }, exploration: { initial: 1.0, final: 0.01, decay: 0.995 }, training: { startAfter: 1000, // Start training after 1000 experiences updateFrequency: 4 } } }); await dqnAgent.initialize();
- Configure hyperparameters
const hyperparameters = { // Learning parameters learningRate: 0.001, discountFactor: 0.99, // Gamma batchSize: 64, // Exploration epsilonStart: 1.0, epsilonEnd: 0.01, epsilonDecay: 0.995, // Experience replay replayBufferSize: 100000, minReplaySize: 1000, prioritizedReplay: true, // Training maxEpisodes: 10000, maxStepsPerEpisode: 1000, targetUpdateFrequency: 1000, // Evaluation evalFrequency: 100, evalEpisodes: 10 }; dqnAgent.setHyperparameters(hyperparameters);
- Setup experience replay
import { PrioritizedReplayBuffer } from '@agentdb/rl-algorithms'; const replayBuffer = new PrioritizedReplayBuffer({ capacity: 100000, alpha: 0.6, // Prioritization exponent beta: 0.4, // Importance sampling betaIncrement: 0.001, epsilon: 0.01 // Small constant for stability }); dqnAgent.setReplayBuffer(replayBuffer);
- Configure training loop
const trainingConfig = { episodes: 10000, stepsPerEpisode: 1000, warmupSteps: 1000, trainFrequency: 4, targetUpdateFrequency: 1000, saveFrequency: 1000, evalFrequency: 100, earlyStoppingPatience: 500, earlyStoppingThreshold: 0.01 }; dqnAgent.setTrainingConfig(trainingConfig);
Memory Pattern:
await agentDB.memory.store('agentdb/learning/algorithm-config', { algorithm: 'dqn', hyperparameters: hyperparameters, trainingConfig: trainingConfig, configured: Date.now() });
Validation:
- Algorithm selected and configured
- Hyperparameters validated
- Replay buffer initialized
- Training config set
Phase 3: Train Agents (3-4 hours)
Objective: Execute training iterations and optimize agent behavior
Agent: safla-neural
Steps:
- Start training loop
async function trainAgent() { console.log('Starting RL training...'); const trainingStats = { episodes: [], totalReward: [], episodeLength: [], loss: [], explorationRate: [] }; for (let episode = 0; episode < trainingConfig.episodes; episode++) { let state = await environment.reset(); let episodeReward = 0; let episodeLength = 0; let episodeLoss = 0; for (let step = 0; step < trainingConfig.stepsPerEpisode; step++) { // Select action const action = await dqnAgent.selectAction(state, { explore: true }); // Execute action const { nextState, reward, done } = await environment.step(action); // Store experience await dqnAgent.storeExperience({ state, action, reward, nextState, done }); // Train if enough experiences if (dqnAgent.canTrain()) { const loss = await dqnAgent.train(); episodeLoss += loss; } episodeReward += reward; episodeLength += 1; state = nextState; if (done) break; } // Update target network if (episode % trainingConfig.targetUpdateFrequency === 0) { await dqnAgent.updateTargetNetwork(); } // Decay exploration dqnAgent.decayExploration(); // Log progress trainingStats.episodes.push(episode); trainingStats.totalReward.push(episodeReward); trainingStats.episodeLength.push(episodeLength); trainingStats.loss.push(episodeLoss / episodeLength); trainingStats.explorationRate.push(dqnAgent.getExplorationRate()); if (episode % 100 === 0) { console.log(`Episode ${episode}:`, { reward: episodeReward.toFixed(2), length: episodeLength, loss: (episodeLoss / episodeLength).toFixed(4), epsilon: dqnAgent.getExplorationRate().toFixed(3) }); } // Save checkpoint if (episode % trainingConfig.saveFrequency === 0) { await dqnAgent.save(`checkpoint-${episode}`); } // Evaluate if (episode % trainingConfig.evalFrequency === 0) { const evalReward = await evaluateAgent(dqnAgent, environment); console.log(`Evaluation at episode ${episode}: ${evalReward.toFixed(2)}`); } // Early stopping if (checkEarlyStopping(trainingStats, episode)) { console.log('Early stopping triggered'); break; } } return trainingStats; } const trainingStats = await trainAgent();
- Monitor training progress
monitor.on('training-update', (stats) => { // Calculate moving averages const window = 100; const recentRewards = stats.totalReward.slice(-window); const avgReward = recentRewards.reduce((a, b) => a + b, 0) / recentRewards.length; // Store metrics agentDB.memory.store('agentdb/learning/training-progress', { episode: stats.episodes[stats.episodes.length - 1], avgReward: avgReward, explorationRate: stats.explorationRate[stats.explorationRate.length - 1], timestamp: Date.now() }); // Plot learning curve (if visualization enabled) if (monitor.visualization) { monitor.plot('reward-curve', stats.episodes, stats.totalReward); monitor.plot('loss-curve', stats.episodes, stats.loss); } });
- Handle convergence
function checkConvergence(stats, windowSize = 100, threshold = 0.01) { if (stats.totalReward.length < windowSize * 2) { return false; } const recent = stats.totalReward.slice(-windowSize); const previous = stats.totalReward.slice(-windowSize * 2, -windowSize); const recentAvg = recent.reduce((a, b) => a + b, 0) / recent.length; const previousAvg = previous.reduce((a, b) => a + b, 0) / previous.length; const improvement = (recentAvg - previousAvg) / Math.abs(previousAvg); return improvement < threshold; }
- Save trained model
await dqnAgent.save('trained-agent-final', { includeReplayBuffer: false, includeOptimizer: false, metadata: { trainingStats: trainingStats, hyperparameters: hyperparameters, finalReward: trainingStats.totalReward[trainingStats.totalReward.length - 1] } }); console.log('Training complete. Model saved.');
Memory Pattern:
await agentDB.memory.store('agentdb/learning/training-results', { algorithm: 'dqn', episodes: trainingStats.episodes.length, finalReward: trainingStats.totalReward[trainingStats.totalReward.length - 1], converged: checkConvergence(trainingStats), modelPath: 'trained-agent-final', timestamp: Date.now() });
Validation:
- Training completed or converged
- Reward curve shows improvement
- Model saved successfully
- Training stats stored
Phase 4: Validate Performance (1-2 hours)
Objective: Benchmark trained agent and validate performance
Agent: performance-benchmarker
Steps:
- Load trained agent
const trainedAgent = await learningPlugin.loadAgent('trained-agent-final');
- Run evaluation episodes
async function evaluateAgent(agent, env, numEpisodes = 100) { const results = { rewards: [], episodeLengths: [], successRate: 0 }; for (let i = 0; i < numEpisodes; i++) { let state = await env.reset(); let episodeReward = 0; let episodeLength = 0; let success = false; for (let step = 0; step < 1000; step++) { const action = await agent.selectAction(state, { explore: false }); const { nextState, reward, done } = await env.step(action); episodeReward += reward; episodeLength += 1; state = nextState; if (done) { success = env.isSuccessful(state); break; } } results.rewards.push(episodeReward); results.episodeLengths.push(episodeLength); if (success) results.successRate += 1; } results.successRate /= numEpisodes; return { meanReward: results.rewards.reduce((a, b) => a + b, 0) / results.rewards.length, stdReward: calculateStd(results.rewards), meanLength: results.episodeLengths.reduce((a, b) => a + b, 0) / results.episodeLengths.length, successRate: results.successRate, results: results }; } const evalResults = await evaluateAgent(trainedAgent, environment, 100); console.log('Evaluation results:', evalResults);
- Compare with baseline
// Random policy baseline const randomAgent = learningPlugin.createAgent({ algorithm: 'random' }); const randomResults = await evaluateAgent(randomAgent, environment, 100); // Calculate improvement const improvement = { rewardImprovement: (evalResults.meanReward - randomResults.meanReward) / Math.abs(randomResults.meanReward), lengthImprovement: (randomResults.meanLength - evalResults.meanLength) / randomResults.meanLength, successImprovement: evalResults.successRate - randomResults.successRate }; console.log('Improvement over random:', improvement);
- Run comprehensive benchmarks
const benchmarks = { performanceMetrics: { meanReward: evalResults.meanReward, stdReward: evalResults.stdReward, successRate: evalResults.successRate, meanEpisodeLength: evalResults.meanLength }, algorithmComparison: { dqn: evalResults, random: randomResults, improvement: improvement }, inferenceTiming: { actionSelection: 0, totalEpisode: 0 } }; // Measure inference speed const timingTrials = 1000; const startTime = performance.now(); for (let i = 0; i < timingTrials; i++) { const state = await environment.randomState(); await trainedAgent.selectAction(state, { explore: false }); } const endTime = performance.now(); benchmarks.inferenceTiming.actionSelection = (endTime - startTime) / timingTrials; await agentDB.memory.store('agentdb/learning/benchmarks', benchmarks);
Memory Pattern:
await agentDB.memory.store('agentdb/learning/validation', { evaluated: true, meanReward: evalResults.meanReward, successRate: evalResults.successRate, improvement: improvement, timestamp: Date.now() });
Validation:
- Evaluation completed (100 episodes)
- Mean reward exceeds threshold
- Success rate acceptable
- Improvement over baseline demonstrated
Phase 5: Deploy Trained Agents (1-2 hours)
Objective: Deploy trained agents to production environment
Agent: ml-developer
Steps:
- Export production model
await trainedAgent.export('production-agent', { format: 'onnx', // or 'tensorflowjs', 'pytorch' optimize: true, quantize: 'int8', // Quantization for faster inference includeMetadata: true });
- Create inference API
import express from 'express'; const app = express(); app.use(express.json()); // Load production agent const productionAgent = await learningPlugin.loadAgent('production-agent'); app.post('/api/predict', async (req, res) => { try { const { state } = req.body; const action = await productionAgent.selectAction(state, { explore: false, returnProbabilities: true }); res.json({ action: action.action, probabilities: action.probabilities, confidence: action.confidence }); } catch (error) { res.status(500).json({ error: error.message }); } }); app.listen(3000, () => { console.log('RL agent API running on port 3000'); });
- Setup monitoring
import { ProductionMonitor } from '@agentdb/monitoring'; const prodMonitor = new ProductionMonitor({ agent: productionAgent, metrics: ['inference-latency', 'action-distribution', 'reward-feedback'], alerting: { latencyThreshold: 100, // ms anomalyDetection: true } }); await prodMonitor.start();
- Create deployment pipeline
const deploymentPipeline = { stages: [ { name: 'validation', steps: [ 'Load trained model', 'Run validation suite', 'Check performance metrics', 'Verify inference speed' ] }, { name: 'export', steps: [ 'Export to production format', 'Optimize model', 'Quantize weights', 'Package artifacts' ] }, { name: 'deployment', steps: [ 'Deploy to staging', 'Run smoke tests', 'Deploy to production', 'Monitor performance' ] } ] }; await agentDB.memory.store('agentdb/learning/deployment-pipeline', deploymentPipeline);
Memory Pattern:
await agentDB.memory.store('agentdb/learning/production', { deployed: true, modelPath: 'production-agent', apiEndpoint: 'http://localhost:3000/api/predict', monitoring: true, timestamp: Date.now() });
Validation:
- Model exported successfully
- API running and responding
- Monitoring active
- Deployment pipeline documented
Integration Scripts
Complete Training Script
#!/bin/bash # train-rl-agent.sh set -e echo "AgentDB RL Training Script" echo "==========================" # Phase 1: Initialize echo "Phase 1: Initializing learning environment..." npm install agentdb-learning @agentdb/rl-algorithms # Phase 2: Configure echo "Phase 2: Configuring algorithm..." node -e "require('./config-algorithm.js')" # Phase 3: Train echo "Phase 3: Training agent..." node -e "require('./train-agent.js')" # Phase 4: Validate echo "Phase 4: Validating performance..." node -e "require('./evaluate-agent.js')" # Phase 5: Deploy echo "Phase 5: Deploying to production..." node -e "require('./deploy-agent.js')" echo "Training complete!"
Quick Start Script
// quickstart-rl.ts import { setupRLTraining } from './setup'; async function quickStart() { console.log('Starting RL training quick setup...'); // Setup const { learningDB, environment, agent } = await setupRLTraining({ algorithm: 'dqn', environment: 'grid-world', episodes: 1000 }); // Train console.log('Training agent...'); const stats = await agent.train(environment, { episodes: 1000, logInterval: 100 }); // Evaluate console.log('Evaluating agent...'); const results = await agent.evaluate(environment, { episodes: 100 }); console.log('Results:', results); // Save await agent.save('quickstart-agent'); console.log('Quick start complete!'); } quickStart().catch(console.error);
Evidence-Based Success Criteria
-
Training Convergence (Self-Consistency)
- Reward curve stabilizes
- Moving average improvement < 1%
- Agent achieves consistent performance
-
Performance Benchmarks (Quantitative)
- Mean reward exceeds baseline by 50%
- Success rate > 80%
- Inference time < 10ms per action
-
Algorithm Validation (Chain-of-Verification)
- Hyperparameters validated
- Exploration-exploitation balanced
- Experience replay functioning
-
Production Readiness (Multi-Agent Consensus)
- Model exported successfully
- API responds within latency threshold
- Monitoring active and alerting
- Deployment pipeline documented
Additional Resources
- AgentDB Learning Documentation: https://agentdb.dev/docs/learning
- RL Algorithms Guide: https://agentdb.dev/docs/rl-algorithms
- Training Best Practices: https://agentdb.dev/docs/training
- Production Deployment: https://agentdb.dev/docs/deployment