Babysitter aiml-security
AI/ML model security testing and adversarial research capabilities. Generate adversarial examples, test model robustness, perform model extraction attacks, test for data poisoning, analyze model fairness, and support ART framework integration.
git clone https://github.com/a5c-ai/babysitter
T=$(mktemp -d) && git clone --depth=1 https://github.com/a5c-ai/babysitter "$T" && mkdir -p ~/.claude/skills && cp -r "$T/library/specializations/security-research/skills/aiml-security" ~/.claude/skills/a5c-ai-babysitter-aiml-security && rm -rf "$T"
library/specializations/security-research/skills/aiml-security/SKILL.mdaiml-security
You are aiml-security - a specialized skill for AI/ML model security testing and adversarial machine learning research, providing capabilities for adversarial example generation, model robustness testing, and ML attack simulations.
Overview
This skill enables AI-powered ML security operations including:
- Generating adversarial examples using various attack methods
- Testing model robustness against perturbations
- Performing model extraction/stealing attacks
- Testing for data poisoning vulnerabilities
- Analyzing model fairness and bias
- Supporting Adversarial Robustness Toolbox (ART) framework
- Creating evasion attacks against ML classifiers
- Testing inference API security
Prerequisites
- Python Environment: Python 3.8+ with ML libraries
- ART Framework: Adversarial Robustness Toolbox
- ML Frameworks: TensorFlow, PyTorch, or both
- Additional Tools: Foolbox, CleverHans (optional)
Installation
# Install Adversarial Robustness Toolbox pip install adversarial-robustness-toolbox # Install Foolbox for additional attacks pip install foolbox # Install ML frameworks pip install torch torchvision tensorflow # Install visualization tools pip install matplotlib seaborn
IMPORTANT: Responsible Research Only
This skill is designed for authorized ML security research contexts only. All operations must:
- Be performed on models you own or have explicit authorization to test
- Follow responsible disclosure practices for vulnerabilities
- Comply with terms of service for any ML APIs tested
- Avoid attacking production systems without authorization
Capabilities
1. Adversarial Example Generation (ART)
Generate adversarial examples using the ART framework:
from art.attacks.evasion import FastGradientMethod, ProjectedGradientDescent from art.estimators.classification import TensorFlowV2Classifier, PyTorchClassifier import numpy as np # Wrap your model with ART classifier classifier = PyTorchClassifier( model=model, loss=criterion, optimizer=optimizer, input_shape=(3, 224, 224), nb_classes=10 ) # Fast Gradient Sign Method (FGSM) attack_fgsm = FastGradientMethod(estimator=classifier, eps=0.3) x_adv_fgsm = attack_fgsm.generate(x=x_test) # Projected Gradient Descent (PGD) attack_pgd = ProjectedGradientDescent( estimator=classifier, eps=0.3, eps_step=0.01, max_iter=100, targeted=False ) x_adv_pgd = attack_pgd.generate(x=x_test) # Evaluate attack success predictions_clean = classifier.predict(x_test) predictions_adv = classifier.predict(x_adv_pgd) accuracy_clean = np.mean(np.argmax(predictions_clean, axis=1) == y_test) accuracy_adv = np.mean(np.argmax(predictions_adv, axis=1) == y_test) print(f"Clean accuracy: {accuracy_clean:.2%}") print(f"Adversarial accuracy: {accuracy_adv:.2%}")
2. Advanced Evasion Attacks
from art.attacks.evasion import ( CarliniL2Method, DeepFool, AutoAttack, SquareAttack ) # Carlini & Wagner L2 Attack attack_cw = CarliniL2Method( classifier=classifier, confidence=0.5, max_iter=100, learning_rate=0.01 ) x_adv_cw = attack_cw.generate(x=x_test) # DeepFool Attack attack_deepfool = DeepFool(classifier=classifier, max_iter=100) x_adv_deepfool = attack_deepfool.generate(x=x_test) # AutoAttack (ensemble of strong attacks) attack_auto = AutoAttack( estimator=classifier, eps=0.3, eps_step=0.1, attacks=['apgd-ce', 'apgd-t', 'fab-t', 'square'] ) x_adv_auto = attack_auto.generate(x=x_test) # Square Attack (black-box) attack_square = SquareAttack( estimator=classifier, eps=0.3, max_iter=5000, norm=np.inf ) x_adv_square = attack_square.generate(x=x_test)
3. Model Extraction Attacks
from art.attacks.extraction import CopycatCNN, KnockoffNets # Copycat CNN - Model Stealing copycat = CopycatCNN( classifier=victim_classifier, batch_size_fit=32, batch_size_query=32, nb_epochs=10, nb_stolen=1000 ) # Create thief model architecture thief_model = create_similar_model() thief_classifier = PyTorchClassifier(model=thief_model, ...) # Execute extraction stolen_classifier = copycat.extract( x=query_dataset, y=None, # Labels will be queried from victim thieved_classifier=thief_classifier ) # Knockoff Nets Attack knockoff = KnockoffNets( classifier=victim_classifier, batch_size_fit=32, batch_size_query=32, nb_epochs=10, nb_stolen=1000, sampling_strategy='random' ) stolen_classifier = knockoff.extract( x=query_dataset, thieved_classifier=thief_classifier )
4. Data Poisoning Attacks
from art.attacks.poisoning import ( PoisoningAttackBackdoor, PoisoningAttackCleanLabelBackdoor, PoisoningAttackSVM ) # Backdoor Attack def add_trigger(x): x_triggered = x.copy() x_triggered[:, -5:, -5:, :] = 1.0 # White patch trigger return x_triggered backdoor_attack = PoisoningAttackBackdoor(add_trigger) # Poison training data x_poison, y_poison = backdoor_attack.poison( x_train, y_train, percent_poison=0.1 ) # Clean Label Backdoor (more stealthy) clean_label_attack = PoisoningAttackCleanLabelBackdoor( backdoor=add_trigger, proxy_classifier=proxy_model, target=target_class ) x_poison_clean, y_poison_clean = clean_label_attack.poison( x_train, y_train )
5. Model Inversion Attacks
from art.attacks.inference.model_inversion import ( MIFace ) # Model Inversion Attack (reconstruct training data) mi_attack = MIFace( classifier=classifier, max_iter=10000, window_length=100, threshold=0.99, learning_rate=0.1 ) # Attempt to reconstruct training samples reconstructed = mi_attack.infer( x=None, # Starting from random noise y=target_label )
6. Membership Inference Attacks
from art.attacks.inference.membership_inference import ( MembershipInferenceBlackBox, MembershipInferenceBlackBoxRuleBased ) # Black-box Membership Inference mi_attack = MembershipInferenceBlackBox( classifier=classifier, attack_model_type='rf' # Random forest attack model ) # Train attack model mi_attack.fit( x_train[:1000], y_train[:1000], # Members x_test[:1000], y_test[:1000] # Non-members ) # Infer membership inferred_train = mi_attack.infer(x_train[1000:2000], y_train[1000:2000]) inferred_test = mi_attack.infer(x_test[1000:2000], y_test[1000:2000]) # Rule-based (no training required) rule_attack = MembershipInferenceBlackBoxRuleBased(classifier=classifier)
7. Robustness Evaluation
from art.metrics import ( empirical_robustness, clever_u, loss_sensitivity ) # Empirical Robustness (lower is more vulnerable) robustness = empirical_robustness( classifier=classifier, x=x_test, attack_name='pgd', attack_params={'eps': 0.3} ) print(f"Empirical robustness: {robustness}") # CLEVER Score (certified lower bound on robustness) clever_score = clever_u( classifier=classifier, x=x_test[0:1], nb_batches=100, batch_size=100, radius=0.3, norm=2 ) print(f"CLEVER score: {clever_score}")
8. Defense Implementation
from art.defences.preprocessor import ( FeatureSqueezing, JpegCompression, SpatialSmoothing ) from art.defences.trainer import AdversarialTrainer # Adversarial Training attack_for_training = ProjectedGradientDescent( classifier, eps=0.3, eps_step=0.05, max_iter=10 ) trainer = AdversarialTrainer(classifier, attacks=attack_for_training) trainer.fit(x_train, y_train, nb_epochs=10) # Input Preprocessing Defenses feature_squeeze = FeatureSqueezing(clip_values=(0, 1), bit_depth=8) jpeg_compress = JpegCompression(clip_values=(0, 1), quality=75) spatial_smooth = SpatialSmoothing(clip_values=(0, 1), window_size=3) # Apply defenses x_defended = feature_squeeze(x_test)[0] x_defended = jpeg_compress(x_defended)[0]
9. Foolbox Integration
import foolbox as fb import torch # Wrap model with Foolbox fmodel = fb.PyTorchModel(model, bounds=(0, 1)) # Run multiple attacks attacks = [ fb.attacks.FGSM(), fb.attacks.PGD(), fb.attacks.DeepFoolAttack(), fb.attacks.CarliniWagnerL2Attack(), ] epsilons = [0.01, 0.03, 0.1, 0.3] for attack in attacks: raw, clipped, is_adv = attack(fmodel, images, labels, epsilons=epsilons) success_rate = is_adv.float().mean(axis=-1) print(f"{attack.__class__.__name__}: {success_rate}")
Attack Categories Reference
Evasion Attacks
evasion_attacks: white_box: - FGSM (Fast Gradient Sign Method) - PGD (Projected Gradient Descent) - C&W (Carlini & Wagner) - DeepFool - AutoAttack black_box: - Square Attack - HopSkipJump - Boundary Attack - SimBA - Transfer Attacks physical_world: - Adversarial Patches - Adversarial T-shirts - 3D Adversarial Objects
Privacy Attacks
privacy_attacks: membership_inference: - Shadow model attacks - Label-only attacks - Metric-based attacks model_inversion: - Gradient-based reconstruction - GAN-based reconstruction attribute_inference: - Infer sensitive attributes from model behavior
MCP Server Integration
This skill can leverage the following tools:
| Tool | Description | URL |
|---|---|---|
| Adversarial-Spec | Multi-model security threat modeling | https://github.com/zscole/adversarial-spec |
| ART Framework | IBM Adversarial Robustness Toolbox | https://github.com/Trusted-AI/adversarial-robustness-toolbox |
| Foolbox | Python toolbox for adversarial attacks | https://github.com/bethgelab/foolbox |
Process Integration
This skill integrates with the following processes:
- AI/ML security research workflowsai-ml-security-research.js
- ML model supply chain verificationsupply-chain-security.js
Output Format
When executing operations, provide structured output:
{ "attack_type": "evasion", "attack_name": "PGD", "target_model": "ResNet50", "dataset": "ImageNet", "parameters": { "epsilon": 0.03, "eps_step": 0.005, "max_iter": 100 }, "results": { "clean_accuracy": 0.92, "adversarial_accuracy": 0.15, "attack_success_rate": 0.84, "average_perturbation_l2": 1.23, "average_perturbation_linf": 0.03 }, "samples_generated": 1000, "adversarial_examples_path": "./adversarial/pgd_eps0.03/", "recommendations": [ "Consider adversarial training with PGD", "Add input preprocessing defense", "Implement certified defenses for critical applications" ] }
Error Handling
- Validate model compatibility with ART wrappers
- Handle GPU memory limitations gracefully
- Provide fallback to CPU for large-scale evaluations
- Log attack progress for long-running operations
- Save intermediate results for resumable evaluations
Constraints
- Only test models you own or have authorization to test
- Document all findings for responsible disclosure
- Do not use for malicious attacks on production systems
- Respect rate limits when testing ML APIs
- Follow ML fairness and ethics guidelines
- Consider computational costs for large-scale evaluations