install
source · Clone the upstream repo
git clone https://github.com/Aradotso/trending-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/Aradotso/trending-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/thereisnospoon-ml-primer" ~/.claude/skills/aradotso-trending-skills-thereisnospoon-ml-primer && rm -rf "$T"
manifest:
skills/thereisnospoon-ml-primer/SKILL.mdsource content
--- name: thereisnospoon-ml-primer description: A machine learning primer built from first principles for engineers, covering fundamentals through transformers using engineering analogies and visualizations. triggers: - "explain machine learning concepts from first principles" - "help me understand neural networks as an engineer" - "walk me through the transformer architecture" - "regenerate the ML primer figures" - "explain backpropagation with analogies" - "help me understand when to use convolution vs attention" - "explain gradient flow and training problems" - "match architecture to my ML problem" --- # There Is No Spoon — ML Primer Skill > Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection. ## What This Project Is `thereisnospoon` is a machine learning primer built from first principles, written for software engineers who already have strong system-design intuition but lack the equivalent gut feel for ML. It uses physical and engineering analogies as the **primary** explanation vehicle, with math as supporting detail. - **Neurons** → polarizing filters - **Depth** → paper folding - **Gradient flow** → pipeline valves - **Chain rule** → gear train - **Projections** → shadows The repo is a single comprehensive markdown document (`ml-primer.md`) plus Python scripts that generate inline figures. --- ## Installation / Setup This is a reading/reference project, not an installable library. Clone it and render the markdown locally or on GitHub. ```bash git clone https://github.com/dreddnafious/thereisnospoon.git cd thereisnospoon
Generate all figures
Requires only
matplotlib and numpy:
pip install matplotlib numpy
Then run each script individually:
python3 scripts/01_neuron_hyperplane.py python3 scripts/02_activation_functions.py python3 scripts/03_paper_folding.py python3 scripts/04_derivatives.py python3 scripts/05_chain_rule.py python3 scripts/06_attention.py python3 scripts/07_ffn_volumetric.py python3 scripts/08_residual_connections.py python3 scripts/09_dot_products.py python3 scripts/10_loss_landscapes.py python3 scripts/11_combination_rules.py python3 scripts/12_gating_operations.py
Or regenerate all at once:
for f in scripts/*.py; do python3 "$f"; done
Figures are written to
figures/.
Project Structure
thereisnospoon/ ├── ml-primer.md # The full primer — primary content ├── SYLLABUS.md # Full topic map / table of contents ├── figures/ # SVG/PNG visualizations (auto-generated) │ ├── logo.svg │ ├── 01_neuron_hyperplane.* │ └── ... └── scripts/ # Python figure-generation scripts ├── 01_neuron_hyperplane.py ├── 02_activation_functions.py └── ...
Key Concepts & Navigation
Part 1 — Fundamentals
| Section | Core Analogy | Key Insight |
|---|---|---|
| The Neuron | Polarizing filter | Dot product as directional agreement |
| Composition | Paper folding | Depth = exponential crease capacity |
| Learning | Pipeline valves | Gradient flow through the network |
| Generalization | Occam's razor | Why overparameterized nets generalize |
| Representation | Shadows/directions | Superposition in feature space |
Part 2 — Architectures
| Section | Core Analogy | When to Reach For It |
|---|---|---|
| Convolution | Sliding template | Spatial/local structure, translation invariance |
| Attention | Weighted spotlight | Long-range dependencies, variable-length sequences |
| Recurrence | State machine | Sequential state with bounded compute |
| Graph ops | Message passing | Relational / graph-structured data |
| SSMs | Continuous dynamics | Long sequences, efficient inference |
| Transformer | Full assembly | General-purpose sequence modeling |
Part 3 — Gates as Control Systems
Gate primitives (scalar, vector, matrix), soft logic composition, branching, routing, recursion within a forward pass.
Code Examples
Neuron from scratch (the primer's core primitive)
import numpy as np def neuron(x: np.ndarray, w: np.ndarray, b: float) -> float: """ Single neuron: dot product + bias + nonlinearity. Conceptually: how much does input x align with direction w? """ pre_activation = np.dot(w, x) + b # directional agreement return np.maximum(0, pre_activation) # ReLU nonlinearity # Example: 3-dimensional input x = np.array([0.5, -0.3, 0.8]) w = np.array([1.0, 0.0, 0.5]) # "cares about" dims 0 and 2 b = -0.2 output = neuron(x, w, b) print(f"Neuron output: {output:.4f}")
Dense layer (width and composition)
import numpy as np def dense_layer(X: np.ndarray, W: np.ndarray, b: np.ndarray) -> np.ndarray: """ X: (batch, in_features) W: (in_features, out_features) b: (out_features,) Returns: (batch, out_features) after ReLU """ return np.maximum(0, X @ W + b) # Two-layer MLP: paper folding twice np.random.seed(42) X = np.random.randn(8, 4) # 8 examples, 4 features W1 = np.random.randn(4, 16) * 0.1 b1 = np.zeros(16) W2 = np.random.randn(16, 2) * 0.1 b2 = np.zeros(2) hidden = dense_layer(X, W1, b1) # fold once output = dense_layer(hidden, W2, b2) # fold again print(f"Output shape: {output.shape}") # (8, 2)
Scaled dot-product attention (the transformer's core op)
import numpy as np def scaled_dot_product_attention( Q: np.ndarray, K: np.ndarray, V: np.ndarray, mask: np.ndarray = None ) -> tuple[np.ndarray, np.ndarray]: """ Q: (seq, d_k) — queries: what am I looking for? K: (seq, d_k) — keys: what do I offer? V: (seq, d_v) — values: what do I actually contain? Analogy: attention scores = spotlight intensity softmax = normalized routing weights output = weighted sum of values """ d_k = Q.shape[-1] # Alignment scores (how much each query matches each key) scores = Q @ K.T / np.sqrt(d_k) # Causal mask for autoregressive decoding if mask is not None: scores = np.where(mask, scores, -1e9) # Softmax: turn scores into a probability distribution scores_exp = np.exp(scores - scores.max(axis=-1, keepdims=True)) attn_weights = scores_exp / scores_exp.sum(axis=-1, keepdims=True) # Weighted aggregation of values output = attn_weights @ V return output, attn_weights # Example: 4-token sequence, d_k=8, d_v=8 seq_len, d_k, d_v = 4, 8, 8 Q = np.random.randn(seq_len, d_k) K = np.random.randn(seq_len, d_k) V = np.random.randn(seq_len, d_v) # Causal mask: position i can only attend to positions <= i causal_mask = np.tril(np.ones((seq_len, seq_len), dtype=bool)) out, weights = scaled_dot_product_attention(Q, K, V, mask=causal_mask) print(f"Attention output shape: {out.shape}") # (4, 8) print(f"Attention weights shape: {weights.shape}") # (4, 4)
Numerical gradient check (backprop intuition)
import numpy as np def numerical_gradient(f, x: np.ndarray, eps: float = 1e-5) -> np.ndarray: """ Approximate gradient using finite differences. Useful for verifying analytic gradients. Analogy: tilt-and-measure — how does output change per unit nudge? """ grad = np.zeros_like(x) for i in range(x.size): x_plus = x.copy(); x_plus.flat[i] += eps x_minus = x.copy(); x_minus.flat[i] -= eps grad.flat[i] = (f(x_plus) - f(x_minus)) / (2 * eps) return grad # Test: gradient of sum-of-squares loss def loss(w): return np.sum(w ** 2) w = np.array([1.0, 2.0, -0.5]) grad_numerical = numerical_gradient(loss, w) grad_analytic = 2 * w # d/dw sum(w^2) = 2w print(f"Numerical gradient: {grad_numerical}") print(f"Analytic gradient: {grad_analytic}") print(f"Max error: {np.max(np.abs(grad_numerical - grad_analytic)):.2e}")
Residual connection (the transformer's training enabler)
import numpy as np def residual_block(x: np.ndarray, sublayer_fn, *args) -> np.ndarray: """ x + sublayer(x): skip connection guarantees identity path. Analogy: bypass valve — gradient can always flow through unchanged. Critical for training deep networks (solves vanishing gradient). """ return x + sublayer_fn(x, *args) # Simulate: residual attention block def mock_attention(x, W): """Simplified: project, attend, project back.""" return np.tanh(x @ W) * 0.1 # small update x = np.random.randn(4, 8) W = np.random.randn(8, 8) * 0.1 out = residual_block(x, mock_attention, W) print(f"Input norm: {np.linalg.norm(x):.4f}") print(f"Output norm: {np.linalg.norm(out):.4f}") # Output is close to input — the residual preserves signal
Scalar gate (Part 3 primitive)
import numpy as np def sigmoid(x): return 1 / (1 + np.exp(-x)) def scalar_gate(value: np.ndarray, gate_logit: float) -> np.ndarray: """ g = sigmoid(logit) ∈ (0, 1) output = g * value Analogy: dimmer switch — how much of this value passes through? Used in: LSTMs, GRUs, mixture-of-experts routing """ g = sigmoid(gate_logit) return g * value # Interpolation gate (LSTM-style) def interpolate_gate( new_val: np.ndarray, old_val: np.ndarray, gate_logit: float ) -> np.ndarray: """How much to update vs. retain state.""" g = sigmoid(gate_logit) return g * new_val + (1 - g) * old_val state = np.array([0.8, -0.3, 0.5]) new_info = np.array([0.1, 0.9, 0.2]) # gate_logit=2.0 → mostly update; gate_logit=-2.0 → mostly retain updated = interpolate_gate(new_info, state, gate_logit=2.0) retained = interpolate_gate(new_info, state, gate_logit=-2.0) print(f"Mostly update: {updated.round(3)}") print(f"Mostly retain: {retained.round(3)}")
Regenerating a Single Figure
Each script in
scripts/ is self-contained. To modify and regenerate figure 06 (attention):
# Edit the script $EDITOR scripts/06_attention.py # Regenerate python3 scripts/06_attention.py # Output written to figures/06_attention.*
Common Patterns & Design Decisions
Matching architecture to problem (from Topology section)
Grid / spatial data (images) → Convolution Variable-length sequences → Transformer (attention) Sequential state, bounded compute → RNN / SSM Relational / graph structure → GNN (message passing) Tabular, low-dim, no structure → MLP Everything else at scale → Transformer
Choosing depth vs width
More depth → more folds in representation space (exponential capacity) → better for hierarchical features → harder to train (use residuals + normalization) More width → more directions per layer (linear capacity) → better for parallel feature detection at same level → easier to train, diminishing returns faster
Loss curve diagnostics (from Appendix)
| Symptom | Likely Cause | Fix |
|---|---|---|
| Loss not decreasing | LR too low, dead ReLUs, bad init | Raise LR, check activations |
| Loss exploding | LR too high, no gradient clipping | Lower LR, add clipping |
| Train ↓ / Val ↑ (overfitting) | Too much capacity, too little data | Dropout, weight decay, more data |
| Train stuck high | Underfitting | More capacity, more epochs, lower LR |
| Loss oscillates | LR too high | LR schedule, lower base LR |
Interactive Use with an AI Agent
Feed the primer to any AI coding assistant for conversational exploration:
Read ml-primer.md. I'm an engineer learning ML fundamentals. Walk me through the section on [topic]. I want to understand it well enough to reason about design decisions, not just recite definitions. Push back if I get something wrong.
Effective question patterns:
- "Why does X work? What would break if we removed it?"
- "How do X and Y differ in terms of inductive bias?"
- "Give me a concrete example where I'd choose X over Y."
- "What's the failure mode of this approach?"
Contributing
PRs welcome. Keep the tone:
- Direct, concrete — no hedging
- Analogies over notation — analogy is the primary explanation
- When-to-use over how-it-works — design decision focus
# Fork, clone, branch git checkout -b improve/section-name # Make changes to ml-primer.md or scripts/ # Regenerate affected figures if scripts changed python3 scripts/XX_affected_figure.py git commit -m "improve: clearer analogy for [concept]" git push origin improve/section-name # Open PR
License
MIT — see
LICENSE.