Awesome-Agent-Skills-for-Empirical-Research deep-learning-papers-guide

Annotated deep learning paper implementations with code walkthroughs

install

source · Clone the upstream repo

git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/43-wentorai-research-plugins/skills/domains/ai-ml/deep-learning-papers-guide" ~/.claude/skills/brycewang-stanford-awesome-agent-skills-for-empirical-research-deep-learning-pap && rm -rf "$T"

manifest: skills/43-wentorai-research-plugins/skills/domains/ai-ml/deep-learning-papers-guide/SKILL.md

source content

Deep Learning Papers Guide

Overview

Understanding deep learning architectures requires more than reading papers -- it requires reading and writing code. The annotated_deep_learning_paper_implementations repository (65,800+ stars) provides line-by-line annotated implementations of seminal deep learning papers in PyTorch, making it one of the most valuable learning resources in the field.

This guide organizes the key architectures by category, provides implementation patterns for the most important building blocks, and offers strategies for going from paper to working code. Whether you are implementing a Transformer variant for your research, understanding a GAN architecture for your experiments, or teaching a deep learning course, these patterns accelerate the process.

The focus is on practical understanding: what each component does, why it is designed that way, and how to implement it correctly in PyTorch.

Core Architecture Families

Transformer Architectures

The Transformer (Vaswani et al., 2017) is the foundation of modern NLP and increasingly of computer vision.

Multi-Head Self-Attention

import torch
import torch.nn as nn
import math

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model: int, n_heads: int):
        super().__init__()
        assert d_model % n_heads == 0
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_k = d_model // n_heads

        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)

        # Linear projections and reshape to (batch, heads, seq, d_k)
        Q = self.W_q(query).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
        K = self.W_k(key).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
        V = self.W_v(value).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)

        # Scaled dot-product attention
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        attn = torch.softmax(scores, dim=-1)
        context = torch.matmul(attn, V)

        # Concatenate heads and project
        context = context.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        return self.W_o(context)

Transformer Encoder Block

class TransformerBlock(nn.Module):
    def __init__(self, d_model: int, n_heads: int, d_ff: int, dropout: float = 0.1):
        super().__init__()
        self.attention = MultiHeadAttention(d_model, n_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout)
        )
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # Pre-norm variant (used in GPT-2, ViT, modern architectures)
        attn_out = self.attention(self.norm1(x), self.norm1(x), self.norm1(x), mask)
        x = x + self.dropout(attn_out)
        x = x + self.ffn(self.norm2(x))
        return x

Convolutional Neural Networks

ResNet Bottleneck Block

class BottleneckBlock(nn.Module):
    expansion = 4

    def __init__(self, in_channels, out_channels, stride=1, downsample=None):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, 1, bias=False)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, 3,
                               stride=stride, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.conv3 = nn.Conv2d(out_channels, out_channels * self.expansion, 1, bias=False)
        self.bn3 = nn.BatchNorm2d(out_channels * self.expansion)
        self.relu = nn.ReLU(inplace=True)
        self.downsample = downsample

    def forward(self, x):
        identity = x
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))
        if self.downsample is not None:
            identity = self.downsample(x)
        out += identity
        return self.relu(out)

Key Architecture Comparison

Architecture	Year	Parameters	Key Innovation	Primary Domain
ResNet	2015	25M (ResNet-50)	Skip connections	Vision
Transformer	2017	Varies	Self-attention	NLP
BERT	2018	340M (Large)	Masked language modeling	NLP
GPT-2	2019	1.5B	Autoregressive generation	NLP
ViT	2020	86M (Base)	Patch-based image tokenization	Vision
Diffusion	2020	Varies	Iterative denoising	Generation
LLaMA	2023	7B-70B	Efficient open LLM	NLP

Training Patterns

Standard Training Loop

def train_epoch(model, dataloader, optimizer, criterion, device):
    model.train()
    total_loss = 0
    for batch_idx, (data, targets) in enumerate(dataloader):
        data, targets = data.to(device), targets.to(device)

        optimizer.zero_grad()
        outputs = model(data)
        loss = criterion(outputs, targets)
        loss.backward()

        # Gradient clipping (crucial for Transformers)
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

        optimizer.step()
        total_loss += loss.item()

    return total_loss / len(dataloader)

Learning Rate Scheduling

# Cosine annealing with warmup (standard for Transformers)
from torch.optim.lr_scheduler import CosineAnnealingLR, LinearLR, SequentialLR

optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=1000)
cosine = CosineAnnealingLR(optimizer, T_max=50000)
scheduler = SequentialLR(optimizer, schedulers=[warmup, cosine], milestones=[1000])

From Paper to Code: A Methodology

Read the paper twice. First pass for high-level understanding; second pass for implementation details.
Identify the core algorithm. Usually in Section 3 or 4 of the paper.
List all hyperparameters. Create a config dict before writing any code.
Implement bottom-up. Start with the smallest building blocks (attention, normalization), then compose.
Test each component in isolation. Verify tensor shapes and gradients at each level.
Reproduce a known result first. Match the paper's numbers on a small dataset before scaling.
Use the official implementation as reference. But write your own code for understanding.

Best Practices

Always verify tensor shapes. Add assert statements for dimensions during development.
Use mixed precision training.
```
torch.cuda.amp
```
provides 2x speedup with minimal accuracy loss.
Log everything. Use Weights & Biases or TensorBoard for experiment tracking.
Start small. Debug on a tiny dataset before running on the full one.
Read the appendix. Critical details (learning rates, initialization, data augmentation) are often in the supplementary material.
Join the community. Papers With Code, Reddit r/MachineLearning, and Twitter/X are where implementation details are discussed.

References

annotated_deep_learning_paper_implementations -- Line-by-line annotated implementations (65,800+ stars)
Attention Is All You Need -- Original Transformer paper
Deep Residual Learning -- ResNet paper
The Illustrated Transformer -- Jay Alammar's visual guide
Papers With Code -- Paper-implementation pairs with benchmarks