Awesome-Agent-Skills-for-Empirical-Research deep-learning-papers-guide

Annotated deep learning paper implementations with code walkthroughs

install
source · Clone the upstream repo
git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/43-wentorai-research-plugins/skills/domains/ai-ml/deep-learning-papers-guide" ~/.claude/skills/brycewang-stanford-awesome-agent-skills-for-empirical-research-deep-learning-pap && rm -rf "$T"
manifest: skills/43-wentorai-research-plugins/skills/domains/ai-ml/deep-learning-papers-guide/SKILL.md
source content

Deep Learning Papers Guide

Overview

Understanding deep learning architectures requires more than reading papers -- it requires reading and writing code. The annotated_deep_learning_paper_implementations repository (65,800+ stars) provides line-by-line annotated implementations of seminal deep learning papers in PyTorch, making it one of the most valuable learning resources in the field.

This guide organizes the key architectures by category, provides implementation patterns for the most important building blocks, and offers strategies for going from paper to working code. Whether you are implementing a Transformer variant for your research, understanding a GAN architecture for your experiments, or teaching a deep learning course, these patterns accelerate the process.

The focus is on practical understanding: what each component does, why it is designed that way, and how to implement it correctly in PyTorch.

Core Architecture Families

Transformer Architectures

The Transformer (Vaswani et al., 2017) is the foundation of modern NLP and increasingly of computer vision.

Multi-Head Self-Attention

import torch
import torch.nn as nn
import math

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model: int, n_heads: int):
        super().__init__()
        assert d_model % n_heads == 0
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_k = d_model // n_heads

        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)

        # Linear projections and reshape to (batch, heads, seq, d_k)
        Q = self.W_q(query).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
        K = self.W_k(key).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
        V = self.W_v(value).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)

        # Scaled dot-product attention
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        attn = torch.softmax(scores, dim=-1)
        context = torch.matmul(attn, V)

        # Concatenate heads and project
        context = context.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        return self.W_o(context)

Transformer Encoder Block

class TransformerBlock(nn.Module):
    def __init__(self, d_model: int, n_heads: int, d_ff: int, dropout: float = 0.1):
        super().__init__()
        self.attention = MultiHeadAttention(d_model, n_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout)
        )
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # Pre-norm variant (used in GPT-2, ViT, modern architectures)
        attn_out = self.attention(self.norm1(x), self.norm1(x), self.norm1(x), mask)
        x = x + self.dropout(attn_out)
        x = x + self.ffn(self.norm2(x))
        return x

Convolutional Neural Networks

ResNet Bottleneck Block

class BottleneckBlock(nn.Module):
    expansion = 4

    def __init__(self, in_channels, out_channels, stride=1, downsample=None):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, 1, bias=False)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, 3,
                               stride=stride, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.conv3 = nn.Conv2d(out_channels, out_channels * self.expansion, 1, bias=False)
        self.bn3 = nn.BatchNorm2d(out_channels * self.expansion)
        self.relu = nn.ReLU(inplace=True)
        self.downsample = downsample

    def forward(self, x):
        identity = x
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))
        if self.downsample is not None:
            identity = self.downsample(x)
        out += identity
        return self.relu(out)

Key Architecture Comparison

ArchitectureYearParametersKey InnovationPrimary Domain
ResNet201525M (ResNet-50)Skip connectionsVision
Transformer2017VariesSelf-attentionNLP
BERT2018340M (Large)Masked language modelingNLP
GPT-220191.5BAutoregressive generationNLP
ViT202086M (Base)Patch-based image tokenizationVision
Diffusion2020VariesIterative denoisingGeneration
LLaMA20237B-70BEfficient open LLMNLP

Training Patterns

Standard Training Loop

def train_epoch(model, dataloader, optimizer, criterion, device):
    model.train()
    total_loss = 0
    for batch_idx, (data, targets) in enumerate(dataloader):
        data, targets = data.to(device), targets.to(device)

        optimizer.zero_grad()
        outputs = model(data)
        loss = criterion(outputs, targets)
        loss.backward()

        # Gradient clipping (crucial for Transformers)
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

        optimizer.step()
        total_loss += loss.item()

    return total_loss / len(dataloader)

Learning Rate Scheduling

# Cosine annealing with warmup (standard for Transformers)
from torch.optim.lr_scheduler import CosineAnnealingLR, LinearLR, SequentialLR

optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=1000)
cosine = CosineAnnealingLR(optimizer, T_max=50000)
scheduler = SequentialLR(optimizer, schedulers=[warmup, cosine], milestones=[1000])

From Paper to Code: A Methodology

  1. Read the paper twice. First pass for high-level understanding; second pass for implementation details.
  2. Identify the core algorithm. Usually in Section 3 or 4 of the paper.
  3. List all hyperparameters. Create a config dict before writing any code.
  4. Implement bottom-up. Start with the smallest building blocks (attention, normalization), then compose.
  5. Test each component in isolation. Verify tensor shapes and gradients at each level.
  6. Reproduce a known result first. Match the paper's numbers on a small dataset before scaling.
  7. Use the official implementation as reference. But write your own code for understanding.

Best Practices

  • Always verify tensor shapes. Add assert statements for dimensions during development.
  • Use mixed precision training.
    torch.cuda.amp
    provides 2x speedup with minimal accuracy loss.
  • Log everything. Use Weights & Biases or TensorBoard for experiment tracking.
  • Start small. Debug on a tiny dataset before running on the full one.
  • Read the appendix. Critical details (learning rates, initialization, data augmentation) are often in the supplementary material.
  • Join the community. Papers With Code, Reddit r/MachineLearning, and Twitter/X are where implementation details are discussed.

References