install
source · Clone the upstream repo
git clone https://github.com/plurigrid/asi
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/plurigrid/asi "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/julia-gpu-kernels" ~/.claude/skills/plurigrid-asi-julia-gpu-kernels && rm -rf "$T"
manifest:
skills/julia-gpu-kernels/SKILL.mdsource content
Julia GPU Kernels Skill
KernelAbstractions.jl: Backend-agnostic GPU kernel programming for Julia.
Core Kernel Macros
@kernel - Define a kernel function
using KernelAbstractions @kernel function vecadd!(A, @Const(B)) I = @index(Global, Linear) @inbounds A[I] += B[I] end # Launch on backend kernel = vecadd!(backend) kernel(A, B, ndrange=size(A)) KernelAbstractions.synchronize(backend)
@Const - Read-only array annotation
Marks array as read-only and non-aliasing for compiler optimizations:
@kernel function copy_kernel(out, @Const(input)) i = @index(Global, Linear) out[i] = input[i] end
@index - Query work item indices
@index(Global, Linear) # Flat global index @index(Global, Cartesian) # CartesianIndex in global space @index(Local, Linear) # Index within workgroup @index(Group, Linear) # Which workgroup @index(Global, NTuple) # As tuple
@groupsize / @ndrange
@kernel function info_kernel(out) gs = @groupsize() # Workgroup dimensions nr = @ndrange() # Total computation range N = @uniform prod(@groupsize()) # Total workgroup size end
@localmem - Shared memory within workgroup
@kernel function reduce_kernel(out, @Const(input)) lid = @index(Local, Linear) gid = @index(Global, Linear) shared = @localmem Float32 (256,) # Shared across workgroup shared[lid] = input[gid] @synchronize # Now all threads can read shared end
@private - Per-work-item memory
@kernel function accumulate_kernel(out, @Const(data)) i = @index(Global, Linear) acc = @private Float32 (1,) # Private to this work item acc[1] = 0.0f0 # ... accumulate into acc out[i] = acc[1] end
@uniform - Evaluate once per workgroup
@kernel function batched_kernel(out, @Const(input)) @uniform begin groupsize = @groupsize()[1] scale = 2.0f0 end # groupsize and scale shared across work items end
@synchronize - Memory barrier
@synchronize # All work items in workgroup must reach this point
Backend System
Backend Types
using KernelAbstractions # Abstract hierarchy Backend # All backends ├── GPU # GPU backends (deprecated in 1.0) │ ├── CUDABackend │ ├── ROCBackend │ └── oneAPIBackend └── CPU # CPU backend # Get backend from array backend = get_backend(A)
Kernel Type
A
Kernel contains:
- Backend reference
- Workgroup size
- NDRange
- Transformed function
@kernel function my_kernel(A) # ... end # Create kernel for specific backend kernel = my_kernel(CUDABackend()) kernel(A, ndrange=size(A), workgroupsize=256)
CUDA.jl Integration
using CUDA, KernelAbstractions # Get CUDABackend backend = CUDABackend() # Or from existing array A_gpu = CUDA.rand(1024) backend = get_backend(A_gpu) @kernel function saxpy!(y, @Const(a), @Const(x)) i = @index(Global, Linear) @inbounds y[i] += a * x[i] end kernel = saxpy!(backend) kernel(y_gpu, 2.0f0, x_gpu, ndrange=length(y_gpu)) KernelAbstractions.synchronize(backend)
Memory Model
| Memory Type | Scope | Lifetime | Speed | Macro |
|---|---|---|---|---|
| Global | All workgroups | Kernel | Slowest | Regular arrays |
| Local | Workgroup | Workgroup | Fast | |
| Private | Work item | Work item | Fastest | |
| Constant | All (read-only) | Kernel | Fast | |
Synchronization Rules
requires@localmem
for visibility across work items@synchronize
must be reached by ALL work items in workgroup or NONE@synchronize- No synchronization needed for
memory@private
Enzyme Autodiff Integration
using KernelAbstractions, Enzyme @kernel function square!(A) I = @index(Global, Linear) @inbounds A[I] *= A[I] end function square_caller(A, backend) kernel = square!(backend) kernel(A, ndrange=size(A)) KernelAbstractions.synchronize(backend) return end # Differentiate with Enzyme A = rand(Float32, 1024) dA = ones(Float32, 1024) # Seed gradient Enzyme.autodiff(Reverse, square_caller, Duplicated(A, dA), # Primal + tangent Const(CPU())) # Backend is constant
Enzyme Annotations
- Active array argumentDuplicated(primal, tangent)
- Constant, not differentiatedConst(x)
- Active scalar (not supported on GPU)Active(x)
MaxEnt Triad Testing Protocol
Three agents maximize mutual information through complementary verification:
| Agent | Role | Verifies |
|---|---|---|
| julia-gpu-kernels | Kernel definition | Correct @kernel syntax, backend dispatch |
| enzyme-autodiff | Differentiation | works |
| julia-tempering | RNG injection | SplittableRandom integrates with kernel |
Test: Differentiable GPU Kernel with Splittable RNG
using KernelAbstractions, Enzyme, SplittableRandoms # Agent A (julia-gpu-kernels): Define kernel with RNG state @kernel function monte_carlo_kernel(out, rng_states, @Const(params)) i = @index(Global) rng = rng_states[i] # Sample and accumulate acc = @private Float32 (1,) acc[1] = 0.0f0 for _ in 1:100 u = rand(rng, Float32) acc[1] += params[1] * u end out[i] = acc[1] end # Launcher for Enzyme compatibility function mc_launcher(out, rng_states, params, backend) kernel = monte_carlo_kernel(backend) kernel(out, rng_states, params, ndrange=length(out)) KernelAbstractions.synchronize(backend) return end
Agent B (enzyme-autodiff): Differentiate the kernel
# Enzyme differentiates w.r.t. params out = zeros(Float32, 1024) dout = ones(Float32, 1024) params = Float32[1.0] dparams = Float32[0.0] Enzyme.autodiff(Reverse, mc_launcher, Duplicated(out, dout), Const(rng_states), # RNG is not differentiated Duplicated(params, dparams), Const(backend)) # dparams now contains ∂loss/∂params
Agent C (julia-tempering): Provide splittable RNG
using SplittableRandoms # Create splittable RNG hierarchy for parallel kernel master_rng = SplittableRandom(42) n_threads = 1024 # Split into independent streams per work item rng_states = [split(master_rng, i) for i in 1:n_threads] # Property: Any permutation of splits yields same distribution # Property: Parent-child independence for parallel safety
Verification Matrix
| Property | A Provides | B Verifies | C Provides |
|---|---|---|---|
| Kernel correctness | syntax | Launches without error | - |
| Memory safety | , | No aliasing violations | - |
| Differentiability | call | Gradients are correct | - |
| RNG independence | - | dRNG/dparams = 0 | Split semantics |
| Reproducibility | - | - | Deterministic splits |
Integration Test
function test_triad_integration() backend = CPU() n = 256 # C: Setup RNG master = SplittableRandom(12345) rngs = [split(master, i) for i in 1:n] # A: Define arrays out = zeros(Float32, n) dout = ones(Float32, n) params = Float32[2.0] dparams = Float32[0.0] # B: Differentiate Enzyme.autodiff(Reverse, mc_launcher, Duplicated(out, dout), Const(rngs), Duplicated(params, dparams), Const(backend)) @assert !isnan(dparams[1]) "Gradient should be finite" @assert dparams[1] != 0.0 "Gradient should be non-zero" println("✓ Triad integration verified") println(" ∂loss/∂params = $(dparams[1])") end
Quick Reference
| Macro | Purpose |
|---|---|
| Define kernel function |
| Read-only array |
| Get work item index |
| Workgroup dimensions |
| Total range |
| Shared memory |
| Per-item memory |
| Evaluate once per group |
| Memory barrier |
Resources
Autopoietic Marginalia
The interaction IS the skill improving itself.
Every use of this skill is an opportunity for worlding:
- MEMORY (-1): Record what was learned
- REMEMBERING (0): Connect patterns to other skills
- WORLDING (+1): Evolve the skill based on use
Add Interaction Exemplars here as the skill is used.