Claude-skill-registry fal-serverless-guide

Complete fal.ai serverless deployment system. PROACTIVELY activate for: (1) Creating fal.App class, (2) GPU machine selection (T4/A10G/A100/H100), (3) setup() for model loading, (4) @fal.endpoint decorators, (5) Persistent volumes for weights, (6) Secrets management, (7) Scaling configuration (min/max concurrency), (8) Multi-GPU deployment, (9) fal deploy commands, (10) Local development with fal run. Provides: App structure, Dockerfile patterns, deployment commands, scaling config. Ensures production-ready serverless ML deployment.

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/fal-serverless-guide" ~/.claude/skills/majiayu000-claude-skill-registry-fal-serverless-guide && rm -rf "$T"
manifest: skills/data/fal-serverless-guide/SKILL.md
source content

Quick Reference

Machine TypeGPUVRAMUse Case
GPU-T4
T416GBDev, small models
GPU-A10G
A10G24GB7B-13B models
GPU-A100
A10040/80GB13B-70B models
GPU-H100
H10080GBCutting-edge
App AttributePurposeExample
machine_type
GPU selection
"GPU-A100"
requirements
Dependencies
["torch", "transformers"]
keep_alive
Warm duration
300
(5 min)
min_concurrency
Min instances
0
(scale to zero)
max_concurrency
Max parallel
4
CommandPurpose
fal deploy app.py::MyApp
Deploy to fal
fal run app.py::MyApp
Run locally
fal logs <app-id>
View logs
fal secrets set KEY=value
Set secrets

When to Use This Skill

Use for custom model deployment:

  • Deploying custom ML models on fal infrastructure
  • Configuring GPU instances and scaling
  • Setting up persistent storage for model weights
  • Creating multi-endpoint apps
  • Managing secrets and environment variables

Related skills:

  • For API integration: see
    fal-api-reference
  • For optimization: see
    fal-optimization
  • For using hosted models: see
    fal-model-guide

fal.ai Serverless Deployment Guide

Complete guide to deploying custom ML models on fal.ai's serverless infrastructure.

Overview

fal serverless provides:

  • Automatic scaling from zero to thousands of instances
  • GPU support (T4, A10G, A100, H100, H200, B200)
  • Persistent storage for model weights
  • Secrets management
  • Real-time logs and monitoring
  • Pay-per-use pricing

Installation

pip install fal

Authentication

# Login to fal
fal auth login

# Or set API key
export FAL_KEY="your-api-key"

Basic App Structure

import fal
from pydantic import BaseModel

class RequestModel(BaseModel):
    """Input schema for your endpoint"""
    prompt: str
    max_tokens: int = 100

class ResponseModel(BaseModel):
    """Output schema for your endpoint"""
    text: str
    tokens: int

class MyApp(fal.App):
    # Machine configuration
    machine_type = "GPU-A100"
    num_gpus = 1

    # Dependencies
    requirements = [
        "torch>=2.0.0",
        "transformers>=4.35.0",
        "accelerate"
    ]

    # Scaling configuration
    keep_alive = 300        # Keep instance warm (seconds)
    min_concurrency = 0     # Scale to zero when idle
    max_concurrency = 4     # Max concurrent requests

    def setup(self):
        """
        Called once when container starts.
        Load models and heavy resources here.
        """
        import torch
        from transformers import AutoModelForCausalLM, AutoTokenizer

        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.tokenizer = AutoTokenizer.from_pretrained("model-name")
        self.model = AutoModelForCausalLM.from_pretrained(
            "model-name",
            torch_dtype=torch.float16
        ).to(self.device)

    @fal.endpoint("/predict")
    def predict(self, request: RequestModel) -> ResponseModel:
        """
        Main inference endpoint.
        Called for each request.
        """
        inputs = self.tokenizer(request.prompt, return_tensors="pt")
        inputs = inputs.to(self.device)

        outputs = self.model.generate(
            **inputs,
            max_new_tokens=request.max_tokens
        )

        text = self.tokenizer.decode(outputs[0], skip_special_tokens=True)

        return ResponseModel(text=text, tokens=len(outputs[0]))

    @fal.endpoint("/health")
    def health(self):
        """Health check endpoint"""
        return {"status": "healthy", "device": self.device}

    def teardown(self):
        """Called when container shuts down (optional)"""
        if hasattr(self, 'model'):
            del self.model
        import torch
        torch.cuda.empty_cache()

Machine Types

TypeGPUVRAMUse Case
CPU
None-Preprocessing, lightweight
GPU-T4
NVIDIA T416GBDevelopment, small models
GPU-A10G
NVIDIA A10G24GBMedium models (7B-13B)
GPU-A100
NVIDIA A10040/80GBLarge models (13B-70B)
GPU-H100
NVIDIA H10080GBCutting-edge performance
GPU-H200
NVIDIA H200141GBVery large models
GPU-B200
NVIDIA B200192GBFrontier models (100B+)

Multi-GPU Configuration

class MultiGPUApp(fal.App):
    machine_type = "GPU-H100"
    num_gpus = 4  # Use 4 H100s

    def setup(self):
        import torch
        from transformers import AutoModelForCausalLM

        self.model = AutoModelForCausalLM.from_pretrained(
            "meta-llama/Llama-2-70b-hf",
            torch_dtype=torch.float16,
            device_map="auto"  # Distribute across GPUs
        )

Persistent Storage

Use volumes to persist data across restarts:

class AppWithStorage(fal.App):
    machine_type = "GPU-A100"
    requirements = ["torch", "transformers"]

    # Define persistent volumes
    volumes = {
        "/data": fal.Volume("model-cache"),
        "/outputs": fal.Volume("generated-outputs")
    }

    def setup(self):
        import os
        from transformers import AutoModel

        cache_dir = "/data/models"
        os.makedirs(cache_dir, exist_ok=True)

        # Model weights persist across cold starts
        self.model = AutoModel.from_pretrained(
            "large-model",
            cache_dir=cache_dir
        )

    @fal.endpoint("/generate")
    def generate(self, request):
        output_path = "/outputs/result.png"
        # Save to persistent storage
        return {"path": output_path}

Secrets Management

# Set secrets via CLI
fal secrets set HF_TOKEN=hf_xxx API_KEY=sk_xxx

# List secrets
fal secrets list

# Delete secret
fal secrets delete HF_TOKEN
import os

class SecureApp(fal.App):
    def setup(self):
        # Access secrets as environment variables
        hf_token = os.environ["HF_TOKEN"]

        from huggingface_hub import login
        login(token=hf_token)

        # Now can access gated models
        self.model = load_gated_model()

Deployment Commands

# Deploy application
fal deploy app.py::MyApp

# Deploy with options
fal deploy app.py::MyApp \
  --machine-type GPU-A100 \
  --num-gpus 2 \
  --min-concurrency 1 \
  --max-concurrency 8

# View deployments
fal list

# View logs
fal logs <app-id>

# View real-time logs
fal logs <app-id> --follow

# Delete deployment
fal delete <app-id>

# Run locally for testing
fal run app.py::MyApp

Advanced Patterns

Image Generation App

import fal
from pydantic import BaseModel
from typing import Optional
import io

class ImageRequest(BaseModel):
    prompt: str
    negative_prompt: Optional[str] = None
    width: int = 1024
    height: int = 1024
    steps: int = 28
    seed: Optional[int] = None

class ImageResponse(BaseModel):
    image_url: str
    seed: int

class ImageGenerator(fal.App):
    machine_type = "GPU-A100"
    requirements = [
        "torch",
        "diffusers",
        "transformers",
        "accelerate",
        "safetensors"
    ]
    keep_alive = 600
    max_concurrency = 2

    volumes = {
        "/data": fal.Volume("diffusion-models")
    }

    def setup(self):
        import torch
        from diffusers import StableDiffusionXLPipeline

        self.pipe = StableDiffusionXLPipeline.from_pretrained(
            "stabilityai/stable-diffusion-xl-base-1.0",
            torch_dtype=torch.float16,
            cache_dir="/data/models"
        ).to("cuda")

        # Optimize
        self.pipe.enable_model_cpu_offload()

    @fal.endpoint("/generate")
    def generate(self, request: ImageRequest) -> ImageResponse:
        import torch
        import random

        seed = request.seed or random.randint(0, 2**32 - 1)
        generator = torch.Generator("cuda").manual_seed(seed)

        image = self.pipe(
            prompt=request.prompt,
            negative_prompt=request.negative_prompt,
            width=request.width,
            height=request.height,
            num_inference_steps=request.steps,
            generator=generator
        ).images[0]

        # Save and upload to CDN
        path = f"/tmp/output_{seed}.png"
        image.save(path)
        url = fal.upload_file(path)

        return ImageResponse(image_url=url, seed=seed)

Streaming Response

import fal
from typing import Generator

class StreamingApp(fal.App):
    machine_type = "GPU-A100"
    requirements = ["torch", "transformers"]

    def setup(self):
        from transformers import AutoModelForCausalLM, AutoTokenizer
        self.tokenizer = AutoTokenizer.from_pretrained("model")
        self.model = AutoModelForCausalLM.from_pretrained("model")

    @fal.endpoint("/stream")
    def stream(self, prompt: str) -> Generator[str, None, None]:
        from transformers import TextIteratorStreamer
        from threading import Thread

        inputs = self.tokenizer(prompt, return_tensors="pt").to("cuda")
        streamer = TextIteratorStreamer(self.tokenizer, skip_prompt=True)

        thread = Thread(
            target=self.model.generate,
            kwargs={**inputs, "streamer": streamer, "max_new_tokens": 256}
        )
        thread.start()

        for text in streamer:
            yield text

Background Tasks

import fal
from typing import Optional
import asyncio

class BackgroundApp(fal.App):
    machine_type = "GPU-A100"

    @fal.endpoint("/process")
    async def process(self, data: str) -> dict:
        # Submit background work
        task_id = await self.start_background_task(data)
        return {"task_id": task_id, "status": "processing"}

    async def start_background_task(self, data: str) -> str:
        # Implement your background logic
        import uuid
        task_id = str(uuid.uuid4())
        # Save task to queue/database
        return task_id

Multiple Endpoints

import fal
from pydantic import BaseModel

class TextRequest(BaseModel):
    text: str

class ImageRequest(BaseModel):
    image_url: str

class MultiModalApp(fal.App):
    machine_type = "GPU-A100"
    requirements = ["torch", "transformers", "Pillow"]

    def setup(self):
        self.text_model = self.load_text_model()
        self.vision_model = self.load_vision_model()

    @fal.endpoint("/analyze-text")
    def analyze_text(self, request: TextRequest) -> dict:
        result = self.text_model(request.text)
        return {"analysis": result}

    @fal.endpoint("/analyze-image")
    def analyze_image(self, request: ImageRequest) -> dict:
        result = self.vision_model(request.image_url)
        return {"analysis": result}

    @fal.endpoint("/")
    def info(self) -> dict:
        return {
            "name": "MultiModal Analyzer",
            "endpoints": ["/analyze-text", "/analyze-image"]
        }

    @fal.endpoint("/health")
    def health(self) -> dict:
        return {"status": "healthy"}

Scaling Configuration

class ScaledApp(fal.App):
    machine_type = "GPU-A100"

    # Scaling options
    min_concurrency = 0     # Scale to zero (cost savings)
    max_concurrency = 10    # Max parallel requests
    keep_alive = 300        # Keep warm for 5 minutes

    # For always-on endpoints
    # min_concurrency = 1   # Always have one instance ready

Concurrency Guidelines

GPU Memory per RequestSuggested max_concurrency
< 4GB8-10
4-8GB4-6
8-16GB2-4
16-40GB1-2
> 40GB1

Error Handling

import fal
from pydantic import BaseModel

class MyApp(fal.App):
    @fal.endpoint("/predict")
    def predict(self, request: dict):
        try:
            result = self.process(request)
            return {"result": result}
        except ValueError as e:
            # Client error
            raise fal.HTTPException(400, f"Invalid input: {e}")
        except RuntimeError as e:
            # Server error
            raise fal.HTTPException(500, f"Processing failed: {e}")
        except Exception as e:
            # Unexpected error
            raise fal.HTTPException(500, "Internal error")

Local Development

# Run locally
fal run app.py::MyApp

# Test endpoint
curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"prompt": "test"}'

# Run with environment variables
FAL_KEY=xxx HF_TOKEN=yyy fal run app.py::MyApp

Calling Deployed Apps

JavaScript/TypeScript

import { fal } from "@fal-ai/client";

fal.config({ credentials: process.env.FAL_KEY });

const result = await fal.subscribe("your-username/your-app/predict", {
  input: {
    prompt: "Hello world",
    max_tokens: 100
  }
});

Python

import fal_client

result = fal_client.subscribe(
    "your-username/your-app/predict",
    arguments={
        "prompt": "Hello world",
        "max_tokens": 100
    }
)

REST API

curl -X POST "https://queue.fal.run/your-username/your-app/predict" \
  -H "Authorization: Key $FAL_KEY" \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Hello world", "max_tokens": 100}'

Best Practices

  1. Load models in setup()

    • Heavy initialization once, not per request
    • Use persistent volumes for large weights
  2. Use appropriate machine type

    • Match GPU memory to model size
    • Consider cost vs performance trade-offs
  3. Handle cold starts

    • Use
      keep_alive
      for frequently accessed endpoints
    • Use
      min_concurrency=1
      for latency-critical apps
  4. Optimize memory

    • Use fp16/bf16 where possible
    • Enable memory-efficient attention
    • Clear GPU cache in teardown
  5. Monitor and debug

    • Check logs regularly:
      fal logs <app-id> --follow
    • Implement health checks
    • Use structured logging
  6. Security

    • Use secrets for API keys
    • Validate all inputs
    • Don't expose internal errors

Pricing

  • Pay per second of compute used
  • Different rates for different GPU types
  • No charge when scaled to zero
  • Check https://fal.ai/pricing for current rates