Claude-skill-registry hugging-face-space-deployer
Create, configure, and deploy Hugging Face Spaces for showcasing ML models. Supports Gradio, Streamlit, and Docker SDKs with templates for common use cases like chat interfaces, image generation, and model comparisons.
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/hugging-face-space-deployer" ~/.claude/skills/majiayu000-claude-skill-registry-hugging-face-space-deployer && rm -rf "$T"
skills/data/hugging-face-space-deployer/SKILL.mdHugging Face Space Deployer
A skill for AI engineers to create, configure, and deploy interactive ML demos on Hugging Face Spaces.
CRITICAL: Pre-Deployment Checklist
Before writing ANY code, gather this information about the model:
1. Check Model Type (LoRA Adapter vs Full Model)
Use the HF MCP tool to inspect the model files:
hf-skills - Hub Repo Details (repo_ids: ["username/model"], repo_type: "model")
Look for these indicators:
| Files Present | Model Type | Action Required |
|---|---|---|
or | Full model | Load directly with |
+ | LoRA/PEFT adapter | Must load base model first, then apply adapter with |
| Only config files, no weights | Broken/incomplete | Ask user to verify |
If adapter_config.json exists, check for
to identify the base model.base_model_name_or_path
2. Check Inference API Availability
Visit the model page on HF Hub and look for "Inference Providers" widget on the right side.
Indicators that model HAS Inference API:
- Inference widget visible on model page
- Model from known provider:
,meta-llama
,mistralai
,HuggingFaceH4
,google
,stabilityaiQwen - High download count (>10,000) with standard architecture
Indicators that model DOES NOT have Inference API:
- Personal namespace (e.g.,
)GhostScientist/my-model - LoRA/PEFT adapter (adapters never have direct Inference API)
- Missing
in model metadatapipeline_tag - No inference widget on model page
3. Check Model Metadata
- Ensure
is set (e.g.,pipeline_tag
)text-generation - Add
tag for chat modelsconversational
4. Determine Hardware Needs
| Model Size | Recommended Hardware |
|---|---|
| < 3B parameters | ZeroGPU (free) or CPU |
| 3B - 7B parameters | ZeroGPU or T4 |
| > 7B parameters | A10G or A100 |
5. Ask User If Unclear
If you cannot determine the model type, ASK THE USER:
"I'm analyzing your model to determine the best deployment strategy. I found:
- [what you found about files]
- [what you found about inference API]
Is this model:
- A full model you trained/uploaded?
- A LoRA/PEFT adapter on top of another model?
- Something else?
Also, would you prefer: A. Free deployment with ZeroGPU (may have queue times) B. Paid GPU for faster response (~$0.60/hr)"
Hardware Options
| Hardware | Use Case | Cost |
|---|---|---|
| Simple demos, Inference API apps | Free |
| Faster CPU inference | ~$0.03/hr |
| Models needing GPU on-demand (recommended for most) | Free (with quota) |
| Small GPU models (<7B) | ~$0.60/hr |
| Medium GPU models | ~$0.90/hr |
| Large models (7B-13B) | ~$1.50/hr |
| Very large models (30B+) | ~$3.15/hr |
| Largest models | ~$4.50/hr |
ZeroGPU Note: ZeroGPU (
zero-a10g) provides free GPU access on-demand. The Space runs on CPU, and when a user triggers inference, a GPU is allocated temporarily (~60-120 seconds). After deployment, you must manually set the runtime to "ZeroGPU" in Space Settings > Hardware.
Deployment Decision Tree
Analyze Model │ ├── Does it have adapter_config.json? │ └── YES → It's a LoRA adapter │ ├── Find base_model_name_or_path in adapter_config.json │ └── Use Template 3 (LoRA + ZeroGPU) │ ├── Does it have model.safetensors or pytorch_model.bin? │ └── YES → It's a full model │ ├── Is it from a major provider with inference widget? │ │ ├── YES → Use Inference API (Template 1) │ │ └── NO → Use ZeroGPU (Template 2) │ └── Neither found? └── ASK USER - model may be incomplete
Dependencies
For Inference API (cpu-basic, free):
gradio>=5.0.0 huggingface_hub>=0.26.0
For ZeroGPU full models (zero-a10g, free with quota):
gradio>=5.0.0 torch transformers accelerate spaces
For ZeroGPU LoRA adapters (zero-a10g, free with quota):
gradio>=5.0.0 torch transformers accelerate spaces peft
CLI Commands (CORRECT Syntax)
# Create Space hf repo create my-space-name --repo-type space --space-sdk gradio # Upload files hf upload username/space-name ./local-folder --repo-type space # Download model files to inspect hf download username/model-name --local-dir ./model-check --dry-run # Check what files exist in a model hf download username/model-name --local-dir /tmp/check --dry-run 2>&1 | grep -E '\.(safetensors|bin|json)'
Template 1: Inference API (For Supported Models)
Use when: Model has inference widget, is from major provider, or explicitly supports serverless API.
import gradio as gr from huggingface_hub import InferenceClient MODEL_ID = "HuggingFaceH4/zephyr-7b-beta" # Must support Inference API! client = InferenceClient(MODEL_ID) def respond(message, history, system_message, max_tokens, temperature, top_p): messages = [{"role": "system", "content": system_message}] for user_msg, assistant_msg in history: if user_msg: messages.append({"role": "user", "content": user_msg}) if assistant_msg: messages.append({"role": "assistant", "content": assistant_msg}) messages.append({"role": "user", "content": message}) response = "" for token in client.chat_completion( messages, max_tokens=max_tokens, stream=True, temperature=temperature, top_p=top_p, ): delta = token.choices[0].delta.content or "" response += delta yield response demo = gr.ChatInterface( respond, title="Chat Assistant", description="Powered by Hugging Face Inference API", additional_inputs=[ gr.Textbox(value="You are a helpful assistant.", label="System message"), gr.Slider(minimum=1, maximum=2048, value=512, step=1, label="Max tokens"), gr.Slider(minimum=0.1, maximum=2.0, value=0.7, step=0.1, label="Temperature"), gr.Slider(minimum=0.1, maximum=1.0, value=0.95, step=0.05, label="Top-p"), ], examples=[ ["Hello! How are you?"], ["Write a Python function to sort a list"], ], ) if __name__ == "__main__": demo.launch()
requirements.txt:
gradio>=5.0.0 huggingface_hub>=0.26.0
README.md:
--- title: My Chat App emoji: 💬 colorFrom: blue colorTo: purple sdk: gradio sdk_version: 5.9.1 app_file: app.py pinned: false license: apache-2.0 ---
Template 2: ZeroGPU Full Model (For Models Without Inference API)
Use when: Full model (has model.safetensors) but no Inference API support.
import gradio as gr import spaces import torch from transformers import AutoModelForCausalLM, AutoTokenizer MODEL_ID = "username/my-full-model" # Load tokenizer at startup tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) # Global model - loaded lazily on first GPU call for faster Space startup model = None def load_model(): global model if model is None: model = AutoModelForCausalLM.from_pretrained( MODEL_ID, torch_dtype=torch.float16, device_map="auto", ) return model @spaces.GPU(duration=120) def generate_response(message, history, system_message, max_tokens, temperature, top_p): model = load_model() messages = [{"role": "system", "content": system_message}] for user_msg, assistant_msg in history: if user_msg: messages.append({"role": "user", "content": user_msg}) if assistant_msg: messages.append({"role": "assistant", "content": assistant_msg}) messages.append({"role": "user", "content": message}) text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) inputs = tokenizer([text], return_tensors="pt").to(model.device) with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=int(max_tokens), temperature=float(temperature), top_p=float(top_p), do_sample=True, pad_token_id=tokenizer.eos_token_id, ) response = tokenizer.decode( outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True ) return response demo = gr.ChatInterface( generate_response, title="My Model", description="Powered by ZeroGPU (free!)", additional_inputs=[ gr.Textbox(value="You are a helpful assistant.", label="System message", lines=2), gr.Slider(minimum=64, maximum=2048, value=512, step=64, label="Max tokens"), gr.Slider(minimum=0.1, maximum=1.5, value=0.7, step=0.1, label="Temperature"), gr.Slider(minimum=0.1, maximum=1.0, value=0.95, step=0.05, label="Top-p"), ], examples=[ ["Hello! How are you?"], ["Help me write some code"], ], ) if __name__ == "__main__": demo.launch()
requirements.txt:
gradio>=5.0.0 torch transformers accelerate spaces
README.md:
--- title: My Model emoji: 🤖 colorFrom: blue colorTo: purple sdk: gradio sdk_version: 5.9.1 app_file: app.py pinned: false license: apache-2.0 suggested_hardware: zero-a10g ---
Template 3: ZeroGPU LoRA Adapter (CRITICAL FOR FINE-TUNED MODELS)
Use when: Model has
adapter_config.json and adapter_model.safetensors (NOT model.safetensors)
You MUST identify the base model from
field adapter_config.jsonbase_model_name_or_path
import gradio as gr import spaces import torch from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel # Your LoRA adapter ADAPTER_ID = "username/my-lora-adapter" # Base model (from adapter_config.json -> base_model_name_or_path) BASE_MODEL_ID = "Qwen/Qwen2.5-Coder-1.5B-Instruct" # Load tokenizer at startup tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_ID) # Global model - loaded lazily on first GPU call model = None def load_model(): global model if model is None: base_model = AutoModelForCausalLM.from_pretrained( BASE_MODEL_ID, torch_dtype=torch.float16, device_map="auto", ) model = PeftModel.from_pretrained(base_model, ADAPTER_ID) model = model.merge_and_unload() # Merge for faster inference return model @spaces.GPU(duration=120) def generate_response(message, history, system_message, max_tokens, temperature, top_p): model = load_model() messages = [{"role": "system", "content": system_message}] for item in history: if isinstance(item, (list, tuple)) and len(item) == 2: user_msg, assistant_msg = item if user_msg: messages.append({"role": "user", "content": user_msg}) if assistant_msg: messages.append({"role": "assistant", "content": assistant_msg}) messages.append({"role": "user", "content": message}) text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) inputs = tokenizer([text], return_tensors="pt").to(model.device) with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=int(max_tokens), temperature=float(temperature), top_p=float(top_p), do_sample=True, pad_token_id=tokenizer.eos_token_id, ) response = tokenizer.decode( outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True ) return response demo = gr.ChatInterface( generate_response, title="My Fine-Tuned Model", description="LoRA fine-tuned model powered by ZeroGPU (free!)", additional_inputs=[ gr.Textbox(value="You are a helpful assistant.", label="System message", lines=2), gr.Slider(minimum=64, maximum=2048, value=512, step=64, label="Max tokens"), gr.Slider(minimum=0.1, maximum=1.5, value=0.7, step=0.1, label="Temperature"), gr.Slider(minimum=0.1, maximum=1.0, value=0.95, step=0.05, label="Top-p"), ], examples=[ ["Hello! How are you?"], ["Help me with a coding task"], ], ) if __name__ == "__main__": demo.launch()
requirements.txt (MUST include peft):
gradio>=5.0.0 torch transformers accelerate spaces peft
README.md:
--- title: My Fine-Tuned Model emoji: 🔧 colorFrom: green colorTo: blue sdk: gradio sdk_version: 5.9.1 app_file: app.py pinned: false license: apache-2.0 suggested_hardware: zero-a10g ---
Post-Deployment Steps
After uploading your Space files:
1. Set the Runtime Hardware (REQUIRED for GPU models)
- Go to:
https://huggingface.co/spaces/USERNAME/SPACE_NAME/settings - Under "Space Hardware", select the appropriate option:
- ZeroGPU for free on-demand GPU (recommended)
- Or a dedicated GPU tier if needed
2. Verify the Space is Running
- Check the Space URL for any build errors
- Review container logs in Settings if issues occur
3. Common Post-Deploy Fixes
| Issue | Cause | Fix |
|---|---|---|
| "No API found" error | Hardware mismatch | Set runtime to ZeroGPU in Settings |
| Model not loading | LoRA vs full model confusion | Check if it's an adapter, use correct template |
| Inference API errors | Model not on serverless | Load directly with transformers instead |
Detecting Model Type - Quick Reference
Full Model
Files include:
model.safetensors, pytorch_model.bin, or sharded versions
# Can load directly model = AutoModelForCausalLM.from_pretrained("username/model")
LoRA/PEFT Adapter
Files include:
adapter_config.json, adapter_model.safetensors
# Must load base model first, then apply adapter base_model = AutoModelForCausalLM.from_pretrained("base-model-id") model = PeftModel.from_pretrained(base_model, "username/adapter") model = model.merge_and_unload() # Optional: merge for faster inference
Inference API Available
Model page shows "Inference Providers" widget on the right side
# Can use InferenceClient (simplest approach) from huggingface_hub import InferenceClient client = InferenceClient("username/model")
Fixing Missing pipeline_tag (To Enable Inference API)
If a model doesn't have an inference widget but should, it may be missing metadata:
# Download the README hf download username/model-name README.md --local-dir /tmp/fix # Edit to add pipeline_tag in YAML frontmatter: # --- # pipeline_tag: text-generation # tags: # - conversational # --- # Upload the fix hf upload username/model-name /tmp/fix/README.md README.md
Note: Even with correct tags, custom models may not get Inference API - it depends on HF's infrastructure decisions.
CRITICAL: Gradio 5.x Requirements
Examples Format (MUST be nested lists)
# CORRECT: examples=[ ["Example 1"], ["Example 2"], ] # WRONG (causes ValueError): examples=[ "Example 1", "Example 2", ]
Version Requirements
gradio>=5.0.0 huggingface_hub>=0.26.0
Do NOT use
gradio==4.44.0 - causes ImportError: cannot import name 'HfFolder'
Troubleshooting
"No API found" Error
Cause: Gradio app isn't exposing API correctly, often due to hardware mismatch Fix: Go to Space Settings and set runtime to "ZeroGPU" or appropriate GPU tier
"OSError: does not appear to have a file named pytorch_model.bin, model.safetensors"
Cause: Trying to load a LoRA adapter as a full model Fix: Check for
adapter_config.json - if present, use PEFT to load:
from peft import PeftModel base_model = AutoModelForCausalLM.from_pretrained("base-model") model = PeftModel.from_pretrained(base_model, "adapter-id")
Inference API Not Available
Cause: Model doesn't have pipeline_tag or isn't deployed to serverless Fix: Either: a. Add
pipeline_tag: text-generation to model's README.md
b. Or load model directly with transformers instead of InferenceClient
ImportError: cannot import name 'HfFolder'
ImportError: cannot import name 'HfFolder'Cause: gradio/huggingface_hub version mismatch Fix: Use
gradio>=5.0.0 and huggingface_hub>=0.26.0
ValueError: examples must be nested list
ValueError: examples must be nested listCause: Gradio 5.x format change Fix: Use
[["ex1"], ["ex2"]] not ["ex1", "ex2"]
Space builds but model doesn't load
Cause: Missing
peft for adapters, or wrong base model
Fix: Check adapter_config.json for correct base_model_name_or_path
Workflow Summary
- Analyze model (check for adapter_config.json, model files, inference widget)
- Determine strategy (Inference API vs ZeroGPU, full model vs LoRA)
- Ask user if unclear about model type or cost preferences
- Generate correct template based on analysis
- Create Space with correct requirements and README
- Upload files using
hf upload - Set hardware in Space Settings (ZeroGPU for free GPU access)
- Monitor build logs for any issues