Awesome-omni-skill localai
Run local AI models with LocalAI. Deploy OpenAI-compatible API for LLMs, embeddings, audio, and images. Use for self-hosted AI, offline inference, and privacy-focused AI deployments.
install
source · Clone the upstream repo
git clone https://github.com/diegosouzapw/awesome-omni-skill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/ai-agents/localai" ~/.claude/skills/diegosouzapw-awesome-omni-skill-localai && rm -rf "$T"
manifest:
skills/ai-agents/localai/SKILL.mdsafety · automated scan (low risk)
This is a pattern-based risk scan, not a security review. Our crawler flagged:
- makes HTTP requests (curl)
- references API keys
Always read a skill's source content before installing. Patterns alone don't mean the skill is malicious — but they warrant attention.
source content
LocalAI
Expert guidance for self-hosted OpenAI-compatible AI API.
Installation
Docker
# Basic (CPU) docker run -p 8080:8080 localai/localai:latest # With GPU (CUDA) docker run --gpus all -p 8080:8080 localai/localai:latest-gpu-nvidia-cuda-12 # With models directory docker run -p 8080:8080 \ -v /path/to/models:/models \ localai/localai:latest
Docker Compose
services: localai: image: localai/localai:latest-gpu-nvidia-cuda-12 ports: - "8080:8080" volumes: - ./models:/models environment: - THREADS=8 - CONTEXT_SIZE=4096 - DEBUG=true deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu]
Model Configuration
YAML Model Definition
# models/llama3.yaml name: llama3 backend: llama-cpp parameters: model: /models/llama-3-8b-instruct.gguf temperature: 0.7 top_p: 0.9 top_k: 40 context_size: 4096 threads: 8 f16: true mmap: true template: chat_message: | <|start_header_id|>{{.RoleName}}<|end_header_id|> {{.Content}}<|eot_id|> chat: | {{.Input}} <|start_header_id|>assistant<|end_header_id|>
Embedding Model
# models/embeddings.yaml name: text-embedding backend: bert-embeddings parameters: model: /models/all-MiniLM-L6-v2 embeddings: true
Whisper (Audio)
# models/whisper.yaml name: whisper-1 backend: whisper parameters: model: /models/whisper-base.bin language: en
Stable Diffusion
# models/stablediffusion.yaml name: stablediffusion backend: stablediffusion parameters: model: /models/sd-v1-5 step: 25
API Usage
OpenAI Python Client
from openai import OpenAI client = OpenAI( base_url="http://localhost:8080/v1", api_key="not-needed" # LocalAI doesn't require API key ) # Chat completion response = client.chat.completions.create( model="llama3", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is machine learning?"} ], temperature=0.7, max_tokens=500 ) print(response.choices[0].message.content) # Streaming stream = client.chat.completions.create( model="llama3", messages=[{"role": "user", "content": "Tell me a story"}], stream=True ) for chunk in stream: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="")
Embeddings
response = client.embeddings.create( model="text-embedding", input=["Hello world", "How are you?"] ) embeddings = [e.embedding for e in response.data]
Image Generation
response = client.images.generate( model="stablediffusion", prompt="A beautiful sunset over mountains", n=1, size="512x512" ) image_url = response.data[0].url
Audio Transcription
with open("audio.mp3", "rb") as f: response = client.audio.transcriptions.create( model="whisper-1", file=f ) print(response.text)
Gallery Models
# List available models curl http://localhost:8080/models/available # Install from gallery curl http://localhost:8080/models/apply -d '{ "id": "huggingface://TheBloke/Llama-2-7B-Chat-GGUF/llama-2-7b-chat.Q4_K_M.gguf" }' # Or via config curl http://localhost:8080/models/apply -d '{ "url": "github:go-skynet/model-gallery/gpt4all-j.yaml" }'
Function Calling
# models/llama3-functions.yaml name: llama3-functions backend: llama-cpp parameters: model: /models/llama-3-8b-instruct.gguf function: disable_no_action: false grammar_prefix: | <|start_header_id|>assistant<|end_header_id|>
response = client.chat.completions.create( model="llama3-functions", messages=[{"role": "user", "content": "What's the weather in Paris?"}], tools=[{ "type": "function", "function": { "name": "get_weather", "description": "Get weather for a city", "parameters": { "type": "object", "properties": { "city": {"type": "string"} }, "required": ["city"] } } }], tool_choice="auto" )
Performance Tuning
# Environment variables THREADS=8 # Number of CPU threads CONTEXT_SIZE=4096 # Context window size F16=true # Use FP16 MMAP=true # Memory map models GPU_LAYERS=35 # Layers to offload to GPU TENSOR_SPLIT=0.5,0.5 # Multi-GPU split
GPU Offloading
# models/llama3-gpu.yaml name: llama3 backend: llama-cpp parameters: model: /models/llama-3-8b-instruct.gguf gpu_layers: 35 main_gpu: 0 tensor_split: ""
Kubernetes Deployment
apikind: Deployment metadata: name: localai spec: replicas: 1 selector: matchLabels: app: localai template: metadata: labels: app: localai spec: containers: - name: localai image: localai/localai:latest-gpu-nvidia-cuda-12 ports: - containerPort: 8080 resources: limits: nvidia.com/gpu: 1 volumeMounts: - name: models mountPath: /models env: - name: THREADS value: "8" volumes: - name: models persistentVolumeClaim: claimName: models-pvc