install
source · Clone the upstream repo
git clone https://github.com/Aradotso/trending-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/Aradotso/trending-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/awesome-opensource-ai" ~/.claude/skills/aradotso-trending-skills-awesome-opensource-ai && rm -rf "$T"
manifest:
skills/awesome-opensource-ai/SKILL.mdsource content
--- name: awesome-opensource-ai description: Curated guide to the best open-source AI projects, models, tools, and infrastructure across the full ML stack triggers: - show me open source AI tools - what are the best open source LLMs - recommend open source ML frameworks - find open source alternatives to closed AI APIs - what open source models should I use for my project - help me pick an open source inference engine - what are good open source RAG tools - open source AI stack for production --- # Awesome Open Source AI > Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection. A curated reference for open-source AI models, libraries, infrastructure, and developer tools spanning the full ML/LLM stack — from training frameworks to production deployment. --- ## What This Resource Covers The [awesome-opensource-ai](https://github.com/alvinunreal/awesome-opensource-ai) list organizes the open-source AI ecosystem into 14 categories: 1. Core Frameworks & Libraries 2. Open Foundation Models 3. Inference Engines & Serving 4. Agentic AI & Multi-Agent Systems 5. Retrieval-Augmented Generation (RAG) & Knowledge 6. Generative Media Tools 7. Training & Fine-tuning Ecosystem 8. MLOps / LLMOps & Production 9. Evaluation, Benchmarks & Datasets 10. AI Safety, Alignment & Interpretability 11. Specialized Domains 12. User Interfaces & Self-hosted Platforms 13. Developer Tools & Integrations 14. Resources & Learning --- ## Quick Decision Guide by Use Case ### "I need to run an LLM locally" | Need | Recommended Tool | |------|-----------------| | Simple local chat | [Ollama](https://github.com/ollama/ollama) | | Max performance inference | [llama.cpp](https://github.com/ggerganov/llama.cpp) or [vLLM](https://github.com/vllm-project/vllm) | | OpenAI-compatible API | [LocalAI](https://github.com/mudler/LocalAI) or [LM Studio](https://lmstudio.ai) | | Production serving | [vLLM](https://github.com/vllm-project/vllm) or [TGI](https://github.com/huggingface/text-generation-inference) | ### "I need to train or fine-tune a model" | Need | Recommended Tool | |------|-----------------| | LoRA/QLoRA fine-tuning | [Unsloth](https://github.com/unslothai/unsloth) or [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) | | Full training at scale | [DeepSpeed](https://github.com/microsoft/DeepSpeed) + [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) | | Quick experiments | [Hugging Face Transformers](https://github.com/huggingface/transformers) + [Accelerate](https://github.com/huggingface/accelerate) | ### "I need to build a RAG pipeline" | Need | Recommended Tool | |------|-----------------| | Full RAG framework | [LlamaIndex](https://github.com/run-llama/llama_index) or [Haystack](https://github.com/deepset-ai/haystack) | | Vector store | [Chroma](https://github.com/chroma-core/chroma), [Qdrant](https://github.com/qdrant/qdrant), or [Weaviate](https://github.com/weaviate/weaviate) | | Embeddings model | [sentence-transformers](https://github.com/UKPLab/sentence-transformers) | ### "I need to build an AI agent" | Need | Recommended Tool | |------|-----------------| | General agents | [LangChain](https://github.com/langchain-ai/langchain) or [LlamaIndex Workflows](https://github.com/run-llama/llama_index) | | Multi-agent orchestration | [AutoGen](https://github.com/microsoft/autogen) or [CrewAI](https://github.com/joaomdmoura/crewAI) | | Code agents | [OpenHands](https://github.com/All-Hands-AI/OpenHands) or [SWE-agent](https://github.com/princeton-nlp/SWE-agent) | --- ## Model Selection Guide ### Open LLMs by Size & Use Case
Small (1B–7B) — Edge, mobile, low-resource:
- Phi-4-Mini (Microsoft) — best reasoning per parameter
- Gemma 3 2B/7B (Google) — strong efficiency
- Qwen3.5-3B/7B — excellent multilingual
Medium (8B–30B) — Balanced production use:
- Llama 4 8B — general purpose workhorse
- Qwen3.5-14B — coding + math
- Mistral Small — multilingual, tool use
Large (70B+) — Max capability open:
- Llama 4 405B — frontier open model
- DeepSeek-V3.2 (MoE 671B active 37B) — math/reasoning
- Qwen3.5-72B — top open coding/math
Coding Specialists:
- Qwen2.5-Coder-32B — #1 open coding
- DeepSeek-Coder-V2 — MoE coding powerhouse
- StarCoder2-15B — 600+ languages, transparent
Vision-Language:
- Qwen2.5-VL-72B — top open VLM
- InternVL 2.5 — charts, OCR, video
- LLaVA-Next — most popular/documented
--- ## Core Framework Examples ### PyTorch — Basic Training Loop ```python import torch import torch.nn as nn from torch.utils.data import DataLoader # Define model class SimpleNet(nn.Module): def __init__(self, input_dim, hidden_dim, output_dim): super().__init__() self.layers = nn.Sequential( nn.Linear(input_dim, hidden_dim), nn.ReLU(), nn.Dropout(0.1), nn.Linear(hidden_dim, output_dim) ) def forward(self, x): return self.layers(x) model = SimpleNet(784, 256, 10).to("cuda") optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4) criterion = nn.CrossEntropyLoss() # Training loop for epoch in range(10): for batch_x, batch_y in dataloader: batch_x, batch_y = batch_x.to("cuda"), batch_y.to("cuda") optimizer.zero_grad() logits = model(batch_x) loss = criterion(logits, batch_y) loss.backward() optimizer.step()
Hugging Face Transformers — Load & Inference
from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_id = "meta-llama/Llama-3.1-8B-Instruct" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto", # auto-distributes across available GPUs ) messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Explain gradient descent in simple terms."}, ] input_ids = tokenizer.apply_chat_template( messages, add_generation_prompt=True, return_tensors="pt" ).to(model.device) with torch.inference_mode(): outputs = model.generate( input_ids, max_new_tokens=512, temperature=0.7, do_sample=True, ) response = tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True) print(response)
Hugging Face Accelerate — Multi-GPU Training
from accelerate import Accelerator from transformers import AutoModelForSequenceClassification, AutoTokenizer from torch.utils.data import DataLoader import torch accelerator = Accelerator(mixed_precision="bf16") model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2) optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5) # Accelerate handles device placement, mixed precision, distributed training model, optimizer, train_dataloader = accelerator.prepare( model, optimizer, train_dataloader ) for epoch in range(3): for batch in train_dataloader: outputs = model(**batch) loss = outputs.loss accelerator.backward(loss) optimizer.step() optimizer.zero_grad() # Save — handles unwrapping DistributedDataParallel automatically accelerator.wait_for_everyone() unwrapped = accelerator.unwrap_model(model) unwrapped.save_pretrained("./output", save_function=accelerator.save)
Inference Engine Examples
vLLM — Production OpenAI-Compatible Server
# Install pip install vllm # Start server (OpenAI-compatible) python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.1-8B-Instruct \ --dtype bfloat16 \ --tensor-parallel-size 2 \ --max-model-len 8192 \ --port 8000
# Use with OpenAI client from openai import OpenAI client = OpenAI( base_url="http://localhost:8000/v1", api_key="not-needed" # vLLM doesn't require auth by default ) response = client.chat.completions.create( model="meta-llama/Llama-3.1-8B-Instruct", messages=[{"role": "user", "content": "Write a Python function to reverse a string."}], temperature=0.7, max_tokens=512, ) print(response.choices[0].message.content)
Ollama — Local Model Management
# Install (macOS/Linux) curl -fsSL https://ollama.ai/install.sh | sh # Pull and run models ollama pull llama3.1:8b ollama pull qwen2.5-coder:14b ollama pull mistral:7b # Interactive chat ollama run llama3.1:8b # Serve API (default port 11434) ollama serve
import ollama # Simple generation response = ollama.chat( model="llama3.1:8b", messages=[{"role": "user", "content": "What is RAG in AI?"}] ) print(response["message"]["content"]) # Streaming for chunk in ollama.chat( model="qwen2.5-coder:14b", messages=[{"role": "user", "content": "Write a FastAPI CRUD app"}], stream=True ): print(chunk["message"]["content"], end="", flush=True) # Embeddings embedding = ollama.embeddings( model="nomic-embed-text", prompt="Represent this document for retrieval:" ) vector = embedding["embedding"] # list of floats
llama.cpp — CPU/GPU Inference
# Build git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make -j$(nproc) # CPU only make LLAMA_CUDA=1 -j$(nproc) # NVIDIA GPU make LLAMA_METAL=1 -j$(nproc) # Apple Silicon # Download a GGUF model (e.g. from HuggingFace) # Then run: ./llama-cli -m ./models/llama-3.1-8b-instruct.Q4_K_M.gguf \ -p "You are a helpful assistant." \ --chat-template llama3 \ -n 512 \ --temp 0.7 # Start OpenAI-compatible server ./llama-server -m ./models/llama-3.1-8b-instruct.Q4_K_M.gguf \ --host 0.0.0.0 --port 8080 \ -ngl 35 # layers to offload to GPU
RAG Pipeline Examples
LlamaIndex — Complete RAG Setup
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings from llama_index.llms.ollama import Ollama from llama_index.embeddings.huggingface import HuggingFaceEmbedding # Configure models Settings.llm = Ollama(model="llama3.1:8b", request_timeout=120.0) Settings.embed_model = HuggingFaceEmbedding( model_name="BAAI/bge-small-en-v1.5" ) # Load documents documents = SimpleDirectoryReader("./data").load_data() # Build index index = VectorStoreIndex.from_documents( documents, show_progress=True ) # Persist index index.storage_context.persist(persist_dir="./storage") # Query query_engine = index.as_query_engine(similarity_top_k=5) response = query_engine.query("What are the main findings?") print(response)
Chroma — Vector Store
import chromadb from chromadb.utils import embedding_functions # Initialize client (persistent) client = chromadb.PersistentClient(path="./chroma_db") # Use sentence-transformers embeddings ef = embedding_functions.SentenceTransformerEmbeddingFunction( model_name="all-MiniLM-L6-v2" ) collection = client.get_or_create_collection( name="documents", embedding_function=ef, metadata={"hnsw:space": "cosine"} ) # Add documents collection.add( documents=[ "PyTorch is a machine learning framework.", "LangChain helps build LLM applications.", "Vector databases store embeddings for similarity search.", ], ids=["doc1", "doc2", "doc3"], metadatas=[ {"source": "ml_docs", "category": "framework"}, {"source": "llm_docs", "category": "framework"}, {"source": "db_docs", "category": "database"}, ] ) # Query results = collection.query( query_texts=["how do I train neural networks?"], n_results=2, where={"category": "framework"} # optional metadata filter ) for doc, score in zip(results["documents"][0], results["distances"][0]): print(f"Score: {1 - score:.3f} | {doc[:80]}...")
Agentic AI Examples
LangChain — ReAct Agent with Tools
from langchain_community.llms import Ollama from langchain.agents import create_react_agent, AgentExecutor from langchain.tools import tool from langchain import hub llm = Ollama(model="llama3.1:8b") @tool def search_docs(query: str) -> str: """Search internal documentation for information.""" # Replace with your actual search logic return f"Documentation results for: {query}" @tool def run_python(code: str) -> str: """Execute Python code and return the output.""" import io, contextlib output = io.StringIO() try: with contextlib.redirect_stdout(output): exec(code, {}) return output.getvalue() or "Code executed successfully (no output)" except Exception as e: return f"Error: {str(e)}" tools = [search_docs, run_python] prompt = hub.pull("hwchase17/react") agent = create_react_agent(llm, tools, prompt) executor = AgentExecutor(agent=agent, tools=tools, verbose=True, max_iterations=5) result = executor.invoke({ "input": "Search for how to use pandas groupby, then write a code example." }) print(result["output"])
AutoGen — Multi-Agent Conversation
import autogen config_list = [{ "model": "llama3.1:8b", "base_url": "http://localhost:11434/v1", "api_key": "ollama", }] llm_config = {"config_list": config_list, "temperature": 0.7} # Create agents assistant = autogen.AssistantAgent( name="Assistant", llm_config=llm_config, system_message="You are a helpful AI. Solve tasks step by step." ) code_reviewer = autogen.AssistantAgent( name="CodeReviewer", llm_config=llm_config, system_message="You review code for bugs, security issues, and best practices." ) user_proxy = autogen.UserProxyAgent( name="User", human_input_mode="NEVER", max_consecutive_auto_reply=5, code_execution_config={"work_dir": "workspace", "use_docker": False}, ) # Group chat groupchat = autogen.GroupChat( agents=[user_proxy, assistant, code_reviewer], messages=[], max_round=10 ) manager = autogen.GroupChatManager(groupchat=groupchat, llm_config=llm_config) user_proxy.initiate_chat( manager, message="Write a Python script that scrapes headlines from a news RSS feed and summarizes them." )
Fine-tuning Examples
Unsloth — Fast LoRA Fine-tuning
from unsloth import FastLanguageModel from trl import SFTTrainer from transformers import TrainingArguments from datasets import load_dataset # Load model with Unsloth optimizations (2x faster, 60% less VRAM) model, tokenizer = FastLanguageModel.from_pretrained( model_name="unsloth/Meta-Llama-3.1-8B-Instruct", max_seq_length=2048, dtype=None, # auto-detect load_in_4bit=True, ) # Add LoRA adapters model = FastLanguageModel.get_peft_model( model, r=16, # LoRA rank target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], lora_alpha=16, lora_dropout=0, bias="none", use_gradient_checkpointing="unsloth", random_state=42, ) dataset = load_dataset("yahma/alpaca-cleaned", split="train") trainer = SFTTrainer( model=model, tokenizer=tokenizer, train_dataset=dataset, dataset_text_field="text", max_seq_length=2048, args=TrainingArguments( per_device_train_batch_size=4, gradient_accumulation_steps=4, num_train_epochs=1, learning_rate=2e-4, fp16=True, output_dir="./output", save_steps=100, logging_steps=10, ), ) trainer.train() # Save LoRA weights model.save_pretrained("./lora_model") tokenizer.save_pretrained("./lora_model") # Optionally merge and export to GGUF model.save_pretrained_gguf("./gguf_model", tokenizer, quantization_method="q4_k_m")
MLOps Examples
MLflow — Experiment Tracking
import mlflow import mlflow.pytorch from mlflow.models import infer_signature mlflow.set_experiment("llm-fine-tuning") with mlflow.start_run(run_name="llama3-lora-v1"): # Log hyperparameters mlflow.log_params({ "model": "llama3.1-8b", "lora_rank": 16, "learning_rate": 2e-4, "epochs": 3, "batch_size": 4, }) # Log metrics during training for step, loss in enumerate(training_losses): mlflow.log_metric("train_loss", loss, step=step) mlflow.log_metric("eval_perplexity", 12.4) mlflow.log_metric("eval_bleu", 0.38) # Log artifacts mlflow.log_artifact("./lora_model", artifact_path="model") mlflow.log_artifact("./training_config.yaml") # Tag the run mlflow.set_tags({ "task": "instruction-tuning", "dataset": "alpaca-cleaned", "framework": "unsloth", }) # Query runs programmatically runs = mlflow.search_runs( experiment_names=["llm-fine-tuning"], filter_string="metrics.eval_perplexity < 15", order_by=["metrics.eval_perplexity ASC"], ) print(runs[["run_id", "params.model", "metrics.eval_perplexity"]].head())
Common Patterns
Pattern 1: Local LLM with Fallback
import os from openai import OpenAI def get_llm_client(prefer_local: bool = True): """Returns OpenAI-compatible client, preferring local vLLM/Ollama.""" if prefer_local: try: client = OpenAI( base_url=os.getenv("LOCAL_LLM_URL", "http://localhost:11434/v1"), api_key="local" ) # Test connection client.models.list() return client, os.getenv("LOCAL_MODEL", "llama3.1:8b") except Exception: pass # Fallback to OpenAI return OpenAI(api_key=os.environ["OPENAI_API_KEY"]), "gpt-4o-mini" client, model = get_llm_client()
Pattern 2: Embeddings + Similarity Search (No Vector DB)
from sentence_transformers import SentenceTransformer import numpy as np model = SentenceTransformer("BAAI/bge-small-en-v1.5") def build_index(texts: list[str]) -> np.ndarray: return model.encode(texts, normalize_embeddings=True) def search(query: str, corpus_embeddings: np.ndarray, texts: list[str], top_k: int = 5): query_emb = model.encode([query], normalize_embeddings=True) scores = (query_emb @ corpus_embeddings.T)[0] top_indices = np.argsort(scores)[::-1][:top_k] return [(texts[i], float(scores[i])) for i in top_indices] # Usage texts = ["doc 1 content", "doc 2 content", "doc 3 content"] embeddings = build_index(texts) results = search("my query", embeddings, texts)
Pattern 3: Structured Output with Pydantic
from pydantic import BaseModel from transformers import pipeline import json class CodeReview(BaseModel): has_bugs: bool severity: str # "low" | "medium" | "high" | "critical" issues: list[str] suggestions: list[str] def review_code(code: str, llm_pipeline) -> CodeReview: prompt = f"""Review this code and respond with ONLY valid JSON matching this schema: {CodeReview.model_json_schema()} Code to review: ```python {code} ```""" output = llm_pipeline(prompt, max_new_tokens=512)[0]["generated_text"] # Extract JSON from output json_start = output.rfind("{") json_end = output.rfind("}") + 1 json_str = output[json_start:json_end] return CodeReview.model_validate_json(json_str)
Troubleshooting
CUDA Out of Memory
# Reduce memory usage: # 1. Use 4-bit quantization from transformers import BitsAndBytesConfig bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, ) model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config) # 2. Enable gradient checkpointing model.gradient_checkpointing_enable() # 3. Use smaller batch + gradient accumulation # Instead of batch_size=32, use batch_size=4, grad_accum=8 # 4. Clear cache between operations import gc torch.cuda.empty_cache() gc.collect()
vLLM Slow First Response
# Pre-warm the model after startup curl -s http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{"model": "meta-llama/Llama-3.1-8B-Instruct", "prompt": "hi", "max_tokens": 1}'
Hugging Face Download Issues
# Use environment variables for auth and caching export HUGGING_FACE_HUB_TOKEN="your_token_here" # use env var, not hardcoded export HF_HOME="/path/to/large/disk/.cache/huggingface" export HF_HUB_OFFLINE=1 # use cached files only (after download) # Download model files explicitly huggingface-cli download meta-llama/Llama-3.1-8B-Instruct \ --local-dir ./models/llama3.1-8b \ --include "*.safetensors" "*.json" "tokenizer*"
Ollama Model Not Found
# List available models ollama list # Search for models ollama search llama # Pull specific version/quantization ollama pull qwen2.5-coder:14b-instruct-q4_K_M # Check running status ollama ps
Environment Setup
# Minimal ML environment conda create -n ai-dev python=3.11 conda activate ai-dev pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 pip install transformers accelerate datasets peft trl pip install vllm # production inference pip install llama-index chromadb # RAG pip install langchain langchain-community # agents pip install mlflow # experiment tracking pip install sentence-transformers # embeddings pip install unsloth # fast fine-tuning # Environment variables (add to .env or shell profile) export HUGGING_FACE_HUB_TOKEN="${HUGGING_FACE_HUB_TOKEN}" export OPENAI_API_KEY="${OPENAI_API_KEY}" # if using OpenAI fallback export ANTHROPIC_API_KEY="${ANTHROPIC_API_KEY}" # if using Anthropic export HF_HOME="${HF_HOME:-~/.cache/huggingface}" export TRANSFORMERS_CACHE="${HF_HOME}/hub"
Key Resources
- Awesome List: https://github.com/alvinunreal/awesome-opensource-ai
- Hugging Face Hub: https://huggingface.co/models (model downloads)
- Ollama Library: https://ollama.com/library (curated GGUF models)
- Open LLM Leaderboard: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
- LMSYS Chatbot Arena: https://chat.lmsys.org (human preference rankings)
- Papers With Code: https://paperswithcode.com/sota (benchmark tracking)