Self-Hosted AI Tools To Run in Your Home Lab
- Who it’s for
- AI Tools
- Ollama: Your local AI runtime
- OpenWebUI: A familiar chat interface for local models
- n8n: Automation and agentic workflows
- LocalAI: The “one-container” option
- AnythingLLM: Client app with RAG and document chat
- Whisper and WhisperX: Local speech-to-text
- Stable Diffusion WebUI: Local image generation
- PrivateGPT: Offline RAG over your documents
- LibreChat: A flexible, shared AI workspace
- How the pieces fit together
- Resource Planning
- Quick reference table
- LLM model size guidance (Ollama/LocalAI/AnythingLLM backends)
- Practical sizing tiers
- Optimization tips
- Common pitfalls
- Privacy checklist
- Ready to build?
If you’ve been eyeing AI but want privacy, control, and the joy of tinkering, 2025 is the year to build your own stack at home. Thanks to open-source distilled models, GPU acceleration, and a maturing ecosystem, you can run powerful language and image models locally no cloud required. Here’s a practical guide to the best tools, how they fit together, and where to start.
Who it’s for
If you’re running Proxmox, Docker, or a single mini PC and want private AI you can tinker with, this guide is for you.
Why self-hosted AI matters
- Privacy: Your prompts, chats, and files stay on your hardware under your control. No third-party access or vendor lock-in.
- Learning: Hosting models teaches you how inference works, what GPU memory means for performance, and how different models behave under load.
- Integration: Local models plug into your existing self-hosted services, enabling automations, assistants, and end-to-end workflows.
- Fun: A few years ago, a private AI chat system sounded like sci-fi. Today, it’s a weekend project.
What you’ll build
A local stack with Ollama for models, OpenWebUI for chat, n8n for automations, WhisperX for transcription, Stable Diffusion for images, and PrivateGPT/AnythingLLM for offline RAG.
Start here (90-minute setup)
- Install Ollama; pull a 7B model (Mistral 7B or Llama‑3 8B quantized GGUF).
- Deploy OpenWebUI and connect to Ollama.
- Add n8n for one simple workflow (RSS → summarize → post).
- Optional: WhisperX for transcription and Stable Diffusion WebUI for images when you have GPU headroom.
AI Tools
Ollama: Your local AI runtime
Think of Ollama as the engine for running large language models (LLMs) locally. It’s lightweight, easy to install, and runs great in Docker or LXC. You can pull and manage models like GPT-OSS, Gemma, Llama 3, Phi-3/4, Mistral, DeepSeek, and many more. Ollama exposes a local API endpoint, which other tools use to provide a chat interface, automations, or RAG.
Key points:
- Deployment: Docker/LXC friendly; simple to stand up on Proxmox, Docker Desktop, or a mini PC.
- Acceleration: NVIDIA and AMD GPU support; CPU fallback if you don’t have a discrete GPU.
- Role: The core engine you set up once, then connect everything else to it.
Resources:
- Acts as the core runtime; its footprint is dominated by the model(s) you load.
- More CPU cores help with token throughput when running CPU‑only; GPU offload gives the biggest gains.
- Concurrency increases memory pressure because each active context has its own KV cache.
OpenWebUI: A familiar chat interface for local models
Ollama handles models; OpenWebUI gives you the chat experience. It looks and feels like ChatGPT, but it’s open-source and backend-agnostic. Connect it to Ollama’s API, choose your model, tweak parameters, and start chatting—entirely in your home lab.
Highlights:
- Multiple models, image generation, prompt templates, chat history, and custom instructions.
- Admin features to download and manage models directly from the UI.
- A full self-hosted ChatGPT alternative that lives inside your lab.
Resources:
- Primarily lightweight UIs. The heavy lifting happens in the backend (Ollama/LocalAI/OpenAI).
- Scale cores/RAM with the number of simultaneous users and background jobs (image generation plug‑ins, etc.).
n8n: Automation and agentic workflows
If you’ve used Zapier or Make, n8n will feel familiar—except you self-host it. It’s an open-source workflow tool you can wire into Ollama and other APIs to automate summaries, posts, notifications, and even CI/CD actions.
Example automations:
- Pull RSS items from FreshRSS, summarize with Ollama, and post to Mastodon.
- Analyze home lab logs daily, flag anomalies, and send a dashboard summary.
- Inspect CI/CD runs, explain failures, and re-trigger pipelines automatically.
Resources:
- CPU/RAM scale with the number and complexity of workflows, and the database (PostgreSQL) size.
- When orchestrating LLM and STT jobs, most resource impact occurs in the called services (Ollama, WhisperX).
LocalAI: The “one-container” option
Want something simpler than pairing Ollama with OpenWebUI? LocalAI is a single container that packages model management and a web interface. Under the hood, it uses Ollama, but saves you from stitching multiple tools together.
What you get:
- One Docker command to launch; supports both CPU and GPU acceleration.
- Text and image model support; familiar sources like Hugging Face, GGUF, and quantized LLMs.
- Ideal for quick deployment without sacrificing local control.
Resources:
- “One‑container” convenience with resource profile similar to Ollama because it uses similar runtimes under the hood.
- GPU support and quantization choice determine latency.
AnythingLLM: Client app with RAG and document chat
Mintplex Labs’ AnythingLLM is an all-in-one platform that shines for document interaction and RAG. Instead of running in Docker, it installs on your workstation (similar to LM Studio). You can upload PDFs and markdown, or sync a GitHub repo; it indexes locally so you can chat over your own content.
Notable features:
- Integrates with Ollama and OpenAI; keeps data local.
- Webhook support for automations; role-based access controls.
- NPU support on Snapdragon X Elite devices (Windows ARM64), delivering ~30% RAG performance gains, bringing otherwise-idle NPUs into play.
Resources:
- RAG indexing is memory‑ and disk‑intensive; larger corpora benefit from 16–32 GB RAM and fast NVMe.
- If available, NPU/GPU acceleration improves embedding and inference speed; otherwise expect CPU‑bound latency.
Whisper and WhisperX: Local speech-to-text
For transcription, OpenAI’s Whisper is accurate—even with noisy audio—and runs locally. WhisperX builds on it with better GPU acceleration, timestamp alignment, and speed.
Use cases:
- Transcribe YouTube videos, podcasts, or meeting recordings privately.
- Containerize and automate: watch a folder for new audio, transcribe with WhisperX, summarize with Ollama, and deliver results via email or dashboards.
Resources:
- CPU‑only runs are feasible but slow for long audio. GPU with 10–12 GB VRAM offers a big speedup.
- WhisperX adds alignment and typically more GPU memory use; plan higher VRAM and RAM for batch pipelines.
Stable Diffusion WebUI: Local image generation
Stable Diffusion remains the go-to for local image generation, and AUTOMATIC1111’s WebUI makes it accessible and powerful. Run it on your workstation or in Docker with GPU acceleration to create artwork, thumbnails, textures, and more.
Features:
- ControlNet, LoRA fine-tuning, and image upscaling.
- Easy to integrate with automations—for example, generating visuals for blog posts or documentation.
Resources:
- VRAM is the key limiter. 8 GB VRAM can run base models at modest resolutions; 12–24 GB handles ControlNet, high‑res, and upscalers comfortably.
- Disk usage grows quickly with checkpoints, LoRAs, ControlNet models, and outputs—budget 50–200+ GB if you experiment widely.
PrivateGPT: Offline RAG over your documents
PrivateGPT gives you a privacy-first chatbot that never touches the cloud. It pairs Ollama or LocalAI with a vector database for retrieval-augmented generation. Feed it your docs and ask questions—fully offline.
Why it’s useful:
- Perfect for querying tech PDFs, internal wikis, or project docs.
- Runs easily in Docker and integrates cleanly with Ollama.
Resources:
- Memory scales with index size and query complexity. For medium corpora, 32–64 GB RAM offers headroom.
- GPU isn’t strictly required but can accelerate the LLM layer; embedding generation benefits from GPU too.
LibreChat: A flexible, shared AI workspace
LibreChat is a customizable web UI forked from the ChatGPT interface that connects to various backends—your local Ollama, OpenAI, or others—without vendor lock-in.
Capabilities:
- Multi-model setups, plugins, custom prompts, and chat memory.
- Configure as a shared AI workspace for your household, with controls over storage and access.
How the pieces fit together
Here’s a simple architecture you can run on Proxmox, Docker Swarm, or a single mini PC:
- Ollama: The central model engine
- OpenWebUI: Chat interface to interact with models
- n8n: Workflow automation to orchestrate tasks and summaries
- Whisper/WhisperX: Transcribe audio locally, feed results into n8n
- Stable Diffusion WebUI: Generate images for content workflows
- PrivateGPT or AnythingLLM: Index and query your documentation or code
Resource Planning
CPU, RAM, and GPU VRAM needs vary widely across these tools and models, and planning for them is essential.
Below is a practical baseline and recommended setup for the tools discussed, assuming light personal use (single‑user, low concurrency). Scale up cores/RAM for multi‑user or parallel jobs.
Quick reference table
| Tool | CPU (minimum → recommended) | System RAM | GPU/VRAM | Storage notes |
|---|---|---|---|---|
| Ollama (LLMs) | 4 cores → 8–16 cores | 16 GB → 32–64 GB | Strongly benefits from GPU; see model sizes below | Models: 4–30+ GB each (quantized), keep SSD fast |
| OpenWebUI | 2 cores → 4–8 cores | 4 GB → 8–16 GB | Optional (UI itself doesn’t need GPU) | Minimal; caches/config only |
| n8n (automation) | 2 cores → 4–8 cores | 4 GB → 8–16 GB | N/A | Depends on DB/log retention |
| LocalAI | 4 cores → 8–16 cores | 16 GB → 32 GB | Optional; GPU improves latency like Ollama | Model footprints similar to Ollama |
| AnythingLLM | 4 cores → 8 cores | 8 GB → 16–32 GB | Optional; NPU/GPU can accelerate | Index size grows with docs; SSD helps |
| Whisper (STT) | 4 cores → 8–12 cores | 8 GB → 16 GB | GPU strongly recommended (6–12 GB VRAM) | Models ~1–3 GB; temp files for audio |
| WhisperX | 8 cores → 16+ cores | 16 GB → 32 GB | Prefer 10–12+ GB VRAM for best speed | Alignment models add overhead |
| Stable Diffusion WebUI | 4 cores → 8–12 cores | 16 GB → 32 GB | 8 GB VRAM minimum; 12–24 GB ideal | Models/checkpoints 2–7 GB each; many extensions |
| PrivateGPT (RAG) | 4 cores → 8–16 cores | 16 GB → 32–64 GB | Optional; GPU speeds generation | Vector DB can grow large; fast NVMe helpful |
| LibreChat | 2 cores → 4–8 cores | 4 GB → 8–16 GB | Optional (UI only) | Minimal; relies on backend resources |
LLM model size guidance (Ollama/LocalAI/AnythingLLM backends)
| Class | Example Models | CPU-only RAM | GPU VRAM | Notes |
|---|---|---|---|---|
| 7B | Llama‑3 8B, Mistral 7B | 16–32 GB (quantized GGUF) | 6–8 GB workable for Q4/Q5; 8–12 GB faster | Comfortable inference with quantization |
| 13B | — | 24–48 GB (quantized) | 10–16 GB recommended; <10 GB needs heavier quantization or CPU offload | Balance of quality and speed |
| 34–70B | — | 64–128+ GB; high latency, low throughput | 24–48+ GB; often multiple GPUs or tensor parallelism | Practical use needs multi‑GPU setups |
| Storage | Quantized GGUF models | ~3–30+ GB per model | — | Keep models, embeddings, and RAG indices on fast NVMe |
Practical sizing tiers
Minimal single‑user stack (CPU‑first):
- 8 cores CPU, 32 GB RAM, fast NVMe.
- Run 7B LLMs, Whisper CPU for short audio, Stable Diffusion on low settings (CPU‑only is slow).
Balanced home lab:
- 12–16 cores CPU, 64 GB RAM, GPU with 12 GB VRAM, NVMe storage (1–2 TB).
- Smooth 7–13B LLMs, WhisperX at good speed, Stable Diffusion with ControlNet, moderate RAG indexes.
Power user:
- 24+ cores CPU, 128 GB RAM, GPU with 24–48 GB VRAM, multiple NVMe drives.
- Larger LLMs, high‑concurrency chats, fast transcription, complex image pipelines, sizable document corpora.
Optimization tips
- Prefer quantized GGUF models (Q4/Q5) to reduce RAM/VRAM while keeping quality reasonable.
- Use NVMe SSDs for models, embeddings, and indices; avoid HDD for hot paths.
- Pin workloads to separate containers/VMs to prevent memory contention, and set VRAM offload limits explicitly.
- Monitor token throughput, context length, and KV cache size; they drive memory use more than raw parameters alone.
- Batch jobs (WhisperX, embedding generation) during off‑hours to smooth CPU/GPU contention.
Common pitfalls
- Slow inference on CPU: switch to Q4_K_M quantized models; reduce context window; disable unnecessary tools
- GPU OOM: lower batch size; reduce KV cache; offload layers to CPU
- WhisperX stutters: update CUDA/cuDNN; ensure ffmpeg is recent; use smaller model for long audio
- Stable Diffusion VRAM errors: lower resolution; disable ControlNet; use xformers
Privacy checklist
- Run everything on a private VLAN; block egress for model endpoints
- Disable telemetry; review container images and tags
- Encrypt disks for models/indices; back up RAG stores
- Use reverse proxy with auth (Authelia/Keycloak) for shared UIs
The payoff is control. You decide what data gets processed, where it’s stored, and which models you trust—while learning and having a lot of fun along the way.
Ready to build?
There’s never been a better time to build a private AI stack at home. With tools like Ollama, OpenWebUI, n8n, LocalAI, AnythingLLM, Whisper/WhisperX, Stable Diffusion WebUI, PrivateGPT, and LibreChat, you can run a complete AI ecosystem locally—no cloud required. Start with Ollama, add an interface, then automate. From there, layer in transcription, image generation, and private document chat. It’s powerful, practical, and entirely yours.
Start with Ollama + OpenWebUI today, then add n8n for one automation.
- ← Previous
Telegram Alerts for UK Commute Disruptions