Self-Hosted AI Tools To Run in Your Home Lab

01 November 2025
LLM,
Ollama,
Home Lab

If you’ve been eyeing AI but want privacy, control, and the joy of tinkering, 2025 is the year to build your own stack at home. Thanks to open-source distilled models, GPU acceleration, and a maturing ecosystem, you can run powerful language and image models locally no cloud required. Here’s a practical guide to the best tools, how they fit together, and where to start.

Who it’s for

If you’re running Proxmox, Docker, or a single mini PC and want private AI you can tinker with, this guide is for you.

Why self-hosted AI matters

Privacy: Your prompts, chats, and files stay on your hardware under your control. No third-party access or vendor lock-in.
Learning: Hosting models teaches you how inference works, what GPU memory means for performance, and how different models behave under load.
Integration: Local models plug into your existing self-hosted services, enabling automations, assistants, and end-to-end workflows.
Fun: A few years ago, a private AI chat system sounded like sci-fi. Today, it’s a weekend project.

What you’ll build

A local stack with Ollama for models, OpenWebUI for chat, n8n for automations, WhisperX for transcription, Stable Diffusion for images, and PrivateGPT/AnythingLLM for offline RAG.

Start here (90-minute setup)

Install Ollama; pull a 7B model (Mistral 7B or Llama‑3 8B quantized GGUF).
Deploy OpenWebUI and connect to Ollama.
Add n8n for one simple workflow (RSS → summarize → post).
Optional: WhisperX for transcription and Stable Diffusion WebUI for images when you have GPU headroom.

AI Tools

Ollama: Your local AI runtime

Think of Ollama as the engine for running large language models (LLMs) locally. It’s lightweight, easy to install, and runs great in Docker or LXC. You can pull and manage models like GPT-OSS, Gemma, Llama 3, Phi-3/4, Mistral, DeepSeek, and many more. Ollama exposes a local API endpoint, which other tools use to provide a chat interface, automations, or RAG.

Key points:

Deployment: Docker/LXC friendly; simple to stand up on Proxmox, Docker Desktop, or a mini PC.
Acceleration: NVIDIA and AMD GPU support; CPU fallback if you don’t have a discrete GPU.
Role: The core engine you set up once, then connect everything else to it.

Resources:

Acts as the core runtime; its footprint is dominated by the model(s) you load.
More CPU cores help with token throughput when running CPU‑only; GPU offload gives the biggest gains.
Concurrency increases memory pressure because each active context has its own KV cache.

OpenWebUI: A familiar chat interface for local models

Ollama handles models; OpenWebUI gives you the chat experience. It looks and feels like ChatGPT, but it’s open-source and backend-agnostic. Connect it to Ollama’s API, choose your model, tweak parameters, and start chatting—entirely in your home lab.

Highlights:

Multiple models, image generation, prompt templates, chat history, and custom instructions.
Admin features to download and manage models directly from the UI.
A full self-hosted ChatGPT alternative that lives inside your lab.

Resources:

Primarily lightweight UIs. The heavy lifting happens in the backend (Ollama/LocalAI/OpenAI).
Scale cores/RAM with the number of simultaneous users and background jobs (image generation plug‑ins, etc.).

n8n: Automation and agentic workflows

If you’ve used Zapier or Make, n8n will feel familiar—except you self-host it. It’s an open-source workflow tool you can wire into Ollama and other APIs to automate summaries, posts, notifications, and even CI/CD actions.

Example automations:

Pull RSS items from FreshRSS, summarize with Ollama, and post to Mastodon.
Analyze home lab logs daily, flag anomalies, and send a dashboard summary.
Inspect CI/CD runs, explain failures, and re-trigger pipelines automatically.

Resources:

CPU/RAM scale with the number and complexity of workflows, and the database (PostgreSQL) size.
When orchestrating LLM and STT jobs, most resource impact occurs in the called services (Ollama, WhisperX).

LocalAI: The “one-container” option

Want something simpler than pairing Ollama with OpenWebUI? LocalAI is a single container that packages model management and a web interface. Under the hood, it uses Ollama, but saves you from stitching multiple tools together.

What you get:

One Docker command to launch; supports both CPU and GPU acceleration.
Text and image model support; familiar sources like Hugging Face, GGUF, and quantized LLMs.
Ideal for quick deployment without sacrificing local control.

Resources:

“One‑container” convenience with resource profile similar to Ollama because it uses similar runtimes under the hood.
GPU support and quantization choice determine latency.

AnythingLLM: Client app with RAG and document chat

Mintplex Labs’ AnythingLLM is an all-in-one platform that shines for document interaction and RAG. Instead of running in Docker, it installs on your workstation (similar to LM Studio). You can upload PDFs and markdown, or sync a GitHub repo; it indexes locally so you can chat over your own content.

Notable features:

Integrates with Ollama and OpenAI; keeps data local.
Webhook support for automations; role-based access controls.
NPU support on Snapdragon X Elite devices (Windows ARM64), delivering ~30% RAG performance gains, bringing otherwise-idle NPUs into play.

Resources:

RAG indexing is memory‑ and disk‑intensive; larger corpora benefit from 16–32 GB RAM and fast NVMe.
If available, NPU/GPU acceleration improves embedding and inference speed; otherwise expect CPU‑bound latency.

Whisper and WhisperX: Local speech-to-text

For transcription, OpenAI’s Whisper is accurate—even with noisy audio—and runs locally. WhisperX builds on it with better GPU acceleration, timestamp alignment, and speed.

Use cases:

Transcribe YouTube videos, podcasts, or meeting recordings privately.
Containerize and automate: watch a folder for new audio, transcribe with WhisperX, summarize with Ollama, and deliver results via email or dashboards.

Resources:

CPU‑only runs are feasible but slow for long audio. GPU with 10–12 GB VRAM offers a big speedup.
WhisperX adds alignment and typically more GPU memory use; plan higher VRAM and RAM for batch pipelines.

Stable Diffusion WebUI: Local image generation

Stable Diffusion remains the go-to for local image generation, and AUTOMATIC1111’s WebUI makes it accessible and powerful. Run it on your workstation or in Docker with GPU acceleration to create artwork, thumbnails, textures, and more.

Features:

ControlNet, LoRA fine-tuning, and image upscaling.
Easy to integrate with automations—for example, generating visuals for blog posts or documentation.

Resources:

VRAM is the key limiter. 8 GB VRAM can run base models at modest resolutions; 12–24 GB handles ControlNet, high‑res, and upscalers comfortably.
Disk usage grows quickly with checkpoints, LoRAs, ControlNet models, and outputs—budget 50–200+ GB if you experiment widely.

PrivateGPT: Offline RAG over your documents

PrivateGPT gives you a privacy-first chatbot that never touches the cloud. It pairs Ollama or LocalAI with a vector database for retrieval-augmented generation. Feed it your docs and ask questions—fully offline.

Why it’s useful:

Perfect for querying tech PDFs, internal wikis, or project docs.
Runs easily in Docker and integrates cleanly with Ollama.

Resources:

Memory scales with index size and query complexity. For medium corpora, 32–64 GB RAM offers headroom.
GPU isn’t strictly required but can accelerate the LLM layer; embedding generation benefits from GPU too.

LibreChat: A flexible, shared AI workspace

LibreChat is a customizable web UI forked from the ChatGPT interface that connects to various backends—your local Ollama, OpenAI, or others—without vendor lock-in.

Capabilities:

Multi-model setups, plugins, custom prompts, and chat memory.
Configure as a shared AI workspace for your household, with controls over storage and access.

How the pieces fit together

Here’s a simple architecture you can run on Proxmox, Docker Swarm, or a single mini PC:

Ollama: The central model engine
OpenWebUI: Chat interface to interact with models
n8n: Workflow automation to orchestrate tasks and summaries
Whisper/WhisperX: Transcribe audio locally, feed results into n8n
Stable Diffusion WebUI: Generate images for content workflows
PrivateGPT or AnythingLLM: Index and query your documentation or code

Resource Planning

CPU, RAM, and GPU VRAM needs vary widely across these tools and models, and planning for them is essential.

Below is a practical baseline and recommended setup for the tools discussed, assuming light personal use (single‑user, low concurrency). Scale up cores/RAM for multi‑user or parallel jobs.

Quick reference table

Tool	CPU (minimum → recommended)	System RAM	GPU/VRAM	Storage notes
Ollama (LLMs)	4 cores → 8–16 cores	16 GB → 32–64 GB	Strongly benefits from GPU; see model sizes below	Models: 4–30+ GB each (quantized), keep SSD fast
OpenWebUI	2 cores → 4–8 cores	4 GB → 8–16 GB	Optional (UI itself doesn’t need GPU)	Minimal; caches/config only
n8n (automation)	2 cores → 4–8 cores	4 GB → 8–16 GB	N/A	Depends on DB/log retention
LocalAI	4 cores → 8–16 cores	16 GB → 32 GB	Optional; GPU improves latency like Ollama	Model footprints similar to Ollama
AnythingLLM	4 cores → 8 cores	8 GB → 16–32 GB	Optional; NPU/GPU can accelerate	Index size grows with docs; SSD helps
Whisper (STT)	4 cores → 8–12 cores	8 GB → 16 GB	GPU strongly recommended (6–12 GB VRAM)	Models ~1–3 GB; temp files for audio
WhisperX	8 cores → 16+ cores	16 GB → 32 GB	Prefer 10–12+ GB VRAM for best speed	Alignment models add overhead
Stable Diffusion WebUI	4 cores → 8–12 cores	16 GB → 32 GB	8 GB VRAM minimum; 12–24 GB ideal	Models/checkpoints 2–7 GB each; many extensions
PrivateGPT (RAG)	4 cores → 8–16 cores	16 GB → 32–64 GB	Optional; GPU speeds generation	Vector DB can grow large; fast NVMe helpful
LibreChat	2 cores → 4–8 cores	4 GB → 8–16 GB	Optional (UI only)	Minimal; relies on backend resources

LLM model size guidance (Ollama/LocalAI/AnythingLLM backends)

Class	Example Models	CPU-only RAM	GPU VRAM	Notes
7B	Llama‑3 8B, Mistral 7B	16–32 GB (quantized GGUF)	6–8 GB workable for Q4/Q5; 8–12 GB faster	Comfortable inference with quantization
13B	—	24–48 GB (quantized)	10–16 GB recommended; <10 GB needs heavier quantization or CPU offload	Balance of quality and speed
34–70B	—	64–128+ GB; high latency, low throughput	24–48+ GB; often multiple GPUs or tensor parallelism	Practical use needs multi‑GPU setups
Storage	Quantized GGUF models	~3–30+ GB per model	—	Keep models, embeddings, and RAG indices on fast NVMe

Practical sizing tiers

Minimal single‑user stack (CPU‑first):

8 cores CPU, 32 GB RAM, fast NVMe.
Run 7B LLMs, Whisper CPU for short audio, Stable Diffusion on low settings (CPU‑only is slow).

Balanced home lab:

12–16 cores CPU, 64 GB RAM, GPU with 12 GB VRAM, NVMe storage (1–2 TB).
Smooth 7–13B LLMs, WhisperX at good speed, Stable Diffusion with ControlNet, moderate RAG indexes.

Power user:

24+ cores CPU, 128 GB RAM, GPU with 24–48 GB VRAM, multiple NVMe drives.
Larger LLMs, high‑concurrency chats, fast transcription, complex image pipelines, sizable document corpora.

Optimization tips

Prefer quantized GGUF models (Q4/Q5) to reduce RAM/VRAM while keeping quality reasonable.
Use NVMe SSDs for models, embeddings, and indices; avoid HDD for hot paths.
Pin workloads to separate containers/VMs to prevent memory contention, and set VRAM offload limits explicitly.
Monitor token throughput, context length, and KV cache size; they drive memory use more than raw parameters alone.
Batch jobs (WhisperX, embedding generation) during off‑hours to smooth CPU/GPU contention.

Common pitfalls

Slow inference on CPU: switch to Q4_K_M quantized models; reduce context window; disable unnecessary tools
GPU OOM: lower batch size; reduce KV cache; offload layers to CPU
WhisperX stutters: update CUDA/cuDNN; ensure ffmpeg is recent; use smaller model for long audio
Stable Diffusion VRAM errors: lower resolution; disable ControlNet; use xformers

Privacy checklist

Run everything on a private VLAN; block egress for model endpoints
Disable telemetry; review container images and tags
Encrypt disks for models/indices; back up RAG stores
Use reverse proxy with auth (Authelia/Keycloak) for shared UIs

The payoff is control. You decide what data gets processed, where it’s stored, and which models you trust—while learning and having a lot of fun along the way.

Ready to build?

There’s never been a better time to build a private AI stack at home. With tools like Ollama, OpenWebUI, n8n, LocalAI, AnythingLLM, Whisper/WhisperX, Stable Diffusion WebUI, PrivateGPT, and LibreChat, you can run a complete AI ecosystem locally—no cloud required. Start with Ollama, add an interface, then automate. From there, layer in transcription, image generation, and private document chat. It’s powerful, practical, and entirely yours.

Start with Ollama + OpenWebUI today, then add n8n for one automation.

← Previous
Telegram Alerts for UK Commute Disruptions