OllamaChat
Self Hosted AI Chat Platform
A self-hosted ChatGPT alternative that runs entirely on your own machine using Ollama. It can search your documents to answer questions, remember things across conversations, automatically switch to the right model for coding or vision tasks, use tools to search the web, and optionally speak and listen, all without sending anything to the cloud.

Project Overview
Role: Full-Stack Engineer
Duration: Ongoing
Team: Solo Project
Year: 2026
GitHub: View Code
Blog Post: Read Article
Technologies Used
Project Details
I built this project to truly understand how AI tools work under the hood, not just use them as a black box. What started as a simple Ollama playground grew into a fully-featured local AI platform. It searches your own documents to give grounded answers with confidence scoring (RAG), remembers useful things you've told it across sessions, detects when you're asking a coding question or sending an image and routes to the right model, runs an agentic tool-use loop for web search and URL fetching, supports extended reasoning with think blocks, and offers optional voice input and output via a locally-hosted speech service. Everything runs on your own hardware: no subscriptions, no data leaving your machine.
Challenge
Build a self-hosted AI chat app that matches cloud tools in capability while running completely locally. The hard parts: implementing document search without a third-party vector database, building a memory system that captures useful context without flooding every message with noise, detecting coding and vision intent to route to the right model, adding an agentic tool-use loop, supporting extended model reasoning, and adding voice I/O, all in a clean, fast UI.
Solution
Document search is powered by SQLite with native vector support (libSQL), so there's no need for an external service like Pinecone. Every message goes through a multi-stage pipeline: system instructions, then relevant memories ranked by relevance and recency, then matching document excerpts with grounding confidence scores, then conversation history, all streamed live. An agentic loop lets the model call tools like web search and URL fetching across up to five rounds before composing a final answer. Model routing auto-detects coding patterns, image attachments, and vision capability to transparently switch models mid-conversation. The memory system auto-extracts facts per turn, scores them by relevance, recency, and frequency, and supports superseding outdated memories. Voice runs through a Docker sidecar using Whisper for speech-to-text and Kokoro for text-to-speech, with intelligent sentence splitting for natural speech pacing.





Key Features
- Chat with any locally installed Ollama model with real-time streaming responses
- Automatically switches to a dedicated coding model when it detects coding questions
- Auto-routes image attachments to vision-capable models with drag-and-drop and clipboard paste support
- Upload documents, PDFs, code files, or web URLs to a searchable knowledge base
- Searches your documents and injects relevant excerpts into every answer, with confidence scoring and source citations
- Remembers preferences and facts across conversations with automatic extraction, relevance ranking, and memory superseding
- Agentic tool-use loop that can search the web and fetch URLs across multiple rounds before answering
- Supports extended reasoning with think-block streaming for compatible models
- Push-to-talk voice input and spoken responses via locally-hosted speech models with natural sentence pacing
- Per-conversation toggles for RAG, memory, agent mode, and custom system prompts
- Persistent conversation history with automatically generated titles
- Watch a folder and automatically index new files as they are added
Results & Impact
- ✓
Built document search directly into SQLite using native vector embeddings with no external vector database needed
- ✓
Implemented a multi-stage message pipeline (system prompt → memory → document context → history) with live streaming
- ✓
Created a memory system that automatically captures, ranks, and recalls preferences and facts across conversations
- ✓
Built smart model routing that detects coding intent, image attachments, and vision capability to switch models transparently
- ✓
Implemented an agentic tool-use loop with web search and URL fetching across up to five rounds per turn
- ✓
Added grounding confidence scoring (high/medium/low) on RAG answers with source citations
- ✓
Supported extended reasoning with streaming think-block detection and buffering
- ✓
Added optional voice I/O using locally-hosted Whisper (STT) and Kokoro (TTS) with intelligent sentence splitting
- ✓
Supported document ingestion for Markdown, PDFs, code files, and live web URLs with language-aware chunking
- ✓
Built per-conversation toggles for RAG, memory, agent mode, and custom system prompts