OllamaChat

Self Hosted AI Chat Platform

A self-hosted ChatGPT alternative that runs entirely on your own machine using Ollama. It can search your documents to answer questions, remember things across conversations, automatically switch to the right model for coding or vision tasks, use tools to search the web, and optionally speak and listen, all without sending anything to the cloud.

OllamaChat interface showing a conversation with RAG citations, memory badge, and dark-themed sidebar

Project Overview

Role: Full-Stack Engineer

Duration: Ongoing

Team: Solo Project

Year: 2026

GitHub: View Code

Blog Post: Read Article

MKV

Technologies Used

Next.js 16React 19TypeScriptTailwind CSS v4Prisma v7libSQLOllamaServer-Sent EventsChokidarpdf-parseCheerioDocker

Project Details

I built this project to truly understand how AI tools work under the hood, not just use them as a black box. What started as a simple Ollama playground grew into a fully-featured local AI platform. It searches your own documents to give grounded answers with confidence scoring (RAG), remembers useful things you've told it across sessions, detects when you're asking a coding question or sending an image and routes to the right model, runs an agentic tool-use loop for web search and URL fetching, supports extended reasoning with think blocks, and offers optional voice input and output via a locally-hosted speech service. Everything runs on your own hardware: no subscriptions, no data leaving your machine.

Challenge

Build a self-hosted AI chat app that matches cloud tools in capability while running completely locally. The hard parts: implementing document search without a third-party vector database, building a memory system that captures useful context without flooding every message with noise, detecting coding and vision intent to route to the right model, adding an agentic tool-use loop, supporting extended model reasoning, and adding voice I/O, all in a clean, fast UI.

Solution

Document search is powered by SQLite with native vector support (libSQL), so there's no need for an external service like Pinecone. Every message goes through a multi-stage pipeline: system instructions, then relevant memories ranked by relevance and recency, then matching document excerpts with grounding confidence scores, then conversation history, all streamed live. An agentic loop lets the model call tools like web search and URL fetching across up to five rounds before composing a final answer. Model routing auto-detects coding patterns, image attachments, and vision capability to transparently switch models mid-conversation. The memory system auto-extracts facts per turn, scores them by relevance, recency, and frequency, and supports superseding outdated memories. Voice runs through a Docker sidecar using Whisper for speech-to-text and Kokoro for text-to-speech, with intelligent sentence splitting for natural speech pacing.

Chat interface showing a live conversation with streaming response and sidebar
Real-time streaming chat with conversation history, model selector, and RAG toggle
Memory page showing auto-captured facts and preferences across conversations
Memory manager with auto-captured items, search, filter, and usage tracking
Settings page showing model configuration, voice options, pipeline parameters, and watched folders
Configurable model settings, voice I/O, pipeline parameters, and file watcher paths
Clean overview of the OllamaChat interface without an active conversation
Full application overview with sidebar, model selector, and empty chat state
Knowledge base page showing uploaded documents with indexing status
Document management with drag-and-drop upload, URL ingestion, and chunk browsing

Key Features

  • Chat with any locally installed Ollama model with real-time streaming responses
  • Automatically switches to a dedicated coding model when it detects coding questions
  • Auto-routes image attachments to vision-capable models with drag-and-drop and clipboard paste support
  • Upload documents, PDFs, code files, or web URLs to a searchable knowledge base
  • Searches your documents and injects relevant excerpts into every answer, with confidence scoring and source citations
  • Remembers preferences and facts across conversations with automatic extraction, relevance ranking, and memory superseding
  • Agentic tool-use loop that can search the web and fetch URLs across multiple rounds before answering
  • Supports extended reasoning with think-block streaming for compatible models
  • Push-to-talk voice input and spoken responses via locally-hosted speech models with natural sentence pacing
  • Per-conversation toggles for RAG, memory, agent mode, and custom system prompts
  • Persistent conversation history with automatically generated titles
  • Watch a folder and automatically index new files as they are added

Results & Impact

  • Built document search directly into SQLite using native vector embeddings with no external vector database needed

  • Implemented a multi-stage message pipeline (system prompt → memory → document context → history) with live streaming

  • Created a memory system that automatically captures, ranks, and recalls preferences and facts across conversations

  • Built smart model routing that detects coding intent, image attachments, and vision capability to switch models transparently

  • Implemented an agentic tool-use loop with web search and URL fetching across up to five rounds per turn

  • Added grounding confidence scoring (high/medium/low) on RAG answers with source citations

  • Supported extended reasoning with streaming think-block detection and buffering

  • Added optional voice I/O using locally-hosted Whisper (STT) and Kokoro (TTS) with intelligent sentence splitting

  • Supported document ingestion for Markdown, PDFs, code files, and live web URLs with language-aware chunking

  • Built per-conversation toggles for RAG, memory, agent mode, and custom system prompts