← Back to blog

Building OllamaChat: A Self-Hosted ChatGPT Alternative with RAG, Memory, and Voice

February 26, 2026·15 min read
Next.jsReactTypeScriptAI integration

Most AI chat apps send your conversations to a cloud server. Every question you ask, every document you upload, every memory the assistant stores it all leaves your machine. I wanted something different: a full-featured chat assistant that runs entirely on my hardware, keeps data in a local SQLite file, and never phones home.

The result is OllamaChat, an open-source web app built on Next.js 16, Ollama, and a handful of thoughtful architectural choices. This post walks through the interesting engineering decisions: storing vectors in SQLite without a separate vector database, automatic memory extraction without an LLM in the loop, smart model routing with regex, and optional self-hosted voice via Whisper and Kokoro.

The Stack

The core stack is deliberately boring where it can be:

  • Next.js 16 (App Router) for the full-stack framework
  • React 19 with Server Components for data-heavy pages
  • Tailwind CSS v4 for styling (with manual prose CSS since the typography plugin isn't v4-compatible yet)
  • Prisma v7 with @prisma/adapter-libsql and @libsql/client for the database layer
  • SQLite as the single storage primitive conversations, documents, chunks, embeddings, memory, and config all live in one dev.db file

For AI, everything runs through Ollama on localhost. Chat uses streaming SSE from /api/chat. Embeddings use /api/embed with nomic-embed-text (768 dimensions). No OpenAI account required.

Storing Vectors in SQLite

The knowledge base needs vector similarity search. The obvious choices are Pinecone, Weaviate, or pgvectorbut all of them require running a separate service. I wanted a single SQLite file.

libSQL (the fork of SQLite used by Turso) has native vector support. You can store F32_BLOB(768) columns, create an ANN index, and run cosine distance queries. The catch: Prisma v7 doesn't know about these types, so the vector column can't be in the schema. Instead, I add it with raw SQL after prisma db push:

// setup-vectors.tsruns once after migrations
await client.execute(`
  ALTER TABLE Chunk ADD COLUMN embedding F32_BLOB(768)
`);
await client.execute(`
  CREATE INDEX IF NOT EXISTS chunk_embedding_idx
  ON Chunk(libsql_vector_idx(embedding))
`);

Inserting embeddings also bypasses Prisma:

await client.execute({
  sql: `UPDATE Chunk SET embedding = vector32(?) WHERE id = ?`,
  args: [JSON.stringify(embedding), chunkId],
});

And querying for nearest neighbors:

const rows = await client.execute({
  sql: `
    SELECT c.id, c.content, c.documentId,
           vector_distance_cos(c.embedding, vector32(?)) AS distance
    FROM Chunk c
    WHERE c.embedding IS NOT NULL
    ORDER BY distance ASC
    LIMIT ?
  `,
  args: [JSON.stringify(queryEmbedding), topK],
});

Cosine distance here is 0 for identical vectors and 2 for opposite ones. I convert to similarity with 1 - distance before comparing against the configured threshold.

The net result: full vector search in a 3 MB SQLite file with no extra daemon running.

RAG with Grounding Confidence

When RAG is enabled for a conversation, retrieved chunks are injected into the prompt context before the conversation history. That part is standard. What's less standard is the grounding confidence system.

After retrieval, the app computes a confidence level based on two signals:

function computeGrounding(chunks: RetrievedChunk[]): GroundingLevel {
  const avgSimilarity =
    chunks.reduce((sum, c) => sum + c.similarity, 0) / chunks.length;
  if (avgSimilarity >= 0.86 && chunks.length >= 2) return "HIGH";
  if (avgSimilarity >= 0.72) return "MEDIUM";
  return "LOW";
}

The thresholds (0.86 and 0.72) came from running an eval script against a test document set. For low-confidence retrievals, a guardrail system prompt is appended before the context:

The following context has low relevance to the question.
Answer based on your own knowledge where the context doesn't help.
If you're uncertain, say so.

This avoids the common failure mode where the model confidently hallucinates details from vaguely related chunks. Each assistant message also stores groundingLevel, avgSimilarity, and chunk citations in the database, so I can query grounding quality across conversations.

Document Chunking

The chunker handles markdown, PDF, code files, and plain text. Markdown gets split by heading first, then by token window within each section this preserves conceptual structure better than naive splitting. Code files detect language from extension and treat the file as a single prose segment (since function boundaries are rarely clean split points at ~512 tokens).

Chunk size, overlap, and token estimation are all configurable from the settings page. Token count is estimated as text.length / 4, which is accurate enough for budget decisions without spinning up a tokenizer.

File watching is handled by chokidar, started from instrumentation.ts:

export async function register() {
  if (process.env.NEXT_RUNTIME === "nodejs") {
    const { startWatcher } = await import("@/lib/rag/watcher");
    startWatcher();
  }
}

When a watched file changes, the watcher recomputes its SHA-256 hash. If the hash differs from what's stored, the document is re-chunked and re-embedded. This gives automatic re-indexing without polling.

Automatic Memory

This is the part I'm most pleased with. After each assistant turn, the app scans the user's message for extractable memory without calling an LLM to do the extraction.

The logic is pattern-based:

const PATTERNS = {
  preference: [
    /\b(i prefer|i like|always use|i want|please (always|don't|never))\b/i,
    /\b(format (it|responses|code) as|respond in)\b/i,
  ],
  fact: [
    /\b(i('m| am)|i work (at|for|on)|my (team|company|stack|project))\b/i,
    /\b(we use|our (stack|setup|workflow) (is|uses))\b/i,
  ],
  decision: [
    /\b(i('ve| have) decided|going with|we're (going|switching|moving) to)\b/i,
  ],
};

Each user message is split into sentences. Sentences that match a pattern, pass length checks (15–200 characters, 4+ words), and don't already exist in the memory store are saved as MemoryItem rows. At most 3 items are extracted per turn to prevent noise accumulation.

For injection, memory items are ranked by a weighted score:

score = lexicalOverlap(item.content, userMessage) * 0.7
      + recencyScore(item.createdAt) * 0.2
      + frequencyScore(item.useCount) * 0.1;

The top items are packed into a token budget (configurable, default ~500 tokens). Items that fit are injected before the RAG context in the prompt. The injection order matters:

  1. System prompt (per-conversation override if set)
  2. Memory (ranked + budgeted)
  3. RAG context (top-K chunks + grounding)
  4. Conversation history
  5. User message

Memory before RAG means the model knows your preferences when interpreting retrieved documents. RAG before history means facts from your knowledge base take precedence over stale context window recall.

Memory items can be global (across all conversations) or scoped to a single conversation. There's a /memory page where you can review, archive, or manually delete items. Each item tracks useCount and lastUsedAt, which influences the frequency component of the ranking score.

Smart Model Routing

Ollama supports many models. I run qwen3:14b for general chat and qwen3-coder:30b for code. Switching manually is annoying, so there's an "Auto" mode that detects code-heavy prompts:

const CODE_PATTERNS = [
  /```[\s\S]/,                           // code fences
  /\b(function|const|let|var|class|import|export|=>)\b/,
  /\.(ts|js|py|go|rs|java|cpp|tsx|jsx)\b/,
  /\b(bug|debug|refactor|implement|fix|error|exception)\b/i,
  /\b(react|next\.?js|vue|angular|django|fastapi|express)\b/i,
];
 
export function isCodePrompt(message: string): boolean {
  return CODE_PATTERNS.some((p) => p.test(message));
}

When the user's message matches, the router resolves the code model. Otherwise it uses the default. This runs synchronously before the Ollama request, so there's no latency overhead. It's a simple heuristic that works well in practice most coding questions include a language keyword or code fence.

Voice Mode

Voice is optional and uses two self-hosted models via Speaches, an OpenAI-compatible local speech API:

  • STT: Faster Whisper (tiny through large-v3 depending on hardware)
  • TTS: Kokoro 82M (fp32, fp16, or int8 quantization)

The voice endpoint validates the Speaches URL with SSRF protection before proxying requests:

function isAllowedHost(hostname: string): boolean {
  const BLOCKED = [
    /^169\.254\.169\.254$/, // AWS metadata
    /\.internal$/,
    /^10\.\d+\.\d+\.\d+$/,
    /^172\.(1[6-9]|2\d|3[01])\.\d+\.\d+$/,
    /^192\.168\.\d+\.\d+$/,
  ];
  const ALLOWED = ["localhost", "127.0.0.1", "host.docker.internal"];
  if (ALLOWED.includes(hostname)) return true;
  return !BLOCKED.some((r) => r.test(hostname));
}

This blocks private IP ranges (useful if the app is ever exposed on a local network) while still allowing localhost and Docker host references. Push-to-talk captures audio via the browser's MediaRecorder API, sends the blob to /api/voice/transcribe, and injects the transcript into the chat input. TTS runs after the assistant response is complete, converting the text to audio in the browser's Audio API.

The Chat Pipeline End to End

It helps to trace a single message through the full system to see how the pieces fit together. When a user hits send, the request hits POST /api/chat. The server then:

  1. Resolves which model to use (auto-routing or explicit selection)
  2. Fetches active memory items for this conversation and ranks them against the incoming message
  3. If RAG is enabled, embeds the message and runs the vector similarity query
  4. Computes grounding confidence from the retrieved chunks
  5. Assembles the final prompt in order: system prompt, memory block, RAG context, conversation history, user message
  6. Opens a streaming connection to Ollama and pipes tokens back to the client via a ReadableStream
  7. After the stream closes, buffers the full response text for memory extraction and persists citations, grounding metadata, and the assistant message to the database

The streaming piece uses the Web Streams API directly rather than a helper library:

const stream = new ReadableStream({
  async start(controller) {
    for await (const chunk of ollamaStream) {
      controller.enqueue(encoder.encode(chunk.message.content));
    }
    // post-stream: memory extraction, DB writes
    await persistTurn(assistantText, citations, grounding);
    controller.close();
  },
});
return new Response(stream, {
  headers: { "Content-Type": "text/plain; charset=utf-8" },
});

The client reads this with a ReadableStreamDefaultReader, appending tokens to the message as they arrive. This keeps the UI responsive even on long responses from larger models.

One design choice worth noting: memory extraction and DB persistence happen inside the ReadableStream constructor's start callback, after the stream is exhausted but before controller.close(). This means the HTTP response stays open until persistence is done. It's simple and eliminates the need for a separate job queue, at the cost of slightly longer time-to-stream-close on turns where extraction finds something to save.

Tailwind v4 in Practice

Tailwind v4 is a meaningful departure from v3. A few things that caught me:

The @tailwindcss/typography plugin is incompatible with v4 (the tailwindcss/plugin export was removed). Rather than waiting for an update, I wrote manual prose CSS scoped to a .prose class. It's around 80 lines and covers headings, paragraphs, code blocks, lists, and blockquotes everything the markdown renderer needs.

Class-based dark mode requires a custom variant declaration instead of the darkMode: 'class' config key from v3:

@custom-variant dark (&:where(.dark, .dark *));

And the @theme inline directive for theme variables:

@theme inline {
  --color-surface: var(--surface);
  --color-text-primary: var(--text-primary);
}

These patterns weren't documented clearly at the time they came from reading the Tailwind v4 source and changelog.

Prisma v7 Pain Points

Prisma v7 has some sharp edges worth knowing about. The constructor now requires a driver adapter you can't call new PrismaClient() with no arguments:

import { createClient } from "@libsql/client";
import { PrismaLibSql } from "@prisma/adapter-libsql";
import { PrismaClient } from "@/generated/prisma/client";
 
const libsql = createClient({ url: "file:./dev.db" });
const adapter = new PrismaLibSql(libsql);
export const prisma = new PrismaClient({ adapter });

Note the import path: @/generated/prisma/client, not @/generated/prisma. There's no index.ts at the package root in v7. The adapter class is PrismaLibSql (lowercase 'ql'), not PrismaLibSQL. These are small things that caused me real confusion the first time.

The singleton pattern (one prisma instance reused across requests) is important in Next.js development mode, where hot reloading would otherwise leak connections:

declare global {
  var __prisma: PrismaClient | undefined;
}
 
export const prisma = global.__prisma ?? new PrismaClient({ adapter });
if (process.env.NODE_ENV !== "production") global.__prisma = prisma;

What I'd Change, and What's Coming Next

OllamaChat is an ongoing project, and there are a few things I'm actively working on improving.

Streaming and memory extraction are coupled awkwardly. Right now, the full streamed response is buffered in a string on the server, then memory extraction runs synchronously after the stream closes. It works, but persistence sits in the same request lifecycle as the streaming response. Moving this to a lightweight background queue would decouple concerns cleanly this is on my list.

Token estimation is approximate. Using text.length / 4 is accurate enough for budget decisions in most cases, but it can be off by 20–30% for code-heavy content, where tokens tend to be longer. A proper tokenizer like tiktoken would be more precise. I'm evaluating whether the dependency cost is worth it for a local-first app.

Memory dedup is text-based. Normalized string comparison catches obvious duplicates, but semantically similar memories ("I prefer TypeScript" and "Always use TypeScript") can both get saved independently. Embedding-based dedup would fix this I already have the embedding infrastructure, so this is a relatively small addition.

Beyond fixes, the bigger planned work is a bounded agent execution layer. The idea is an explicit plan-then-approve loop: the model proposes a sequence of tool calls (read_file, search_repo, fetch_url, write_patch), the user sees and approves the plan, and then individual steps execute with full audit logs. This is the natural next step now that the trust layer (citations, grounding confidence) and memory system are solid you want those guarantees in place before giving the model any execution capability.

After that, I want to add a formal quality evaluation framework: a set of JSONL test cases that run against the live stack and produce groundedness scores, citation precision/recall, and latency p95. The goal is being able to say "this change degraded grounding by 8%" rather than relying on vibes. I already have a rough pnpm eval:grounding script; the next step is making it a proper regression gate.

Longer term, multi-document synthesis is interesting retrieving from multiple sources and explicitly surfacing contradictions between them rather than silently blending. That one requires rethinking the retrieval strategy to enforce source diversity in the top-K results.

The project is genuinely more useful than I expected when I started it. It's my daily driver for coding questions and document Q&A, and the memory system means it actually gets better the more I use it.

Wrapping Up

OllamaChat is a good example of how far you can get with a single SQLite file and an HTTP call to a local model server. The interesting engineering isn't in the LLM call itself it's in the surrounding system: vector storage without a separate database, memory extraction without burning tokens on an LLM step, confidence signals that tell the model when to be uncertain.

All of it runs without cloud services, API keys, or data leaving the machine. The full source is on GitHub at Maxkrvo/OllamaChat you can also find it among my projects.

If you're building something similar or want to talk through the architecture, I'm happy to dig into the details. Check out what I offer if you're looking for frontend and AI integration consulting.

Related Posts