← Back to Docs

AI Gateway

Connect SourcePrep to the LLMs that power deep enrichment, code reasoning, and large-context synthesis.

SourcePrep uses a tiered architecture where different model slots handle different jobs based on their strengths. The AI Gateway is where you wire each slot to an endpoint (Ollama Cloud, OpenRouter, a local Ollama instance, or any OpenAI-compatible server). You can run everything with a single model in a pinch, but a small stack gives you the best speed / cost / quality balance.

The simplest setup that performs well today. Ollama Cloud handles the bulk of the work; OpenRouter picks up large-context synthesis where its long context window helps.

SlotEndpointModelWhy
FastOllama Cloudgemini-3-flash-preview:cloudCheap and fast for cataloguing, intent detection, and tagging.
CodeOllama Cloudkimi-k2.6:cloudStrong code-aware reasoning for inferred-edge discovery.
ThinkingOllama Cloudkimi-k2.6:cloudSame model handles deep reasoning + per-file swarm workers.
Swarm CoordinatorOpenRouterqwen/qwen3.6-plusCluster routing + module synthesis benefit from a fast, JSON-reliable model with a different context profile.

Simpler stack (all Ollama Cloud)

One endpoint, no API keys to juggle, no OpenRouter account needed. Slightly slower on the synthesis stage; perfectly fine for day-to-day use.

SlotEndpointModel
FastOllama Cloudgemini-3-flash-preview:cloud
CodeOllama Cloudkimi-k2.6:cloud
ThinkingOllama Cloudkimi-k2.6:cloud
Swarm CoordinatorOllama Cloudkimi-k2.6:cloud (inherit)

Local-only stack (no cloud)

Fully offline. Requires a GPU and Ollama installed locally. The Qwen3 family is the recommended baseline — best-in-class small models with reliable JSON output.

  • Fast: qwen3:4b (2.5 GB) — cataloguing, intent, tagging. Falls back here as the single model if you only configure one slot.
  • Code: qwen3-coder family — inferred-edge discovery on AST gaps. Falls back to Fast if unset.
  • Thinking: qwen3:8b (5.2 GB) — epistemic enrichment, clustering. Step up to qwen3:14b (9.3 GB) or qwen3:30b MoE (19 GB) for higher quality.
  • Swarm Coordinator: inherits Thinking by default.

See Dynamic Model Loading for how SourcePrep balances multiple local models against VRAM.


Model Slots Explained

SourcePrep defines five "slots" for AI models. You can configure these in the Settings > AI Models tab of the dashboard.

Loading component preview…

Live preview: Configure the model slots in the dashboard.

1. Embedding Model (Required)

Default: nomic-embed-text-v1.5 (built-in ONNX, CPU)

Converts code and documentation into vectors for semantic search. SourcePrep supports three tiers — pick the one that fits your hardware:

  • nomic-embed-code via Ollama — recommended for GPU users. Code-specialized model (7B Qwen2 backbone), best retrieval quality for code-heavy repos. ~4 GB download. Requires a GPU. ollama pull manutic/nomic-embed-code
  • nomic-embed-text via Ollama — good quality, much smaller (~274 MB). Works with or without a GPU. Good for mixed text/code repos. ollama pull nomic-embed-text
  • Built-in ONNX (default) — the same nomic-embed-text model shipped as a ~132 MB quantized ONNX file. Runs entirely on CPU — no GPU, no Ollama, no external service. Downloads automatically on first build and is cached at ~/.cache/huggingface/. CPU inference is perfectly fine: embedding happens at build time (not per-query), and query-time embedding takes under 10 ms regardless.

See the Embedding Models guide for setup instructions and a full comparison table.

2. Single / Fast Model

Recommended: qwen3:4b (2.5GB)

A high-speed model used for background tasks like file cataloguing, tagging, and intent detection during indexing. If you only configure this one slot, it also handles Thinking-tier work as a fallback. Alt: qwen3:1.7b for very limited VRAM.

3. Code Model

Recommended: qwen3-coder family or any code-specialized local/cloud model

A code-specialized slot used for code-aware analysis like inferred-edge discovery (which call edges the AST parser missed). Falls back to the Single / Fast Model if not configured.

4. Thinking Model

Recommended: qwen3:8b (5.2GB) or qwen3:14b (9.3GB)

BYOK examples: gpt-4.1-mini, claude-sonnet-4.6, gemini-2.5-flash

The reasoning model used for epistemic enrichment, clustering, and deep analysis. It takes each file with its neighbor context and produces extended summaries and domain tags. For BYOK, any mid-tier cloud model works well — you don't need the most expensive option.

5. Swarm Coordinator (optional)

Inherits from Thinking Model by default

Used during the Group Reasoning stage for cluster routing and large-context synthesis when swarm mode is enabled. Most users leave this inheriting from the Thinking slot; configure it explicitly only when you want a different model for the coordinator step.

Smart Compression is not a model slot — it's a built-in feature that runs alongside the slots above. It uses structural Level-of-Detail rendering (no model needed) plus an optional 178 MB BERT model for prose compression. Read the compression guide →

💡Single Model Fallback

If you only have resources to run one model (e.g., qwen3:4b), SourcePrep will use it for both "Fast" and "Thinking" tasks. You can simply select the same endpoint and model for both slots in the settings.


Managing Endpoints

SourcePrep isn't tied to one provider. The Endpoint Manager at the bottom of the AI Models settings allows you to connect to any OpenAI-compatible API.

Adding a Custom Endpoint

To add a local server (like LM Studio or vLLM) or a cloud provider (like Groq or OpenRouter):

  1. Scroll to Saved Endpoints and click Add New Endpoint.
  2. Display Name: Give it a recognizable name (e.g., "LM Studio Local").
  3. Provider: Select OpenAI Compatible for most generic servers.
  4. URL: Enter the base URL.
    • Ollama: http://localhost:11434
    • LM Studio: http://localhost:1234/v1
    • vLLM: http://localhost:8000/v1
  5. API Key: Required for cloud providers; often optional ("sk-dummy") for local servers.

Testing Connections

Before assigning an endpoint to a model slot, use the Test Connection button on the model card. This performs a lightweight "handshake" (usually a list models call or a tiny completion) to verify:

  • The server is reachable.
  • The API key is valid.
  • The specific model name you entered exists on that server.
Troubleshooting Tip: If a test fails, check your CORS settings if the server is running on a different port, or ensure the container is exposing the port to localhost.