AI Gateway
Connect SourcePrep to the LLMs that power deep enrichment, code reasoning, and large-context synthesis.
SourcePrep uses a tiered architecture where different model slots handle different jobs based on their strengths. The AI Gateway is where you wire each slot to an endpoint (Ollama Cloud, OpenRouter, a local Ollama instance, or any OpenAI-compatible server). You can run everything with a single model in a pinch, but a small stack gives you the best speed / cost / quality balance.
Recommended stack (cloud-first)
The simplest setup that performs well today. Ollama Cloud handles the bulk of the work; OpenRouter picks up large-context synthesis where its long context window helps.
| Slot | Endpoint | Model | Why |
|---|---|---|---|
| Fast | Ollama Cloud | gemini-3-flash-preview:cloud | Cheap and fast for cataloguing, intent detection, and tagging. |
| Code | Ollama Cloud | kimi-k2.6:cloud | Strong code-aware reasoning for inferred-edge discovery. |
| Thinking | Ollama Cloud | kimi-k2.6:cloud | Same model handles deep reasoning + per-file swarm workers. |
| Swarm Coordinator | OpenRouter | qwen/qwen3.6-plus | Cluster routing + module synthesis benefit from a fast, JSON-reliable model with a different context profile. |
Simpler stack (all Ollama Cloud)
One endpoint, no API keys to juggle, no OpenRouter account needed. Slightly slower on the synthesis stage; perfectly fine for day-to-day use.
| Slot | Endpoint | Model |
|---|---|---|
| Fast | Ollama Cloud | gemini-3-flash-preview:cloud |
| Code | Ollama Cloud | kimi-k2.6:cloud |
| Thinking | Ollama Cloud | kimi-k2.6:cloud |
| Swarm Coordinator | Ollama Cloud | kimi-k2.6:cloud (inherit) |
Local-only stack (no cloud)
Fully offline. Requires a GPU and Ollama installed locally. The Qwen3 family is the recommended baseline — best-in-class small models with reliable JSON output.
- Fast:
qwen3:4b(2.5 GB) — cataloguing, intent, tagging. Falls back here as the single model if you only configure one slot. - Code:
qwen3-coderfamily — inferred-edge discovery on AST gaps. Falls back to Fast if unset. - Thinking:
qwen3:8b(5.2 GB) — epistemic enrichment, clustering. Step up toqwen3:14b(9.3 GB) orqwen3:30bMoE (19 GB) for higher quality. - Swarm Coordinator: inherits Thinking by default.
See Dynamic Model Loading for how SourcePrep balances multiple local models against VRAM.
Model Slots Explained
SourcePrep defines five "slots" for AI models. You can configure these in the Settings > AI Models tab of the dashboard.
1. Embedding Model (Required)
Default: nomic-embed-text-v1.5 (built-in ONNX, CPU)
Converts code and documentation into vectors for semantic search. SourcePrep supports three tiers — pick the one that fits your hardware:
- nomic-embed-code via Ollama — recommended for GPU users. Code-specialized model (7B Qwen2 backbone), best retrieval quality for code-heavy repos. ~4 GB download. Requires a GPU.
ollama pull manutic/nomic-embed-code - nomic-embed-text via Ollama — good quality, much smaller (~274 MB). Works with or without a GPU. Good for mixed text/code repos.
ollama pull nomic-embed-text - Built-in ONNX (default) — the same nomic-embed-text model shipped as a ~132 MB quantized ONNX file. Runs entirely on CPU — no GPU, no Ollama, no external service. Downloads automatically on first build and is cached at
~/.cache/huggingface/. CPU inference is perfectly fine: embedding happens at build time (not per-query), and query-time embedding takes under 10 ms regardless.
See the Embedding Models guide for setup instructions and a full comparison table.
2. Single / Fast Model
Recommended: qwen3:4b (2.5GB)
A high-speed model used for background tasks like file cataloguing, tagging, and intent detection during indexing. If you only configure this one slot, it also handles Thinking-tier work as a fallback. Alt: qwen3:1.7b for very limited VRAM.
3. Code Model
Recommended: qwen3-coder family or any code-specialized local/cloud model
A code-specialized slot used for code-aware analysis like inferred-edge discovery (which call edges the AST parser missed). Falls back to the Single / Fast Model if not configured.
4. Thinking Model
Recommended: qwen3:8b (5.2GB) or qwen3:14b (9.3GB)
BYOK examples: gpt-4.1-mini, claude-sonnet-4.6, gemini-2.5-flash
The reasoning model used for epistemic enrichment, clustering, and deep analysis. It takes each file with its neighbor context and produces extended summaries and domain tags. For BYOK, any mid-tier cloud model works well — you don't need the most expensive option.
5. Swarm Coordinator (optional)
Inherits from Thinking Model by default
Used during the Group Reasoning stage for cluster routing and large-context synthesis when swarm mode is enabled. Most users leave this inheriting from the Thinking slot; configure it explicitly only when you want a different model for the coordinator step.
💡Single Model Fallback
If you only have resources to run one model (e.g., qwen3:4b), SourcePrep will use it for both "Fast" and "Thinking" tasks. You can simply select the same endpoint and model for both slots in the settings.
Managing Endpoints
SourcePrep isn't tied to one provider. The Endpoint Manager at the bottom of the AI Models settings allows you to connect to any OpenAI-compatible API.
Adding a Custom Endpoint
To add a local server (like LM Studio or vLLM) or a cloud provider (like Groq or OpenRouter):
- Scroll to Saved Endpoints and click Add New Endpoint.
- Display Name: Give it a recognizable name (e.g., "LM Studio Local").
- Provider: Select
OpenAI Compatiblefor most generic servers. - URL: Enter the base URL.
- Ollama:
http://localhost:11434 - LM Studio:
http://localhost:1234/v1 - vLLM:
http://localhost:8000/v1
- Ollama:
- API Key: Required for cloud providers; often optional ("sk-dummy") for local servers.
Testing Connections
Before assigning an endpoint to a model slot, use the Test Connection button on the model card. This performs a lightweight "handshake" (usually a list models call or a tiny completion) to verify:
- The server is reachable.
- The API key is valid.
- The specific model name you entered exists on that server.
