Model Configuration
Configure local LLMs for analysis, reasoning, and compression.
SourcePrep uses a tiered architecture where different models handle specific tasks based on their strengths. While you can run everything with a single model, we recommend a specialized stack for the best balance of speed and intelligence.
New: Use the Model Setup Advisor to get personalized recommendations based on your GPU and preferences.
Recommended Stack
We recommend the Qwen3 family for the core analysis and reasoning loops. These models deliver excellent local inference performance at every size class.
Used for fast file cataloguing, intent detection, and auto-tagging during indexing. Only 2.5GB. Rivals 72B models at this size.
View on Ollama βUsed for complex reasoning, epistemic enrichment, and deep analysis. 5.2GB. Alt: qwen3:14b (9.3GB) or qwen3:30b MoE (19GB) for better quality.
View on Ollama βWhy Qwen3?
- Best-in-class small models: Qwen3:4b rivals Qwen2.5-72B on benchmarks while being tiny enough for any GPU.
- MoE efficiency: The 30B model only activates 3B parameters per token β outstanding reasoning with efficient VRAM use.
- Reliable JSON output: SourcePrep's pipeline needs structured JSON responses. Qwen3 excels at this.
Model Slots Explained
SourcePrep defines four "slots" for AI models. You can configure these in the Settings > AI Models tab of the dashboard.
1. Embedding Model (Required)
Default: nomic-embed-text-v1.5 (built-in ONNX, CPU)
Converts code and documentation into vectors for semantic search. SourcePrep supports three tiers β pick the one that fits your hardware:
- nomic-embed-code via Ollama β recommended for GPU users. Code-specialized model (7B Qwen2 backbone), best retrieval quality for code-heavy repos. ~4 GB download. Requires a GPU.
ollama pull manutic/nomic-embed-code - nomic-embed-text via Ollama β good quality, much smaller (~274 MB). Works with or without a GPU. Good for mixed text/code repos.
ollama pull nomic-embed-text - Built-in ONNX (default) β the same nomic-embed-text model shipped as a ~132 MB quantized ONNX file. Runs entirely on CPU β no GPU, no Ollama, no external service. Downloads automatically on first build and is cached at
~/.cache/huggingface/. CPU inference is perfectly fine: embedding happens at build time (not per-query), and query-time embedding takes under 10 ms regardless.
See the Embedding Models guide for setup instructions and a full comparison table.
2. Fast Model
Recommended: qwen3:4b (2.5GB)
A high-speed model used for background tasks. When you import a project, this model (if enabled) scans files to generate tags and detect purpose without slowing down the indexing process. Alt: qwen3:1.7b for very limited VRAM.
3. Thinking Model
Recommended: qwen3:8b (5.2GB)
Better: qwen3:14b (9.3GB) or qwen3:30b MoE (19GB)
BYOK: gpt-4.1-mini, claude-sonnet-4.6, gemini-2.5-flash
The reasoning model used for epistemic enrichment, clustering, and deep analysis. It takes each file with its neighbor context and produces extended summaries and domain tags. For BYOK, any mid-tier cloud model works well β you don't need the most expensive option.
4. Smart Compression (built-in)
No GPU required
Two engines: structural compression for code extracts at variable Levels of Detail (LOD 0β5) based on relevance β 3β20Γ, no model needed. Language compression for docs uses a lightweight BERT model (~178 MB) to remove filler while preserving meaning. Tier-adaptive per client.Read the full compression guide β
π‘Single Model Fallback
If you only have resources to run one model (e.g., qwen3:4b), SourcePrep will use it for both "Fast" and "Thinking" tasks. You can simply select the same endpoint and model for both slots in the settings.
Managing Endpoints
SourcePrep isn't tied to one provider. The Endpoint Manager at the bottom of the AI Models settings allows you to connect to any OpenAI-compatible API.
Adding a Custom Endpoint
To add a local server (like LM Studio or vLLM) or a cloud provider (like Groq or OpenRouter):
- Scroll to Saved Endpoints and click Add New Endpoint.
- Display Name: Give it a recognizable name (e.g., "LM Studio Local").
- Provider: Select
OpenAI Compatiblefor most generic servers. - URL: Enter the base URL.
- Ollama:
http://localhost:11434 - LM Studio:
http://localhost:1234/v1 - vLLM:
http://localhost:8000/v1
- Ollama:
- API Key: Required for cloud providers; often optional ("sk-dummy") for local servers.
Testing Connections
Before assigning an endpoint to a model slot, use the Test Connection button on the model card. This performs a lightweight "handshake" (usually a list models call or a tiny completion) to verify:
- The server is reachable.
- The API key is valid.
- The specific model name you entered exists on that server.
