Embedding Models
SourcePrep supports three embedding tiers — from a zero-dependency CPU fallback to a GPU-accelerated code-specialized model. Pick the one that fits your hardware.
Three tiers at a glance
nomic-embed-codevia Ollama · ~4 GBCode-specialized model built on a 7B-parameter backbone (Qwen2). Trained specifically on source code. Useful for very large code-heavy codebases where the broader training data may help. In our benchmark the built-in ONNX model scored slightly higher, so this is a flexibility option, not a quality upgrade. Requires a GPU.
nomic-embed-textvia Ollama · ~274 MBGeneral-purpose text + code embedding model. Excellent quality, much smaller footprint than the code-specialized model. Good choice if you have Ollama running but lack a dedicated GPU, or if your codebase is mixed text and code.
nomic-embed-text-v1.5Built-in ONNX · ~132 MBThe same nomic-embed-text model, shipped as a quantized ONNX file that SourcePrep downloads automatically from HuggingFace. Runs entirely on CPU — no GPU, no Ollama, no external service needed. CPU inference is perfectly fine for indexing and search; embedding speed is not a bottleneck in normal usage. This is the zero-config default for any machine without Ollama.
~/.cache/huggingface/| Model | Size | GPU? | Accuracy (R@1) | Query speed | Best for |
|---|---|---|---|---|---|
| nomic-embed-code | ~4 GB | Required | 82.1% | ~148 ms | Large code repos, Ollama users |
| nomic-embed-text | ~274 MB | Optional | 82.1% | ~25 ms | Mixed repos, Ollama users |
| nomic-embed-text-v1.5 | ~132 MB | None | 84.6% | ~7 ms | Best accuracy, zero-config default |
Benchmark: 39 code-retrieval queries across a 22-file fixture. R@1 = percentage of queries where the correct file was the #1 result. All tiers achieve 97%+ Recall@5. Query speed = median embed + search (p50).
Why CPU inference is fine for the built-in model
The built-in ONNX model runs on CPU using ONNX Runtime — and that is intentional. Embedding happens at index build time (once when you add or change files), not at query time. A 50 000-file codebase takes a few minutes to embed on a modern CPU. Subsequent incremental builds only re-embed changed files.
During search, only the query is embedded on the fly (one tiny vector per search), which takes under 10 ms on CPU. There is no perceptible latency difference between CPU and GPU for query-time embedding.
nomic-embed-code via Ollama when you want the highest retrieval quality and have a GPU available.Configuring your embedding tier
Tier 1: nomic-embed-code via Ollama (recommended)
- Install Ollama and ensure a GPU is available
ollama pull manutic/nomic-embed-code- In the dashboard: Settings → AI Models → Embedding → Use Endpoint
- Select your Ollama endpoint — the model is auto-selected if found
- Rebuild your project
Tier 2: nomic-embed-text via Ollama
ollama pull nomic-embed-text- In the dashboard: Settings → AI Models → Embedding → Use Endpoint
- Select your Ollama endpoint and choose
nomic-embed-text - Rebuild your project
Tier 3: Built-in ONNX (default, no setup required)
Nothing to configure — this is the default. On first build SourcePrep downloads the quantized ONNX model (~132 MB) from HuggingFace and caches it locally. To pre-download before your first build:
Cached at ~/.cache/huggingface/hub/. No re-download on subsequent runs.
You can also switch to it explicitly in the dashboard: Settings → AI Models → Embedding → Download from HF.
prep build --full.Pre-downloading the built-in model
Useful for restricted networks or air-gapped environments.
Via CLI
Via API
API reference
Returns whether native embeddings are available, the model cache path, and HuggingFace repo info.
Downloads the ONNX model (~132 MB) from HuggingFace Hub to the local cache. Blocking — returns when complete. Safe to call multiple times (no-op if already cached).
