← Guides

Embedding Models

SourcePrep supports three embedding tiers — from a zero-dependency CPU fallback to a GPU-accelerated code-specialized model. Pick the one that fits your hardware.

Three tiers at a glance

GPU Optionnomic-embed-codevia Ollama · ~4 GB

Code-specialized model built on a 7B-parameter backbone (Qwen2). Trained specifically on source code. Useful for very large code-heavy codebases where the broader training data may help. In our benchmark the built-in ONNX model scored slightly higher, so this is a flexibility option, not a quality upgrade. Requires a GPU.

4 096-dim embeddings·GPU required (Ollama)·~4 GB download
ollama pull manutic/nomic-embed-code
GPU Optionalnomic-embed-textvia Ollama · ~274 MB

General-purpose text + code embedding model. Excellent quality, much smaller footprint than the code-specialized model. Good choice if you have Ollama running but lack a dedicated GPU, or if your codebase is mixed text and code.

768-dim embeddings·Ollama required·~274 MB download
ollama pull nomic-embed-text
Recommended · Best Accuracynomic-embed-text-v1.5Built-in ONNX · ~132 MB

The same nomic-embed-text model, shipped as a quantized ONNX file that SourcePrep downloads automatically from HuggingFace. Runs entirely on CPU — no GPU, no Ollama, no external service needed. CPU inference is perfectly fine for indexing and search; embedding speed is not a bottleneck in normal usage. This is the zero-config default for any machine without Ollama.

768-dim embeddings·CPU only · no GPU needed·~132 MB download (auto, one-time)·cached at ~/.cache/huggingface/
# No setup — downloads automatically on first build
prep build
ModelSizeGPU?Accuracy (R@1)Query speedBest for
nomic-embed-code~4 GBRequired82.1%~148 msLarge code repos, Ollama users
nomic-embed-text~274 MBOptional82.1%~25 msMixed repos, Ollama users
nomic-embed-text-v1.5~132 MBNone84.6%~7 msBest accuracy, zero-config default

Benchmark: 39 code-retrieval queries across a 22-file fixture. R@1 = percentage of queries where the correct file was the #1 result. All tiers achieve 97%+ Recall@5. Query speed = median embed + search (p50).

Why CPU inference is fine for the built-in model

The built-in ONNX model runs on CPU using ONNX Runtime — and that is intentional. Embedding happens at index build time (once when you add or change files), not at query time. A 50 000-file codebase takes a few minutes to embed on a modern CPU. Subsequent incremental builds only re-embed changed files.

During search, only the query is embedded on the fly (one tiny vector per search), which takes under 10 ms on CPU. There is no perceptible latency difference between CPU and GPU for query-time embedding.

Bottom line: Use the built-in ONNX model if you don't have Ollama or a GPU. Upgrade to nomic-embed-code via Ollama when you want the highest retrieval quality and have a GPU available.

Configuring your embedding tier

Tier 1: nomic-embed-code via Ollama (recommended)

  1. Install Ollama and ensure a GPU is available
  2. ollama pull manutic/nomic-embed-code
  3. In the dashboard: Settings → AI Models → Embedding → Use Endpoint
  4. Select your Ollama endpoint — the model is auto-selected if found
  5. Rebuild your project

Tier 2: nomic-embed-text via Ollama

  1. ollama pull nomic-embed-text
  2. In the dashboard: Settings → AI Models → Embedding → Use Endpoint
  3. Select your Ollama endpoint and choose nomic-embed-text
  4. Rebuild your project

Tier 3: Built-in ONNX (default, no setup required)

Nothing to configure — this is the default. On first build SourcePrep downloads the quantized ONNX model (~132 MB) from HuggingFace and caches it locally. To pre-download before your first build:

prep models

Cached at ~/.cache/huggingface/hub/. No re-download on subsequent runs.

You can also switch to it explicitly in the dashboard: Settings → AI Models → Embedding → Download from HF.

Switching models requires a rebuild. Embeddings from different models have different dimensions and are not compatible. After changing the embedding tier, trigger a full rebuild from the dashboard or run prep build --full.

Pre-downloading the built-in model

Useful for restricted networks or air-gapped environments.

Via CLI

prep models

Via API

# Check status
curl http://localhost:8400/embedding/status
# Trigger download
curl -X POST http://localhost:8400/embedding/download

API reference

GET /embedding/status

Returns whether native embeddings are available, the model cache path, and HuggingFace repo info.

POST /embedding/download

Downloads the ONNX model (~132 MB) from HuggingFace Hub to the local cache. Blocking — returns when complete. Safe to call multiple times (no-op if already cached).