Embedding Models

SourcePrep supports three embedding tiers — from a zero-dependency CPU fallback to a GPU-accelerated code-specialized model. Pick the one that fits your hardware.

Three tiers at a glance

GPU Optionnomic-embed-codevia Ollama · ~4 GB

Code-specialized model built on a 7B-parameter backbone (Qwen2). Trained specifically on source code. Useful for very large code-heavy codebases where the broader training data may help. In our benchmark the built-in ONNX model scored slightly higher, so this is a flexibility option, not a quality upgrade. Requires a GPU.

3 584-dim embeddings (Matryoshka-truncated to 768)·GPU required (Ollama)·~4 GB download

ollama pull manutic/nomic-embed-code

GPU Optionalnomic-embed-textvia Ollama · ~274 MB

General-purpose text + code embedding model. Excellent quality, much smaller footprint than the code-specialized model. Good choice if you have Ollama running but lack a dedicated GPU, or if your codebase is mixed text and code.

768-dim embeddings·Ollama required·~274 MB download

ollama pull nomic-embed-text

Recommended · Best Accuracynomic-embed-text-v1.5Built-in ONNX · ~132 MB

The same nomic-embed-text model, shipped as a quantized ONNX file that SourcePrep downloads automatically from HuggingFace. Runs entirely on CPU — no GPU, no Ollama, no external service needed. CPU inference is perfectly fine for indexing and search; embedding speed is not a bottleneck in normal usage. This is the zero-config default for any machine without Ollama.

768-dim embeddings·CPU only · no GPU needed·~132 MB download (auto, one-time)·cached at ~/.cache/huggingface/

# No setup — downloads automatically on first build

prep build

Model	Size	GPU?	Accuracy (R@1)	Query speed	Best for
nomic-embed-code	~4 GB	Required	82.1%	~148 ms	Large code repos, Ollama users
nomic-embed-text	~274 MB	Optional	82.1%	~25 ms	Mixed repos, Ollama users
nomic-embed-text-v1.5	~132 MB	None	84.6%	~7 ms	Best accuracy, zero-config default

Benchmark: 39 code-retrieval queries across a 22-file fixture. R@1 = percentage of queries where the correct file was the #1 result. All tiers achieve 97%+ Recall@5. Query speed = median embed + search (p50).

Why CPU inference is fine for the built-in model

The built-in ONNX model runs on CPU using ONNX Runtime — and that is intentional. Embedding happens at index build time (once when you add or change files), not at query time. A 50 000-file codebase takes a few minutes to embed on a modern CPU. Subsequent incremental builds only re-embed changed files.

During search, only the query is embedded on the fly (one tiny vector per search), which takes under 10 ms on CPU. There is no perceptible latency difference between CPU and GPU for query-time embedding.

Bottom line: Use the built-in ONNX model if you don't have Ollama or a GPU. Upgrade to nomic-embed-code via Ollama when you want the highest retrieval quality and have a GPU available.

Configuring your embedding tier

Loading component preview…

The Embedding model slot in Settings → AI Models — pick an endpoint and the dashboard auto-detects the model.

Tier 1: nomic-embed-code via Ollama (recommended)

Install Ollama and ensure a GPU is available
ollama pull manutic/nomic-embed-code
In the dashboard: Settings → AI Models → Embedding → Use Endpoint
Select your Ollama endpoint — the model is auto-selected if found
Rebuild your project

Tier 2: nomic-embed-text via Ollama

ollama pull nomic-embed-text
In the dashboard: Settings → AI Models → Embedding → Use Endpoint
Select your Ollama endpoint and choose nomic-embed-text
Rebuild your project

Tier 3: Built-in ONNX (default, no setup required)

Nothing to configure — this is the default. On first build SourcePrep downloads the quantized ONNX model (~132 MB) from HuggingFace and caches it locally. To pre-download before your first build:

prep models

Cached at ~/.cache/huggingface/hub/. No re-download on subsequent runs.

You can also switch to it explicitly in the dashboard: Settings → AI Models → Embedding → Download from HF.

Switching models requires a rebuild. Embeddings from different models have different dimensions and are not compatible. After changing the embedding tier, trigger a full rebuild from the dashboard or run prep build --full.

Pre-downloading the built-in model

Useful for restricted networks or air-gapped environments.

Via CLI

prep models

Via API

# Check status

curl http://localhost:8400/embedding/status

# Trigger download

curl -X POST http://localhost:8400/embedding/download

API reference

GET /embedding/status

Returns whether native embeddings are available, the model cache path, and HuggingFace repo info.

POST /embedding/download

Downloads the ONNX model (~132 MB) from HuggingFace Hub to the local cache. Blocking — returns when complete. Safe to call multiple times (no-op if already cached).