← Back to Concepts

Vector Indexing

How SourcePrep finds the right code when you describe what you want instead of naming it exactly.

The Indexing Process

Vector Indexing is what lets SourcePrep answer "fuzzy" questions — the kind where you describe intent instead of typing exact names. The Knowledge Pipeline's structural map gives you the skeleton of your code; vector indexing adds the muscle, so searching for "how authentication works" can surface the right files even if none of them contain the word "authentication" verbatim.

Unlike cloud-based tools, this happens entirely on your localhost.

1. Discovery

The prep-walker crate (Rust) scans your directory, respecting.gitignore and user-defined exclusions. It computes BLAKE3 hashes for change detection.

2. Parsing & Chunking

Files are parsed using Tree-sitter. Code is split into logical chunks (functions, classes) rather than arbitrary text windows. Markdown docs are split by headers.

3. Embedding

Chunks are passed to the Native Embedder (ONNX/nomic-embed-text) or an optional Ollama instance. This converts text into 768-dimensional vectors.

4. Storage

Vectors and metadata are stored in a local LanceDB instance (or Qdrant/Chroma if configured). The raw text is never sent to the cloud.

Incremental Updates

SourcePrep includes a real-time file watcher (watchdog). When you save a file:

  • The watcher detects the modify event.
  • It debounces rapid changes (e.g. typing).
  • It re-hashes the file content.
  • If the hash changed, only that file is re-parsed and re-embedded.

This typically takes <200ms, ensuring your AI always sees the current state of your code.

Exclusions

You can control what gets indexed via the Dashboard or .sourceprep/ignore. Common patterns like node_modules/, dist/, and .git/are ignored by default.


Dashboard Controls

The Knowledge Base column in the dashboard gives you visibility and control over this process.

Index Status Card

The top card provides a real-time health check. Watch for the status badge in the top right:

  • Fresh: The index is perfectly synced with your disk.
  • Stale: Files have changed, but the index hasn't updated yet (usually brief during debounce).
  • Building: The background worker is actively processing files.

It also breaks down the index composition:

  • Code: Source files (parsed into AST chunks).
  • Instructions: Markdown, text, and documentation files (parsed by headers).
  • Graph: Structural nodes (symbols, imports) used for graph traversal.

Manual Rebuild

While the watcher handles 99% of changes, you might need the Build Index card for:

  • Branch Switching: If you switch git branches and thousands of files change instantly, a manual rebuild ensures everything is caught.
  • Config Changes: If you change path_weights or exclusion patterns, a rebuild applies them to the entire codebase.
  • Troubleshooting: If search feels "off", a full rebuild (using the --full flag via CLI or the dashboard button) clears the vector store and starts fresh.