← Back to Docs

Local LLM Setup

Running SourcePrep's pipeline against models on your own hardware — when it's worth it, what to run, and how SourcePrep handles model swapping for you.

Should you run local LLMs at all?

For most users, no. While SourcePrep is designed local-first architecturally, for most users a cloud service like Ollama's cloud will provide faster and better intelligence and — importantly — the concurrency needed for swarm aggregation.

Running local makes sense when:

  • You have a hard requirement that no codebase data leaves your machine.
  • You're on a low-/no-connectivity setup and need to keep working.
  • You already own a Mac with 32GB+ unified memory or a 24GB+ NVIDIA GPU, and want to use it.

Hardware floor: Don't try to run SourcePrep's pipeline against local LLMs on a 16GB Mac (or anything similar). The pipeline drives 14B+ parameter models with long context windows; 16GB just doesn't have the headroom. Use cloud LLMs at that scale — they'll be faster, more reliable, and won't lock up your machine.

SourcePrep's pipeline runs reasoning and code-generation tasks that need real capability. Don't bother with sub-14B models — they don't hold up under multi-hop reasoning and produce noisy enrichments. The flagship recommendation is Qwen 3.5 35B (the MoE 35B-active-3B variant); we benchmark against it in development and it's what we tune the prompt budgets for.

ModelVRAM (Q4)When to pick it
qwen3.5:35b-a3b~24 GBRecommended. MoE: 35B params, 3B active. Strong reasoning, fits on 32GB Mac (tight) or 48GB+ (comfortable).
qwen3.5:35b-a3b-q8_0~39 GBHigher-precision MoE. Needs 48GB+ unified memory or a real GPU.
qwen3.5:122b-a10b~81 GBFlagship MoE. Only on workstations with 96GB+ unified memory or a multi-GPU rig.

VRAM figures are SourcePrep's measured ceilings (see src/prep/core/context_config.py) including the 256K-token context window we provision per slot. Smaller context windows save a few GB but cap how much codebase the agent can reason over.

How dynamic loading works

SourcePrep's pipeline runs several different LLM tasks per project — fast file cataloguing, deeper reasoning passes, code-aware refactor planning, etc. On a single machine you can't keep every model resident, so SourcePrep loads, unloads, and re-warms models between stages on your behalf.

  1. Stage starts — the orchestrator picks the model for the upcoming task.
  2. Room check — if a different model is currently loaded and the next task uses a different one, the loaded model is unloaded first.
  3. Preload — the new model is loaded into memory, with the right context length, before the stage executes.
  4. Ready — the stage runs.
  5. Next stage — if the same model is needed again, it stays put. Otherwise the cycle repeats.

None of this requires manual intervention — you configure your endpoints and models once and SourcePrep handles the choreography per run.

Loading component preview…

The Advanced LLM Settings panel — keep-alive, VRAM headroom, and per-slot model overrides live here.

Provider support

Both Ollama and LM Studio are fully supported peers — SourcePrep's pipeline calls the same load/unload/ensure-ready logic against either backend.

ProviderDynamic LoadingMLX BackendNotes
Ollama✅ Full✅ SupportedCross-platform. Ollama's recent MLX support brings Apple Silicon performance close to LM Studio on equivalent models.
LM Studio✅ Full✅ Built-inApple Silicon only. SourcePrep auto-sets context length on load; no manual UI fiddling required.
Cloud APIsN/A (always ready)N/ANo VRAM management needed. Recommended for most users.

On Apple Silicon, both backends now run MLX-quantized models. Pick whichever you're already comfortable with; the SourcePrep integration is the same either way.

Persistent models

One model can be marked Always Available in the AI Models settings — typically the smaller of your two slots, kept resident to skip the 5–15 second cold start when stages cycle.

If VRAM pressure forces an eviction, SourcePrep will:

  • Temporarily unload the persistent model to make room for a larger task.
  • Automatically reload it once the heavy task finishes.
  • Show an eviction warning in the AI Gateway so you know it happened.

Mac with 48–64GB unified memory

All slots: qwen3.5:35b-a3b via Ollama or LM Studio

The MoE 35B fits with comfortable headroom. Pick either backend — both now use MLX on Apple Silicon. With 48GB+ you can leave the model resident as Always Available and skip cold starts entirely.

Mac with 32GB unified memory

Primary model: qwen3.5:35b-a3b

Tight but workable — the 35B-a3b MoE consumes ~24 GB at the provisioned context window, leaving roughly 8 GB for the OS and your other apps. Run a single model and let SourcePrep cycle it across stages; don't try to keep a second model resident. If 32 GB feels constrained in practice, drop down to the hybrid setup and route reasoning to a cloud endpoint.

Workstation with 96GB+ memory or a 48GB+ NVIDIA GPU

Primary model: qwen3.5:35b-a3b-q8_0 or qwen3.5:122b-a10b

At this scale you can run the higher-precision Qwen variants or the flagship 122B MoE. Dynamic loading is still on, but mostly to swap between specialised models (reasoning vs code) rather than because you're out of memory.

Hybrid: local fast + cloud heavy

Fast tasks: local qwen3.5:35b-a3b
Reasoning + code tasks: cloud (Claude / GPT / Ollama Cloud)

Often the most pragmatic setup: keep a single local model resident for high-volume fast-sync work, and route the heavy reasoning / refactor stages to a cloud API. No VRAM pressure, frontier-quality results where they matter, and your codebase data still stays local for the fast-sync passes.

MLX vs GGUF (Apple Silicon)

On Apple Silicon both backends now offer MLX runtimes — Apple's ML framework optimised for unified memory. The performance gap that historically favored LM Studio has largely closed now that Ollama ships MLX support.

Practically: pick whichever you already have installed. If neither, Ollama's CLI + REST API is easier to script and works the same on Mac, Linux, and Windows; LM Studio is the better choice if you prefer a GUI for browsing and downloading models. Both give SourcePrep the same dynamic-loading behaviour.

Pipeline safety

Changing your model configuration while a pipeline is running could cause the next stage to resolve to a different (or missing) model. SourcePrep handles this with a pipeline-safe mode switch:

  1. When you save a configuration change, SourcePrep pauses any active pipeline stages.
  2. The new configuration is written atomically.
  3. SourcePrep verifies the next stage's model is available under the new config.
  4. The pipeline resumes from where it left off.

This means you can safely switch endpoints, swap models, or move between Structured and Assigned mode — even mid-pipeline — without data loss or crashes.