Dynamic Model Loading
How SourcePrep manages VRAM by loading and unloading local models on demand, and what that means for your hardware setup.
Overview
SourcePrep's trace pipeline runs up to 10 different LLM tasks — from fast file cataloguing to deep group reasoning. Each task may use a different model optimized for that job. On most consumer hardware, only one or two models fit in VRAM/RAM at a time.
Dynamic model loading is SourcePrep's system for automatically loading the right model before each pipeline stage and unloading the previous one to free memory. This happens transparently — you configure your models once and SourcePrep handles the rest.
Key concept: Dynamic model loading only applies to local model servers. Cloud API providers (OpenAI, Anthropic, Google) are always “ready” and don't require VRAM management.
Provider Support
| Provider | Dynamic Loading | MLX Engine | Best For |
|---|---|---|---|
| Ollama | ✅ Full support | ❌ Not yet (planned) | Multi-model pipelines, automation, headless servers |
| LM Studio | ⚠️ Manual only | ✅ Built-in | Mac users, single-model setups, maximum performance |
| Cloud APIs | N/A (always ready) | N/A | Zero-config, no local hardware needed |
How It Works (Ollama)
Ollama is the only local provider that supports fully automated dynamic model loading. Here's what happens during a pipeline run:
- Stage starts — SourcePrep checks which model is needed for the next task (e.g.
qwen3:4bfor cataloguing). - Room check — If a different model is currently loaded, SourcePrep sends a
keep_alive=0request to Ollama to unload it. - Preload — SourcePrep sends an empty generate request to Ollama with the new model name. Ollama loads it into memory.
- Ready — Once loaded, the pipeline stage runs its LLM calls.
- Next stage — If the next stage uses the same model, it stays loaded. If it needs a different model, the cycle repeats.
The entire process is transparent. You never need to manually run ollama run or worry about which model is loaded.
Always Available (Persistent Models)
Some models — typically the Fast model used for cataloguing — benefit from staying loaded at all times. This avoids the 5–15 second cold-start delay every time a file changes and triggers a fast sync.
In the AI Models settings, check “Always available (Keep loaded)” on any model card. SourcePrep will:
- Skip unloading this model between pipeline stages
- If VRAM pressure forces an eviction (e.g. a large reasoning model needs to load), SourcePrep will temporarily unload the persistent model, run the heavy task, then automatically reload the persistent model afterward
- Show a warning indicator in the AI Gateway if a persistent model was temporarily evicted
VRAM tip: On machines with limited RAM (16GB or less), avoid marking large models as “Always available”. A persistent 8B model (5GB) alongside a 30B reasoning model won't fit — SourcePrep will handle it gracefully via eviction, but the constant load/unload cycle defeats the purpose.
LM Studio & MLX
LM Studio is a graphical model server with a built-in MLX backend. MLX is Apple's machine learning framework, purpose-built for Apple Silicon. It uses unified memory more efficiently than the GGUF/llama.cpp backend that Ollama uses.
Why consider LM Studio?
- MLX performance: On Apple Silicon Macs (M1–M5), MLX models run faster and use less memory than the equivalent GGUF model in Ollama.
- Unified memory advantage: MLX is designed around Apple's shared CPU/GPU memory architecture. It can fit larger models into the same RAM.
- Great UI: Download, configure, and chat with models without touching the terminal.
Limitations with SourcePrep
LM Studio does not expose an API for loading or unloading models. This means SourcePrep cannot perform dynamic model loading with LM Studio. Specifically:
- No automatic model switching. If your pipeline uses different models for different tasks, you must manually load the correct model in LM Studio's UI. SourcePrep will warn you if the loaded model doesn't match what's configured.
- Context window is manual. LM Studio defaults to 4,096 tokens. Deep reasoning tasks routinely send 8K–32K token prompts. You must increase the context window in LM Studio's UI (Load → Context Length) before running the trace pipeline.
- “Always available” is implicit. Whatever model you load in LM Studio stays loaded. The checkbox in SourcePrep has no effect on LM Studio.
Recommendation: LM Studio is ideal for Mac users with 32GB+ RAM who want to keep a single powerful model loaded at all times and leverage MLX performance. Use it as your Fast model endpoint, and optionally pair it with an Ollama endpoint (or cloud API) for the Thinking/Code slots that benefit from dynamic model loading.
MLX vs GGUF (llama.cpp)
The local AI landscape on Mac currently has two main inference backends:
| MLX (LM Studio) | GGUF / llama.cpp (Ollama) | |
|---|---|---|
| Platform | Apple Silicon only | Cross-platform (Mac, Linux, Windows) |
| Memory usage | Lower (unified memory optimized) | Higher (Metal backend, less optimized for shared memory) |
| Inference speed | Faster on Apple Silicon | Good, but slightly slower on Mac |
| Dynamic loading | ❌ No API | ✅ Full API |
| Model format | MLX (safetensors) | GGUF (quantized) |
| Server mode | GUI app with local server | Daemon with CLI + REST API |
Ollama uses llama.cpp under the hood with a Metal backend on Mac. It does not use MLX and there are no current plans for Ollama to ship MLX support (though it's been discussed in the community). Similarly, llama.cpp itself does not use MLX — it has its own Metal GPU acceleration layer.
If you want MLX performance, LM Studio is currently the only practical option with a SourcePrep-compatible OpenAI API endpoint.
Recommended Setups
Mac with 32GB+ RAM
qwen3:4b — Always Available ✅qwen3:8b — Dynamic loadingqwen3-coder:30b — Dynamic loadingThe Fast model stays hot in LM Studio via MLX (low memory, fast inference). Ollama handles the heavier models with dynamic loading — swapping them in and out of VRAM as the pipeline progresses.
Mac with 16GB RAM
qwen3:4b (2.5GB) — optionally Always Availableqwen3:8b (5.2GB)With 16GB, dynamic model loading is essential. Only one model fits comfortably at a time. Ollama's automated load/unload cycle keeps the pipeline moving without manual intervention.
Hybrid: Local + Cloud
Offload the heavy reasoning tasks (group reasoning, atlas generation, deepening) to a cloud API. This eliminates VRAM pressure entirely for those tasks and gets frontier-quality results. Local models handle the high-volume fast sync tasks.
Linux / Windows with NVIDIA GPU
Ollama with CUDA acceleration is the recommended setup. Dynamic model loading works identically to Mac. MLX is not available on Linux/Windows — it's Apple Silicon only.
Pipeline Safety
Changing your model configuration while a pipeline is running could cause the next stage to resolve to a different (or missing) model. SourcePrep handles this with a pipeline-safe mode switch:
- When you save a configuration change, SourcePrep pauses any active pipeline stages
- The new configuration is written atomically
- SourcePrep verifies the next stage's model is available under the new config
- The pipeline resumes from where it left off
This means you can safely switch between Structured and Assigned mode, change endpoints, or swap models — even mid-pipeline — without data loss or crashes.
