← Back to Guides

Dynamic Model Loading

How SourcePrep manages VRAM by loading and unloading local models on demand, and what that means for your hardware setup.

Overview

SourcePrep's trace pipeline runs up to 10 different LLM tasks — from fast file cataloguing to deep group reasoning. Each task may use a different model optimized for that job. On most consumer hardware, only one or two models fit in VRAM/RAM at a time.

Dynamic model loading is SourcePrep's system for automatically loading the right model before each pipeline stage and unloading the previous one to free memory. This happens transparently — you configure your models once and SourcePrep handles the rest.

Key concept: Dynamic model loading only applies to local model servers. Cloud API providers (OpenAI, Anthropic, Google) are always “ready” and don't require VRAM management.

Provider Support

ProviderDynamic LoadingMLX EngineBest For
Ollama✅ Full support❌ Not yet (planned)Multi-model pipelines, automation, headless servers
LM Studio⚠️ Manual only✅ Built-inMac users, single-model setups, maximum performance
Cloud APIsN/A (always ready)N/AZero-config, no local hardware needed

How It Works (Ollama)

Ollama is the only local provider that supports fully automated dynamic model loading. Here's what happens during a pipeline run:

  1. Stage starts — SourcePrep checks which model is needed for the next task (e.g. qwen3:4b for cataloguing).
  2. Room check — If a different model is currently loaded, SourcePrep sends a keep_alive=0 request to Ollama to unload it.
  3. Preload — SourcePrep sends an empty generate request to Ollama with the new model name. Ollama loads it into memory.
  4. Ready — Once loaded, the pipeline stage runs its LLM calls.
  5. Next stage — If the next stage uses the same model, it stays loaded. If it needs a different model, the cycle repeats.

The entire process is transparent. You never need to manually run ollama run or worry about which model is loaded.

Always Available (Persistent Models)

Some models — typically the Fast model used for cataloguing — benefit from staying loaded at all times. This avoids the 5–15 second cold-start delay every time a file changes and triggers a fast sync.

In the AI Models settings, check “Always available (Keep loaded)” on any model card. SourcePrep will:

  • Skip unloading this model between pipeline stages
  • If VRAM pressure forces an eviction (e.g. a large reasoning model needs to load), SourcePrep will temporarily unload the persistent model, run the heavy task, then automatically reload the persistent model afterward
  • Show a warning indicator in the AI Gateway if a persistent model was temporarily evicted

VRAM tip: On machines with limited RAM (16GB or less), avoid marking large models as “Always available”. A persistent 8B model (5GB) alongside a 30B reasoning model won't fit — SourcePrep will handle it gracefully via eviction, but the constant load/unload cycle defeats the purpose.

LM Studio & MLX

LM Studio is a graphical model server with a built-in MLX backend. MLX is Apple's machine learning framework, purpose-built for Apple Silicon. It uses unified memory more efficiently than the GGUF/llama.cpp backend that Ollama uses.

Why consider LM Studio?

  • MLX performance: On Apple Silicon Macs (M1–M5), MLX models run faster and use less memory than the equivalent GGUF model in Ollama.
  • Unified memory advantage: MLX is designed around Apple's shared CPU/GPU memory architecture. It can fit larger models into the same RAM.
  • Great UI: Download, configure, and chat with models without touching the terminal.

Limitations with SourcePrep

LM Studio does not expose an API for loading or unloading models. This means SourcePrep cannot perform dynamic model loading with LM Studio. Specifically:

  • No automatic model switching. If your pipeline uses different models for different tasks, you must manually load the correct model in LM Studio's UI. SourcePrep will warn you if the loaded model doesn't match what's configured.
  • Context window is manual. LM Studio defaults to 4,096 tokens. Deep reasoning tasks routinely send 8K–32K token prompts. You must increase the context window in LM Studio's UI (Load → Context Length) before running the trace pipeline.
  • “Always available” is implicit. Whatever model you load in LM Studio stays loaded. The checkbox in SourcePrep has no effect on LM Studio.

Recommendation: LM Studio is ideal for Mac users with 32GB+ RAM who want to keep a single powerful model loaded at all times and leverage MLX performance. Use it as your Fast model endpoint, and optionally pair it with an Ollama endpoint (or cloud API) for the Thinking/Code slots that benefit from dynamic model loading.

MLX vs GGUF (llama.cpp)

The local AI landscape on Mac currently has two main inference backends:

MLX (LM Studio)GGUF / llama.cpp (Ollama)
PlatformApple Silicon onlyCross-platform (Mac, Linux, Windows)
Memory usageLower (unified memory optimized)Higher (Metal backend, less optimized for shared memory)
Inference speedFaster on Apple SiliconGood, but slightly slower on Mac
Dynamic loading❌ No API✅ Full API
Model formatMLX (safetensors)GGUF (quantized)
Server modeGUI app with local serverDaemon with CLI + REST API

Ollama uses llama.cpp under the hood with a Metal backend on Mac. It does not use MLX and there are no current plans for Ollama to ship MLX support (though it's been discussed in the community). Similarly, llama.cpp itself does not use MLX — it has its own Metal GPU acceleration layer.

If you want MLX performance, LM Studio is currently the only practical option with a SourcePrep-compatible OpenAI API endpoint.

Mac with 32GB+ RAM

Fast Model: LM Studio (MLX) — qwen3:4b — Always Available ✅
Thinking Model: Ollama — qwen3:8b — Dynamic loading
Code Model: Ollama — qwen3-coder:30b — Dynamic loading

The Fast model stays hot in LM Studio via MLX (low memory, fast inference). Ollama handles the heavier models with dynamic loading — swapping them in and out of VRAM as the pipeline progresses.

Mac with 16GB RAM

All models: Ollama — Dynamic loading
Fast: qwen3:4b (2.5GB) — optionally Always Available
Thinking: qwen3:8b (5.2GB)

With 16GB, dynamic model loading is essential. Only one model fits comfortably at a time. Ollama's automated load/unload cycle keeps the pipeline moving without manual intervention.

Hybrid: Local + Cloud

Fast tasks: Ollama or LM Studio — local model
Reasoning tasks: Cloud API (Claude, GPT-4o) — no VRAM needed

Offload the heavy reasoning tasks (group reasoning, atlas generation, deepening) to a cloud API. This eliminates VRAM pressure entirely for those tasks and gets frontier-quality results. Local models handle the high-volume fast sync tasks.

Linux / Windows with NVIDIA GPU

All models: Ollama — Dynamic loading with CUDA

Ollama with CUDA acceleration is the recommended setup. Dynamic model loading works identically to Mac. MLX is not available on Linux/Windows — it's Apple Silicon only.

Pipeline Safety

Changing your model configuration while a pipeline is running could cause the next stage to resolve to a different (or missing) model. SourcePrep handles this with a pipeline-safe mode switch:

  1. When you save a configuration change, SourcePrep pauses any active pipeline stages
  2. The new configuration is written atomically
  3. SourcePrep verifies the next stage's model is available under the new config
  4. The pipeline resumes from where it left off

This means you can safely switch between Structured and Assigned mode, change endpoints, or swap models — even mid-pipeline — without data loss or crashes.