How It Works
Graph Enrichment
SourcePrep learns how your code actually connects — not just what words appear where. A 15-stage pipeline turns raw source files into a living knowledge graph your AI can reason over: fast structural parsing first, then deeper reasoning about meaning, and finally the guides, rules, and safeguards your tools consume.
The Journey
Indexing runs in three groups. Sync delivers a structural map in seconds and layers in the catalogue and search index over the next few minutes. Enrich reasons about what each module does and how it fits the whole. Finalize produces the atlas, rules, concepts, audit findings, and safeguards your tools actually consume.
Sync — Structure first
Structural map in seconds (Rust). Catalogue and search index follow in minutes — your agent works the whole time.
Enrich — Background
Reasons about what each piece of code does, clusters related modules, and scores what matters most.
Finalize — Deliver
Atlas, rules, concepts, audit findings, and guardrails. Runs in parallel where possible.
Sync
The structural map is ready in seconds. The catalogue and search index follow within minutes on most codebases — your agent is already working the whole time.
Parses every file via tree-sitter. Extracts imports, symbols, call sites. Builds the trace graph in seconds.
Finds edges static parsing misses — cross-language API calls, dynamic dispatch, interface satisfaction, implicit dependencies. Confidence-scored; never overrides parser-derived edges.
A one-line summary and tags for every file. The longest single Sync stage on a fresh index — runs while structural context is already serving your agent.
Integrity check. Verifies graph consistency, flags orphan nodes, discards hallucinated edges.
The catalogued graph becomes searchable. Deeper enrichment happens next, in Enrich.
Enrich
Runs in the background with full LLM passes. Each stage builds on the previous, producing progressively richer structural understanding. Supports swarm mode — multiple LLM workers processing nodes in parallel.
Epistemic scoring — layers, domains, confidence ratings for every node in graph context. Stage id `enrichment`.
LLM consensus across related nodes. Identifies patterns and architectural themes.
Module boundary discovery. Groups files into logical subsystems. Stage id `clustering`.
Iterative epistemic refinement with full graph context available.
Re-embed everything with enriched data. The search index now reflects deep understanding.
Finalize
Produces the deliverables your tools actually consume — the atlas document, IDE rules files, seeded concepts, audit findings, and immune system defenses. Most of these run in parallel once the atlas is ready; the safeguards are derived from your recorded concepts last.
Generates the architectural overview — segments, hub files, cross-cutting concerns, workspace map.
Generates IDE rules files — AGENTS.md, .cursor/, .windsurf/ — so your editors know about the MCP tools.
Seeds concepts from atlas, modules, and audit findings. The "why" behind the architecture.
Runs structural analyzers — coupling hotspots, import cycles, hub concentration, quality gaps.
Derives immune system defenses from concept assertions. Constraint violations surface as alerts.
Always Running
The file watcher detects changes and triggers incremental rebuilds. Sync stages re-run in seconds. Enrich and Finalize queue in the background. Your agents always get fresh structural context.
Incremental Rebuilds
Only changed files re-enter the pipeline. The graph is patched, not rebuilt from scratch. Hub files and structural relationships update in real time.
The Understanding Score
Every node in the graph gets an understanding score (0.0–1.0) — internally called the epistemic score — that represents how well the trace comprehends this node in the context of the entire codebase. This is fundamentally different from search relevance or summary accuracy. A file can have a perfect summary but a low understanding score if we don't know how it connects to anything else.
The score is a weighted composite of six dimensions:
Score Decay — Knowledge Has a Half-Life
Understanding scores aren't static. They decay when the world changes around a node — because in epistemology, knowledge that hasn't been re-verified against current evidence is no longer justified belief.
| Event | Effect | Rationale |
|---|---|---|
| Source file changed | Score → 0.0 | Everything might be wrong |
| Neighbor re-enriched | Score × 0.95 | Context shifted slightly |
| Referenced doc updated | Score × 0.90 | Documentation changed, claims may be invalid |
| Trace rebuilt (structural) | Score × 0.80 | Edges changed, relationships may differ |
| Module re-synthesized | Score × 0.97 | Module understanding refined |
Decay cascades through neighbors: if File A changes, File B (which imports A) decays slightly, and File C (which imports B) decays even less — up to a 3-hop propagation limit. Nodes with decayed scores enter a re-analysis queue ordered by score (lowest first); the Deepening stage processes this queue until the graph converges or the token budget is exhausted.
Documentation Mining
Most indexing tools treat .md files as flat text blobs. SourcePrep extracts structure from documentation: section headers, code references, status markers, and cross-links. This enables:
- Doc↔code links — docs reference code files and vice versa
- Staleness detection — a doc referencing a renamed file is flagged as drifted
- Decision tracking — architecture decisions and their outcomes are captured
- Orphan detection — docs with no code references or incoming links are identified
Why This Matters in Practice
Without enrichment: asking your AI "how does the ad framework work?" might return one file that mentions "ad".
With enrichment: the same query returns the module summary, the 6 files that compose the subsystem, their entry points, the 3 docs that describe the architecture, and a flag that one doc references a renamed file. The understanding score tells the AI that the framework is well understood, but one file's score has decayed because one neighbor was recently modified and hasn't been re-analyzed yet.
The difference isn't just better search results. It's the difference between a tool that retrieves and a tool that understands — and knows the boundary between the two.
Research Foundation
The pipeline draws on peer-reviewed research in knowledge graph construction, code intelligence, and probabilistic reasoning:
- Hierarchical graph + community summaries — Microsoft GraphRAG (2024)
- LLM → validator multi-agent enrichment — KARMA (2025)
- AST → code knowledge graph — KG-based Repo-Level Code Gen (2025)
- Bottom-up topological enrichment — RepoAgent, EMNLP 2024
- Iterative convergence via residual scheduling — Belief Propagation (Pearl 1988, Yedidia 2003)
- Score decay and re-enrichment — inspired by epistemic logic and truth-maintenance systems (Doyle 1979)
