Graph Enrichment
How SourcePrep keeps deepening what it knows about your code — parsing structure fast, reasoning about meaning slowly, and re-checking what it thinks it knows whenever the code changes.
SourcePrep doesn't just search your code — it actively tries to understand it. A fast Rust parser maps the structure in seconds. Then small and large language models layer in meaning: what each file does, how it connects to others, which modules it belongs to. When you change a file, the system notices its neighbors' understanding may have shifted and re-checks them. Your AI never acts on stale context. The rest of this page explains how — including the research it's built on, for readers who want that depth.
Why Epistemology?
Most code intelligence tools answer a simple question: "Which files match this query?"SourcePrep asks a fundamentally different one: "How well does the system understand this code — and how justified is that understanding?"
That question is epistemological. Epistemology is the branch of philosophy concerned with knowledge itself — what it means to know something, how knowledge is justified, and when it becomes stale. These aren't abstract concerns when you're building a code graph:
- A file summary can be accurate (the text is correct) but not understood (we don't know how it connects to anything)
- A node's understanding decays when its neighbors change — because context is relational, not intrinsic
- A doc referencing a renamed file is drifted knowledge — technically present, epistemically broken
- Re-enriching the least understood nodes first (residual scheduling) mirrors how belief propagation converges in probabilistic graphical models
The trace graph doesn't just contain knowledge about your code — it knows about its own knowledge.
Inside the Pipeline
Enrichment runs in two groups. Fast Sync runs on every file save (~seconds) to keep the structural map current. Deep Enrichment runs when the system is idle or on a schedule (~minutes) to layer in meaning and relationships.
Group A: Fast Sync
Group B: Deep Enrichment
The Understanding Score
Every node in the graph gets an understanding score (0.0–1.0) — internally called the epistemic score — that represents how well the trace comprehends this node in the context of the entire codebase. This is fundamentally different from search relevance or summary accuracy. A file can have a perfect summary but a low understanding score if we don't know how it connects to anything else.
The score is a weighted composite of six dimensions:
Score Decay — Knowledge Has a Half-Life
Understanding scores aren't static. They decay when the world changes around a node — because in epistemology, knowledge that hasn't been re-verified against current evidence is no longer justified belief:
| Event | Effect | Rationale |
|---|---|---|
| Source file changed | Score → 0.0 | Everything might be wrong |
| Neighbor re-enriched | Score × 0.95 | Context shifted slightly |
| Referenced doc updated | Score × 0.90 | Documentation changed, claims may be invalid |
| Trace rebuilt (structural) | Score × 0.80 | Edges changed, relationships may differ |
| Module re-synthesized | Score × 0.97 | Module understanding refined |
Decay cascades through neighbors: if File A changes, File B (which imports A) decays slightly, and File C (which imports B) decays even less — up to a 3-hop propagation limit. This mirrors how uncertainty propagates through belief networks.
Nodes with decayed scores enter a re-analysis queue ordered by score (lowest first). The Continuous Deepening stage processes this queue until the graph converges or the token budget is exhausted.
Documentation Mining
Most indexing tools treat .md files as flat text blobs. SourcePrep's enrichment pipeline extracts structure from documentation: section headers, code references, status markers, and cross-links. This enables:
- Doc↔code links — docs reference code files and vice versa
- Staleness detection — a doc referencing a renamed file is flagged as drifted
- Decision tracking — architecture decisions and their outcomes are captured
- Orphan detection — docs with no code references or incoming links are identified
Research Foundation
The pipeline draws on peer-reviewed research in knowledge graph construction, code intelligence, and probabilistic reasoning:
- Hierarchical graph + community summaries — Microsoft GraphRAG (2024)
- LLM → validator multi-agent enrichment — KARMA (2025)
- AST → code knowledge graph — KG-based Repo-Level Code Gen (2025)
- Bottom-up topological enrichment — RepoAgent, EMNLP 2024
- Iterative convergence via residual scheduling — Belief Propagation (Pearl 1988, Yedidia 2003)
- Score decay and re-enrichment — inspired by epistemic logic and truth-maintenance systems (Doyle 1979)
Why This Matters in Practice
Without enrichment: asking your AI "how does the ad framework work?" might return one file that mentions "ad".
With enrichment: the same query returns the module summary, the 6 files that compose the subsystem, their entry points, the 3 docs that describe the architecture, and a flag that one doc references a renamed file. The understanding score tells the AI that the framework is well understood, but one file's score has decayed because one neighbor was recently modified and hasn't been re-analyzed yet.
The difference isn't just better search results. It's the difference between a tool that retrieves and a tool that understands — and knows the boundary between the two.
