How It Works

Graph Enrichment

SourcePrep learns how your code actually connects — not just what words appear where. A 15-stage pipeline turns raw source files into a living knowledge graph your AI can reason over: fast structural parsing first, then deeper reasoning about meaning, and finally the guides, rules, and safeguards your tools consume.

The Journey

Indexing runs in three groups. Sync delivers a structural map in seconds and layers in the catalogue and search index over the next few minutes. Enrich reasons about what each module does and how it fits the whole. Finalize produces the atlas, rules, concepts, audit findings, and safeguards your tools actually consume.

Sync — Structure first

Structural map in seconds (Rust). Catalogue and search index follow in minutes — your agent works the whole time.

Enrich — Background

Reasons about what each piece of code does, clusters related modules, and scores what matters most.

Finalize — Deliver

Atlas, rules, concepts, audit findings, and guardrails. Runs in parallel where possible.

Loading component preview…

The 15-stage pipeline in action — Sync → Enrich → Finalize, with per-stage progress and provenance.

Sync

The structural map is ready in seconds. The catalogue and search index follow within minutes on most codebases — your agent is already working the whole time.

1
STRUCTURAL

Parses every file via tree-sitter. Extracts imports, symbols, call sites. Builds the trace graph in seconds.

2
INFERRED_EDGES

Finds edges static parsing misses — cross-language API calls, dynamic dispatch, interface satisfaction, implicit dependencies. Confidence-scored; never overrides parser-derived edges.

3
CATALOGUE

A one-line summary and tags for every file. The longest single Sync stage on a fresh index — runs while structural context is already serving your agent.

4
VALIDATION

Integrity check. Verifies graph consistency, flags orphan nodes, discards hallucinated edges.

5
KNOWLEDGE

The catalogued graph becomes searchable. Deeper enrichment happens next, in Enrich.

LLM Deep Enrichment

Enrich

Runs in the background with full LLM passes. Each stage builds on the previous, producing progressively richer structural understanding. Supports swarm mode — multiple LLM workers processing nodes in parallel.

6
DEEP_REASONING

Epistemic scoring — layers, domains, confidence ratings for every node in graph context. Stage id `enrichment`.

7
GROUP_REASONING

LLM consensus across related nodes. Identifies patterns and architectural themes.

8
MODULE_SYNTHESIS

Module boundary discovery. Groups files into logical subsystems. Stage id `clustering`.

9
DEEPENING

Iterative epistemic refinement with full graph context available.

10
DEEP_KNOWLEDGE

Re-embed everything with enriched data. The search index now reflects deep understanding.

Synthesis & Delivery

Finalize

Produces the deliverables your tools actually consume — the atlas document, IDE rules files, seeded concepts, audit findings, and immune system defenses. Most of these run in parallel once the atlas is ready; the safeguards are derived from your recorded concepts last.

11
ATLAS

Generates the architectural overview — segments, hub files, cross-cutting concerns, workspace map.

12
RULES

Generates IDE rules files — AGENTS.md, .cursor/, .windsurf/ — so your editors know about the MCP tools.

13
CONCEPTS

Seeds concepts from atlas, modules, and audit findings. The "why" behind the architecture.

14
AUDIT

Runs structural analyzers — coupling hotspots, import cycles, hub concentration, quality gaps.

15
ANTIBODIES

Derives immune system defenses from concept assertions. Constraint violations surface as alerts.

Always Running

The file watcher detects changes and triggers incremental rebuilds. Sync stages re-run in seconds. Enrich and Finalize queue in the background. Your agents always get fresh structural context.

Incremental Rebuilds

Only changed files re-enter the pipeline. The graph is patched, not rebuilt from scratch. Hub files and structural relationships update in real time.

The Understanding Score

Every node in the graph gets an understanding score (0.0–1.0) — internally called the epistemic score — that represents how well the trace comprehends this node in the context of the entire codebase. This is fundamentally different from search relevance or summary accuracy. A file can have a perfect summary but a low understanding score if we don't know how it connects to anything else.

The score is a weighted composite of six dimensions:

Summary confidence (20%)
How certain is the catalogue model about its own output?
Validation depth (15%)
Has a larger model verified and enriched the summary?
Neighbor coverage (20%)
Are connected nodes also enriched? Understanding is relational.
Cross-reference density (15%)
How many doc↔code bidirectional links exist?
Enrichment depth (15%)
How many reasoning passes has this node been through?
Temporal currency (15%)
Has the source changed since enrichment? Stale knowledge scores zero.
Loading component preview…

The Atlas Lens is what consumes the understanding score — projecting the codebase through role-specific lenses (security, refactor, onboarding).

Score Decay — Knowledge Has a Half-Life

Understanding scores aren't static. They decay when the world changes around a node — because in epistemology, knowledge that hasn't been re-verified against current evidence is no longer justified belief.

EventEffectRationale
Source file changedScore → 0.0Everything might be wrong
Neighbor re-enrichedScore × 0.95Context shifted slightly
Referenced doc updatedScore × 0.90Documentation changed, claims may be invalid
Trace rebuilt (structural)Score × 0.80Edges changed, relationships may differ
Module re-synthesizedScore × 0.97Module understanding refined

Decay cascades through neighbors: if File A changes, File B (which imports A) decays slightly, and File C (which imports B) decays even less — up to a 3-hop propagation limit. Nodes with decayed scores enter a re-analysis queue ordered by score (lowest first); the Deepening stage processes this queue until the graph converges or the token budget is exhausted.

Documentation Mining

Most indexing tools treat .md files as flat text blobs. SourcePrep extracts structure from documentation: section headers, code references, status markers, and cross-links. This enables:

  • Doc↔code links — docs reference code files and vice versa
  • Staleness detection — a doc referencing a renamed file is flagged as drifted
  • Decision tracking — architecture decisions and their outcomes are captured
  • Orphan detection — docs with no code references or incoming links are identified

Why This Matters in Practice

Without enrichment: asking your AI "how does the ad framework work?" might return one file that mentions "ad".

With enrichment: the same query returns the module summary, the 6 files that compose the subsystem, their entry points, the 3 docs that describe the architecture, and a flag that one doc references a renamed file. The understanding score tells the AI that the framework is well understood, but one file's score has decayed because one neighbor was recently modified and hasn't been re-analyzed yet.

The difference isn't just better search results. It's the difference between a tool that retrieves and a tool that understands — and knows the boundary between the two.

Research Foundation

The pipeline draws on peer-reviewed research in knowledge graph construction, code intelligence, and probabilistic reasoning:

  • Hierarchical graph + community summaries — Microsoft GraphRAG (2024)
  • LLM → validator multi-agent enrichment — KARMA (2025)
  • AST → code knowledge graph — KG-based Repo-Level Code Gen (2025)
  • Bottom-up topological enrichment — RepoAgent, EMNLP 2024
  • Iterative convergence via residual scheduling — Belief Propagation (Pearl 1988, Yedidia 2003)
  • Score decay and re-enrichment — inspired by epistemic logic and truth-maintenance systems (Doyle 1979)