← Back to Concepts

Context Assembly

Turning raw signals into an optimized LLM prompt.

Retrieving code is easy. assembling it into a coherent prompt that fits within a context window while maximizing information density is hard.

The Assembly Process

1. Retrieval

SourcePrep gathers candidates from multiple sources:

  • Semantic Search: Top-K chunks via vector similarity.
  • Keyword Search: BM25 matches for exact terms.
  • Code Graph: Related definitions and call sites (if trace expansion is on).

2. Scoring & Weighting

Candidates are re-scored based on:

  • Relevance: The raw vector distance.
  • Query Intent: SourcePrep classifies your query (e.g. "docs", "tests", "code", or "default") and automatically adjusts role weights. For example, "how to use auth" boosts documentation, while "auth test failure" boosts test files.
  • Path Weights: User-defined multipliers (e.g. boost src/core by 1.5x, suppress tests/ by 0.5x).
  • Priming: Files named AGENTS.md, PREP_PRIMER.md, or PROJECT_PRIMER.md receive a global score boost (default +0.25). These files are ideal for high-level architectural overviews that should be considered relevant to most queries.
  • Recency: Slight boost for recently modified files (configurable).

3. Budgeting & Truncation

You specify a max_chars or max_tokens budget. SourcePrep:

  • Sorts chunks by their final score.
  • Greedily adds chunks until the budget is near full.
  • Ensures "glue" code (class headers, function signatures) is preserved for context.

4. Smart Compression

When compression is enabled, SourcePrep uses two engines. Code files are structurally compressed at a Level of Detail (LOD) determined by relevance score — top results stay full, mid-relevance shows signatures, peripheral files show names only (3–20×, no model needed).Documentation is compressed with a lightweight language model that removes filler while preserving meaning (~2.4×). Both run on CPU. Tier-adaptive per client.

5. Formatting

The final output is formatted as XML, Markdown, or JSON, complete with file path citations (@src/file.ts:10-20) that AI editors can parse to provide "Click to Open" links.


Context Panel Controls

The Context Assembler panel in the dashboard lets you tune this pipeline for your specific needs.

Retrieval Settings

  • Chunks (k): Controls how many distinct code blocks are retrieved from the vector database.
    Default: 20. Increase for broad queries, decrease for precision.
  • Max Chars: The hard limit for the final output. SourcePrep will stop adding chunks once this budget is hit.
    Default: 24,000 chars (fits comfortably in most 32k context windows).

Output Toggles

  • Sources: Adds the @path/to/file:line-line citation header to each chunk.
    Essential for AI editors to provide clickable links.
  • Scores: Appends the relevance score (0.0-1.0) to each chunk.
    Useful for debugging why a specific piece of code was included.
  • Structured: Returns a JSON object instead of a text blob.
    Use this when building programmatic integrations.