๐Ÿ“„ Reference โ€” Boyle System Suite ยท Document 3 of 6
Reference

CORPUS MANAGEMENT

Source Ingestion, Ouroboros Technique & Notebook Segmentation

How to structure, maintain, and stay within the limits of a NotebookLM corpus. For researchers and team leads managing active notebooks.

Version 1.1 | March 2026

Medhavy AI, LLC  |  Bear Brown LLC  |  Humanitarians AI (501(c)(3))

Version 1.1 | March 2026 | Reviewed by Dev the Dev
What this reference covers Source format performance, the 50-source ceiling and how to manage it, the Ouroboros note-to-source conversion workflow, source stitching strategies, and the five-notebook segmentation taxonomy. If you need to understand why corpus management matters structurally, see the System Overview.

Platform Limits โ€” Know These Before You Ingest

ResourceHard LimitManagement Strategy
Notebooks per account100Segment by project / domain / cohort (see ยง4)
Sources per notebook50Ouroboros technique + source stitching (see ยง2, ยง3)
Words per source500,000Stitching maximizes this โ€” combine small PDFs into one
Total corpus per notebook~25 million wordsEquivalent to ~25 large technical monographs
Context window (Gemini 1.5 Pro)1M tokensNear-perfect recall (>99.7%) up to this limit
BD-001: Source Slot Ceiling The 50-source limit is the primary operational constraint on long-running projects. At 40 sources, set an alert. At 45, begin Ouroboros conversion before you hit the ceiling under pressure. Never attempt Ouroboros at 50 โ€” you will lose citation data. See Roadmap AI-001.

Source Ingestion and Format Performance

Not all source formats perform equally in NotebookLM's RAG pipeline. Retrieval quality depends on how cleanly the document can be chunked and vectorized. High-noise formats โ€” visual-heavy PDFs, scanned documents โ€” degrade retrieval precision.

Format Retrieval Quality Technical Consideration Recommendation
Markdown / Plain Text โ– โ– โ– โ– โ–  Highest No layout noise; ideal for RAG chunking Primary target format. Convert everything you can.
Google Docs / Word โ– โ– โ– โ–  High Structured formatting facilitates parsing Acceptable. Export to Markdown for critical long-term sources.
Text-Based PDF โ– โ– โ–  Strong Multi-column layouts may cause chunking errors Use. Convert to Markdown if passage retrieval is critical.
Scanned PDF โ– โ–  Mixed Sensitive to scan resolution and lighting Apply OCR preprocessing before ingestion.
Handwritten Notes (OCR) โ–  Variable Cursive notation reduces reliability Hybrid pipeline: OCR + Gemini self-correction pass.
Audio Overview (MP3) โ– โ–  High abstraction Multi-modal, conversational perspective Track lineage carefully. Avoid multi-generation re-upload (see ยง2.3).
Website URLs โ– โ– โ–  Variable Dynamic content may not index correctly Prefer static pages. Exclude dynamic URL patterns.

The Ouroboros Technique

The Ouroboros technique converts accumulated AI-generated notes back into corpus sources, freeing slot capacity while preserving distilled knowledge. The name refers to the self-consuming quality of the cycle โ€” the system feeds on its own outputs to compress and survive.

When to run Ouroboros

Ouroboros workflow

OUROBOROS WORKFLOW

Research Session 1โ€“N
      โ”‚
      โ–ผ
Accumulated MVAL Entries + AI Responses (in NotebookLM Notes)
      โ”‚
      โ–ผ  BEFORE CONVERTING: embed metadata manually (see step 2)
"Convert to Source" โ†’ New Dense Source Document
      โ”‚
      โ”œโ”€โ”€โ”€ โœ“ Delete original bulky source files (slot freed)
      โ”‚
      โ””โ”€โ”€โ”€ โœ“ Verify: original citations are preserved in new source text

Step-by-step procedure

  1. Select notes for conversion. In the NotebookLM UI, select the notes accumulated since the last Ouroboros cycle.
  2. Embed citation metadata manually โ€” before converting. Conversion strips inline citations. For every source referenced in those notes, manually add: original citation (author, date, document title), page numbers or section references, and the source's role in the corpus. This step is mandatory โ€” skipping it means the citation data is permanently lost.
  3. Run "Convert to Source." NotebookLM generates a new source document from the selected notes.
  4. Delete the original sources that have now been distilled into the new source. This frees the slot count.
  5. Verify the new source. Query it for a specific passage from the original notes. Confirm the embedded metadata is retrievable.
BD-002: Citation Loss on Ouroboros Conversion โ€” CRITICAL Converting notes to sources strips original inline citations automatically. This is not recoverable after conversion. Mandate: manually embed original citation metadata before every conversion, without exception. A mandatory pre-conversion checklist is in development. See Roadmap AI-002.

Audio Overview as Ouroboros input

Re-uploading Audio Overview MP3s as corpus sources provides a multi-modal perspective and can surface connections the text-based corpus misses. However:

Source Stitching

Source stitching combines multiple small documents into a single large source file before upload. This bypasses the 50-source count limit by treating a collection of PDFs as one corpus entry rather than many.

StrategyMechanismBenefitRisk
Source Stitching Combining multiple PDFs into one file before upload Bypasses the 50-source count limit; maximizes 500k-word-per-source capacity Slightly slower retrieval for specific passages within a stitched document
Ouroboros (Note โ†’ Source) Converting AI-generated notes into a new source document Distills knowledge, compresses history, clears source slots Loss of inline citations if metadata is not embedded before conversion
Audio as Source Re-uploading Audio Overview MP3s as corpus sources Multi-modal perspective; may surface cross-source connections Creeping errors across generational summaries
Metadata Tagging Including author, title, date in the text flow of each source Improves citation accuracy and retrieval specificity Manual overhead in document preparation
Notebook Segmentation Splitting corpus by content type across multiple notebooks 64% retrieval improvement (benchmarked); prevents cross-domain noise Requires disciplined categorization at ingestion time

When to stitch vs. when to segment

Notebook Segmentation Strategy

A single notebook with all document types produces degraded retrieval quality across all queries โ€” the system cannot distinguish which context is relevant for which question. The five-notebook taxonomy separates documents by role, not just by topic.

๐Ÿ“‹ Notebook Type 1 โ€” Project Charter Notebook

Contents: Project charters, institutional standards, compliance protocols, team agreements, degree requirements.

Rationale: Grounds the Tutor role in authoritative institutional context. Isolated from research data so guidance is always drawn from the governance layer, not contaminated by experimental results.

Update frequency: Low. Update when standards or protocols change.

๐Ÿ”ฌ Notebook Type 2 โ€” Active Research Notebook

Contents: MVAL entries, experiment logs, pipeline documentation, session notes, working hypotheses.

Rationale: The primary working notebook. Updated continuously. Subject to Ouroboros compression when source count approaches ceiling.

Update frequency: High โ€” after every research session.

๐Ÿ“š Notebook Type 3 โ€” Literature Notebook

Contents: Academic papers, stitched research surveys, external technical documentation, benchmark reports.

Rationale: Separates authoritative external sources from internal logs. Prevents internal working notes from contaminating citation-backed retrieval of external literature.

Update frequency: Medium โ€” as new literature is reviewed.

๐Ÿค Notebook Type 4 โ€” Handoff Notebook

Contents: Distilled MVAL summaries, onboarding guides, personnel transition documents, "state of the project" snapshots.

Rationale: Designed for new-reader optimization. Every document in this notebook should be readable by someone who has never seen the project before. This is the institutional memory artifact โ€” the one that survives personnel transitions.

Update frequency: At transition events: role changes, project milestones, semester boundaries, OPT/visa transitions.

๐Ÿ—ƒ๏ธ Notebook Type 5 โ€” Failure Archive

Contents: Failed experiment logs, abandoned approach documentation, dead-end records, negative results.

Rationale: A searchable record of what did not work, preventing duplicate negative work across the team and across time. A failure archive that has never been queried has already paid for itself โ€” the moment someone asks "has anyone tried X?" before spending two weeks on X.

Update frequency: After every failure event, per the MVAL Failure Artifact Protocol.

Retrieval Quality by Segmentation Strategy

ConfigurationRetrieval QualityCitation PrecisionCross-domain noise
Single mixed notebook (all types together) Degraded Low High
Two notebooks (literature vs. logs) Improved Moderate Reduced
Five-notebook taxonomy (full segmentation) +64% vs. single High Minimal
Naming conventions A standardized notebook naming taxonomy is in development. See Roadmap AI-006 for the planned taxonomy standard. Until then: use consistent prefixes per type (e.g., [PROJECT]-charter, [PROJECT]-research, [PROJECT]-lit, [PROJECT]-handoff, [PROJECT]-failures).