Version 1.1 | March 2026 | Reviewed by Dev the Dev

What this reference covers Source format performance, the 50-source ceiling and how to manage it, the Ouroboros note-to-source conversion workflow, source stitching strategies, and the five-notebook segmentation taxonomy. If you need to understand why corpus management matters structurally, see the System Overview.

Platform Limits — Know These Before You Ingest

Resource	Hard Limit	Management Strategy
Notebooks per account	100	Segment by project / domain / cohort (see §4)
Sources per notebook	50	Ouroboros technique + source stitching (see §2, §3)
Words per source	500,000	Stitching maximizes this — combine small PDFs into one
Total corpus per notebook	~25 million words	Equivalent to ~25 large technical monographs
Context window (Gemini 1.5 Pro)	1M tokens	Near-perfect recall (>99.7%) up to this limit

BD-001: Source Slot Ceiling The 50-source limit is the primary operational constraint on long-running projects. At 40 sources, set an alert. At 45, begin Ouroboros conversion before you hit the ceiling under pressure. Never attempt Ouroboros at 50 — you will lose citation data. See Roadmap AI-001.

Source Ingestion and Format Performance

Not all source formats perform equally in NotebookLM's RAG pipeline. Retrieval quality depends on how cleanly the document can be chunked and vectorized. High-noise formats — visual-heavy PDFs, scanned documents — degrade retrieval precision.

Format	Retrieval Quality	Technical Consideration	Recommendation
Markdown / Plain Text	■■■■■ Highest	No layout noise; ideal for RAG chunking	Primary target format. Convert everything you can.
Google Docs / Word	■■■■ High	Structured formatting facilitates parsing	Acceptable. Export to Markdown for critical long-term sources.
Text-Based PDF	■■■ Strong	Multi-column layouts may cause chunking errors	Use. Convert to Markdown if passage retrieval is critical.
Scanned PDF	■■ Mixed	Sensitive to scan resolution and lighting	Apply OCR preprocessing before ingestion.
Handwritten Notes (OCR)	■ Variable	Cursive notation reduces reliability	Hybrid pipeline: OCR + Gemini self-correction pass.
Audio Overview (MP3)	■■ High abstraction	Multi-modal, conversational perspective	Track lineage carefully. Avoid multi-generation re-upload (see §2.3).
Website URLs	■■■ Variable	Dynamic content may not index correctly	Prefer static pages. Exclude dynamic URL patterns.

The Ouroboros Technique

The Ouroboros technique converts accumulated AI-generated notes back into corpus sources, freeing slot capacity while preserving distilled knowledge. The name refers to the self-consuming quality of the cycle — the system feeds on its own outputs to compress and survive.

When to run Ouroboros

Source slot count reaches 40 (set a monitoring alert at this threshold)
A research phase is complete and early working documents are no longer actively queried
A personnel handoff is approaching and knowledge needs to be distilled

Ouroboros workflow

OUROBOROS WORKFLOW

Research Session 1–N
      │
      ▼
Accumulated MVAL Entries + AI Responses (in NotebookLM Notes)
      │
      ▼  BEFORE CONVERTING: embed metadata manually (see step 2)
"Convert to Source" → New Dense Source Document
      │
      ├─── ✓ Delete original bulky source files (slot freed)
      │
      └─── ✓ Verify: original citations are preserved in new source text

Step-by-step procedure

Select notes for conversion. In the NotebookLM UI, select the notes accumulated since the last Ouroboros cycle.
Embed citation metadata manually — before converting. Conversion strips inline citations. For every source referenced in those notes, manually add: original citation (author, date, document title), page numbers or section references, and the source's role in the corpus. This step is mandatory — skipping it means the citation data is permanently lost.
Run "Convert to Source." NotebookLM generates a new source document from the selected notes.
Delete the original sources that have now been distilled into the new source. This frees the slot count.
Verify the new source. Query it for a specific passage from the original notes. Confirm the embedded metadata is retrievable.

BD-002: Citation Loss on Ouroboros Conversion — CRITICAL Converting notes to sources strips original inline citations automatically. This is not recoverable after conversion. Mandate: manually embed original citation metadata before every conversion, without exception. A mandatory pre-conversion checklist is in development. See Roadmap AI-002.

Audio Overview as Ouroboros input

Re-uploading Audio Overview MP3s as corpus sources provides a multi-modal perspective and can surface connections the text-based corpus misses. However:

Track generational lineage — know which generation each audio source represents
Do not re-upload audio generated from a previous audio source (multi-generation summaries accumulate errors)
Audio sources count against the 50-source limit the same as any other source

Source Stitching

Source stitching combines multiple small documents into a single large source file before upload. This bypasses the 50-source count limit by treating a collection of PDFs as one corpus entry rather than many.

Strategy	Mechanism	Benefit	Risk
Source Stitching	Combining multiple PDFs into one file before upload	Bypasses the 50-source count limit; maximizes 500k-word-per-source capacity	Slightly slower retrieval for specific passages within a stitched document
Ouroboros (Note → Source)	Converting AI-generated notes into a new source document	Distills knowledge, compresses history, clears source slots	Loss of inline citations if metadata is not embedded before conversion
Audio as Source	Re-uploading Audio Overview MP3s as corpus sources	Multi-modal perspective; may surface cross-source connections	Creeping errors across generational summaries
Metadata Tagging	Including author, title, date in the text flow of each source	Improves citation accuracy and retrieval specificity	Manual overhead in document preparation
Notebook Segmentation	Splitting corpus by content type across multiple notebooks	64% retrieval improvement (benchmarked); prevents cross-domain noise	Requires disciplined categorization at ingestion time

When to stitch vs. when to segment

Stitch when documents are the same type and domain — e.g., a set of academic papers on the same topic. Stitching keeps them in one notebook and saves source slots.
Segment when documents serve different roles — research literature, project charters, MVAL logs, and failure archives should each live in their own notebook. Mixing them degrades retrieval quality for all of them.

Notebook Segmentation Strategy

A single notebook with all document types produces degraded retrieval quality across all queries — the system cannot distinguish which context is relevant for which question. The five-notebook taxonomy separates documents by role, not just by topic.

📋 Notebook Type 1 — Project Charter Notebook

Contents: Project charters, institutional standards, compliance protocols, team agreements, degree requirements.

Rationale: Grounds the Tutor role in authoritative institutional context. Isolated from research data so guidance is always drawn from the governance layer, not contaminated by experimental results.

Update frequency: Low. Update when standards or protocols change.

🔬 Notebook Type 2 — Active Research Notebook

Contents: MVAL entries, experiment logs, pipeline documentation, session notes, working hypotheses.

Rationale: The primary working notebook. Updated continuously. Subject to Ouroboros compression when source count approaches ceiling.

Update frequency: High — after every research session.

📚 Notebook Type 3 — Literature Notebook

Contents: Academic papers, stitched research surveys, external technical documentation, benchmark reports.

Rationale: Separates authoritative external sources from internal logs. Prevents internal working notes from contaminating citation-backed retrieval of external literature.

Update frequency: Medium — as new literature is reviewed.

🤝 Notebook Type 4 — Handoff Notebook

Contents: Distilled MVAL summaries, onboarding guides, personnel transition documents, "state of the project" snapshots.

Rationale: Designed for new-reader optimization. Every document in this notebook should be readable by someone who has never seen the project before. This is the institutional memory artifact — the one that survives personnel transitions.

Update frequency: At transition events: role changes, project milestones, semester boundaries, OPT/visa transitions.

🗃️ Notebook Type 5 — Failure Archive

Contents: Failed experiment logs, abandoned approach documentation, dead-end records, negative results.

Rationale: A searchable record of what did not work, preventing duplicate negative work across the team and across time. A failure archive that has never been queried has already paid for itself — the moment someone asks "has anyone tried X?" before spending two weeks on X.

Update frequency: After every failure event, per the MVAL Failure Artifact Protocol.

Retrieval Quality by Segmentation Strategy

Configuration	Retrieval Quality	Citation Precision	Cross-domain noise
Single mixed notebook (all types together)	Degraded	Low	High
Two notebooks (literature vs. logs)	Improved	Moderate	Reduced
Five-notebook taxonomy (full segmentation)	+64% vs. single	High	Minimal

Naming conventions A standardized notebook naming taxonomy is in development. See Roadmap AI-006 for the planned taxonomy standard. Until then: use consistent prefixes per type (e.g., [PROJECT]-charter, [PROJECT]-research, [PROJECT]-lit, [PROJECT]-handoff, [PROJECT]-failures).

CORPUS MANAGEMENT