How to structure, maintain, and stay within the limits of a NotebookLM corpus. For researchers and team leads managing active notebooks.
| Resource | Hard Limit | Management Strategy |
|---|---|---|
| Notebooks per account | 100 | Segment by project / domain / cohort (see ยง4) |
| Sources per notebook | 50 | Ouroboros technique + source stitching (see ยง2, ยง3) |
| Words per source | 500,000 | Stitching maximizes this โ combine small PDFs into one |
| Total corpus per notebook | ~25 million words | Equivalent to ~25 large technical monographs |
| Context window (Gemini 1.5 Pro) | 1M tokens | Near-perfect recall (>99.7%) up to this limit |
Not all source formats perform equally in NotebookLM's RAG pipeline. Retrieval quality depends on how cleanly the document can be chunked and vectorized. High-noise formats โ visual-heavy PDFs, scanned documents โ degrade retrieval precision.
| Format | Retrieval Quality | Technical Consideration | Recommendation |
|---|---|---|---|
| Markdown / Plain Text | โ โ โ โ โ Highest | No layout noise; ideal for RAG chunking | Primary target format. Convert everything you can. |
| Google Docs / Word | โ โ โ โ High | Structured formatting facilitates parsing | Acceptable. Export to Markdown for critical long-term sources. |
| Text-Based PDF | โ โ โ Strong | Multi-column layouts may cause chunking errors | Use. Convert to Markdown if passage retrieval is critical. |
| Scanned PDF | โ โ Mixed | Sensitive to scan resolution and lighting | Apply OCR preprocessing before ingestion. |
| Handwritten Notes (OCR) | โ Variable | Cursive notation reduces reliability | Hybrid pipeline: OCR + Gemini self-correction pass. |
| Audio Overview (MP3) | โ โ High abstraction | Multi-modal, conversational perspective | Track lineage carefully. Avoid multi-generation re-upload (see ยง2.3). |
| Website URLs | โ โ โ Variable | Dynamic content may not index correctly | Prefer static pages. Exclude dynamic URL patterns. |
The Ouroboros technique converts accumulated AI-generated notes back into corpus sources, freeing slot capacity while preserving distilled knowledge. The name refers to the self-consuming quality of the cycle โ the system feeds on its own outputs to compress and survive.
OUROBOROS WORKFLOW
Research Session 1โN
โ
โผ
Accumulated MVAL Entries + AI Responses (in NotebookLM Notes)
โ
โผ BEFORE CONVERTING: embed metadata manually (see step 2)
"Convert to Source" โ New Dense Source Document
โ
โโโโ โ Delete original bulky source files (slot freed)
โ
โโโโ โ Verify: original citations are preserved in new source text
Re-uploading Audio Overview MP3s as corpus sources provides a multi-modal perspective and can surface connections the text-based corpus misses. However:
Source stitching combines multiple small documents into a single large source file before upload. This bypasses the 50-source count limit by treating a collection of PDFs as one corpus entry rather than many.
| Strategy | Mechanism | Benefit | Risk |
|---|---|---|---|
| Source Stitching | Combining multiple PDFs into one file before upload | Bypasses the 50-source count limit; maximizes 500k-word-per-source capacity | Slightly slower retrieval for specific passages within a stitched document |
| Ouroboros (Note โ Source) | Converting AI-generated notes into a new source document | Distills knowledge, compresses history, clears source slots | Loss of inline citations if metadata is not embedded before conversion |
| Audio as Source | Re-uploading Audio Overview MP3s as corpus sources | Multi-modal perspective; may surface cross-source connections | Creeping errors across generational summaries |
| Metadata Tagging | Including author, title, date in the text flow of each source | Improves citation accuracy and retrieval specificity | Manual overhead in document preparation |
| Notebook Segmentation | Splitting corpus by content type across multiple notebooks | 64% retrieval improvement (benchmarked); prevents cross-domain noise | Requires disciplined categorization at ingestion time |
A single notebook with all document types produces degraded retrieval quality across all queries โ the system cannot distinguish which context is relevant for which question. The five-notebook taxonomy separates documents by role, not just by topic.
Contents: Project charters, institutional standards, compliance protocols, team agreements, degree requirements.
Rationale: Grounds the Tutor role in authoritative institutional context. Isolated from research data so guidance is always drawn from the governance layer, not contaminated by experimental results.
Update frequency: Low. Update when standards or protocols change.
Contents: MVAL entries, experiment logs, pipeline documentation, session notes, working hypotheses.
Rationale: The primary working notebook. Updated continuously. Subject to Ouroboros compression when source count approaches ceiling.
Update frequency: High โ after every research session.
Contents: Academic papers, stitched research surveys, external technical documentation, benchmark reports.
Rationale: Separates authoritative external sources from internal logs. Prevents internal working notes from contaminating citation-backed retrieval of external literature.
Update frequency: Medium โ as new literature is reviewed.
Contents: Distilled MVAL summaries, onboarding guides, personnel transition documents, "state of the project" snapshots.
Rationale: Designed for new-reader optimization. Every document in this notebook should be readable by someone who has never seen the project before. This is the institutional memory artifact โ the one that survives personnel transitions.
Update frequency: At transition events: role changes, project milestones, semester boundaries, OPT/visa transitions.
Contents: Failed experiment logs, abandoned approach documentation, dead-end records, negative results.
Rationale: A searchable record of what did not work, preventing duplicate negative work across the team and across time. A failure archive that has never been queried has already paid for itself โ the moment someone asks "has anyone tried X?" before spending two weeks on X.
Update frequency: After every failure event, per the MVAL Failure Artifact Protocol.
| Configuration | Retrieval Quality | Citation Precision | Cross-domain noise |
|---|---|---|---|
| Single mixed notebook (all types together) | Degraded | Low | High |
| Two notebooks (literature vs. logs) | Improved | Moderate | Reduced |
| Five-notebook taxonomy (full segmentation) | +64% vs. single | High | Minimal |
[PROJECT]-charter, [PROJECT]-research, [PROJECT]-lit, [PROJECT]-handoff, [PROJECT]-failures).