Wikipedia keyword detection for pop-ups, and NanoLex fine-tuning for lexical entry generation — now separated and specified.
The original v0.1 spec described a single pipeline from Wikipedia to MDX. On review, the work has two distinct purposes that should be treated as separate systems with different inputs, outputs, and technical requirements.
Pipeline A — Keyword pop-up generation answers: which terms in a Medhavi textbook page link to a Wikipedia entry, and what does that pop-up show? This is a detection and rendering problem. Wikipedia and the existing subwiki.py parser can handle it almost entirely. No language model fine-tuning is required.
Pipeline B — Lexical entry generation answers: can we automatically generate a NanoLex-style lexical dictionary entry for any nanomedicine term — including terms not yet in NanoLex? This is a generation problem. It requires fine-tuning a model on the approximately 1,000 mapped NanoLex entries so it can produce structurally correct, domain-accurate entries for the remaining Wikipedia corpus. The Wikipedia match provides the source text; the fine-tuned model extends it into a full lexical entry.
These two systems share the same JSONL schema and the same keyword corpus, but their downstream consumers are different: Pipeline A feeds the Medhavi MDX layer directly; Pipeline B produces training data and a runtime lexicographer for future content enrichment.
subwiki.py — parse Wikipedia XML, extract articles by category, output JSONL with keyword, source URL, images, categoriesmdx_exporter.py — render matched keywords as MDX pop-up components (blocked: needs MDX contract from Dhruv)The JSONL corpus produced by subwiki.py contains every term that has a Wikipedia article in the Biotechnology and Nanotechnology categories. Each record includes the term name (the keyword field), the source URL, and associated images. This corpus acts as the lookup table for keyword detection.
For any Medhavi textbook page, a detection pass scans the page text and identifies token sequences that match a keyword field in the corpus. A matched term becomes an interactive element in the rendered MDX: hovering or tapping it triggers a pop-up that shows a summary drawn from the Wikipedia entry or, where a lexical entry has been generated (Pipeline B), a structured definition from the NanoLex-style record.
The pop-up component must receive at minimum: the term name, a short definition (first sentence of the Wikipedia article or the NanoLex main definition), the source URL for attribution, and optionally an image filename. Whether this data is passed inline as props or fetched from a data store by the component is the open question for Dhruv (Decision 1, Section 4).
| Field | Used for | Status |
|---|---|---|
| keyword | Match against textbook text to flag the term | Confirmed |
| source_url | Attribution link in the pop-up | Confirmed |
| images | Optional image display in the pop-up | Confirmed |
| categories | Filter or badge in the pop-up (e.g. "Biotechnology") | Confirmed |
| lexicon_entry | Extended definition text (available once Pipeline B runs) | Later |
This is the single blocking item for Pipeline A. mdx_exporter.py cannot be built until the component name, required props, and data delivery pattern are confirmed. Below are the three architecture options Dhruv should select from.
NanoLex is a lexical database of approximately 1,000 nanomedicine terms, each mapped to a structured entry containing: main definition, Classification, Synonymy, Hypernymy, Hyponymy, Meronymy, Part Holonyms, Verb Relations, Antonymy, Relational Adjectives, and an Expanded Definition with application context. This is the schema the Gemini-generated JSONL entries currently follow.
These 1,000 entries become the fine-tuning training corpus. The training format is supervised fine-tuning (SFT): the input prompt is the raw Wikipedia article text for a given term; the completion target is the structured NanoLex entry for that term. After training, the model can generate a NanoLex-style entry for any Wikipedia nanomedicine article it has not seen before.
Before running the fine-tuned model on the full Wikipedia corpus, a matching pass identifies which Wikipedia terms already have a NanoLex entry. Terms with an existing entry do not need generation — they are used as-is. Terms without a matching entry are queued for the fine-tuned model to generate. This minimizes unnecessary inference calls and ensures that human-curated entries are never overwritten.
Llama 3 8B is the preferred model for its stronger instruction-following on structured output tasks. Mistral 7B is the fallback if Discovery cluster node memory is constrained. This decision belongs to Prof. Brown and Srinivas and does not block system prompt drafting or corpus preparation.
A held-out set of 100 NanoLex entries (10% of the corpus) should be reserved before training begins. The model's output should be scored against the NanoLex schema — all eight sections present, no hallucinated synonyms, correct ontological relationships, expanded definition tied to actual biomedical application context.
The system prompt frames the fine-tuned model's role and output format at inference time. It is used both during fine-tuning evaluation and when running the model against the unmatched Wikipedia corpus in production. The draft below is based on the NanoLex schema and the Medhavi platform context — Prof. Brown should review whether the lexicographic framing and output constraints are correct.
| Action | Owner | Depends on | Status |
|---|---|---|---|
| Confirm MDX component contract (name, props, architecture option) | Dhruv | — | Blocked |
| Review and confirm system prompt draft (Section 6) | Prof. Brown | — | Blocked |
Build mdx_exporter.py for Pipeline A pop-up generation |
Thejus | Decision 1 | Pending |
| Prepare NanoLex SFT training corpus (~900 prompt-completion pairs after 100 held out) | Thejus | Decision 2 (schema confirmation) | Pending |
| Submit fine-tuning run on Discovery cluster (Llama 3 or Mistral) | Thejus | Decision 2 + cluster config from Srinivas | Pending |
| Run matching pass — identify Wikipedia terms without a NanoLex entry | Thejus | Fine-tuning complete | Pending |
| Run fine-tuned model on unmatched Wikipedia terms, extend JSONL | Thejus | Fine-tuning complete + matching pass | Pending |
| Image sourcing and pop-up integration (Natnicha) | Natnicha | Decision 1 | Pending |
1. Should the pop-up (Pipeline A) display the NanoLex entry sections (Classification, Synonymy, etc.) in a structured layout, or just the main definition and Expanded Definition as readable prose?
2. What is the target volume of Medhavi textbook pages that need keyword pop-ups for v1? This determines the urgency of Pipeline A versus B.
3. Are the ~1,000 NanoLex entries already in a format that can be directly used as fine-tuning targets, or do they need cleaning and reformatting first?
4. Should the fine-tuned model's output go through a human review step before being added to the live JSONL corpus, or is automated quality gating sufficient?
5. Does the keyword detection need to handle multi-word terms (e.g., "gold nanoparticle," "bacterial conjugation") or single tokens only? Multi-word matching requires a different detection approach.