Medhavi lexicon pipeline — technical specification v0.2

⚑ Two decisions still required

MDX component contract — component name, required props, and architecture (Options A/B/C as defined in Section 4). Pipeline A is blocked until this is confirmed.Owner: Dhruv

System prompt draft review — a draft system prompt for the fine-tuned model is proposed in Section 6. Prof. Brown should confirm whether the lexicographic framing, output format, and NanoLex schema coverage are correct before training begins.Owner: Prof. Brown

01Purpose — two systems, not one

The original v0.1 spec described a single pipeline from Wikipedia to MDX. On review, the work has two distinct purposes that should be treated as separate systems with different inputs, outputs, and technical requirements.

Pipeline A — Keyword pop-up generation answers: which terms in a Medhavi textbook page link to a Wikipedia entry, and what does that pop-up show? This is a detection and rendering problem. Wikipedia and the existing subwiki.py parser can handle it almost entirely. No language model fine-tuning is required.

Pipeline B — Lexical entry generation answers: can we automatically generate a NanoLex-style lexical dictionary entry for any nanomedicine term — including terms not yet in NanoLex? This is a generation problem. It requires fine-tuning a model on the approximately 1,000 mapped NanoLex entries so it can produce structurally correct, domain-accurate entries for the remaining Wikipedia corpus. The Wikipedia match provides the source text; the fine-tuned model extends it into a full lexical entry.

These two systems share the same JSONL schema and the same keyword corpus, but their downstream consumers are different: Pipeline A feeds the Medhavi MDX layer directly; Pipeline B produces training data and a runtime lexicographer for future content enrichment.

02Architecture overview

Pipeline A · Keyword pop-ups Wikipedia → Medhavi MDX

subwiki.py — parse Wikipedia XML, extract articles by category, output JSONL with keyword, source URL, images, categories

Keyword detector — scan Medhavi textbook content, flag terms that match a Wikipedia entry in the corpus

mdx_exporter.py — render matched keywords as MDX pop-up components (blocked: needs MDX contract from Dhruv)

Stages 1 complete · Stage 3 blocked on Decision 1

Pipeline B · Lexical entry generation NanoLex → Fine-tuned model

Training corpus — ~1,000 NanoLex entries already mapped, covering Classification, Synonymy, Hypernymy, Hyponymy, Meronymy, Verb Relations, Antonymy, Expanded Definition

System prompt — frames the fine-tuned model as a domain lexicographer (draft in Section 6, pending Prof. Brown approval)

Fine-tuning run — SFT on Discovery cluster using NanoLex as prompt-completion pairs. Model generates NanoLex-style entries for any Wikipedia nanomedicine term

Entry extension — run fine-tuned model on Wikipedia corpus: match terms, generate lexical entries for unmatched terms, append to JSONL

Stage 1 ready · Stages 2–4 pending Decision 2

03Pipeline A in detail — keyword detection and pop-up rendering

How keyword detection works

The JSONL corpus produced by subwiki.py contains every term that has a Wikipedia article in the Biotechnology and Nanotechnology categories. Each record includes the term name (the keyword field), the source URL, and associated images. This corpus acts as the lookup table for keyword detection.

For any Medhavi textbook page, a detection pass scans the page text and identifies token sequences that match a keyword field in the corpus. A matched term becomes an interactive element in the rendered MDX: hovering or tapping it triggers a pop-up that shows a summary drawn from the Wikipedia entry or, where a lexical entry has been generated (Pipeline B), a structured definition from the NanoLex-style record.

What the pop-up needs to display

The pop-up component must receive at minimum: the term name, a short definition (first sentence of the Wikipedia article or the NanoLex main definition), the source URL for attribution, and optionally an image filename. Whether this data is passed inline as props or fetched from a data store by the component is the open question for Dhruv (Decision 1, Section 4).

Scope note

Pipeline A does not require the lexical dictionary. It can display a plain Wikipedia summary in the pop-up from day one. The lexical entry (from Pipeline B) enriches the pop-up later — it is an enhancement, not a prerequisite.

JSONL fields used by Pipeline A

Field	Used for	Status
keyword	Match against textbook text to flag the term	Confirmed
source_url	Attribution link in the pop-up	Confirmed
images	Optional image display in the pop-up	Confirmed
categories	Filter or badge in the pop-up (e.g. "Biotechnology")	Confirmed
lexicon_entry	Extended definition text (available once Pipeline B runs)	Later

04MDX component contract (Decision 1 — Dhruv)

This is the single blocking item for Pipeline A. mdx_exporter.py cannot be built until the component name, required props, and data delivery pattern are confirmed. Below are the three architecture options Dhruv should select from.

      Option A — term reference only
// Component fetches its own data from an external store keyed on term
<Popup term="Gold nanoparticle" />
    

      Option B — inline props
// Component receives all display data as props at render time
<LexiconEntry
  term="Gold nanoparticle"
  definition="Ultrafine particles of gold..."
  category="Nanotechnology"
  sourceUrl="https://en.wikipedia.org/wiki/..."
/>
    

      Option C — child content block
// Component wraps structured MDX content
<LexiconEntry term="Gold nanoparticle">
  **Classification:** Nanomaterial, Metal Nanoparticle
  **Synonyms:** AuNPs, Colloidal Gold
  ...
</LexiconEntry>
    

Decision 1 required

Dhruv: confirm the component name, required props, and which option above matches the current Medhavi implementation. If none match, describe the actual pattern. The exporter is blocked without this.

05Pipeline B in detail — NanoLex fine-tuning

What NanoLex provides

NanoLex is a lexical database of approximately 1,000 nanomedicine terms, each mapped to a structured entry containing: main definition, Classification, Synonymy, Hypernymy, Hyponymy, Meronymy, Part Holonyms, Verb Relations, Antonymy, Relational Adjectives, and an Expanded Definition with application context. This is the schema the Gemini-generated JSONL entries currently follow.

These 1,000 entries become the fine-tuning training corpus. The training format is supervised fine-tuning (SFT): the input prompt is the raw Wikipedia article text for a given term; the completion target is the structured NanoLex entry for that term. After training, the model can generate a NanoLex-style entry for any Wikipedia nanomedicine article it has not seen before.

The matching step

Before running the fine-tuned model on the full Wikipedia corpus, a matching pass identifies which Wikipedia terms already have a NanoLex entry. Terms with an existing entry do not need generation — they are used as-is. Terms without a matching entry are queued for the fine-tuned model to generate. This minimizes unnecessary inference calls and ensures that human-curated entries are never overwritten.

Training data format

      SFT pair — one training example
{
  "prompt": "[Wikipedia article text for 'Gold nanoparticle']",
  "completion": "### Lexical Dictionary Entry: Gold nanoparticle\n\n**Gold nanoparticle** (n.): ..."
}
    

Model selection and cluster

Llama 3 8B is the preferred model for its stronger instruction-following on structured output tasks. Mistral 7B is the fallback if Discovery cluster node memory is constrained. This decision belongs to Prof. Brown and Srinivas and does not block system prompt drafting or corpus preparation.

Validation set

A held-out set of 100 NanoLex entries (10% of the corpus) should be reserved before training begins. The model's output should be scored against the NanoLex schema — all eight sections present, no hallucinated synonyms, correct ontological relationships, expanded definition tied to actual biomedical application context.

06System prompt — draft for review (Decision 2)

The system prompt frames the fine-tuned model's role and output format at inference time. It is used both during fine-tuning evaluation and when running the model against the unmatched Wikipedia corpus in production. The draft below is based on the NanoLex schema and the Medhavi platform context — Prof. Brown should review whether the lexicographic framing and output constraints are correct.

Draft system prompt — NanoLex lexicographer

You are a domain lexicographer specializing in nanomedicine, nanotechnology, and biotechnology. Your task is to generate a structured lexical dictionary entry in NanoLex format for a given scientific term, using the provided Wikipedia source text as your primary reference. NanoLex entries follow this exact structure. Every section is required: 1. **Term name and part of speech** — e.g., "Gold nanoparticle (n.):" 2. **Main definition** — one to three precise sentences defining the term for a specialist audience. 3. **Classification** — two to four categories placing the term in its domain hierarchy. 4. **Synonymy** — alternative names, abbreviations, or equivalent terms with brief explanatory notes. 5. **Hypernymy** — broader parent categories the term belongs to, with explanations. 6. **Hyponymy** — narrower subtypes or specific instances, with explanations. 7. **Meronymy** — component parts or constituent elements of the term. 8. **Part Holonyms** — larger systems or wholes that the term is part of. 9. **Verb Relations** — actions the entity performs, undergoes, or enables. 10. **Antonymy** — contrasting terms or opposite concepts. 11. **Relational Adjectives (Pertainyms)** — adjectives that pertain to this term. 12. **Expanded Definition and Usage** — three to five sentences contextualizing the term in biomedical or nanotechnology applications, emphasizing why it matters and what problems it addresses. Rules: - Use the Wikipedia source text as evidence. Do not hallucinate properties, synonyms, or applications not supported by the source. - Write in formal academic register. Avoid hedging phrases like "can be used" — state properties directly. - All section headers must match exactly. Do not add or omit sections. - The Expanded Definition must name at least one specific application domain (e.g., drug delivery, biosensing, environmental remediation, cancer therapy). - If the source text does not contain sufficient information for a section, write "Information not available in source text" rather than inventing content.

This is a draft. Prof. Brown should confirm: (1) whether the eight-section schema is complete or needs additions, (2) whether the application domain requirement in the Expanded Definition is correct, and (3) whether the instruction to flag missing information is preferred over inference from domain knowledge.

Decision 2 required

Prof. Brown: review the system prompt draft above and confirm or modify it before fine-tuning begins. Once this prompt is locked, Thejus can prepare the training corpus and submit the fine-tuning job to Discovery.

07Next steps and status

Action	Owner	Depends on	Status
Confirm MDX component contract (name, props, architecture option)	Dhruv	—	Blocked
Review and confirm system prompt draft (Section 6)	Prof. Brown	—	Blocked
Build `mdx_exporter.py` for Pipeline A pop-up generation	Thejus	Decision 1	Pending
Prepare NanoLex SFT training corpus (~900 prompt-completion pairs after 100 held out)	Thejus	Decision 2 (schema confirmation)	Pending
Submit fine-tuning run on Discovery cluster (Llama 3 or Mistral)	Thejus	Decision 2 + cluster config from Srinivas	Pending
Run matching pass — identify Wikipedia terms without a NanoLex entry	Thejus	Fine-tuning complete	Pending
Run fine-tuned model on unmatched Wikipedia terms, extend JSONL	Thejus	Fine-tuning complete + matching pass	Pending
Image sourcing and pop-up integration (Natnicha)	Natnicha	Decision 1	Pending

08Open questions for first review meeting

1. Should the pop-up (Pipeline A) display the NanoLex entry sections (Classification, Synonymy, etc.) in a structured layout, or just the main definition and Expanded Definition as readable prose?
2. What is the target volume of Medhavi textbook pages that need keyword pop-ups for v1? This determines the urgency of Pipeline A versus B.
3. Are the ~1,000 NanoLex entries already in a format that can be directly used as fine-tuning targets, or do they need cleaning and reformatting first?
4. Should the fine-tuned model's output go through a human review step before being added to the live JSONL corpus, or is automated quality gating sufficient?
5. Does the keyword detection need to handle multi-word terms (e.g., "gold nanoparticle," "bacterial conjugation") or single tokens only? Multi-word matching requires a different detection approach.

Version 0.2 · March 2026 · Lead: Thejus Thomson · Formatted by Dev the Dev

Lexicon PipelineTwo purposes. Two pipelines.

How keyword detection works

What the pop-up needs to display

JSONL fields used by Pipeline A

What NanoLex provides

The matching step

Training data format

Model selection and cluster

Validation set

Lexicon Pipeline
Two purposes. Two pipelines.