Version 1.1 | March 2026 | Reviewed by Dev the Dev

What this document explains The theoretical and architectural reasoning behind the Boyle System's adaptive pedagogy layer — the five instructional modes, the bandit selection engine, learner context vectors, GAMBITTS integration, and fairness constraints. This is not an implementation guide. If you need to configure a deployment, see the Deployment Guide. If you need to track implementation status, see the Roadmap.

The Central Problem: The Assistance Dilemma

Every instructional system faces the same fundamental tension. Too much assistance produces high immediate success rates but shallow cognitive structures — learners who can perform with help but cannot transfer without it. Too little assistance forces impasse-driven learning, which works only when the learner has sufficient self-regulation and prior knowledge to bridge the gap. When those conditions aren't met, the result is disengagement.

The Assistance Dilemma Too much assistance → high immediate success but shallow cognitive structures. Too little assistance → impasse-driven learning only if the learner has sufficient self-regulation; otherwise, disengagement. The five-mode system navigates this dynamically, adapting to the learner's state rather than assuming a fixed optimal level.

Static instructional systems solve this by picking a single mode and applying it uniformly — a course that is always Socratic, a platform that always provides direct instruction. The Boyle System's answer is a multi-armed bandit that treats each instructional mode as a discrete arm and learns, in real time, which arm is most effective for each learner at each moment.

The Five Learning Modes

Each mode is calibrated to a specific learner state, cognitive load condition, and desired learning outcome. Each has a theoretical basis from the learning sciences. None of them is universally optimal — that is the point.

💬 Mode 1 — Socratic Questioning

Operational definition: Iterative probing using leading questions and progressive hints that elicit latent knowledge from the learner rather than delivering information directly.

Theoretical basis: Active retrieval and schema refinement. Knowledge elicited through productive struggle is retained significantly longer than knowledge delivered passively.

Optimal state: Learners with adequate foundational schemas; integrative or synthesis tasks; executive education case discussions.

Risk: Can induce frustration and cognitive overload when foundational schemas are absent. The bandit detects this via rising response latency without accuracy gains — a high-fidelity signal that Socratic mode has exceeded the learner's current capability.

🏗️ Mode 2 — Scaffolding

Operational definition: Reducing degrees of freedom by removing distractors, pre-filling procedural steps, or providing structured templates that allow the learner to focus on the core knowledge component.

Theoretical basis: Vygotsky's Zone of Proximal Development (ZPD) — the system maintains the learner at the edge of their capability without exceeding it. Scaffolding provides the temporary structure that allows the learner to perform above their current independent capability.

Optimal state: High cognitive load conditions; new procedural skills; onboarding scenarios in executive education.

Risk: The Expertise Reversal Effect — once mastery is achieved, continued scaffolding actively impedes fluency by adding unnecessary cognitive load. The bandit detects this transition via knowledge tracing and reduces scaffold weight accordingly.

📋 Mode 3 — Direct Instruction

Operational definition: Explicit delivery of facts, definitions, or procedural rules. No elicitation; information is provided directly and efficiently.

Theoretical basis: Cognitive Load Theory — minimizes extraneous cognitive load when the learner lacks prerequisite schemas, enabling rapid acquisition of new Knowledge Components (KCs).

Optimal state: Prerequisite concept introduction; low-energy or high-stress learner states; situations where exploratory modes would cause disengagement before any learning occurs.

Risk: Passive dependency if used exclusively. The system enforces a minimum exploration rate across all other modes — every learner is regularly given the opportunity to engage with more demanding modes.

🔬 Mode 4 — Cognitive Apprenticeship

Operational definition: Modeling expert processes via worked examples, "think-aloud" demonstrations, or "first letter" hints that reveal the structure of expert reasoning without completing the task for the learner.

Theoretical basis: Observational learning and expert visualization. Learners acquire procedural fluency and strategy adoption by watching expert processes made visible — by seeing not just what the expert does, but how they think about it.

Optimal state: Complex multi-step procedures; professional practice domains (consulting, research methodology, case analysis); think tank workflows.

Risk: High LLM generation cost. The IC-Cache optimization routes Cognitive Apprenticeship requests to cached high-quality examples where possible, reducing cost without degrading quality.

🧠 Mode 5 — Meta-cognitive Feedback

Operational definition: Prompts for reflection, strategy evaluation, and self-monitoring. Rather than providing content, the system asks the learner to evaluate their own approach, predict their performance, or identify their gaps.

Theoretical basis: Self-Regulated Learning (SRL) theory. Learners who can monitor and regulate their own cognition perform significantly better on transfer tasks — they learn how to learn, not just what to learn.

Optimal state: Advanced learners approaching mastery; post-task review; program-level reflection in executive education; doctoral and research training contexts.

Risk: Entirely ineffective for novices who lack the foundational knowledge to evaluate their own performance accurately. Applied too early, it produces confusion rather than reflection.

Mode Selection Summary

Mode	Theoretical Basis	Optimal Learner State	Primary Risk
Socratic Questioning	Active retrieval, schema refinement	Moderate–high prior knowledge	Frustration if schemas absent
Scaffolding	Zone of Proximal Development	Low–moderate; high cognitive load	Expertise Reversal Effect
Direct Instruction	Cognitive Load Theory	Novice; low energy; high stress	Passive dependency
Cognitive Apprenticeship	Observational learning	Intermediate; procedural tasks	High LLM generation cost
Meta-cognitive Feedback	Self-Regulated Learning	Advanced; near or post-mastery	Ineffective for novices

The Multi-Armed Bandit Architecture

Each of the five instructional modes is treated as a discrete "arm" of a Multi-Armed Bandit (MAB). The bandit engine selects which mode to apply at each instructional moment, balancing exploration (trying modes with uncertain effectiveness for this learner) against exploitation (using the mode currently estimated to be most effective).

Thompson Sampling — Bayesian Mode Selection

How Thompson Sampling works in the Boyle System

For each instructional mode a ∈ {1,...,5}, the system maintains a belief state modeled as a Beta distribution Beta(αₐ, βₐ) for binary rewards, or a Gaussian distribution N(μₐ, σₐ²) for continuous learning progress metrics.

Thompson Sampling draws a sample from each mode's posterior and selects the mode with the highest sample. This naturally produces high exploration early in a session — when uncertainty about the learner's response to each mode is high — and converges on the most effective personalized strategy as evidence accumulates.

Why not UCB or epsilon-greedy? Thompson Sampling handles non-stationarity more gracefully than UCB and produces more natural exploration-exploitation balance than epsilon-greedy. For non-stationary learning trajectories — which all real learners produce — the Bayesian approach is more appropriate.

The Contextual Bandit — Learner State Vectors

A context-free bandit cannot achieve true personalization — it treats all learners identically and learns only from aggregated signal. The Contextual MAB (CMAB) incorporates a feature vector xₜ representing the learner's current state:

E[rₜ | xₜ, a] = xₜᵀ θₐ

Where:
  xₜ  = learner context vector at time t
  θₐ  = learned weight vector for instructional mode a
  rₜ  = expected reward (learning progress)

Feature Category	Features Included	Role in Bandit
Surface-Level (Stable)	Baseline education level, prior academic performance, domain background	Sets initial priors; "warm start" for new learners before interaction data accumulates
Deep-Level (Dynamic)	Current Knowledge Component mastery, error distributions, response latency	Primary signal for real-time mode switching; updated after every interaction
Affective State	Estimated mood, energy level, stress indicators	Temporarily biases toward lower-load modes (Direct, Scaffolding) when stress indicators are high
Knowledge Tracing (DKT/BKT)	Mastery probability per skill from sequence of prior responses	Detects Expertise Reversal; triggers mode drift away from scaffolding toward exploration

Response Latency as Cognitive Load Proxy An increase in response latency without a corresponding increase in accuracy is a high-fidelity signal that the current instructional mode is exceeding the learner's current capability. The bandit treats this pattern as a negative reward signal and shifts toward more structured modes. This proxy requires no self-report — it is observable without the learner knowing they are being measured.

Three Implementation Phases

Phase 1 — Expert-Guided Initialization (Cold Start)

Before the system has enough data to personalize, it uses expert knowledge to seed the bandit's priors. Direct Instruction is the default mode for prerequisite concepts; Socratic Questioning is prioritized for integrative tasks. This warm-start mechanism prevents detrimental random exploration in early sessions — a new learner should not be subjected to a Socratic interrogation on their first day.

Phase 2 — Online Adaptation and Clustering

As learners interact with the system, the bandit refines its models. Local Clustering in Bandits (LOCB) groups learners by preference parameters θₐ. New learners whose initial behavior matches an existing cluster inherit that cluster's learned policy — dramatically accelerating personalization without requiring extended individual observation.

This collaborative filtering approach scales intelligence across entire cohorts. A think tank that has deployed the system across three cohorts has accumulated learner cluster data that benefits the fourth cohort from day one.

Phase 3 — Non-Stationary Drift (Expertise Reversal Management)

Learning is non-stationary — a learner's optimal mode changes as they develop. Sliding Window UCB or Discounted Thompson Sampling gives more weight to recent observations. As deep-level features indicate higher competence, rewards for Direct Instruction and heavy Scaffolding naturally decline, while rewards for Socratic Questioning and Meta-cognitive Feedback increase.

The bandit policy drifts with the learner — a seamless transition from guided structure to open-ended exploration, without requiring any explicit configuration change.

Reward Modeling

Learning Progress as Reward Signal

Simple correctness rewards create a perverse incentive: the bandit maximizes assistance to guarantee "success," producing over-assistance — helping learners get correct answers without learning anything. The Boyle System uses Learning Progress (LP) as its primary reward signal:

r = cᵢ(t) - cᵢ(t-1)

Where cᵢ(t) = probability of mastery for Knowledge Component i at time t

If a learner already knows a concept: cᵢ(t) - cᵢ(t-1) ≈ 0
→ Bandit shifts to more challenging content or Meta-cognitive mode

If progress is rapid: reward is high
→ Bandit reinforces the current instructional mode

Composite Reward Function

Reward Component	Metric	Purpose
Immediate Success	P(Correct \| Mode)	Maintains learner motivation and "flow" state
Knowledge Gain	ΔMastery	Ensures the mode is actually teaching, not just enabling correct answers
Efficiency	1 / Time-on-Task	Penalizes unnecessarily verbose or slow modes
Persistence	Session completion rate	Encourages modes that sustain long-term engagement

GAMBITTS — LLM Integration

When instructional content is generated by LLMs in real time, a standard bandit faces a fundamental problem: it selects an action (e.g., "provide a Socratic hint") but the treatment delivered to the learner is the stochastic output of the LLM — variable, unpredictable, and not fully controlled by the action choice. GAMBITTS (Generator-Mediated Bandit-Thompson Sampling) explicitly models this action-treatment split.

GAMBITTS PIPELINE

Bandit Agent
  └─ Selects: Instructional mode A + prompt template P
              (e.g., "Use Socratic questioning to explain concept X")
                          │
                          ▼
              LLM Generator (stochastic)
  └─ Produces: Specific text string Gₜ
                          │
                          ▼
              Embedding Projection
  └─ Projects: High-dim text Gₜ → Low-dim embedding Zₜ
              (Detects when different outputs deliver the same pedagogy)
                          │
                          ▼
              Reward Signal
  └─ Learner response → LP reward → Update θₐ posterior

IC-Cache — Architectural Optimization

System Component	Optimization Strategy	Pedagogical Impact
Example Selector	Caches high-utility request-response pairs from larger models	Enables smaller, faster models to emulate Cognitive Apprenticeship quality
Request Router	Routes simple queries to small models, complex ones (Socratic) to large models	Maintains low latency during critical "flow" states when response speed matters
Example Manager	Continuously refines cached examples based on learner rewards	Ensures scaffolding examples remain aligned with actual learner outcomes

Algorithmic Fairness Constraints

Without fairness constraints, a bandit that learns from historical data can permanently route certain learners into a narrow set of modes — reflecting prior educational disadvantage rather than actual learning potential.

The Pigeonholing Problem

If a bandit observes that a demographic subgroup has historically responded well to Direct Instruction — potentially because prior educational experiences have conditioned them to passive reception — it may permanently route those learners into a Direct Instruction loop. This denies them access to higher-order modes like Socratic Questioning or Meta-cognitive Feedback. The system's decisions would then mirror and amplify the educational disadvantage it was meant to overcome.

Fairness constraint: minimum exploration rate across all five modes The Boyle System enforces a minimum exploration rate across all five instructional modes for all learner demographics. Every learner is regularly given the opportunity to succeed with more challenging, exploratory modes, regardless of initial cluster assignment or historical response patterns. A fairness audit protocol to monitor this constraint is on the roadmap — see Roadmap AI-008.

Implementation Status

The adaptive architecture described in this document represents the full design specification. Current deployment status:

Component	Status	Notes
Five instructional modes (conceptual)	Defined	Deployed as manual selection in current pilot
Thompson Sampling MAB engine	Planned	See Roadmap — full MAB Engine
Contextual learner vectors (CMAB)	Planned	Requires DKT/BKT integration
GAMBITTS LLM integration	Planned	Dependent on MAB engine deployment
IC-Cache optimization	Planned	Post-GAMBITTS
Fairness audit protocol	Planned	See Roadmap AI-008
Phase 1 expert priors for exec education	In development	See Roadmap AI-004

For full roadmap with effort estimates and priorities, see the Roadmap document.

ADAPTIVE INSTRUCTIONAL ARCHITECTURE