How the Boyle System selects, applies, and adapts instructional strategies in real time β and why each design decision was made. For engineers, learning scientists, and program designers.
Every instructional system faces the same fundamental tension. Too much assistance produces high immediate success rates but shallow cognitive structures β learners who can perform with help but cannot transfer without it. Too little assistance forces impasse-driven learning, which works only when the learner has sufficient self-regulation and prior knowledge to bridge the gap. When those conditions aren't met, the result is disengagement.
Static instructional systems solve this by picking a single mode and applying it uniformly β a course that is always Socratic, a platform that always provides direct instruction. The Boyle System's answer is a multi-armed bandit that treats each instructional mode as a discrete arm and learns, in real time, which arm is most effective for each learner at each moment.
Each mode is calibrated to a specific learner state, cognitive load condition, and desired learning outcome. Each has a theoretical basis from the learning sciences. None of them is universally optimal β that is the point.
Operational definition: Iterative probing using leading questions and progressive hints that elicit latent knowledge from the learner rather than delivering information directly.
Theoretical basis: Active retrieval and schema refinement. Knowledge elicited through productive struggle is retained significantly longer than knowledge delivered passively.
Optimal state: Learners with adequate foundational schemas; integrative or synthesis tasks; executive education case discussions.
Risk: Can induce frustration and cognitive overload when foundational schemas are absent. The bandit detects this via rising response latency without accuracy gains β a high-fidelity signal that Socratic mode has exceeded the learner's current capability.
Operational definition: Reducing degrees of freedom by removing distractors, pre-filling procedural steps, or providing structured templates that allow the learner to focus on the core knowledge component.
Theoretical basis: Vygotsky's Zone of Proximal Development (ZPD) β the system maintains the learner at the edge of their capability without exceeding it. Scaffolding provides the temporary structure that allows the learner to perform above their current independent capability.
Optimal state: High cognitive load conditions; new procedural skills; onboarding scenarios in executive education.
Risk: The Expertise Reversal Effect β once mastery is achieved, continued scaffolding actively impedes fluency by adding unnecessary cognitive load. The bandit detects this transition via knowledge tracing and reduces scaffold weight accordingly.
Operational definition: Explicit delivery of facts, definitions, or procedural rules. No elicitation; information is provided directly and efficiently.
Theoretical basis: Cognitive Load Theory β minimizes extraneous cognitive load when the learner lacks prerequisite schemas, enabling rapid acquisition of new Knowledge Components (KCs).
Optimal state: Prerequisite concept introduction; low-energy or high-stress learner states; situations where exploratory modes would cause disengagement before any learning occurs.
Risk: Passive dependency if used exclusively. The system enforces a minimum exploration rate across all other modes β every learner is regularly given the opportunity to engage with more demanding modes.
Operational definition: Modeling expert processes via worked examples, "think-aloud" demonstrations, or "first letter" hints that reveal the structure of expert reasoning without completing the task for the learner.
Theoretical basis: Observational learning and expert visualization. Learners acquire procedural fluency and strategy adoption by watching expert processes made visible β by seeing not just what the expert does, but how they think about it.
Optimal state: Complex multi-step procedures; professional practice domains (consulting, research methodology, case analysis); think tank workflows.
Risk: High LLM generation cost. The IC-Cache optimization routes Cognitive Apprenticeship requests to cached high-quality examples where possible, reducing cost without degrading quality.
| Mode | Theoretical Basis | Optimal Learner State | Primary Risk |
|---|---|---|---|
| Socratic Questioning | Active retrieval, schema refinement | Moderateβhigh prior knowledge | Frustration if schemas absent |
| Scaffolding | Zone of Proximal Development | Lowβmoderate; high cognitive load | Expertise Reversal Effect |
| Direct Instruction | Cognitive Load Theory | Novice; low energy; high stress | Passive dependency |
| Cognitive Apprenticeship | Observational learning | Intermediate; procedural tasks | High LLM generation cost |
| Meta-cognitive Feedback | Self-Regulated Learning | Advanced; near or post-mastery | Ineffective for novices |
Each of the five instructional modes is treated as a discrete "arm" of a Multi-Armed Bandit (MAB). The bandit engine selects which mode to apply at each instructional moment, balancing exploration (trying modes with uncertain effectiveness for this learner) against exploitation (using the mode currently estimated to be most effective).
For each instructional mode a β {1,...,5}, the system maintains a belief state modeled as a Beta distribution Beta(Ξ±β, Ξ²β) for binary rewards, or a Gaussian distribution N(ΞΌβ, ΟβΒ²) for continuous learning progress metrics.
Thompson Sampling draws a sample from each mode's posterior and selects the mode with the highest sample. This naturally produces high exploration early in a session β when uncertainty about the learner's response to each mode is high β and converges on the most effective personalized strategy as evidence accumulates.
Why not UCB or epsilon-greedy? Thompson Sampling handles non-stationarity more gracefully than UCB and produces more natural exploration-exploitation balance than epsilon-greedy. For non-stationary learning trajectories β which all real learners produce β the Bayesian approach is more appropriate.
A context-free bandit cannot achieve true personalization β it treats all learners identically and learns only from aggregated signal. The Contextual MAB (CMAB) incorporates a feature vector xβ representing the learner's current state:
E[rβ | xβ, a] = xβα΅ ΞΈβ Where: xβ = learner context vector at time t ΞΈβ = learned weight vector for instructional mode a rβ = expected reward (learning progress)
| Feature Category | Features Included | Role in Bandit |
|---|---|---|
| Surface-Level (Stable) | Baseline education level, prior academic performance, domain background | Sets initial priors; "warm start" for new learners before interaction data accumulates |
| Deep-Level (Dynamic) | Current Knowledge Component mastery, error distributions, response latency | Primary signal for real-time mode switching; updated after every interaction |
| Affective State | Estimated mood, energy level, stress indicators | Temporarily biases toward lower-load modes (Direct, Scaffolding) when stress indicators are high |
| Knowledge Tracing (DKT/BKT) | Mastery probability per skill from sequence of prior responses | Detects Expertise Reversal; triggers mode drift away from scaffolding toward exploration |
Before the system has enough data to personalize, it uses expert knowledge to seed the bandit's priors. Direct Instruction is the default mode for prerequisite concepts; Socratic Questioning is prioritized for integrative tasks. This warm-start mechanism prevents detrimental random exploration in early sessions β a new learner should not be subjected to a Socratic interrogation on their first day.
As learners interact with the system, the bandit refines its models. Local Clustering in Bandits (LOCB) groups learners by preference parameters ΞΈβ. New learners whose initial behavior matches an existing cluster inherit that cluster's learned policy β dramatically accelerating personalization without requiring extended individual observation.
This collaborative filtering approach scales intelligence across entire cohorts. A think tank that has deployed the system across three cohorts has accumulated learner cluster data that benefits the fourth cohort from day one.
Learning is non-stationary β a learner's optimal mode changes as they develop. Sliding Window UCB or Discounted Thompson Sampling gives more weight to recent observations. As deep-level features indicate higher competence, rewards for Direct Instruction and heavy Scaffolding naturally decline, while rewards for Socratic Questioning and Meta-cognitive Feedback increase.
The bandit policy drifts with the learner β a seamless transition from guided structure to open-ended exploration, without requiring any explicit configuration change.
Simple correctness rewards create a perverse incentive: the bandit maximizes assistance to guarantee "success," producing over-assistance β helping learners get correct answers without learning anything. The Boyle System uses Learning Progress (LP) as its primary reward signal:
r = cα΅’(t) - cα΅’(t-1) Where cα΅’(t) = probability of mastery for Knowledge Component i at time t If a learner already knows a concept: cα΅’(t) - cα΅’(t-1) β 0 β Bandit shifts to more challenging content or Meta-cognitive mode If progress is rapid: reward is high β Bandit reinforces the current instructional mode
| Reward Component | Metric | Purpose |
|---|---|---|
| Immediate Success | P(Correct | Mode) | Maintains learner motivation and "flow" state |
| Knowledge Gain | ΞMastery | Ensures the mode is actually teaching, not just enabling correct answers |
| Efficiency | 1 / Time-on-Task | Penalizes unnecessarily verbose or slow modes |
| Persistence | Session completion rate | Encourages modes that sustain long-term engagement |
When instructional content is generated by LLMs in real time, a standard bandit faces a fundamental problem: it selects an action (e.g., "provide a Socratic hint") but the treatment delivered to the learner is the stochastic output of the LLM β variable, unpredictable, and not fully controlled by the action choice. GAMBITTS (Generator-Mediated Bandit-Thompson Sampling) explicitly models this action-treatment split.
GAMBITTS PIPELINE
Bandit Agent
ββ Selects: Instructional mode A + prompt template P
(e.g., "Use Socratic questioning to explain concept X")
β
βΌ
LLM Generator (stochastic)
ββ Produces: Specific text string Gβ
β
βΌ
Embedding Projection
ββ Projects: High-dim text Gβ β Low-dim embedding Zβ
(Detects when different outputs deliver the same pedagogy)
β
βΌ
Reward Signal
ββ Learner response β LP reward β Update ΞΈβ posterior
| System Component | Optimization Strategy | Pedagogical Impact |
|---|---|---|
| Example Selector | Caches high-utility request-response pairs from larger models | Enables smaller, faster models to emulate Cognitive Apprenticeship quality |
| Request Router | Routes simple queries to small models, complex ones (Socratic) to large models | Maintains low latency during critical "flow" states when response speed matters |
| Example Manager | Continuously refines cached examples based on learner rewards | Ensures scaffolding examples remain aligned with actual learner outcomes |
Without fairness constraints, a bandit that learns from historical data can permanently route certain learners into a narrow set of modes β reflecting prior educational disadvantage rather than actual learning potential.
If a bandit observes that a demographic subgroup has historically responded well to Direct Instruction β potentially because prior educational experiences have conditioned them to passive reception β it may permanently route those learners into a Direct Instruction loop. This denies them access to higher-order modes like Socratic Questioning or Meta-cognitive Feedback. The system's decisions would then mirror and amplify the educational disadvantage it was meant to overcome.
The adaptive architecture described in this document represents the full design specification. Current deployment status:
| Component | Status | Notes |
|---|---|---|
| Five instructional modes (conceptual) | Defined | Deployed as manual selection in current pilot |
| Thompson Sampling MAB engine | Planned | See Roadmap β full MAB Engine |
| Contextual learner vectors (CMAB) | Planned | Requires DKT/BKT integration |
| GAMBITTS LLM integration | Planned | Dependent on MAB engine deployment |
| IC-Cache optimization | Planned | Post-GAMBITTS |
| Fairness audit protocol | Planned | See Roadmap AI-008 |
| Phase 1 expert priors for exec education | In development | See Roadmap AI-004 |
For full roadmap with effort estimates and priorities, see the Roadmap document.