GlassBox AI - Research Paper Explainer

⚔️

Multi-Agent Debate for Factuality

(Du et al.) · NeurIPS 2024 · arXiv:2305.14325

⭐⭐⭐⭐⭐ The foundational paper

🎭 Analogy

A courtroom. One lawyer says "guilty," another says "innocent." The jury hears BOTH sides and decides better than either lawyer alone. This paper does that with LLMs arguing about facts.

📐 Diagram

Round 1: Agent A → "42" Agent B → "37" Agent C → "42" Round 2: A sees B,C → "42" B sees A,C → "hmm" C sees A,B → "42" Round 3: A: "42 ✅" B: "Ok 42 ✅" C: "42 ✅" Converged → 42 (correct!)

💎 The Core Trick

Multiple LLMs independently answer, share, debate, converge. Wrong answers don't survive cross-examination. Free accuracy boost - no fine-tuning needed. 10-20% improvement on math and factual QA.

→ GlassBox borrows: Our 3-round debate (Position → Reaction → Convergence) is a direct implementation of this insight.

01 / 16

🏅

ChatEval: Debate for Evaluation

(Chan et al.) · ICLR 2024 · arXiv:2308.07201

⭐⭐⭐⭐ Debate for judging, not generating

🎭 Analogy

Movie critics. One says "masterpiece," another says "overrated." Their public disagreement is more informative than any single review. Readers get a nuanced picture.

📐 Diagram

"Rate this essay" → Judge 1: 7/10 "Clear structure" → Judge 2: 4/10 "Lacks evidence" → Judge 3: 8/10 "Creative angle" ↓ Debate → Agreed: 6/10 (matches human rating)

💎 The Core Trick

Use debate for evaluation, not generation. Multiple LLM personas judge quality, debate their scores, converge. Matches human judgments better than any single LLM judge.

→ GlassBox borrows: Our LLM-as-judge after Round 3 uses debate-as-evaluation. Three agents debating > one agent scoring.

02 / 16

🫂

Society of Mind for LLMs

(Zhang et al.) · ACL 2024 · OpenReview

⭐⭐⭐⭐ Personality matters

🎭 Analogy

A brainstorm. If everyone is open ("what do you think?"), the group finds better ideas. If one person dominates ("I'm right, shut up"), the group converges on that person's idea, even if it's wrong.

📐 Diagram

Easy-going agents ↔ Overconfident agents ↓ ↓ Listen to each other Bulldoze others ↓ ↓ ✅ Better answers ❌ Worse answers ↓ ↓ Truth emerges Loudest voice wins

💎 The Core Trick

Agent personality matters as much as capability. Easy-going agents that reference each other by name produce better results. Overconfident agents kill group accuracy.

→ GlassBox borrows: Agents say "I agree/disagree with @architect because..." - this referencing-by-name pattern comes directly from this research.

03 / 16

🌳

Tree of Thoughts

(Yao et al.) · NeurIPS 2023 · arXiv:2305.10601

⭐⭐⭐⭐⭐ System-2 thinking for LLMs

🎭 Analogy

Chess. A grandmaster doesn't play the first move that looks good (System-1). They explore multiple sequences, evaluate each, prune bad ones, pick the best path (System-2). This paper gives LLMs that ability.

📐 Diagram

┌── Thought A1 → A2 → A3 → Score: 0.8 ✅ Problem ──┤ ├── Thought B1 → B2 ✗ (pruned, dead end) │ └── Thought C1 → C2 → C3 → Score: 0.6 ↓ Best path: A1 → A2 → A3

💎 The Core Trick

Instead of one pass (chain-of-thought), explore a tree of reasoning paths. Evaluate each branch. Prune bad ones. Turns LLMs from "fast and sloppy" into "slow and deliberate."

→ GlassBox borrows: Multi-round debate is a social version of tree search. Each agent explores a different branch, debate prunes bad paths.

04 / 16

🕸️

EigenTrust: PageRank for Trust

(Kamvar et al.) · WWW 2003 · ACM DL

⭐⭐⭐⭐⭐ The OG trust algorithm

🎭 Analogy

Google PageRank. A website is important if important websites link to it. EigenTrust does the same for people: you're trustworthy if trustworthy people vouch for you. Reputation as linear algebra.

📐 Diagram

Local trust: Global trust: A trusts B: 0.9 ┐ ┌──────────────────────────┐ C trusts B: 0.7 ├─eigenvector─→ Global trust(B) = 0.82 │ D trusts B: 0.3 ┘ of matrix │ (weighted by how trusted │ │ A, C, D themselves are) │ └──────────────────────────┘

💎 The Core Trick

Compute global trust from local ratings via eigenvector iteration. Malicious peers can't game the system - their votes are weighted by their own reputation. Same math as Google, but for trust.

→ GlassBox borrows: Our EMA trust scoring is a simplified version. Roadmap: full EigenTrust where agents rate each other bidirectionally.

05 / 16

📚

Trust Models Survey

(Pinyol & Sabater-Mir) · ACM Computing Surveys 2013 · ACM DL

⭐⭐⭐⭐ The encyclopedia of trust

🎭 Analogy

A Michelin guide, but for trust algorithms. It doesn't invent a new restaurant - it visits every restaurant (trust model) that exists, rates them, tells you which one for which occasion.

📐 Diagram

Trust Models Map: ┌──────────────────────────────────────────────────────┐ │ Cognitive → "I trust you because I understand you"│ │ Game Theory → "Cooperation is the rational choice" │ │ Probabilistic→ "The data says you're reliable" │ │ Social → "Others trust you, so I will too" │ └──────────────────────────────────────────────────────┘ Pick based on: your agents, domain, constraints

💎 The Core Trick

Not a single algorithm - it's the map of the entire design space. Every trust decision has tradeoffs. This survey tells you which model to pick and why.

→ GlassBox borrows: We chose EMA (probabilistic) for v0.3.0 simplicity. This survey maps our upgrade path to cognitive/social trust.

06 / 16

🧠❤️

Cognitive vs Emotional Trust

(Shang et al.) · AAAI/ACM AIES 2024 · AIES 2024

⭐⭐⭐⭐ Trust has two dimensions

🎭 Analogy

A surgeon. You trust them cognitively ("top of their class"). But you also need emotional trust ("I feel safe"). Capability without comfort = anxiety. Comfort without capability = negligence. You need both.

📐 Diagram

Comfort (Emotional Trust) ↑ │ "Friendly but ✅ IDEAL │ incompetent" "Capable AND │ (Clippy) comfortable" │ │ ❌ WORST "Smart but scary" │ (nothing) (early ChatGPT) └─────────────────────────→ Capability (Cognitive Trust)

💎 The Core Trick

Humans trust AI on two axes: cognitive ("it's capable") and emotional ("I'm comfortable"). Showing the reasoning (transparency) boosts emotional trust. You need both.

→ GlassBox borrows: Our entire thesis. Debate transcripts = emotional trust. Trust scores = cognitive trust. Glass box = both.

07 / 16

⚖️

LLM-as-a-Judge Survey

(Li et al.) · arXiv 2024 · arXiv:2411.15594

⭐⭐⭐⭐ Can LLMs judge other LLMs?

🎭 Analogy

Figure skating judges who are also figure skaters. They know the sport, but have biases - they favor their own style, rate the last performer higher, prefer flashy over technically perfect. Useful, but calibrate carefully.

📐 Diagram

Single judge: LLM → Score (⚠️ biased: position, verbosity, self-preference) Multi-judge: LLM₁ → Score₁ ┐ LLM₂ → Score₂ ├→ Aggregate → Less biased LLM₃ → Score₃ ┘ Debate-judge: LLMs debate quality → Converge → ✅ Best accuracy

💎 The Core Trick

LLM judges are useful but systematically biased: position bias (prefer first/last), verbosity bias (longer = better), self-preference. Fix: multiple judges, structured rubrics, debate.

→ GlassBox borrows: Debate-as-judge avoids single-judge bias. Three agents debating > one agent scoring.

08 / 16

📏

FACTS Grounding Benchmark

(Google DeepMind) · 2024 · DeepMind Blog

⭐⭐⭐⭐⭐ The hallucination ruler

🎭 Analogy

A fact-check desk at a newspaper. Every claim must trace to a source. "Crime is up 40%" - "Source? Page? Date?" No source = claim gets cut. This benchmark does that for LLMs.

📐 Diagram

Given: Source document (the truth) Given: LLM response (claims facts) ↓ FACTS score = % of claims supported by the source ↓ Multi-judge evaluation (multiple LLMs verify each claim) ↓ Score: 0.0 (all hallucinated) → 1.0 (fully grounded)

💎 The Core Trick

First industry-standard benchmark for measuring how well LLMs stick to source material. Uses multiple judges to verify each claim. The ruler by which all grounding is now measured.

→ GlassBox can borrow: Future claim verification layer should measure against FACTS-style grounding metrics.

09 / 16

🐜

MiniCheck: Cheap Fact-Checking

(Tang et al.) · EMNLP 2024 · arXiv:2404.10774

⭐⭐⭐⭐ David beats Goliath

🎭 Analogy

Airport security. You don't need a PhD detective to check boarding passes. A well-trained guard with a scanner does the job at 1/400th the cost. You don't need GPT-4 to catch lies. A tiny model can do it.

📐 Diagram

GPT-4 fact-check: MiniCheck (770M params): Cost: $$$$$ Cost: $ Accuracy: 94% Accuracy: 93.7% Speed: slow Speed: 400x faster API dependency: yes Runs locally: yes

💎 The Core Trick

770M parameter model achieves GPT-4-level fact-checking at 400x lower cost. Trained on 11 unified datasets. You don't need a giant model for verification - you need a specialist.

→ GlassBox can borrow: The critic agent could be replaced by a MiniCheck-style specialist - cheaper, faster, runs locally, no API needed.

10 / 16

✍️

Self-Refine

(Madaan et al.) · NeurIPS 2023 · arXiv:2303.17651

⭐⭐⭐⭐⭐ Edit your own essay

🎭 Analogy

Writing an essay. First draft: rough. Re-read, find weak spots, rewrite. Second draft: better. Third draft: polished. You're your own editor. This paper makes LLMs do the same - generate, critique, refine - no training needed.

📐 Diagram

┌──────────┐ ┌──────────┐ ┌──────────┐ │ Generate │ ──→ │ Critique │ ──→ │ Refine │ │ (draft) │ │ (review) │ │ (better) │ └──────────┘ └──────────┘ └────┬─────┘ │ loop ↓ 5-40% improvement

💎 The Core Trick

No training, no fine-tuning, just prompting. LLM generates, critiques its own output, refines. Free 5-40% improvement across code, math, and writing tasks. The simplest self-improvement loop.

→ GlassBox borrows: Our debate is a multi-agent version of Self-Refine. Three agents critique each other instead of one critiquing itself - less echo chamber.

11 / 16

📓

Reflexion: Verbal RL

(Shinn et al.) · NeurIPS 2023 · arXiv:2303.11366

⭐⭐⭐⭐⭐ Learning from failure, in English

🎭 Analogy

A student with a mistake journal. After every failed exam: "Got Q3 wrong because I confused velocity with acceleration." Next exam, re-read the journal. No tutoring needed - just honest self-reflection stored in memory.

📐 Diagram

Attempt 1: Try → Fail → "Failed because I missed edge cases" ↓ store in memory Attempt 2: Read memory → Try → Fail → "Forgot the base case" ↓ store Attempt 3: Read memory → Try with both → ✅ Pass! Result: 91% on HumanEval (vs 80% baseline). No weight updates.

💎 The Core Trick

Store verbal failure reflections in memory. Agent reads them before retrying. No gradients, no fine-tuning - just writing down why you failed. 91% pass@1 on HumanEval.

→ GlassBox can borrow: Our feedback flywheel's failure log (F1-F15 taxonomy) is exactly this - verbal reflections the agent reads before similar issues.

12 / 16

🧬

Self-Correct via RL

(Google DeepMind) · ICLR 2025 · ICLR 2025

⭐⭐⭐⭐ Intrinsic self-correction

🎭 Analogy

Learning to catch your own typos. At first you need spell-check (external). After years of writing, you instinctively pause at a word that "looks wrong" - you've internalized the correction. This paper trains that instinct into LLMs.

📐 Diagram

Before (external): LLM → wrong answer → human says "wrong" → retry After (intrinsic, this paper): LLM → answer → "wait, that doesn't feel right" → self-corrects ↑ Trained via RL to recognize own mistakes

💎 The Core Trick

First method that trains intrinsic self-correction into LLMs via RL. The model improves answers without any external feedback. Previous papers needed external signals - this one doesn't.

→ GlassBox can borrow: Long-term - if debate agents could intrinsically self-correct, we'd need fewer rounds.

13 / 16

🗺️

Code Repair as Exploration

(NeurIPS 2024) · NeurIPS 2024

⭐⭐⭐⭐ Don't retry the same fix

🎭 Analogy

Lost your keys. Do you search the same pocket 3 times (exploit)? Or check the table, coat, drawer (explore)? Searching the same place repeatedly is insane - but that's exactly what coding agents do when they retry the same broken fix.

📐 Diagram

❌ BAD (exploit): ✅ GOOD (explore): Attempt 1: Fix A → fail Attempt 1: Fix A → fail Attempt 2: Fix A → fail Attempt 2: Fix B → fail Attempt 3: Fix A → fail Attempt 3: Fix C → pass! ↓ ↓ Same bug, 3 times Different strategies, found it

💎 The Core Trick

Frame code repair as a tree search with exploration-exploitation tradeoff. Better expansion policies → better fixes. Agents that explore diverse strategies fix 2x more bugs.

→ GlassBox must borrow: Issue #18 failed exactly this way - agent retried same approach 3 times. REQ-07 now mandates different strategies per retry.

14 / 16

🏛️

AI Safety via Debate

(Irving, Christiano & Amodei) · 2018 · arXiv:1805.00899

⭐⭐⭐⭐⭐ The paper that started it all

🎭 Analogy

The legal system. You (the judge) can't investigate every crime yourself. Two lawyers compete to convince you. Because it's adversarial, lies get exposed. It works even though the judge is weaker than the lawyers.

📐 Diagram

Human judge (limited, can't verify everything) ↑ Agent A ←─ debate ─→ Agent B (argues X) (argues Y) ↓ Zero-sum game: lying is punished Truth is the Nash equilibrium ↓ Human gets correct answer despite being weaker

💎 The Core Trick

Two AIs debate to help a weak human judge - even on tasks too complex for the human alone. Debate as a zero-sum game where truth is the Nash equilibrium. Lying agents get caught by the opponent.

→ GlassBox borrows: This IS our foundational philosophy. Debate = safety. Transparency = trust. The human sees the reasoning, not just the answer.

15 / 16

📜

Constitutional AI

(Bai et al., Anthropic) · 2022 · arXiv:2212.08073

⭐⭐⭐⭐⭐ Rules instead of humans

🎭 Analogy

Raising a child. Option A: hire a nanny to watch every move (RLHF - expensive, doesn't scale). Option B: teach principles - "be kind, tell the truth" - and let them self-correct (Constitutional AI - scales infinitely).

📐 Diagram

RLHF (old way): Constitutional AI (this paper): Human labels: "harmful" Principles: "Be helpful, harmless, honest" → expensive, slow → free, instant → 50K+ annotations → 0 annotations → doesn't scale → scales infinitely ↓ AI: "Is my response harmful?" AI: "Let me fix it per principles" ↓ RLAIF (AI feedback replaces human feedback)

💎 The Core Trick

Replace human feedback (RLHF) with AI self-judgment against a constitution of principles (RLAIF). Zero human labels for safety alignment. AI self-improves by asking "does this follow my principles?"

→ GlassBox borrows: Agent prompts act as a mini-constitution: "@architect: think long-term", "@critic: find failures." Principles guide behavior without human-in-the-loop per turn.

16 / 16

What GlassBox Borrows - and What's Next

✅ = already implemented in v0.3.0 🔲 = on the roadmap

✅

01 Multi-Agent Debate

3-round debate: Position → Reaction → Convergence

✅

02 ChatEval

Debate-as-evaluation after Round 3 (LLM judge)

✅

03 Society of Mind

Agents reference each other by @name, have distinct personalities

✅

04 Tree of Thoughts

Debate as social tree search (each agent = different branch)

✅

05 EigenTrust

EMA trust scoring (simplified). Scores persist in SQLite.

✅

06 Trust Survey

Chose probabilistic trust (EMA) for v0.3.0

✅

07 Cognitive/Emotional

Transcripts = emotional trust, scores = cognitive trust

✅

08 LLM-as-Judge

Multi-agent debate avoids single-judge bias

🔲

09 FACTS Grounding

Claim verification layer (roadmap item)

🔲

10 MiniCheck

Replace critic with lightweight specialist model

✅

11 Self-Refine

Debate IS multi-agent Self-Refine (generate → critique → refine)

🔲

12 Reflexion

Failure log (F1-F15) as verbal reflections for agent memory

🔲

13 Self-Correct RL

Intrinsic self-correction (long-term, needs fine-tuning)

🔲

14 Code Repair

Exploration-exploitation retry strategy (REQ-07)

✅

15 Safety via Debate

Core philosophy: debate = safety, transparency = trust

✅

16 Constitutional AI

Agent prompts as mini-constitution (system prompts with principles)

Score: 10/16 borrowed · 6 more on the roadmap · Each one is a GitHub issue waiting to happen

✅ / 16

Research Paper Explainer

🧠 Multi-Agent Debate

🔒 Trust & Reputation

🔍 Grounding & Fact-Checking

🔄 Self-Correction & Refinement

🛡️ AI Safety & Oversight