Research Paper Explainer
16 papers that power GlassBox AI - explained visually
agentic-trust-labs/glassbox-ai Β· February 2026
βοΈ
Multi-Agent Debate for Factuality
βββββ The foundational paper
π Analogy
A courtroom. One lawyer says "guilty," another says "innocent." The jury hears BOTH sides and decides better than either lawyer alone. This paper does that with LLMs arguing about facts.
π Diagram
Round 1: Agent A β "42" Agent B β "37" Agent C β "42"
Round 2: A sees B,C β "42" B sees A,C β "hmm" C sees A,B β "42"
Round 3: A: "42 β
" B: "Ok 42 β
" C: "42 β
"
Converged β 42 (correct!)
π The Core Trick
Multiple LLMs independently answer, share, debate, converge. Wrong answers don't survive cross-examination. Free accuracy boost - no fine-tuning needed. 10-20% improvement on math and factual QA.
β GlassBox borrows: Our 3-round debate (Position β Reaction β Convergence) is a direct implementation of this insight.
01 / 16
π
ChatEval: Debate for Evaluation
ββββ Debate for judging, not generating
π Analogy
Movie critics. One says "masterpiece," another says "overrated." Their public disagreement is more informative than any single review. Readers get a nuanced picture.
π Diagram
"Rate this essay" β Judge 1: 7/10 "Clear structure"
β Judge 2: 4/10 "Lacks evidence"
β Judge 3: 8/10 "Creative angle"
β
Debate β Agreed: 6/10 (matches human rating)
π The Core Trick
Use debate for evaluation, not generation. Multiple LLM personas judge quality, debate their scores, converge. Matches human judgments better than any single LLM judge.
β GlassBox borrows: Our LLM-as-judge after Round 3 uses debate-as-evaluation. Three agents debating > one agent scoring.
02 / 16
π«
Society of Mind for LLMs
ββββ Personality matters
π Analogy
A brainstorm. If everyone is open ("what do you think?"), the group finds better ideas. If one person dominates ("I'm right, shut up"), the group converges on that person's idea, even if it's wrong.
π Diagram
Easy-going agents β Overconfident agents
β β
Listen to each other Bulldoze others
β β
β
Better answers β Worse answers
β β
Truth emerges Loudest voice wins
π The Core Trick
Agent personality matters as much as capability. Easy-going agents that reference each other by name produce better results. Overconfident agents kill group accuracy.
β GlassBox borrows: Agents say "I agree/disagree with @architect because..." - this referencing-by-name pattern comes directly from this research.
03 / 16
π³
Tree of Thoughts
βββββ System-2 thinking for LLMs
π Analogy
Chess. A grandmaster doesn't play the first move that looks good (System-1). They explore multiple sequences, evaluate each, prune bad ones, pick the best path (System-2). This paper gives LLMs that ability.
π Diagram
βββ Thought A1 β A2 β A3 β Score: 0.8 β
Problem βββ€
βββ Thought B1 β B2 β (pruned, dead end)
β
βββ Thought C1 β C2 β C3 β Score: 0.6
β
Best path: A1 β A2 β A3
π The Core Trick
Instead of one pass (chain-of-thought), explore a tree of reasoning paths. Evaluate each branch. Prune bad ones. Turns LLMs from "fast and sloppy" into "slow and deliberate."
β GlassBox borrows: Multi-round debate is a social version of tree search. Each agent explores a different branch, debate prunes bad paths.
04 / 16
πΈοΈ
EigenTrust: PageRank for Trust
(Kamvar et al.) Β· WWW 2003 Β·
ACM DL
βββββ The OG trust algorithm
π Analogy
Google PageRank. A website is important if important websites link to it. EigenTrust does the same for people: you're trustworthy if trustworthy people vouch for you. Reputation as linear algebra.
π Diagram
Local trust: Global trust:
A trusts B: 0.9 β ββββββββββββββββββββββββββββ
C trusts B: 0.7 ββeigenvectorββ Global trust(B) = 0.82 β
D trusts B: 0.3 β of matrix β (weighted by how trusted β
β A, C, D themselves are) β
ββββββββββββββββββββββββββββ
π The Core Trick
Compute global trust from local ratings via eigenvector iteration. Malicious peers can't game the system - their votes are weighted by their own reputation. Same math as Google, but for trust.
β GlassBox borrows: Our EMA trust scoring is a simplified version. Roadmap: full EigenTrust where agents rate each other bidirectionally.
05 / 16
π
Trust Models Survey
(Pinyol & Sabater-Mir) Β· ACM Computing Surveys 2013 Β·
ACM DL
ββββ The encyclopedia of trust
π Analogy
A Michelin guide, but for trust algorithms. It doesn't invent a new restaurant - it visits every restaurant (trust model) that exists, rates them, tells you which one for which occasion.
π Diagram
Trust Models Map:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Cognitive β "I trust you because I understand you"β
β Game Theory β "Cooperation is the rational choice" β
β Probabilisticβ "The data says you're reliable" β
β Social β "Others trust you, so I will too" β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Pick based on: your agents, domain, constraints
π The Core Trick
Not a single algorithm - it's the map of the entire design space. Every trust decision has tradeoffs. This survey tells you which model to pick and why.
β GlassBox borrows: We chose EMA (probabilistic) for v0.3.0 simplicity. This survey maps our upgrade path to cognitive/social trust.
06 / 16
π§ β€οΈ
Cognitive vs Emotional Trust
(Shang et al.) Β· AAAI/ACM AIES 2024 Β·
AIES 2024
ββββ Trust has two dimensions
π Analogy
A surgeon. You trust them cognitively ("top of their class"). But you also need emotional trust ("I feel safe"). Capability without comfort = anxiety. Comfort without capability = negligence. You need both.
π Diagram
Comfort (Emotional Trust)
β
β "Friendly but β
IDEAL
β incompetent" "Capable AND
β (Clippy) comfortable"
β
β β WORST "Smart but scary"
β (nothing) (early ChatGPT)
βββββββββββββββββββββββββββ
Capability (Cognitive Trust)
π The Core Trick
Humans trust AI on two axes: cognitive ("it's capable") and emotional ("I'm comfortable"). Showing the reasoning (transparency) boosts emotional trust. You need both.
β GlassBox borrows: Our entire thesis. Debate transcripts = emotional trust. Trust scores = cognitive trust. Glass box = both.
07 / 16
βοΈ
LLM-as-a-Judge Survey
ββββ Can LLMs judge other LLMs?
π Analogy
Figure skating judges who are also figure skaters. They know the sport, but have biases - they favor their own style, rate the last performer higher, prefer flashy over technically perfect. Useful, but calibrate carefully.
π Diagram
Single judge: LLM β Score (β οΈ biased: position, verbosity, self-preference)
Multi-judge: LLMβ β Scoreβ β
LLMβ β Scoreβ ββ Aggregate β Less biased
LLMβ β Scoreβ β
Debate-judge: LLMs debate quality β Converge β β
Best accuracy
π The Core Trick
LLM judges are useful but systematically biased: position bias (prefer first/last), verbosity bias (longer = better), self-preference. Fix: multiple judges, structured rubrics, debate.
β GlassBox borrows: Debate-as-judge avoids single-judge bias. Three agents debating > one agent scoring.
08 / 16
π
FACTS Grounding Benchmark
βββββ The hallucination ruler
π Analogy
A fact-check desk at a newspaper. Every claim must trace to a source. "Crime is up 40%" - "Source? Page? Date?" No source = claim gets cut. This benchmark does that for LLMs.
π Diagram
Given: Source document (the truth)
Given: LLM response (claims facts)
β
FACTS score = % of claims supported by the source
β
Multi-judge evaluation (multiple LLMs verify each claim)
β
Score: 0.0 (all hallucinated) β 1.0 (fully grounded)
π The Core Trick
First industry-standard benchmark for measuring how well LLMs stick to source material. Uses multiple judges to verify each claim. The ruler by which all grounding is now measured.
β GlassBox can borrow: Future claim verification layer should measure against FACTS-style grounding metrics.
09 / 16
π
MiniCheck: Cheap Fact-Checking
ββββ David beats Goliath
π Analogy
Airport security. You don't need a PhD detective to check boarding passes. A well-trained guard with a scanner does the job at 1/400th the cost. You don't need GPT-4 to catch lies. A tiny model can do it.
π Diagram
GPT-4 fact-check: MiniCheck (770M params):
Cost: $$$$$ Cost: $
Accuracy: 94% Accuracy: 93.7%
Speed: slow Speed: 400x faster
API dependency: yes Runs locally: yes
π The Core Trick
770M parameter model achieves GPT-4-level fact-checking at 400x lower cost. Trained on 11 unified datasets. You don't need a giant model for verification - you need a specialist.
β GlassBox can borrow: The critic agent could be replaced by a MiniCheck-style specialist - cheaper, faster, runs locally, no API needed.
10 / 16
βοΈ
Self-Refine
βββββ Edit your own essay
π Analogy
Writing an essay. First draft: rough. Re-read, find weak spots, rewrite. Second draft: better. Third draft: polished. You're your own editor. This paper makes LLMs do the same - generate, critique, refine - no training needed.
π Diagram
ββββββββββββ ββββββββββββ ββββββββββββ
β Generate β βββ β Critique β βββ β Refine β
β (draft) β β (review) β β (better) β
ββββββββββββ ββββββββββββ ββββββ¬ββββββ
β loop
β
5-40% improvement
π The Core Trick
No training, no fine-tuning, just prompting. LLM generates, critiques its own output, refines. Free 5-40% improvement across code, math, and writing tasks. The simplest self-improvement loop.
β GlassBox borrows: Our debate is a multi-agent version of Self-Refine. Three agents critique each other instead of one critiquing itself - less echo chamber.
11 / 16
π
Reflexion: Verbal RL
βββββ Learning from failure, in English
π Analogy
A student with a mistake journal. After every failed exam: "Got Q3 wrong because I confused velocity with acceleration." Next exam, re-read the journal. No tutoring needed - just honest self-reflection stored in memory.
π Diagram
Attempt 1: Try β Fail β "Failed because I missed edge cases"
β store in memory
Attempt 2: Read memory β Try β Fail β "Forgot the base case"
β store
Attempt 3: Read memory β Try with both β β
Pass!
Result: 91% on HumanEval (vs 80% baseline). No weight updates.
π The Core Trick
Store verbal failure reflections in memory. Agent reads them before retrying. No gradients, no fine-tuning - just writing down why you failed. 91% pass@1 on HumanEval.
β GlassBox can borrow: Our feedback flywheel's failure log (F1-F15 taxonomy) is exactly this - verbal reflections the agent reads before similar issues.
12 / 16
π§¬
Self-Correct via RL
ββββ Intrinsic self-correction
π Analogy
Learning to catch your own typos. At first you need spell-check (external). After years of writing, you instinctively pause at a word that "looks wrong" - you've internalized the correction. This paper trains that instinct into LLMs.
π Diagram
Before (external):
LLM β wrong answer β human says "wrong" β retry
After (intrinsic, this paper):
LLM β answer β "wait, that doesn't feel right" β self-corrects
β
Trained via RL to recognize own mistakes
π The Core Trick
First method that trains intrinsic self-correction into LLMs via RL. The model improves answers without any external feedback. Previous papers needed external signals - this one doesn't.
β GlassBox can borrow: Long-term - if debate agents could intrinsically self-correct, we'd need fewer rounds.
13 / 16
πΊοΈ
Code Repair as Exploration
ββββ Don't retry the same fix
π Analogy
Lost your keys. Do you search the same pocket 3 times (exploit)? Or check the table, coat, drawer (explore)? Searching the same place repeatedly is insane - but that's exactly what coding agents do when they retry the same broken fix.
π Diagram
β BAD (exploit): β
GOOD (explore):
Attempt 1: Fix A β fail Attempt 1: Fix A β fail
Attempt 2: Fix A β fail Attempt 2: Fix B β fail
Attempt 3: Fix A β fail Attempt 3: Fix C β pass!
β β
Same bug, 3 times Different strategies, found it
π The Core Trick
Frame code repair as a tree search with exploration-exploitation tradeoff. Better expansion policies β better fixes. Agents that explore diverse strategies fix 2x more bugs.
β GlassBox must borrow: Issue #18 failed exactly this way - agent retried same approach 3 times. REQ-07 now mandates different strategies per retry.
14 / 16
ποΈ
AI Safety via Debate
βββββ The paper that started it all
π Analogy
The legal system. You (the judge) can't investigate every crime yourself. Two lawyers compete to convince you. Because it's adversarial, lies get exposed. It works even though the judge is weaker than the lawyers.
π Diagram
Human judge (limited, can't verify everything)
β
Agent A ββ debate ββ Agent B
(argues X) (argues Y)
β
Zero-sum game: lying is punished
Truth is the Nash equilibrium
β
Human gets correct answer despite being weaker
π The Core Trick
Two AIs debate to help a weak human judge - even on tasks too complex for the human alone. Debate as a zero-sum game where truth is the Nash equilibrium. Lying agents get caught by the opponent.
β GlassBox borrows: This IS our foundational philosophy. Debate = safety. Transparency = trust. The human sees the reasoning, not just the answer.
15 / 16
π
Constitutional AI
βββββ Rules instead of humans
π Analogy
Raising a child. Option A: hire a nanny to watch every move (RLHF - expensive, doesn't scale). Option B: teach principles - "be kind, tell the truth" - and let them self-correct (Constitutional AI - scales infinitely).
π Diagram
RLHF (old way): Constitutional AI (this paper):
Human labels: "harmful" Principles: "Be helpful, harmless, honest"
β expensive, slow β free, instant
β 50K+ annotations β 0 annotations
β doesn't scale β scales infinitely
β
AI: "Is my response harmful?"
AI: "Let me fix it per principles"
β
RLAIF (AI feedback replaces human feedback)
π The Core Trick
Replace human feedback (RLHF) with AI self-judgment against a constitution of principles (RLAIF). Zero human labels for safety alignment. AI self-improves by asking "does this follow my principles?"
β GlassBox borrows: Agent prompts act as a mini-constitution: "@architect: think long-term", "@critic: find failures." Principles guide behavior without human-in-the-loop per turn.
16 / 16
What GlassBox Borrows - and What's Next
β
= already implemented in v0.3.0 π² = on the roadmap
β
01 Multi-Agent Debate
3-round debate: Position β Reaction β Convergence
β
02 ChatEval
Debate-as-evaluation after Round 3 (LLM judge)
β
03 Society of Mind
Agents reference each other by @name, have distinct personalities
β
04 Tree of Thoughts
Debate as social tree search (each agent = different branch)
β
05 EigenTrust
EMA trust scoring (simplified). Scores persist in SQLite.
β
06 Trust Survey
Chose probabilistic trust (EMA) for v0.3.0
β
07 Cognitive/Emotional
Transcripts = emotional trust, scores = cognitive trust
β
08 LLM-as-Judge
Multi-agent debate avoids single-judge bias
π²
09 FACTS Grounding
Claim verification layer (roadmap item)
π²
10 MiniCheck
Replace critic with lightweight specialist model
β
11 Self-Refine
Debate IS multi-agent Self-Refine (generate β critique β refine)
π²
12 Reflexion
Failure log (F1-F15) as verbal reflections for agent memory
π²
13 Self-Correct RL
Intrinsic self-correction (long-term, needs fine-tuning)
π²
14 Code Repair
Exploration-exploitation retry strategy (REQ-07)
β
15 Safety via Debate
Core philosophy: debate = safety, transparency = trust
β
16 Constitutional AI
Agent prompts as mini-constitution (system prompts with principles)
Score: 10/16 borrowed Β· 6 more on the roadmap Β· Each one is a GitHub issue waiting to happen
β
/ 16