Multi-agent debate outperforms single chat
Argue · Critique · Converge
Multiple agents propose answers, cross-examine one another, and then a selector aggregates the strongest chain. Across tasks, this consistently beats a single model's one-shot reply, using the same base model.
- Arithmetic accuracy rose from ~67% to ~82% when using multi-agent debate vs. single model output (same base LM).
- Grade-school math improved from ~77% to ~85% under debate protocols.
- Reduces invalid steps and improves factual consistency compared to single-pass or simple self-reflection.