Exploration-Exploitation in Multi-Agent Competition: Convergence with Bounded Rationality

Stefanos Leonardos, Georgios Piliouras, Kelly Spendlove

correcthigh confidence

Category: math.DS
Journal tier: Strong Field
Processed: Sep 28, 2025, 12:56 AM
arXiv Links: Abstract ↗PDF ↗

Audit review

The uploaded paper proves that in weighted zero-sum polymatrix games with strictly positive exploration rates, Q-learning admits a unique QRE and all interior trajectories converge to it exponentially; the proof hinges on an exact dissipation identity for the weighted KL divergence (Theorem 4.1; see the statement and identity d/dt D_KL^(w)(p || x(t)) = -∑_k w_k T_k[ D_KL(p_k || x_k) + D_KL(x_k || p_k) ] in the main text and appendix) . The derivation uses two key ingredients: (i) a time-derivative formula (Lemma 4.2) and (ii) a structural cancellation identity tailored to rescaled zero-sum polymatrix games (Lemma 4.3 via a payoff-equivalent transformation, Property 2) . The candidate solution reproduces the same dissipation identity and conclusions once the correct cross-term cancellation is in place, and it correctly invokes the QRE fixed point characterization and existence (Theorem 3.2) . However, its opening step incorrectly asserts that the weighted zero-sum hypothesis implies the per-edge antisymmetry w_k A_{kl} + (w_l A_{lk})^T = 0; the paper instead shows an equivalent pairwise constant-sum representation after a payoff-equivalent transformation, not necessarily antisymmetry on the original matrices (â_{kl} + â_{lk}^T = c_{kl} with ∑ c_{kl} = 0) . Because this incorrect equivalence is used to justify the key cancellation, the model’s proof is logically invalid as written, even though its remaining steps align with the paper’s argument (Derivative/Lemma 4.2 and QRE identity) and would go through if Lemma 4.3 (rather than antisymmetry) were invoked .

Referee report (LaTeX)

\textbf{Recommendation:} minor revisions

\textbf{Journal Tier:} strong field

\textbf{Justification:}

The paper establishes a clear and sharp convergence theorem for smooth Q-learning in weighted zero-sum polymatrix games, providing an exact Lyapunov identity, uniqueness of QRE, and exponential convergence. The argument is clean and modular (derivative identity + structural cancellation), and the result fills an important gap in the literature on learning in competitive multi-agent systems. Minor clarifications (e.g., highlighting the payoff-equivalent transformation and its role) would further improve readability.