2011.11837

Adaptive Observation-Based Efficient Reinforcement Learning for Uncertain Systems

Maopeng Ran, Lihua Xie

correcthigh confidence

Category: math.DS
Journal tier: Strong Field
Processed: Sep 28, 2025, 12:55 AM
arXiv Links: Abstract ↗PDF ↗

Audit review

The paper formulates the plant as a chain-of-integrators with unknown parametric drift and designs a concurrent-learning adaptive extended observer (CL-AEO) coupled with an actor–critic RL scheme that leverages both instantaneous and extrapolated Bellman errors. The key elements appear explicitly: plant (2) and problems 1–2, the CL-AEO (6) with two-time-scale structure, the normalized least-squares critic and damped actor updates (25)–(27), and the extrapolated-excitation Assumption A1 (28) . The paper proves (i) practical convergence of the CL-AEO (Theorem 1) using a high-gain/singular-perturbation argument for the scaled error and a CL Lyapunov analysis for the parameter update; and (ii) uniform ultimate boundedness (UUB) of the state and NN weights (Theorem 2) via a composite Lyapunov function and gain conditions (40)–(41) . The candidate solution mirrors these steps: a fast observer yielding O(ε) error, CL-based parameter convergence under rank(Z)=m, bounded critic/actor weights under A1 with a bounded Γ, and a composite Lyapunov argument for UUB. Minor differences: the candidate replaces the paper’s ε-smallness condition in the W̃-inequality with a stronger spectral lower bound on the stored-data Gramian, which is not guaranteed by the recording algorithm but not essential in the paper’s proof; and it de-emphasizes explicit approximation error terms ς, which the paper keeps and later argues can be made small . Overall, both arguments are logically consistent and essentially the same in structure and technique.

Referee report (LaTeX)

\textbf{Recommendation:} minor revisions

\textbf{Journal Tier:} strong field

\textbf{Justification:}

The paper unifies concurrent-learning adaptive observation and model-based actor–critic reinforcement learning using extrapolated Bellman errors. It rigorously relaxes PE requirements and avoids derivative/integral estimation while providing UUB guarantees. Assumptions are standard for the area but can be heavy; some constants in gain conditions are not directly measurable. Clarity in notation and tighter discussion of the data-recording algorithm would strengthen the contribution.