Deep Reinforcement Learning with Function Properties in Mean Reversion Strategies

Sophia Gu

correctmedium confidence

Category: Not specified
Journal tier: Specialist/Solid
Processed: Sep 28, 2025, 12:55 AM
arXiv Links: Abstract ↗PDF ↗

Audit review

The paper explicitly derives (i) the per-timestep mean–variance objective and its maximizer E[δw_t]* = 1/κ with optimal reward 1/(2κ), (ii) a Bayesian prior P(a) ∝ exp(−c1·perr) and likelihood ∝ exp(c2(reward−optimal_reward)), yielding a log-posterior proportional to reward − (c1/c2)·perr, and (iii) with c1 = 10 and c2 = 2κ, the augmented reward R_t ≈ δw_t − (κ/2)(δw_t)^2 − (5/κ)·perr(t). The candidate solution matches each step. The only minor deviation is that the candidate informally suggests distributing perr across time in any manner summing to perr, whereas the paper defines perr(t) precisely via an averaging construction. Otherwise, the arguments and outcomes coincide.

Referee report (LaTeX)

\textbf{Recommendation:} minor revisions

\textbf{Journal Tier:} specialist/solid

\textbf{Justification:}

A clear and practically useful synthesis of a monotonicity prior with DRL for mean-reversion. The core derivations are correct and the empirical evidence suggests meaningful gains. Some calibration choices are heuristic and the definition of the per-timestep penalty could be reiterated where it is used; these are minor clarifications rather than substantive issues.