Efficient time stepping for numerical integration using reinforcement learning

Michael Dellnitz, Eyke Hüllermeier, Marvin Lücke, Sina Ober-Blöbaum, Christian Offen, Sebastian Peitz, Karlson Pfannschmidt

incompletemedium confidence

Category: Not specified
Journal tier: Strong Field
Processed: Sep 28, 2025, 12:56 AM
arXiv Links: Abstract ↗PDF ↗

Audit review

The paper cleanly specifies the state, action, and transition structure for the Simpson-rule base learner and the meta-learner, and it motivates reward designs and γ≈0 training, but it does not supply the contraction/optimality proofs nor the ‘largest-feasible-step’ structural result. The candidate solution supplies standard Bellman γ-contraction arguments for both base and meta MDPs and correctly shows that, with the paper’s simple reward r=h if ε≤tol and 0 otherwise, the γ=0 optimum is to pick the largest feasible step. These align with the paper’s setup and reward choices for quadrature and the ODE extension, with one minor mismatch about the meta-learner’s reward definition noted below. Key paper passages: base state/action/transition and Q-target (eq. (8); Simpson st, st+1) , reward designs including the 0/feasible and h/feasible (scaled) alternatives , the γ=0 training remark Q= r for quadrature , meta-learner selection mechanism (two slightly different descriptions) , and the ODE/RK state st=(h,k1,…,ks) (eq. (12)) .

Referee report (LaTeX)

\textbf{Recommendation:} minor revisions

\textbf{Journal Tier:} strong field

\textbf{Justification:}

A well-motivated and competently executed application of hierarchical RL to adaptive step-size control in quadrature and ODE integration. The methodology is clear and empirically compelling. The main limitation is a lack of concise theoretical statements (contraction, existence/uniqueness of Q*, greedy optimality) and a small ambiguity in the meta-level reward. Addressing these would materially improve the paper's completeness without substantial rework.