2104.03562
Efficient time stepping for numerical integration using reinforcement learning
Michael Dellnitz, Eyke Hüllermeier, Marvin Lücke, Sina Ober-Blöbaum, Christian Offen, Sebastian Peitz, Karlson Pfannschmidt
incompletemedium confidence
- Category
- Not specified
- Journal tier
- Strong Field
- Processed
- Sep 28, 2025, 12:56 AM
- arXiv Links
- Abstract ↗PDF ↗
Audit review
The paper cleanly specifies the state, action, and transition structure for the Simpson-rule base learner and the meta-learner, and it motivates reward designs and γ≈0 training, but it does not supply the contraction/optimality proofs nor the ‘largest-feasible-step’ structural result. The candidate solution supplies standard Bellman γ-contraction arguments for both base and meta MDPs and correctly shows that, with the paper’s simple reward r=h if ε≤tol and 0 otherwise, the γ=0 optimum is to pick the largest feasible step. These align with the paper’s setup and reward choices for quadrature and the ODE extension, with one minor mismatch about the meta-learner’s reward definition noted below. Key paper passages: base state/action/transition and Q-target (eq. (8); Simpson st, st+1) , reward designs including the 0/feasible and h/feasible (scaled) alternatives , the γ=0 training remark Q= r for quadrature , meta-learner selection mechanism (two slightly different descriptions) , and the ODE/RK state st=(h,k1,…,ks) (eq. (12)) .
Referee report (LaTeX)
\textbf{Recommendation:} minor revisions
\textbf{Journal Tier:} strong field
\textbf{Justification:}
A well-motivated and competently executed application of hierarchical RL to adaptive step-size control in quadrature and ODE integration. The methodology is clear and empirically compelling. The main limitation is a lack of concise theoretical statements (contraction, existence/uniqueness of Q*, greedy optimality) and a small ambiguity in the meta-level reward. Addressing these would materially improve the paper's completeness without substantial rework.