2110.04744
Long Expressive Memory for Sequence Modeling
T. Konstantin Rusch, S. Mishra, N. Benjamin Erichson, Michael W. Mahoney
correctmedium confidence
- Category
- Not specified
- Journal tier
- Specialist/Solid
- Processed
- Sep 28, 2025, 12:56 AM
- arXiv Links
- Abstract ↗PDF ↗
Audit review
The paper’s Proposition 4.4 states that LEM (3) can approximate any finite-horizon Lipschitz discrete-time system (8) under boundedness assumptions, with proof deferred to SM§D.4. The construction there freezes the gates by setting Δt=1 and taking large gate biases (so ˆσ(b∞)≈1), realizes each step as a two-layer tanh MLP, uses block structure so only the state coordinates feed the next step, and controls error growth via Lipschitz bounds, culminating in an explicit LEM parameterization that achieves ||o_n−ω_n||≤ε for 1≤n≤N . The candidate solution follows the same blueprint: freeze the time gates near 1, approximate the composite map with a two-layer tanh network on a compact set, enforce block-sparsity so outputs do not feed back, and bound the accumulated error with a discrete Grönwall/Lipschitz argument. The only technical difference is that the candidate explicitly quantifies the gate-induced error (ε_gate) instead of the paper’s notational shortcut of setting the gate deviation δ→0 for simplicity, but both yield the same conclusion. Hence both are correct and essentially the same proof at a conceptual level.
Referee report (LaTeX)
\textbf{Recommendation:} minor revisions
\textbf{Journal Tier:} specialist/solid
\textbf{Justification:}
The proposition on universal approximation by LEM is demonstrably correct using standard approximation theory and a careful parameterization that freezes gates and enforces block structure. The supplemental materials provide the needed constructive details. Minor issues include the notational overlap for W\_y and the informal setting of gate deviation to zero for simplicity; making these fully explicit would strengthen rigor and clarity.