Long Expressive Memory for Sequence Modeling

T. Konstantin Rusch, S. Mishra, N. Benjamin Erichson, Michael W. Mahoney

correctmedium confidence

Category: Not specified
Journal tier: Specialist/Solid
Processed: Sep 28, 2025, 12:56 AM
arXiv Links: Abstract ↗PDF ↗

Audit review

The paper’s Proposition 4.4 states that LEM (3) can approximate any finite-horizon Lipschitz discrete-time system (8) under boundedness assumptions, with proof deferred to SM§D.4. The construction there freezes the gates by setting Δt=1 and taking large gate biases (so ˆσ(b∞)≈1), realizes each step as a two-layer tanh MLP, uses block structure so only the state coordinates feed the next step, and controls error growth via Lipschitz bounds, culminating in an explicit LEM parameterization that achieves ||o_n−ω_n||≤ε for 1≤n≤N . The candidate solution follows the same blueprint: freeze the time gates near 1, approximate the composite map with a two-layer tanh network on a compact set, enforce block-sparsity so outputs do not feed back, and bound the accumulated error with a discrete Grönwall/Lipschitz argument. The only technical difference is that the candidate explicitly quantifies the gate-induced error (ε_gate) instead of the paper’s notational shortcut of setting the gate deviation δ→0 for simplicity, but both yield the same conclusion. Hence both are correct and essentially the same proof at a conceptual level.

Referee report (LaTeX)

\textbf{Recommendation:} minor revisions

\textbf{Journal Tier:} specialist/solid

\textbf{Justification:}

The proposition on universal approximation by LEM is demonstrably correct using standard approximation theory and a careful parameterization that freezes gates and enforces block structure. The supplemental materials provide the needed constructive details. Minor issues include the notational overlap for W\_y and the informal setting of gate deviation to zero for simplicity; making these fully explicit would strengthen rigor and clarity.