Neural Symbolic Regression from Truncated Taylor Series

This project investigates the problem of recovering compact, closed-form symbolic expressions from truncated Taylor series using neural sequence models. By training a two-layer LSTM encoder–decoder with a domain-specific loss, we show that it is possible to map local approximations back to their generating functions with high accuracy.

Note: The code is temporarily private as it supports NeurIPS MATH-AI Workshop 2025 paper submission. Access may be granted upon request once the review process is complete.

Motivation

Taylor expansions are everywhere: in numerical analysis, optimization, signal processing, partial differential equations, and mathematical physics. In many of these settings, truncated series are the only available representation of a solution or a signal.

Being able to reconstruct the original closed-form expression from such local approximations can:

Improve numerical accuracy beyond truncated approximations.
Aid in optimization and stability analysis.
Provide more robust signal reconstructions.
Enable deeper analytic insights into solutions of differential equations.

This task also highlights central challenges for neural-symbolic methods:

Deciding operator placement from local information.
Inferring nesting of transcendental functions.
Handling non-identifiability, since distinct expressions can share the same Taylor expansion.
Managing instabilities caused by division near the expansion point.

Dataset Construction

Grammar: Functions are built from a minimal base set {x, x², x³, sin(x), cos(x), exp(x)} combined with operators {+, −, ×, /}. Expressions include 1–3 base terms.
Simplification: Each expression is simplified with SymPy. Degenerate forms (identically zero, invalid denominators) are resampled.
Expansions: For each function, a Taylor expansion around 0 is computed, truncated at orders 4–6. The remainder term is dropped.
Tokenization: Operators, parentheses, and function names are tokenized. A shared vocabulary of 161 tokens includes <SOS>, <EOS>, <PAD>, and <UNK>.
Dataset size: ~137k pairs of (expansion → function). Expansions are 5–45 tokens long; functions are 5–19 tokens long.
Examples:
- x⁴/24 − x²/2 + 1 → cos(x)
- x³/2 + x → x / cos(x)

Method

Model Architecture

Encoder–decoder: Two-layer LSTM with hidden size 512, embedding dimension 150.
Dropout: Applied between layers for regularization.
Decoder: Autoregressively generates tokens until <EOS>.

Loss Function

A composite objective balances symbolic correctness with functional accuracy:

L_\text{total} = \alpha L_\text{CE} + \beta L_\text{WMSE}

Cross-entropy (CE): Ensures syntactic correctness and token-level accuracy.
Weighted Mean Squared Error (WMSE): Compares predicted and target functions as continuous objects over ([-1,1]), with higher weights near (x=0). This encourages numerical fidelity at the expansion center.
Dynamic weighting: CE dominates early training to enforce syntax; WMSE weight increases over epochs to refine functional accuracy.

Training Setup

Framework: PyTorch.
Optimizer: AdamW (lr = 2e−3, weight decay = 1e−5).
Gradient clipping (norm ≤ 10).
Batch size: 64.
Epochs: 5.
Teacher forcing applied.

Decoding

Greedy decoding: Efficient and sufficient for short sequences.
Future extension: Beam search could capture more globally optimal sequences.

Results

On a held-out test set of 500 pairs:

Exact Match (EM): 93%
Token Accuracy (TA): ~96% (slightly reduced when WMSE is included)
WMSE: Inclusion of the WMSE term reduces error by ~17% (from 0.913×10⁻³ to 0.758×10⁻³).

These results show that the model not only learns to output the correct symbolic form but also approximates the true function more faithfully near the expansion point.

Examples include:

Correct prediction: Perfect syntactic and functional recovery.
Incorrect prediction: Minor token mismatches, yet the predicted function remains numerically close to the target due to WMSE emphasis.

Conclusion and Future Work

This project demonstrates that neural methods can recover closed-form symbolic expressions from truncated Taylor expansions, a task with practical applications in physics, engineering, and scientific computing.

Key findings:

LSTMs with a domain-specific loss achieve high accuracy despite the apparent information loss of truncation.
WMSE regularization improves local approximation quality with minimal trade-off in symbolic accuracy.

Limitations include:

Restricted grammar (limited set of base functions).
Fixed expansion point (always around 0).
Fully synthetic, noiseless data.

Future directions:

Allow longer expansions and shifts in expansion center.
Explore grammar-aware decoding.
Integrate attention or transformer-based models for longer sequences.
Adopt beam search for more robust inference.