Neural Symbolic Regression from Truncated Taylor Series
This project investigates the problem of recovering compact, closed-form symbolic expressions from truncated Taylor series using neural sequence models. By training a two-layer LSTM encoder–decoder with a domain-specific loss, we show that it is possible to map local approximations back to their generating functions with high accuracy.
Note: The code is temporarily private as it supports NeurIPS MATH-AI Workshop 2025 paper submission. Access may be granted upon request once the review process is complete.
Motivation
Taylor expansions are everywhere: in numerical analysis, optimization, signal processing, partial differential equations, and mathematical physics. In many of these settings, truncated series are the only available representation of a solution or a signal.
Being able to reconstruct the original closed-form expression from such local approximations can:
- Improve numerical accuracy beyond truncated approximations.
- Aid in optimization and stability analysis.
- Provide more robust signal reconstructions.
- Enable deeper analytic insights into solutions of differential equations.
This task also highlights central challenges for neural-symbolic methods:
- Deciding operator placement from local information.
- Inferring nesting of transcendental functions.
- Handling non-identifiability, since distinct expressions can share the same Taylor expansion.
- Managing instabilities caused by division near the expansion point.
Dataset Construction
- Grammar: Functions are built from a minimal base set
{x, x², x³, sin(x), cos(x), exp(x)}combined with operators{+, −, ×, /}. Expressions include 1–3 base terms. - Simplification: Each expression is simplified with SymPy. Degenerate forms (identically zero, invalid denominators) are resampled.
- Expansions: For each function, a Taylor expansion around 0 is computed, truncated at orders 4–6. The remainder term is dropped.
- Tokenization: Operators, parentheses, and function names are tokenized. A shared vocabulary of 161 tokens includes
<SOS>,<EOS>,<PAD>, and<UNK>. - Dataset size: ~137k pairs of (expansion → function). Expansions are 5–45 tokens long; functions are 5–19 tokens long.
- Examples:
x⁴/24 − x²/2 + 1 → cos(x)x³/2 + x → x / cos(x)
Method
Model Architecture
- Encoder–decoder: Two-layer LSTM with hidden size 512, embedding dimension 150.
- Dropout: Applied between layers for regularization.
- Decoder: Autoregressively generates tokens until
<EOS>.
Loss Function
A composite objective balances symbolic correctness with functional accuracy:
- Cross-entropy (CE): Ensures syntactic correctness and token-level accuracy.
- Weighted Mean Squared Error (WMSE): Compares predicted and target functions as continuous objects over ([-1,1]), with higher weights near (x=0). This encourages numerical fidelity at the expansion center.
- Dynamic weighting: CE dominates early training to enforce syntax; WMSE weight increases over epochs to refine functional accuracy.
Training Setup
- Framework: PyTorch.
- Optimizer: AdamW (lr = 2e−3, weight decay = 1e−5).
- Gradient clipping (norm ≤ 10).
- Batch size: 64.
- Epochs: 5.
- Teacher forcing applied.
Decoding
- Greedy decoding: Efficient and sufficient for short sequences.
- Future extension: Beam search could capture more globally optimal sequences.
Results
On a held-out test set of 500 pairs:
- Exact Match (EM): 93%
- Token Accuracy (TA): ~96% (slightly reduced when WMSE is included)
- WMSE: Inclusion of the WMSE term reduces error by ~17% (from 0.913×10⁻³ to 0.758×10⁻³).
These results show that the model not only learns to output the correct symbolic form but also approximates the true function more faithfully near the expansion point.
Examples include:
- Correct prediction: Perfect syntactic and functional recovery.
- Incorrect prediction: Minor token mismatches, yet the predicted function remains numerically close to the target due to WMSE emphasis.
Conclusion and Future Work
This project demonstrates that neural methods can recover closed-form symbolic expressions from truncated Taylor expansions, a task with practical applications in physics, engineering, and scientific computing.
Key findings:
- LSTMs with a domain-specific loss achieve high accuracy despite the apparent information loss of truncation.
- WMSE regularization improves local approximation quality with minimal trade-off in symbolic accuracy.
Limitations include:
- Restricted grammar (limited set of base functions).
- Fixed expansion point (always around 0).
- Fully synthetic, noiseless data.
Future directions:
- Allow longer expansions and shifts in expansion center.
- Explore grammar-aware decoding.
- Integrate attention or transformer-based models for longer sequences.
- Adopt beam search for more robust inference.