Two ideas that have circulated in quantitative research for at least a decade are particularly tempting in combination. The first is that “sentiment” — variously defined as attention, tone, or positioning extracted from news, social media, regulatory filings, or order flow — contains information about future returns that classical fundamental and technical signals miss. The second is that deep sequence models — recurrent networks, temporal convolutions, and now transformer architectures — can extract structure from financial time series that linear models leave on the table. The combination is the obvious next step: feed sentiment factors into a deep model and let it learn the non-linearities. We have spent considerable effort on this combination. The honest result is more nuanced than either the AI-in-finance literature or the alternative-data vendors suggest.
The state of evidence on each piece, separately
Sentiment as a return-predictive factor has a real but modest track record in the academic literature. The early work of Tetlock (2007), the attention-as-search-volume measure of Da, Engelberg and Gao (2011), and the financial-text dictionary work of Loughran and McDonald (2011) collectively established that text-derived sentiment carries some forward signal, particularly at the cross-sectional level and at horizons measured in days rather than minutes. The effects survive transaction-cost adjustment in some specifications and do not survive it in others; the literature is genuinely mixed and the debate is ongoing.
Deep learning applied to financial time series has, separately, produced a much larger body of mostly disappointing forward results. Gu, Kelly and Xiu (2020) — probably the most comprehensive empirical asset-pricing-with-machine-learning study to date — find that gradient-boosted trees and shallow neural networks meaningfully outperform linear baselines on cross-sectional return prediction, but that adding depth to the network rarely helps and frequently hurts. The honest reading of the broader literature is that for monthly or weekly cross-sectional return prediction on the standard equity universe, the marginal benefit of moving from a well-regularized linear model to a deep sequence model is small and often negative once realistic sample-splitting discipline is applied.
Why combining them is harder than it looks
The signal-to-noise problem compounds. Financial returns at any horizon shorter than a year carry an R-squared, against the best feasible predictors, that is measured in single-digit percentages on the upper end and is indistinguishable from zero on most sub-universes. Sentiment measures derived from text or alternative data are themselves noisy estimates of an underlying latent quantity; the dictionary-based and embedding-based approaches both show meaningful disagreement on the same source text. Stacking a high-capacity model on top of a low-SNR signal does not produce a magnification of the signal. It produces a magnification of the noise. The resulting predictions look spectacular in sample and degrade quickly in any honest forward test.
Effective sample size is much smaller than it appears. A daily panel of 3,000 names over twenty years contains, on paper, fifteen million observations. The effective number of independent observations after accounting for the strong cross-sectional correlation in returns and the high temporal autocorrelation in sentiment factors is several orders of magnitude smaller. A transformer with several million parameters has, in practice, fewer effective examples than parameters. The model can fit the training set arbitrarily well and learn nothing generalizable about the future.
The distributional shift is not stationary. The relationship between a sentiment input and forward returns is itself time-varying. Periods of structurally elevated retail participation, episodic short-squeeze dynamics, and regime shifts in news-flow intensity all change the mapping. A sequence model trained to learn complex non-linear dependencies between sentiment and returns is by construction learning the historical mapping; the more flexibly it learns, the more sensitive it is to the mapping's instability. Linear models are blunt enough to be partially robust to this; deep models are not.
What we have found does work
We use deep sequence models in this domain, but in a narrower role than the “DL replaces classical models” framing suggests. Three uses have repeatedly survived honest forward evaluation:
Deep models as feature extractors, linear models on top. A pre-trained language model produces dense embeddings of news or filings text. Those embeddings, projected into a modest number of dimensions and fed into a shrinkage-style linear cross-sectional regression, capture more of the sentiment effect than any dictionary-based approach we have tested. The non-linearity sits in the encoder, where it has been trained on enormous text corpora unrelated to returns; the predictive head stays simple and is the part we re-estimate.
Sequence models for representation, not prediction. Temporal convolutional networks and small transformers used to learn unsupervised representations of price and volume sequences — autoencoder-style — produce features that improve downstream classical models. The supervised forward return prediction does not happen inside the deep model. The deep model learns structure; we predict returns from that structure with a transparent and inspectable head.
Aggressive ensembling and uncertainty estimation. Where we do use deep predictive models directly, we use them as one input among many in an ensemble whose weights are themselves estimated with strict out-of-sample discipline. The deep model's predictions enter the ensemble with uncertainty estimates derived from a held-out validation set, and a model whose forward calibration drifts is automatically down-weighted.
The narrower claim we are willing to make
Deep learning on sentiment factors is a useful component of a modern systematic process when the deep model is asked to do what it is good at — learning representations from large amounts of weakly-supervised text or unsupervised time-series data — and a simpler, more inspectable model is asked to do the part the deep model is bad at, namely producing well-calibrated forward return predictions on a small effective sample with non-stationary dynamics. The framing that places a deep sequence model at the end of the pipeline, taking sentiment in and producing return forecasts out, is the framing that produces the bulk of the disappointing forward results in the published literature, and we have not been able to replicate the impressive in-sample results of that framing in any reasonable out-of-sample protocol.
We expect this conclusion to be revisited as both the data and the methods continue to evolve. We do not expect the underlying tension between high-capacity models and low-signal financial data to go away, and we treat the tension as a structural feature of the problem rather than as a temporary engineering challenge.
References. Tetlock, P. (2007). Giving content to investor sentiment: The role of media in the stock market. Journal of Finance. · Da, Z., Engelberg, J., and P. Gao (2011). In search of attention. Journal of Finance. · Loughran, T., and B. McDonald (2011). When is a liability not a liability? Textual analysis, dictionaries, and 10-Ks. Journal of Finance. · Heaton, J., Polson, N., and J. Witte (2017). Deep learning for finance: Deep portfolios. Applied Stochastic Models in Business and Industry. · Gu, S., Kelly, B., and D. Xiu (2020). Empirical asset pricing via machine learning. Review of Financial Studies. · Lim, B., and S. Zohren (2021). Time-series forecasting with deep learning: A survey. Philosophical Transactions of the Royal Society A.
This research note is provided for informational purposes only and does not constitute investment, legal, tax, or accounting advice. Nothing herein constitutes an offer to sell or a solicitation of an offer to buy any security. See our disclosures for the full notice.