A quantitative research monograph · Rajveer Singh Pall
FinSight.
Markets move on numbers.
They drift on words.
Every quarter, the leadership of corporate America spends a few thousand hours explaining itself to analysts. This study reads all of it — 14,584 earnings calls across 601 S&P 500 companies — and asks whether the language itself, measured carefully and tested honestly, predicts what the stock does next.
Chapter 01 · The Corpus
Fourteen thousand conversations,
shelved by year
An earnings call is a strange document. It is half theatre — prepared remarks polished by investor-relations teams — and half interrogation, as analysts probe for what the script left out. FinSight collects seven years of these conversations: every available quarterly call from the S&P 500, paired with more than a million rows of daily price data, so that every sentence can be held against what the market did in the days that followed.
The grains of ink drifting behind this page are that archive. Each point is one call. They have just arranged themselves into seven columns — one per year — and they will keep reorganising as you descend, because every chapter of this study is a different way of looking at the same fourteen thousand conversations.
Calls per year
Chapter 02 · Reading
Teaching a machine
to read the room
FinBERT · sentence-level classification · live
CEO · prepared remarks
We delivered record revenue of $2.4 billion this quarter, up 18% year over year.
Margins came in slightly below our expectations, reflecting elevated input costs.
We remain confident in our full-year outlook and the strength of our pipeline.
That said, there remain certain headwinds we continue to monitor closely.
Analyst · Q&A
Can you help us bridge the gap between the guidance you gave last quarter and what you are reporting today?
I guess what I am trying to understand is what actually changed.
CFO · response
Sure — the variance reflects timing dynamics that we view as transitory.
scanning… 0/7 sentences
FinBERT — a BERT model trained on financial text — reads every sentence of every call and scores it positive, negative, or neutral. Crucially, FinSight scores the prepared remarks and the Q&A separately. Management speaks from a script; analysts do not. The difference between those two registers turns out to matter more than either one alone.
Aggregated per call, this produces 14 sentiment features: tone means, negativity ratios, sentence counts, and — most subtly — the volatility of tone across a call, which captures a management team that cannot keep its story steady.
Chapter 03 · Retrieval
Five questions,
asked of every call
Tone tells you how management spoke. It cannot tell you what they actually discussed. So every transcript is split into chunks, embedded into a vector space, and indexed — then five structured questions are put to each call through retrieval. The answers become features: how relevant the retrieved passages are, and what they contain.
Guidance specificity
Q1“Specific numerical guidance for next quarter revenue and earnings”
Vague guidance hides; specific guidance commits.
Management confidence
Q2“Management expressing strong confidence about future performance”
Conviction has a vocabulary. So does doubt.
Forward-looking
Q3“Forward looking statements about growth plans and strategy”
Calls that dwell on the past are often avoiding the future.
New risks
Q4“New risks headwinds or challenges disclosed this quarter”
First mention of a risk is worth more than its tenth.
Cost pressure
Q5“Rising costs inflation margin pressure and supply issues”
Margin language leaks before margin numbers do.
Chapter 04 · The Signal Field — explore
11,551 moments of truth
Each point is one earnings call, placed by its language on the horizontal axis and by the stock’s return over the next five trading days on the vertical. Green rose, red fell. Run your cursor through the field — every point answers.
x · qa_neg_ratio — share of negative analyst sentences
Chapter 05 · Honest Models
Six models,
no time travel
Most backtested “alpha” dies the moment you stop it from peeking at the future. Standard cross-validation shuffles time; a model trained on 2023 quietly learns things no trader could have known in 2021. FinSight forbids this with walk-forward validation: train on three years, test on the next, slide, repeat. Every result on this page was produced by a model that had never seen its test year.
The protocol
The logistic baseline looks brilliant in 2023 — IC 0.165 — and then loses money in 2021 and 2024. Its standard deviation across folds is 0.114. LightGBM’s is 0.009 — thirteen times steadier, and positive in all four years. In quantitative research, consistency is the difference between a signal and a story.
IC by test year · all models
| Model | IC mean | IC std | Hit rate |
|---|---|---|---|
| LightGBM · 34 features ★ | +0.0198 | 0.0085 | 53.3% |
| LSTM · 6-quarter sequences | +0.0153 | 0.0211 | 54.7% |
| XGBoost · 34 features | +0.0141 | 0.0180 | 53.2% |
| RAG features only | +0.0000 | 0.0295 | 53.5% |
| FinBERT features only | −0.0044 | 0.0117 | 53.1% |
| Logistic regression baseline | +0.0429 | 0.1141 | 53.1% |
The LSTM deserves a footnote: fed six-quarter sequences per company, it posted the single strongest fold of the entire study (2022, IC +0.047) and the best directional accuracy — evidence that the trajectory of a company’s language carries information its latest call alone does not.
Chapter 06 · Attribution
What the model
learned to hear
Analyst scepticism out-predicts management optimism.
SHAP attribution opens the model and asks which features actually moved its predictions. The answer is consistent and a little subversive: the single most influential feature in the entire system is qa_neg_ratio — the fraction of negative sentences in the analyst Q&A. The prepared remarks are theatre. The interrogation is evidence.
Close behind: management’s tone volatility (a story that keeps changing), the sheer length of the Q&A (how long analysts kept digging), and deliberate neutrality — the corporate art of saying nothing, which the market reads as hedging.
Mean |SHAP| · top 12 of 34 features · hover for plain English
- 01qa_neg_ratio0.0541
Share of negative analyst sentences in Q&A
- 02mgmt_sent_vol0.0476
Volatility of management tone across the call
- 03qa_n_sentences0.0453
Length of the Q&A — how long analysts kept digging
- 04mgmt_mean_neu0.0445
How deliberately neutral management stayed
- 05rag_guidance_specificity_relevance0.0420
Whether the call actually contained specific numerical guidance
- 06qa_mean_neg0.0415
Average negativity of analyst questions
- 07qa_net_sentiment0.0403
Net tone of the Q&A session
- 08rag_management_confidence_score0.0368
Confidence expressed in retrieved management passages
- 09rag_cost_pressure_relevance0.0358
How much of the call concerned cost pressure
- 10rag_new_risks_relevance0.0355
How much of the call disclosed new risks
- 11qa_mean_pos0.0346
Average positivity of analyst questions
- 12mgmt_n_sentences0.0333
Length of prepared remarks
Chapter 07 · Where It Lives
Alpha has
a postcode
Walk-forward IC by GICS sector · whiskers = ±1 std
Run the same walk-forward protocol sector by sector and the signal stops being evenly spread — it concentrates violently. Energy calls predict their own stocks at IC +0.311, roughly 83× the signal in Technology, which sits at a statistical zero.
The pattern is exactly what efficient-market theory would sketch. Technology is the most-watched sector on earth — every syllable of an Apple call is priced before the CFO finishes the sentence. Energy firms live downstream of commodity prices, hedge books, and project timelines that management discusses in concrete, numerical language. More information asymmetry in; more predictability out.
Chapter 08 · The Verdict
The signal is real.
The market is fast.
Long–short quartile portfolio · net of 10bps round-trip costs
| Metric | 5-day hold | 20-day hold |
|---|---|---|
| Annualised return | −0.91% | −0.69% |
| Sharpe ratio | −0.81 | −0.23 |
| Max drawdown | −4.24% | −6.03% |
| Win rate | 37.5% | 31.3% |
We found a whisper, not a siren — and we report it as a whisper.
Here is the honest arithmetic. The signal exists — IC 0.0198, steady across every test year. But at a five-day horizon it is too small to outrun ten basis points of transaction costs, and the strategy loses money: Sharpe −0.81. This is the result most projects bury. It is the most informative number in the study.
Because the loss has structure. Stretch the holding period to twenty days and the Sharpe ratio improves 3.6× while trading costs stay fixed — precisely the shape predicted by post-earnings announcement drift (Bernard & Thomas, 1989): the market underreacts to earnings information and absorbs it over weeks, not days. The language signal is real; it simply needs patience the five-day strategy doesn’t have.
Chapter 09 · Colophon
The archive goes back to sleep
Fourteen thousand conversations, read end to end, yield a small, stable, honest signal — strongest where fewer people are listening, and priced away where everyone is. The field behind this page is drifting back to where you found it. The next earnings season will wake it up again.
Methods & stack
- Language model
- FinBERT (ProsusAI)
- Embeddings
- all-MiniLM-L6-v2 · ChromaDB
- Learners
- LightGBM · XGBoost · PyTorch LSTM
- Attribution
- SHAP
- Validation
- Walk-forward, 4 folds, zero leakage
- Data
- HuggingFace transcripts · yfinance prices
- Compute
- Single RTX 4060 laptop GPU
- This site
- Next.js · Three.js · WebGL
Limitations, plainly
- Net-of-cost returns are negative at both horizons; this is a study of signal existence, not a tradeable strategy.
- Sector results rest on small samples — Energy’s IC of +0.31 comes with a std of 0.24.
- Retrieval scoring is lexical-semantic, not generative; a stronger reader may find more.
Next
- Long-only 20-day backtest to remove short-selling costs
- Generative scoring of retrieved passages with a modern LLM
- Sector-stratified models — one specialist per GICS group
- Real-time pipeline for live earnings season
References
- [1]Araci, D. (2019). FinBERT: Financial Sentiment Analysis with Pre-trained Language Models. arXiv:1908.10063.
- [2]Bernard, V. & Thomas, J. (1989). Post-Earnings-Announcement Drift: Delayed Price Response or Risk Premium? Journal of Accounting Research.
- [3]Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS.
- [4]Lundberg, S. & Lee, S.-I. (2017). A Unified Approach to Interpreting Model Predictions. NeurIPS.
- [5]Loughran, T. & McDonald, B. (2011). When is a Liability not a Liability? Textual Analysis, Dictionaries, and 10-Ks. Journal of Finance.
- [6]Chan, L., Jegadeesh, N. & Lakonishok, J. (1996). Momentum Strategies. Journal of Finance 51(5).
Rajveer Singh Pall
Independent quantitative research. Designed, built, and validated end to end — from transcript ingestion to this page.
© 2026 Rajveer Singh Pall · MIT License · Research, engineering & design by the author