The Arithmetic of When

Martin RussmannJun 25, 2026

The Bottom Line: This is a rigorous, unsentimental account of how to decide when to enter and exit a trade. Unlike most advice on the subject, it refuses to pretend that trading is free, that markets behave like coins, or that a hunch is a plan. It gives you a single formula for the moment action is justified, and the statistical scaffolding to keep that formula honest. Whether you trade for a living or merely watch the tickers the way other people watch weather, this is the bridge between the seminar room and the order book.

Introduction: The Question Everyone Skips

You have done the work. You have read the filings until the numbers blurred, traced the supply chains, argued with yourself at two in the morning about whether the market has already noticed what you think you alone have noticed. You are as certain as you ever get. The stock is going up. Your finger hovers over the button.

And then the only question that actually matters arrives, quietly, like a creditor at the door: when?

Not whether. Whether is the easy part, the part everyone has a loud opinion about at dinner. When is the part that empties accounts. Right now? At the open tomorrow? After the dip that may or may not be coming? Most trading books wave their hands here and mutter something soothing about "buying on weakness" or "respecting the trend," advice with roughly the precision of telling a lost hiker to "head generally downhill." It sounds wise. It will not get you off the mountain.

This article answers the question properly, with arithmetic. It does so without ever forgetting the thing the seminar room likes to forget: that markets are expensive, noisy, and entirely indifferent to how much research you did.

Why This Is Different

The usual approach treats timing as pattern recognition. The market drifts up on Mondays. Buy when the price kisses the fifty-day average. These rules feel like knowledge. They share three fatal weaknesses.

First, they ignore the cost of trading. A small statistical edge is a delicate thing, and the spread, the fees, and the market's habit of moving away from you will crush it without noticing. Second, they assume independence, as if each day's return were a fresh coin flip, when in fact today's move leans on yesterday's and breathes on tomorrow's. Third, and most damning, they never tell you when to act. A fifty-five percent win rate sounds like an edge. Whether it survives contact with reality is a different question entirely, and the slogans have nothing to say about it.

The framework here fixes all three by treating timing as what it has always quietly been: a decision made under uncertainty, against friction, with real money on the line.

Part 1: Setting Up the Problem

The Price You See Versus the Price You Get

Before we can talk about timing, we have to be honest about a small fraud built into every trading screen: the price you are looking at is not a price you can have.

What you usually see is the mid-price, the polite average of the best bid (what buyers will pay) and the best ask (what sellers demand):

$$S_t = \frac{A_t + B_t}{2}$$

where $S_t$ is the mid-price at time $t$, $A_t$ is the ask, and $B_t$ is the bid.

The mid-price is a useful fiction, the way "sea level" is a useful fiction. You cannot trade at it. Buy in a hurry and you pay the ask. Sell in a hurry and you receive the bid. The gap between them, the spread, is the toll you pay simply for the privilege of changing your mind quickly. It is the first cost of doing business, and it is charged before you have made a single cent.

What a Return Actually Is

When we say a stock "moved," we mean its return, the percentage change in price. There are two dialects for this.

The simple return is what most people picture:

$$r_t = \frac{S_t - S_{t-1}}{S_{t-1}}$$

The log return is what quants prefer, because it behaves itself mathematically:

$$\ell_t = \log S_t - \log S_{t-1}$$

For small moves the two are nearly indistinguishable. The log return has one charming property: it adds up over time. The return from Monday to Wednesday is exactly the Monday-to-Tuesday return plus the Tuesday-to-Wednesday return, no compounding gymnastics required.

For a prediction horizon of $\tau$ periods, say the next ninety minutes, we write:

$$\ell_{t \to t+\tau} = \log S_{t+\tau} - \log S_t$$

This is the thing we are trying to forecast: how far the stock will travel over the window we have chosen to care about.

The One Rule You Cannot Break: No Peeking

It sounds too obvious to state, which is precisely why it is the trapdoor most timing research falls through. Every calculation may use only the information that existed at the moment of the decision.

Formally, we hang this on the idea of a filtration $\mathcal{F}_t$, a grand name for a simple thing: everything you could possibly have known at time $t$. Past prices, past trades, news already on the wire, every indicator you computed from history.

A legitimate timing signal must be $\mathcal{F}_t$-measurable, meaning it leans on nothing but what was available at time $t$. The alternative, letting a whisper of the future leak into the calculation, produces backtests of breathtaking beauty and live performance of breathtaking misery. The future is the most expensive thing you can accidentally put in a model.

Part 2: The True Cost of Trading

Your Fill Is Not the Mid

Let us model what actually happens when you press the button. Buying, your real fill price looks like this:

$$P_t^{\text{fill, buy}} = A_t \times (1 + \kappa_t^{\text{impact}} + \kappa_t^{\text{slip}}) + \text{fee}_t$$

Selling, like this:

$$P_t^{\text{fill, sell}} = B_t \times (1 - \kappa_t^{\text{impact}} - \kappa_t^{\text{slip}}) - \text{fee}_t$$

The pieces, and roughly what each one costs you:

Component	What it is	Typical size
$A_t$ or $B_t$	Ask or bid price	The starting line
$\kappa^{\text{impact}}$	Market impact: your own order shoving the price away from you	1 to 10 basis points
$\kappa^{\text{slip}}$	Slippage from latency and the lag between intent and execution	1 to 5 basis points
$\text{fee}_t$	Brokerage and exchange charges	0 to 10 basis points

A basis point (bp) is one hundredth of a percent, so 10 bp is 0.1%.

Market impact deserves a moment of respect. It is the universe's way of reminding you that you are not a passive observer of the price. The moment you reach for a large quantity, the price notices and steps back, like a crowd parting around someone who smells of desperation.

The Net Return: What You Keep

For a long trade, buy now and sell later, your net log return is what is left after the tolls:

$$g_{t,\tau}^{\text{long}} = \log P_{t+\tau}^{\text{fill, sell}} - \log P_t^{\text{fill, buy}}$$

This is the only number that pays the rent. Not the mid-price return, not the gross return, but the net return after everything has taken its cut.

Picture an edge you would be proud of: a predicted move of 0.3%, hit 60% of the time. On the whiteboard it is a small fortune. Now subtract 0.2% in round-trip costs, and two thirds of your fortune simply ceases to exist, quietly, before you have done anything wrong. This is the trapdoor under every "proven" strategy that works in the brochure and starves in the wild.

Part 3: From Prediction to Decision

This is where most timing research lies down and goes to sleep. It builds a model, measures its accuracy, admires the accuracy, and stops. But accuracy is not action. A weather forecast that is right 70% of the time tells you nothing about whether to cancel the picnic until you know how much you hate getting wet and how much you were looking forward to the sandwiches.

Two Ways to Think About It

The first, the full distribution approach, is the ideal. If you can model the whole shape of future returns, not just the average but the spread and the lopsidedness, you can make textbook-optimal decisions through expected utility. It is elegant and demanding.

The second, the win/loss approach, is the one you can actually run on a Tuesday. You predict three things: the probability that the trade ends in profit, the size of the win when you are right, and the size of the loss when you are wrong.

Precisely:

$$y_t = \mathbb{1}\left[g_{t,\tau}^{\text{long}} > 0\right]$$

A binary verdict: 1 if the trade would have made money after costs, 0 if not. The model estimates the odds of a 1:

$$\hat{p}_t = P(y_t = 1 \mid \mathcal{F}_t)$$

the probability, given everything known at time $t$, that the trade pays off. It also estimates the size of victory,

$$\mu_t^+ = \mathbb{E}\left[g_{t,\tau}^{\text{long}} \mid \mathcal{F}_t, y_t = 1\right]$$

a positive number, the expected gain when you win, and the size of defeat,

$$\mu_t^- = -\mathbb{E}\left[g_{t,\tau}^{\text{long}} \mid \mathcal{F}_t, y_t = 0\right]$$

written as a positive number for convenience, the expected loss when you lose.

The Question Worth Money: How Sure Is Sure Enough?

Here is the heart of the thing. Given your probability $\hat{p}_t$ and your win and loss sizes $\mu_t^+$ and $\mu_t^-$, when should you actually pull the trigger?

The expected value of trading is:

$$\mathbb{E}[g_{t,\tau}^{\text{long}} \mid \mathcal{F}_t] = \hat{p}_t \cdot \mu_t^+ - (1 - \hat{p}_t) \cdot \mu_t^-$$

In plain speech: your expected profit is the chance of winning times the average win, minus the chance of losing times the average loss. The full and ancient logic of every bet ever placed, in one line.

But a merely positive expected value is not a good enough reason to risk capital. You want a margin of safety, a buffer $\lambda_t$ that absorbs the model's overconfidence, your risk limits, and the small operational indignities of real trading. So the rule tightens:

Trade only if: $\hat{p}_t \cdot \mu_t^+ - (1 - \hat{p}_t) \cdot \mu_t^- > \lambda_t$

The Formula: Your Required Win Rate

Solve that inequality for $\hat{p}_t$ and out falls the minimum probability threshold:

$$\boxed{\pi_t^* = \frac{\mu_t^- + \lambda_t}{\mu_t^+ + \mu_t^-}}$$

This is the whole article compressed into a fraction. It tells you, exactly, how confident you must be before acting. Everything else is in service of computing the three numbers inside it.

Read it like a sentence. When the potential loss grows, the bar rises, because bigger losses demand more certainty. When the potential win grows, the bar falls, because a fat reward justifies acting on thinner odds. When you become more cautious, the bar rises with you.

When this rises...	$\pi^*$ does this...	Because
Expected loss $\mu^-$	rises	bigger losses demand more confidence
Expected win $\mu^+$	falls	bigger wins justify acting on thinner odds
Risk margin $\lambda$	rises	a more cautious stance raises the bar

A Worked Example

Say the model hands you: $\hat{p}_t = 0.62$ (a 62% chance of profit), $\hat{\mu}_t^+ = 0.0009$ (a 9 bp expected win), $\hat{\mu}_t^- = 0.0007$ (a 7 bp expected loss), and $\lambda_t = 0.0001$ (a 1 bp risk margin).

The threshold:

$$\pi_t^* = \frac{0.0007 + 0.0001}{0.0009 + 0.0007} = \frac{0.0008}{0.0016} = 0.50$$

Since $\hat{p}_t = 0.62 > 0.50 = \pi_t^*$, you trade.

Now the market grows nervous, volatility spikes, and you raise your margin to $\lambda_t = 0.0005$:

$$\pi_t^* = \frac{0.0007 + 0.0005}{0.0009 + 0.0007} = \frac{0.0012}{0.0016} = 0.75$$

Now $\hat{p}_t = 0.62 < 0.75 = \pi_t^*$, and you sit on your hands. Nothing about your prediction changed. The world did. This is the whole point: the bar for action must breathe with the conditions, never sit frozen at some number you liked the look of once.

Part 4: Finding Statistical Edges

We now know how to turn a probability into a decision. The next question is where the probabilities come from. The framework draws on three wells.

4.1 Temporal Patterns: Do Certain Times Work Better?

The question is innocent enough. Are there days, or hours, when a stock tends to behave differently? The trouble is that this is a statistical minefield wearing the costume of a simple question. Returns are not coin flips. They are serially correlated, today's leaning on yesterday's, and their volatility comes in clusters, calm stretches broken by storms. The standard textbook tests assume none of this and will hand you confident, beautiful, completely false answers.

For each weekday $d$, Monday through Friday, we want the average return,

$$\mu_d = \mathbb{E}\left[\ell_t \mid \text{day}(t) = d\right]$$

and we want to know whether it truly differs from the others, or whether the difference is noise wearing a pattern's clothing.

Doing this honestly means four disciplines, not one. Compute returns with a single consistent definition. Estimate the differences using HAC (Heteroskedasticity and Autocorrelation Consistent) standard errors (Newey and West, 1987), which take seriously that returns are correlated and that their variance wanders. Lean on block permutation tests, which shuffle the data while keeping its time structure intact, breaking only the day-of-week link you are testing. And apply multiple testing corrections (Benjamini and Hochberg, 1995), because once you are probing five days across many stocks across many horizons, something will look significant purely because you asked enough times. This is not an abstract worry. In finance specifically, a sweeping review of the published "factors" found that most of them fail once you account honestly for how many hypotheses the profession has quietly tested over the years (Harvey, Liu, and Zhu, 2016). The graveyard of dead anomalies is mostly filled with multiple-testing artifacts that nobody corrected for.

The same machinery handles intraday windows. Carve the day into segments, open, mid-morning, midday, afternoon, close, and sum the return inside each:

$$L_w(t) = \sum_{u \in w(t)} \ell_u$$

with the same statistical caution applied throughout.

A warning worth tattooing somewhere visible: temporal patterns are weak, and they wander with the regime. Feed them to your model as features if you like. Do not let them drive the car.

4.2 Price Levels: Support and Resistance

Chartists have insisted for a century that prices bounce off certain lines, the way a ball bounces off a floor that only the chartist can see. We can make the idea respectable.

Rather than drawing lines by hand and trusting the eye, we use statistical quantiles of recent prices:

$$S_t^{\text{support}} = \text{Quantile}_\alpha\{S_u : u \in [t-W, t-1]\}$$ $$R_t^{\text{resistance}} = \text{Quantile}_{1-\alpha}\{S_u : u \in [t-W, t-1]\}$$

Here $W$ is the lookback window, perhaps 5,000 to 20,000 bars, and $\alpha$ is something small like 0.1 or 0.15. In words: support is roughly the price's 10th percentile over the recent past, a floor it seldom drops through. Resistance is the 90th, a ceiling it seldom breaks.

To generate candidates we add a tolerance $\delta$, so we are not whipsawed by exact boundaries, and a momentum check $m_t = S_t - S_{t-k}$. A long candidate appears when the price sits near support and momentum has begun to turn up:

$$S_t \leq S_t^{\text{support}}(1 + \delta) \quad \text{and} \quad m_t > 0$$

A short candidate, the mirror image, when the price presses against resistance and momentum has rolled over:

$$S_t \geq R_t^{\text{resistance}}(1 - \delta) \quad \text{and} \quad m_t < 0$$

That momentum condition is there to keep you from doing the single most expensive thing in trading, which is reaching out to catch a knife on the way down because it was cheap. It makes you wait for evidence that the bounce is happening rather than merely overdue.

4.3 Multi-Horizon Momentum

A single moving-average crossover is a noisy oracle, prone to declaring a trend every time the wind changes. So instead of trusting one, we poll many and let them vote:

$$\text{Score}(t) = \sum_{k \in K} w_k \cdot \text{sign}\left(\text{MA}_{k,\text{short}}(t) - \text{MA}_{k,\text{long}}(t)\right)$$

The score runs positive when short averages sit above long ones across several horizons, the technical signature of an uptrend, and negative in the opposite case. The weights $w_k$ should be learned from history and then reined in with regularization, so the model does not fall in love with a coincidence from three years ago.

Part 5: The Machine Learning Layer

What We Are Predicting

The target is the binary outcome we already met: will this trade make money after costs?

$$y_t = \mathbb{1}\left[g_{t,\tau}^{\text{long}} > 0\right]$$

Note the quiet insistence on executable fill prices, not mid-prices. We are predicting profit, the kind you can spend, not direction, the kind you can frame.

The Inputs

Every feature must be built from the past and only the past. The usual cast: recent returns at various lags $\{\ell_{t-k}\}$; volatility measures, both realized and range-based; temporal features like day and time and rolling estimates; level features such as the distance to support and resistance; the aggregated momentum score; and microstructure signals like the spread, depth proxies, and volume patterns.

Which Model

For tabular data of this kind, gradient boosting (XGBoost, from Chen and Guestrin, 2016; LightGBM, from Ke et al., 2017) is a hard baseline to beat and a sensible place to start. This is not nostalgia for an older method. On exactly the kind of heterogeneous, medium-sized tabular data that markets produce, boosted trees still tend to match or beat deep neural networks while demanding far less tuning and far less data (Grinsztajn, Oyallon, and Varoquaux, 2022). Fancier machinery, transformers, recurrent nets, is welcome to enter the ring, but it must win on the only scoreboard that pays, net of costs, not merely on some accuracy metric that looks good in a slide.

The model returns the probability,

$$\hat{p}_t = P(y_t = 1 \mid \mathcal{F}_t)$$

and, if you ask nicely, the magnitude estimates $\hat{\mu}_t^+$ and $\hat{\mu}_t^-$.

Why Calibration Beats Accuracy

Here is a point that is subtle, dull-sounding, and absolutely load-bearing: your model's probabilities have to be true.

A model is well-calibrated if, among all the moments it announces "60% chance of profit," close to 60% of them actually pay off. Plenty of models manage good accuracy alongside catastrophic calibration, cheerfully printing "70%" when the honest answer is 55%. They are like a friend who is right about who will win the match but wildly wrong about how nervous you should be, which makes their advice useless precisely when you need it.

It matters because the threshold formula treats $\hat{p}_t$ as a real probability and divides by it without mercy. If the probability is a fantasy, $\pi_t^*$ is built on sand, and you will trade too much or too little with great conviction. The modern, high-capacity models everyone reaches for are especially guilty here, tending toward systematic overconfidence unless explicitly corrected (Guo et al., 2017). The repair is a calibration step after training, isotonic regression or Platt scaling on held-out data, to drag the model's confidence back toward honesty.

Part 6: Validation, or Proving It Without Fooling Yourself

Why the Usual Split Fails

In ordinary machine learning you shuffle the data into training and test piles at random and call it fair. For time series this is not fair, it is theft from the future. Random shuffling can drop tomorrow's observations into today's training set, letting the model "learn" from events that had not happened yet. And when your prediction horizon is ninety minutes, two observations thirty minutes apart share a slab of the same future, so their labels are quietly entangled.

Walk-Forward, with Purging and Embargo

The remedy is walk-forward validation, which respects the arrow of time. Train on history up to $T_1$. Validate on the stretch from $T_1$ plus an embargo through $T_2$. Then roll the window forward and repeat, always testing on the future relative to what you trained on.

The embargo is a deliberate gap that stops information from seeping across the boundary. If your prediction reaches over $[t, t+\tau]$, any training sample within $\tau$ of the validation start could smuggle the future into the past, so a safe embargo is at least $\tau$ periods wide. Purging does the complementary job, removing training samples whose label windows overlap the test period at all. Both devices, and the wider discipline of leakage-safe cross-validation built specifically for financial data, are set out in detail by López de Prado (2018), whose combinatorial purged cross-validation extends the same logic to many resampled paths rather than the single historical one.

The Metrics That Matter

Two kinds. The statistical scores, AUC, the Brier score, and the Expected Calibration Error, tell you whether the model understands the world. The economic scores, net P&L after every cost, the Sharpe ratio, the maximum drawdown, and the fill rate, tell you whether that understanding is worth anything.

One rule overrides the rest. Every economic number must be computed net of execution costs. A strategy with glorious gross returns and miserable execution is not a strategy. It is an optical illusion that happens to be expensive.

Part 7: Execution, Where Theory Meets the Order Book

Implementation Shortfall

You can hold the finest predictions ever assembled and still hand back every cent of edge at the moment of execution. We measure how badly we are doing this with shortfall:

$$ES = \frac{P^{\text{fill}} - B(W)}{B(W)}$$

where $B(W)$ is a benchmark price, usually VWAP (Volume-Weighted Average Price) over your execution window. Positive shortfall means you did worse than the benchmark. Negative means, for once, better.

Transaction Cost Analysis

The total cost decomposes into its honest parts:

$$C = C_{\text{fees}} + C_{\text{spread}} + C_{\text{impact}} + C_{\text{slippage}}$$

Market impact is often modeled by a square-root law, in which the cost grows with the square root of your order size relative to the market's normal traded volume, not linearly as naive intuition expects. This is one of the most stubbornly universal regularities in all of market microstructure, holding across equities, futures, FX, options, and even crypto, across different decades, and seemingly untouched by the rise of high-frequency trading (Tóth et al., 2011). Recent large-scale empirical work, including a near-complete survey of the Tokyo Stock Exchange, continues to confirm it and to argue that the law is mechanical, a property of order flow and liquidity rather than of information (Sato and Kanazawa, 2024). The practical lesson is old and reliable: try to move a lot, quickly, and the market charges you a premium for the disturbance.

The Feedback Loop

The crucial habit is this. Your cost assumptions should be corrected by what execution actually costs you. If the model budgeted 5 bp and the fills keep arriving at 8 bp, then $\pi_t^*$ is set too low and you are overtrading on an edge that was never there. Update the conservative buffer $\epsilon_c$ from the realized shortfall, and let reality keep the model honest, which is the only thing reality is reliably good for.

Part 8: Putting It All Together

The Pipeline, Start to Finish

Here is the whole machine running in real time, one decision at a time:

For each decision time t:
    1. UPDATE FEATURES
       - Compute all rolling statistics using only data up to t
       - No future information allowed

    2. GENERATE PREDICTION
       - Feed features into calibrated model
       - Output: probability p̂_t and magnitude estimates μ̂⁺, μ̂⁻

    3. COMPUTE THRESHOLD
       - Calculate π*_t = (μ̂⁻ + λ_t) / (μ̂⁺ + μ̂⁻)
       - λ_t depends on current volatility and risk limits

    4. DECIDE
       - If p̂_t > π*_t: proceed to execution
       - Otherwise: no action

    5. EXECUTE (if trading)
       - Select order type based on urgency and liquidity
       - Use TWAP/VWAP/POV algorithm to minimize impact
       - Respect participation limits

    6. RECORD AND LEARN
       - Log fill price, shortfall, latency
       - Update cost buffer if systematic deviations

The Baseline Ladder, or Why Complexity Must Pay Its Way

Before you let a sophisticated model anywhere near real money, make it climb a ladder and prove it deserves the rung it is standing on.

Level	Strategy	What it tests
1	Random timing	Sanity check. Anything should beat a coin.
2	Simple momentum with fixed costs	A basic heuristic floor
3	Statistical timing only	Whether temporal patterns earn their keep
4	Gradient boosting with features	A strong tabular baseline
5	Sequence model (transformer or RNN)	Only if it beats Level 4 stably

The rule has no exceptions: never deploy a complex model unless it beats the simpler one net of costs, across several assets and several stretches of time. Sophistication is not a virtue. It is a cost, and like every other cost in this article it has to be earned back before it counts as profit.

Part 9: What Goes Wrong, and How to Stop It

Most timing systems do not die of bad luck. They die of one of a handful of recurring, avoidable mistakes.

Problem	How it shows up	The fix
Leakage	Glorious backtest, dismal live results	Audit every feature; enforce purging and embargo
Multiple testing	"Discovered" patterns that refuse to replicate	Control the false discovery rate; demand real effect sizes
Cost neglect	Profitable on paper, bleeding in practice	Use executable fills; keep an adaptive cost buffer
Miscalibration	Systematic over- or under-trading	Watch the ECE; recalibrate on fresh data
Regime change	Works, then abruptly stops working	Re-estimate on a rolling basis; detect regimes

What This Framework Does Not Promise

Let us be perfectly clear, because the brochures never are: no method guarantees profit. Anyone who tells you otherwise is selling something, probably to several people at once.

What the framework does guarantee is more modest and more useful. If you see good performance, it is less likely to be a mirage conjured from data snooping or cost-free fantasy. The danger is not vague, it is quantifiable: run enough strategy variants over the same history and a glorious backtest becomes almost guaranteed by luck alone, which is precisely the quantity the probability of backtest overfitting was invented to measure (Bailey et al., 2017). Your decision rule is interpretable and auditable, so you can explain exactly why you traded. And you have an honest mechanism for updating your beliefs as the world changes, which it will, without warning and without apology.

Part 10: Practical Takeaways

For the Individual Investor

Know your costs before you fall in love with an idea. The spread, the fees, the slippage: total them up, because that number is the height of the wall every edge has to clear. Demand calibration, not just accuracy. When someone boasts of "70% accuracy," ask the only question that matters: are the predicted probabilities actually true? Use the threshold. Even without a model, you can estimate your average win and loss from your own trading history, add a margin of safety, and compute

$$\pi^* = \frac{\mu^- + \lambda}{\mu^+ + \mu^-}$$

to get a principled floor for how sure you should be. And keep a healthy skepticism toward temporal folklore. "The market rises on Mondays" may have been true once, the way many things were true once, without surviving daylight or the years.

For the Quantitative Researcher

Report everything net of costs, because gross returns are a flattering lie and you should refuse to tell it. Use the baseline ladder, and make your clever models earn their complexity against dumb ones. Calibrate before you threshold, since for actual decisions the ECE matters more than the AUC. And document obsessively, every data source, every preprocessing step, every feature definition, every embargo rule and execution assumption, so that a future version of you, or a skeptical colleague, can rebuild the result and watch it hold.

For the Portfolio Manager

The threshold is not a constant, and treating it as one leaves money on the table. $\pi_t^*$ should move with volatility, liquidity, and the risk budget. Execution quality is itself alpha: two managers running identical models can post wildly different P&L on the strength, or weakness, of their fills alone, so measure shortfall and fight for it. And audit the loop relentlessly. Confirm that predicted probabilities match realized frequencies, and that your assumed costs match the ones the market is actually charging you.

Conclusion: From Signals to Decisions

The single idea beneath all of this is easy to state and easy to forget: timing is not pattern recognition. It is decision-making under uncertainty, against friction, with money at stake.

Hold that idea and the consequences fall into place. Statistical significance is necessary and nowhere near sufficient; what you need is economic significance after costs. Probability estimates must be calibrated, not merely accurate, because the formula divides by them. The bar for action must be derived from your costs and your risk, not plucked from the air because it felt right. And execution is not a clerical afterthought. It is the ground on which alpha lives or, far more often, quietly dies.

The formula

$$\pi_t^* = \frac{\mu_t^- + \lambda_t}{\mu_t^+ + \mu_t^-}$$

holds the whole philosophy in a single fraction: how sure you need to be depends entirely on what stands to be won and lost. Follow the procedures here, dependence-aware statistics, leakage-safe validation, cost-conscious evaluation, an explicit threshold, and you can build timing systems that are at once scientifically credible and tradeable in the real world.

Or, just as valuable and far cheaper, you can kill a bad idea early, in the quiet of a backtest, before it ever has the chance to bill you for the lesson.

Glossary

Term	Definition
Basis point (bp)	One hundredth of a percent, 0.01%
Calibration	The property that predicted probabilities match actual frequencies
ECE	Expected Calibration Error, a measure of how well-calibrated a model is
Embargo	A gap between training and test data that prevents information leakage
Filtration ($\mathcal{F}_t$)	All information available at time $t$
HAC	Heteroskedasticity and Autocorrelation Consistent, a family of robust standard errors
Log return	$\log(P_t) - \log(P_{t-1})$; returns that add cleanly over time
Market impact	The price move caused by your own trading
Purging	Removing training samples whose labels overlap the test period
Shortfall	The gap between your execution price and a benchmark
Slippage	Unfavorable price movement between decision and execution
VWAP	Volume-Weighted Average Price, a common execution benchmark

References

Classical foundations

Newey, W. K., and West, K. D. (1987). A Simple, Positive Semi-Definite, Heteroskedasticity and Autocorrelation Consistent Covariance Matrix. Econometrica, 55(3), 703–708.
Benjamini, Y., and Hochberg, Y. (1995). Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society, Series B, 57(1), 289–300.
Gneiting, T., and Raftery, A. E. (2007). Strictly Proper Scoring Rules, Prediction, and Estimation. Journal of the American Statistical Association, 102(477), 359–378.

Validation and the multiple-testing problem in finance

López de Prado, M. (2018). Advances in Financial Machine Learning. Hoboken, NJ: Wiley. (Purging, embargo, walk-forward validation, and combinatorial purged cross-validation.)
Bailey, D. H., Borwein, J., López de Prado, M., and Zhu, Q. J. (2017). The Probability of Backtest Overfitting. Journal of Computational Finance, 20(4), 39–69.
Harvey, C. R., Liu, Y., and Zhu, H. (2016). … and the Cross-Section of Expected Returns. Review of Financial Studies, 29(1), 5–68.

Models and probability calibration

Chen, T., and Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794.
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T.-Y. (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree. Advances in Neural Information Processing Systems, 30, 3149–3157.
Grinsztajn, L., Oyallon, E., and Varoquaux, G. (2022). Why Do Tree-Based Models Still Outperform Deep Learning on Typical Tabular Data? Advances in Neural Information Processing Systems, 35 (Datasets and Benchmarks Track).
Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. (2017). On Calibration of Modern Neural Networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), PMLR 70, 1321–1330. (Temperature scaling, ECE, and the overconfidence of modern models.)

Execution and market impact

Tóth, B., Lempérière, Y., Deremble, C., de Lataillade, J., Kockelkoren, J., and Bouchaud, J.-P. (2011). Anomalous Price Impact and the Critical Nature of Liquidity in Financial Markets. Physical Review X, 1(2), 021006.
Sato, Y., and Kanazawa, K. (2024). Does the Square-Root Price-Impact Law Belong to the Strict Universal Scalings? Quantitative Support from a Complete Survey of the Tokyo Stock Exchange Market. Preprint. (Recent large-scale empirical confirmation of the square-root impact law.)

This article is for educational and research purposes only. It is not investment advice, and the past performance of any strategy does not guarantee future results.

← Back to Blog