We built a GA to predict the S&P 500. Then we built the machine that killed it.

A 70.5% accurate model meets a leakage audit, a permutation test, and a null distribution that refuses to move. The model loses. That's the point.

Last week we had a genetic algorithm that found 70.5% accuracy predicting S&P 500 direction. This week we learned that number was well within our GA’s overfitting floor.

This is the story of a hypothesis that looked promising, passed the eye test, and died under its first real examination. More importantly, it’s about why we built the firing squad before we built the trading system — and how that’s the kind of approach that will save us from ourselves.

The setup

We’ve been building an evolutionary system in Elixir that searches for relationships between macro signals and market direction. The genome is simple: four weights and a threshold.

if (w_gold × gold + w_oil × oil + w_sentiment × sentiment + w_vix × vix) > threshold
  then UP else DOWN

The GA evolves populations of these weight vectors using tournament selection, crossover, and mutation. Rule 30 cellular automata generate the initial population. The whole thing runs on the BEAM.

These signals are simple. Gold, oil, VIX, and sentiment probably don’t move the S&P 500. Someone would have found out already. The goal of this phase isn’t to find an edge — it’s to build a methodology that we trust to tell us whether an edge exists. If we can’t reliably distinguish signal from noise on a simple problem, we have no business running a complex one.

Previously, we’d validated the search mechanism on synthetic data — planted a hidden rule, let the GA find it blind, got 96.7% recovery. That proved the engine works. The question was whether the validation pipeline on the other side — the pipeline that kills bad results — works just as well.

The promising result

First run: 61 trading days. The GA converged on a structure with negative oil and negative VIX weights. All 10 survivors agreed. Accuracy: 70.5%.

That number feels good. Twenty points above random. Clean convergence. A story you can tell yourself: the market in this window was an energy-and-volatility regime, and the GA found it.

We nearly started building the next phase — an island model using OTP to test whether independent populations would converge on the same structure. Five GenServers, ring topology migration, the whole distributed evolution setup. It would have been a beautiful piece of Elixir architecture.

We didn’t build it. First, we checked the data.

The bug we almost didn’t catch

Before testing whether the result was real, we ran a data integrity audit. Phase 0. The unglamorous one.

It found that our gold and oil normalization had look-ahead bias. The normalize_price function computed min/max over the entire history, including future dates. For each day T, the normalized value depended on prices from days T+1 through the end of the dataset.

How bad was it? Gold on day 6, raw price $4,000.30:

With full-window normalization: -0.923
With causal normalization (only data available on day 6): +0.910

The same price, at the same moment, flipping from near-minimum to near-maximum depending on whether the system could see the future. We fixed it by switching to fixed anchors — center and scale values chosen from domain knowledge rather than computed from data. The same approach VIX already used.

There were two other findings: sentiment was always 0.0 in backtesting mode (the API was only called for live predictions), and Wikipedia pageview data was fetched but never wired into the historical pipeline. So the GA was effectively running on three signals, not five, with two weights free to drift meaninglessly.

After fixing normalization and rerunning, the GA still hit 70.5%. The data was now clean. If this was an artifact, it wasn’t a simple data leak — it was something more insidious.

So we built the test that could actually kill it.

The test that wasn’t optional

The permutation test asks one question: can the GA achieve this accuracy on random data?

Method: take the real features, shuffle only the target labels (UP/DOWN), and rerun the entire identical GA — same population size, same generations, same selection and mutation. Not rescoring a fixed model; rerunning the full evolutionary search from scratch. Do this 200 times. Build a distribution of what accuracy looks like when there’s no signal to find.

If 70.5% sits well above that distribution, maybe there’s something real. If it sits inside the distribution, the GA is just a very sophisticated coin-flipper.

Result on 61 days: the null distribution averaged 63.8%. Our 70.5% (43/61) had a p-value of 0.115. Not significant. But close enough to squint at — 11.5% of random runs matched or beat our result.

That null average will surprise you. 63.8% on random labels? That’s the search effect: given enough generations and enough candidate weight vectors, the GA will find patterns in noise. This is why raw accuracy is meaningless without a null baseline.

This is the danger zone. The number is high enough to feel meaningful and marginal enough to rationalize. “Maybe with more data…”

So we got more data.

Result on 174 days: accuracy dropped to 61.5% (107/174). The null distribution tightened to 60.4%. P-value: 0.335. Dead center of the noise.

The progression tells the whole story. On a small window, the GA finds what looks like a pattern. On a larger window, the “pattern” washes out while the overfitting floor holds steady. This is textbook: apparent edge shrinks with more data. The signal was never there.

The overfitting floor

The permutation test revealed the number we actually needed: the GA’s overfitting floor.

On datasets this small, a 5-parameter model evolved over 100 generations can reliably “discover” ~60% accuracy even when the labels are random. On 61 data points, the floor was about 64%. On 174 points, it tightened to 60%. Without that baseline, 70.5% looks like signal. With it, it looks like search.

And remember — two of those five parameters were wasted on inactive features. The GA was squeezing 60% out of three live variables: gold, oil, and VIX. Evolutionary search is powerful enough to find that much structure in almost nothing.

What survives the failure

The hypothesis is dead. That was always an acceptable outcome — the signals were scaffolding, not the structure.

What survives is the methodology. The leakage audit catches real bugs. The permutation test kills false positives. The decision gates prevent wasted work. The GA core finds signal when signal exists (96.7% on synthetic data); the validation pipeline prevents us from mistaking noise for signal.

That was the point of this iteration. Now that the pipeline is validated, the next iteration can ask harder questions with better inputs.

What changes next

Daily binary direction might be the wrong target regardless of features. Predicting whether the market goes up or down on any given day is, almost by definition, the hardest version of this problem. Weekly direction, or continuous return magnitude, gives underlying features time to express themselves.

The model might also need structural changes. A single linear threshold can’t capture regime-dependent relationships — oil matters during energy crises, not during tech rallies. But those are questions worth asking only with a methodology you trust.

The next hypothesis will change the target and the signals, not the engine. The permutation test will be waiting for it.

The meta-lesson

The interesting work — distributed evolutionary architectures, representational experiments, regime analysis — is seductive. We had a full plan for the island model, the VIX representation comparison, the walk-forward validation. It would have been genuinely fun to build.

The tedious work — shuffling labels 200 times and checking if your normalization function looks at the future — is what keeps you honest. The permutation test took a day. The leakage audit took an afternoon. Together they saved us from weeks of architecture work on a nonexistent signal.

We built the firing squad before we built the trading system. The firing squad worked. The trading system didn’t survive it. That’s exactly how this is supposed to go.

← All Lab Notes