Not too long ago, my Blogspot.com stats flagged an incoming link from the Arimaa.com forum,
Re: Measure stereotyped openings.
The link was to a post from two years ago,
Waving a Yellow Flag,
where I listed a number of chess960 start positions that, according to CCRL experiments, seemed to produce superior results for White. The Arimaa.com poster concluded,
Under the assumption that every chess960 position has exactly the same first-move advantage, by natural variation I get results just as extreme as the ones our blogger has compiled. So perhaps some positions have just been lucky for White so far, and others unlucky, with no inherent bias. At a minimum, if these are the most conclusive stats available, we have to say there is so far no statistical evidence that some positions favor White more than others.
In other words, the CCRL results match the distribution one would expect from the number of games in the CCRL sample, assuming a 55%-45% theoretical advantage for White. I asked Ichabod, the chess960 expert and professional statistician last seen on this blog in
A Better Pawn Method,
if he agreed with the post on the Arimaa forum and he confirmed its methodology.
Then I asked him,
'How big would the samples have to be to reduce the extremes to their theoretical minimum?'
He answered,
It's not a question of theoretical minimum. The question is, are the results you are seeing more extreme than you would expect with random chance? If you aren't, then there isn't statistical evidence of an effect.
To clarify, you need to think about how much of an advantage you want to detect. What you're seeing is that at your current sample size you can't detect an advantage of 12% because of the random noise. What advantage do you want to detect? 5%? 1%? From that you could back calculate a necessary sample size from the multinomial win/loss/draw distribution.
Last year, on my main blog, I posted a series on
Practical Evaluation,
where I learned that the value of the first move in traditional chess is a half-tempo, which is worth 0.2 times the value of a Pawn.
In the last post in the series, I learned that
A Pawn Equals 200 Rating Points,
which gives White a theoretical advantage of 56%-44% based on the half-tempo. This is very close to the observed advantage for White over millions of games.
Given that all of the start positions in chess960 confer a half-tempo advantage on White, does that mean White always has an advantage of 56%-44%? Or perhaps the half-tempo advantage isn't equivalent to 0.2 times a Pawn for all 960 start positions. I suspect the latter is true, but how will we ever find out, given that we need so many games with each start position to provide a valid sample.
I asked Ichabod, 'How many games would I have to play in another start position to know that the new W%-B% is significantly different?' He answered,
Here we get into the issue of the two different kinds of significance: statistical significance and practical significance. Statistical significance is going to determine what sample size you need to detect a given difference. Practical significance is going to determine what difference you want to detect. Say we had a bazillion games for each position, and we could show that in some positions White had an advantage of 0.000001 pawns. No one would care. We would have statistical significance but we wouldn't have practical significance.
On the other hand, let's say in certain positions we could show with statistical significance that white had a full Pawn advantage. Then people would care, and would think that position is flawed. We would have both practical and statistical significance. Now, somewhere between 0.000001 pawns and a full Pawn is a minimum advantage that would be considered a practically significant difference between the standard position and a given chess960 position.
Determining that minimum advantage is not a statistics question, it's a chess question. That is, you have to determine what fraction of a Pawn advantage is a practically significant advantage.
Here's an idea for killing a large amount of time: Run an engine (any chess960-enabled engine) on all 960 start positions (SPs). Record the value of the top-10 first moves for each SP. Analyze the results. Can any information be derived from the observed value of the first moves?