## 11 October 2014

Not too long ago, my Blogspot.com stats flagged an incoming link from the Arimaa.com forum, Re: Measure stereotyped openings. The link was to a post from two years ago, Waving a Yellow Flag, where I listed a number of chess960 start positions that, according to CCRL experiments, seemed to produce superior results for White. The Arimaa.com poster concluded,
Under the assumption that every chess960 position has exactly the same first-move advantage, by natural variation I get results just as extreme as the ones our blogger has compiled. So perhaps some positions have just been lucky for White so far, and others unlucky, with no inherent bias. At a minimum, if these are the most conclusive stats available, we have to say there is so far no statistical evidence that some positions favor White more than others.

In other words, the CCRL results match the distribution one would expect from the number of games in the CCRL sample, assuming a 55%-45% theoretical advantage for White. I asked Ichabod, the chess960 expert and professional statistician last seen on this blog in A Better Pawn Method, if he agreed with the post on the Arimaa forum and he confirmed its methodology. Then I asked him, 'How big would the samples have to be to reduce the extremes to their theoretical minimum?' He answered,

It's not a question of theoretical minimum. The question is, are the results you are seeing more extreme than you would expect with random chance? If you aren't, then there isn't statistical evidence of an effect.

To clarify, you need to think about how much of an advantage you want to detect. What you're seeing is that at your current sample size you can't detect an advantage of 12% because of the random noise. What advantage do you want to detect? 5%? 1%? From that you could back calculate a necessary sample size from the multinomial win/loss/draw distribution.

Last year, on my main blog, I posted a series on Practical Evaluation, where I learned that the value of the first move in traditional chess is a half-tempo, which is worth 0.2 times the value of a Pawn. In the last post in the series, I learned that A Pawn Equals 200 Rating Points, which gives White a theoretical advantage of 56%-44% based on the half-tempo. This is very close to the observed advantage for White over millions of games.

Given that all of the start positions in chess960 confer a half-tempo advantage on White, does that mean White always has an advantage of 56%-44%? Or perhaps the half-tempo advantage isn't equivalent to 0.2 times a Pawn for all 960 start positions. I suspect the latter is true, but how will we ever find out, given that we need so many games with each start position to provide a valid sample.

I asked Ichabod, 'How many games would I have to play in another start position to know that the new W%-B% is significantly different?' He answered,

Here we get into the issue of the two different kinds of significance: statistical significance and practical significance. Statistical significance is going to determine what sample size you need to detect a given difference. Practical significance is going to determine what difference you want to detect. Say we had a bazillion games for each position, and we could show that in some positions White had an advantage of 0.000001 pawns. No one would care. We would have statistical significance but we wouldn't have practical significance.

On the other hand, let's say in certain positions we could show with statistical significance that white had a full Pawn advantage. Then people would care, and would think that position is flawed. We would have both practical and statistical significance. Now, somewhere between 0.000001 pawns and a full Pawn is a minimum advantage that would be considered a practically significant difference between the standard position and a given chess960 position.

Determining that minimum advantage is not a statistics question, it's a chess question. That is, you have to determine what fraction of a Pawn advantage is a practically significant advantage.

Here's an idea for killing a large amount of time: Run an engine (any chess960-enabled engine) on all 960 start positions (SPs). Record the value of the top-10 first moves for each SP. Analyze the results. Can any information be derived from the observed value of the first moves?

GeneM said...

I have studied some of the CCRL results for the various chess960 start setups. I do not believe the results can be taken to tell us anything about the relative size the White's (unfair) advantage between the different setups.
.
One reason I say this is because Fritz, when deprived of its opening book (the the chess960 engines are), does a lowsey job of identifying the strongest opening moves and opening systems, especially at the rather fast rate of play used by CCRL. Therefore there is every reason to believe that the CCRL data does Not involve strong opening moves (by grandmaster standards as we know them for the traditional setup). And we cannot ascertain anything about White's true inherent advantage when both colors are playing sub-grandmaster quality opening moves.
.
Another reason we cannot draw conclusions about white's advantages in chess960 setups is that within some mirror (or reciprocal) setup pairs, the size of white's advantage varies too greatly. I cannot recall which setup pair, but in one pair White had a large win % advantage when the kings start near column-a, but Black has a sizable win % advantage when the kings instead start near column-h. I do not believe the slight castling asymmetry can account for the large and opposite directions of the statistical advantage that CCRL reports.
.
Still, I certainly think the CCRL data is cool. The DRAW RATES data might be more reliable.
.
GeneM , 2014/Oct/20

Mark Weeks said...

GeneM - I completely agree with you on most points. Opening play is based on chess logic and engines don't select their moves through logic.

Another point worth noting is that engines don't select their openings in the traditional start position. Their repertoires ('books') are developed by strong players who take into account an engine's strengths and weaknesses.

Re the varying success rates for twins, most (all?) of these anomalies arise because castling is possible on the first move for one of the twins. - Mark