25 March 2023

Evolving Evaluations

The previous post Myth No.6 - 'Forced Wins for White' (March 2023) introduced 'the Molas study', a data scientist's effort 'to find if there’s a [chess960] *start position* that's better than the others'. One of the datasets used in the study was:-
Stockfish evaluation at depth ~40 for all the starting positions

This is also known as the 'Sesse' resource and I gave its URL in the post. The Molas study concluded,

Stockfish evaluations don’t predict actual winning rates for each variation

This didn't surprise me. If you consider that each start position (SP) leads to a mega-zillion possible games and that Sesse reduces each SP to a single two-digit number, much more surprising would be to find a meaningful correlation between an SP's W-L-D percentages and its Sesse value.

I discussed the Sesse numbers once before in A Stockfish Experiment (February 2019). That post mentioned another discussion, What's the Most Unbalanced Chess960 Position? (chess.com; Mike Klein; March 2018 / February 2020). Fun Master (FM) Mike observed,

Let's now take the most extreme case the other way -- the position where Sesse claims White enjoys the most sizable advantage. The lineup BBNNRKRQ delivers a whopping +0.57 plus for the first move. The advantaged is so marked that some chess960 events may even jettison this arrangement as a possible option (a total of four positions are +0.50 or better for White, but none are as lopsided as this one).

That position, also known as 'SP080 BBNNRKRQ', has received some notoriety thanks to Sesse, so I decided to investigate further. I downloaded the SP080 file from the CCRL (see link in the right sidebar), loaded it into SCID, and discovered that it contained 554 games. SCID gave me percentages for White's first moves, which I copied into the following chart.

There are 11 first moves for White listed in the top block of the chart. I then expanded the first two of those moves -- 1.g3 (65.7% overall score for White) and 1.Nd3 (59.7%) -- into the second and third blocks of the chart to see how Black has responded to those moves.

You might be wondering why I said there were 554 games in the file, but the SCID extract counts only 519 games. SCID was designed to handle the traditional start position (SP518 RNBQKBNR) and knows nothing about chess960 castling rules. SP080 allows 1.O-O on the first move, which SCID rejects. The 35 missing games (554 minus 519) are games that started 1.O-O. When I'm using SCID for a chess960 correspondence game, I have a technique to account for this anomaly, but I won't go into details here.

Similarly, the charts for 1.g3 and 1.Nd3 show '[end]' as one of the first moves for Black. These are games where Black played 1...O-O on the first move. The corresponding percentage scores are among the worst for Black, showing once again that early castling is a risky strategy.

If I were playing SP080 in a correspondence game, I would analyze both 1.g3 and 1.Nd3. A promising continuation after 1.g3 is 1...c5, which the score '43.9%' says, 'Favors Black'. Of course, I would have to look at White's second moves in this variation, where one move will appear to be superior to the others. And so on and so on.

To be useful, the SCID tool needs to be handled intelligently. I recently blundered into a wrong evaluation that I doumented in The CCRL Is Unreliable (Not!) (December 2021). I'm hopeful that some day a tool will appear that rivals SCID functionality *and* that understands chess960 castling. For now, I make do with the software I have.

For a look at two more SPs where evaluations have shifted with experience, see SP864 - BBQRKRNN and SP868 - QBBRKRNN, which are both attachments to this blog. One lesson I've learned from playing chess960 for almost 15 years : nothing is fixed in stone.

18 March 2023

Myth No.6 - 'Forced Wins for White'

Upon encountering chess960 for the first time, one of the first questions a new player asks is 'Are all 960 positions fair?'. I included a statement of this concern in Top 10 Myths About Chess960 (May 2012), where one bullet said,
Some start positions are forced wins for White

Remembering that I wrote this more than 10 years ago, at a time when I wasn't absolutely 100% sure that such unfair positions didn't exist, my standard response to the statement was, 'Which positions are forced wins? Please provide a specific example'. I never received a single example. Ten years later I can say with more confidence -- although still not 'absolutely 100% sure' -- that while some positions are difficult for Black to play, none of the 960 positions is lost before a single move is made.

In January a new study titled, Analyzing Chess960 Data | Alex Molas | Towards Data Science (towardsdatascience.com), appeared. Its subtitle announced,

Using more than 14 million chess960 games to find if there’s a variation that's better than the others.

There is considerable knowledge presented in the study and I don't pretend to understand all of it. I might well need several posts to unravel its subtleties, so I'll start by summarizing its references; in the following discussion, '>>>' means a direct quote from 'Analyzing Chess960 Data'.

>>> 'The original post was published here...'

[NB: I'll come back to this reference later; see '(A)' below. First I need to point out that there's an important issue with terminology. When chess players use the term 'variation', they mean a sequence of play arising from a specific position; e.g. 'In this position I had two variations and I had to work out which variation was better for me.' • In the Molas study, I'm convinced that the word 'variation' refers to one of the well-defined 960 start positions that are legal for chess960. I read the subtitle of the towardsdatascience.com article as 'to find if there’s a *start position* that’s better than the others' and the title of the amolas.dev post as saying 'Discovering the best chess960 *start position*'. I won't repeat this caveat each time, but it's important and helps to understand the discussion.]

>>> 'Ryan Wiley wrote this blog post where he analyzes some data from lichess..'

>>> 'There’s also this repo with the statistics for 4.5 millions games (~4500 games per variation)...'

[NB: There's an issue with the word 'variant' here, but it's not as important as the previous 'NB'. Chess960 purists will know what I'm talking about.]

>>> 'In this spreadsheet there’s the Stockfish evaluation at depth ~40 for all the starting positions...'

>>> 'There’s also this database with Chess960 games between different computer engines. However, I’m currently only interested in analyzing human games, so I’ll not put a lot of attention to this type of games...'

>>> 'Lichess -- the greatest chess platform out -- maintains a database with all the games that have been played in their platform...'

>>> 'To do the analysis, I downloaded ALL the available Chess960 data (up until 31–12–2022). For all the games played I extracted the variation, the players Elo and the final result...'

>>> 'The scripts and notebooks to donwload [sic] and process the data are available on this repo...'

At this point the article launches into 'Mathematical framework; 'Bayesian A/B testing; [...]'. This, of course, is the essence of the study and I won't go any further in this current post. Let's get back to '(A)', where there's another key reference.

>>> 'This post got some attention in Reddit...'

I could end the post here, but I need to make an admittedly subjective observation. There are two example of bias in the above references.

The first bias is 'I’m currently only interested in analyzing human games'. Huge caveat here. In my not-so-humble opinion, the CCRL is the best source of chess960 opening theory. Period. Full stop. The CCRL engines are rated at least 1000 points higher than most human players on Lichess. The engines don't make simple tactical errors and they calculate deeper into every position than any human can. If there is an unfair chess960 start position, the engines will find it, just like they find errors in most games played between humans.

I can understand ignoring the engines because humans grapple with different challenges in chess960 openings, but the purpose of the study was 'to find if there’s a *start position* that’s better than the others'. Ignoring the experience of the best players on the planet is severely limiting.

The second bias is 'Lichess -- the greatest chess platform out'. The main alternative here is Chess.com. Why ignore games played on the world's largest chess platform? Maybe there's a good reason, but I can't think of one. On a personal note, last year I investigated which of the two sites would be better to continue my own chess960 correspondence play. I determined that Chess.com was more serious about eliminating human players who cheat by using engines in games with other humans. Since my goal was playing no-engine games, I went with Chess.com. How much of the Lichess data involves concealed engine use?

Biases notwithstanding, the Molas study is an important step in evaluating the fairness of all 960 positions in chess960/FRC. I'm looking forward to understanding it in more depth.