On the Kramnik vs Nakamura Dispute


In the recent time there has been a lot of discussion about the dispute between two of the most famous chess players in the world. The dispute between Vladimir Kramnik and Hikaru Nakamura has been the subject of much debate in the chess community. In this article, we will take a statistical dive into the dispute. My approach will be rather educational so I’ll try to use other examples to explain the concepts.
What’s going on?
Some time ago a former World Chess Champion Vladimir Kramnik implied that Hikaru Nakamura’s exceptional online performance, particularly a streak of 45 wins and one draw in Chess.com’s Titled Tuesday blitz tournament, was statistically improbable and suggested potential cheating.
Kramnik’s insinuations were made through a cryptic post on his Chess.com profile, which many interpreted as targeting Nakamura.
In response, Nakamura dismissed the allegations as “garbage” and expressed disappointment over the unfounded accusations. Chess.com conducted an investigation into Nakamura’s games and found no evidence of cheating, stating that Kramnik’s accusations lacked statistical merit.
45 Wins and 1 Draw: How Unlikely Is It?
It is VERY unlikely. To put it into perspective, let’s consider a simpler example: flipping a coin. The probability of getting heads or tails is 0.5. If you flip a coin 46 times, the probability of getting 45 heads and 1 tail (there are no no draws in flipping a coin unfortunately) is:
$$ \underset{\text{All possible positions of single tail}}{46} * \underset{\text{45 heads}}{\frac{1}{2} ^{45}} * \underset{\text{1 tail}}{\frac{1}{2}} \approxeq 0.0000000000006537. $$Lets break down the calculation:
- The probability of getting 45 heads is $(\frac{1}{2})^{45}$ since one such event is with probabilty $\frac{1}{2}$ and they are independent.
- The probability of getting 1 tail is $\frac{1}{2}$.
But having 45 heads and then 1 tail in the end is not the only way to get 45 heads and a tail. There are 46 possible positions for a single tail in 46 flips (like the first flip, the second flip, etc.) meaning 46 different scenarios that we can put into one bag with “45/46 - lucky guy!” name on it. Calculating the probabilty of such an event (meaning 45 wins and a draw) is just adding up probabilities of each separate scenario (separate is an important word here - if they are not separate we need to be more fancy and avoid counting same thing twice or more). Each one of them is of same probability, so essentialy what we need to do is to multiply the probability of one scenario by 46.
Ok, mister, how about situation where we have 44 heads and 2 tails? Or 43 heads and 3 tails? Or 42 heads and 4 tails? Or…? Well, you get the idea. Lets focus on the case of two tails for a moment. We need to add up all of the probabilities of getting 44 heads and 2 tails, but we need to count them properly. Luckily, there is a binomial coefficient to help us:
$$ \underset{\text{number of possible placements of 2 tails among 44 heads}}{\binom{46}{44}} * \underset{\text{44 heads}}{\frac{1}{2} ^{44}} * \underset{\text{2 tails}}{\frac{1}{2}}^2 \approxeq 0.000000000014. $$A lot beter than just one tail, but still very low.
You might wonder why do we split $\frac{1}{2}$ between 44 heads and 2 tails, or 45 heads and 1 tail. Why dont we just write $\frac{1}{2}^{46}$? Well, we could, but then we restrict ourselves to only one scenario - when both tails and heads are equally likely. We tend to be more flexible and consider all possible scenarios as we’ll see in a moment.
One might argue that the case of Hikaru is a bit different since his chance of winning is not 0.5 but rather higher and the chance of draw is for sure not 0.5. But even if we take this into account, the probability of getting 45 wins and 1 draw is still very low. Let’s calculate it assuming that Hikaru’s chance of winning is 0.8 (extremly high, but just to show that even in this case the probability is very low) and for a draw is 0.05. Lte’s assume that he’s so strong that he almost never loses a game. The probability of getting 45 wins and 1 draw is higher then:
$$ \underset{\text{number of possible placements of 1 draw among 45 wins}}{\binom{46}{45}} * \underset{\text{45 wins}}{0.7 ^{45}} * \underset{\text{1 draw}}{0.05} \approxeq 0.0001 $$Way higher than in the case of flipping a coin, but still extremaly low. This just shows how unlikely it is to get 45 wins and 1 draw in 46 games.
What statistics has to say?
Before we continue, let’s clarify what we mean by “statistically improbable.” In statistics, we often use the concept of a p-value to help us reject or accept some hypothesis (for instance hypothesis may be “this coin is fair” or “Hikaru is not cheating”). We set a p-value before conducting the analysis - this is usualy a small number. Then we observe the reality, for the lack of a better word. For instance we toss a coin or let Hikaru play some games. Then we look at the outcome - number of heads vs tails, number of wins vs loses and ask the question: “Is, assuming the hypothesis, the probability of observing such an outcome lower than the threshold we set?” If the answer is “yes,” we reject the hypothesis. If it is complicated wait a moment, I will show some examples.
Lets go back to the coins and chess games to see how this works in practice. In the case of flipping a coin, lets say we set the p-value to 0.05. Lets flip a coin 46 times and assume that we get 45 heads and 1 tail. We already calculated the probabilty of this (assuming the coin is fair and has $\frac{1}{2}$ probability of landing on heads or tails) and it is equal to $0.0000000000006537$. This is way lower than 0.05, so we reject the hypothesis that the coin is fair. In other words we did what someone without this whole math background would do, but we have a fancy name for it - hypothesis testing. We simply say “if the coin is fair, the probability of observing such an outcome is very low, so the coin is probably not fair.” On the other hand, if we get, say 21 heads and 25 tails, the probability of observing such an outcome is $0.1$ (done by same calculations as in the case of 44 heads and 2 tails, but with $\binom{46}{21}$), which is higher than 0.05, so we accept the hypothesis that the coin is fair. In other words we say: “if the coin is fair, the probability of observing such an outcome is not that low, so the coin is probably fair, or at least we cannot say that it is not fair.”
In the case of Hikaru, we can do the same. Lets sit him down and let him play - lets assume that he had 45 wins and 1 draw in 46 games. We already calulated the probability and it is $0.000000246$ (or even smaller since we gave hin a 0.7 chance of winning). This is way lower than 0.05, so we reject the hypothesis that Hikaru is not cheating. In other words we say “if Hikaru is not cheating, the probability of observing such an outcome is very low, so Hikaru is probably cheating.”
But is it really that simple?
Well, no. There are a lot of things that we need to take into account. Please note that when we talked about the experiment with Hikaru we said “Lets sit him down and let him play”, however what Vladimir did is that he analized 50,000 games of Hikaru and found a winning streak of 45 wins and 1 draw.
But, come on, is this an important distinction?
This is a COLLOSAL distinction. What Vladimir essentialy asked is: “during 50000 games (or any events of random outcome) there will be for sure some wierd winning (or loosing, or interchangebly winning-loosing-winning-…, or any other pattern that you can think of) streaks. What is the longest streak that we can find there?” Obiously the longer sequence of games the longer winning streak. In case of Hikaru 50000 games Vladimir’s answer was 46. This is not something unusual. Here is python code to simulate this:
import random
sequence = [1 if random.random() < 0.5 else 0 for _ in range(50000)]
max_streak = 0
current_streak = 0
for i in range(len(sequence)):
zero_found = False
current_streak = 0
if sequence[i] == 0:
continue
for j in range(len(sequence)-i-1):
if sequence[i+j+1] == 0:
if zero_found:
break
else:
zero_found = True
current_streak += 1
max_streak = max(max_streak, current_streak)
max_streak
We generate a random sequence of 1’s and 0’s and calculate the longest streak with at most one 0 in it. The results will vary, but you will get a number around 20, but keep in mind that Hikaru is one of the greates to ever do this, so you might put 0.7 in the second line to see the difference. Anyway, this is not something unusual, this is just how random events work.
One other example that comes to mind is the day that a car breaks down. Let’s assume that it brekas once every 4 years (I mean to the point that you can no longer drive), so that the probability that on a given day it will break is $1/(4*365 +1\text{ (for a leap year }) \approxeq 0.0007$ - very unlikely. Imagine that you want to check if the car dealership in your town is selling working cars (coin is fair, Hikaru is not cheating). Would you be convinced that the dealership is cheating if someone told you that during 7 years of exploitation the car broke down 2 times? I hope not. However if you were told that the car broke down after a month of exploitation, you might start to think that something is fishy.
The big difference here is the lenght of the period that we look at. The longer the period the higher chances of some unlikely things to happen. Iw we wait long enough, we will see some wierd things happening everytwhere.
Using 46 winning streaks to test if Hikaru is cheating is a big mistake. This is called data snooping or p-value hacking and it is a big problem in statistics. Donn’t get me wrong here - it also doesn’t mean that Hikaru is not cheating. It just means that the low probability of 45,5/46 points is not claryfying anything. And since this is the only argument that Vladimir had on Hikaru, I fell like this is a wrong way of thinking and may confuse a lot of people without mathmatical background to reject such claims.
Conclusion
The dispute between Kramnik and Nakamura is a great example of how statistics can be misused. The probability of getting 45 wins and 1 draw in 46 games is very low, but it is not a good way to test if someone is cheating. We need to be very careful when using statistics to make claims about the world.