Solving Wordle
Table of Contents
1. Introduction
3Blue1Brown's fantastic video Solving Wordle using Information Theory introduced me to the game of Wordle. In this article, I am going to explain what I learnt and implement a program to play this game.
Wordle is a word guessing game. You get 6 chances to guess a 5 letter word. For each guess, the game reveals following information:
- if any letter in your guess word is at the exact same location as in the answer word, it is marked green,
- if any letter in your guess word exists in the answer word but at a different location, it is marked yellow,
- and the letters that don't exist in the answer word are marked black.
Using the revealed information, you narrow the possibilites and try to guess the correct word. As humans, we can intuitively think weather one guess might be better than other or not. "What exactly is a good guess?" is the main question. Or, put different, "What strategy for selecting guess words is the best?". Only when we answer this, we can implement that strategy in code and see the result. To answer this, observe that we tend to think of following strategies while playing the game ourselves:
- we would like to use words that have different letters than previous guesses,
- even if we know the correct letters of few positions, its may be better to use words that have wrong letters at those positions, because we can then know if those letter exist or not at other positions of the answer word,
- we give priority to vowels and other letters that are more frequent in most of the words.
There is one central theme to these thoughts: Increase the amount of information we can extract from each guess. If we use letters previously used, we get less information; If we test uncommon letters in the guess, on average we narrow down the possible words by much less than if we had tested common letters. Now, we need to quantify these notions.
2. Probability & Information
Lets imagine, we are playing a different game. I think of a random number between 0 and 255, and your task is to guess it. Initially your probability of guessing the right one is \(1/256 = 0.4\%\). If I say, that the number is divisible by three. The probability of the number being divisible by 3 is about \(33\%\) and not being divisible by 3 is \(67\%\). Then how much information did I reveal?. And, If I had said that the number is divisible by 2, then how much information would I be revealing?. Intuitively, you can guess that I would be revealing much more information in the former case than in the latter. As another example, if I had said that the number is not greater than 400, then I would not be revealing any information. Because the probability of the number being not greater than 400 is already known to be 100%. So, in this case where the probability of each number being the right one is equal, the amount of information revealed is higher if that information reduces the space of right answers by a large number. And in general, the amount of information revealed is higher for events of lesser probability and lesser for events of higher probability. i.e. Information is decreasing function of probability.
Now, lets look at this in another way. A number between 0 and 255 can be represented in binary by 8 digits (8 binary digits = 8 bits). e.g. 56 is 00111000
. And if I say that the number is divisible by 2, then it is the same as saying that the last digit in binary is 0
. We need 8 bits to represent the number, and I revealed one of those 8 bits. So, we can say I revealed 1 bit of information. Now, it shouldn't feel surprising that we can measure information in bits. After all, every information (documents, images, audio, …) you store in your computer is stored in binary and their size measured in bits, bytes (8 bits = 1 byte), MBs (Mega Byte = 220 bytes), GBs. Revealing that the last digit is divisible by 2, reduced the possible answers by 1/2 and gave 1 bit of information. If I said that the last two digits are 10
the possible answers would be reduced by 1/4th and this would be 2 bits of information. And if I reveal the right answer, it would reduce the possibilites to 1 out of all possible 256, i.e. 8 bits of information. So, information \(I\) is:
\(- \log_2(1/2) = 1\) bit, and \(-\log_2(1/4) = 2\) bits. Here, \(P\) is the probability of event = the amount by which the possibilites are reduced (because in this case each number is equally likely).
Now, we can answer the previous questions:
- If I say that the number is divisible by 3, then I restrict the digits to some pattern. And it would be \(-\log_2(1/3) = 1.58\) bits of information.
- If I say that the number is divisible by 2, then it would be \(-\log_2(1/2) = 1\) bit of information.
And as we had intuitively said previously, we can now say numerically too that the former reveals higher information than the latter.
Let's get back to wordle. Our target is to increase the amount of information we can extract from each guess. There are 2315 words that can be right answers in the original wordle game.
In a wordle game instance where sweet
is the right answer, using the guess word crane
would show following color:
⬛⬛⬛⬛🟨
And if we check every possible answer words, there are 127 words that would give the above color.
beefy | befit | beget | belly | below | beset | betel | bevel | bezel | bleed |
bleep | bless | bowel | bused | debit | debug | debut | deity | depot | depth |
detox | devil | dopey | dowel | duvet | dwell | dwelt | edify | eight | elbow |
elegy | embed | empty | epoxy | equip | ethos | exist | expel | extol | exult |
fetid | fetus | field | filet | fleet | flesh | geeky | golem | gooey | guess |
guest | hefty | heist | helix | hello | hotel | hovel | impel | islet | jelly |
jetty | jewel | lefty | leggy | level | libel | model | modem | motel | pesky |
pesto | petty | piety | pixel | plied | poesy | quell | quest | quiet | seedy |
setup | sheep | sheet | sheik | shelf | shell | shied | sleek | sleep | sleet |
slept | smell | smelt | speed | spell | spelt | spied | spiel | steed | steel |
steep | sweep | sweet | swell | swept | teddy | teeth | tempo | tepid | testy |
theft | thief | totem | towel | tweed | tweet | upset | video | vowel | weedy |
weigh | welsh | wheel | whelp | wield | yield | zesty |
So, crane
reduced the possibilites from 2315 words to 127 words, i.e. it revealed \(-\log_2 (127/2315) = 4.19\) bits of information. Lets denote this as:
However, if we were in the game where right answer was rebar
, then guessing crane
would show
⬛🟨🟨⬛🟨
and there would be 50
possible valid answer words next. In this game, crane
reveals \(-\log_2(50/2315) = 5.53\) bits of information.
However, if we had guessed pearl
then for the first game it would show
⬛🟨⬛⬛⬛
and reduce possibilites to 159
words (3.86 bits of information) and while for the second case, it would show
⬛🟩🟨🟨⬛
and reduce possibilites to 4
word (9.18 bits of information). So, crane is a better guess if the right answer is sweet
while pearl
is better guess if the right answer is rebar
. But we don't know beforehand what the right answer is.
3. Entropy
Equipped with what we learned about information, we are now able to quantify what information is and how it relates to the reduction in the number of possibilites. Also, we saw that the amount of information each guess would reveal depends on what the right answer is. But we don't know what the right answer is beforehand. So, for each guess, we can't say how much information that word reveals but still we can compute the average (i.e. expected) value of the information we could get. In the above example, when we don't know what the right answer is, both sweet
and rebar
are equally likely and so are all other 2315 words. So, on average crane
would reveal
bits of information. \(E[I(w)]\) is the expected (average) value of information that the word \(w\) would reveal. This is also the entropy of the event. Event here is the event of guessing the word crane
, and entropy is the amount of information we would get from the result of the event on average.
If you look up Wikipedia, entropy is defined as:
Given a discrete random variable \(X\) with possible outcomes \(x_1\), …, \(x_n\) with probability \(P(x_1)\), …, \(P(x_n)\), the entropy is defined as
\begin{equation} H(X) = - \sum_{i=1}^{n} P(x_i) log(P(x_i)) \end{equation}
Here our discrete random variable \(X\) is the result we get for a guess word (say crane
), with possible outcomes ⬛⬛⬛⬛🟨
, ⬛🟨🟨⬛🟨
, … with probability of occurrences \(127/2315 = 5.5\%\), \(50/2315 = 2.2\%\), and the entropy of guess crane
would be \(H(X) = 5.74\) bits.
Finally!!! We can answer the question `How to increase the amount of information we can extract from each guess?' Select the guess word such that the entropy of guess word is highest.
4. Lets play a game
The right answer for this game is focus
but we don't know this yet. So, our first guess would be the word that has the highest entropy over all possible answer words.
crane
has an entropy of 5.74pearl
has an entropy of 5.53soare
has the highest entropy of 5.89So, our first guess is soare.
🟨🟩⬛⬛⬛
Now, from this color pattern we can figure out, by going through all words in the dictionary, that there are 15
possible words:
bonus | boost | bosom | bossy | focus | foist | hoist | joist | joust | locus |
lousy | moist | mossy | noisy | posit |
Now, you might think we may be better off guessing one of the words from this list but remember, we have to extract the maximum amount of information form each guess.
Assuming that each of the words above is equally likely to be the right answer, the average information that bonus
would reveal is 2.2892463
bits. Where as the word thumb
would reveal the highest amount of information on average of 3.4565654
bits.
With the next guess thumb
resulting in the following color:
⬛⬛🟨⬛⬛
We now have only 2
valid guesses:
focus | locus |
Now, the entropy of each of the two words is 1 bit because they each reduces the possibilites to half on average. The guess can be either right or wrong. At this point, we guess focus
and complete the game in 3 steps.
5. Code
5.1. Color
We can encode the color a guess word shows for an actual answer word as an integer between 0 to \(3^5-1\), i.e. a number from 00000 to 22222 in base 3. If the color of first letter is black the first digit is 0, and 1 if its yellow and 2 if its green. Similarly, the second digit represent color of second letter, and so on.
< Collapse code block
(deftype color () '(integer 0 #.(1- (expt 3 5)))) (defun color0 (guess actual-word) "the sequence of colors obtained when `guess' word is used to guess `actual-word' represented as a number for 0 to 3^5 in base 3 | digit | meaning | |-------|-----------------| | 0 | wrong | | 1 | different place | | 2 | right |" (let ((total 0) (matches (make-array 5 :element-type 'bit :initial-element 0))) (flet ((match-position (char) (loop for c across actual-word for pos from 0 when (and (eql c char) (eql (aref matches pos) 0)) return pos))) ;; mark green positions (map-into matches (lambda (c1 c2) (if (eql c1 c2) 1 0)) guess actual-word) ;; compute color (loop for c1 across guess for c2 across actual-word for i from 0 for color = (serapeum:cond-let p ((eql c1 c2) 2) ((match-position c1) (setf (aref matches p) 1) 1) (t 0)) do (incf total (* color (expt 3 i)))) total)))
5.2. Valid Guesses
Once we know the guess words and the color the game shows for those words we iterate through all the words and select those word that could show that pattern.
For that we first need to download the word-list for Wordle. In the original wordle game, only a portion of the valid guess words are the actual answers.
- Possible Answer words : 2315
- Allowed guess words : 12972
< Collapse code block
(defparameter *answers* (uiop:read-file-lines "./wordle-answers-alphabetical.txt")) (defparameter *words* (concatenate 'vector *answers* (uiop:read-file-lines "./wordle-allowed-guesses.txt"))) (defun valid-guesses (previous-guesses-and-colors) "return the `words' which are possible answers given the previous guesses and their color `previous-guesses-and-colors' is a list of pairs (`guess-word' and `color')" (let ((guesses (make-array 0 :element-type 'index :fill-pointer 0 :adjustable t))) (loop for word in in *answers* when (every #'(lambda (pair) (destructuring-bind (guess-word . color) pair (= (color0 guess-word word) color))) previous-guesses-and-colors) do (vector-push-extend word guesses)) guesses))
5.3. Entropy
Now we can calculate the entropy (i.e. expected information) of a guess word for some given possible answer words. At first the possible answer words is the whole answer word list (of 2315 word) and later on as we guess and colors are revealed, this list is reduced (which we can generate by valid-guesses
function).
< Collapse code block
(defun entropy (word possible-words) "compute the average information that the `word' would reveal under given `possible-word'" (let ((possibilites (make-array #.(expt 3 5) :element-type 'fixnum :initial-element 0))) (map 'nil #'(lambda (w) (incf (aref possibilites (color word w)))) possible-words) (loop for k across possibilites with l = (length possible-words) unless (= k 0) summing (* (/ k l) (log (/ k l))) into total finally (return (* -1 total (/ (log 2)))))))
There are only \(3^5 = 243\) possible color pattern. At first I have 243 zeros in the array possibilites
. Then, we map across all possible answer words (possible-words
) and find the color that would show on guessing word
for that answer word. Incrementing the count in the array possible-words
for that color. Finally, since each answer word is equally likely, the possibility of the color showing for the guess word is equal to the count (k
) divided by total possible words (l
). Summing \(- \frac k l \log(\frac k l)\) for all occurring colors gives the entropy.
Now, to play the game we find the guess word which has the highest entropy.
< Collapse code block
(defun highest-entropy (previous-guesses-and-colors) "find the `word' that reveals the highest average information given the previous guesses" (let ((guesses (valid-guesses previous-guesses-and-colors))) (if (<= (length guesses) 2) (aref guesses 0) (loop for word across *words* for entropy = (entropy word guesses) with max = -1 with w = nil do (when (> entropy max) (setf max entropy w word)) finally (return w)))))
This is it. In just about 75 lines of code we have a program that solves wordle games.
6. Benchmark
Lets check the how effective this program is. For this we let the program play all possible wordle game and see how many guesses it take to solve those games.
Our first guess would always be soare
and then the guess would be the word with the highest-entropy
given the previous-guesses
and their resulting color.
< Collapse code block
(defun simulate-game (word) (let ((previous-guesses ())) (loop repeat 6 for guess = "soare" then (highest-entropy previous-guesses) for color = (color guess word) do (push (cons guess color) previous-guesses) (when (= guess word) (return))) (reverse (map 'list #'car previous-guesses))))
Simulating a game of sweet
, it takes our program 4 steps to solve it.
< Collapse code block
(reverse (simulate-game "sweet"))
soare | clipt | sheet | sweet |
For, focus
it takes just 3 steps.
< Collapse code block
(reverse (simulate-game "focus"))
soare | thumb | focus |
Now, lets iterate through all possible answer words and simulate the game, counting the number of guesses it takes to solve the game.
< Collapse code block
(defun histogram () (let ((steps) (total 0)) (loop for word across *words* for steps in (simulate-game word) do (if (string-equal word (first steps)) (incf (getf steps (length steps) 0)) (incf (getf steps :fail 0))) (incf total)) (loop for i from 1 to 6 for percent = (* 100 (/ (or (getf steps i) 0) total)) do (format t "~&~a : ~4,1f%~%" i percent)) (format t "~&>6: ~4d~%" (or (getf steps :fail) 0))))
1 : 0.0%
2 : 1.3%
3 : 47.8%
4 : 47.9%
5 : 2.9%
6 : 0.0%
>6: 0
So, it solves almost all games in 3 or 4 steps, and only takes 5 guesses for about 2.9% games.
We can slightly increase the performance of the program if we select the guess word by computing entropy for two or more steps of game. And then the best first guess word would change from soare
to something else. As elaborated in 3Blue1Brown's another video, trace
performs slightly better.
1 : 0.0%
2 : 2.3%
3 : 48.6%
4 : 46.3%
5 : 2.7%
6 : 0.0%
>6: 0
With trace
as the first guess word and no change in entropy function, the program solves 48.6% of wordle games in 3 steps and 46.3% of the games in 4 steps.
7. Optimized Code
Running above benchmark takes a few minutes without optimization. Upon optimization of the code and using multi-threading it takes about 20 seconds for the optimized version of the above code to play all games. Optimization I did were:
- cache the
color
of every pair of guess word and answer word, - represent words as numbers by using index of words in the
*words*
array to filter, check equality, and do other operation. This is faster than directly using strings, - add type declarations,
- and running benchmark in multiple threads.
Here's the optimized code.