Solving Wordle

1. Introduction
2. Probability & Information
3. Entropy
4. Lets play a game
5. Code
6. Benchmark
7. Optimized Code

1. Introduction

3Blue1Brown's fantastic video Solving Wordle using Information Theory introduced me to the game of Wordle. In this article, I am going to explain what I learnt and implement a program to play this game.

Wordle is a word guessing game. You get 6 chances to guess a 5 letter word. For each guess, the game reveals following information:

if any letter in your guess word is at the exact same location as in the answer word, it is marked green,
if any letter in your guess word exists in the answer word but at a different location, it is marked yellow,
and the letters that don't exist in the answer word are marked black.

Using the revealed information, you narrow the possibilites and try to guess the correct word. As humans, we can intuitively think weather one guess might be better than other or not. "What exactly is a good guess?" is the main question. Or, put different, "What strategy for selecting guess words is the best?". Only when we answer this, we can implement that strategy in code and see the result. To answer this, observe that we tend to think of following strategies while playing the game ourselves:

we would like to use words that have different letters than previous guesses,
even if we know the correct letters of few positions, its may be better to use words that have wrong letters at those positions, because we can then know if those letter exist or not at other positions of the answer word,
we give priority to vowels and other letters that are more frequent in most of the words.

There is one central theme to these thoughts: Increase the amount of information we can extract from each guess. If we use letters previously used, we get less information; If we test uncommon letters in the guess, on average we narrow down the possible words by much less than if we had tested common letters. Now, we need to quantify these notions.

2. Probability & Information

Lets imagine, we are playing a different game. I think of a random number between 0 and 255, and your task is to guess it. Initially your probability of guessing the right one is \(1/256 = 0.4\%\). If I say, that the number is divisible by three. The probability of the number being divisible by 3 is about \(33\%\) and not being divisible by 3 is \(67\%\). Then how much information did I reveal?. And, If I had said that the number is divisible by 2, then how much information would I be revealing?. Intuitively, you can guess that I would be revealing much more information in the former case than in the latter. As another example, if I had said that the number is not greater than 400, then I would not be revealing any information. Because the probability of the number being not greater than 400 is already known to be 100%. So, in this case where the probability of each number being the right one is equal, the amount of information revealed is higher if that information reduces the space of right answers by a large number. And in general, the amount of information revealed is higher for events of lesser probability and lesser for events of higher probability. i.e. Information is decreasing function of probability.

Now, lets look at this in another way. A number between 0 and 255 can be represented in binary by 8 digits (8 binary digits = 8 bits). e.g. 56 is 00111000. And if I say that the number is divisible by 2, then it is the same as saying that the last digit in binary is 0. We need 8 bits to represent the number, and I revealed one of those 8 bits. So, we can say I revealed 1 bit of information. Now, it shouldn't feel surprising that we can measure information in bits. After all, every information (documents, images, audio, …) you store in your computer is stored in binary and their size measured in bits, bytes (8 bits = 1 byte), MBs (Mega Byte = 2²⁰ bytes), GBs. Revealing that the last digit is divisible by 2, reduced the possible answers by 1/2 and gave 1 bit of information. If I said that the last two digits are 10 the possible answers would be reduced by 1/4th and this would be 2 bits of information. And if I reveal the right answer, it would reduce the possibilites to 1 out of all possible 256, i.e. 8 bits of information. So, information \(I\) is:

\begin{equation*} I = - \log_2(P) \end{equation*}

\(- \log_2(1/2) = 1\) bit, and \(-\log_2(1/4) = 2\) bits. Here, \(P\) is the probability of event = the amount by which the possibilites are reduced (because in this case each number is equally likely).

Now, we can answer the previous questions:

If I say that the number is divisible by 3, then I restrict the digits to some pattern. And it would be \(-\log_2(1/3) = 1.58\) bits of information.
If I say that the number is divisible by 2, then it would be \(-\log_2(1/2) = 1\) bit of information.

And as we had intuitively said previously, we can now say numerically too that the former reveals higher information than the latter.

Let's get back to wordle. Our target is to increase the amount of information we can extract from each guess. There are 2315 words that can be right answers in the original wordle game.

In a wordle game instance where sweet is the right answer, using the guess word crane would show following color:

⬛⬛⬛⬛🟨

And if we check every possible answer words, there are 127 words that would give the above color.

beefy	befit	beget	belly	below	beset	betel	bevel	bezel	bleed
bleep	bless	bowel	bused	debit	debug	debut	deity	depot	depth
detox	devil	dopey	dowel	duvet	dwell	dwelt	edify	eight	elbow
elegy	embed	empty	epoxy	equip	ethos	exist	expel	extol	exult
fetid	fetus	field	filet	fleet	flesh	geeky	golem	gooey	guess
guest	hefty	heist	helix	hello	hotel	hovel	impel	islet	jelly
jetty	jewel	lefty	leggy	level	libel	model	modem	motel	pesky
pesto	petty	piety	pixel	plied	poesy	quell	quest	quiet	seedy
setup	sheep	sheet	sheik	shelf	shell	shied	sleek	sleep	sleet
slept	smell	smelt	speed	spell	spelt	spied	spiel	steed	steel
steep	sweep	sweet	swell	swept	teddy	teeth	tempo	tepid	testy
theft	thief	totem	towel	tweed	tweet	upset	video	vowel	weedy
weigh	welsh	wheel	whelp	wield	yield	zesty

So, crane reduced the possibilites from 2315 words to 127 words, i.e. it revealed \(-\log_2 (127/2315) = 4.19\) bits of information. Lets denote this as:

\begin{equation*} I(\textrm{crane}|\textrm{sweet}) = 4.19 \end{equation*}

However, if we were in the game where right answer was rebar, then guessing crane would show

⬛🟨🟨⬛🟨

and there would be 50 possible valid answer words next. In this game, crane reveals \(-\log_2(50/2315) = 5.53\) bits of information.

However, if we had guessed pearl then for the first game it would show

⬛🟨⬛⬛⬛

and reduce possibilites to 159 words (3.86 bits of information) and while for the second case, it would show

⬛🟩🟨🟨⬛

and reduce possibilites to 4 word (9.18 bits of information). So, crane is a better guess if the right answer is sweet while pearl is better guess if the right answer is rebar. But we don't know beforehand what the right answer is.

3. Entropy

Equipped with what we learned about information, we are now able to quantify what information is and how it relates to the reduction in the number of possibilites. Also, we saw that the amount of information each guess would reveal depends on what the right answer is. But we don't know what the right answer is beforehand. So, for each guess, we can't say how much information that word reveals but still we can compute the average (i.e. expected) value of the information we could get. In the above example, when we don't know what the right answer is, both sweet and rebar are equally likely and so are all other 2315 words. So, on average crane would reveal

\begin{equation*} H(\textrm{crane}) = E[I(\textrm{crane})] =\frac 1 {2315} \sum_{w \in \textrm{D}} I(\textrm{crane}|w) = 5.74 \end{equation*}

bits of information. \(E[I(w)]\) is the expected (average) value of information that the word \(w\) would reveal. This is also the entropy of the event. Event here is the event of guessing the word crane, and entropy is the amount of information we would get from the result of the event on average.

If you look up Wikipedia, entropy is defined as:

Given a discrete random variable \(X\) with possible outcomes \(x_1\), …, \(x_n\) with probability \(P(x_1)\), …, \(P(x_n)\), the entropy is defined as
\begin{equation} H(X) = - \sum_{i=1}^{n} P(x_i) log(P(x_i)) \end{equation}

Here our discrete random variable \(X\) is the result we get for a guess word (say crane), with possible outcomes ⬛⬛⬛⬛🟨, ⬛🟨🟨⬛🟨, … with probability of occurrences \(127/2315 = 5.5\%\), \(50/2315 = 2.2\%\), and the entropy of guess crane would be \(H(X) = 5.74\) bits.

Finally!!! We can answer the question `How to increase the amount of information we can extract from each guess?' Select the guess word such that the entropy of guess word is highest.

4. Lets play a game

The right answer for this game is focus but we don't know this yet. So, our first guess would be the word that has the highest entropy over all possible answer words.

crane has an entropy of 5.74
pearl has an entropy of 5.53
soare has the highest entropy of 5.89

So, our first guess is soare.

🟨🟩⬛⬛⬛

Now, from this color pattern we can figure out, by going through all words in the dictionary, that there are 15 possible words:

bonus	boost	bosom	bossy	focus	foist	hoist	joist	joust	locus
lousy	moist	mossy	noisy	posit

Now, you might think we may be better off guessing one of the words from this list but remember, we have to extract the maximum amount of information form each guess.

Assuming that each of the words above is equally likely to be the right answer, the average information that bonus would reveal is 2.2892463 bits. Where as the word thumb would reveal the highest amount of information on average of 3.4565654 bits.

With the next guess thumb resulting in the following color:

⬛⬛🟨⬛⬛

We now have only 2 valid guesses:

focus

locus

Now, the entropy of each of the two words is 1 bit because they each reduces the possibilites to half on average. The guess can be either right or wrong. At this point, we guess focus and complete the game in 3 steps.

5. Code

5.1. Color

We can encode the color a guess word shows for an actual answer word as an integer between 0 to \(3^5-1\), i.e. a number from 00000 to 22222 in base 3. If the color of first letter is black the first digit is 0, and 1 if its yellow and 2 if its green. Similarly, the second digit represent color of second letter, and so on.

< Collapse code block> Expand code block

(deftype color ()
  '(integer 0 #.(1- (expt 3 5))))

(defun color0 (guess actual-word)
  "the sequence of colors obtained when `guess' word is used to guess `actual-word'
represented as a number for 0 to 3^5 in base 3
| digit | meaning         |
|-------|-----------------|
| 0     | wrong           |
| 1     | different place |
| 2     | right           |"
  (let ((total 0)
        (matches (make-array 5 :element-type 'bit :initial-element 0)))
    (flet ((match-position (char)
             (loop for c across actual-word
                   for pos from 0
                   when (and (eql c char) (eql (aref matches pos) 0))
                     return pos)))

      ;; mark green positions
      (map-into matches (lambda (c1 c2) (if (eql c1 c2) 1 0))
                guess actual-word)
      ;; compute color 
      (loop for c1 across guess
            for c2 across actual-word
            for i from 0
            for color = (serapeum:cond-let p
                          ((eql c1 c2)
                           2)
                          ((match-position c1)
                           (setf (aref matches p) 1)
                           1)
                          (t 0))
            do (incf total (* color (expt 3 i))))
      total)))

5.2. Valid Guesses

Once we know the guess words and the color the game shows for those words we iterate through all the words and select those word that could show that pattern.

For that we first need to download the word-list for Wordle. In the original wordle game, only a portion of the valid guess words are the actual answers.

Possible Answer words : 2315
Allowed guess words : 12972

< Collapse code block> Expand code block

(defparameter *answers* (uiop:read-file-lines "./wordle-answers-alphabetical.txt"))
(defparameter *words* (concatenate 'vector
                                   *answers*
                                   (uiop:read-file-lines "./wordle-allowed-guesses.txt")))

(defun valid-guesses (previous-guesses-and-colors)
  "return the `words' which are possible answers given the previous guesses and their color
`previous-guesses-and-colors' is a list of pairs (`guess-word' and `color')"
  (let ((guesses (make-array 0 :element-type 'index :fill-pointer 0 :adjustable t)))
    (loop for word in in *answers*
          when (every #'(lambda (pair)
                          (destructuring-bind (guess-word . color) pair
                            (= (color0 guess-word word) color)))
                      previous-guesses-and-colors)
            do (vector-push-extend word guesses))
    guesses))

5.3. Entropy

Now we can calculate the entropy (i.e. expected information) of a guess word for some given possible answer words. At first the possible answer words is the whole answer word list (of 2315 word) and later on as we guess and colors are revealed, this list is reduced (which we can generate by valid-guesses function).

< Collapse code block> Expand code block

(defun entropy (word possible-words)
  "compute the average information that the `word' would reveal under given `possible-word'"
  (let ((possibilites (make-array #.(expt 3 5) :element-type 'fixnum :initial-element 0)))
    (map 'nil #'(lambda (w)
                  (incf (aref possibilites (color word w))))
         possible-words)
    (loop for k across possibilites
          with l  = (length possible-words)
          unless (= k 0)
            summing (* (/ k l) (log (/ k l))) into total
          finally (return (* -1 total (/ (log 2)))))))

There are only \(3^5 = 243\) possible color pattern. At first I have 243 zeros in the array possibilites. Then, we map across all possible answer words (possible-words) and find the color that would show on guessing word for that answer word. Incrementing the count in the array possible-words for that color. Finally, since each answer word is equally likely, the possibility of the color showing for the guess word is equal to the count (k) divided by total possible words (l). Summing \(- \frac k l \log(\frac k l)\) for all occurring colors gives the entropy.

Now, to play the game we find the guess word which has the highest entropy.

< Collapse code block> Expand code block

(defun highest-entropy (previous-guesses-and-colors)
  "find the `word' that reveals the highest average information given the previous guesses"
  (let ((guesses (valid-guesses previous-guesses-and-colors)))
    (if (<= (length guesses) 2)
        (aref guesses 0)
        (loop for word across *words*
              for entropy = (entropy word guesses)
              with max = -1
              with w = nil 
              do (when (> entropy max)
                   (setf max entropy
                         w word))
              finally (return w)))))

This is it. In just about 75 lines of code we have a program that solves wordle games.

6. Benchmark

Lets check the how effective this program is. For this we let the program play all possible wordle game and see how many guesses it take to solve those games.

Our first guess would always be soare and then the guess would be the word with the highest-entropy given the previous-guesses and their resulting color.

< Collapse code block> Expand code block

(defun simulate-game (word)
  (let ((previous-guesses ()))
    (loop repeat 6
          for guess = "soare"
            then (highest-entropy previous-guesses)
          for color = (color guess word) do
            (push (cons guess color) previous-guesses)
            (when (= guess word)
              (return)))
    (reverse (map 'list #'car previous-guesses))))

Simulating a game of sweet, it takes our program 4 steps to solve it.

< Collapse code block> Expand code block

(reverse (simulate-game "sweet"))

soare

clipt

sheet

sweet

For, focus it takes just 3 steps.

< Collapse code block> Expand code block

(reverse (simulate-game "focus"))

soare

thumb

focus

Now, lets iterate through all possible answer words and simulate the game, counting the number of guesses it takes to solve the game.

< Collapse code block> Expand code block

(defun histogram ()
  (let ((steps)
        (total 0))
    (loop for word across *words*
          for steps in (simulate-game word) do
            (if (string-equal word (first steps))
                (incf (getf steps (length steps) 0))
                (incf (getf steps :fail 0)))
          (incf total))
    (loop for i from 1 to 6
          for percent = (* 100 (/ (or (getf steps i) 0) total)) do
            (format t "~&~a : ~4,1f%~%" i percent))
    (format t "~&>6: ~4d~%" (or (getf steps :fail) 0))))

1 : 0.0%
2 : 1.3%
3 : 47.8%
4 : 47.9%
5 : 2.9%
6 : 0.0%
>6: 0

So, it solves almost all games in 3 or 4 steps, and only takes 5 guesses for about 2.9% games.

We can slightly increase the performance of the program if we select the guess word by computing entropy for two or more steps of game. And then the best first guess word would change from soare to something else. As elaborated in 3Blue1Brown's another video, trace performs slightly better.

1 : 0.0%
2 : 2.3%
3 : 48.6%
4 : 46.3%
5 : 2.7%
6 : 0.0%
>6: 0

With trace as the first guess word and no change in entropy function, the program solves 48.6% of wordle games in 3 steps and 46.3% of the games in 4 steps.

7. Optimized Code

Running above benchmark takes a few minutes without optimization. Upon optimization of the code and using multi-threading it takes about 20 seconds for the optimized version of the above code to play all games. Optimization I did were:

cache the color of every pair of guess word and answer word,
represent words as numbers by using index of words in the *words* array to filter, check equality, and do other operation. This is faster than directly using strings,
add type declarations,
and running benchmark in multiple threads.

Here's the optimized code.

Solving Wordle

Table of Contents

1. Introduction

2. Probability & Information

3. Entropy

4. Lets play a game

5. Code

5.1. Color

5.2. Valid Guesses

5.3. Entropy

6. Benchmark

7. Optimized Code