The use of machine learning to teach computers to play board games has had a lot of interest lately. Big companies such as Facebook and Google have both made recent breakthroughs in teaching AI the complex board game, Go. However, people have been using machine learning to teach computers board games since the mid-twentieth century. In the early 1960s Donald Michie, a British computer scientist who helped break the German Tunny code during the Second World War, came up with Menace (the Machine Educable Noughts And Crosses Engine). Menace uses 304 matchboxes all filled with coloured beads in order to learn to play noughts and crosses.
How Menace works
Donald Michie’s original Menace.
Menace “learns” to play noughts and crosses by playing the game repeatedly against another player, each time refining its strategy until after having played a certain number of games it becomes almost perfect and its opponent is only able to draw or lose against it. The learning process involves being “punished” for losing and “rewarded” for drawing or winning, in much the same way that a child learns. This type of machine learning is calledreinforcement learning.
The 304 matchboxes that make up Menace represent all the possible layouts of a noughts and crosses board it might come across while playing. This is reduced from a much larger number by removing winning layouts, only allowing Menace to play first and treating rotations and reflections as the same board.
Each of these layouts would be represented by a single matchbox, as they are all either rotations or reflections of one another.
Each of these matchboxes contains a number of coloured beads, each colour representing a valid move Menace could play for the
Coloured beads representing possible moves.
corresponding board layout. The starting number of beads in each matchbox varies depending on the number of turns that have already been played. In Donald Michie’s original version of Menace, the box representing Menace’s first turn had four beads for each different move. The boxes representing the layouts of the board for Menace’s second turn contained three beads for each different move; there were two beads each for Menace’s third; and one of each in the boxes representing Menace’s fourth go. There are no boxes representing Menace’s fifth move as there is only one space remaining and Menace is forced to take it.
To speed up the learning process even more, only beads representing unique possible moves are used. So even though all the spaces are free on an empty board, only three need to be represented: centre, side and corner. All other positions are equivalent to one of these three.
The beads shown are the only ones required for each scenario: all other positions on the board are equivalent to one of the positions marked.
When Menace makes its move one must find the box representing the current board layout and take a bead at random from that box. The bead represents the space in which Menace wishes to place its counter. This process is repeated every time it is Menace’s go until either somebody wins or the board is filled.
After having completed a game, Menace is punished or rewarded depending on the outcome. If Menace lost, the beads representing the moves Menace played are removed. If it was a draw, an extra bead of the colour played is added to each relevant matchbox; while if Menace won, three extra beads are added. This means that if Menace played badly, it will have asmaller chance of playing the same game next time. However, if Menace played well, it is more likely to follow the same route the next time and win again.
An example game
When playing against Menace, Menace always starts, otherwise the number of matchboxes needed would be greatly increased.
1st move, Menace’s turn: The operator finds the matchbox that displays the empty board, opens it and takes out a random bead. In this case the random bead is red, which means that Menace wants to place its counter in the centre-top space.
2nd move, player’s turn: The human player places the counter in the desired spot, which is in the centre.
3rd move, Menace’s turn: The operator finds the matchbox that displays the current board layout, opens it and takes out a random bead. The random bead is now blue, which means that Menace wants to place its counter in the top-left corner.
4th move, player’s turn: The human player places the counter in the desired spot, here blocking Menace from getting three in a row.
5th move, Menace’s turn: The operator finds the matchbox that displays the current board layout, opens it and takes out a random bead. The random bead is green, which means that Menace wants to place its counter in the middle row and left-most column.
6th move, player’s turn: The human player places the counter in the bottom-left corner and obtains three in a row. The human player wins!
With the game over, Menace, since it lost, must be punished. Hence every bead that represented a move played is removed from its corresponding box, as shown below.
I built a physical implementation of Menace so that I could play against the real thing myself. I wanted to create an experience similar to that which Michie must have had when he came up with the idea. What is most striking is the time it takes to play one game: finding the right box, taking a bead out and then placing Menace’s piece in the corresponding space on the board is very time-consuming.
To make things faster I therefore decided to write a Python script to work in the same way that Menace does, with the computer doing all the work for me. Using the script, I could also save the state of the boxes and try out new reward systems and bead arrangements without losing the results from a previous set of data.
On top of this, I could now train Menace by playing it against already existing programmed strategies for noughts and crosses. I first had it play hundreds of games against a program that placed its counters randomly. You could see that after having played a certain number of games Menace had evolved. But when I played against Menace afterwards, it would sometimes make irrational moves. When Menace played against the random strategy, the random strategy would often place its counter somewhere other than where it could easily block Menace, letting Menace win and therefore reinforcing the sequence of moves that had been played. These moves, however, might not necessarily have been good ones Menace might just have got lucky.
I also wrote a program for a perfect strategy for noughts and crosses to train Menace against. This strategy only lets another player draw or lose against it. What struck me when training Menace against this program was the number of times it ran out of beads in certain boxes, having lost so many times that all of the beads from a particular box were removed. If there are no beads left in the box representing the current scenario, Menace is said to “resign” as it doesn’t “think” it can win in its current situation. This is OK if it’s a box representing a move that you rarely come across, but when it’s the box representing the empty board, the box that Menace always starts off with, there’s a problem! To fix this, all I needed to do was to add more beads to this box, although this had the consequence that it takes Menace longer to learn.
Graphs showing Menace against a perfect strategy. The y-axis shows the number of games Menace has drawn minus the number it has lost. On the top, we can see that after having added beads to the starting state of Michie’s original Menace, it takes longer to draw the same number of games as it has lost (cross the x-axis).
Machine learning is being used in many fields today including in the work of technology giants such as Facebook and Google, whom we have already mentioned have made breakthroughs in the more complex game of Go. The AI programs learning to play Go are not saving all the possible layouts of the game as we have done with noughts and crosses, as there are more of these than there are atoms in the universe. Instead, these programs detect similar patterns and use them as a starting point for their learning. However, the programs are then taught in a similar way to the way we taught Menace to play against different strategies, although the programs learning Go play against themselves repeatedly. So ideas that emerged from simple machines in the past are still being used today for much more complicated tasks.”