‘Will machines outsmart humans?’ Ever since the emergence of artificial intelligence, that question sparked countless public discussions characterized by both excitement and fear. In 2015, the birth of AlphaGo brought this discussion to its peak. The fear of AI partially stems from lack of understanding, so this article aims to provide readers with a fundamental explanation of the workings behind AlphaGo.
IMG Credit: SotaTek
Background
DeepMind is a company of scientists, engineers, etc. who research AI. It was acquired by Google in 2014, which is also when DeepMind started developing AlphaGo, a computer program that can play the ancient Chinese board game called Go. In October 2015, AlphaGo became the first computer program to defeat a professional human player in Go, and AlphaGo subsequently went on to win against legendary Go players such as Lee Sedol and Ke Jie.
IMG Credit: thebeijinger
So what is this game called “Go”? It’s a board game where two players take turns to place stones on a 19×19 grid. One player will play with white stones, and the other will play with black stones. A player acquires points through capturing the opponents’ stones or increasing the player’s territory. Compared to other two-player board games like chess, Go involves significantly more possible outcomes that it seemed impossible at the time for an AI model to master the game. However, DeepMind challenged that notion.
What is Machine Learning?
Understanding AlphaGo requires some technical background. To start with, machine learning is when computer programs learn how to perform a task (prediction, classification, etc.) through training on data. There are three types of machine learning:
Supervised learning involves training a model on labeled data. For instance, a model trained on animal photos and their names would be able to name the animal when presented a new animal photo
Unsupervised learning involves unlabeled data. For instance, a model trained on only the photos of animals (with no labels for their names) could learn how to cluster the photos into groups with similar patterns
Reinforcement learning, which is most important to understanding this article, trains a model to maximize reward through trial and error. It’s similar to training a dog. If I give a treat to my dog whenever it responds to my order to “sit,” the dog is likely to repeat that behavior in the future to maximize reward. Reinforcement learning involves an AI undergoing trial & error and making adjustments in a way that can increase reward.
IMG Credit: MathWorks
What are Neural Networks?
A neural network is an AI algorithm that mimics the interaction between our brain’s neurons. Neural networks consist of multiple layers of nodes, where each node represents a neuron. A diagram of a possible neural network structure is shown below.
Image Credit: IBM
In a basic sense, nodes can be thought of as variables that each hold a numerical value. For instance, if we had a black and white image as an input, each node in the input layer would contain a gray-scale value of each pixel in the input image. The nodes are connected to nodes in the next layer, and these connections are each assigned a weight that numerically represents how much a particular node in the first layer would contribute to a particular node in the second layer.
A value in a node will be determined by a calculation that involves values from the previous layer, the weights in those corresponding connections, and another parameter called a bias. Through such calculations, each layer passes data to the next layer in the network until the output layer is reached. A neural network will “learn” (or in other words, improve its accuracy in performing a certain task) by adjusting these weights and biases appropriately.
The commonly used word “deep learning” actually refers to neural networks with many layers. There are also many complex variants of neural networks, including the well-known convolutional neural networks.
AlphaGo: Overall Mechanism
DeepMind used two different neural networks that take the Go board’s current state (positions of black and white stones) as an input. One was the “policy neural network” that determines sensible moves to play, and the other was the “value neural network” that determines how advantageous the state is for the player. These networks were first trained on data of human amateur games, specifically those from the KGS Go server. This allowed AlphaGo to play somewhat reasonably, although it wasn’t enough to defeat human professionals.
The key was AlphaGo playing against slightly different versions of itself millions of times. This was the step of trial and error: AlphaGo gradually improved itself by learning what works well and what doesn’t, and this is called reinforcement learning.
AlphaGo: A Deeper Explanation
Let’s dive in a little deeper. As a reminder, the policy network aims to find reasonable moves to play at a given board state. To achieve this purpose, the network first learns how human players would play under certain circumstances. It is a 13-layer convolutional neural network trained through supervised learning: the data consist of board positions and labels. The labels, in this case, would be the moves played by human players when encountered upon the corresponding board states. At this point, when presented with a board state, the policy network would be able to identify moves that human players would sensibly play. This is called the SL (Supervised Learning) policy network.
This SL policy network then plays against itself and learns from its mistakes, which is reinforcement learning. However, if the neural network only played against a single opponent, that would result in overfitting, which means that the neural network can not generalize its learnings when meeting a new opponent. To prevent this problem, the network would play against past versions of itself that each use slightly different parameters. Based on the results of each game, the network would update its parameters and play again with a different past version.
It must be noted that the network does not know anything about the rules of Go, such as capturing the opponents’ stones or maximizing its territory. The network was simply instructed to maximize reward (winning). At the end of this training process, the final model would be called the RL (Reinforcement Learning) policy network. When tested, the RL policy network turned out to be a stronger player than the SL policy network.
Now it’s time to talk about the value network. This network was trained based on board positions and whether the player eventually won or lost the game, which was determined through making the RL policy play. The value network used 30 million board states from 30 million different games to do this; in order to prevent overfitting, DeepMind did not use multiple board states from the same game.
Image Credit: Library of Congress
AlphaGo used search trees to determine which move would be optimal, specifically through an algorithm called the Monte Carlo tree search algorithm. Since conducting simulations for all possible moves in Go would be extremely exhaustive, AlphaGo used both the value network and the policy network to narrow down the options and only explore the sensible moves. AlphaGo took into account both the probability of winning when a move was played (value network) and the probability for an expert to choose that move (policy network).
Interestingly, the strongest combination tested was when using the SL policy network in tandem with the value network trained on the RL policy network’s plays. This is the combination that AlphaGo used to beat the strongest human players on Earth.
Significance
So why does any of this matter? To start with, the event publicly revealed the incredible capacity of contemporary AI technology. It became evident that artificial intelligence reached beyond the level of simply mimicking humans in simple tasks; AI was able to learn by itself, not only surpassing the best human players in such a complex game but also coming up with moves that humans cannot predict. AlphaGo’s inventive moves went against human conventions and were even studied by human players afterwards.
Remember that this was just 2016, and breakthroughs in artificial intelligence didn’t stop coming. This includes DeepMind’s invention of AlphaGo Zero, which mastered chess, shogi, and Go without any data of human plays. AlphaGo was certainly not a culmination but rather an initiation of DeepMind’s journey towards general-purpose AI. Ending with a quote from AlphaGo’s human opponent Lee Sedol:
"I thought AlphaGo was based on probability calculation and that it was merely a machine. But when I saw this move, I changed my mind. Surely, AlphaGo is creative."
References
“AlphaGo.” DeepMind, DeepMind Technologies, https://www.deepmind.com/research/highlighted-research/alphago.
Aman, Agarwal. “Explained Simply: How an AI Program Mastered the Ancient Game of Go.” FreeCodeCamp, 10 Mar. 2018, https://www.freecodecamp.org/news/explained-simply-how-an-ai-program-mastered-the-ancient-game-of-go-62b8940a9080/.
Heller, Martin. “Reinforcement Learning Explained.” InfoWorld, InfoWorld, 6 June 2019, https://www.infoworld.com/article/3400876/reinforcement-learning-explained.html.
Silver, David, et al. “Mastering the Game of Go with Deep Neural Networks and Tree Search.” Nature News, Nature Publishing Group, 27 Jan. 2016, https://www.nature.com/articles/nature16961.
“What Are Neural Networks?” IBM, IBM Cloud Education, 17 Aug. 2020, https://www.ibm.com/cloud/learn/neural-networks.
Comments