If you haven’t read it, check out my previous post that serves as a part 1, as this article is going to expand on what we learned there.

In our previous post, we investigated states, actions, and what the core concept of reinforcement learning is. From here we’re going to look into weights, policies, and one of my favourite modern day implementations of the concept.

Last we left off, we assigned a “yes” or “no” weighting to our elevator example based on whether a certain action (a) took the elevator to a position (state: s) that would allow it to pick up a passenger or not. Here, 1 is “yes” and 0 is “no”.

This “yes” or “no” can translate directly to probabilities, and the magic that makes reinforcement learning work, weights. Weights can also be called the discount factor, and will be a number between 0 and 1. These are probabilities, and you can think of them as percentages. In our example, the action with a 1 has a 100% chance of being chosen when the state F1 B2 is active, where the action with a weight of 0 has a 0% chance. The collection of weights that make up a system is called a Policy. This example policy is perfect, where we know 100% what the correct action to take is based on our current state. This is a special case of an MDP (Markov Decision Problem) called a Markov Chain. In reality, if you have the Markov Chain for a problem, the problem is “solved” and Machine Learning is not (or no longer) necessary.

In complex systems, you will almost never be able to reach the Markov Chain. Time, computing power, finances, storage, and especially having an imperfect definition for your reward make this nearly impossible. In most cases, there will be a series of weights across hundreds, or even thousands of actions to move around millions of states that will each have a value between 0 (never pick) and 1 (always pick). These action weights do not have to add up to 1, ie. we could rewrite our two weights above to 0.6 and 0.7 and this would be valid, even though 0.6+0.7 does not equal 1 (100%). What these numbers indicate is the probability that a certain route will be picked. For example, there’s a higher chance we will pick the route with 0.7 weight over 0.6, but that is not guaranteed.

The machine learning aspect comes in here via randomization, developing weights, and the overall policy we eventually use for the system. With our elevators, we randomly tried all possible routes from our state and said “yes” or “no”. But what if our action now will affect a state or action in the future? For example, if we pick up a person on floor 1, will that affect the next person we have to pick up on the fifth floor 10 minutes from now? Reality is much more complicated and the reward is much more vague and influenced by many more factors than our example here, so we’re going to look into a personal favourite of mine, Alpha Go, and backpropagation to explain how developing a policy works when reward starts getting convoluted.

Alpha Go is an AI made by Google using their Deepmind technology that has been taught to play the famous Korean board game Go using Reinforcement Learning. It uses MDPs and a “build as you go” strategy centered around backpropagation, and is called a Monte Carlo Tree Search. We’re going to look at backpropagation specifically before we look into the MCTS magic.

In Go, we start in one state, with nothing on the board, and can perform a huge number of actions from there, similar in bare concept to chess. In this case, our reward will be simply “I win the game”. Here, the reward isn’t defined by an action, but by a series of actions and states together. Hard to “yes” or “no” an individual action in this case! To solve this, we’re going to do selection (random search) and backpropagation.

Selection is randomly (or procedurally) picking an action given a state. We’re going to select actions and move between states randomly until we either win or lose a game. Sort of like playing chess blind. When we win or lose, we’re going to go back (backpropagate) and mark every action we took with a +1 (we won!) or +0 (we lost). This will be represented as a fraction, games won/games played. After enough times doing this over many versions of the game, we will eventually have a Policy created – a series of weights covering most or all actions. If an action is not covered, it is typical to give it either a 0 (we’re not even going to try) or a 0.5 (50% chance of being picked).

So what if we don’t have the time, memory, or other resources to try every single set of actions and reach every single state? Realistically, you wouldn’t. This is where Monte Carlo Search Trees become extra useful. We never store more data than we’ve used. That process of playing through a game, then backpropagating whether we win or lose? In a MCST we do this in Real Time. We’ll already have some weights from previous games stored, and other actions will be weighted 0.5. When it’s our turn, our AI is going to simulate as many games as it can from our current state and take the move that won the most games. We can only simulate so many games in this time period, so weightings learned from previous games to try to optimize the routes we try is critical. Every time it runs these simulations, it will also update these weightings for future games. Given Alpha Go beat Korea’s best go player in a championship game of GO, I’d say it works pretty well!

That’s it for the high/mid-level fundamentals of reinforcement learning. Look out for future blog posts in this series; next, we’re going to look into classifying images using regression.