In my last blog post, we learned about what machine learning is, when to use it, some business applications, and a few handy checklists on defining your machine learning problem and steps to solving it.

From here on we’re going to move away from high-level understanding and move into some of the math behind the magic that makes these methods work. For this post, we’re going to be looking into one of the simpler but powerful methods of machine learning: Reinforcement Learning (RL).

As a quick recap, we’ve learned that machine learning is at core function approximation, or the art of making an educated guess. Essentially, you have a problem too complicated, such as guessing what the weather will be tomorrow, to know for 100% certainty what the answer is. Using all your knowledge of the weather, its history, the time of year, today’s weather, meteorologist predictions, etc. you make your best guess about whether you’re going to wear your new fleece sweater to work the next day. Machine learning attempts to use all of that experience you use to make your prediction to make predictions of its own computationally.

Reinforcement Learning, if I had to distill into a single sentence, asks the question: “What action should I take to maximize my reward”? We can use humans’ best friend as a good example here. To teach a dog to sit, you say “sit”, maybe use a gesture, and award the dog a treat when it does what you’ve asked. The dog takes the best action (sitting) to maximize its reward (a treat), and you, the owner, reinforce its good behaviour using that treat. As you may have guessed from this example, in Reinforcement Learning, the algorithm is the dog trying to maximize the reward, and you’re the owner reinforcing the good behaviour. The difference here is, with a dog, a treat is clearly the reward, but with an algorithm, you have to define the reward based on the problem at hand.

In 1906 a Russian “mathemagician” named Andrey Markov published his first paper on the Markov Process. Mr. Markov said, essentially, that if we know the entire history of a process, we can predict the future of that process based solely on our present state (satisfies the Markov Property). From this, in 1960, came Ronald A. Howard’s “Dynamic Programming and Markov Processes”, which birthed the Markov Decision Process (MDP) using Mr. Markov’s original theory. The MDP is the blood of Reinforcement Learning, and although is simple at heart, has paved the way and is fundamental for much more complicated and effective implementations of Reinforcement Learning in use today.

An MDP is a discrete, stochastic control (non-deterministic) process consisting of 4 main components:

*s*: State*a*: Action- Ra(s, s’): Reward
- y[0≤γ<1]: Discount Factor (weight)

And additional component:

- Pa(s, s’): Probability

Discrete means that measurements are taken at specific instances in time (as opposed to over continuous time), and stochastic, keeping to the definition of a machine learning problem, that the same input to the same system does not guarantee the same output every time. In technical terms, the MDP says that “A process, in some state *s*, will choose to make an action *a* to some state* s’*, to maximize reward *R**a*”. For those who are less technical, *s’* is called “s-prime” and in this case refers to an arbitrary future state that is different than our original state *s*.

States, actions, weights, and rewards can be pretty tricky to understand just from their mathematical definitions (which come together into a mathematical formula called a policy, which we’ll not touch on in this post), so to simplify things, we’re going to learn what all these components mean and do via a simple example.

2 elevators in an apartment, with 5 floors total. How do we optimize their behaviour?

First off, we decide it is in fact, a Reinforcement Learning problem. This problem has defined states, defined actions, and can be defined stochastically and discretely. But what is a state, or an action, and why are these relevant?

In the simplest breakdown of this problem, let’s say that a state *s *is a combination of elevator positions, and there are* S* total states, or elevator position combinations. For example, Elevator 1 on the first floor and Elevator 2 on the third floor would be one distinct state,* s**13*. The table below helps visualize these states.

Elevator 1 Position | ||||

Elevator 2 Position | Floor | First |
Second |
Third … |

First |
state11 | s12 | s13 | |

Second |
state21 | s22 | s23 | |

Third … |
state31 | s32 | s33 |

We defined a state, a specific (discrete) point in time where we can clearly define the components of our system. Similarly, right now, at this exact instance of time, it is 18°C outside and sunny – that could be a state for the weather. So what about an action? An action is, formally, the process that takes us from one state to another. In this case, let’s say Elevator 1 on floor 1 moves (performs an action) to floor 2. We are now in a new state *s’ *that is different from our original state. *s**13* moved to* s’**23*, and it did so by performing an action.

Now we’ve answered what, but the next question, and where this gets more complicated, is why? Why does the elevator move from floor 1 to floor 2? We can start by redefining our state. Not only do we care about the position of the elevators, but also which buttons on what floors are pressed (or not). If we assume the button is generic, so not up or down just a button to call the elevator, that means for 5 floors we have 5 possible buttons. This adds 5 more dimensions to our original two (elevator 1 position and elevator 2 position), giving us 7 total dimensions to the problem. It’s easy to see how quickly a real-world problem can have hundreds/thousands of dimensions and millions of related states.

With this we can finally define reward – remember: the core of RL is “what action should I take to maximize my reward?”. Simply, when a user selects the button on the fourth floor, and an elevator goes to floor four and, presumably, picks up the user, that was a good action. If an elevator doesn’t go to the user’s floor, that’s probably a bad action. We’re going to make it even simpler here. 1 elevator, 2 floors, 2 buttons (1 per floor), 1 button is pressed. We don’t even care about dropping off our passenger, we just have to pick them up. This gives us 4 total states, and 2 possible actions per state (stay on the current floor, switch floor).

Thanks to our buddies mentioned earlier, Mr. Markov and Mr. Howard, we’re going to use the MDP. Here we randomly pick a route, assign it “good” or “bad” based on our definition of reward (did the elevator go to the floor with the pressed button), and then give it a weight (discount factor) which for now means “yes – 1” or “no – 0”.

Step by step rational: As seen below, we start on Floor 1 (F1), and Button 2 (B2) is pressed. We randomly decide to stay on the same floor (take an action). Did we pick up the passenger? No. Assign that route 0. We randomly decide to try the only other path, and go to Floor 2. Did we pick up the passenger? Yes. Assign that route 1.

In the future, in practice when the algorithm is fleshed out and using the results of this process, we’ll always pick the “yes” route; but we navigate all the conceptual possible routes (or as many as we can realistically) ahead of time before coming to a final decision. It’s similar to the way a human thinks. We weigh our options given the current situation and pick what seems the best based on our knowledge and experience.

Reinforcement Learning, weights, policies, and its modern day implementations all make this much more complicated, so we’ll stop here for now until my next post when we’ll dive in even further.