An Introductory Hands-On With Reinforcement Learning

INTERSECT helps businesses create and deliver exceptional digital products that marry business objectives and user needs. Learn more.

Artificial intelligence and machine learning are hot topics these days amongst those that play in and around the tech world. We hear our clients asking about these things quite often too. Everyone wants to know how artificial intelligence can help to improve their digital experiences. The problem with this though is that a lot of people know the terms, but don’t really understand what is just science fiction and what can (and should) be done in reality.

One area we see a lot of promise for companies to take advantage of machine learning is by using Reinforcement Learning to enhance their products. Some places you might not know that reinforcement is happening behind the scenes include online advertising bidding. Machines are always trying to get the best price for the advertiser and progressively improve their tactics to do so. Another example is with self-driving cars. These cars are trying to get passengers from point A to B while keeping them and everyone and everything around them as safe as possible. In order to do so though, the cars must learn what actions accomplish this and which ones result in a failure of its objectives.

So, how does a machine actually learn?

Reinforcement Learning (RL) is one approach to creating Artificial Intelligence by teaching a machine (known as the “agent”) to perform tasks without explicitly telling it how. The basic idea involves the agent interacting with its environment by taking actions for which it is given some reward value. The goal is to train the agent to maximize the reward value. Reinforcement Learning is particularly interesting because it demonstrates the ability to generalize well to new tasks and environments. This concept could be useful across a wide variety of optimization domains in the real world like robotics, industrial automation, and many more.

This mode of teaching computers is similar in concept to the way that humans learn: by first taking random actions, and if by luck the action results in something positive, we are rewarded. Over time, we learn to associate certain actions to better outcomes. The process, known as a Markov Decision Process, is shown below.

The Agent begins in a current state (St). It makes an action (At). As a result, the environment changes and returns a new state (St+1) and a reward value (Rt+1) to the agent. The state and reward are updated with values returned. This cycle then repeats itself until the environment is solved or terminated.

Consider the case of a computer trying to learn to play a video game. The environment in this instance is the world of the game and the rules that govern it. The state is an array of numbers representing everything about how the game world looks at one specific moment.

Let’s look at the classic game “Cart Pole”, in which the player tries to balance a falling pole on a moving cart.

The game laws are defined as follows:

A pole is attached by a joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pole starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.

The entire state of this very simple game can be described with 4 numbers: the cart position, the cart velocity, the angle of the pole, and the angular velocity of the pole. Our machine’s goal is to create a policy which decides what action to take depending on the current state.

A policy is a mapping that takes the current game state as its input and returns the action the agent should take (e.g. move the cart right or left). This cycle continues with the updated state from the chosen action, which is provided to the policy again to make the next decision. The policy is what we’re trying to fine-tune and represents the agent’s decision making intelligence.

In the context of Cart Pole, the policy is an array of weights which determines how important each of the 4 state components is to taking a specific action. We multiply each weight by the state components, then sum them all to come up with a single number. This number represents the action that we will take. For Cart Pole, if the number is positive, we move the cart to the right and vice versa.

Imagine the cart is positioned in the center of the game screen, is moving to the right, the pole is angled to the right, and is falling right. The state could be represented as [0.1, 0.02, 0, 0.3]. An example policy could be [0.23, 0.45, -0.11, -0.09]. In this case, the game state and policy would correspond as follows:

State ComponentCurrent value of each state componentPolicy (relative importance of each state component to next decision)
pole_angle (radians)0.10.23
pole_angular_velocity (radians/frame)0.020.45
cart_position (units)0-0.11
cart_velocity (units/frame)0.3-0.09

To make a decision, each element in the state array would be multiplied by the corresponding element in the policy array and the products would be summed. We arbitrarily use 0 to set a threshold for which the action is either move to the left or the right.

Action = (pole angle x relative importance of pole angle to action decision) + …
Action = (0.1 x 0.23) + (0.02 x 0.45) + (0 x -0.11) + (0.3 x -0.09)
Action = 0.005
Threshold = 0
Action > Threshold, therefore we move to the right by one unit.

Now that we understand the concept, let’s try to build an agent to beat the game.

Our first step is to import the libraries we need to run a Cart Pole game in an OpenAI Gym environment (software library developed to simulate and test RL algorithms). We will also import numpy, a helpful mathematical computing library.

import gym
import numpy as np

Next, we’ll create the environment.

env = gym.make('CartPole-v1')

In order to run through episodes, let’s build a function that accepts the environment and a policy array as inputs. The function will play the game and return the score from an episode as output. We’ll also receive an observation of the game state after every action.

def play(env, policy):
    observation = env.reset()

    # create variables to track game status, score, and hold observations at each time step

    score = 0
    observations = [ ]
    completed = False

    # play the game until it is done

    for i in range(3000):

        # record observations

        observations += [observation.tolist()]

        if completed:

        # use the policy to decide on an action

        result =, observation)

        if result > 0:
            action = 1
            action = 0

# take a step using the action (the env.step method returns a snapshot of the environment after the action is taken, the reward from that action, whether the episode is completed, and diagnostic data for debugging)

        observation, reward, completed, data = env.step(action)

        # record cumulative score

        score += reward

    return score, observations

Awesome! Now that our brave AI is able to play the game, let’s give it a policy to do so. In the absence of a clever strategy for devising a policy, we’ll start with random values centred around zero.

policy = np.random.rand(1,4) - 0.5
score, observations = play(env, policy)
print(‘Score:’, score)

After running the script, how did our agent perform? Cart Pole has a maximum score of 500. In all likelihood, our agent yielded a very low score. A better strategy might be to generate lots of random policies and keep the one with the highest score. The approach is to use a variable that progressively retains the policy, observations, and score of the best-performing game so far.

best = (0, [], [])

for _ in range(1000):

    policy = np.random.rand(1,4) - 0.5
    score, observations = play(env, policy)

        if score > best[0]:
            best = (score, observations, policy)

print('Best score:', best[0])

What is our best score now? Chances are, we have come up with a policy that is able to achieve the high score of 500. Our agent has beat the game!

Where do we go from here? Well that’s it for this post but if we wanted to build a more robust system we might consider some of the following approaches:

  • Using an optimization algorithm to find the best policy instead of randomly picking (e.g. Deep Q Learning, Proximal Policy Optimization, Monte Carlo Tree Search, etc.)
  • Testing the best policy that we obtained over many episodes to ensure that we didn’t just get lucky in the one episode
  • Testing our policy on a version of cart pole with a higher top score than 500 to see how sustainable the policy is

Thank you for reading.