Using Q-Learning for OpenAI’s CartPole-v1

Ali Fakhry
The Startup
Published in
5 min readNov 13, 2020

--

(Image by Author)

Background Information

Q-Learning is generally deemed to be the “most simple” reinforcement learning algorithm. I find myself agreeing with this statement.

In another paper, I discussed the use of Q-Learning compared to Deep Q Networks. So, I will pull the information that discussed what Q-Learning is, the positives and negatives, and the general equation. I believe that this information is crucial background information.

Q-Learning is one of the more basic reinforcement learning algorithms; that is due to its “model-free reinforcement learning” nature. A model-free algorithm, as opposed to a model-based algorithm, has the agent learn policies directly. Like many of the other algorithms, Q-Learning has both positives and negatives [1]. As discussed upon, Q-Learning does not require a model, nor does it require a complicated system of operation. Instead, Q-Learning uses previously learned “states” which have been explored to consider future moves and stores this information in a “Q-Table.” For every action taken from a state, the policy table, Q table, has to include a positive or negative reward. The model starts with a fixed epsilon value, which represents the randomization of movements [1]. Over time, the randomization decreased based upon the epsilon decay value. Furthermore, if the current state of the agent is considered to be new or unexplored, the agent will just produce a randomly generated move, in an attempt to better learn the environment.

This form of learning is great when there are a limited number of moves or the environment is not complicated, as the agent remembers past moves and repeats them with ease.

However, for more complex environments with a significantly larger number of states, the Q-Table will fill up rapidly, causing large training times. The issue is, Q-Learning does not predict but rather deals in absolutes the majority of the time: the state has already been considered and the action to produce is already known, or the state is unknown, and a random action has to be applied upon.

The equation for this algorithm is depicted below and can be explained simply going from left to right. This formula is applied to determine the best action to take the best in the current state.

Equation: Q-Learning from Wikipedia Contributors [3].

The “Q” value represents the quality of a value, or how well the action is perceived in the algorithm. The higher the quality value is, the more likely the same action will be performed again. The quality of the action is represented with Qnew(st, at), where the st represents state and at represents action. The model will discount new values using the gamma and adjust the action process, step, based on the learning rate [2].

CartPole-v1

CartPole-v1 is one of OpenAI’s environments that are open source. The “cartpole” agent is a reverse pendulum where the “cart” is trying to balance the “pole” vertically, with a little shift of the angle.

The only forces that can be applied are +1 and -1, which translates to a movement of either left or right. If the cart moves more than 2.4 units from the center the episode is over. If the angle is moved more than 15 degrees from the vertical the episode is over. The reward is +1 for every timestamp that the episode is not over.

This can further be read upon here.

The Code and the Application

The first step is to get all the imports set up.

import numpy as np # used for arrays

import gym # pull the environment

import time # to get the time

import math # needed for calculations

The next step is to create the environment.

env = gym.make("CartPole-v1")
print(env.action_space.n)

The next step would be to designate the variables needed.

LEARNING_RATE = 0.1

DISCOUNT = 0.95
EPISODES = 60000
total = 0
total_reward = 0
prior_reward = 0

Observation = [30, 30, 50, 50]
np_array_win_size = np.array([0.25, 0.25, 0.01, 0.1])

epsilon = 1

epsilon_decay_value = 0.99995

The majority of these variables are self-explanatory.

The “Observation” variable is slightly unique, however. The reason that the array was manually set was that the first two variables (Cart position, Cart Velocity) is not as important as the other two, (Pole Angle, Pole Velocity).

The np_array_win_size is the “steps” based upon cart position, cart velocity, pole angle, and then pole velocity.

Next, set up the Q-Table

q_table = np.random.uniform(low=0, high=1, size=(Observation + [env.action_space.n]))q_table.shape

Then, define a method to get the discrete state.

def get_discrete_state(state):
discrete_state = state/np_array_win_size+ np.array([15,10,1,10])
return tuple(discrete_state.astype(np.int))

After defining everything, run the Q-Learning algorithm!

I will post it through GitHub, as the code is too long to type individually, but is essential. To describe the important part, there will be comments included.

There are some key ideas that need to be pointed out before the code is considered though.

First, I added a “timer” to measure the time the cart is able to balance the pole. As training increases, it is evident that the time increases and the pole is being balanced better.

Second, I want to point out that the “average reward” was over 1,000 episodes rather than 100 like OpenAI asks for. 100 episodes seem too simple and is not an accurate representation. Getting the reward threshold 195.0 over 1,000 episodes shows more evidence of success.

Another idea that I want to elaborate on is the fact that I allowed the agent to train for 10,000 episodes with full epsilon. I believe this is good to allow for the agent to understand the environment.

Lastly, I prevented the epsilon from decreasing if the current episode did worse than the one before it, as it shows that more epsilon is needed as training is not producing a constant result. Yet, the “dip down” is usually a drop of only a few reward points, so it could typically be due to randomization.

Conclusion

Overall, this is one of the applications of Q-Learning. In around ~40,000 episodes, the average reward reaches approximately 150. Around ~55,000 episodes, the average reward reaches approximately 195. Success!

To train and complete the task, it takes around 10 minutes. This is actually surprisingly fast for a reinforcement learning task. This is due to the simplicity of the Q-Learning algorithm.

Yet, while this is a use of Q-Learning, Q-Learning has limited usage and for more complex tasks, users should consider using other algorithms such as Deep Q Networks and such.

References

[1] Shyalika, C. (2019, November 16). A Beginners Guide to Q-Learning. Retrieved September 14, 2020, from https://towardsdatascience.com/a-beginners-guide-to-q-learning-c3e2a30a653c

[2] Choudhary, A. (n.d.). [DQN Formula]. Retrieved September 15, 2020, from https://www.analyticsvidhya.com/blog/2019/04/introduction-deep-q-learning-python/

[3] Wikipedia contributors. (2020, October 20). Q-learning. In Wikipedia, The Free Encyclopedia. Retrieved 00:03, October 29, 2020, from https://en.wikipedia.org/w/index.php?title=Q-learning&oldid=984486286

--

--

Ali Fakhry
The Startup

Reinforcement learning, artificial intelligence, and software. NYU.