Applying a Deep Q Network for OpenAI’s Car Racing Game

Ali Fakhry
Towards Data Science
19 min readOct 28, 2020

--

Abstract. Using a classic environment from OpenAI, CarRacing-v0, a 2D autonomous vehicle environment, alongside a custom based modification of the environment, a Deep Q-Network (DQN) was created to solve both the classic and custom environments. Through the use of a Resnet18 pre-trained architecture and a custom made convolutional neural network structure, these models were used to solve the classic and modified environments. All-inclusive, the custom environment did not allow for free movement, which ultimately caused catastrophic forgetting, making the classic environment more suitable for training. Additionally, the pre-trained model produced more randomized results, while the custom made CNN architecture resulted in a more definite correlation between episodes and rewards. Thereby, through the application of this custom-designed architecture, the model was able to regularly surpass a reward count of 350.

Background

Gym, launched by OpenAI, is an open-source bank of artificial intelligence projects. This database has been put together for developers to use various artificial intelligence techniques such as reinforcement learning and computer vision to solve these environments[5]. Numerous user-generated solutions to these tasks are publicly available, thereby producing a benchmark for future developments. The environment explored in this study was CarRacing-v0, a 2D autonomous vehicle environment. Using machine learning, a subset of artificial intelligence that relies on large datasets[6], an agent was trained to learn this track.

The specific machine learning algorithm used in this study was Deep Q-Networks — a reinforcement learning technique. This technique builds off of Q-Learning, a simple model-free reinforcement learning algorithm [8], by adding neural networks. The simple Q-Learning algorithm carries out actions based on previous states, and if the state had not been previously explored, the model would act randomly[8]. Neural networks are a tool used for estimations based on previous data [7]. Thus, by using neural networks to estimate a Q-value, rather than basing it on previous states and purely random actions, more efficient actions are produced[3].

As explained previously, the Deep Q-Learning algorithm is a modified variant of Q-Learning. This unique formula is shown below.

Eq. 1 Deep Q-Network [13].

Using a deep convolutional neural network as a tool for approximation for the best action is described by a maximum sum of rewards, which is discounted at each time-interval, based on a behavioral policy, after resulting in various observations and taking the appropriate action [13].

Eq. 2 Loss Function for Deep Q-Network [13].

The equation yields the Q-value based upon the state s and the action being performed a. Thereby, this would produce the reward at r(s,a) in combination with the Q-value’s highest point from future state s’. Performing experience relay, the agent’s experience is stored in a data set. While training, Q-Learning is applied on random samples of experience that have been stored (displayed in equation 2). In this equation, the gamma is the discount factor that regulates future rewards.

Rationale and Justification

The objective of this study was to explore a real-world application of machine learning. Since the 1980s, developments in the field of autonomous vehicles have been widely publicized and recognized[15]. Yet, these advancements have not truly met commercial and societal needs. Through the progression of various sectors of applied artificial intelligence, such as machine learning, computer vision, reinforcement learning, and neural networks, autonomous vehicles can be produced for the betterment of human society[15]. This paper, which uses these methods, will help discuss developments in this field. Using positive reinforcement to incentivize the vehicle to stay on the desired path is similar to what is being developed for autonomous vehicles.

Likewise, this research paper also helps further the exploration of machine learning developments for diverse disciplines and uses. For instance, this study’s methods and outcomes can be applied in other fields such as natural language processing [18] and reinforcement learning for computer games [19].

Structures

1. Pytorch Framework

Using dynamic computational graphs and eager execution for deep learning, defined by the phrase “define-by-run” rather than the classic “define-and-run,” has added significant value when training models. Yet, frameworks that implemented this approach have done so at the cost of performance (Chainer[16]), and others have resulted in using a less expressive language, such as Dynet[17], limiting their applications. Yet, with implementation options and design choices offered in the Pytorch library, dynamic execution can be used without sacrificing significant power or performance [9].

Additionally, Pytorch performs immediate execution for dynamic tensors by using automatic differentiation coupled with the use of GPU acceleration. Furthermore, Pytorch does so while maintaining a performance level that is comparable to leading libraries used for deep learning [9]. This is due to the fact that these tensors are similar to what is considered to be NumPy’s “ndarrays” while having the advantage of being applied through a GPU. Thereby, this hastens the speed of training [14].

2. Pre-trained Models

Pre-trained models are models created in the past to solve other but similar problems. These models’ architecture is given for free within the model and does not require much extra training. The packages, particularly for Pytorch, contain definitions of models for addressing different tasks. These include image classification, pixel-wise semantic segmentation, instance segmentation, object detection, person keypoint detection, and video classification [12].

The use of pre-trained models — by definition — is considered to be transfer learning. While most of the layers are already trained, the final layers must be manipulated and reshaped to have the same input as the output of the pre-trained models. Furthermore, while training, the user has to optimize the pre-trained model by choosing which layers do not get retrained, as it makes the use of transfer learning somewhat useless if all the layers are retrained [10]. Thus, it is typical that the entirety of the pre-trained model is frozen.

Environments

1. Classic Environment

The classic CarRacing-v0 environment is both simple and straightforward. Without any external modifications, the state consists of 96x96 pixels, starting off with a classical RGB environment. The reward is equal to -0.1 for each frame and +1000/N for every track tile visited, where N is represented by the total number of tiles throughout the entirety of the track. To be considered a successful run, the agent must achieve a reward of 900 consistently, thus meaning that the maximum time the agent has to be on the track is 1000 frames. Furthermore, there is a barrier outside of the track, which results in a -100 penalty and an immediate finish of the episode if crossed over. Outside, the track consists of grass, which does not give rewards but, due to the environment’s friction, results in a struggle for the vehicle to move back onto the track. Overall, this environment is a classic 2D environment, which is significantly simpler than that of 3D environments, making OpenAI’s CarRacing-v0 much simpler.

Figure 1: A screenshot of the classic CarRacing-v0 environment.

2. Custom Environment

The borders of the classic environment force the agent inside the restrictions of the border. Thus, a theory was created that replacing the grass with extended borders would force the vehicle on the track, allowing quicker learning time. Removing the grass leaves only two possible locations, track or border, yet being on the border would immediately end the episode, meaning that all the states had to be on the track. Conversely, in the classic environment, the vehicle had to spend extensive time learning the mechanics of the grass and the friction applied upon it. Going over the more prominent barrier still gives the agent a -100 reward and still causes the agent to be considered “done” for that episode.

Figure 2: A screenshot of the custom CarRacing-v0 environment. Image by author.

CNN Models and Modifications

While there are differences between using a custom architecture and pre-trained models, both shared some similarities. The learning rate for both was and a discount value equal to

1. Custom Architecture

Convolutional layers in a neural network are great for image recognition. Due to this environment’s approach on image recognition, it seemed that conv2d layers were the best fit. Using PyTorch’s Conv2d, a network was designed.

  • Conv2d(1, 6) (Kernel Size 4, Stride 4)
  • ReLU Activation (True)
  • Conv2d(6, 24) (Kernel Size 4, Stride 1)
  • ReLU Activation (True)
  • MaxPool2d (Kernel_Size 2)
  • Flatten Layer

While the image initially was recognized in a RGB state, each frame was directly translated upon using a grayscale filter. Thus, the input channel had to start at 1. The linear layers had to be implemented as well, following the Conv2d layers.

  • Linear (((9 x 9)x 24), 1000)
  • ReLU Activation (True)
  • Linear (1000, 256)
  • ReLU Activation (True)
  • Linear (256, 4)

Due to the nature of the input layer of linear layers, the calculation has to reflect the Conv2d layers. The image that would be sent through the linear activation layer would be 84 x 84 post-cropping. The first conv2d layer has a convolution of 4 x 4 without any padding with a stride of 4, dropping the size to 21 x 21. The second layer would then have an input of 21 x 21 and would apply a 4 x 4 convolution again, this time with a stride of 1, dropping the size to 18 x 18. Lastly, there is a MaxPool2d layer, with a kernel size of 2, thus making the overall size equal to 9 x 9. This number, multiplied by 24, the output of the conv2d layer, produces the input of the linear activations.

2. Resnet18 Pre-trained Model

Since release, Resnet has become one of the most common pre-trained models in the space of transfer learning due to its accurate results and representations, especially for computer vision based tasks. Resnet18, from the family of Resnet pre-trained models, is the variant with 18 layers, outputting 512 channels. Using untrained CNN’s, typically slows down the training process significantly. The model can be simplified as the following:

  • Resnet18 (18 layers)(Frozen)
  • Linear (512, 256)
  • ReLU (True)
  • Linear (256, 4)

The diagram below shows the model explained. The input layer of 512 is received from the Resnet18 pre-trained model, the added hidden layer has a size of 256, and the final output layer is 4 in length.

3. Customization

The approach to many DQNs is the same, especially for the network. However, the approach to the agent and the state varies. Many strategies employed effectively improved the training time, but the impact ranged by the method. Some approaches have been explained before, and some have not.

  • Cropping: 84 x 84 frame shape

There are many parts of the image that are needed. However, the bottom part is not needed. In fact, it can mess up the image recognition process, as the color black especially could be recognized as a border of the map. Thus, cropping the bottom part is necessary to optimize the model. Additionally, fixing the size of the image will also help stabilize the model. Thereby, the edges were also cropped out of the PIL image, from 6 pixels from the left and 6 pixels to the right.

  • Gray Scaling: 3 channels to 1 channel

Using color — RGB — for computer vision tasks typically complicates the model and introduces more channels. Using one channel rather than three makes the model approximately three times faster gray scaling compared to normal RGB images. Many times, using color adds no benefit, especially for a model as simple as CarRacing-v0, where the image recognition part is not as heavy as the actual learning side, especially because the environment is not 3D but rather maintains a 2D approach.

  • Image Equalize: equalize image

For PIL images, there is a function that applies a non-linear mapping to an input image in order to create a uniform distribution of grayscale values in the output image. Used to equalize the “image histogram.” This method is used to increase contrast in the images, and due to the use of grayscale images, this is an important step.

  • Epsilon Fluctuation: adjusting epsilon

Epsilon is needed for training, as it allows the model the necessary exploration needed. However, many times it appears that the epsilon applied upon is not enough. Thus, more epsilon and randomization are necessary. To rectify this, adjusting the epsilon automatically rather than manually appears to be the best course of action. The process was simple; if the last 50 episodes had better improvement than the 50 prior, then the model should decrease the epsilon by 0.025. If not, the model should add 0.05 instead, as it seems that more training is needed to perfect the model. Epsilon, at its max value, is set to 1.0; at that point, all actions are random. Gradually, based on the epsilon decay, it will decrease.

  • Reward Modifications: adjusting rewards

It is apparent that the rewards allocated in this environment are improperly aligned for various reasons. The penalty for going past the border gives a larger penalty than needed, causing the agent to limit its movements due to the large reward penalty, preventing exploration and attempting the track. Thus, the reward had to be modified from -100 to 0 to allow for better training, along with punishing the agent for staying on the grass at -0.05 for each step. Reward modifications were tested for the custom environment. However, due to the significance of catastrophic forgetting, this had no effect on training.

Results and Evaluation

1. Theory: Catastrophic Forgetting

Catastrophic forgetting is one of the dreadful cycles that a model could go through, particularly when using neural networks. Catastrophic forgetting, or sometimes called catastrophic interference, occurs when training a new task, or categories of tasks, a neural network may forget past information to replace present information [2]. For the custom environment, the death rate was incredibly high, meaning the agent would die almost immediately when training. While, for other environments, such as CartPole, it is not such a big deal, but for a somewhat complicated environment such as CarRacing, it produces a larger problem. While, for the majority of the classic CarRacing-v0 environment this issue was not prominent, catastrophic forgetting appeared to be occurring at a large scale throughout the custom CarRacing-v0 environment.

The car, the agent, appears to be dying too quickly to learn new information and gradually forgets everything that it learns and instead learns to remain still in an attempt to prevent surpassing the border, as the repercussion for such action produces a heavy reward penalty which the model attempts to prevent.

Going to the classic model, the same result initially occurred. The agent would learn that the barrier would produce a “die” state, causing the agent to “learn” to spin around in a circle in an attempt to prevent any deaths and the large penalty. The agent would also forget its past experiences on the track as well to do this sequence. This seems to be a common issue with others attempting to solve the environment, yet allowing more episodes to train and explore, alongside adjusting reward penalties, allowed the model to realize that the track is the targeted route.

Altering the rewards for the custom environment, however, added no change and neither did more training and exploration. For the custom environment, the agent would suffer the tragedy of catastrophic forgetting regardless.

2. Comparing the Models

The first two graphs, before displaying the diagrams shown, were trained at a maximum epsilon of 1 without decaying for 500 episodes. Then, these graphs were produced based on the results displayed.

While using a pre-trained model, it appears that it did not train as quickly as a fully custom architecture. Attempting to solve this, the frozen layers were unfrozen to be retrained, which in a way removed the purpose of having a pre-trained model at all. The results of using transfer learning over a course of many episodes is displayed below, with the title “Transfer Learning Exploration.” It appears that there was a significant amount of fluctuation in the graph and its learning curve. The rewards over each episode appear to go down, then up, then back up. However, there is still an ever so slight upwards trend, if the outliers are not considered.

Graph 1: Using Pre-trained Model from Resnet18.

The other method, using a custom architecture, produced similar results yet had a more gradual change in results, as depicted in the graph below titled “CNN- Stable Exploration,” which again compares the episodes to the rewards. There is a gradual increase in rewards per episode, yet, not as quick as it should be. The reward count for this model peaked at a value of 112.

Graph 2: Custom CNN with a Stable Epsilon.

By allocating more episodes for the initial training process, the training process was sped up. Instead of 500 training episodes, there were 1000. The epsilon decay was also increased by 15% to allow for more randomized initial exploration. The results of this are displayed in the graph below, with the classic title of “CNN.” The change seen is more immediate than the other graphs, showing the speed of learning was faster. Additionally, this model peaked at a reward count above 125, which was slightly lower than that of Graph 2. However, this peak for Graph 3 occurred approximately 400 episodes sooner than that of Graph 2.

Graph 3: Custom CNN with less epsilon decay.

The last graph, Graph 4: CNN — More Epsilon, had the best results. The initial training episodes were five times larger than the initial graphs with only 500 training episodes. Not only were there 2500 training episodes, but the initial epsilon allocation increased from 1.0 to 1.5. Thus, allowing for more training and learning, which can be seen with the graph below. According to Graph 4, the reward count surpasses a value of 350 multiple times.

Graph 4: Custom CNN with more training and more epsilon. Image by author.

As it is being demonstrated, the pre-trained model seems to have a more random flow of rewards per episode while also appearing to train faster at the start than the custom-built model. However, as can be seen, the models that apply a custom CNN gradually increase the rewards per episode and result in a higher reward peak compared to the model using a pre-trained model.

These assumptions have remained consistent throughout many tests, and the graphs presented above do not appear to be an outlier to this assumption. Yet, no matter the model used, none of them have achieved enough reward to be considered successful, as for no episodes did the car have a total reward of 900, much less for 100 episodes consecutively.

Discussion of Research and Future Outlook

1. Double DQNs (Double Learning)

While DQNs are useful for many environments, a modification of DQNs, called Double DQNs — under double learning — are typically better and more efficient. DQNs tend to be more optimistic compared to Double DQNs, which tend to be more skeptical with the actions chosen, calculating the target Q value for taking that action [11]. Overall, Double DQNs help reduce the overestimation of Q-values, allowing training to be faster and have more stable learning [11]. Due to this information, it could be understood that using double learning would have been more effective and beneficial for this environment compared to the classic DQN that was applied.

2. Google Colab GPU (Training Time)

Training time is an essential part of deep learning due to the fact that learning takes lots of time to be able to learn whatever process the model is being applied upon. For many situations, GPUs are a must as they exponentially increase the speed of training, thus training the model quicker, allocating the additional training time for perfecting the model, and tweaking the extra changes.

Using Google Colab for the accessible cloud GPU allows for quicker training, yet, the cloud program is not great for environments that require rendering. For the CarRacing-v0 environment, using the classic “env.render()” after an episode was not possible. After spending a decent time trying to configure the environment on Google Colab, the best that could be achieved was rendering the last episode after running the code by creating a function that allows for “show_video().” Yet, this was not practical as it did not allow me to check each iteration and episode and be able to analyze what was happening closely.

3. Environmental Flaws

While OpenAI has plenty, open-sourced environments that are easily accessible by the general public, such as CarRacing-v0, are not entirely great in the development of the environment. The rather substandard placement of the bottom information could hinder the performance of the agent. The poorly written physics applied upon the track forces the vehicle in circles. The reward scores collectively, along with other flaws, ruin the overall structure of the environment, making it challenging for the agent to train.

4. More Exploration (Epsilon)

For any variation of a Q-Learning algorithm, exploration is how learning occurs. Allowing more epsilon allows for the model to learn more about the environment and how to solve it. Truly, theoretically, more training and more exploration should be able to push any model to learn an environment enough to solve it.

5. Referencing the Work of Others

This project applies to a niche part of the “OpenAI gym environments” community. As explained, using a DQN for a project such as this is not considered to be the best course of action by any metrics. Yet, other papers have tried to use this method, with varying levels of success. For instance, one paper, published in December 2018, by two Stanford researchers attempted to use DQNs for this same environment. This paper was called “Reinforcement Learning for a Simple Racing Game” and was written by Pablo Aldape and Samuel Sowell.

Aldape and Sowell did not crop the image in their project but rather left it at the classic 96 x 96, with a node size of 9216. However, they then mentioned that they did not grayscale or even keep the RGB coloring, but rather each node took in the green channel of its respective pixel [1]. This seems like a complicated manner of color manipulation when normal rescaling would have been easier and possibly more effective. Furthermore, cropping should have been employed better, as the coloring on the meter bar polluted the environment’s green screen, and an 84 x 84 crop could have removed it.

Yet, there were also a few similarities between their project and this one. For example, the usage of a pre-trained model as a form of training was shared. Additionally, both models did not surpass the 900 reward threshold, as neither had the computational power or time to solve the environment fully.

There are no great examples of solving the CarRacing-v0 environment using a DQN. Thus, the next closest paper relating to this environment was one created by a French Pre-PhD student. In the paper, Dancette, the author, used a convolutional neural network and thoroughly described how the model was trained in the conclusion. Dancette claims that the network recognizes shapes to keep the car on the desired path [4], which would be more useful for this environment than attempting a DQN, as the most important part of this environment is recognizing the track and staying within its boundaries.

These three models all converged to various maximum reward counts and are able to be represented in a numerical table. As can be seen, while this DQN was superior to that of Sowell and Aldape, Dancette’s neural network model was significantly more efficient and practical.

This DQN Model: 300–350

Sowell and Aldape’s DQN Model [1]: 150–200

Dancette’s Neural Network Model [4]: 450–500

6. Future Outlook

It may be beneficial to use other techniques for this environment — for instance, Double DQNs or Proximal Policy Optimization (PPO). If DQNs were to be used again for this environment, more epsilon, episodes, and training would be required, yet, this would be less practical than using other more reliable techniques. If DQNs were to be used again, they should be better applied for different autonomous vehicle environments that are shorter and more predictable. Additionally, these various reinforcement learning techniques should be used in more advanced 3D environments. CarRacing-v0 is rather simple compared to other simulated autonomous vehicle environments. By training a wide variety of models that use deep learning algorithms such as neural networks and policy gradients on several 3D vehicle environments, advancements in autonomous vehicles can be made.

References

[1] Aldape, P., & Sowell, S. (2018, December 18). Reinforcement Learning for a Simple Racing Game. Retrieved September 14, 2020, from https://web.stanford.edu/class/aa228/reports/2018/final150.pdf/

[2] “Lifelong Machine Learning, Second Edition.” Lifelong Machine Learning, Second Edition | Synthesis Lectures on Artificial Intelligence and Machine Learning, www.morganclaypool.com/doi/10.2200/S00832ED1V01Y201802AIM037.

[3] Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013, December 19). Playing Atari with Deep Reinforcement Learning. ArXiv.Org. https://arxiv.org/abs/1312.5602

[4] Dancette, C. (2018, April 09). [Tutoriel] Conduite autonome par imitation grâce à un réseau de con… Retrieved September 10, 2020, from https://cdancette.fr/2018/04/09/self-driving-CNN/

[5] Gym: A toolkit for developing and comparing reinforcement learning algorithms. (n.d.). OpenAI. Retrieved January 20, 2021, from https://gym.openai.com/

[6] Simeone, O. (2018). A Very Brief Introduction to Machine Learning With Applications to Communication Systems. IEEE Transactions on Cognitive Communications and Networking, 4(4), 648–664. https://doi.org/10.1109/tccn.2018.2881442

[7] Krose, B., Krose, B. J. A., Smagt, P. V., & Smagt, P. (1993). An introduction to neural networks. Journal of Computer Science, 1–135. https://www.researchgate.net/publication/272832321_An_introduction_to_neural_networks

[8] François-Lavet, V., Henderson, P., Islam, R., Bellemare, M. G., & Pineau, J. (2018). An Introduction to Deep Reinforcement Learning. Foundations and Trends® in Machine Learning, 11(3–4), 219–354. https://doi.org/10.1561/2200000071

[9] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Köpf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., … Chintala, S. (2019, December 3). PyTorch: An Imperative Style, High-Performance Deep Learning Library. ArXiv.Org. https://arxiv.org/abs/1912.01703

[10] Fine Tuning Torchvision Models¶. (n.d.). Retrieved September 14, 2020, from https://pytorch.org/tutorials/beginner/finetuning_torchvision_models_tutorial.html

[11] Hasselt, V. H., Guez, A., & Silver, D. (2015, September 22). Deep Reinforcement Learning with Double Q-learning. ArXiv.Org. https://arxiv.org/abs/1509.06461

[12] Torchvision.models¶. (n.d.). Retrieved September 14, 2020, from https://pytorch.org/docs/stable/torchvision/models.html

[13] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., & Hassabis, D. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533. https://doi.org/10.1038/nature14236

[14] What is PyTorch?¶. (n.d.). Retrieved September 14, 2020, from https://pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html

[15] Janai, J., Güney, F., Behl, A., & Geiger, A. (2017, April 18). Computer Vision for Autonomous Vehicles: Problems, Datasets and State of the Art. ArXiv.Org. https://arxiv.org/abs/1704.05519

[16] Seiya Tokui, Kenta Oono, Shohei Hido, and Justin Clayton. Chainer: a next-generation open source framework for deep learning. In Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Twenty-ninth Annual Conference on Neural Information Processing Systems (NIPS), 2015.

[17] G. Neubig, C. Dyer, Y. Goldberg, A. Matthews, W. Ammar, A. Anastasopoulos, M. Ballesteros, D. Chiang, D. Clothiaux, T. Cohn, K. Duh, M. Faruqui, C. Gan, D. Garrette, Y. Ji, L. Kong, A. Kuncoro, G. Kumar, C. Malaviya, P. Michel, Y. Oda, M. Richardson, N. Saphra, S. Swayamdipta, and P. Yin. DyNet: The Dynamic Neural Network Toolkit. ArXiv e-prints, January 2017.

[18] Narasimhan, K., Kulkarni, T., & Barzilay, R. (2015, June 30). Language Understanding for Text-based Games Using Deep Reinforcement Learning. ArXiv.Org. https://arxiv.org/abs/1506.08941

[19] Lin, Jhang, Lee, Lin, & Young. (2019). Using a Reinforcement Q-Learning-Based Deep Neural Network for Playing Video Games. Electronics, 8(10), 1128. https://doi.org/10.3390/electronics8101128

--

--