Let's Play Pong!

Introduction

We noticed that computers can now automatically learn to play ATARI games, they are beating world champions at Go, simulated quadrupeds are learning to run and leap, and robots are learning how to perform complex manipulation tasks that defy explicit programming. In our project, we aim at applying deep reinforcement learning to train a model that could master the game Pong from Pixels based on previous studies and change the parameters to get the best training model in the shortest time.

Reinforcement learning solves the difficult problem of correlating immediate actions with the delayed returns they produce. Like humans, reinforcement learning algorithms sometimes have to wait a while to see the fruit of their decisions. They operate in a delayed return environment, where it can be difficult to understand which action leads to which outcome over many time steps. Reinforcement learning algorithms can be expected to perform better and better in more ambiguous, real-life environments while choosing from an arbitrary number of possible actions, rather than from the limited options of a video game. That is, with time we expect them to be valuable to achieve goals in the real world.

Algorithm

In reinforcement learning framework, the agent has an observation about environment. Then, according to inner mechanics, the agent will take one action for this observation. The action of the agent will change the environment, and the new environment will feedback a reward and a new observation to the agent. You can see Figure 1 for it.

In this project, we will use policy gradient method which can directly give an action according to the observation(Murphy). A pretty popular solution is Actor-Critic framework. In this framework, there will be one policy function we can denote it as p(a|s), and one critic function we can denote it as Q(a, s) which can evaluate the value of certain action in specific state. We use deep neural network to approximate p(a|s) and the output of this function is a probability to adopt this action. As for Q(a, s), we will just consider the reward. We denote

in which a means one action. Si is an observation. γ is a discount factor and rm is the reward which earned in position m of the series whose length is i + k. The cost function we will use is

Implementation method

Andrej train the agent to beat the computer by building a Neural Network that takes in each image and outputs a command to AI to move up or down.

Our Neural Network, based heavily on Andrej’s solution, will do the following:

Take in images from the game and preprocess them (remove color, background, downsample etc.).
Use the Neural Network to compute a probability of moving up.
Sample from that probability distribution and tell the agent to move up or down.
If the round is over (you missed the ball or the opponent missed the ball), find whether you won or lost.
When the episode has finished(someone got to 21 points), pass the result through the backpropagation algorithm to compute the gradient for our weights.
After 10 episodes have finished, sum up the gradient and move the weights in the direction of the gradient.
Repeat this process until our weights are tuned to the point where we can beat the computer.

Existing Resources

OpenAI Gym: Gym is a toolkit for developing and comparing reinforcement learning algorithms. It makes no assumptions about the structure of your agent, and is compatible with any numerical computation library, such as TensorFlow or Theano. The gym library is a collection of test problems — environments — that you can use to work out your reinforcement learning algorithms. These environments have a shared interface, allowing you to write general algorithms.

CONTACT

Find some bugs? Please let us know!

Worcester Polytechnic Institute
100 Institute Road
Worcester, MA | 01609-2280

Hang Qi: hqi3@wpi.edu
Jin Huang: jhuang2@wpi.edu
Yuhan Liu: yliu26@wpi.edu
Yuxiang Bao: charlesbao95@outlook.com

Pong Game

Pong Game

Reinforcement learning

Let's Play Pong!

Introduction

Introduction

Algorithm

Algorithm

Implementation method

Implementation method

Existing Resources

Existing Resources

CONTACT