We noticed that computers can now automatically learn to play ATARI games, they are beating world champions at Go, simulated quadrupeds are learning to run and leap, and robots are learning how to perform complex manipulation tasks that defy explicit programming. In our project, we aim at applying deep reinforcement learning to train a model that could master the game Pong from Pixels based on previous studies and change the parameters to get the best training model in the shortest time.
Reinforcement learning solves the difficult problem of correlating immediate actions with the delayed returns they produce. Like humans, reinforcement learning algorithms sometimes have to wait a while to see the fruit of their decisions. They operate in a delayed return environment, where it can be difficult to understand which action leads to which outcome over many time steps. Reinforcement learning algorithms can be expected to perform better and better in more ambiguous, real-life environments while choosing from an arbitrary number of possible actions, rather than from the limited options of a video game. That is, with time we expect them to be valuable to achieve goals in the real world.
In reinforcement learning framework, the agent has an observation about environment. Then, according to inner mechanics, the agent will take one action for this observation. The action of the agent will change the environment, and the new environment will feedback a reward and a new observation to the agent. You can see Figure 1 for it.
In this project, we will use policy gradient method which can directly give an action according to the observation(Murphy). A pretty popular solution is Actor-Critic framework. In this framework, there will be one policy function we can denote it as p(a|s), and one critic function we can denote it as Q(a, s) which can evaluate the value of certain action in specific state. We use deep neural network to approximate p(a|s) and the output of this function is a probability to adopt this action. As for Q(a, s), we will just consider the reward. We denote
in which a means one action. Si is an observation. γ is a discount factor and rm is the reward which earned in position m of the series whose length is i + k. The cost function we will use is
Andrej train the agent to beat the computer by building a Neural Network that takes in each image and outputs a command to AI to move up or down.
Our Neural Network, based heavily on Andrej’s solution, will do the following:
OpenAI Gym: Gym is a toolkit for developing and comparing reinforcement learning algorithms. It makes no assumptions about the structure of your agent, and is compatible with any numerical computation library, such as TensorFlow or Theano. The gym library is a collection of test problems — environments — that you can use to work out your reinforcement learning algorithms. These environments have a shared interface, allowing you to write general algorithms.
Find some bugs? Please let us know!