Diving deep into Reinforcement Learning

By Antonio Lisi

Intro

Hello everyone, this is the third post on reinforcement learning and the start of a new series that is focusing on continuous action environments.

In this post, we will implement DDPG from scratch, and we will try to solve Pendulum and Lunar Lander.

Why the Continuous Action Spaces?

Before proceeding further, we need to motivate why we're moving to solve continuous space environments. Until now, we always solved pong, but you could use the algorithms, the approach, and even all the code to solve any other Atari game.

But many real-world applications of reinforcement learning require an agent to select actions from continuous spaces. For example, Autonomous Robotics requires an agent to take action in a continuous space. The same goes for autonomous driving, which is, by the way, one of the hottest topics in the automotive industry.

Environments to solve

As said in the introduction section, we'll solve two environments from the OpeanAI gym library.

Pendulum

We start with Pendulum, which is a classical environment to start.

The goal is to try to keep a frictionless pendulum standing up. We can see the input and the action characteristics with the following code:

The input consists of three observations. Looking at the documentation, we can see that they represent the angle of the pendulum (cosine and sine of theta) and its angular velocity (theta dot). On the other hand, we know that the action is a single value between -2.0 and 2.0, and it represents the amount of left or right force on the pendulum. The precise equation for the reward:

-(theta2 + 0.1*theta2 dt + 0.001*action2)

Lunar Lander

Another classical environment to solve is Lunar Lander (in its continuous version).

The game's main goal is to direct the agent to the landing pad as softly and fuel-efficiently as possible. From the documentation, we know that the landing pad is always at coordinates (0,0). As before, we can see the input and action characteristics using the same code:

The input consist of 8 values that are:

  1. x coordinate of the lander
  2. y coordinate of the lander
  3. vx, the horizontal velocity
  4. vy, the vertical velocity
  5. θ, the orientation in space
  6. vθ, the angular velocity
  7. Left leg touching the ground (Boolean)
  8. Right leg touching the ground (Boolean)

The action is a two values array from -1 to +1 for both dimensions. The first one controls the main engine, -1.0 is off, and from 0 to 1.0, the engine's power goes from 50% to 100% power. The engine can't work with less than 50% of the power. The second value controls the left and right engines. From -1.0 to -0.5, it fires the left engine; from 0.5 to 1.0, it fires the right engine; from -0.5 to 0.5, the engines are off.

The Reward for moving from the top of the screen to the landing pad and zero speed is from 100 to 140 points. If the lander moves away from the landing pad, it loses the reward. The episode finishes if the lander crashes or comes to rest, receiving additional -100 or +100 points. Each leg ground contact is +10. Firing the main engine is -0.3 points for each frame.

Deep Deterministic Policy Gradients

The DDPG algorithm (Deep Deterministic Policy Gradients) was introduced in 2015 by Timothy P. Lillicrap and others in the paper called Continuous Control with Deep Reinforcement Learning. It belongs to the Actor-Critic family, but at the same time, the policy is deterministic (same input, same output/action to take). DDPG also shares some ideas with DQN. In particular, the network is trained off-policy with samples from a replay buffer (to avoid correlation) and with a target Q network.

In summary, DDPG has in common with DQN, the deterministic policy, and that is trained off-policy, but at the same time has the Actor-Critic Approach. All this may look a bit complex now, but it will become easier in the next section when we see the code.

Replay Buffer

Let's start with the simples part, the replay buffer:

            class ReplayBuffer():     def __init__(self, env, buffer_capacity=BUFFER_CAPACITY, batch_size=BATCH_SIZE, min_size_buffer=MIN_SIZE_BUFFER):         self.buffer_capacity = buffer_capacity         self.batch_size = batch_size         self.min_size_buffer = min_size_buffer         self.buffer_counter = 0         self.n_games = 0                  self.states = np.zeros((self.buffer_capacity, env.observation_space.shape[0]))         self.actions = np.zeros((self.buffer_capacity, env.action_space.shape[0]))         self.rewards = np.zeros((self.buffer_capacity))         self.next_states = np.zeros((self.buffer_capacity, env.observation_space.shape[0]))         self.dones = np.zeros((self.buffer_capacity), dtype=bool)                        def __len__(self):         return self.buffer_counter      def add_record(self, state, action, reward, next_state, done):         # Set index to zero if counter = buffer_capacity and start again (1 % 100 = 1 and 101 % 100 = 1) so we substitute the older entries         index = self.buffer_counter % self.buffer_capacity          self.states[index] = state         self.actions[index] = action         self.rewards[index] = reward         self.next_states[index] = next_state         self.dones[index] = done                  # Update the counter when record something         self.buffer_counter += 1          def check_buffer_size(self):         return self.buffer_counter >= self.batch_size and self.buffer_counter >= self.min_size_buffer          def update_n_games(self):         self.n_games += 1          def get_minibatch(self):         # If the counter is less than the capacity we don't want to take zeros records,          # if the cunter is higher we don't access the record using the counter          # because older records are deleted to make space for new one         buffer_range = min(self.buffer_counter, self.buffer_capacity)                  batch_index = np.random.choice(buffer_range, self.batch_size, replace=False)          # Take indices         state = self.states[batch_index]         action = self.actions[batch_index]         reward = self.rewards[batch_index]         next_state = self.next_states[batch_index]         done = self.dones[batch_index]                  return state, action, reward, next_state, done          def save(self, folder_name):         """         Save the replay buffer         """         if not os.path.isdir(folder_name):             os.mkdir(folder_name)          np.save(folder_name + '/states.npy', self.states)         np.save(folder_name + '/actions.npy', self.actions)         np.save(folder_name + '/rewards.npy', self.rewards)         np.save(folder_name + '/next_states.npy', self.next_states)         np.save(folder_name + '/dones.npy', self.dones)                  dict_info = {"buffer_counter": self.buffer_counter, "n_games": self.n_games}                  with open(folder_name + '/dict_info.json', 'w') as f:             json.dump(dict_info, f)      def load(self, folder_name):         """         Load the replay buffer         """         self.states = np.load(folder_name + '/states.npy')         self.actions = np.load(folder_name + '/actions.npy')         self.rewards = np.load(folder_name + '/rewards.npy')         self.next_states = np.load(folder_name + '/next_states.npy')         self.dones = np.load(folder_name + '/dones.npy')                  with open(folder_name + '/dict_info.json', 'r') as f:             dict_info = json.load(f)         self.buffer_counter = dict_info["buffer_counter"]         self.n_games = dict_info["n_games"]          

The concept is the same used in the DDDQN article. We are storing all the states, actions, rewards, next_states, and terminal flags derived from the interaction with the environment calling the function add_record. Then we have a get_minibatch method that returns a random subset of these observations.

Note that we have a check_buffer_size method that makes sure that the buffer size is larger than the batch size and the minimum value that we have defined in the config file.

In the end, we have the save and load methods that save and load the buffer. This is particularly useful when you need to stop and restart the training (if you're using Google Colab, this will save your life).

Networks

As said by introducing the DDPG algorithm, we have the standard Actor and Critic in their trained and target version. So we actually have four neural networks, two Actors and two Critics.

Let's start with the code to define the Actors.

Actor

            class Actor(tf.keras.Model):     def __init__(self, name, actions_dim, upper_bound, hidden_0=CRITIC_HIDDEN_0, hidden_1=CRITIC_HIDDEN_1, init_minval=INIT_MINVAL, init_maxval=INIT_MAXVAL):         super(Actor, self).__init__()         self.hidden_0 = hidden_0         self.hidden_1 = hidden_1         self.actions_dim = actions_dim         self.init_minval = init_minval         self.init_maxval = init_maxval         self.upper_bound = upper_bound                  self.net_name = name          self.dense_0 = Dense(self.hidden_0, activation='relu')         self.dense_1 = Dense(self.hidden_1, activation='relu')         self.policy = Dense(self.actions_dim, kernel_initializer=random_uniform(minval=self.init_minval, maxval=self.init_maxval), activation='tanh')      def call(self, state):         x = self.dense_0(state)         policy = self.dense_1(x)         policy = self.policy(policy)          return policy * self.upper_bound          

As you can see it's a pretty easy network, with two hidden layers and a final layer with a tanh activation. Just to a reminder the tanh function is defined as:

You can find the definition of all the activation functions here.

Its outputs range from -1 to 1, so we need to multiply for the action's upper bound. For example, for the pendulum environment, we need to multiply the output of the tanh function by two to have an action between -2 and 2 as required by the environment. For the Lunar Landing, we don't actually need it because the action space is limited from -1 to 1 in both dimensions.

As you can see, the Actor output is deterministic. There will be the same output vector (action) given the same input (state).

We can see something different here on the initialization of the last layer's weights. We initialize them from a uniform distribution with a min and max value (+/-0.005 in our case). This is necessary to prevent having a zero gradient in the initial stages of the training since we use a tanh activation function.

Critic

We can now see the code for the Critic Networks:

            class Critic(tf.keras.Model):     def __init__(self, name, hidden_0=CRITIC_HIDDEN_0, hidden_1=CRITIC_HIDDEN_1):         super(Critic, self).__init__()                  self.hidden_0 = hidden_0         self.hidden_1 = hidden_1          self.net_name = name          self.dense_0 = Dense(self.hidden_0, activation='relu')         self.dense_1 = Dense(self.hidden_1, activation='relu')         self.q_value = Dense(1, activation=None)      def call(self, state, action):         state_action_value = self.dense_0(tf.concat([state, action], axis=1))         state_action_value = self.dense_1(state_action_value)          q_value = self.q_value(state_action_value)          return q_value          

As always, the Critic estimates the Q-value, the discounted reward of the action taken in some state. But our action is an array, so we need to take as input both the state and the action. As you can see, in the call method, the states and actions are stacked together using the concat function from TensorFlow. As always, the output will be a single number.

The action is given by the actor, so the Critic will take as input the output of the Actor network.

Agent

We can see the implementation of the Agent's logic.

            class Agent:     def __init__(self, env, actor_lr=ACTOR_LR, critic_lr=CRITIC_LR, gamma=GAMMA, max_size=BUFFER_CAPACITY, tau=TAU, path_save=PATH_SAVE, path_load=PATH_LOAD):                  self.gamma = gamma         self.tau = tau         self.replay_buffer = ReplayBuffer(env, max_size)         self.actions_dim = env.action_space.shape[0]         self.upper_bound = env.action_space.high[0]         self.lower_bound = env.action_space.low[0]         self.actor_lr = actor_lr         self.critic_lr = critic_lr         self.path_save = path_save         self.path_load = path_load                  self.actor = Actor(name='actor', actions_dim=self.actions_dim, upper_bound=self.upper_bound)         self.critic = Critic(name='critic')         self.target_actor = Actor(name='target_actor', actions_dim=self.actions_dim, upper_bound=self.upper_bound)         self.target_critic = Critic(name='target_critic')          self.actor.compile(optimizer=opt.Adam(learning_rate=actor_lr))         self.critic.compile(optimizer=opt.Adam(learning_rate=critic_lr))         self.target_actor.compile(optimizer=opt.Adam(learning_rate=actor_lr))         self.target_critic.compile(optimizer=opt.Adam(learning_rate=critic_lr))          actor_weights = self.actor.get_weights()         critic_weights = self.critic.get_weights()                  self.target_actor.set_weights(actor_weights)         self.target_critic.set_weights(critic_weights)                  self.noise = np.zeros(self.actions_dim)          

Looking at the Class inputs, you can see how it's taking the defined environment to define the actions' dimensions, upper and lower bound, and to define the replay buffer.

Then we use these dimensions to define the four networks: Actor, Target Actor, Critic, and Target Critic. In the beginning, we just copy the Actor and Critic weights to the target networks.

Target Networks

The target networks are time-delayed copies of their original networks that slowly update their weights. As for DDDQN, we're using these target networks to improve learning stability. We'll see how they are used in the loss functions. For now, let's see now how we update these networks:

                          def update_target_networks(self, tau):         actor_weights = self.actor.weights         target_actor_weights = self.target_actor.weights         for index in range(len(actor_weights)):             target_actor_weights[index] = tau * actor_weights[index] + (1 - tau) * target_actor_weights[index]          self.target_actor.set_weights(target_actor_weights)                  critic_weights = self.critic.weights         target_critic_weights = self.target_critic.weights              for index in range(len(critic_weights)):             target_critic_weights[index] = tau * critic_weights[index] + (1 - tau) * target_critic_weights[index]          self.target_critic.set_weights(target_critic_weights)          

As you can see, we update the networks' weights as a weighted average between the target weights and the trained networks' weights. In our case, Tau is equal to 0.05 as defined in the config file. We're using tau as the weight given to the trained networks' weights. Given the low value of tau, we're doing a "soft" update, but we're updating them every time we update the trained networks' weights, as we'll see in the training loop.

Save and load

As always we want to have some methods to save and load the networks' weights and the replay buffer:

                          def save(self):         date_now = time.strftime("%Y%m%d%H%M")         if not os.path.isdir(f"{self.path_save}/save_agent_{date_now}"):             os.makedirs(f"{self.path_save}/save_agent_{date_now}")         self.actor.save_weights(f"{self.path_save}/save_agent_{date_now}/{self.actor.net_name}.h5")         self.target_actor.save_weights(f"{self.path_save}/save_agent_{date_now}/{self.target_actor.net_name}.h5")         self.critic.save_weights(f"{self.path_save}/save_agent_{date_now}/{self.critic.net_name}.h5")         self.target_critic.save_weights(f"{self.path_save}/save_agent_{date_now}/{self.target_critic.net_name}.h5")                  np.save(f"{self.path_save}/save_agent_{date_now}/noise.npy", self.noise)                  self.replay_buffer.save(f"{self.path_save}/save_agent_{date_now}")      def load(self):         self.actor.load_weights(f"{self.path_load}/{self.actor.net_name}.h5")         self.target_actor.load_weights(f"{self.path_load}/{self.target_actor.net_name}.h5")         self.critic.load_weights(f"{self.path_load}/{self.critic.net_name}.h5")         self.target_critic.load_weights(f"{self.path_load}/{self.target_critic.net_name}.h5")                  self.noise = np.load(f"{self.path_load}/noise.npy")                  self.replay_buffer.load(f"{self.path_load}")          

Nothing special here, but you can notice that we're also saving the noise variable that we're now going to talk about.

Exploration vs. Exploitation

The price that we need to pay to have a deterministic policy is that we need some strategy to explore the environment. The simplest way would be to just add some noise to the action returned by the Actor. We can randomly take this noise from a normal distribution, for example. But in the original paper, they used an Ornstein-Uhlenbeck process, so we're going to implement it in the same way:

                          def _ornstein_uhlenbeck_process(self, x, theta=THETA, mu=0, dt=DT, std=0.2):         """         Ornstein–Uhlenbeck process         """         return x + theta * (mu-x) * dt + std * np.sqrt(dt) * np.random.normal(size=self.actions_dim)      def get_action(self, observation, noise, evaluation=False):         state = tf.convert_to_tensor([observation], dtype=tf.float32)         actions = self.actor(state)         if not evaluation:             self.noise = self._ornstein_uhlenbeck_process(noise)             actions += self.noise          actions = tf.clip_by_value(actions, self.lower_bound, self.upper_bound)          return actions[0]          

For the Ornstein-Uhlenbeck process, I just used the formula you can easily find by just googling it. Next, in the get_action method, we use the noise to create randomness in the action value. In the end, we clip the value of the action using the lower and upper bounds.

Loss functions and gradients

In the end, we just need to update the weights of the Actor and Critic using the gradients computed from the loss functions.

The Critic loss is the Mean Squared Error between the expected return seen by the Target Networks and the Q value predicted by the Critic network.

The Actor loss is the mean of the Critic network's value given the actions taken by the Actor network. We want to maximize this value (we want a higher Q value), so we use the minus sign to transform it into a loss function.

Looking at the code:

                          def learn(self):         if self.replay_buffer.check_buffer_size() == False:             return          state, action, reward, new_state, done = self.replay_buffer.get_minibatch()          states = tf.convert_to_tensor(state, dtype=tf.float32)         new_states = tf.convert_to_tensor(new_state, dtype=tf.float32)         rewards = tf.convert_to_tensor(reward, dtype=tf.float32)         actions = tf.convert_to_tensor(action, dtype=tf.float32)          with tf.GradientTape() as tape:             target_actions = self.target_actor(new_states)             target_critic_values = tf.squeeze(self.target_critic(                                 new_states, target_actions), 1)             critic_value = tf.squeeze(self.critic(states, actions), 1)             target = reward + self.gamma * target_critic_values * (1-done)             critic_loss = tf.keras.losses.MSE(target, critic_value)          critic_gradient = tape.gradient(critic_loss,                                             self.critic.trainable_variables)         self.critic.optimizer.apply_gradients(zip(             critic_gradient, self.critic.trainable_variables))          with tf.GradientTape() as tape:             policy_actions = self.actor(states)             actor_loss = -self.critic(states, policy_actions)             actor_loss = tf.math.reduce_mean(actor_loss)          actor_gradient = tape.gradient(actor_loss,                                      self.actor.trainable_variables)         self.actor.optimizer.apply_gradients(zip(             actor_gradient, self.actor.trainable_variables))          self.update_target_networks(self.tau)          

First of all, we check that the buffer size is larger than the batch size and the minimum size defined in the config script using the method check_buffer_size() of the ReplayBuffer class. Then we take a minibatch from the Replay buffer, and we start calculating the two losses as described above, the gradients that we apply to the Actor and Critic Network to update the weights. In the end, we "softly" update the weights of the target networks using the method update_target_networks.

Training and results

We can now go to the training loop and the results for the two environments.

            config = dict(   learning_rate_actor = ACTOR_LR,   learning_rate_critic = ACTOR_LR,   batch_size = BATCH_SIZE,   architecture = "DDPG",   infra = "Ubuntu",   env = ENV_NAME )  wandb.init(   project=f"tensorflow2_ddpg_{ENV_NAME.lower()}",   tags=["DDPG", "FCL", "RL"],   config=config, )  env = gym.make(ENV_NAME) agent = Agent(env)  scores = [] evaluation = True  if PATH_LOAD is not None:     print("loading weights")     observation = env.reset()     action = agent.actor(observation[None, :])     agent.target_actor(observation[None, :])     agent.critic(observation[None, :], action)     agent.target_critic(observation[None, :], action)     agent.load()     print(agent.replay_buffer.buffer_counter)     print(agent.replay_buffer.n_games)     print(agent.noise)          

As always, we start configuring wandb for logging, and in case we're loading a previously trained agent, we'll load the agent and the replay buffer. Before we can load everything, we need to call the four networks to build them, so we'll use a random initial observation to initialize them.

Then we have the training loop:

            for _ in tqdm(range(MAX_GAMES)):     start_time = time.time()     states = env.reset()     done = False     score = 0     while not done:         action = agent.get_action(states, evaluation)         new_states, reward, done, info = env.step(action)         score += reward         agent.add_to_replay_buffer(states, action, reward, new_states, done)         agent.learn()         states = new_states              agent.replay_buffer.update_n_games()          scores.append(score)           wandb.log({'Game number': agent.replay_buffer.n_games, '# Episodes': agent.replay_buffer.buffer_counter,                 "Average reward": round(np.mean(scores[-10:]), 2), \                       "Time taken": round(time.time() - start_time, 2)})      if (_ + 1) % EVALUATION_FREQUENCY == 0:         evaluation = True         states = env.reset()         done = False         score = 0         while not done:             action = agent.get_action(states, evaluation)             new_states, reward, done, info = env.step(action)             score += reward             states = new_states         wandb.log({'Game number': agent.replay_buffer.n_games,                     '# Episodes': agent.replay_buffer.buffer_counter,                     'Evaluation score': score})         evaluation = False           if (_ + 1) % SAVE_FREQUENCY == 0:         print("saving...")         agent.save()         print("saved")          

In the training loop, we interact with the environment, we save the states, actions, rewards, and terminal flags in the replay buffer and we train the agent using the method learn() seen before. When the episode is over, we evaluate the results logging the mean of the last ten games. We also log the evaluation score every 100 games where we simply don't add the noise to the action and every 200 games we save the networks' weights, noise, and replay buffer.

We can start looking at the pendulum results, keeping in mind that the OpenAI documentation says that "Best 100-episode average reward was -138.98 ± 9.06. (Pendulum-v0 does not have a specified reward threshold at which it's considered solved.)".

As you can see, the algorithm converges very quickly. After just 300 games, we have similar performances, if not better than those in the documentation.

We finish looking at the Lunar Lander results; in this case, the documentation says that "LunarLander-v2 defines "solving" as getting an average reward of 200 over 100 consecutive trials.".

Here we're taking more time to obtain good results, we get an average score of 200 after almost 150.000 episodes. But the episodes are really short so it wasn't much in training time, in around an hour we solved the environment.

You can find all the code on my Github. For any questions, you can reach me through Linkedin.

If you enjoyed this article, share it with your friends and colleagues! I'll see you in the next post. In the meantime, take care, stay safe, and remember don't be another brick in the wall.

Anton.ai