Back to Curriculum

Reinforcement Learning

📚 Lesson 7 of 15 ⏱️ 100 min

Reinforcement Learning

100 min

Reinforcement Learning (RL) is a type of machine learning where agents learn by interacting with an environment, receiving rewards or penalties for actions, and learning optimal policies through trial and error. Unlike supervised learning, RL doesn't require labeled examples—agents learn from experience. RL is inspired by how humans and animals learn from interaction. Understanding RL enables you to build systems that learn to make decisions. RL has achieved remarkable success in game playing (AlphaGo, game AI), robotics, and autonomous systems.

RL uses rewards and penalties to guide learning through trial and error. The agent receives rewards for good actions and penalties for bad actions, learning to maximize cumulative rewards over time. The reward signal is the primary learning signal—agents learn which actions lead to higher rewards. Understanding rewards enables you to design effective RL systems. Reward design is crucial—poor rewards lead to poor behavior. The exploration-exploitation trade-off is fundamental: agents must explore to discover good actions while exploiting known good actions.

Key concepts include Q-learning (value-based), policy gradients (policy-based), and deep reinforcement learning (combining deep learning with RL). Q-learning learns action-value functions (Q-values) that estimate expected future rewards. Policy gradients directly optimize policies (action selection strategies). Deep RL uses neural networks to approximate Q-functions or policies, enabling RL in high-dimensional spaces. Understanding these approaches enables you to choose appropriate RL methods. Each approach has strengths for different problems.

Markov Decision Processes (MDPs) formalize RL problems with states, actions, rewards, and transition probabilities. MDPs assume the Markov property: future states depend only on the current state and action, not history. Understanding MDPs provides the theoretical foundation for RL. MDPs enable mathematical analysis and algorithm design. Many RL algorithms assume MDP structure, though real problems may not perfectly satisfy it.

RL algorithms include value-based methods (Q-learning, DQN), policy-based methods (REINFORCE, PPO), and actor-critic methods (combining both). Value-based methods learn value functions. Policy-based methods learn policies directly. Actor-critic methods combine both for better performance. Understanding different algorithms enables you to choose appropriate methods. Modern RL often uses deep learning (Deep Q-Networks, Policy Gradients) for complex problems.

RL is used in game playing (chess, Go, video games), robotics (manipulation, locomotion), autonomous systems (self-driving cars, drones), recommendation systems, and resource allocation. RL excels when you need sequential decision-making, exploration, and adaptation. Understanding RL applications enables you to identify when RL is appropriate. Best practices include proper reward design, balancing exploration and exploitation, using appropriate algorithms, and understanding that RL can be sample-inefficient (requiring many interactions).

Key Concepts

  • Reinforcement Learning learns through interaction with environments.
  • RL uses rewards and penalties to guide learning.
  • Q-learning, policy gradients, and deep RL are key approaches.
  • Markov Decision Processes formalize RL problems.
  • RL is used in game playing, robotics, and autonomous systems.

Learning Objectives

Master

  • Understanding reinforcement learning concepts and terminology
  • Implementing Q-learning algorithms
  • Understanding policy gradients and deep RL
  • Applying RL to sequential decision-making problems

Develop

  • RL problem-solving thinking
  • Understanding when to use RL vs other ML approaches
  • Designing effective RL systems

Tips

  • Design rewards carefully—they guide learning behavior.
  • Balance exploration and exploitation for effective learning.
  • Start with simple environments before tackling complex problems.
  • Understand that RL can be sample-inefficient—be patient.

Common Pitfalls

  • Poor reward design, causing agents to learn unintended behavior.
  • Not balancing exploration and exploitation, causing poor learning.
  • Using RL when simpler methods would suffice.
  • Not understanding that RL requires many interactions.

Summary

  • Reinforcement Learning learns through interaction and rewards.
  • Q-learning, policy gradients, and deep RL are key approaches.
  • RL is used for sequential decision-making problems.
  • Understanding RL enables building adaptive, learning systems.
  • RL requires careful reward design and exploration-exploitation balance.

Exercise

Implement a simple Q-learning algorithm for a grid world environment.

import numpy as np
import matplotlib.pyplot as plt

class GridWorld:
    def __init__(self, size=5):
        self.size = size
        self.state = 0  # Start at top-left
        self.goal = size * size - 1  # Goal at bottom-right
        self.actions = [0, 1, 2, 3]  # Up, Right, Down, Left
        
    def reset(self):
        self.state = 0
        return self.state
    
    def step(self, action):
        # Convert state to coordinates
        row = self.state // self.size
        col = self.state % self.size
        
        # Apply action
        if action == 0:  # Up
            row = max(0, row - 1)
        elif action == 1:  # Right
            col = min(self.size - 1, col + 1)
        elif action == 2:  # Down
            row = min(self.size - 1, row + 1)
        elif action == 3:  # Left
            col = max(0, col - 1)
        
        # Update state
        self.state = row * self.size + col
        
        # Check if goal reached
        done = self.state == self.goal
        reward = 100 if done else -1
        
        return self.state, reward, done

class QLearningAgent:
    def __init__(self, state_size, action_size, learning_rate=0.1, epsilon=0.1):
        self.q_table = np.zeros((state_size, action_size))
        self.lr = learning_rate
        self.epsilon = epsilon
        
    def choose_action(self, state):
        if np.random.random() < self.epsilon:
            return np.random.choice(4)
        return np.argmax(self.q_table[state])
    
    def learn(self, state, action, reward, next_state):
        old_value = self.q_table[state, action]
        next_max = np.max(self.q_table[next_state])
        new_value = (1 - self.lr) * old_value + self.lr * (reward + 0.9 * next_max)
        self.q_table[state, action] = new_value

# Training
env = GridWorld(5)
agent = QLearningAgent(25, 4)
episodes = 1000
rewards_history = []

for episode in range(episodes):
    state = env.reset()
    total_reward = 0
    done = False
    
    while not done:
        action = agent.choose_action(state)
        next_state, reward, done = env.step(action)
        agent.learn(state, action, reward, next_state)
        state = next_state
        total_reward += reward
    
    rewards_history.append(total_reward)
    
    if episode % 100 == 0:
        print(f"Episode {episode}, Total Reward: {total_reward}")

# Plot training progress
plt.plot(rewards_history)
plt.title('Training Progress')
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.show()

print("Training completed!")

Code Editor

Output