[] Building a Super-human Flappy Bird Agent with Double DQN

PROJECT TITLE: Building a Superhuman Flappy Bird Agent with Double DQN

GitHub: https://github.com/lambo131/flappy-bird-DQN_agent_RL

Youtube video: https://youtu.be/QNoUa5ieCFE

PROJECT OVERVIEW

----------------

This project explores Deep Reinforcement Learning by implementing a Double Deep Q-Network (DDQN) agent capable of mastering the Flappy Bird environment (FlappyBird-v0), as well as classic control tasks like CartPole, MountainCar, and LunarLander. The system is built with PyTorch and Gymnasium, featuring a modular design that separates the agent logic, network architecture, and hyperparameter configuration.

1. CORE ARCHITECTURE

--------------------

- Algorithm: The agent utilizes Double DQN, an improvement over standard DQN that decouples action selection from target evaluation to reduce overestimation bias.

- Neural Network: A flexible Multi-Layer Perceptron (MLP) serves as the function approximator.

* Input: 12-dimensional state vector (e.g., bird position, velocity, pipe distance).

* Hidden Layers: Configurable depth and width (e.g., 3 layers x 128 neurons for Flappy Bird), using ReLU activation.

* Output: Q-values for discrete actions (Flap / Don't Flap).

- Experience Replay: A custom ReplayMemory class (using collections.deque) stores transitions (state, action, reward, next_state, done) to break correlation between consecutive samples during training.

2. KEY TECHNICAL FEATURES

-------------------------

- Dual-Buffer Strategy ("Important Experience"): Unlike standard implementations, this agent maintains a secondary "Important Experience" buffer. It selectively stores transitions from high-performing episodes and mixes them into training batches (e.g., 80% standard / 20% important), accelerating learning by replaying successful strategies.

- Dynamic Exploration (Epsilon-Greedy +):

* Standard epsilon decay is augmented with an Exploration Probability Mask. Instead of purely uniform random actions during exploration, the agent uses a biased distribution (e.g., preferring "do nothing" slightly more than "flap") to mimic safer random behavior.

* Recursive Exploration: An experimental "explore" function recursively searches for better rewards from a known good state.

- Target Network Synchronization: A separate target network stabilizes training, with weights synced from the policy network every N steps (configurable per environment).

- Adaptive Memory Management: The agent implements logic to flush a percentage of old memories when a new "best reward" is achieved, ensuring the replay buffer stays relevant to the current policy's capability.

3. CONFIGURATION & TOOLING

--------------------------

- YAML-Based Hyperparameters: All training parameters (learning rate, discount factor, batch size, epsilon decay) are decoupled into a "hyperparameters.yml" file. This allows for rapid experimentation across different environments (e.g., flappybird_1 vs. cartpole_1) without changing code.

- Comprehensive Logging: The system automatically logs training metrics:

* CSV/Pandas: Detailed episode-by-episode statistics.

* Matplotlib: Real-time generation of Reward vs. Episode and Epsilon decay graphs.

* Model Checkpointing: Automatically saves the "Best", "Intermediate", and "Young" model weights (.pt) and experience buffers (.pkl) for resuming training.

4. RESULTS

----------

The agent successfully converges to high scores in the Flappy Bird environment. The implementation demonstrates how enhancements like Double DQN and Prioritized/Important Experience Replay can significantly improve stability and sample efficiency in environments with sparse or delayed rewards.

# This plot shows the reward sign and magnitude distribution for different action mask settings. Action '0' is nothing, '1' is flap wing. The insight is that for environments like flappy bird, a good random exploration action should be sampled from a non-uniform probability mask for faster learning. This is because a uniform action sampling will most often cause the bird to hit a upper pipe.

# This plot shows my epsilon greedy scheduling strategy. My epsilon schedule decay and increases indefinitely, mimicking how humans learn by trying new things and refining existing skills

TECH STACK

----------

Python, PyTorch, Gymnasium, Flappy-Bird-Gymnasium, NumPy, Pandas.