Space_RL
DDQN Space Survival Shooter (Reinforcement Learning)
| *[GitHub] | [Demo Video]* |
A. Problem Overview
This project aims to train a Reinforcement Learning (RL) agent to play the “Space Survival” game implemented in Pygame. The agent learns to control a spaceship by steering left/right and shooting to avoid or destroy asteroids while collecting shield and weapon power-ups to extend survival time and accumulate higher scores. The agent’s state is represented by a 51-dimensional feature vector, comprising the player’s status, the eight nearest asteroids, and the two nearest power-ups. The Double DQN (DDQN) algorithm is employed for policy learning.
B. Environment Setup
- Hardware: Training was conducted on workstations equipped with one NVIDIA RTX 3080 and one RTX 4060.
- Software: Torch (2.7.1), CUDA (12.8).
- Detailed Training Environment Setup
C. State Design
The entire state is output as a 1D NumPy array (length 51) and fed into the MLP model. It is categorized into three groups: Player (5 dimensions), Asteroids (8×5 dimensions), and Power-ups (2×3 dimensions).
1. Player Information (5 dimensions)
| Dimension | Range/Norm | Description |
|---|---|---|
px_norm | $0, 1$ | Player x-coordinate / Screen Width (WIDTH) |
hp_norm | $0, 1$ | Player Health / Max Health (100) |
gun_oh_1 | ${0, 1}$ | Gun Level One-Hot: Level 1 |
gun_oh_2 | ${0, 1}$ | Gun Level One-Hot: Level $\ge 2$ |
cd_norm | $0, 1$ | Remaining Cooldown / Total Cooldown Delay (p.bullet_delay) |
p = self.game.player.sprite
px_norm = p.rect.centerx / WIDTH
hp_norm = p.health / 100
gun_oh = [1 if p.gun == 1 else 0, 1 if p.gun >= 2 else 0]
cd_norm = len(p.bullet_timer) / p.bullet_delay # 0~1
2. Asteroids (8 asteroids × 5 dimensions, 40 dimensions total)
- Only the 8 nearest asteroids to the player are tracked; if fewer than 8 exist, the vector is padded with zeros.
- Features: Relative position
(dx, dy), velocity(vx, vy), radiusrr. - Normalization: Velocity is normalized by
MAX_SPD_X=3andMAX_SPD_Y=10.
| Dimension | Range | Description |
|---|---|---|
dx | $[-1, 1]$ | (Asteroid x – Player x) / WIDTH |
dy | $[-1, 1]$ | (Asteroid y – Player y) / HEIGHT |
vx | $[-1, 1]$ | Horizontal velocity / MAX_SPD_X |
vy | $[0, 1]$ | Vertical velocity / MAX_SPD_Y |
rr | Real | Asteroid radius (pixels) |
Implementation Snippet
rocks = sorted(self.game.rocks,
key=lambda r: (r.rect.y-p.rect.y)**2 + (r.rect.x-p.rect.x)**2)[:MAX_ROCK]
rock_feats = []
for r in rocks:
dx = (r.rect.centerx - p.rect.centerx) / WIDTH
dy = (r.rect.centery - p.rect.centery) / HEIGHT
vx = r.speedx / MAX_SPD_X
vy = r.speedy / MAX_SPD_Y
rr = r.radius
rock_feats += [dx, dy, vx, vy, rr]
rock_feats += [0.0] * (MAX_ROCK*5 - len(rock_feats))
3. Power-up Information (2 power-ups × 3 dimensions, 6 dimensions total)
- Only the 2 nearest power-ups to the player are included; if fewer than 2 exist, pad with zeros.
- Each power-up feature: relative position
(dx, dy), typetp(shield+1/ gun-1).
| Dimension | Range | Description |
|---|---|---|
dx | $-1,1$ | (Power-up x – Player x) / WIDTH |
dy | $-1,1$ | (Power-up y – Player y) / HEIGHT |
tp | {+1, -1} | shield → +1;gun → -1 |
powers = sorted(self.game.powers,
key=lambda pw: abs(pw.rect.y-p.rect.y))[:MAX_POWER]
power_feats = []
for pw in powers:
dx = (pw.rect.centerx - p.rect.centerx) / WIDTH
dy = (pw.rect.centery - p.rect.centery) / HEIGHT
tp = 1 if pw.type=='shield' else -1
power_feats += [dx, dy, tp]
power_feats += [0]*(MAX_POWER*3 - len(power_feats))
4. Concatenate into Final State Vector
state_vec = np.array(
[px_norm, hp_norm] + gun_oh + [cd_norm] +
rock_feats + power_feats,
dtype=np.float32
)
return state_vec # shape=(51,)
Total features: 5 + 40 + 6 = 51 dimensions.
D. Reward Design
Each step (frame) starts with a fixed time penalty and is dynamically adjusted based on game events. Main parameters are defined at the top of Env.py:
# ---- reward parameters ----
ALPHA_HIT = 0.8 # Asteroid destruction reward coefficient
PSI_MIN = 0.4 # ψ(hp) = PSI_MIN + (1-PSI_MIN)*ϕ
BETA_COLL = 1.2 # Collision penalty coefficient
GAMMA_SHIELD = 1.4 # Health recovery reward coefficient
R_GUN = 16 # Gun upgrade reward
MISS_SHOT_PEN = -1.0 # Penalty for shooting during cooldown
COOLDOWN_BONUS = +0.5 # Cooldown completion bonus
0. Fixed Time Step Penalty
Encourage the agent to achieve high scores quickly
reward = -0.25
1. Asteroid Destruction (Hit Bonus)
Condition: delta_score = self.game.score - score_before (indicates newly destroyed asteroids).
First, calculate the current health ratio $\phi$ and the reward coefficient $\psi$:
\[\phi = \frac{hp_{\mathrm{after}}}{100}\] \[\psi = \mathrm{\psi_{MIN}} + (1 - \mathrm{\psi_{MIN}}) \times \phi\]Reward Formula:
\[\text{reward} += \alpha_{HIT} \times \Delta score \times \psi\]The higher the health ($\psi$), the higher the reward, encouraging the agent to shoot actively when in good health.
delta_score = self.game.score - score_before
if delta_score:
ϕ = hp_after / 100
ψ = PSI_MIN + (1-PSI_MIN)*ϕ
hit_bonus = ALPHA_HIT * delta_score * ψ
reward += hit_bonus2. Asteroid Collision (Collision Penalty)
Condition: self.game.is_collided == True
- Calculate damage: r = hp_before - hp_after
- Weight factor: factor = 2 - ϕ (heavier penalty when at low health)
Penalty formula:
$$penalty = \beta_{COLL} \times r \times (2-\phi)$$
$$reward -= penalty$$
if self.game.is_collided:
r = hp_before - hp_after
factor = 2 - (hp_after/100)
penalty = BETA_COLL * r * factor
reward -= penalty
3. Power-up Collection (Power-up Bonus)
Condition: self.game.is_power == True
- Shield: If health gain
hp_gain>0, then higher value when at low health, encouraging timely health recovery. \(\mathrm{reward} \mathrel{+}= hp\_gain \times \mathrm{\gamma_{SHIELD}} \times \bigl(1 - \phi\bigr)\) - Gun upgrade: Fixed reward of 16
if self.game.is_power:
hp_gain = hp_after - hp_before
if hp_gain>0:
reward += hp_gain * GAMMA_SHIELD * (1-ϕ)
else:
reward += R_GUN
4. Cooldown Shaping
- Detect shooting action:
- Pressing shoot (
action==1andready_before==True) →fired_now=True, mark as entering cooldown. - Random button press penalty: If pressed but no bullet fired, and not yet penalized, give one
MISS_SHOT_PEN=-1.0. - Cooldown completion reward: When cooldown ends (
in_cooldown==Trueandready_after==True), give+0.5and reset state.
- Pressing shoot (
# Detect fired_now and in_cooldown
if was_shooting and ready_before:
self.in_cooldown = True
self.cooldown_penalized = False
if was_shooting and not ready_before and not self.cooldown_penalized:
reward += MISS_SHOT_PEN
self.cooldown_penalized = True
if self.in_cooldown and ready_after:
reward += COOLDOWN_BONUS
self.in_cooldown = False
5. Wall-hit Penalty
Condition: When the agent attempts to move outside the wall boundaries.
If player.rect.left == 0 and action==LEFT or player.rect.right==WIDTH and action==RIGHT, penalize with -0.5.
hit_left = (player.rect.left==0 and action==LEFT)
hit_right = (player.rect.right==WIDTH and action==RIGHT)
if hit_left or hit_right:
reward -= 0.5
E. Model Architecture
The original template_v2 provided a CNN-based neural network. Since state has been changed to pass a 51-dimensional vector, a smaller redesigned model is sufficient.
- Algorithm: Double DQN
- Network structure: A three-layer fully connected MLP (Multi-Layer Perceptron),
- Input dimension
input_dim→ 128 → 128 → Output dimensionnum_actions - All hidden layers use ReLU activation
- Final layer outputs Q-values for each action
- Input dimension
class MLPDDQN(nn.Module):
def __init__(self, input_dim, num_actions):
super().__init__()
self.net = nn.Sequential(
nn.Linear(input_dim, 128),
nn.ReLU(),
nn.Linear(128, 128),
nn.ReLU(),
nn.Linear(128, num_actions)
)
def forward(self, x):
return self.net(x)
- Output: Returns a tensor of shape
(128, num_actions), representing Q-values for each state-action pair, used by ε-greedy strategy to select actions.
F. Training Process
0. Hyperparameters
| Hyperparameter | Value |
|---|---|
num_episodes | 4000 |
batch_size | 64 |
gamma (discount factor) | 0.99 |
lr (learning rate) | 1e-4 |
epsilon_start | 1.0 |
epsilon_end | 0.05 |
epsilon_decay | 0.999 |
memory_capacity | 60000 |
target_update_freq | 2000 |
1. Initialization
env = SpaceShipEnv()
state_dim = env._extract_state().shape[0]
num_actions = len(env.action_space)
policy_net = MLPDDQN(state_dim, num_actions).to(device)
target_net = MLPDDQN(state_dim, num_actions).to(device)
target_net.load_state_dict(policy_net.state_dict())
target_net.eval()
optimizer = optim.Adam(policy_net.parameters(), lr=lr)
memory = ReplayMemory(memory_capacity)
epsilon = epsilon_start
total_steps = 0
2. Episode Loop
(for episode in range(start_episode, 4000) Based on template_v2, the main modification is changing from DQN to DDQN model
--- Original: DQN
+++ Modified: DDQN
- # Original DQN: directly use target_net's max Q value
- # target_q = rewards + γ * target_net(next_states).max(1)[0] * (1 - dones)
+ # DDQN: use policy_net to select action, then use target_net to evaluate that action's Q value
+ next_actions = policy_net(next_states).argmax(dim=1).unsqueeze(1)
+ next_q_values = target_net(next_states).gather(1, next_actions).squeeze(1)
+ target_q = rewards + γ * next_q_values * (1 - dones)
- DQN: target_q = r + γ·maxₐ Qₜₐᵣ₍ₙₑₜ₎(s′, a)
- DDQN:
- Select action a* = arg maxₐ Qₚₒₗᵢcᵧ( s′, a )
- Evaluate action value Qₜₐᵣ₍ₙₑₜ₎( s′, a* )
- target_q = r + γ·Qₜₐᵣ₍ₙₑₜ₎( s′, a* )
G. Result Presentation
Using v2.4.5 as the main version for analysis
Score Distribution Charts
Reward Distribution Charts
Using checkpoint_ep1800 from v2.4.5 for analysis
Episode 1: total_reward=-147.77, score=724
Episode 2: total_reward=279.31, score=2522
Episode 3: total_reward=-11.90, score=2614
Episode 4: total_reward=3328.23, score=10002
Episode 5: total_reward=19.11, score=1526
Episode 6: total_reward=-204.31, score=516
Episode 7: total_reward=3232.96, score=10054
Episode 8: total_reward=-56.24, score=800
Episode 9: total_reward=-85.92, score=3346
Episode 10: total_reward=317.93, score=1950
Agent running results:
Best score: 10054, Best reward: 3328.228000000003
Score — Mean: 3405.40, Std: 3422.08
Reward — Mean: 667.14, Std: 1316.51
Analysis and Discussion:
- Convergence speed: The model converges to the highest score around episode 1800, then starts to decline
- Failure cases: When asteroids appear quickly and block the original escape path, or when asteroids are too large with no escape route
H. Development Notes
The following records core modifications and training results for each version
v2.1
- Modification: Used default image (raw pixel) as
state, model architecture same as teaching assistant’s example. - Performance: Training failed to converge, scores remained at random levels.
v2.2
- Modification:
- Introduced basic game strategy, making reward highly correlated with actual scores
- Added asteroid collision penalty
- Reward parameters:
α_HIT=1.4、β_coll=1.2、time_penalty=−0.5
- Performance: Total score significantly improved, reaching around 1400 points but unable to improve further.
v2.3
- Modification: To address slow CNN training and slow convergence, switched to MLP and directly input
np.arrayCall_extract_state()instep()to extract environment information - Performance: Training speed significantly increased, but highest score did not improve
v2.4
- Modification:
- Readjusted reward weights:
α_HIT 1.4→1.8、β_coll 1.2→1.4、time_penalty −0.3→0.4 - Changed asteroid radius to discrete input
- Optimized Replay Buffer performance
- Readjusted reward weights:
- Performance: Training curve smoother, but highest score still did not break existing upper limit.
v2.4.3
Modification: Integrated TensorBoard for real-time monitoring of loss and reward curves Learning rate adjustment: 1e-4→5e-5;ε decay: 0.999→0.997 All rewards multiplied by 0.6 Performance: Average score broke 2000 points, occasionally reaching 10000 points in single runs; but power-up collection strategy remained unstable.
v2.4.4
- Modification:
- Found that excessive reward differences affected learning, switched to segmented time_penalty:
score<1000:−0.25 1000≤score<3000:−0.3 3000≤score<6000:−0.35 - Readjusted α_HIT 1.8→0.8、β_coll 1.4→1.2
- Conducted A/B testing for health recovery power-ups: γ_shield=1.4 vs. 1;result: γ_shield=1.4 performed best
- Found that excessive reward differences affected learning, switched to segmented time_penalty:
- Performance: Score growth trend became unstable after 3000 points, highest reached close to 4000 points, still unable to consistently break through
v2.4.5 (Best Version)
- Modification:
- Reduced asteroid feature dimensions, no longer using one-hot for each asteroid, only input radius rr
- Added “wall-hit penalty” for incorrect dodging at screen edges
- Fixed time_penalty = −0.25
- Performance: Significantly increased probability of single-run breakthrough of 10000 points, but low-score situations still occur occasionally, suspected to be inherent game difficulty limitation
v2.4.6
- Modification: Expanded action space to 6 actions (allowing simultaneous movement and shooting)
- Performance: Converged to 2000–3000 points but unable to improve further
v2.4.7
- Modification: Replaced original Replay Buffer with Prioritized Replay Buffer
- Performance: Score initially rose above 1000 points, then declined and fluctuated in the 1000–2000 point range
Future Optimization Directions:
I. Environment and Performance Optimization
-
Remove FPS Limitation Change
clock.tick()to frame counter calculating accumulated frame count, allowing training to no longer be constrained by display frame rate, significantly improving data collection speed. -
Parallel Environments Adopt vectorized environment or multi-process execution to parallelly collect multiple training trajectories, reducing training time.
II. Hyperparameters and Exploration Strategy
-
Dynamic ε-greedy Strategy Currently ε converges in the [0.2,0.1] range, after which score performance declines. Could try linear or adaptive dynamic decay of ε
-
Systematic Hyperparameter Search Introduce Grid Search, Bayesian Optimization, Hyperband, etc., to automatically tune learning rate, batch_size, γ, gradient clipping and other parameters
III. Reward Mechanism Adjustment
- Enhanced Dodging Ability Current agent still encounters issues of getting stuck at edges or being blocked by high-speed asteroids. Could try:
- Change edge penalty to center reward, making agent prefer staying in the center
I. Reflection
I found this final project both interesting and challenging. From designing what information to include in the “state”, to figuring out how to design rewards reasonably, each time I modified hyperparameters I had to understand the underlying meaning, then look at graphs and game videos to guess which parts didn’t meet expectations, and then design the next version—just like conducting experiments. However, too many parameters and excessively long single training runs made it difficult to control all variables simultaneously within limited time, and I couldn’t quickly verify hypotheses in a short time.
Learned several useful tools during the process:
-
Git Branch Management: Since I needed to constantly sync versions across different computers and create new branches, I chose Git as the version control tool. Due to the need to continuously update versions, I became more familiar with gitflow, but by the later versions things got a bit messy, and I spent considerable effort organizing. Next time I’ll need to more rigorously plan branch workflows and PR reviews.
-
Matplotlib / TensorBoard: Initially used Matplotlib to plot training curves, and had to rerun the plots multiple times each update to compare trends. Later switched to TensorBoard, which allows viewing loss and reward curves in real-time during training, and can switch between version records with one click, saving a lot of comparison time—a very convenient package.