
在强化学习(Reinforcement Learning, RL)领域,稀疏奖励环境是一个巨大的挑战。这类环境中,智能体往往需要在长时间内没有外部奖励的情况下进行探索,才能偶然发现那些导致高奖励的行为。深度Q网络(Deep Q-Network, DQN)作为一种经典的强化学习算法,在面对稀疏奖励时也容易陷入局部最优或探索不足的困境。为了解决这个问题,基于内在动机(Intrinsic Motivation)的探索策略被提出,旨在通过给予智能体额外的内在奖励,鼓励其在没有明显外部奖励的情况下进行有效探索。







  • 计数奖励(Count-Based Bonus):通过统计智能体访问每个状态的次数,给予较少访问的状态更高的内在奖励。
  • 预测误差奖励(Prediction Error Bonus):利用智能体对未来状态的预测误差作为内在奖励,鼓励智能体探索那些难以预测的状态。
  • 好奇心驱动奖励(Curiosity-Driven Bonus):通过学习一个预测模型来预测智能体的未来状态或行为结果,并将预测误差作为内在奖励。



import numpy as np import random from collections import defaultdict class IntrinsicDQNAgent: def __init__(self, state_size, action_size, learning_rate=0.001, discount_factor=0.99, epsilon=1.0, epsilon_decay=0.995, epsilon_min=0.01, memory_size=10000, batch_size=64, state_count_threshold=10): self.state_size = state_size self.action_size = action_size self.learning_rate = learning_rate self.discount_factor = discount_factor self.epsilon = epsilon self.epsilon_decay = epsilon_decay self.epsilon_min = epsilon_min self.memory = [] self.memory_size = memory_size self.batch_size = batch_size self.state_count = defaultdict(int) self.state_count_threshold = state_count_threshold # Initialize Q-network and target network with random weights # (code for initializing networks omitted for brevity) def remember(self, state, action, reward, next_state, done): self.memory.append((state, action, reward, next_state, done)) if len(self.memory) > self.memory_size: self.memory.pop(0) def act(self, state): if np.random.rand() <= self.epsilon: return random.randrange(self.action_size) else: # Choose action with the highest Q-value return np.argmax(self.predict(state)[0]) def predict(self, state): # Forward pass through the Q-network (code omitted for brevity) pass def update_epsilon(self): if self.epsilon > self.epsilon_min: self.epsilon *= self.epsilon_decay def intrinsic_reward(self, state): count = self.state_count[tuple(state)] if count < self.state_count_threshold: return 1.0 / (count + 1) # Intrinsic reward based on state visit count else: return 0.0 def replay(self): if len(self.memory) < self.batch_size: return minibatch = random.sample(self.memory, self.batch_size) for state, action, reward, next_state, done in minibatch: target = reward if not done: target = reward + self.discount_factor * np.amax(self.predict(next_state)[0]) # Add intrinsic reward intrinsic_reward = self.intrinsic_reward(state) target += intrinsic_reward # Update Q-network (code for training step omitted for brevity) # Update state count self.state_count[tuple(state)] += 1
