Skip to content

Commit 6eb8b58

Browse files
committed
Update python code
1 parent 35b4131 commit 6eb8b58

10 files changed

Lines changed: 1398 additions & 8 deletions

.history/README_20251108180336.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
<div align="center">
2+
<p><a href="https://wgyhhhh.github.io/Mathematical-Foundations-of-Reinforcement-Learning-Notes/Preface1/">📚 在线阅读地址</a></p>
3+
<h3>🤖 《强化学习中的数学原理》-个人笔记与思考总结</h3>
4+
<p><em>理解强化学习的数学原理,并通过实例深入掌握核心算法</em></p>
5+
</div>
6+
7+
### &#8627; Stargazers
8+
[![Stargazers repo roster for @wgyhhhh/Mathematical-Foundations-of-Reinforcement-Learning-Notes](https://reporoster.com/stars/wgyhhhh/Mathematical-Foundations-of-Reinforcement-Learning-Notes)](https://github.com/wgyhhhh/Mathematical-Foundations-of-Reinforcement-Learning-Notes/stargazers)
9+
10+
### &#8627; Forkers
11+
[![Forkers repo roster for @wgyhhhh/Mathematical-Foundations-of-Reinforcement-Learning-Notes](https://reporoster.com/forks/wgyhhhh/Mathematical-Foundations-of-Reinforcement-Learning-Notes)](https://github.com/wgyhhhh/Mathematical-Foundations-of-Reinforcement-Learning-Notes/network/members)
12+
13+
## 🎯 笔记介绍
14+
15+
&emsp;&emsp;本笔记是对赵世钰老师所著《强化学习中的数学原理》的个人思考与总结,**笔者将其做成了网页模式,方便大家随时随地在掌上设备阅读**。在此基础上,我还补充了对书中核心算法的实现,以便读者能获得更直观的理解。书中首先从基础概念入手,讲解Bellman公式和Bellman最优公式,接着扩展到基于模型(model-based)和无模型(model-free)的强化学习算法,最终推广到基于函数逼近的强化学习算法。若读者在强化学习方面没有背景知识,只需具备一定的线性代数和概率论基础即可阅读本书。而对于已有一些强化学习知识的读者,本笔记则可以帮助他们深入理解相关问题。
16+
17+
## 📖 内容导航
18+
19+
| 章节 | 关键内容 | 状态 |
20+
| --- | --- | --- |
21+
| [前言](https://wgyhhhh.github.io/Mathematical-Foundations-of-Reinforcement-Learning-Notes/Preface1/) | 本笔记的缘起、背景及阅读建议 ||
22+
| [第一章 基本概念](https://wgyhhhh.github.io/Mathematical-Foundations-of-Reinforcement-Learning-Notes/Chapter-1/intro/) | 强化学习的基本概念 ||
23+
| [第二章 状态值与贝尔曼方程](https://wgyhhhh.github.io/Mathematical-Foundations-of-Reinforcement-Learning-Notes/Chapter-2/intro/) | 回报、状态值、Bellman方程 ||
24+
| [第三章 最优状态值与贝尔曼最优方程](https://wgyhhhh.github.io/Mathematical-Foundations-of-Reinforcement-Learning-Notes/Chapter-3/intro/) | 最优状态值、最优策略、Bellman最优方程 ||
25+
| [第四章 值迭代与策略迭代](https://wgyhhhh.github.io/Mathematical-Foundations-of-Reinforcement-Learning-Notes/Chapter-4/intro/) | 值迭代算法、策略迭代算法、截断策略迭代算法 ||
26+
| [第五章 蒙特卡罗方法](https://wgyhhhh.github.io/Mathematical-Foundations-of-Reinforcement-Learning-Notes/Chapter-5/intro/) | MC Basic、MC Exploring Starts、MC-Greedy ||
27+
| [第六章 随机近似算法](https://wgyhhhh.github.io/Mathematical-Foundations-of-Reinforcement-Learning-Notes/Chapter-6/intro/) | Robbins-Monro算法、Dvoretzky定理、随机梯度下降 ||
28+
| [第七章 时序差分算法](https://wgyhhhh.github.io/Mathematical-Foundations-of-Reinforcement-Learning-Notes/Chapter-7/intro/) | Sarsa、n步Sarsa、Q-learning、 Off-policy、On-policy||
29+
| [第八章 值函数方法](https://wgyhhhh.github.io/Mathematical-Foundations-of-Reinforcement-Learning-Notes/Chapter-8/intro/) | 基于值函数的TD算法、Sarsa、Q-learning | ✅(润色中) |
30+
| [第九章 策略梯度方法](https://wgyhhhh.github.io/Mathematical-Foundations-of-Reinforcement-Learning-Notes/Chapter-9/intro/) | 策略梯度、REINFORCE | ✅(润色中) |
31+
| [第十章 演员-评论家算法](https://wgyhhhh.github.io/Mathematical-Foundations-of-Reinforcement-Learning-Notes/Chapter-10/intro/) | 优势演员-评论家、异策略演员-评论家、确定性演员-评论家 | ✅(润色中) |
32+
| 算法实现详解 | 核心算法Python实现 | 🚧 |
33+
34+
### 🚧 算法实现详解
35+
36+
笔者正在使用Python实现本书中的部分核心算法,读者可以通过结合阅读,获得更直观的理解。
37+
38+
## 🤝 如何贡献
39+
40+
如果你对强化学习感兴趣,可以参与到该笔记的完善中!❤️
41+
42+
- 💡**内容完善** - 帮助改进笔记内容
43+
- 📝**报告问题** - 发现问题请提交 Issue

.history/docs/python/examples/arguments_20251108184311.py

Whitespace-only changes.
Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
__credits__ = ["Intelligent Unmanned Systems Laboratory at Westlake University."]
2+
'''
3+
Specify parameters of the env
4+
'''
5+
from typing import Union
6+
import numpy as np
7+
import argparse
8+
9+
parser = argparse.ArgumentParser("Grid World Environment")
10+
11+
## ==================== User settings ===================='''
12+
# specify the number of columns and rows of the grid world
13+
parser.add_argument("--env-size", type=Union[list, tuple, np.ndarray], default=(5,5) )
14+
15+
# specify the start state
16+
parser.add_argument("--start-state", type=Union[list, tuple, np.ndarray], default=(2,2))
17+
18+
# specify the target state
19+
parser.add_argument("--target-state", type=Union[list, tuple, np.ndarray], default=(4,4))
20+
21+
# sepcify the forbidden states
22+
parser.add_argument("--forbidden-states", type=list, default=[ (2, 1), (3, 3), (1, 3)] )
23+
24+
# sepcify the reward when reaching target
25+
parser.add_argument("--reward-target", type=float, default = 10)
26+
27+
# sepcify the reward when entering into forbidden area
28+
parser.add_argument("--reward-forbidden", type=float, default = -5)
29+
30+
# sepcify the reward for each step
31+
parser.add_argument("--reward-step", type=float, default = -1)
32+
## ==================== End of User settings ====================
33+
34+
35+
## ==================== Advanced Settings ====================
36+
parser.add_argument("--action-space", type=list, default=[(0, 1), (1, 0), (0, -1), (-1, 0), (0, 0)] ) # down, right, up, left, stay
37+
parser.add_argument("--debug", type=bool, default=False)
38+
parser.add_argument("--animation-interval", type=float, default = 0.2)
39+
## ==================== End of Advanced settings ====================
40+
41+
42+
args = parser.parse_args()
43+
def validate_environment_parameters(env_size, start_state, target_state, forbidden_states):
44+
if not (isinstance(env_size, tuple) or isinstance(env_size, list) or isinstance(env_size, np.ndarray)) and len(env_size) != 2:
45+
raise ValueError("Invalid environment size. Expected a tuple (rows, cols) with positive dimensions.")
46+
47+
for i in range(2):
48+
assert start_state[i] < env_size[i]
49+
assert target_state[i] < env_size[i]
50+
for j in range(len(forbidden_states)):
51+
assert forbidden_states[j][i] < env_size[i]
52+
try:
53+
validate_environment_parameters(args.env_size, args.start_state, args.target_state, args.forbidden_states)
54+
except ValueError as e:
55+
print("Error:", e)
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
from typing import Union
2+
import numpy as np
3+
import argparse
4+
5+
parser = argparse.ArgumentParser("Grid World Environment")
6+
7+
## ==================== User settings ===================='''
8+
# specify the number of columns and rows of the grid world
9+
parser.add_argument("--env-size", type=Union[list, tuple, np.ndarray], default=(5,5) )
10+
11+
# specify the start state
12+
parser.add_argument("--start-state", type=Union[list, tuple, np.ndarray], default=(2,2))
13+
14+
# specify the target state
15+
parser.add_argument("--target-state", type=Union[list, tuple, np.ndarray], default=(4,4))
16+
17+
# sepcify the forbidden states
18+
parser.add_argument("--forbidden-states", type=list, default=[ (2, 1), (3, 3), (1, 3)] )
19+
20+
# sepcify the reward when reaching target
21+
parser.add_argument("--reward-target", type=float, default = 10)
22+
23+
# sepcify the reward when entering into forbidden area
24+
parser.add_argument("--reward-forbidden", type=float, default = -5)
25+
26+
# sepcify the reward for each step
27+
parser.add_argument("--reward-step", type=float, default = -1)
28+
## ==================== End of User settings ====================
29+
30+
31+
## ==================== Advanced Settings ====================
32+
parser.add_argument("--action-space", type=list, default=[(0, 1), (1, 0), (0, -1), (-1, 0), (0, 0)] ) # down, right, up, left, stay
33+
parser.add_argument("--debug", type=bool, default=False)
34+
parser.add_argument("--animation-interval", type=float, default = 0.2)
35+
## ==================== End of Advanced settings ====================
36+
37+
38+
args = parser.parse_args()
39+
def validate_environment_parameters(env_size, start_state, target_state, forbidden_states):
40+
if not (isinstance(env_size, tuple) or isinstance(env_size, list) or isinstance(env_size, np.ndarray)) and len(env_size) != 2:
41+
raise ValueError("Invalid environment size. Expected a tuple (rows, cols) with positive dimensions.")
42+
43+
for i in range(2):
44+
assert start_state[i] < env_size[i]
45+
assert target_state[i] < env_size[i]
46+
for j in range(len(forbidden_states)):
47+
assert forbidden_states[j][i] < env_size[i]
48+
try:
49+
validate_environment_parameters(args.env_size, args.start_state, args.target_state, args.forbidden_states)
50+
except ValueError as e:
51+
print("Error:", e)

.history/docs/python/gridworld_notebook_20251108185258.ipynb

Whitespace-only changes.

0 commit comments

Comments
 (0)