We notice here that the final result of this code is a reward curve, but it shows an upward trend and does not converge to a seemingly dynamically stable value. I personally feel that the convergence effect is a bit poor. Do you have any solutions to this problem? My learning_max_episode has a value of 100 and max_ep_steps has a value of 500
We notice here that the final result of this code is a reward curve, but it shows an upward trend and does not converge to a seemingly dynamically stable value. I personally feel that the convergence effect is a bit poor. Do you have any solutions to this problem? My learning_max_episode has a value of 100 and max_ep_steps has a value of 500