- PyTorch 1.x Reinforcement Learning Cookbook
- Yuxi (Hayden) Liu
- 238字
- 2021-06-24 12:34:43
How it works...
The policy gradient algorithm trains an agent by taking small steps and updating the weight based on the rewards associated with those steps at the end of an episode. The technique of having the agent run through an entire episode and then updating the policy based on the rewards obtained is called Monte Carlo policy gradient.
The action is selected based on the probability distribution computed based on the current state and the model’s weight. For example, if the probabilities for the left and right actions are [0.6, 0.4], this means the left action is selected 60% of the time; it doesn't mean the left action is chosen, as in the random search and hill-climbing algorithms.
We know that the reward is 1 for each step before an episode terminates. Hence, the future reward we use to calculate the policy gradient at each step is the number of steps remaining. After each episode, we feed the gradient history multiplied by the future rewards to update the weight using the stochastic gradient ascent method. In this way, the longer an episode is, the bigger the update of the weight. This will eventually increase the chance of getting a larger total reward.
As we mentioned at the start of this section, the policy gradient algorithm might be overkill for a simple environment such as CartPole, but it should get us ready for more complicated problems.
- 基于C語(yǔ)言的程序設(shè)計(jì)
- Hands-On Internet of Things with MQTT
- 自動(dòng)控制原理
- Visual FoxPro 6.0數(shù)據(jù)庫(kù)與程序設(shè)計(jì)
- 大數(shù)據(jù)時(shí)代的數(shù)據(jù)挖掘
- 精通Excel VBA
- PyTorch深度學(xué)習(xí)實(shí)戰(zhàn)
- 現(xiàn)代機(jī)械運(yùn)動(dòng)控制技術(shù)
- 分布式多媒體計(jì)算機(jī)系統(tǒng)
- Ceph:Designing and Implementing Scalable Storage Systems
- 工業(yè)機(jī)器人操作與編程
- Lightning Fast Animation in Element 3D
- 精通數(shù)據(jù)科學(xué):從線性回歸到深度學(xué)習(xí)
- 網(wǎng)絡(luò)服務(wù)器搭建與管理
- 基于RPA技術(shù)財(cái)務(wù)機(jī)器人的應(yīng)用與研究