官术网_书友最值得收藏!

Reinforcement learning

Even if there are no actual supervisors, reinforcement learning is also based on feedback provided by the environment. However, in this case, the information is more qualitative and doesn't help the agent in determining a precise measure of its error. In reinforcement learning, this feedback is usually called reward (sometimes, a negative one is defined as a penalty) and it's useful to understand whether a certain action performed in a state is positive or not. The sequence of most useful actions is a policy that the agent has to learn, so to be able to make always the best decision in terms of the highest immediate and cumulative reward. In other words, an action can also be imperfect, but in terms of a global policy it has to offer the highest total reward. This concept is based on the idea that a rational agent always pursues the objectives that can increase his/her wealth. The ability to see over a distant horizon is a distinction mark for advanced agents, while short-sighted ones are often unable to correctly evaluate the consequences of their immediate actions and so their strategies are always sub-optimal.

Reinforcement learning is particularly efficient when the environment is not completely deterministic, when it's often very dynamic, and when it's impossible to have a precise error measure. During the last few years, many classical algorithms have been applied to deep neural networks to learn the best policy for playing Atari video games and to teach an agent how to associate the right action with an input representing the state (usually a screenshot or a memory dump). 

In the following figure, there's a schematic representation of a deep neural network trained to play a famous Atari game. As input, there are one or more subsequent screenshots (this can often be enough to capture the temporal dynamics as well). They are processed using different layers (discussed briefly later) to produce an output that represents the policy for a specific state transition. After applying this policy, the game produces a feedback (as a reward-penalty), and this result is used to refine the output until it becomes stable (so the states are correctly recognized and the suggested action is always the best one) and the total reward overcomes a predefined threshold.

We're going to discuss some examples of reinforcement learning in the chapter dedicated to introducing deep learning and TensorFlow.

主站蜘蛛池模板: 濉溪县| 大丰市| 茶陵县| 普安县| 湖南省| 棋牌| 东港市| 华安县| 电白县| 仁布县| 广东省| 曲靖市| 龙山县| 石首市| 临邑县| 自治县| 广州市| 新田县| 临朐县| 通州市| 唐山市| 哈巴河县| 抚顺县| 叶城县| 西安市| 杭州市| 临潭县| 丹巴县| 甘孜| 香河县| 贞丰县| 喀喇沁旗| 临潭县| 昌平区| 仁寿县| 尚义县| 纳雍县| 阿拉尔市| 佛教| 南乐县| 晋江市|