官术网_书友最值得收藏!

An overview of reinforcement learning

Reinforcement learning is the problem of getting an agent to act in the world so as to maximize its rewards. It is about what to do and how to map situations to actions so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them. The two most important distinguishing features of reinforcement learning are trial and error and search and delayed reward. Some examples of reinforcement learning are as follows:

  • A chess player making a move, the choice is informed both by planning anticipating possible replies and counter replies.
  • An adaptive controller adjusts parameters of a petroleum refinery's operation in real time. The controller optimizes the yield/cost/quality trade-off on the basis of specified marginal costs without sticking strictly to the set points originally suggested by engineers.
  • A gazelle calf struggles to its feet minutes after being born. Half an hour later it is running at 20 miles per hour.
  • Teaching a dog a new trick--one cannot tell it what to do, but one can reward/punish it if it does the right/wrong thing. It has to figure out what it did that made it get the reward/punishment, which is known as the credit assignment problem.

Reinforcement learning is like trial and error learning. The agent should discover a good policy from its experiences of the environment without losing too much reward along the way. Exploration is about finding more information about the environment while Exploitation exploits known information to maximize reward. For example:

  • Restaurant selection: Exploitation; go to your favorite restaurant. Exploration; try a new restaurant.
  • Oil drilling: Exploitation; drill at the best-known location. Exploration; drill at a new location.

Major components of reinforcement learning are as follows:

  • Policy: This is the agent's behavior function. It determines the mapping from perceived states of the environment to actions to be taken when in those states. It corresponds to what in psychology would be called a set of stimulus-response rules or associations.
  • Value Function: This is a prediction of future reward. The value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state. Whereas rewards determine the immediate, intrinsic desirability of environmental states, values indicate the long-term desirability of states after taking into account the states that are likely to follow, and the rewards available in those states.
  • Model: The model predicts what the environment will do next. It predicts the next state and the immediate reward in the next state.
主站蜘蛛池模板: 离岛区| 肥城市| 金平| 台南市| 陆河县| 湘西| 郯城县| 呼玛县| 秦皇岛市| 五家渠市| 石狮市| 汶川县| 天长市| 微山县| 长宁区| 西和县| 望奎县| 平乡县| 灵台县| 托里县| 福海县| 安康市| 都安| 乌苏市| 色达县| 宣威市| 禄劝| 德令哈市| 德格县| 宁陵县| 仪征市| 蓝田县| 黄平县| 濮阳市| 长阳| 泗水县| 凭祥市| 安吉县| 毕节市| 崇左市| 宁南县|