离散数学期末复习概念

深层钢筋学习讲解— 12 (DEEP REINFORCEMENT LEARNING EXPLAINED — 12)

Part 1 of the “Deep Reinforcement Learning Explained” series introduces a practical approach to the essential concepts in Reinforcement Learning (RL) and Deep Learning (DL) to begin in the area of Deep Reinforcement Learning (DRL).

“ 深度强化学习介绍”系列的第1部分介绍了一种针对强化学习(RL)和深度学习(DL)的基本概念的实用方法，从深度强化学习(DRL)领域开始。

This post starts a new part, Part 2, where we will introduce the implementation of Reinforcement Learning classical methods, as Monte Carlo, Temporal Difference Learning, SARSA or Q-learning.

这篇文章开始一个新的部分，第2部分，我们将在其中介绍强化学习经典方法的实现，例如蒙特卡洛(Monte Carlo)，时间差异学习，SARSA或Q学习。

萨顿和巴托的强化学习书 (Reinforcement Learning book by Sutton and Barto)

We will formalize a little more the formulation presented in Part 1 in order to align it to the notation used in the textbook Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto. This book is “the” classic text with an excellent introduction to Reinforcement Learning fundamentals. Thus, if the reader wishes, he will be able to use this excellent book to deepen at a theoretical level in any of the subjects that we present here. I think that with the basis of RL that the reader acquired in Part 1, it is easy to do it now and it can be very helpful for the reader to have bibliographic resources like this to complement your study.

我们将对第1部分中提出的公式进行更多形式化，以使其与Richard S. Sutton和Andrew G. Barto的教科书《 强化学习：入门》中使用的符号保持一致。本书是“经典”教材，对强化学习的基础知识进行了出色的介绍。因此，如果读者愿意，他将能够使用这本出色的书在理论水平上加深我们在此提出的任何主题。我认为，以读者在第1部分中获得的RL为基础，现在很容易做到这一点，对于读者来说，拥有像这样的书目资源来补充您的研究可能会非常有帮助。

Dr. Richard S. Sutton is a distinguished research scientist at DeepMind and a renowned professor of computing science at the University of Alberta. Dr. Sutton is considered one of the founding fathers of modern computational Reinforcement Learning. Dr. Andrew G. Barto is a professor emeritus at University of Massachusetts Amherst, and was the doctoral advisor of Dr. Sutton.

Richard S. Sutton博士是DeepMind的杰出研究科学家，也是阿尔伯塔大学计算机科学的著名教授。 Sutton博士被认为是现代计算强化学习的创始人之一。博士 安德鲁·巴托 ( Andrew G. Barto)是麻省大学阿默斯特分校(University of Massachusetts Amherst)的名誉教授，也是萨顿(Sutton)博士的博士生顾问。

符号和定义已更新 (Notation and definitions updated)

In this section we will review and update the mathematical notation introduced in Part 1 of this series modified a little in order to fit the one presented in Sutton’s book.

在本节中，我们将回顾和更新本系列第1部分中引入的数学符号，并对其进行了一些修改，以适应萨顿书中介绍的数学符号。

Because the Medium editor has certain limitations for writing formulas, I am going to write these formulas in Latex and include here as images. In order to make it clearer, I will create a set of cheatsheets with the main formulas and definitions to guide your study of RL along with this series. The latex code of these cheatsheets can be found on the GitHub of this series.
由于中型编辑器在编写公式方面有一定的局限性，因此我将在Latex中编写这些公式，并将其作为图像包含在其中。为了使内容更清楚，我将创建一组备有主要配方和定义的备忘单 ，以指导您学习RL及其系列。这些备忘单的乳胶代码可以在本系列的GitHub上找到。

In post 2 we have seen that we can use a Markov Decision Process (MDP) as a formal definition of the problem that we’d like to solve with Reinforcement Learning. A MDP is defined by 5 parameters <S,A,R,γ,p>, where each one indicates:

在第二篇文章中，我们看到我们可以使用马尔可夫决策过程(MDP)作为我们要通过强化学习解决的问题的正式定义。一个MDP由5个参数<S，A，R，γ，p>定义，其中每个参数指示：

Other related definitions that we will use along this series are:

我们将在本系列中使用的其他相关定义是：

At an arbitrary time step t, the Agent and the MDP Environment interaction has evolved as a sequence of states, actions, and rewards (trajectory) like this:

在任意时间步长t ，Agent与MDP Environment的交互已演变为一系列状态，动作和奖励( 轨迹 )，如下所示：

Remember that the Environment responds to the Agent at time step t, it considers only the state and action at the previous time step t-1 . It does not care what states were presented to the Agent more than one step prior. It does not look at the actions that the Agent took prior to the last one. And finally, neither how much reward it is collecting, has no effect on how the Environment chooses to respond to the Agent.

请记住，环境在时间步t响应代理，它仅考虑前时间步t-1的状态和动作。它不关心在多于一个步骤之前向代理呈现了哪些状态。它不查看代理在上一个代理之前执行的操作。最后，无论收集到多少报酬，都不会影响环境如何选择响应代理。

Because of this, we can completely define how the Environment decides the state and reward by specifying the transition function p as we did here (the dot over the equals sign in the equation reminds us that it is a definition). The function p defines the dynamics of the MDP. These conditional probabilities are said to specify the one-step dynamics of the Environment.

因此，我们可以像在此一样通过指定转换函数p来完全定义环境如何决定状态和奖励(等式中等号上的点提醒我们这是一个定义)。函数p定义了MDP的动态特性。据说这些条件概率指定了环境的一步动态 。

As a summary, emphasize that when we have a real problem in mind, we will need to specify the MDP as a way to formally define the problem that we want our Agent to solve. The Agent will know the states and actions along with the discount factor. We have the function p, those specify how the Environment works and will be unknown to the Agent. Despite not having this information, the Agent will still have to learn from interaction with the Environment on how to accomplish its goal.

作为总结，强调当我们想到一个实际问题时，我们将需要指定MDP作为正式定义我们希望我们的代理解决的问题的方法。代理将了解状态和操作以及折扣系数。我们具有函数p ，这些函数指定环境如何工作，并且对代理是未知的 。尽管没有此信息，代理仍然必须从与环境的交互中学习如何实现其目标。

连续任务的折现率 (Discount rate for continuing task)

Before continuing, I recommend the reader to review post 2 to refresh the basics. But let us briefly add how the discount rate behaves in a continuing task not covered in Part 1.

在继续之前，我建议读者阅读第二篇文章以刷新基础知识。但是，让我们简要地添加折扣率在第1部分中未涉及的连续任务中的行为。

连续任务示例 (Continuing Task Example)

In Part 1 of this series, we used an episodic task, the Frozen-Lake Environment, a simple grid-world Environment from OpenAI Gym, a toolkit for developing and comparing RL algorithms. In this section we will introduce a continuing task using another Environment, the Cart-Pole balancing problem:

在本系列的第1部分中，我们使用了一个偶发性任务，即冰冻湖水环境，这是来自OpenAI Gym的简单网格世界环境，该环境是开发和比较RL算法的工具包。在本节中，我们将介绍使用另一个环境的持续任务，即购物车-车子平衡问题：

As it is shown in the previous figure, a cart is positioned on a frictionless track along the horizontal axis, and a pole is anchored to the top of the cart. The objective is to keep the pole from falling over by moving the cart either left or right, and without falling off the track.

如上图所示，手推车沿水平轴放置在无摩擦的轨道上，并且一根杆子固定在手推车的顶部。目的是通过向左或向右移动手推车来防止杆子跌落，并且不会掉落轨道。

The system is controlled by applying a force of +1 (left) or -1 (right) to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every time-step that the pole remains upright, including the final step of the episode. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.

通过对推车施加+1(左)或-1(右)的力来控制系统。钟摆开始直立，目的是防止其跌落。对于杆保持直立的每个时间步(包括情节的最后一步)，都会提供+1的奖励。当杆子与垂直线的夹角超过15度时，或者推车从中心移出2.4个单位以上时，情节结束。

The observation space for this Environment at each time point is an array of 4 numbers. At every time step, you can observe its position, velocity, angle, and angular velocity. These are the observable states of this world. You can look up what each of these numbers represents in this document. Notice the minimum (-Inf) and maximum (Inf) values for both Cart Velocity and the Pole Velocity at Tip. Since the entry in the array corresponding to each of these indices can be any real number, that means, the state space is infinite!

此环境在每个时间点的观察空间是4个数字的数组。在每个时间步上，您都可以观察其位置，速度，角度和角速度。这些是这个世界可观察的状态。您可以查询本文档中每个数字所代表的含义。请注意小车速度和顶杆速度的最小值(-Inf)和最大值(Inf)。由于数组中与这些索引中的每个索引对应的条目可以是任何实数，因此，状态空间是无限的！

At any state, the cart only has two possible actions: move to the left or move to the right. In other words, the state-space of the Cart-Pole has four dimensions of continuous values and the action-space has one dimension of two discrete values.

在任何状态下，购物车只有两种可能的动作：向左移动或向右移动 。换句话说，购物车杆的状态空间具有四个连续值的维，而动作空间具有两个离散值的一维。

折扣率 (Discount rate)

Which discount rates would encourage the Agent to keep the pole balanced in our example of continuing task for as long as possible?

哪种折扣率会鼓励代理商在我们继续执行任务的示例中尽可能长时间保持平衡？

With any discount rate γ>0, the Agent receives a positive reward for each time-step where the pole has not yet fallen. Thus, the agent will try to keep the pole balanced for as long as possible.

在任何贴现率γ> 0的情况下，对于极点尚未下降的每个时间段，代理都会收到正回报。因此，代理将尝试尽可能长时间地保持极点平衡。

However, imagine that the reward signal is amended to only give a reward to the Agent at the end of an episode. In other words that is, the reward is 0 for every time step, with the exception of the final time step, when the episode terminates, and then the Agent receives a reward of +1.

但是，想象一下将奖励信号修改为仅在情节结束时对特工给予奖励。换句话说，当情节终止时，除了最后一个时间步长以外，每个时间步长的奖励都是0，然后特工会收到+1的奖励。

In this case, if the discount rate is γ=1, the agent will always receive a reward of +1 (no matter what actions it chooses during the episode), and so the reward signal will not provide any useful feedback to the agent.

在这种情况下，如果折现率是γ= 1，则代理将始终获得+1的奖励(无论在情节期间选择何种操作)，因此奖励信号将不会向代理提供任何有用的反馈。

If the discount rate is γ<1 the Agent will try to terminate the episode as soon as possible (by either dropping the pole quickly or moving off the edge of the track). Thus, in this case, we must redesign the reward signal!

如果折现率为γ <1，代理将尝试尽快终止该情节(通过快速掉下杆或从轨道边缘移开)。因此，在这种情况下，我们必须重新设计奖励信号！

The solution to this problem, that means a series of actions that need to be learned by the Agent towards the pursuit of a goal, is determined by the Policy. In the next section, we will continue a little further in the formal definition of the solution to this problem.

该问题的解决方案(即代理人为实现目标而需要学习的一系列操作)由政策决定。在下一节中，我们将继续对这个问题的解决方案进行正式定义。

政策 (Policy)

The policy is the strategy (e.g. some set of rules) that the Agent employs to determine the next action based on the current state. Typically denoted by 𝜋(𝑎|𝑠), the Greek letter pi, a policy is a function that determines the next action a to take given a state s.

该策略是代理用来根据当前状态确定下一个操作的策略(例如，某些规则集)。通常用𝜋(𝑎|𝑠)(希腊字母pi)表示是确定给定状态s采取下一个动作a的函数。

The simplest kind of policy is a mapping from the set of environment states S to the set of possible actions A. We call this kind of policy a deterministic policy. But in post 2 we also introduced that the policy 𝜋(𝑎|𝑠) can be defined as probability and not as concrete action. In other words that are, a stochastic policy that has a probability distribution over actions that an Agent can take at a given state.

最简单的策略是从环境状态S到可能动作A的映射。我们称这种政策为确定性政策 。但是在第二篇文章中，我们还介绍了可以将策略𝜋(𝑎|𝑠)定义为概率而不是具体行动。换句话说，是一种随机策略 ，该策略对代理在给定状态下可以采取的操作具有概率分布。

The stochastic policy will allow the Agent to choose actions randomly. More formally, We define a stochastic policy as a mapping that accepts an Environment state S and action A and returns the probability that the agent takes action A while in state S:

随机策略将允许代理随机选择操作。更正式地说，我们将随机策略定义为接受环境状态S和动作A并返回代理在状态S时采取动作A的概率的映射：

During the learning process, the policy 𝜋 may change as the Agent gains more experience. For example, the Agent may start from a random policy, where the probability of all actions is uniform; meanwhile, the Agent will hopefully learn to optimize its policy toward reaching the optimal policy.

在学习过程中，策略𝜋可能会随着代理获得更多经验而发生变化。例如，代理可以从随机策略开始，在该策略中所有操作的概率是一致的；同时，Agent有望学习优化其策略以达到最佳策略。

Now that we know how to specify a policy, what steps can we take to make sure that the Agent’s policy is the best one? We will use the state-value function and action-value function already introduced in Post 2.

现在我们知道如何指定策略，我们可以采取哪些步骤来确保代理策略是最佳策略？我们将使用Post 2中已经介绍的state-value函数和action-value函数。

价值功能 (Value functions)

The state-value function, also referred to as the value function, or even the V-function, measures the goodness of each state, it tells us the total return we can expect in the future if we start from that state.

状态值函数 (也称为值函数 ，甚至称为V函数 )衡量每个状态的优劣，它告诉我们如果从该状态开始，可以期望将来获得总收益。

For each state s, the state-value function tells us the expected discounted return G, if the agent started in that state s, and then use the policy to choose its actions for all time steps. It is important to note that the state value function will always correspond to a particular policy, so if we change the policy, we change the state-value function. For this reason, we typically denote the function with the lowercase v with the corresponding policy 𝜋 in the subscript and defined formally by:

对于每个状态s ，状态值函数会告诉我们预期的折现收益G ，如果代理在该状态s中启动，然后使用该策略为所有时间步选择其动作。重要的是要注意，状态值函数将始终对应于特定策略，因此，如果我们更改策略，则将更改状态值函数。因此，我们通常在下标中用对应的策略note表示小写v的函数，并通过以下方式正式定义：

where 𝔼[·] denotes the expected value of a random variable given that the agent follows policy 𝜋, and t is any time step. As we introduced in Post 8, it is used expectation 𝔼[.] in this definition because the environment transition function might act in a stochastic way.

其中𝔼[·]表示给定代理遵循策略𝜋的随机变量的期望值， t是任何时间步长。正如我们在Post 8中介绍的那样，在此定义中使用了期望𝔼[。]，因为环境转换函数可能以随机方式起作用。

Also in post 2 we extended the definition of state-value function to state-action pairs, defining a value for each state-action pair, which is called the action-value function, also known as Q-function or simply Q. It defines the value of taking action a in state s under a policy π, as the expected Return G starting from s, taking the action a, and thereafter following policy π:

同样在第二篇文章中，我们将状态值函数的定义扩展到状态动作对，为每个状态动作对定义了一个值，这称为动作值函数，也称为Q函数或简称Q。它定义了在策略π下在状态s下采取措施a的值，作为预期收益G从s开始，采取措施a，然后遵循策略π：

贝尔曼期望方程 (Bellman Expectation Equation)

For a general MDP, we have to work in terms of an expectation, since it’s not often the case that the immediate reward and next state can be predicted with certainty. Indeed, we saw in the previous post that the reward r and next state s’ are chosen according to the one-step dynamics of the MDP. In this case, where the r and s′ are drawn from a (conditional) probability distribution p(s′,r∣s,a), the Bellman Expectation Equation expresses the value of any state s in terms of the expected immediate reward and the expected value of the next state (satisfying a recursive relationships).

对于一般的MDP，我们必须根据期望进行工作，因为通常无法确定地预测即时回报和下一状态。确实，我们在上一篇文章中看到，根据MDP的单步动态选择了奖励r和下一个状态s' 。在这种情况下，如果r和s '是从(条件)概率分布p(s'，r∣s，a)中得出的 ，则Bellman期望方程式根据期望的立即回报来表示任何状态s的值，并且下一个状态的期望值 (满足递归关系 )。

For the general case, where the Agent’s policy π is stochastic, the Agent selects action a with probability π(a∣s) when in state s, and the Bellman Expectation Equation can be expressed as:

对于一般情况，在代理策略π 随机的情况下，当处于状态s时，代理以概率π ( a ∣ s )选择动作a ，而Bellman期望方程可表示为：

In this case, we multiply the sum of the reward and discounted value of the next state (r+γvπ(s′)) by its corresponding probability π(a∣s)p(s′,r∣s,a) and sum over all possibilities to yield the expected value.

在这种情况下，我们将下一个状态( r + γvπ ( s ))的奖励和折扣值之和乘以其相应的概率π ( a ∣ s ) p ( s ′， r ∣ s ， a )和对所有可能性求和以得出期望值。

We also have the Bellman equation for the action-value function:

对于动作值函数，我们还有Bellman方程：

最优政策 (Optimal Policy)

The goal of the Agent is to maximize the total cumulative reward in the long run. The policy, which maximizes the total cumulative reward is called the optimal policy. In Post 8 we introduced the “optimal” value functions.

代理商的目标是从长远来看最大化总累积奖励。使总累积奖励最大化的策略称为最优策略 。在Post 8中，我们介绍了“最优”值函数。

A policy π′ is defined to be better than or equal to a policy π if and only if vπ′(s)≥vπ(s) for all s∈S. An optimal policy π∗ satisfies π∗≥π for all policies π. An optimal policy is guaranteed to exist but may not be unique.

策略π'被定义为优于或等于π政策当且仅当Vπ'(一个或多个 )≥Vπ(S)对于所有s∈S。 最优政策 π*满足π*≥π所有策略π。最优策略可以保证存在，但可能不是唯一的。

All optimal policies have the same state-value function v∗, called the optimal state-value function. A more formal definition for the optimal state-value functions could be:

所有最优策略都具有相同的状态值函数v ∗ ，称为最优状态值函数 。最佳状态值函数的更正式定义可以是：

and for the action-value function:

对于动作值函数：

All optimal policies have the same action-value function q∗, called the optimal action-value function.

所有最优策略都具有相同的作用值函数q ∗ ，称为最优作用值函数 。

This optima action-value are very useful in order to obtain the optimal policy. The Agent estimates it by interacting with the Environment. Once the agent determines the optimal action-value function q∗, it can quickly obtain an optimal policy π∗ by setting:

该最佳动作值对于获得最佳策略非常有用。代理通过与环境交互来估计它。一旦代理确定了最佳行动值函数q ∗，它就可以通过设置以下内容快速获得最佳策略π ∗：

As we saw in Post 8 the Bellman equation is used to find the optimal values of the value functions in the algorithms to calculate them. A more formal expression could be:

正如我们在文章8中看到的那样，使用Bellman方程在算法中找到值函数的最佳值以进行计算。更为正式的表达可能是：

接下来是什么？ (What is next?)

We have reached the end of this post!. In the following post, we are going to introduce the Monte Carlo Method, a learning method for estimating value functions and discovering optimal policies. Unlike the Value Iteration algorithm introduced in Posts 9, 10 and 11, here we do not assume complete knowledge of the Environment. Monte Carlo methods require only experience — sample sequences of states, actions, and rewards from actual or simulated interaction with the Environment, similar with what we did with the Cross-Entropy Method introduced in Post 6.

我们已经到了这篇文章的结尾！在下面的文章中，我们将介绍蒙特卡洛方法，这是一种用于估计价值函数和发现最优政策的学习方法。不同于数值迭代算法，帖子介绍了9 ， 10和11 ，在这里我们不承担环境的完整的知识。蒙特卡洛方法只需要经验，即与环境的实际或模拟交互中的状态，动作和奖励的示例序列，与我们在Post 6中引入的交叉熵方法相似。

See you in the next post!

下篇再见！

翻译自: https://towardsdatascience.com/reviewing-essential-concepts-from-part-1-e28234ee7f4f