Dynamic Programming and Optimal Control 第一章习题

阅读量：

1 Consider the system $x_{k+1}=x_k+u_k+w_k,\qquad k=0,1,2,3$ with initial state $x_0=5$ , and the cost function $\sum_{k=0}^3(x_k^2+u_k^2)$ Apply the DP algorithm for the following three cases:
（a）The control constraint set $U_k(x_k)$ is $\{u|0\leq x_k+u\leq 5, u \text{: integer}\}$ for all $x_k$ and $k$ , and the disturbance $w_k$ is equal to zero for all $k$ .
（b） The control constraint and the disturbance $w_k$ are as in (a), but there is in addition a constraint $x_4=5$ on the final state.
（c）The control constraint is as in part (a) and the disturbance $w_k$ takes the values $-1$ and $1$ with equal probability $1/2$ for all $x_k$ and $u_k$ , except if $x_k+u_k$ is equal to $0$ or $5$ , in which case $w_k=0$ with probability $1$ .
Solution.
（a） $N=4$ , we assume that $J_4(x_4)=g_4(x_4)=0$ . As the condition, the system equation takes the form $f_{k+1}=x_k+u_k+w_k,\qquad k=0,1,2,3$ and the cost function is $g_k=x_k^2+u_k^2$ In this case, $w_k=0$ for all $k$ , so $J_3(x_3)=\mathop{\min}\limits_{u_3\in U_3(x_3)} \mathop\mathbf{E}\limits_{w_3}\{g_3(x_3,u_3,w_3)+J_4(f_3(x_3,u_3,w_3))\}$ $=\mathop{\min}\limits_{u_3=-x_3,1-x_3,...,5-x_3}(x_3^2+u_3^2)=x_3^2\qquad\qquad\qquad$ i.e. $\qquad J_3(x_3)=x_3^2, \mu_3^*(x_3)=0.\qquad\qquad\qquad\qquad\qquad\qquad (1)$
Then $J_2(x_2)=\mathop{\min}\limits_{u_2=-x_2,1-x_2,...,5-x_2}(x_2^2+u_2^2+(x_2+u_2)^2)$ Let $h_{x_2}(u_2)=x_2^2+u_2^2+(x_2+u_2)^2$ , we have $h'_{x_2}(u_2)=4u_2+2x_2.$

If $x_2>10$ , $h_{x_2}(u_2)=4(u_2+x_2)-2x_2 < 4(u_2+x_2)-20 < 0$ , then $\mu_x^*(x_2)=5-x_2\;,\;J_x(x_2)=x_2^2+(5-x_2)^2+5^2=2x_2^2-10x_2+50$

If $x_2<0$ , $h_{x_2}(u_2)=4(u_2+x_2)-2x_2>-2x_2>0$ , then $\mu_x^2(x_2)=-x_2\;,\;J_2(x_2)=2x_2^2$

If $0\leq x_2\leq 10$ and $x_2$ is even, then $\mu_2^*(x_2)=-\frac{x_2}{2}$ and $J_2(x_2) = \frac{3}{2}x_2^2.$

If $0\leq x_2\leq 10$ and $x_2$ is odd, then $\mu_2^*(x_2)=-\frac{x_2-1}{2}\;\text{or}\;-\frac{x_2+1}{2}$ and $J_2(x_2) = \frac{3}{2}x_2^2+\frac{1}{2}.$

So
$\begin{cases} \mu_x^2(x_2)=-x_2\;,\;J_2(x_2)=2x_2^2 & \qquad\text{if }\; x_2<0 \\ \mu_2^*(x_2)=-\frac{x_2}{2}\;,\;J_2(x_2) = \frac{3}{2}x_2^2 & \qquad\text{if }\;0\leq x_2\leq 10 \;\text{and} \;x_2\; \text{is even}\\ \mu_2^*(x_2)=-\frac{x_2-1}{2}\;\text{or}\;-\frac{x_2+1}{2}\;,\;J_2(x_2) = \frac{3}{2}x_2^2+\frac{1}{2} & \qquad\text{if }\;0\leq x_2\leq 10 \;\text{and} \;x_2\; \text{is odd}\\ \mu_x^*(x_2)=5-x_2\;,\;J_x(x_2)=x_2^2+(5-x_2)^2+5^2=2x_2^2-10x_2+50 & \qquad\text{if }\; x_2>10 \end{cases}$

In case $k=1$
$J_1(x_1)=\mathop{\min}\limits_{u_1=-x_1,1-x_1,...,5-x_1}(x_1^2+u_1^2+J_2(x_1+u_1))\qquad\qquad\qquad\qquad\qquad\;\;\;$ $=\mathop{\min}\limits_{u_1=-x_1,1-x_1,...,5-x_1}\big(x_1^2+u_1^2+\frac{3}{2}(x_1+u_1)^2+\frac{1}{2}\cdot\mathbf{1}_{x_1+u_1\;\text{is odd}}\big)$ In this case, we give an caculation in a direct way:

Let $h_{x_1}(u_1)=x_1^2+u_1^2+\frac{3}{2}(x_1+u_1)^2+\frac{1}{2}\cdot\mathbf{1}_{x_1+u_1\;\text{is odd}}$
$h_{x_1}(-x_1)=2x_1^2$ $h_{x_1}(1-x_1)=2x_1^2-2x_1+3$ $h_{x_1}(2-x_1)=2x_1^2-4x_1+10$ $h_{x_1}(3-x_1)=2x_1^2-6x_1+23$ $h_{x_1}(4-x_1)=2x_1^2-8x_1+40$ $h_{x_1}(5-x_1)=2x_1^2-10x_1+63$ Then we can get the following result easily
$\begin{cases} \mu_1^*(x_1)=-x_1\;,\;J_1(x_1)=2x_1^2 & \qquad\text{if }\; x_1\leq\frac{3}{2} \\ \mu_1^*(x_1)=1-x_1\;,\;J_1(x_1)=2x_1^2-2x_1+3 & \qquad\text{if }\; \frac{3}{2}\frac{23}{2} \\ \end{cases}$

For case $k=0\,,\; x_0=5$ , $J_0(x_0)=\mathop{\min}\limits_{u_0=-x_0,1-x_0,...,5-x_0}(x_0^2+u_0^2+J_1(x_0+u_0))=\mathop{\min}\limits_{u_0=-5,-4,...,0}(25+u_0^2+J_1(5+u_0))$ Let $h_0(u_0)=25+u_0^2+J_1(5+u_0)$ we have
$h_0(-5)=50,\; h_0(-4)=43,\;h_0(-3)=41,\;h_0(-2)=44,\;h_0(-1)=50,\;h_0(0)=65$ So $\mu_0^*(x_0)=-3$ and $J_0(x_0)=41$ .

（b） In this case ,we still asume that $J_4(x_4)=g_4(x_4)=0$ . By the additional condition, $U_3(x_3)=\{5-x_3\}$ , and
$J_3(x_3)=\mathop{\min}\limits_{u_3\in U_3(x_3)}(x_3^2+u_3^2)=x_3^2+(5-x_3)^2$ Then $J_2(x_2)=\mathop{\min}\limits_{u_2=-x_2,1-x_2,...,5-x_2}(x_2^2+u_2^2+(5-x_2-u_2)^2)$ Let $h_{x_2}(u_2)=x_2^2+u_2^2+(5-x_2-u_2)^2$ , take some easy caculation
$h_{x_2}(-x_2)=2x_2^2+25$ $h_{x_2}(1-x_2)=2x_2^2-2x_2+18$ $h_{x_2}(2-x_2)=2x_2^2-4x_1+17$ $h_{x_2}(3-x_2)=2x_2^2-6x_2+22$ $h_{x_2}(4-x_2)=2x_2^2-8x_2+33$ $h_{x_2}(5-x_2)=2x_2^2-10x_2+50$ and we get
$\begin{cases} \mu_2^*(x_2)=-x_2\;,\;J_2(x_2)=2x_2^2+25 & \qquad\text{if }\; x_1\leq-\frac{7}{2} \\ \mu_2^*(x_2)=1-x_2\;,\;J_2(x_2)=2x_2^2-2x_2+18 & \qquad\text{if }\;- \frac{7}{2}\frac{17}{2} \\ \end{cases}$

Similarly, for $k=1$ we have
$\begin{cases} \mu_1^*(x_1)=-x_1\;,\;J_1(x_1)=2x_1^2 +17& \qquad\text{if }\; x_1\leq-\frac{1}{2} \\ \mu_1^*(x_1)=1-x_1\;,\;J_1(x_1)=2x_1^2-2x_1+16 & \qquad\text{if }\; -\frac{1}{2}\frac{21}{2} \\ \end{cases}$

For case $k=0\,,\; x_0=5$
$h_0(-5)=66,\; h_0(-4)=57,\;h_0(-3)=54,\;h_0(-2)=56,\;h_0(-1)=63,\;h_0(0)=76$ So $\mu_0^*(x_0)=-3$ and $J_0(x_0)=54$ .

（c）In this case, $w_k$ is not always be $0$ and satisfies certern distributuion. $J_3(x_3)=\mathop{\min}\limits_{u_3\in U_3(x_3)} \mathop\mathbf{E}\limits_{w_3}\{g_3(x_3,u_3,w_3)+J_4(f_3(x_3,u_3,w_3))\}=\mathop{\min}\limits_{u_3\in U_3(x_3)} \mathop\mathbf{E}\limits_{w_3}(x_3^2+u_3^2)$ which is not affected by the distritution of $w_k$ , so the result is the same as case (a):

$\begin{cases} \mu_3^*(x_3)=-x_3\;,\;J_3(x_3)=2x_3^2 & \qquad\text{if }\; x_3\leq\frac{1}{2} \\ \mu_3^*(x_3)=1-x_3\;,\;J_3(x_3)=2x_3^2-2x_3+1 & \qquad\text{if }\; \frac{1}{2}\frac{9}{2} \\ \end{cases}$

Especially, we have $J_3(0)=0, \,J_3(1)=1,\,J_3(2)=4,\,J_3(3)=9,\,J_3(4)=16,\,J_3(5)=25,$ which will be used in the case $k=2$ . Now $J_2(x_2)=\mathop{\min}\limits_{u_2\in U_2(x_2)} \mathop\mathbf{E}\limits_{w_2}(x_2^2+u_2^2+J_3(x_2+u_2+w_2))$ We calculate the expectation of the right side for each of the possible values of $u_2$ :
$u_2=-x_2\;:\;\mathbf{E}\{\cdot\}=x_2^2+(-x_2)^2+J_3(0)=2x_2^2$ $u_2=1-x_2\;:\;\mathbf{E}\{\cdot\}=x_2^2+(1-x_2)^2+\frac{1}{2}[J_3(0)+J_3(2)]=2x_2^2-2x_2+3$ $u_2=2-x_2\;:\;\mathbf{E}\{\cdot\}=x_2^2+(2-x_2)^2+\frac{1}{2}[J_3(1)+J_3(3)]=2x_2^2-4x_2+9$ $u_2=3-x_2\;:\;\mathbf{E}\{\cdot\}=x_2^2+(3-x_2)^2+\frac{1}{2}[J_3(2)+J_3(4)]=2x_2^2-6x_2+19$ $u_2=4-x_2\;:\;\mathbf{E}\{\cdot\}=x_2^2+(4-x_2)^2+\frac{1}{2}[J_3(3)+J_3(5)]=2x_2^2-8x_2+33$ $u_2=5-x_2\;:\;\mathbf{E}\{\cdot\}=x_2^2+(5-x_2)^2+J_3(5)=2x_2^2-10x_2+50$ Hence we have

$\begin{cases} \mu_2^*(x_2)=-x_2\;,\;J_2(x_2)=2x_2^2 & \qquad\text{if }\; x_2\leq\frac{3}{2} \\ \mu_2^*(x_2)=1-x_2\;,\;J_2(x_2)=2x_2^2-2x_2+3 & \qquad\text{if }\; \frac{3}{2}\frac{17}{2} \\ \end{cases}$

Especially, $J_2(0)=0,\,J_2(1)=2,\,J_2(2)=7,\,J_2(3)=15,\,J_2(4)=25,\,J_2(5)=39.$

For case $k=1$ $J_1(x_1)=\mathop{\min}\limits_{u_1\in U_1(x_1)} \mathop\mathbf{E}\limits_{w_1}(x_1^2+u_1^2+J_2(x_1+u_1+w_1))$ We calculate the expectation of the right side for each of the possible values of $u_1$ :
$u_1=-x_1\;:\;\mathbf{E}\{\cdot\}=x_1^2+(-x_1)^2+J_2(0)=2x_1^2$ $u_2=1-x_1\;:\;\mathbf{E}\{\cdot\}=x_1^2+(1-x_1)^2+\frac{1}{2}[J_2(0)+J_2(2)]=2x_1^2-2x_1+\frac{9}{2}$ $u_1=2-x_1\;:\;\mathbf{E}\{\cdot\}=x_1^2+(2-x_1)^2+\frac{1}{2}[J_2(1)+J_2(3)]=2x_1^2-4x_1+\frac{25}{2}$ $u_1=3-x_1\;:\;\mathbf{E}\{\cdot\}=x_1^2+(3-x_1)^2+\frac{1}{2}[J_2(2)+J_2(4)]=2x_1^2-6x_1+25$ $u_1=4-x_1\;:\;\mathbf{E}\{\cdot\}=x_1^2+(4-x_1)^2+\frac{1}{2}[J_2(3)+J_2(5)]=2x_1^2-8x_1+43$ $u_1=5-x_1\;:\;\mathbf{E}\{\cdot\}=x_1^2+(5-x_1)^2+J_2(5)=2x_1^2-10x_1+64$ Hence we have

$\begin{cases} \mu_1^*(x_1)=-x_1\;,\;J_1(x_1)=2x_1^2 & \qquad\text{if }\; x_1\leq\frac{9}{4} \\ \mu_1^*(x_1)=1-x_1\;,\;J_1(x_1)=2x_1^2-2x_1+\frac{9}{2} & \qquad\text{if }\; \frac{9}{4}\frac{21}{2} \\ \end{cases}$

Especially, $J_1(0)=0,\,J_1(1)=2,\,J_1(2)=8,\,J_1(3)=\frac{15}{2},\,J_1(4)=\frac{57}{2},\,J_1(5)=\frac{85}{2}.$

The last step, the algorithm takes the form
$J_0(x_0)=\mathop{\min}\limits_{u_0\in U_0(x_0)} \mathop\mathbf{E}\limits_{w_0}(x_0^2+u_0^2+J_1(x_0+u_0+w_0))=\mathop{\min}\limits_{u_0\in U_0(x_0)} \mathop\mathbf{E}\limits_{w_0}\big(25+u_0^2+J_1(5+u_0+w_0)\big)$ $u_0=-5\;:\;\mathbf{E}\{\cdot\}=25+25+J_1(0)=50$ $u_0=-4\;:\;\mathbf{E}\{\cdot\}=25+16+\frac{1}{2}[J_1(0)+J_1(2)]=45$ $u_0=-3\;:\;\mathbf{E}\{\cdot\}=25+9+\frac{1}{2}[J_1(1)+J_1(3)]=38\frac{3}{4}$ $u_0=-2\;:\;\mathbf{E}\{\cdot\}=25+4+\frac{1}{2}[J_1(2)+J_1(4)]=47\frac{1}{4}$ $u_1=-1\;:\;\mathbf{E}\{\cdot\}=25+1+\frac{1}{2}[J_1(3)+J_1(5)]=51$ $u_0=0\;:\;\mathbf{E}\{\cdot\}=25+0+J_1(5)=67\frac{1}{2}$ So $\mu_0^*(x_0)=-3,J_0(x_0)=38\frac{3}{4}.\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\Box$
PS: In the book, a hint is given : Alternatively, you may use a terminal cost $g_4(x_4)$ equal to a very large number for $x_4\neq 5$ . Why a very large number ? My caculation is wrong?

1.2 Carry out the calculations needed to verify that $J_0(1)=2.67$ and $J_0(2)=2.608$ in Example 1.3.2.
Solution. The result in the exercise is wrong, even not the same as that in the text.
For $x_0=1$ , we have $J_0(1)=\min_{u_0=0,1} \mathop{\bf E}\limits_{w_0=0,1,2}\{u_0+(1+u_0-w_0)^2+J_1(\max(0,1+u_0-w_0)))\}$ $u_0=0: \;\mathbf{E}(.)=\mathop{\mathbf{E}}\limits_{w_0=0,1,2}\{(1-w_0)^2+J_1(\max(0,1-w_0))\}$
$\qquad\qquad\qquad=0.1\times[1+J_1(1)]+0.7\times J_1(0)+0.2\times[1+J_1(0)]$
$\qquad\qquad\qquad=0.1\times2.5+0.7\times 2.5 + 0.2\times 3.5$
$\qquad\qquad\qquad=2.7$
$u_0=1: \;\mathbf{E}(.)=\mathop{\mathbf{E}}\limits_{w_0=0,1,2}\{1+(2-w_0)^2+J_1(\max(0,2-w_0))\}$
$\qquad\qquad\qquad=0.1\times[5+J_1(2)]+0.7\times [2+J_1(1)]+0.2\times[1+J_1(0)]$
$\qquad\qquad\qquad=0.1\times6.68+0.7\times3.5 + 0.2\times 3.5$
$\qquad\qquad\qquad=3.818$
So $J_0(1)=2.7, \mu_0^*(1)=0.$

For $x_0=2$ , we have $J_0(2)=\min_{u_0=0} \mathop{\bf E}\limits_{w_0=0,1,2}\{u_0+(2+u_0-w_0)^2+J_1(\max(0,2+u_0-w_0)))\}$ $\qquad\qquad\qquad\qquad=\mathop{\mathbf{E}}\limits_{w_0=0,1,2}\{(2-w_0)^2+J_1(\max(0,2-w_0))\}$
$\qquad\qquad\qquad\qquad=0.1\times[4+J_1(2)]+0.7\times[1+J_1(1)]+0.2\times J_1(0)$
$\qquad\qquad\qquad\qquad=0.1\times5.68+0.7\times2.5+0.2\times 2.5$
$\qquad\qquad\qquad\qquad=2.818$
So $J_0(2)=2.818,\mu_0^*(2)=0.\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\Box$

1.6 (Discouted Cost per State) In the framework of the basic problem, consider the case where the cost is of the form $\mathop\mathbf{E}\limits_{\mathop{w_k}\limits_{k=0,1,...,N-1}}\left\{\alpha^Ng_N(x_N)+\sum_{k=0}^{N-1}\alpha^kg_k(x_k,u_k,w_k)\right\}$ where $\alpha$ is a discount factor with $0<\alpha<1$ . Show that an alternate form the DP algorithm is given by $V_N(x_N)=g_N(x_N)$ $V_k(x_k)=\mathop{\min}\limits_{u_k\in U_k(x_k)}\mathop\mathbf{E}\limits_{w_k}\left\{g_k(x_k,u_k,w_k)+\alpha V_{k+1}\big(f_k(x_k,u_k,w_k)\big)\right\}.$ Solution. Using similar argument to Proposition 1.3.1, in the dicounted cost case, the minimizing cost takes the form
$V^*(x_0)=\mathop{\min}\limits_{\pi\in\Pi}\mathop\mathbf{E}\limits_{\mathop{w_k}\limits_{k=0,1,...,N-1}}\left\{\alpha^Ng_N(x_N)+\sum_{k=0}^{N-1}\alpha^kg_k(x_k,u_k,w_k)\right\}$ Let $V_k^*(x_k)$ be the optimal cost for the $(N-k)$ -stage problem that starts at state $x_k$ and time $k$ , and ends at time $N$ .
$V_k^*(x_k)=\mathop{\min}\limits_{\pi^k}\mathop\mathbf{E}\limits_{w_k,...,w_{N-1}}\left\{\alpha^{N-k}g_N(x_N)+\sum_{i=k}^{N-1}\alpha^{i-k}g_i(x_i,u_i,w_i)\right\}$ Claim: $\; V_k^*(x_k)=V_k(x_k).$
We prove this cliam by induction on $k$ . Firstly, $V_N^*(x_N)=g_N(x_N)=V_N(x_N)$ . Suppose the claim is true for any $k+1, ..., N$ , then
$V_k^*(x_k)=\mathop{\min}\limits_{\pi^k}\mathop\mathbf{E}\limits_{w_k,...,w_{N-1}}\left\{g_k(x_k,u_k,w_k)+\alpha^{N-k}g_N(x_N)+\sum_{i=k+1}^{N-1}\alpha^{i-k}g_i(x_i,u_i,w_i)\right\}\qquad\qquad\quad$ $=\mathop{\min}\limits_{u_k\in U_k(x_k)}\mathop\mathbf{E}\limits_{w_k}\{g_k(x_k,u_k,w_k\}+\alpha\cdot\mathop{\min}\limits_{\pi^k}\mathop\mathbf{E}\limits_{w_k,...,w_{N-1}}\left\{\alpha^{N-k-1}g_N(x_N)+\sum_{i=k+1}^{N-1}\alpha^{i-k-1}g_i(x_i,u_i,w_i)\right\}$ $=\mathop{\min}\limits_{u_k\in U_k(x_k)}\mathop\mathbf{E}\limits_{w_k}\{g_k(x_k,u_k,w_k\}+\alpha\cdot\mathop{\min}\limits_{\pi^{k+1}}\mathop\mathbf{E}\limits_{w_{k+1},...,w_{N-1}}\left\{\alpha^{N-k-1}g_N(x_N)+\sum_{i=k+1}^{N-1}\alpha^{i-k-1}g_i(x_i,u_i,w_i)\right\}$ $=\mathop{\min}\limits_{u_k\in U_k(x_k)}\mathop\mathbf{E}\limits_{w_k}\{g_k(x_k,u_k,w_k\}+\alpha V_{k+1}^*(x_{k+1})\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad$ $=\mathop{\min}\limits_{u_k\in U_k(x_k)}\mathop\mathbf{E}\limits_{w_k}\{g_k(x_k,u_k,w_k\}+\alpha V_{k+1}(x_{k+1})\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad$ $=\mathop{\min}\limits_{u_k\in U_k(x_k)}\mathop\mathbf{E}\limits_{w_k}\{g_k(x_k,u_k,w_k\}+\alpha \cdot\mathop{\min}\limits_{u_k\in U_k(x_k)}\mathop\mathbf{E}\limits_{w_k}V_{k+1}(f_k(x_k,u_k,w_k))\qquad\qquad\qquad\qquad\qquad\qquad\;$ $=\mathop{\min}\limits_{u_k\in U_k(x_k)}\mathop\mathbf{E}\limits_{w_k}\left\{g_k(x_k,u_k,w_k)+\alpha V_{k+1}\big(f_k(x_k,u_k,w_k)\big)\right\}=V_k(x_k)\qquad\qquad\qquad\qquad\qquad\qquad\Box$

1.7 (Exponential Cost Function) In the framework of the basic problem, consider the case where the cost is of the form $\mathop\mathbf{E}\limits_{\mathop{w_k}\limits_{k=0,1,...,N-1}}\left\{\exp\left(g_N(x_N)+\sum_{k=0}^{N-1}g_k(x_k,u_k,w_k)\right)\right\}$ （a）Show that the optimal cost and an optimal policy can be obtained from the DP-like algorithm $J_N(x_N)=\exp\left(g_N(x_N)\right)$ $J_k(x_k)=\mathop{\min}\limits_{u_k\in U_k(x_k)}\mathop\mathbf{E}\limits_{w_k}\left\{J_{k+1}\left(f_k(x_k,u_k,w_k)\right)\exp\left(g_k(x_k,u_k,w_k)\right)\right\}$ （b）Define the functions $V_k(x_k)=\ln J_k(x_k)$ . Assume also that $g_k$ is a function of $x_k$ and $u_k$ only (and not of $w_k$ ). Show that the above algorithm can be rewritten as $V_N(x_N)=g_N(x_N)$ $V_k(x_k)=\mathop{\min}\limits_{u_k\in U_k(x_k)}\left\{g_k(x_k,u_k)+\ln \mathop\mathbf{E}\limits_{w_k}\left\{\exp\left(V_{k+1}\left(f_k(x_k,u_k,w_k)\right)\right)\right\}\right\}$
Solution. (a) Define $J_k^*(x_k)=\mathop{\min}\limits_{\pi^k}\mathop\mathbf{E}\limits_{\mathop{w_i}\limits_{i=k,k+1,...,N-1}}\left\{\exp\left(g_N(x_N)+\sum_{i=k}^{N-1}g_i(x_i,u_i,w_i)\right)\right\}$ It is easy to check that $J_k^*(x_k)=J_k(x_k)$ .
$J_k^*(x_k)=\mathop{\min}\limits_{u_k}\mathop\mathbf{E}\limits_{w_k}\left\{\exp\left(g_k(x_k,u_k,w_k)\right)\right\}\mathop{\min}\limits_{\pi^k}\mathop\mathbf{E}\limits_{\mathop{w_i}\limits_{i=k,k+1,...,N-1}}\left\{\exp\left(g_N(x_N)+\sum_{i=k+1}^{N-1}g_i(x_i,u_i,w_i)\right)\right\}$ $=\mathop{\min}\limits_{u_k}\mathop\mathbf{E}\limits_{w_k}\left\{\exp\left(g_k(x_k,u_k,w_k)\right)\right\}\mathop{\min}\limits_{\mathop{\pi^{k+1}}\limits_{x_{k+1}=f_k(x_k,u_k,w_k)}}\mathop\mathbf{E}\limits_{\mathop{w_i}\limits_{i=k+1,....,N-1}}\left\{\exp\left(g_N(x_N)+\sum_{i=k+1}^{N-1}g_i(x_i,u_i,w_i)\right)\right\}$ $=\mathop{\min}\limits_{u_k}\mathop\mathbf{E}\limits_{w_k}\left\{\exp\left(g_k(x_k,u_k,w_k)\right)\right\}\cdot \mathop{\min}\limits_{u_k}\mathop\mathbf{E}\limits_{w_k}J_{k+1}^*(f_k(x_k,u_k,w_k)))\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad$ $=\mathop{\min}\limits_{u_k}\mathop\mathbf{E}\limits_{w_k}\left\{\exp\left(g_k(x_k,u_k,w_k)\right)\right\}\cdot \mathop{\min}\limits_{u_k}\mathop\mathbf{E}\limits_{w_k}J_{k+1}(f_k(x_k,u_k,w_k)))=J_k(x_k)\qquad\qquad\qquad\qquad\qquad$ (b) Since $g_k$ is a function not of $w_k$ , we can rewrite $J_k(x_k)$ as follows: $J_k(x_k)=\mathop{\min}\limits_{u_k\in U_k(x_k)}\left\{\exp\left(g_k(x_k,u_k,w_k)\right)\mathop\mathbf{E}\limits_{w_k}\left\{J_{k+1}\left(f_k(x_k,u_k,w_k)\right)\right\}\right\}$ So $V_k(x_k)=\ln J_k(x_k)\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad$ $=\mathop{\min}\limits_{u_k\in U_k(x_k)}\left\{g_k(x_k,u_k,w_k)+\ln\mathop\mathbf{E}\limits_{w_k}\left\{J_{k+1}\left(f_k(x_k,u_k,w_k)\right)\right\}\right\}\qquad\qquad\qquad$ $=\mathop{\min}\limits_{u_k\in U_k(x_k)}\left\{g_k(x_k,u_k,w_k)+\ln\mathop\mathbf{E}\limits_{w_k}\left\{\exp \left(V_{k+1}\left(f_k(x_k,u_k,w_k)\right)\right)\right\}\right\}\qquad\;\Box$

1.8 (Terminating Process) In the framework of the basic problem, consider the case where the system evolution terminates at time $i$ when a given value $\bar{w_i}$ of the disurbance at time $i$ occurs, or when a termination decesion $u_i$ is made by the controller. If termination occurs at time $i$ , the resulting cost is $T+\sum_{k=0}^ig_k(x_k,u_k,w_k)$ where $T$ is a termination cost. If the process has not terminated up to the final time $N$ , the resulting cost is $g_N(x_N)+\sum_{k=0}^{N-1}g_k(x_k,u_k,w_k)$ . Reformulate the problem into the framework of the basic problem.

1.9 (Multiplicative Cost) In the framework of the basic problem, consider the case where the cost has the multiplicative form $\mathop\mathbf{E}\limits_{\mathop{w_k}\limits_{k=0,1,...,N-1}}\left\{g_N(x_N)\cdot g_{N-1}(x_{N-1}, u_{N-1}, w_{N-1})\ldots g_0(x_0,u_0,w_0)\right\}$ Develop a DP-like algorithm for this problem assuming that $g_k(x_k,u_k,w_k)\ge0$ for all $x_k, u_k, w_k$ and $k$ .
Solution. The DP-like algorithm for this problem is as follows:
$J_N(x_N)=g_N(x_N)$ $J_k(x_k)=\mathop\mathbf{E}\limits_{w_k}\left\{g_k(x_k,u_k,w_k)J_{k+1}\left(f_k(x_k,u_k,w_k)\right)\right\}$ The proof is standard as above, we may define
$J_k^*(x_k)=\mathop\mathbf{E}\limits_{\mathop{w_k}\limits_{i=k,...,N-1}} \left\{g_N(x_N)\ldots g_i(x_i,u_i,w_i)\ldots g_k(x_k,u_k,w_k) \right\}$ and prove $J_k^*(x_k)=J_k(x_k)$ . $\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\Box$

1.10 Assume that we have a vessel whose maximum weight capacity is $z$ and whose cargo is to consist of different quantities of $N$ different items. Let $v_i$ denote the value of the $i$ -th type of item, $w_i$ the weight of $i$ -th type of item, and $x_i$ the number of items of type $i$ that are loaded in the vessel. The problem is to find the most valuable cargo, i.e., to maximize $\sum_{i=1}^Nx_iv_i$ subject to the constraints $\sum_{i=1}^Nx_iw_i\leq z$ and $x_i=0,1,2,\ldots$ Formulate this problem in terms of DP.

Solution. We formulate this problem in terms of DP as follows:
$\qquad$ a state $x'_k$ is defined by $x'_k=\mathop{\sum}\limits_{i=0}^{k-1}x_iw_i$
$\qquad$ an action $u'_k$ is taken on the space $U_k(x'_k)=\left\{0,1,\ldots,\left[\frac{z-x'_k}{w_k}\right]\right\}$
$\qquad$ a random parameter $w'_k$ is always be 0
$\qquad$ state transform function is defined as $x'_{k+1}=f_k(x'_k,u'_k,w'_k)=x'_k+u'_kw_k$
$\qquad$ cost function is defined as $g_k(x'_k,u'_k,w'_k)=-u'_kv_k$
So the problem is reformulated to get an optimal solution to the minimization problem $\mathop{\min}\limits_{\pi=\{u'_1,\ldots,u'_N\}}g_k(x'_k,u'_k,w'_k)$ the problem can be solved by the following DP-algorithm:
$J_{N+1}(x'_{N+1})=0\qquad\qquad$ $J_k(x'_k)=\mathop{\min}\limits_{u'_k\in U_k(x'_k)}\{-u'_kv_k+J_{k+1}(f_k(x'_k,u'_k,w'_k))\}\;,\qquad k=1,2,\ldots,N$ $\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\Box$

1.11 Consider a device consisting of $N$ stages connected in series, where each stage consists of a particular component. The components are subject to failure, and to increase the reliablity of the device duplicate components are provided. For $j=1,2,\ldots,N$ , let $(1+m_j)$ be the number of components for the $j$ -th stage, let $p_j(m_j)$ be the probability of success operation of the $j$ -th stage when $(1+m_j)$ components are used, and let $c_j$ denote the cost of a single component at the $j$ -th stage. Formulate in terms of DP the problem of finding the number of components at each stage that maximize the realibility of the device expressed by $p_1(m_1)\cdot p_2(m_2)\ldots p_N(m_N)$ subject to the cost constraint $\sum_{j=1}^N c_jm_j\leq A$ , where $A>a$ is given.

1.12 (Minimization over a Subset of Policies) This problem is primarily of theoretical interest. Consider a variant of the basic problem whereby we seek $\mathop{\min}\limits_{\pi\in\tilde{\Pi}}J_k(x_0)$ where $\tilde{\Pi}$ is some given subset of the set of sequences $\{\mu_0,\mu_1,\ldots, \mu_{N-1}\}$ of functions $\mu_k:S_k\rightarrow C_k$ with $\mu_k(x_k)\in U_k(x_k)$ for all $x_k\in S_k$ . Assume that $\pi^*=\{\mu_0^*,\mu_1^*,\ldots, \mu_{N-1}^*\}$ belongs to $\tilde{\Pi}$ and attains the minimum in the DP algorithm; that is, for all $k=0,1,\ldots,N-1$ and $x_k\in S_k$ , $J_k(x_k)=\mathop{\mathbf{E}}\limits_{w_k}\left\{g_k(x_k,\mu_k^*(x_k),w_k)+J_{k+1}\left(f_k(x_k,\mu_k^*(x_k),w_k)\right)\right\}$ $\mathop{\min}\limits_{u_k\in U_k(x_k)}\mathop{\mathbf{E}}\limits_{w_k}\left\{g_k(x_k,u_k,w_k)+J_{k+1}\left(f_k(x_k,u_k,w_k)\right)\right\}$ with $J_N(x_N)=g_N(x_N)$ . Assume further that the function $J_k$ are real-valued and that the preceding expectations are well-defined and finite. Show that $\pi^*$ is optimal within $\tilde{\Pi}$ and that the corresponding optimal cost is equal to $J_0(x_0)$ .

1.13 (Semilinear Systems) Consider a problem involving the system $x_{k+1}=A_kx_k+f_k(u_k)+w_k$ where $x_k\in\mathfrak{R}^n$ , $f_k$ are given functions, and $A_k$ and $w_k$ are random $n\times n$ matices and $n$ -vectors, respectively, with given probability distributions that do not depend on $x_k, u_k$ or prior values of $A_k$ and $w_k$ . Assume that the cost is of the form $\mathop{\mathbf{E}}\limits_{\mathop{A_k,w_k}\limits_{k=0,1,\ldots,N-1}}\left\{c'_Nx_n+\sum_{k=0}^{N-1}\left(c'_kx_k+g_k\left(\mu_k(x_k)\right)\right)\right\}$ where $c_k$ are given vectors and $g_k$ are given functions. Show that if the optimal cost for this problem is finite and the control constraint sets $U_k(x_k)$ are independent of $x_k$ , then the cost-to-go functions of the DP algorithm are affine (linear plus constraint). Assuming that there is at least one optimal policy, show that there exists an optimal policy that consist of constant functions $\mu_k^*$ that is $\mu_k^*(x_k)=$ constant for all $x_k\in\mathfrak{R}^n$ .

1.14 A farmer annually producing $x_k$ units of a certain crop stores $(1-u_k)x_k$ units of his production, where $0\leq u_k\leq 1$ and invests the remaining $u_kx_k$ units, thus increasing the next year’s production to a level $x_{k+1}$ given by $x_{k+1}=x_k+w_ku_kx_k,\qquad k=0,1,\cdots,N-1$ The scalars $w_k$ are independent random variables with identical probability distributions that do not depend either on $x_k$ or $u_k$ . Furthermore, $\mathbf{E}\{w_k\}=\overline{w}>0$ . The problem is to find the optimal investment policy that maximizes the total expected product stored over $N$ years $\mathop\mathbf{E}\limits_{\mathop{w_k}\limits_{k=0,1,\ldots,N-1}}\left\{x_N+\sum_{k=0}^{N-1}(1-u_k)x_k\right\}$ Show the optimality of the following policy that consists of constant functions:
(1) If $\overline{w}>1$ , $\mu_0^*(x_0)=\cdots=\mu_{N-1}^*(x_{N-1})=1$ .
(2) If $0<\overline{w}<1/N$ , $\mu_0^*(x_0)=\cdots=\mu_{N-1}^*(x_{N-1})=0$ .
(3) If $1/N\leq\overline{w}\leq 1$ , $\mu_0^*(x_0)=\cdots=\mu_{N-\bar{k}-1}^*(x_{N-\bar{k}-1})=1$ $\mu_{N-\bar{k}}^*(x_{N-\bar{k}})=\cdots=\mu_{N-1}^*(x_{N-1})=0$ where $\bar{k}$ is such that $1/{(\bar{k}+1)}<\overline{w}\leq 1/{\bar{k}}$ .
Solution. The DP-algorithm can be formed as follows: $J_N(x_N)=x_N$ $J_k(x_k)=\mathop{\max}\limits_{u_k}\mathop{\mathbf E}\limits_{w_k}\left\{(1-u_k)x_k+J_{k+1}(x_k+w_ku_kx_k)\right\}\,,\;\; k=0,1,\cdots,N-1$ (1) In the case $\overline{w}>1$ $J_{N-1}(x_{N-1})=\mathop{\max}\limits_{u_{N-1}}\mathop{\mathbf E}\limits_{w_{N-1}}\left\{(1-u_{N-1})x_{N-1}+x_{N-1}+w_{N-1}u_{N-1}x_{N-1}\right\}$ $=\mathop{\max}\limits_{u_{N-1}}\mathop{\mathbf E}\limits_{w_{N-1}}\left\{2x_{N-1}+(w_{N-1}-1)u_{N-1}x_{N-1}\right\}$ $=\mathop{\max}\limits_{u_{N-1}}\left\{2x_{N-1}+(\overline{w}-1)u_{N-1}x_{N-1}\right\}\qquad\qquad$ we obtain that $\mu_{N-1}^*(x_{N-1})=1$ and $J_{N-1}(x_{N-1})=(\overline{w}+1)x_{N-1}$ . By indunction and similar caculations, we can get $\mu_k^*(x_k)=1\;,\;J_k(x_k)=(\overline{w}+N-k)x_k\;\;\text{for all }\;k=0,1,\cdots,N-1$ (2) In the case $\overline{w}<1/N$ ,
$J_{N-1}(x_{N-1})=\mathop{\max}\limits_{u_{N-1}}\left\{2x_{N-1}+(\overline{w}-1)u_{N-1}x_{N-1}\right\}=2x_{N-1}$ with $\mu_{N-1}^*(x_{N-1})=0$ . And then
$J_{N-2}(x_{N-2})=\mathop{\max}\limits_{u_{N-2}}\mathop{\mathbf E}\limits_{w_{N-2}}\left\{(1-u_{N-2})x_{N-2}+2[x_{N-2}+w_{N-2}u_{N-2}x_{N-2}]\right\}$ $=\mathop{\max}\limits_{u_{N-2}}\left\{3x_{N-2}+(2\overline{w}-1)u_{N-2}x_{N-2}\right\}=3x_{N-2}$ with $\mu_{N-2}^*(x_{N-2})=0$ . Since $\overline{w}<1/N$ , we can prove by induction that for all $k=0,1,\cdots,N-1$ , $\mu_k^*(x_k)=0\;,\;J_k(x_k)=(N-k+1)x_k$ (3) In the case $1/N\leq\overline{w}\leq 1$ , there is some $k=\bar{k}$ satisfying $1/{(\bar{k}+1)}<\overline{w}\leq 1/{\bar{k}}$ , after which step the induction argument in (2) will stop.
For any $k\ge N-\bar{k}$ , we have $J_k(x_k)=\mathop{\max}\limits_{u_k}\left\{(N-k+1)x_k+((N-k)\overline{w}-1)u_kx_k\right\}$ Since $(N-k)\overline{w}\leq\bar{k}\overline{w}\leq 1$ , then $\mu_k^*(x_k)=0$ and $J_k(x_k)=(N-k+1)x_k$ .
Next for $k= N-\bar{k}-1$
$J_{N-\bar{k}-1}(x_{N-\bar{k}-1})=\mathop{\max}\limits_{u_{N-\bar{k}-1}}\left\{(\bar{k}+2)x_{N-\bar{k}-1}+((\bar{k}+1)\overline{w}-1)u_{N-\bar{k}-1}x_{N-\bar{k}-1}\right\}$ In this case, $(\bar{k}+1)\overline{w}-1>0$ , yielding $\mu_{N-\bar{k}-1}^*(x_{N-\bar{k}-1})=1$ and $J_{N-\bar{k}-1}(x_{N-\bar{k}-1})=[(\bar{k}+1)\overline{w}+\bar{k}+1]x_{N-\bar{k}-1}$ .
And then by induction, we can prove that for all $k<N-\bar{k}-1$ , $\mu_k^*(x_k)=1$ . $\qquad\qquad\qquad\qquad\Box$

1.15 Let $x_k$ denote the number of educators in a certain country at time $k$ and let $y_k$ denote the number of research scientists at time $k$ . New scientists (potential educators or research scientists) are produced during the $k$ th period by educators at a rate $\gamma_k$ per educator, while educators and research scientists leave the field due to death, retirement, and transfer at a rate $\delta_k$ . The scalars $\gamma_k$ , $k = 0,1, ... ,N-1$ , are independent identically distributed random variables taking values within a closed and bounded interval of positive numbers. Similarly $\delta_k$ , $k = 0,1, ... ,N-1$ , are independent identically distributed and take values in an interval $[\delta,\delta']$ with $0<\delta\leq\delta'<1$ . By means of incentives, a science policy maker can determine the proportion $u_k$ of new scientists produced at time $k$ who become educators. Thus, the number of research scientists and educators evolves according to the equations $x_{k+1}=(1-\delta_k)x_k+u_k\gamma_k x_k$ $y_{k+1}=(1-\delta_k)y_k+(1-u_k)\gamma_k x_k$ The initial numbers $x_0,y_0$ are known, and it is required to find a policy $\{\mu_0^*(x_0,y_0),\cdots,\mu_{N-1}^*(x_{N-1},y_{N-1})\}$ with $0<\alpha\leq \mu_k^*(x_k,y_k)\leq\beta<1,\qquad\text{for all }x_k,y_k\text{ and }k$ which maximizes $\mathbf{E}_{\gamma_k,\delta_k}\{y_N\}$ (i.e. the expected final number of research scientists after $N$ periods). The scalars $\alpha$ and $\beta$ are given.
(1) Show that the cost-to-go functions $J_k(x_k,y_k)$ are linear; that is, for some scalars $\xi_k,\zeta_k,$ $J_k(x_k,y_k)=\xi_kx_k+\zeta_ky_k$ (2) Derive an optimal policy $\{\mu_0^*,\cdots,\mu_{N-1}^*\}$ , under the assumption $\mathbf{E}\{\gamma_k\}>\mathbf{E}\{\delta_k\}$ and show that this optimal policy can consist of constant functions.
(3) Assume that the proportion of new scientists who become educators at time $k$ is $u_k+\epsilon_k$ (rather than $u_k$ ), where $\epsilon_k$ are identically distributed independent random variables that are also independent of $\gamma_k,\delta_k$ and take values in the interval $[-\alpha,1-\beta]$ . Derive the form of the cost-to-go functions and the optimal policy.

全部评论 (0)

还没有任何评论哟~

Dynamic Programming and Optimal Control 第一章习题

1Considerthesystemxk+1=xk+uk+wk,\qquadk=0,1,2,3withinitialstatex0=5,andthecostfunction\sumk=0^3xk^2+...

Dynamic Programming and Optimal Control 第四章习题

3ConsideraninventoryproblemsimilartotheproblemofSection4.2zerofixedcost.Theonlydifferenceisthatatthe...

Dynamic Programming and Optimal Control 第三章习题

1SolvetheproblemofExample3.2.1forthecasewherethecostfunctionisxT^2+\int0^Tut^2dtAlso,calculatethecos...

Speeding Up Dynamic Programming Computation: Tips and

作者：禅与计算机程序设计艺术 1.简介动态规划Dynamicprogramming是一种解决最优化问题的关键算法。它通过将子问题的解重复计算而节省时间。对于多种问题都可以用动态规划求解。动态规划算法...

RL学习日志：（Reinforcement Learning for Sequential Decision and Optimal Control）Day1

HistoryofRL backgrand DynamicProgramming TrialandErrorLearning ExamplesofRLApplications TicTacToe Ch...

RL学习日志：（Reinforcement Learning for Sequential Decision and Optimal Control）Day3

ClassificationofRLMethods preface DefinitionofRLProblems Bellman'sPrincipleofOptimality TwoKindsofBe...

RL学习日志：（Reinforcement Learning for Sequential Decision and Optimal Control）Day2

FourElementsofRLProblems preface EnvironmentModel StateActionSample Policy TabularPolicy Parameteriz...

菜鸟看论文——Stochastic Occupancy Grids and Dynamic Programming（一）

参考论文： 07.FreeSpaceComputationUsingStochasticOccupancyGridsandDynamicProgramming 08.StereobasedFreeSp...

增强学习 - MDPs - Dynamic Programming (一）

1.MDP MDP马尔科夫决策过程：一个马尔可夫决策过程由一个五元组构成M=<S,A,P,R,γ S:表示状态集states，有s∈S，si表示第i步的状态。 A:表示一组动作actions，有a∈...

Dynamic Programming and State Compression (LeetCode #2209 & LeetCode #2212)

LeetCode2209 Youaregivena0indexedbinarystringfloor,whichrepresentsthecolorsoftilesonafloor: floor[i]...

是否确定退出登录?

Dynamic Programming and Optimal Control 第一章习题

全部评论 (0)

相关文章推荐

Dynamic Programming and Optimal Control 第一章习题

Dynamic Programming and Optimal Control 第四章习题

Dynamic Programming and Optimal Control 第三章习题

Speeding Up Dynamic Programming Computation: Tips and

RL学习日志：（Reinforcement Learning for Sequential Decision and Optimal Control）Day1

RL学习日志：（Reinforcement Learning for Sequential Decision and Optimal Control）Day3

RL学习日志：（Reinforcement Learning for Sequential Decision and Optimal Control）Day2

菜鸟看论文——Stochastic Occupancy Grids and Dynamic Programming（一）

增强学习 - MDPs - Dynamic Programming (一）

Dynamic Programming and State Compression (LeetCode #2209 & LeetCode #2212)