Hello,
Here is my solution according to the given:
Terminal states:
V*(S1) = sqrt(3) ~= 1.732
V*(S5) = 100
V*(S9) = sqrt(3) ~= 1.732
gamma = 0.9, R(s,a,s') = 0
Actions:
L: 0.4 -> s-1, 0.5 -> s-2, 0.1 -> stay
R: 0.4 -> s+1, 0.5 -> s+2, 0.1 -> stay
Since S5 = 100 is largest, optimal policy:
S2, S3, S4 -> R
S6, S7, S8 -> L
By symmetry:
V*(S4)=V*(S6)=a
V*(S3)=V*(S7)=b
V*(S2)=V*(S8)=c
S4:
a = 0.9*(0.4100 + 0.5a + 0.1a)
a = 0.9(40 + 0.6a)
a = 36 + 0.54a
0.46a = 36
a = 36/0.46 ~= 78.26
S3:
b = 0.9*(0.4a + 0.5100 + 0.1b)
b = 0.9(0.478.26 + 50 + 0.1b)
b = 0.9(31.304 + 50 + 0.1b)
b = 73.1736 + 0.09b
0.91b = 73.1736
b ~= 80.41
S2:
c = 0.9*(0.4b + 0.5a + 0.1c)
c = 0.9*(0.480.41 + 0.578.26 + 0.1c)
c = 0.9*(32.164 + 39.13 + 0.1c)
c = 64.1646 + 0.09c
0.91c = 64.1646
c ~= 70.51
Final values:
S1 = 1.732
S2 = 70.51
S3 = 80.41
S4 = 78.26
S5 = 100
S6 = 78.26
S7 = 80.41
S8 = 70.51
S9 = 1.732