0 votes

You are traveling on a straight road, but have a jumpy car.  The car sometimes "jumps" (moves double).  At other times, it doesn't move at all.  The following MDP has been created to model this behavior and the landscape.

Estimate the V* values (optimal values for the states) for this MDP:

S1S2S3S4S5S6S7S8S9
sqrt(3)100sqrt(3)

MDP is defined as follows: There are two actions L (Left) and R (Right).  When moving left, there is a 40% chance moving left, 50% chance of moving DOUBLE (2 spots left), and 10% chance of not moving at all.  Similarly, when moving right, there is a 40% chance moving right, 50% chance of moving DOUBLE (2 spots right), and 10% chance of not moving at all.

S1, S5 and S9 are terminal states with values sqrt(3), 100 and sqrt(3) respectively.  

Discount factor is 0.9 (so, gamma = 0.9).

There is no living reward, that is R(s,a,s’) = 0.

in MDP by AlgoMeister (1.9k points)
edited by

13 Answers

+1 vote
Hello professor,

Terminal states: V*(S1)=sqrt(3)≈1.732, V*(S5)=100, V*(S9)=sqrt(3)≈1.732

Discount: γ=0.9, R(s,a,s')=0

 

TRANSITIONS:

Action R: 40%->s+1, 50%->s+2, 10%->stay

Action L: 40%->s-1, 50%->s-2, 10%->stay

 

SYMMETRY:

S1=S9 and both equal sqrt(3), so: V*(S4)=V*(S6)=a, V*(S3)=V*(S7)=b, V*(S2)=V*(S8)=c

Optimal action in S2,S3,S4 = R (right), since S5=100 dominates.

 

SOLVING S4: Action R: 40%->S5, 50%->S6=a, 10%->S4=a

a = 0.9[0.4(100) + 0.5a + 0.1a]

a = 36 + 0.54a

0.46a = 36

a ≈ 78.26

 

SOLVING S3: Action R: 40%->S4=a, 50%->S5, 10%->S3=b

b = 0.9[0.4(78.26) + 0.5(100) + 0.1b]

b = 73.17 + 0.09b

0.91b = 73.17

b ≈ 80.41

 

SOLVING S2: Action R: 40%->S3=b, 50%->S4=a, 10%->S2=c

c = 0.9[0.4(80.41) + 0.5(78.26) + 0.1c]

c = 64.16 + 0.09c

0.91c = 64.16

c ≈ 70.51

 

FINAL VALUES

S1=1.732 | S2=70.51 | S3=80.41 | S4=78.26 | S5=100 | S6=78.26 | S7=80.41 | S8=70.51 | S9=1.732
by (176 points)
+1 vote

Hello Sir, 
Terminal States: S1 = sqrt(3), S5 = 100, S9 = sqrt(3)
Actions = L(left) and R(right)

Transition probabilities: 40% - 1 step; 50% - 2 steps; 10% - stay

Discount factor - 0.9

Reward - 0

S4 = V4=0.9(0.4V5+0.5V6+0.1V4) = 89.4

S6 = V6=0.9(0.4V5+0.5V4+0.1V6) = 89.4

S3 = V3=0.9(0.4V4+0.5V5+0.1V3) = 77.8

S7  = V7=0.9(0.4V6+0.5V5+0.1V7) = 77.8

S2 = V2=0.9(0.4V3+0.5V4+0.1V2) = 65

S8 = V8=0.9(0.4V7+0.5V6+0.1V8) = 65

by (188 points)
+1 vote
Hello everyone,

States: S1, S2, S3, S4, S5, S6, S7, S8, S9

Terminal states: S1 = √3 = 1,732 , S5 = 100, S9 = √3 = 1.732

Actions: L (Left), R (Right)

Transition probabilities:

40% normal move (1 step)

50% double move (2 steps)

10% no move (0 steps)

gamma = 0.9

R(s,a,s') = 0

Now, according to Bellman equation, if we want to calculate S2, for instance:

Action Right :

40% → S3: 0.4 × V(S3)

50% → S4: 0.5 × V(S4)

10% → S2: 0.1 × V(S2)

V_R(S2) = 0.9 x (0.4×V(S3) + 0.5×V(S4) + 0.1×V(S2))

Action Left:

40% → S1: 0.4 × √3

50% → S0 : 0.5 × V(S2) (There is no S0, so it stays at S2)

10% → S2: 0.1 × V(S2)

V_L(S2) = 0.9 x (0.4×1.732 + 0.6×V(S2) = 0.693 + 0.6×V(S2))

Optimal choice:

V(S2) = max(V_R, V_L) = V_R, because the optimal policy goes toward S5.

Iteration

Initially: V(S2) = 0

İter 1:

V(S3) = 0.9×(0.4×0 + 0.5×100) = 45     S4+S5+S3

V(S4) = 0.9×(0.4×100 + 0) = 36            S5 + S6 + S4

V(S2) = 0.9×(0.4×45 + 0.5×36) = 32.4  S3 + S4 + S2  

After a few iterations, we get the answer, but all are just approximations:

V(S1) = 1.732* (√3)

V(S2) ≈ 81.0*

V(S3) ≈ 90.0*

V(S4) ≈ 94.5*

V(S5) = 100.0*

V(S6) ≈ 94.5*

V(S7) ≈ 90.0*

V(S8) ≈ 81.0*

V(S9) = 1.732* (√3)

Symmetric structure: S2≈S8, S3≈S7, S4≈S6
by (200 points)
0 votes
The optimal policy is like this:
S2, S3, S4: choose R
S6, S7, S8: choose L

By symmetry;

V(S6)=V(S4)
V(S7)=V(S3)
V(S8)=V(S2)

and gamma = .9

So, the Bellman equations are:

V(S2)=0.9(0.1V(S2)+0.4V(S3)+0.5V(S4))
V(S3)=0.9(0.1V(S3)+0.4V(S4)+0.5*100)
V(S4)=0.9(0.1V(S4)+0.4*100+0.5V(S4))

To solve it easily replacing it with x, y and z:

x=S2=S8; y=S3=S7; z=S4=S6

So,
x = .9 (0.4y + 0.5z + 0.1x)
y = .9(0.4z + 50 + 0.1y)
z = .9(40 + 0.5z + 0.1z)

First, we solve z:
z = .9(40 + 0.6z)
z = 36 + 0.54z
z = 36/0.46 ≈ 78.26

Meaning S4 and S6 are approximately 78.26
------------------------------------------------------------------
Then, y:
y = .9(0.4*78.26 + 50 + 0.1y)
0.91y = 73.18
y ≈ 80.42

Meaning S3 and S7 are approximately 80.42
------------------------------------------------------------------
Lastly, for x:
x = .9(0.4*80.42 + 0.5*78.26 + 0.1x)
x = 0.9(71.3 +0,1x)
0.91x = 64.17
x ≈ 70.52

Meaning S2 and S8 are approximately 70.52
------------------------------------------------------------------

Result:

V(S2)=V(S8) ≈ 70.52
V(S3)=V(S7) ≈ 80.42
V(S4)=V(S6) ≈ 78.26
by (236 points)
0 votes
Hello!

Gamma = 0.9

By symmetry:
V(S2)=V(S8)
V(S3)=V(S7)
V(S4)=V(S6)


# V(S4) and V(S6)
From S4:
0.4 to S5​
0.5 to S6​
0.1 for S4​

V(S4)​=0.9(0.4⋅100+0.5V(S6)​+0.1V(S4)​)
Since V(S4)=V(S6):
V(S4)​=0.9(0.4⋅100+0.5V(S4)​+0.1V(S4)​)
V(S4)​=0.9(40+0.6V(S4)​)
V(S4)=36+0.54V(S4))
V(S4)≈78.26
Therefore, V(S6)≈78.26

# V(S3) and V(S7)
From S3:
0.4 to S4​
0.5 to S5​
0.1 for S3​

V(S3)​=0.9(0.4V(S4)​+0.5⋅100+0.1V(S3)​)
Since V(S4)≈78.26:
V(S3)​=0.9(0.4⋅78.26​+0.5⋅100+0.1V(S3)​)
V(S3)​=73.1736+0.09V(S3)
V(S3)≈80.41​
Therefore, V(S7)≈80.41​

# V(S2) and V(S8)
From S2:
0.4 to S3​
0.5 to S4
0.1 for S2

V(S2)​=0.9(0.4V(S3)​+0.5V(S4)​+0.1V(S2)​)
Since V(S3)≈80.41, V(S4)≈78.26:
V(S2)​=0.9(0.4⋅80.41+0.5⋅78.26+0.1V(S2)​)
V(S2)​=0.9(32.164+39.130+0.1V(S2)​)
V(S2)=64.1646+0.09V(S2)
V(S2)≈70.51
​Therefore, V(S8)≈70.51

# Final V* values
Terminal states
V*(S1)=sqrt(3)≈1.732
V*(S5)=100
V*(S9​)=sqrt(3)​≈1.732
Non-terminal states
V*(S2)≈70.51
V*(S3)≈80.41
V*(S4)≈78.26
V*(S6)≈78.26
V*(S7)≈80.41
V*(S8)≈70.51
by (188 points)
0 votes

Hello Professor, 

Optimal policy for all states here is to move toward S5 as its terminal value is highest. 

S2, S3, S4 - go right 

S6, S7, S8- go left 

V(S2)=0.9(0.1V(S2)+0.4V(S3)+0.5V(S4))

V(S3)=0.9(0.1V(S3)+0.4V(S4)+0.5⋅100)

V(S4)=0.9(0.1V(S4)+0.5V(S6)+0.4⋅100)

V(S6)=0.9(0.1V(S6)+0.5V(S4)+0.4⋅100)

V(S7)=0.9(0.1V(S7)+0.4V(S6)+0.5⋅100)

V(S8)=0.9(0.1V(S8)+0.4V(S7)+0.5V(S6))

By symmetry, V(S2) = V(S8)=X, V(S3)=V(S7)=Y, V(S4)=V(S6)=Z

Solving for X,Y,Z:

X=0.9(0.1X+0.4Y+0.5Z)

0.91X=0.36Y+0.45Z

Y=0.09Y+0.36Z+45

0.91Y=0.36Z+45

Z=0.9(0.6Z+40) 

Z= 78.26

Substitute into 

0.91Y=0.36Z+45

y = 80.41

Substitute 

0.91X=0.36Y+0.45Z

x = 70.51

As a result, V2= 70.51, V3 = 80.41, V4 = 78.26, V6, = 78.26, V7 = 80.41, V8= 70.51

by (188 points)
0 votes

At each state, we can move either to the left or to the right. Since the reward at S5 is 100, it will be more optimal for all states to the left of it to move towards S5, given that the reward sqrt(3) at S1, located on the opposite side, is significantly smaller. We also consider the symmetry in the state space and find that S6 = S4, S7 = S3, and S8 = S2. Hence, we can solve the following system of equations to find the values for all 6 unknown states:

  • 10*S2 = 9*0.4*S3 + 9*0.5*S4 + 9*0.1*S2
  • 10*S3 = 9*0.4*S4 + 9*0.5*100 + 9*0.1*S3
  • 10*S4 = 9*0.4*100 + 9*0.5*S4 + 9*0.1*S4

Solution:

  • 10*S4 = 360 + 5.4*S4 => S4 = 360/4.6 = 78.26
  • 10*S3 = 3.6*78.26 + 4.5*100 + 0.9*S3 => S3 = (3.6*78.26 + 450)/9.1 = 80.41
  • 10*S2 = 3.6*80.41 + 4.5*78.26 + 0.9*S2 => (3.6*80.41+4.5*78.26)/9.1 = 70.51

Results:

  • S2 = S8 = 70.51
  • S3 = S7 = 80.41
  • S4 = S6 = 78.26

by (176 points)
0 votes

Hi,

Since S5 has the highest terminal value, the optimal policy of all non-terminal states is to move toward it. 

So the optimal actions will be:

S2, S3, S4 - right

S6, S7, S8 - left

Using the Bellman equations:

V(S2)=0.9x(0.1V(S2)+0.4V(S3)+0.5V(S4))

V(S3)=0.9x(0.1V(S3)+0.4V(S4)+0.5x100)

V(S4)=0.9x(0.1V(S4)+0.5V(S6)+0.4x100)

V(S6)=0.9x(0.1V(S6)+0.5V(S4)+0.4x100)

V(S7)=0.9x(0.1V(S7)+0.4V(S6)+0.5x100)

V(S8)=0.9x(0.1V(S8)+0.4V(S7)+0.5V(S6))

By symmetry,

V(S2)=V(S8)=X, V(S3)=V(S7)=Y, V(S4)=V(S6)=Z

Thus, we will get these equations:

X=0.9(0.1X+0.4Y+0.5Z)

0.91X=0.36Y+0.45Z

Y=0.9(0.1Y+0.4Z+50)

0.91Y=0.36Z+45

Z=0.9(0.6Z+40) - > it gives -> Z=78.26

Then after substitution,   0.91Y=0.36Z+45 gives -> Y=80.41

Then -> 0.91X=0.36Y+0.45Z gives -> X=70.51

So for the state values we will get:

V(S2)=V(S8)=70.51

V(S3)=V(S7)=80.41

V(S4)=V(S6)=78.26

Final results:

V(S2)=70.51

V(S3)=80.41

V(S4)=78.26

V(S6)=78.26

V(S7)=80.41

V(S8)=70.51

by (188 points)
0 votes

There are 3 terminal states. 

  • V(S1)=sqrt(3) = 1.732​, 
  • V(S5)=100, 
  • V(S9)=sqrt(3) = 1.732

As 100 > sqrt(3) --> the best choice is moving towards S5. So:

  • from S2, S3, S4, the optimal action is Right
  • from S6, S7, S8, the optimal action is Left

Note: For simplicity I will skip "S" letter in equations. I.e., V(S2) will become V2, V(S3) will become V3, and so on.

Also this problem is symmetric around S5, thus we will have below values same:
                      V2 = V8                        V3 = V7                    V4 = V6
 

So we need to solve equations for S2, S3, and S4

For S2: 40% to S3, 50% to S4, 10% to stay in S2, which becomes:
                          V2​=0.9(0.4*V3​+0.5*V4​+0.1*V2​) --> 0.91*V2​ = 0.36*V3​+0.45*V4​


For S3: 40% to S4, 50% to S5, 10% to stay in S3, which becomes:                                                                V3​=0.9(0.4*V4​+0.5*(100)+0.1*V3​) --> 0.91*V3 ​= 0.36*V4​+45


For S4: 40% to S5, 50% to S6, 10% to stay in S4, which becomes:                                                                              V4​=0.9*(0.4*(100)+0.5*V6​+0.1*V4​)


Recall that V(S4) = V(S6). Then we can simplify above equation by:
                         V4 = 0.9(40+0.5*V4​+0.1*V4​) --> 0.46*V4​=36 --> V4​=78.26


Now substitute value of V4 into V3 equation above:
                                     0.91*V3​=0.36*(78.26)+45  --> V3​=80.41

Then substitute values of V3 and V4 into V2 equation above:

                                0.91*V2​=0.36*(80.41)+0.45*(78.26) --> V2​=70.51

Thus, we have V2 = 70.51, V3 = 80.41, V4 = 78.26. Also, recall that by symmetry:
               V8 = V2 = 70.51,                  V7 = V3 = 80.41,               V6 = V4 = 78.26

Therefore, we found all values. Below are final values:

S1S2S3S4S5S6S7S8S9
1.73270.5180.4178.2610078.2680.4170.511.732
by (188 points)
0 votes

Hello Professor,

The board perfectly mirrors around S5. This means S4 equals S6, S3 equals S7, and S2 equals S8. Moving Right is always the best choice here. We just need to solve the left side using the 0.9 discount factor.

For S4, moving right gives a 40% chance of hitting S5 (worth 100), 50% of double jumping to S6 (which equals S4), and 10% of staying put.

V(S4) = 0.9 * [0.4 * 100 + 0.5 * V(S4) + 0.1 * V(S4)]

V(S4) = 36 + 0.54 * V(S4)

V(S4) = 78.26

For S3, moving right gives a 40% chance to hit S4, a 50% chance to land right on S5, and 10% to stay. It is actually worth more than S4 because of that big 50% chance to hit the jackpot without overshooting.

V(S3) = 0.9 * [0.4 * 78.26 + 0.5 * 100 + 0.1 * V(S3)]

V(S3) = 73.17 + 0.09 * V(S3)

V(S3) = 80.41

For S2, moving right means a 40% chance to hit S3, 50% to jump to S4, and 10% to stay.

V(S2) = 0.9 * [0.4 * 80.41 + 0.5 * 78.26 + 0.1 * V(S2)]

V(S2) = 64.16 + 0.09 * V(S2)

V(S2) = 70.51

Mirroring the values gives the final estimates:

  • S1: 1.73

  • S2: 70.51

  • S3: 80.41

  • S4: 78.26

  • S5: 100

  • S6: 78.26

  • S7: 80.41

  • S8: 70.51

  • S9: 1.73

by (236 points)

Related questions

0 votes
1 answer
asked Feb 23, 2021 in MDP by amrinderarora AlgoMeister (1.9k points)
0 votes
1 answer
asked Mar 30, 2021 in MDP by amrinderarora AlgoMeister (1.9k points)
0 votes
1 answer
0 votes
0 answers
asked May 6, 2025 in Informed Search by amrinderarora AlgoMeister (1.9k points)
0 votes
1 answer
...