0 votes

You are traveling on a straight road, but have a jumpy car.  The car sometimes "jumps" (moves double).  At other times, it doesn't move at all.  The following MDP has been created to model this behavior and the landscape.

Estimate the V* values (optimal values for the states) for this MDP:

S1S2S3S4S5S6S7S8S9
sqrt(3)100sqrt(3)

MDP is defined as follows: There are two actions L (Left) and R (Right).  When moving left, there is a 40% chance moving left, 50% chance of moving DOUBLE (2 spots left), and 10% chance of not moving at all.  Similarly, when moving right, there is a 40% chance moving right, 50% chance of moving DOUBLE (2 spots right), and 10% chance of not moving at all.

S1, S5 and S9 are terminal states with values sqrt(3), 100 and sqrt(3) respectively.  

Discount factor is 0.9 (so, gamma = 0.9).

There is no living reward, that is R(s,a,s’) = 0.

in MDP by AlgoMeister (1.9k points)
edited by

13 Answers

0 votes
Hello professor,

The optimal state values are:
    •    S1 = sqrt(3) ≈ 1.732
    •    S2 ≈ 70.511
    •    S3 ≈ 80.411
    •    S4 ≈ 78.261
    •    S5 = 100
    •    S6 ≈ 78.261
    •    S7 ≈ 80.411
    •    S8 ≈ 70.511
    •    S9 = sqrt(3) ≈ 1.732

Since there is no living reward and gamma = 0.9, each non-terminal state value is just the discounted expected value of the next state, assuming we choose the best action.

The terminal states are fixed:
    •    S1 = sqrt(3)
    •    S5 = 100
    •    S9 = sqrt(3)

Because S5 has a much larger value than S1 and S9, the best policy is always to move toward S5:
    •    From S2, S3, S4: go Right
    •    From S6, S7, S8: go Left

Using that policy, the Bellman equations become:
    •    V(S2) = 0.9 * [0.4V(S3) + 0.5V(S4) + 0.1V(S2)]
    •    V(S3) = 0.9 * [0.4V(S4) + 0.5(100) + 0.1V(S3)]
    •    V(S4) = 0.9 * [0.4(100) + 0.5V(S6) + 0.1V(S4)]
    •    V(S6) = 0.9 * [0.4(100) + 0.5V(S4) + 0.1V(S6)]
    •    V(S7) = 0.9 * [0.4V(S6) + 0.5(100) + 0.1V(S7)]
    •    V(S8) = 0.9 * [0.4V(S7) + 0.5V(S6) + 0.1V(S8)]

By symmetry:
    •    V(S4) = V(S6)
    •    V(S3) = V(S7)
    •    V(S2) = V(S8)

Solving gives:
    •    V(S4) = V(S6) ≈ 78.261
    •    V(S3) = V(S7) ≈ 80.411
    •    V(S2) = V(S8) ≈ 70.511

So the final answer is:

S1 = 1.732
S2 = 70.511
S3 = 80.411
S4 = 78.261
S5 = 100
S6 = 78.261
S7 = 80.411
S8 = 70.511
S9 = 1.732

One interesting detail is that S3 is slightly better than S4. That is because from S3, going right gives a 50% chance of jumping directly into S5, which is worth 100.
by (232 points)
0 votes
Hello,

The terminal states are fixed:

V*(S1) = sqrt(3) ≈ 1.732
V*(S5) = 100
V*(S9) = sqrt(3) ≈ 1.732

The discount factor is:

gamma = 0.9

There is no living reward:

R(s,a,s') = 0

Transition model:

Action Right (R):

40% chance of moving one state right
50% chance of moving two states right
10% chance of staying in the same state

Action Left (L):

40% chance of moving one state left
50% chance of moving two states left
10% chance of staying in the same state

Because the reward at S5 = 100 is much larger than the terminal rewards at S1 and S9, the optimal action on the left side is to move right. By symmetry, the right side mirrors the left side.

Using symmetry:

V*(S4) = V*(S6) = a
V*(S3) = V*(S7) = b
V*(S2) = V*(S8) = c

Step 1: Solve for S4

From S4, choosing Right gives:

40% -> S5
50% -> S6 = a
10% -> stay at S4 = a

Bellman equation:

a = 0.9[0.4(100) + 0.5a + 0.1a]
a = 0.9[40 + 0.6a]
a = 36 + 0.54a
0.46a = 36
a ≈ 78.26

Step 2: Solve for S3

From S3, choosing Right gives:
40% -> S4 = a
50% -> S5 = 100
10% -> stay at S3 = b

b = 0.9[0.4(78.26) + 0.5(100) + 0.1b]
b = 0.9[31.30 + 50 + 0.1b]
b = 73.17 + 0.09b
0.91b = 73.17
b ≈ 80.41

Step 3: Solve for S2
From S2, choosing Right gives:

40% -> S3 = b
50% -> S4 = a
10% -> stay at S2 = c


c = 0.9[0.4(80.41) + 0.5(78.26) + 0.1c]
c = 0.9[32.16 + 39.13 + 0.1c]
c = 64.16 + 0.09c
0.91c = 64.16
c ≈ 70.51

Final optimal values:
V*(S1) = 1.732
V*(S2) = 70.51
V*(S3) = 80.41
V*(S4) = 78.26
V*(S5) = 100
V*(S6) = 78.26
V*(S7) = 80.41
V*(S8) = 70.51
V*(S9) = 1.732

Optimal policy:
S2, S3, S4: choose Right
S6, S7, S8: choose Left
by (164 points)
0 votes

Hello,

Here is my solution according to the given:

Terminal states:

V*(S1) = sqrt(3) ~= 1.732

V*(S5) = 100

V*(S9) = sqrt(3) ~= 1.732

 

gamma = 0.9, R(s,a,s') = 0

 

Actions:

 

L: 0.4 -> s-1, 0.5 -> s-2, 0.1 -> stay

R: 0.4 -> s+1, 0.5 -> s+2, 0.1 -> stay

 

Since S5 = 100 is largest, optimal policy:

 

S2, S3, S4 -> R

S6, S7, S8 -> L

 

By symmetry:

 

V*(S4)=V*(S6)=a

V*(S3)=V*(S7)=b

V*(S2)=V*(S8)=c

 

S4:

 

a = 0.9*(0.4100 + 0.5a + 0.1a)

a = 0.9(40 + 0.6a)

a = 36 + 0.54a

0.46a = 36

a = 36/0.46 ~= 78.26

 

S3:

 

b = 0.9*(0.4a + 0.5100 + 0.1b)

b = 0.9(0.478.26 + 50 + 0.1b)

b = 0.9(31.304 + 50 + 0.1b)

b = 73.1736 + 0.09b

0.91b = 73.1736

b ~= 80.41

 

S2:

 

c = 0.9*(0.4b + 0.5a + 0.1c)

c = 0.9*(0.480.41 + 0.578.26 + 0.1c)

c = 0.9*(32.164 + 39.13 + 0.1c)

c = 64.1646 + 0.09c

0.91c = 64.1646

c ~= 70.51

 

Final values:

 

S1 = 1.732

S2 = 70.51

S3 = 80.41

S4 = 78.26

S5 = 100

S6 = 78.26

S7 = 80.41

S8 = 70.51

S9 = 1.732

by (176 points)

Related questions

0 votes
1 answer
asked Feb 23, 2021 in MDP by amrinderarora AlgoMeister (1.9k points)
0 votes
1 answer
asked Mar 30, 2021 in MDP by amrinderarora AlgoMeister (1.9k points)
0 votes
1 answer
0 votes
0 answers
asked May 6, 2025 in Informed Search by amrinderarora AlgoMeister (1.9k points)
0 votes
1 answer
...