Jumpy Car - Estimate/Solve the V* values for the following MDP

Question

Jumpy Car - Estimate/Solve the V* values for the following MDP

You are traveling on a straight road, but have a jumpy car. The car sometimes "jumps" (moves double). At other times, it doesn't move at all. The following MDP has been created to model this behavior and the landscape.

Estimate the V* values (optimal values for the states) for this MDP:

S1	S2	S3	S4	S5	S6	S7	S8	S9
sqrt(3)				100				sqrt(3)

MDP is defined as follows: There are two actions L (Left) and R (Right). When moving left, there is a 40% chance moving left, 50% chance of moving DOUBLE (2 spots left), and 10% chance of not moving at all. Similarly, when moving right, there is a 40% chance moving right, 50% chance of moving DOUBLE (2 spots right), and 10% chance of not moving at all.

S1, S5 and S9 are terminal states with values sqrt(3), 100 and sqrt(3) respectively.

Discount factor is 0.9 (so, gamma = 0.9).

There is no living reward, that is R(s,a,s’) = 0.

asked Mar 1, 2021 in MDP by amrinderarora AlgoMeister (2.1k points)
edited Mar 1, 2021 by amrinderarora

14 Answers

Page:

Related questions

0 votes

1 answer

Solve the V* values for this MDP

asked Feb 23, 2021 in MDP by amrinderarora AlgoMeister (2.1k points)

0 votes

1 answer

Solve the V* values for this MDP - 5x5

asked Mar 30, 2021 in MDP by amrinderarora AlgoMeister (2.1k points)

0 votes

1 answer

Solve the V* values for this grid world MDP - 3 x 4

asked Apr 16, 2023 in Informed Search by amrinderarora AlgoMeister (2.1k points)

0 votes

0 answers

Solve this grid world.

asked May 6, 2025 in Informed Search by amrinderarora AlgoMeister (2.1k points)

0 votes

16 answers

MLE for the given MDP and Observation Sequence (RGB Color Cycler)

asked Apr 24 in HMM by amrinderarora AlgoMeister (2.1k points)

Kamal · Answer 1 · 2026-03-31T15:59:27+0000

Hello professor,

The optimal state values are:
• S1 = sqrt(3) ≈ 1.732
• S2 ≈ 70.511
• S3 ≈ 80.411
• S4 ≈ 78.261
• S5 = 100
• S6 ≈ 78.261
• S7 ≈ 80.411
• S8 ≈ 70.511
• S9 = sqrt(3) ≈ 1.732

Since there is no living reward and gamma = 0.9, each non-terminal state value is just the discounted expected value of the next state, assuming we choose the best action.

The terminal states are fixed:
• S1 = sqrt(3)
• S5 = 100
• S9 = sqrt(3)

Because S5 has a much larger value than S1 and S9, the best policy is always to move toward S5:
• From S2, S3, S4: go Right
• From S6, S7, S8: go Left

Using that policy, the Bellman equations become:
• V(S2) = 0.9 * [0.4V(S3) + 0.5V(S4) + 0.1V(S2)]
• V(S3) = 0.9 * [0.4V(S4) + 0.5(100) + 0.1V(S3)]
• V(S4) = 0.9 * [0.4(100) + 0.5V(S6) + 0.1V(S4)]
• V(S6) = 0.9 * [0.4(100) + 0.5V(S4) + 0.1V(S6)]
• V(S7) = 0.9 * [0.4V(S6) + 0.5(100) + 0.1V(S7)]
• V(S8) = 0.9 * [0.4V(S7) + 0.5V(S6) + 0.1V(S8)]

By symmetry:
• V(S4) = V(S6)
• V(S3) = V(S7)
• V(S2) = V(S8)

Solving gives:
• V(S4) = V(S6) ≈ 78.261
• V(S3) = V(S7) ≈ 80.411
• V(S2) = V(S8) ≈ 70.511

So the final answer is:

S1 = 1.732
S2 = 70.511
S3 = 80.411
S4 = 78.261
S5 = 100
S6 = 78.261
S7 = 80.411
S8 = 70.511
S9 = 1.732

One interesting detail is that S3 is slightly better than S4. That is because from S3, going right gives a 50% chance of jumping directly into S5, which is worth 100.

elmarmammadov · Answer 2 · 2026-03-31T18:47:59+0000

Hello,

The terminal states are fixed:

V*(S1) = sqrt(3) ≈ 1.732
V*(S5) = 100
V*(S9) = sqrt(3) ≈ 1.732

The discount factor is:

gamma = 0.9

There is no living reward:

R(s,a,s') = 0

Transition model:

Action Right (R):

40% chance of moving one state right
50% chance of moving two states right
10% chance of staying in the same state

Action Left (L):

40% chance of moving one state left
50% chance of moving two states left
10% chance of staying in the same state

Because the reward at S5 = 100 is much larger than the terminal rewards at S1 and S9, the optimal action on the left side is to move right. By symmetry, the right side mirrors the left side.

Using symmetry:

V*(S4) = V*(S6) = a
V*(S3) = V*(S7) = b
V*(S2) = V*(S8) = c

Step 1: Solve for S4

From S4, choosing Right gives:

40% -> S5
50% -> S6 = a
10% -> stay at S4 = a

Bellman equation:

a = 0.9[0.4(100) + 0.5a + 0.1a]
a = 0.9[40 + 0.6a]
a = 36 + 0.54a
0.46a = 36
a ≈ 78.26

Step 2: Solve for S3

From S3, choosing Right gives:
40% -> S4 = a
50% -> S5 = 100
10% -> stay at S3 = b

b = 0.9[0.4(78.26) + 0.5(100) + 0.1b]
b = 0.9[31.30 + 50 + 0.1b]
b = 73.17 + 0.09b
0.91b = 73.17
b ≈ 80.41

Step 3: Solve for S2
From S2, choosing Right gives:

40% -> S3 = b
50% -> S4 = a
10% -> stay at S2 = c

c = 0.9[0.4(80.41) + 0.5(78.26) + 0.1c]
c = 0.9[32.16 + 39.13 + 0.1c]
c = 64.16 + 0.09c
0.91c = 64.16
c ≈ 70.51

Final optimal values:
V*(S1) = 1.732
V*(S2) = 70.51
V*(S3) = 80.41
V*(S4) = 78.26
V*(S5) = 100
V*(S6) = 78.26
V*(S7) = 80.41
V*(S8) = 70.51
V*(S9) = 1.732

Optimal policy:
S2, S3, S4: choose Right
S6, S7, S8: choose Left

bahruzgurbanli · Answer 3 · 2026-03-31T19:31:30+0000

Hello,

Here is my solution according to the given:

Terminal states:

V*(S1) = sqrt(3) ~= 1.732

V*(S5) = 100

V*(S9) = sqrt(3) ~= 1.732

gamma = 0.9, R(s,a,s') = 0

Actions:

L: 0.4 -> s-1, 0.5 -> s-2, 0.1 -> stay

R: 0.4 -> s+1, 0.5 -> s+2, 0.1 -> stay

Since S5 = 100 is largest, optimal policy:

S2, S3, S4 -> R

S6, S7, S8 -> L

By symmetry:

V*(S4)=V*(S6)=a

V*(S3)=V*(S7)=b

V*(S2)=V*(S8)=c

S4:

a = 0.9*(0.4100 + 0.5a + 0.1a)

a = 0.9(40 + 0.6a)

a = 36 + 0.54a

0.46a = 36

a = 36/0.46 ~= 78.26

S3:

b = 0.9*(0.4a + 0.5100 + 0.1b)

b = 0.9(0.478.26 + 50 + 0.1b)

b = 0.9(31.304 + 50 + 0.1b)

b = 73.1736 + 0.09b

0.91b = 73.1736

b ~= 80.41

S2:

c = 0.9*(0.4b + 0.5a + 0.1c)

c = 0.9*(0.480.41 + 0.578.26 + 0.1c)

c = 0.9*(32.164 + 39.13 + 0.1c)

c = 64.1646 + 0.09c

0.91c = 64.1646

c ~= 70.51

Final values:

S1 = 1.732

S2 = 70.51

S3 = 80.41

S4 = 78.26

S5 = 100

S6 = 78.26

S7 = 80.41

S8 = 70.51

S9 = 1.732

rufat_guliyev · Answer 4 · 2026-04-26T13:09:30+0000

Hello Professor,

Since S5 has a terminal value of 100, the optimal strategy is to move toward S5. So for the states on the left side, S2, S3, and S4, the best action is R. For the states on the right side, S6, S7, and S8, the best action is L.

The terminal state values are:

V(S1) = 1.732

V(S5) = 100

V(S9) = 1.732

Because the MDP is symmetric around S5:

V(S2) = V(S8)

V(S3) = V(S7)

V(S4) = V(S6)

Let:

x = V(S2) = V(S8)

y = V(S3) = V(S7)

z = V(S4) = V(S6)

The discount factor is 0.9. There is no living reward, so the value comes only from discounted future values.

For S4, moving right gives:

z = 0.9(0.4(100) + 0.5z + 0.1z)

z = 0.9(40 + 0.6z)

z = 36 + 0.54z

0.46z = 36

z = 78.26

So:

V(S4) = V(S6) = 78.26

For S3, moving right gives:

y = 0.9(0.4z + 0.5(100) + 0.1y)

Substituting z = 78.26:

y = 0.9(0.4(78.26) + 50 + 0.1y)

y = 73.17 + 0.09y

0.91y = 73.17

y = 80.41

So:

V(S3) = V(S7) = 80.41

For S2, moving right gives:

x = 0.9(0.4y + 0.5z + 0.1x)

Substituting y = 80.41 and z = 78.26:

x = 0.9(0.4(80.41) + 0.5(78.26) + 0.1x)

x = 64.16 + 0.09x

0.91x = 64.16

x = 70.51

So:

V(S2) = V(S8) = 70.51

Therefore, the final values are:

V(S1) = 1.732

V(S2) = 70.51

V(S3) = 80.41

V(S4) = 78.26

V(S5) = 100

V(S6) = 78.26

V(S7) = 80.41

V(S8) = 70.51

V(S9) = 1.732

Final answer:

[1.732, 70.51, 80.41, 78.26, 100, 78.26, 80.41, 70.51, 1.732]

The values are not perfectly increasing as the states get closer to S5 because there is a chance of jumping two spaces. This means that being one step away from S5 is not always better than being two steps away, depending on the transition probabilities

Categories

Most popular tags

Jumpy Car - Estimate/Solve the V* values for the following MDP

Please log in or register to add a comment.

Please log in or register to answer this question.

14 Answers

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Related questions