0 votes
Let's say you are given a '+'-shaped MDP with five states and a gamma (discount rate) of 1:

Given MDP

_ A _
B C D
_ E _

The input policy \Pi is as follows:

A -> Terminal
B -> C
C -> D
D -> Terminal
E -> C

Let's say you have the following observed episodes (training) though:

Episode 1:
B, east, C, -1
C, east, D, -1
D, exit, x, +10

Episode 2:
B, east, C, -1
C, east, D, -1
D, exit, x, +10

Episode 3:
E, north, C, -1
C, east, D, -1
D, exit, x, +10

Episode 4:
E, north, C, -1
C, north, A, -1
A, exit, x, -10

What are the output values based on these episodes?
in MDP by AlgoMeister (568 points)
edited by

1 Answer

0 votes
Each state's value depends on the paths that continue from that state.

For A:
Episode 4: A -> x (sum = -10)
A = average( path sums ) = -10

For B:
Episode 1: B -> C -> D -> x (sum = +8)
Episode 2: B -> C -> D -> x (sum = +8)
B = average( path sums ) = (+8 + +8)/ 2 = 8

For C:
Episode 1: C -> D -> x (sum = +9)
Episode 2: C -> D -> x (sum = +9)
Episode 3: C -> D -> x (sum = +9)
Episode 4: C -> A -> x (sum = -11)
C = average( path sums ) = (9 + 9 + 9 - 11)/4 = 4

For D:
Episode 1: D -> x (sum = +10)
Episode 2: D -> x (sum = +10)
Episode 3: D -> x (sum = +10)
D = average( path sums ) = +10

For E:
Episode 1: E -> C -> D -> x (sum = 8)
Episode 2: E -> C -> A -> x (sum = -12)
E = average( path sums ) = (8 - 12)/2 = -2
by AlgoMeister (568 points)

Related questions

0 votes
1 answer
asked Mar 30, 2021 in MDP by Amrinder Arora AlgoMeister (1.6k points)
0 votes
2 answers
0 votes
1 answer
asked Feb 23, 2021 in MDP by Amrinder Arora AlgoMeister (1.6k points)
0 votes
1 answer
0 votes
1 answer
The Book: Analysis and Design of Algorithms | Presentations on Slideshare | Lecture Notes, etc
...