TD Learning model simulation

By TD learning, the agent learns from future rewards and back-propagates prediction errors by updating estimation values(keep updating beliefs of future rewards at every moment approaching the future). It is one of the core concepts of model-free reinforcement learning.


Conditioned stimulus: $ u \in \{ 0,1 \} $.

Unconditioned stimulus: $ r\in \{ 0,1 \} $.

The linear filter used to build convolutionary kernel: $ w\in \mathbb{R} $.

With learning, the prediction error at reward onset time will gradually approach zero, meaning there will be less and eventually no more prediction error(surprise) when obtaining the reward.

The prediction error will backpropagate to the cue onset time. The prediction error(surprise) occurs once the animal hears the cue.

Learning process

Similar to the Rescorla-Wagner rule - designed to minimise $\frac{1}{2} \delta^2$, adjust the TD learning rule:

$$V_t = \sum_{\tau=0}^t w(\tau)u(t-\tau)$$

$$V_{t+1} = \sum_{\tau=0}^{t+1} w(\tau)u(t+1-\tau)$$

$$ \delta_t = r_t + V_{t+1} - V_t $$

$$ w(\tau) \leftarrow w(\tau) + \alpha\delta_tu(t-\tau)$$

$\alpha$ is the learning rate. $\delta_t$ is the prediction error at time $t$.

A demo

Assume that the cue onset at time 2($u_2=1$ and $u_{others}=0$), and reward onset at time 5($r_5=1$ and $r_{others}=0$). Set learning rate $\alpha$ as $0.5$.

In the first trial, update prediction error $\delta_t$ at time $t=5$, when the reward is obtained. $$\delta_5 = r_5 + V_6 - V_5 = 1+0-0=1$$

Then update parameter

$$w_3 \leftarrow w_3 + \alpha\delta_5 u_2 = 0+0.5\times1\times1=0.5$$

which is equvalent to update value

$$V_5 \leftarrow V_5 + \alpha\delta_5 = 0+0.5\times1=0.5$$

Since $$V_5 = w_0u_5 + w_1u_4 + w_2u_3 + w_3u_2 + w_4u_1 + w_5u_0 = w_3 u_2 $$

$$dV_5/dw_3 = u_2$$

In the second trial,

at time $t=4$, update

$$\delta_4 = r_4 + V_5 - V_4 = 0+0.5-0=0.5$$

$$w_2 \leftarrow w_2 + \alpha\delta_4 u_2 = 0+0.5\times0.5\times1 = 0.25$$

at time $t=5$, update

$$\delta_5 = r_5 + V_6 - V_5 = 1+0-0.5=0.5$$

$$w_3 \leftarrow w_3 + \alpha\delta_5 u_2 = 0.5+0.5\times0.5\times1 = 0.75$$

In the third trial, update $w_3, w_2, w_1$. In the fourth and following trials, update $w_3, w_2, w_1, w_0$, until prediction error converges to zero.


# TD Learning model

import matplotlib.pyplot as plt
import numpy as np

def td_learn(lr=0.2, trials_n=100, steps_n=40, cs=10, us=20):
    """TD Learning

        lr (float): learning rate. Defaults to 0.2.
        trials_n (int): number of trials. Defaults to 100.
        steps_n (int): number of steps per trial. Defaults to 40.
        cs (int): conditioned stimulus onset. Defaults to 10.
        us (int): unconditioned stimulus onset step. Defaults to 20.
    # cue(unconditioned stimulus) onset at step cs
    u = np.zeros(steps_n)
    u[cs] = 1

    # reward(conditioned stimulus) onset at step us
    r = np.zeros(steps_n)
    r[us] = 1

    # w is the linear filter for convolutionary kernel
    w = np.zeros((steps_n, trials_n))

    # init to store v and delta
    v = np.zeros((steps_n, trials_n))
    delta = np.zeros((steps_n, trials_n))

    # learning session
    for t_i in range(trials_n - 1):
        for s_i in range(steps_n - 1):
            # update v(values) by linear filter w
            v[s_i, t_i] = w[:s_i+1, t_i] @ u[s_i::-1]
            v[s_i+1, t_i] = w[:s_i+1+1, t_i] @ u[s_i+1::-1]

            # prediction error delta
            delta[s_i, t_i] = r[s_i] + v[s_i+1, t_i] - v[s_i, t_i]

            # update parameters
            # v[s_i, t_i] += lr * delta[s_i, t_i]
            w[:s_i+1, t_i] += lr * delta[s_i, t_i] * u[s_i::-1]

        # v[:, t_i+1] = v[:, t_i]
        # save w for the next trial
        w[:, t_i+1] = w[:, t_i]

    return delta, w, v

def plot(trials_n, steps_n, num):

    x = np.arange(0, trials_n)
    y = np.arange(0, steps_n)

    X, Y = np.meshgrid(x, y)
    Z = num

    fig = plt.figure()
    ax = fig.add_subplot(111, projection='3d')
    ax.plot_surface(X, Y, Z)


    ax.set_title('TD Learning Simulation')

lr = 0.2
trials_n = 100
steps_n = 40

delta, w, v = td_learn(lr, trials_n, steps_n, cs=10, us=20)
# print(delta)
plot(trials_n, steps_n, delta)



Dayan, Peter, and Laurence F. Abbott. Theoretical neuroscience: computational and mathematical modeling of neural systems. MIT press, 2005.