Pratham Patel
· 18 min read

Reinforcement Learning from Scratch

Building RL from the ground up — actions, rewards, policies, expected reward, the policy gradient theorem, and REINFORCE — all derived step by step with concrete examples.

This is Part 2. If you haven’t read Mathematical Prerequisites for Reinforcement Learning, start there — we will use the same mathematical tools (expected value, derivatives, the log trick, and Monte Carlo estimation) throughout this post.


The Single-Step World

A tiny AI

Imagine you are building a very small AI. It sees a math question — “2 + 3 = ?” — and must output one of two answers. Action 0 means the model answers “4”, and Action 1 means the model answers “5”. That’s the entire world: one question, two possible actions. We will build Reinforcement Learning completely inside this tiny setup before scaling up to anything more complex.


What Is an Action?

An action is simply a choice the model makes, denoted aa. In our example, a{0,1}a \in \{0, 1\}, where a=0a = 0 means the model answers “4” and a=1a = 1 means the model answers “5”. There is nothing deep about this definition — an action is just a number that labels a choice.


What Is Reward?

After the model outputs an answer, we check whether it was correct. If the answer is right, we assign a reward of 1. If it is wrong, the reward is 0. We denote reward as rr, so if a=1a = 1 (the model answers “5”, which is correct), then r=1r = 1, and if a=0a = 0 (the model answers “4”, which is wrong), then r=0r = 0.

Reward is a number that tells us how good the action was. In more complex settings, rewards can be negative (penalties), fractional, or delayed, but the idea is always the same: a scalar signal indicating quality.


The Policy

The model does not always pick the same action deterministically. Instead, it chooses randomly according to a probability distribution over actions. We call this distribution the policy, written as:

πθ(a)\pi_\theta(a)

Read this as “the probability of choosing action aa under parameter θ\theta.” The policy is the central object in RL — it is the thing we are trying to improve.


What Is θ\theta?

θ\theta is a number (or, in general, a vector of numbers) that controls the model’s behaviour. We define the policy using the sigmoid function from Part 1:

πθ(1)=σ(θ),πθ(0)=1σ(θ)\pi_\theta(1) = \sigma(\theta), \qquad \pi_\theta(0) = 1 - \sigma(\theta)

where σ(θ)=11+eθ\sigma(\theta) = \frac{1}{1 + e^{-\theta}}.

So θ\theta acts as a knob. When θ=0\theta = 0, the sigmoid outputs 0.5, meaning the model picks each answer with equal probability — a completely random policy. As θ\theta grows large and positive, σ(θ)\sigma(\theta) approaches 1, and the model almost always answers “5” (the correct answer). As θ\theta grows large and negative, the model almost always answers “4” (the wrong answer). Training the model means finding a value of θ\theta that makes the policy produce the correct answer reliably.


Expected Reward

Because the model chooses its action randomly, the reward it receives is also random. Sometimes it will answer correctly and get reward 1, other times it will answer incorrectly and get reward 0. The natural question is: on average, how much reward does this policy earn?

Using the expected value formula from Part 1:

E[r]=πθ(1)×1+πθ(0)×0=πθ(1)=σ(θ)\mathbb{E}[r] = \pi_\theta(1) \times 1 + \pi_\theta(0) \times 0 = \pi_\theta(1) = \sigma(\theta)

We give this quantity a name — the objective function:

J(θ)=σ(θ)J(\theta) = \sigma(\theta)

J(θ)J(\theta) represents the average reward the model earns under the current policy. Our goal is to make JJ as large as possible, which means finding the θ\theta that pushes σ(θ)\sigma(\theta) close to 1.


Computing the Gradient

To maximise JJ, we compute its derivative with respect to θ\theta. From Part 1, the derivative of the sigmoid is σ(θ)(1σ(θ))\sigma(\theta)(1 - \sigma(\theta)), so:

dJdθ=σ(θ)(1σ(θ))\frac{dJ}{d\theta} = \sigma(\theta)(1 - \sigma(\theta))

This derivative is always positive (since σ\sigma is always between 0 and 1), which confirms that increasing θ\theta always increases JJ. We can use gradient ascent — the opposite of gradient descent — to iteratively improve the policy:

θθ+αdJdθ\theta \leftarrow \theta + \alpha \frac{dJ}{d\theta}

where α\alpha is a small positive learning rate. Each update nudges θ\theta upward, which increases the probability of the correct action. Repeat this enough times and σ(θ)\sigma(\theta) will approach 1, meaning the model almost always answers “5”.


Deriving the Policy Gradient (Log Trick)

The direct derivative above works perfectly for this two-action example. But it relies on us being able to differentiate JJ in closed form, which requires knowing the reward for every possible action and summing over all of them. In real problems with enormous action spaces, that sum is intractable. So let’s derive the same gradient a different way — the way that generalises.

Start from the definition of expected reward:

J(θ)=aπθ(a)r(a)J(\theta) = \sum_a \pi_\theta(a) \cdot r(a)

Differentiate with respect to θ\theta:

θJ=aθπθ(a)r(a)\nabla_\theta J = \sum_a \nabla_\theta \pi_\theta(a) \cdot r(a)

Now apply the log trick from Part 1. We multiply and divide by πθ(a)\pi_\theta(a) inside the sum:

θπθ(a)=πθ(a)θlogπθ(a)\nabla_\theta \pi_\theta(a) = \pi_\theta(a) \cdot \nabla_\theta \log \pi_\theta(a)

Substituting this into the gradient expression:

θJ=aπθ(a)r(a)θlogπθ(a)\nabla_\theta J = \sum_a \pi_\theta(a) \cdot r(a) \cdot \nabla_\theta \log \pi_\theta(a)

This sum has the form aP(a)f(a)\sum_a P(a) \cdot f(a), which is exactly the definition of an expectation. So we can rewrite it as:

θJ=Eaπθ[r(a)θlogπθ(a)]\boxed{\nabla_\theta J = \mathbb{E}_{a \sim \pi_\theta}\left[ r(a) \cdot \nabla_\theta \log \pi_\theta(a) \right]}

This is the policy gradient formula, and it emerged purely from algebra — no approximations, no tricks beyond the log identity. The reason this form is so powerful is that it is an expectation under the policy. That means we can estimate it by sampling actions from the policy and averaging, rather than summing over every possible action. For our two-action example this distinction is trivial, but for a language model choosing among 50,000 tokens, or a robot choosing among continuous motor torques, it is the difference between feasible and impossible.


The Multi-Step World

Everything so far involved a single decision followed by a single reward. Real Reinforcement Learning is different — you take multiple actions over time, like playing a game where each move leads to a new position, and the reward might only arrive at the very end. Let’s extend our framework to handle this.


A Tiny Grid World

Consider a number line. You start at position 0 and want to reach position 2. At each time step, you can either move right (+1) or move left (−1). We will stop after exactly 2 steps, regardless of where you end up.

The action at time tt is at{1,+1}a_t \in \{-1, +1\}, and the state sts_t simply represents your current position. So s0=0s_0 = 0. If you move right at step 0, then s1=1s_1 = 1. If you then move right again, s2=2s_2 = 2. The reward structure is sparse: you receive a reward of 1 only if your final position is 2, and 0 otherwise.

rt={1if st=20otherwiser_t = \begin{cases} 1 & \text{if } s_t = 2 \\ 0 & \text{otherwise} \end{cases}

This is a minimal example of a multi-step problem. You don’t get any feedback until the end, and you have to make two sequential decisions, each of which affects what states are available in the future.


What Is a Trajectory?

A trajectory is the complete record of what happened during one episode — every state visited and every action taken:

τ=(s0,a0,s1,a1,s2)\tau = (s_0, a_0, s_1, a_1, s_2)

For example, the trajectory τ=(0,+1,1,+1,2)\tau = (0, +1, 1, +1, 2) represents starting at 0, moving right to 1, then moving right again to reach 2. The trajectory τ=(0,+1,1,1,0)\tau = (0, +1, 1, -1, 0) represents moving right and then left, ending back at the start. Each complete sequence from start to finish is one trajectory, denoted τ\tau.


Probability of a Trajectory

In our grid world, the environment is deterministic — if you are at position 1 and choose +1, you always end up at position 2, never somewhere random. This means all the randomness in a trajectory comes from the policy’s action choices. Suppose the policy assigns the same probabilities at every state:

πθ(+1s)=p,πθ(1s)=1p\pi_\theta(+1 \mid s) = p, \qquad \pi_\theta(-1 \mid s) = 1 - p

The probability of an entire trajectory is simply the product of the probabilities of each action taken along the way:

Pθ(τ)=tπθ(atst)P_\theta(\tau) = \prod_t \pi_\theta(a_t \mid s_t)

This is the same multiplication rule you use for independent coin tosses — the probability of getting Heads then Heads is p×pp \times p. Let’s enumerate all four possible 2-step trajectories.

Right, right (0,+1,1,+1,2)(0, +1, 1, +1, 2): ends at position 2, earns reward 1, with probability p2p^2.

Right, left (0,+1,1,1,0)(0, +1, 1, -1, 0): ends at position 0, earns reward 0, with probability p(1p)p(1-p).

Left, right (0,1,1,+1,0)(0, -1, -1, +1, 0): ends at position 0, earns reward 0, with probability (1p)p(1-p) \cdot p.

Left, left (0,1,1,1,2)(0, -1, -1, -1, -2): ends at position 2-2, earns reward 0, with probability (1p)2(1-p)^2.

Only the first trajectory reaches position 2 and earns a reward. All others end at 0 or 2-2 and earn nothing.


The Objective

The total reward for a trajectory is the sum of all rewards collected along the way:

R(τ)=trtR(\tau) = \sum_t r_t

In our example, only the trajectory (0,+1,1,+1,2)(0, +1, 1, +1, 2) has R(τ)=1R(\tau) = 1. All other trajectories have R(τ)=0R(\tau) = 0.

The objective function J(θ)J(\theta) is the expected total reward — the average reward you would get if you ran the policy many times:

J(θ)=τPθ(τ)R(τ)J(\theta) = \sum_\tau P_\theta(\tau) \cdot R(\tau)

This is exactly the same expected value formula from Part 1 — E[X]=P(x)x\mathbb{E}[X] = \sum P(x) \cdot x — except now the “outcomes” are entire trajectories instead of individual coin tosses, and the “values” are total rewards instead of winnings.

Numerical verification

Let p=0.5p = 0.5, so the policy is completely random. Each of the four trajectories has probability 0.250.25, and only one of them has reward 1. So:

J=0.25×1+0.25×0+0.25×0+0.25×0=0.25J = 0.25 \times 1 + 0.25 \times 0 + 0.25 \times 0 + 0.25 \times 0 = 0.25

This makes intuitive sense: with a random policy, you have a 25% chance of going right twice, which is the only way to reach position 2. If we increase pp to 0.8 (a policy that strongly prefers moving right), then J=0.82=0.64J = 0.8^2 = 0.64, which is much better. And if p=1p = 1, then J=1J = 1 — the policy always reaches the goal.


Deriving the Multi-Step Policy Gradient

Now we derive the gradient of JJ with respect to θ\theta for the multi-step case. The derivation follows the exact same structure as the single-step case, just with trajectories instead of individual actions.

Step 1 — Differentiate. We start from J(θ)=τPθ(τ)R(τ)J(\theta) = \sum_\tau P_\theta(\tau) \cdot R(\tau) and move the derivative inside the sum:

θJ=τθPθ(τ)R(τ)\nabla_\theta J = \sum_\tau \nabla_\theta P_\theta(\tau) \cdot R(\tau)

Step 2 — Apply the log trick. Using the identity P=PlogP\nabla P = P \cdot \nabla \log P (derived in Part 1):

θJ=τPθ(τ)R(τ)θlogPθ(τ)\nabla_\theta J = \sum_\tau P_\theta(\tau) \cdot R(\tau) \cdot \nabla_\theta \log P_\theta(\tau)

Step 3 — Recognise as expectation. The sum τPθ(τ)()\sum_\tau P_\theta(\tau) \cdot (\cdots) is the definition of expectation over trajectories:

θJ=Eτ[R(τ)θlogPθ(τ)]\nabla_\theta J = \mathbb{E}_\tau \left[ R(\tau) \cdot \nabla_\theta \log P_\theta(\tau) \right]

Step 4 — Expand the log probability. Recall that the probability of a trajectory is Pθ(τ)=tπθ(atst)×(environment transitions)P_\theta(\tau) = \prod_t \pi_\theta(a_t \mid s_t) \times (\text{environment transitions}). Taking the logarithm turns this product into a sum:

logPθ(τ)=tlogπθ(atst)+terms that don’t depend on θ\log P_\theta(\tau) = \sum_t \log \pi_\theta(a_t \mid s_t) + \text{terms that don't depend on } \theta

The environment transition probabilities (like “moving right from position 1 always leads to position 2”) do not depend on θ\theta at all — they are fixed properties of the environment. So when we differentiate, those terms vanish:

θlogPθ(τ)=tθlogπθ(atst)\nabla_\theta \log P_\theta(\tau) = \sum_t \nabla_\theta \log \pi_\theta(a_t \mid s_t)

Step 5 — Substituting back. This gives us:

θJ=Eτ[R(τ)tθlogπθ(atst)]\nabla_\theta J = \mathbb{E}_\tau \left[ R(\tau) \sum_t \nabla_\theta \log \pi_\theta(a_t \mid s_t) \right]

This is already a valid policy gradient formula — it is correct and you could use it as-is. But it has a problem: every action in the trajectory is weighted by the total reward R(τ)R(\tau), including rewards that happened before that action was taken. An action at time step tt cannot possibly have caused a reward at time step k<tk < t, so those past rewards are adding noise to the gradient without providing useful signal. We can do better.

From total reward to per-time-step credit

Recall that R(τ)=k=0TrkR(\tau) = \sum_{k=0}^{T} r_k. Substituting this into the gradient:

θJ=Eτ[(k=0Trk)(t=0Tθlogπθ(atst))]\nabla_\theta J = \mathbb{E}_\tau \left[ \left(\sum_{k=0}^{T} r_k\right) \left(\sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t \mid s_t)\right) \right]

Expanding the product of these two sums gives a double sum — every reward term rkr_k multiplies every gradient term logπθ(atst)\nabla \log \pi_\theta(a_t \mid s_t):

=Eτ[t=0Tk=0Trkθlogπθ(atst)]= \mathbb{E}_\tau \left[ \sum_{t=0}^{T} \sum_{k=0}^{T} r_k \nabla_\theta \log \pi_\theta(a_t \mid s_t) \right]

For each time step tt, we can split the inner sum over kk into past rewards (k<tk < t) and present-or-future rewards (ktk \geq t):

θJ=Eτ[t=0T(k<trk+ktrk)θlogπθ(atst)]\nabla_\theta J = \mathbb{E}_\tau \left[ \sum_{t=0}^{T} \left(\sum_{k < t} r_k + \sum_{k \geq t} r_k \right) \nabla_\theta \log \pi_\theta(a_t \mid s_t) \right]

Now comes the key observation: for a fixed time step tt, the rewards rkr_k with k<tk < t happened in the past. They were determined by earlier actions and do not depend on the action ata_t at all. So the past-reward terms can be factored out of the expectation over ata_t, leaving only Eat[θlogπθ(atst)]\mathbb{E}_{a_t}[\nabla_\theta \log \pi_\theta(a_t \mid s_t)] behind. We need to show that this expectation is zero.

The score function identity

For any probability distribution, the expected value of the gradient of its log-probability is zero:

Eaπθ[θlogπθ(a)]=0\mathbb{E}_{a \sim \pi_\theta}\left[ \nabla_\theta \log \pi_\theta(a) \right] = 0

The proof is short. Start from the fact that probabilities sum to 1:

aπθ(a)=1\sum_a \pi_\theta(a) = 1

Differentiate both sides with respect to θ\theta:

aθπθ(a)=0\sum_a \nabla_\theta \pi_\theta(a) = 0

Now apply the log trick — replace π\nabla \pi with πlogπ\pi \nabla \log \pi:

aπθ(a)θlogπθ(a)=0\sum_a \pi_\theta(a) \nabla_\theta \log \pi_\theta(a) = 0

But the left side is exactly Eaπθ[θlogπθ(a)]\mathbb{E}_{a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(a)]. So the identity holds. This is pure calculus — it follows directly from the constraint that probabilities must sum to 1.

Applying the identity

Consider the past-reward contribution to the gradient at time step tt:

Eτ[k<trkθlogπθ(atst)]\mathbb{E}_\tau \left[ \sum_{k < t} r_k \cdot \nabla_\theta \log \pi_\theta(a_t \mid s_t) \right]

The rewards rkr_k for k<tk < t are already determined by the time we reach step tt — they depend on earlier states and actions, not on ata_t. So with respect to the expectation over ata_t, the past rewards are constants that can be factored out:

=Eτat[k<trkEatπθ(st)[θlogπθ(atst)]]= \mathbb{E}_{\tau \setminus a_t} \left[ \sum_{k < t} r_k \cdot \mathbb{E}_{a_t \sim \pi_\theta(\cdot \mid s_t)} \left[ \nabla_\theta \log \pi_\theta(a_t \mid s_t) \right] \right]

By the score function identity we just proved, the inner expectation is zero. So the entire expression vanishes:

Eτ[k<trkθlogπθ(atst)]=0\mathbb{E}_\tau \left[ \sum_{k < t} r_k \cdot \nabla_\theta \log \pi_\theta(a_t \mid s_t) \right] = 0

Past rewards contribute nothing to the gradient. Only future rewards matter.

The refined policy gradient theorem

Dropping the past-reward terms and defining Rt:=ktrkR_t := \sum_{k \geq t} r_k as the reward-to-go from time step tt onward, we obtain:

θJ=Eτ[t=0TRtθlogπθ(atst)]\boxed{\nabla_\theta J = \mathbb{E}_\tau \left[ \sum_{t=0}^{T} R_t \nabla_\theta \log \pi_\theta(a_t \mid s_t) \right]}

This is the refined form of the policy gradient theorem. Each action at time tt is weighted only by the rewards that occur at or after time tt — never by rewards from the past. Compared to the total-reward version, this form has the same expected value (it is still unbiased) but lower variance, because we have removed the noise contributed by past rewards that the action could not have influenced.

The key insight from the full derivation is that we never needed to know the environment’s transition dynamics. The environment terms dropped out when we took the derivative. This is what makes policy gradient methods so general — they work even when you have no model of the environment, as long as you can run episodes and observe rewards.


From Theory to Algorithm: REINFORCE

The policy gradient formula tells us the direction to update θ\theta, but it is written as an expectation over all possible trajectories. In practice, we cannot sum over every possible trajectory — even in our tiny grid world there are 4, and in real problems the number is astronomical. This is where Monte Carlo estimation from Part 1 comes in.

The algorithm

The idea is to sample trajectories by actually running the policy, then use those samples to estimate the gradient. The procedure, known as REINFORCE, works as follows.

First, fix the current policy parameters θ\theta. Then generate NN trajectories by running the policy in the environment: τ1,τ2,,τN\tau_1, \tau_2, \ldots, \tau_N. For each trajectory and each time step tt, compute the reward-to-go Rt(i)=ktrk(i)R_t^{(i)} = \sum_{k \geq t} r_k^{(i)} and the gradient term θlogπθ(at(i)st(i))\nabla_\theta \log \pi_\theta(a_t^{(i)} \mid s_t^{(i)}). The gradient estimate is:

J^=1Ni=1NtRt(i)θlogπθ(at(i)st(i))\widehat{\nabla J} = \frac{1}{N} \sum_{i=1}^{N} \sum_t R_t^{(i)} \nabla_\theta \log \pi_\theta(a_t^{(i)} \mid s_t^{(i)})

Finally, update the parameters using gradient ascent: θθ+αJ^\theta \leftarrow \theta + \alpha \widehat{\nabla J}, where α\alpha is the learning rate. Then repeat the whole process — fix the new θ\theta, sample fresh trajectories, estimate the gradient, update.

Why θ\theta must be fixed during sampling

This is a subtle but critical point. All NN trajectories in a single batch must be sampled from the same policy πθ\pi_\theta. If you updated θ\theta after generating the first trajectory and before generating the second, those two trajectories would come from different distributions. As we discussed in Part 1, the unbiasedness proof requires all samples to come from the same distribution — mixing samples from different policies is like measuring heights from different classes and averaging them together.

So the correct training loop has this structure: fix θ\theta, collect an entire batch of trajectories, compute the gradient estimate from that batch, update θ\theta, and only then start collecting a new batch. The update happens between batches, never within a batch.

Unbiasedness

The Monte Carlo estimator J^\widehat{\nabla J} is unbiased, meaning E[J^]=J\mathbb{E}[\widehat{\nabla J}] = \nabla J. The proof is identical to the one in Part 1: each trajectory τi\tau_i is drawn from the same distribution PθP_\theta, so E[X(τi)]=J\mathbb{E}[X(\tau_i)] = \nabla J for every ii, and averaging over NN such terms does not change the expected value.

This means that even though any single estimate might be noisy (because we only sampled NN trajectories instead of considering all possible ones), the estimate is correct on average. Over many batches and many updates, the noise averages out and the policy converges to one that earns high reward.

Intuition

For each action in a sampled trajectory, the gradient estimate includes the term Rtθlogπθ(atst)R_t \cdot \nabla_\theta \log \pi_\theta(a_t \mid s_t). If the reward-to-go from that point onward is large, this term pushes the parameters θ\theta in a direction that makes that action more probable in that state. If the reward-to-go is zero, the action contributes nothing to the gradient. Over many samples, trajectories that led to good outcomes get reinforced (made more likely), while trajectories that led to poor outcomes are effectively ignored. This is the fundamental mechanism of policy gradient methods, and it is why the field is called reinforcement learning — good behaviour is reinforced through probability increases.


Summary

An action aa is a choice the model makes, and the reward rr is the scalar signal telling us how good that choice was. The policy πθ(as)\pi_\theta(a \mid s) defines the probability of taking action aa in state ss, controlled by the parameter θ\theta — a knob (or vector of knobs) that we tune during training. The objective J(θ)J(\theta) is the average reward earned under the current policy, and the gradient θJ\nabla_\theta J tells us the direction to change θ\theta to increase that average. A trajectory τ\tau is the complete sequence of states and actions in one episode. The log trick — the identity P=PlogP\nabla P = P \nabla \log P — converts the gradient into an expectation we can estimate by sampling. And REINFORCE is the algorithm that ties it all together: sample trajectories, estimate the gradient, update the parameters, repeat.

Everything was derived from two small examples — a math quiz for the single-step case and a grid world for the multi-step case. The same structure scales to language models with vocabularies of 50,000 tokens, game-playing agents navigating millions of states, and robots learning to walk. The algebra is identical; only the size of the action space and the length of the trajectories change.

Enjoyed this post?

Subscribe to get notified when I publish new posts. No spam, unsubscribe anytime.