Reinforcement Learning from Scratch
Building RL from the ground up — actions, rewards, policies, expected reward, the policy gradient theorem, and REINFORCE — all derived step by step with concrete examples.
This is Part 2. If you haven’t read Mathematical Prerequisites for Reinforcement Learning, start there — we will use the same mathematical tools (expected value, derivatives, the log trick, and Monte Carlo estimation) throughout this post.
The Single-Step World
A tiny AI
Imagine you are building a very small AI. It sees a math question — “2 + 3 = ?” — and must output one of two answers. Action 0 means the model answers “4”, and Action 1 means the model answers “5”. That’s the entire world: one question, two possible actions. We will build Reinforcement Learning completely inside this tiny setup before scaling up to anything more complex.
What Is an Action?
An action is simply a choice the model makes, denoted . In our example, , where means the model answers “4” and means the model answers “5”. There is nothing deep about this definition — an action is just a number that labels a choice.
What Is Reward?
After the model outputs an answer, we check whether it was correct. If the answer is right, we assign a reward of 1. If it is wrong, the reward is 0. We denote reward as , so if (the model answers “5”, which is correct), then , and if (the model answers “4”, which is wrong), then .
Reward is a number that tells us how good the action was. In more complex settings, rewards can be negative (penalties), fractional, or delayed, but the idea is always the same: a scalar signal indicating quality.
The Policy
The model does not always pick the same action deterministically. Instead, it chooses randomly according to a probability distribution over actions. We call this distribution the policy, written as:
Read this as “the probability of choosing action under parameter .” The policy is the central object in RL — it is the thing we are trying to improve.
What Is ?
is a number (or, in general, a vector of numbers) that controls the model’s behaviour. We define the policy using the sigmoid function from Part 1:
where .
So acts as a knob. When , the sigmoid outputs 0.5, meaning the model picks each answer with equal probability — a completely random policy. As grows large and positive, approaches 1, and the model almost always answers “5” (the correct answer). As grows large and negative, the model almost always answers “4” (the wrong answer). Training the model means finding a value of that makes the policy produce the correct answer reliably.
Expected Reward
Because the model chooses its action randomly, the reward it receives is also random. Sometimes it will answer correctly and get reward 1, other times it will answer incorrectly and get reward 0. The natural question is: on average, how much reward does this policy earn?
Using the expected value formula from Part 1:
We give this quantity a name — the objective function:
represents the average reward the model earns under the current policy. Our goal is to make as large as possible, which means finding the that pushes close to 1.
Computing the Gradient
To maximise , we compute its derivative with respect to . From Part 1, the derivative of the sigmoid is , so:
This derivative is always positive (since is always between 0 and 1), which confirms that increasing always increases . We can use gradient ascent — the opposite of gradient descent — to iteratively improve the policy:
where is a small positive learning rate. Each update nudges upward, which increases the probability of the correct action. Repeat this enough times and will approach 1, meaning the model almost always answers “5”.
Deriving the Policy Gradient (Log Trick)
The direct derivative above works perfectly for this two-action example. But it relies on us being able to differentiate in closed form, which requires knowing the reward for every possible action and summing over all of them. In real problems with enormous action spaces, that sum is intractable. So let’s derive the same gradient a different way — the way that generalises.
Start from the definition of expected reward:
Differentiate with respect to :
Now apply the log trick from Part 1. We multiply and divide by inside the sum:
Substituting this into the gradient expression:
This sum has the form , which is exactly the definition of an expectation. So we can rewrite it as:
This is the policy gradient formula, and it emerged purely from algebra — no approximations, no tricks beyond the log identity. The reason this form is so powerful is that it is an expectation under the policy. That means we can estimate it by sampling actions from the policy and averaging, rather than summing over every possible action. For our two-action example this distinction is trivial, but for a language model choosing among 50,000 tokens, or a robot choosing among continuous motor torques, it is the difference between feasible and impossible.
The Multi-Step World
Everything so far involved a single decision followed by a single reward. Real Reinforcement Learning is different — you take multiple actions over time, like playing a game where each move leads to a new position, and the reward might only arrive at the very end. Let’s extend our framework to handle this.
A Tiny Grid World
Consider a number line. You start at position 0 and want to reach position 2. At each time step, you can either move right (+1) or move left (−1). We will stop after exactly 2 steps, regardless of where you end up.
The action at time is , and the state simply represents your current position. So . If you move right at step 0, then . If you then move right again, . The reward structure is sparse: you receive a reward of 1 only if your final position is 2, and 0 otherwise.
This is a minimal example of a multi-step problem. You don’t get any feedback until the end, and you have to make two sequential decisions, each of which affects what states are available in the future.
What Is a Trajectory?
A trajectory is the complete record of what happened during one episode — every state visited and every action taken:
For example, the trajectory represents starting at 0, moving right to 1, then moving right again to reach 2. The trajectory represents moving right and then left, ending back at the start. Each complete sequence from start to finish is one trajectory, denoted .
Probability of a Trajectory
In our grid world, the environment is deterministic — if you are at position 1 and choose +1, you always end up at position 2, never somewhere random. This means all the randomness in a trajectory comes from the policy’s action choices. Suppose the policy assigns the same probabilities at every state:
The probability of an entire trajectory is simply the product of the probabilities of each action taken along the way:
This is the same multiplication rule you use for independent coin tosses — the probability of getting Heads then Heads is . Let’s enumerate all four possible 2-step trajectories.
Right, right : ends at position 2, earns reward 1, with probability .
Right, left : ends at position 0, earns reward 0, with probability .
Left, right : ends at position 0, earns reward 0, with probability .
Left, left : ends at position , earns reward 0, with probability .
Only the first trajectory reaches position 2 and earns a reward. All others end at 0 or and earn nothing.
The Objective
The total reward for a trajectory is the sum of all rewards collected along the way:
In our example, only the trajectory has . All other trajectories have .
The objective function is the expected total reward — the average reward you would get if you ran the policy many times:
This is exactly the same expected value formula from Part 1 — — except now the “outcomes” are entire trajectories instead of individual coin tosses, and the “values” are total rewards instead of winnings.
Numerical verification
Let , so the policy is completely random. Each of the four trajectories has probability , and only one of them has reward 1. So:
This makes intuitive sense: with a random policy, you have a 25% chance of going right twice, which is the only way to reach position 2. If we increase to 0.8 (a policy that strongly prefers moving right), then , which is much better. And if , then — the policy always reaches the goal.
Deriving the Multi-Step Policy Gradient
Now we derive the gradient of with respect to for the multi-step case. The derivation follows the exact same structure as the single-step case, just with trajectories instead of individual actions.
Step 1 — Differentiate. We start from and move the derivative inside the sum:
Step 2 — Apply the log trick. Using the identity (derived in Part 1):
Step 3 — Recognise as expectation. The sum is the definition of expectation over trajectories:
Step 4 — Expand the log probability. Recall that the probability of a trajectory is . Taking the logarithm turns this product into a sum:
The environment transition probabilities (like “moving right from position 1 always leads to position 2”) do not depend on at all — they are fixed properties of the environment. So when we differentiate, those terms vanish:
Step 5 — Substituting back. This gives us:
This is already a valid policy gradient formula — it is correct and you could use it as-is. But it has a problem: every action in the trajectory is weighted by the total reward , including rewards that happened before that action was taken. An action at time step cannot possibly have caused a reward at time step , so those past rewards are adding noise to the gradient without providing useful signal. We can do better.
From total reward to per-time-step credit
Recall that . Substituting this into the gradient:
Expanding the product of these two sums gives a double sum — every reward term multiplies every gradient term :
For each time step , we can split the inner sum over into past rewards () and present-or-future rewards ():
Now comes the key observation: for a fixed time step , the rewards with happened in the past. They were determined by earlier actions and do not depend on the action at all. So the past-reward terms can be factored out of the expectation over , leaving only behind. We need to show that this expectation is zero.
The score function identity
For any probability distribution, the expected value of the gradient of its log-probability is zero:
The proof is short. Start from the fact that probabilities sum to 1:
Differentiate both sides with respect to :
Now apply the log trick — replace with :
But the left side is exactly . So the identity holds. This is pure calculus — it follows directly from the constraint that probabilities must sum to 1.
Applying the identity
Consider the past-reward contribution to the gradient at time step :
The rewards for are already determined by the time we reach step — they depend on earlier states and actions, not on . So with respect to the expectation over , the past rewards are constants that can be factored out:
By the score function identity we just proved, the inner expectation is zero. So the entire expression vanishes:
Past rewards contribute nothing to the gradient. Only future rewards matter.
The refined policy gradient theorem
Dropping the past-reward terms and defining as the reward-to-go from time step onward, we obtain:
This is the refined form of the policy gradient theorem. Each action at time is weighted only by the rewards that occur at or after time — never by rewards from the past. Compared to the total-reward version, this form has the same expected value (it is still unbiased) but lower variance, because we have removed the noise contributed by past rewards that the action could not have influenced.
The key insight from the full derivation is that we never needed to know the environment’s transition dynamics. The environment terms dropped out when we took the derivative. This is what makes policy gradient methods so general — they work even when you have no model of the environment, as long as you can run episodes and observe rewards.
From Theory to Algorithm: REINFORCE
The policy gradient formula tells us the direction to update , but it is written as an expectation over all possible trajectories. In practice, we cannot sum over every possible trajectory — even in our tiny grid world there are 4, and in real problems the number is astronomical. This is where Monte Carlo estimation from Part 1 comes in.
The algorithm
The idea is to sample trajectories by actually running the policy, then use those samples to estimate the gradient. The procedure, known as REINFORCE, works as follows.
First, fix the current policy parameters . Then generate trajectories by running the policy in the environment: . For each trajectory and each time step , compute the reward-to-go and the gradient term . The gradient estimate is:
Finally, update the parameters using gradient ascent: , where is the learning rate. Then repeat the whole process — fix the new , sample fresh trajectories, estimate the gradient, update.
Why must be fixed during sampling
This is a subtle but critical point. All trajectories in a single batch must be sampled from the same policy . If you updated after generating the first trajectory and before generating the second, those two trajectories would come from different distributions. As we discussed in Part 1, the unbiasedness proof requires all samples to come from the same distribution — mixing samples from different policies is like measuring heights from different classes and averaging them together.
So the correct training loop has this structure: fix , collect an entire batch of trajectories, compute the gradient estimate from that batch, update , and only then start collecting a new batch. The update happens between batches, never within a batch.
Unbiasedness
The Monte Carlo estimator is unbiased, meaning . The proof is identical to the one in Part 1: each trajectory is drawn from the same distribution , so for every , and averaging over such terms does not change the expected value.
This means that even though any single estimate might be noisy (because we only sampled trajectories instead of considering all possible ones), the estimate is correct on average. Over many batches and many updates, the noise averages out and the policy converges to one that earns high reward.
Intuition
For each action in a sampled trajectory, the gradient estimate includes the term . If the reward-to-go from that point onward is large, this term pushes the parameters in a direction that makes that action more probable in that state. If the reward-to-go is zero, the action contributes nothing to the gradient. Over many samples, trajectories that led to good outcomes get reinforced (made more likely), while trajectories that led to poor outcomes are effectively ignored. This is the fundamental mechanism of policy gradient methods, and it is why the field is called reinforcement learning — good behaviour is reinforced through probability increases.
Summary
An action is a choice the model makes, and the reward is the scalar signal telling us how good that choice was. The policy defines the probability of taking action in state , controlled by the parameter — a knob (or vector of knobs) that we tune during training. The objective is the average reward earned under the current policy, and the gradient tells us the direction to change to increase that average. A trajectory is the complete sequence of states and actions in one episode. The log trick — the identity — converts the gradient into an expectation we can estimate by sampling. And REINFORCE is the algorithm that ties it all together: sample trajectories, estimate the gradient, update the parameters, repeat.
Everything was derived from two small examples — a math quiz for the single-step case and a grid world for the multi-step case. The same structure scales to language models with vocabularies of 50,000 tokens, game-playing agents navigating millions of states, and robots learning to walk. The algebra is identical; only the size of the action space and the length of the trajectories change.
Enjoyed this post?
Subscribe to get notified when I publish new posts. No spam, unsubscribe anytime.