Why dividing by the number of successes instead of the batch size changes what your gradient estimator optimizes — and how this connects REINFORCE, maximum likelihood, and pass@k through one clean mathematical identity.
This post explains the core ideas behind the MaxRL paper — a surprisingly clean result that connects Reinforcement Learning, maximum likelihood training, and pass@k evaluation through a single mathematical identity. We will derive everything from scratch using one running example, the same way we built RL from the ground up in the previous posts.
We will use one prompt for the entire post: x = “2 + 3 = ?”. Our model mθ can generate three possible outputs when given this prompt:
Output z=A: the correct answer “5”
Output z=B: the wrong answer “4”
Output z=C: the wrong answer “6”
The model assigns probabilities to each output (these depend on the parameters θ):
mθ(A∣x)=0.1,mθ(B∣x)=0.6,mθ(C∣x)=0.3
There is a verifier f(⋅) that checks whether an output is correct. We define a binary reward:
r(x,z)={10if z is correctif z is wrong
In our example: r(x,A)=1, r(x,B)=0, r(x,C)=0. The model currently assigns only 10% probability to the correct answer — it is a weak model that usually gets this question wrong.
What Are We Optimizing?
Fix one prompt x. The model has a probability of producing a correct answer, which we write as:
pθ(x)=z:r(x,z)=1∑mθ(z∣x)
In our example, only output A is correct, so pθ(x)=mθ(A∣x)=0.1. This is the probability of success on a single try.
Different training objectives amount to different functions of this probability.
RL objective. Standard RL maximizes the probability of success directly:
JRL(x)=pθ(x)
If pθ(x)=0.1, then JRL(x)=0.1.
ML objective. Maximum likelihood maximizes the log of the probability:
JML(x)=logpθ(x)
If pθ(x)=0.1, then JML(x)=log(0.1)≈−2.3.
MaxRL objective. The paper defines a new family of objectives parameterized by N:
JMaxRL(N)(x)=−k=1∑Nk(1−pθ(x))k
This is a truncated version of logpθ(x), and N controls how many terms you keep. When N=1 you get something proportional to the RL objective. As N→∞ you recover the ML objective exactly. The reason this truncation matters — and why N corresponds to the number of samples you draw — is what the rest of this post will explain.
pass@k and the Bridge to Logarithms
Before touching any gradients, we need one key mathematical connection. It starts with a simple idea: what if instead of asking the model once, you ask it k times and accept if at least one answer is correct?
pass@k
If the model has probability p of being correct on a single try, then the probability that all k tries fail is (1−p)k. So the probability that at least one try succeeds is:
pass@k=1−(1−p)k
In our example with p=0.1: pass@1 is just 0.1, pass@3 is 1−0.93=1−0.729=0.271, and pass@10 is 1−0.910≈0.651. Even a weak model starts looking decent if you give it enough tries.
We can also define the complement — the probability that all k tries fail:
fail@k=(1−p)k
The power series identity
There is a standard identity in mathematics known as the Maclaurin expansion (Taylor series expanded around 0) of log(1−q):
log(1−q)=−k=1∑∞kqkfor ∣q∣<1
Now let q=1−p. Then 1−q=p, and:
log(p)=log(1−(1−p))=−k=1∑∞k(1−p)k
But we already said that (1−p)k=fail@k. So:
log(p)=−k=1∑∞kfail@k
This is a surprising identity. It says that the log probability of success — which is the ML objective — secretly encodes information about failure probabilities across all possible numbers of attempts. Each term kfail@k measures how likely it is that k independent tries all fail, weighted down by 1/k.
Numerical verification
Take p=0.1, so 1−p=0.9. The first few terms of the series are:
The partial sums with the minus sign: −0.9, −1.305, −1.548, −1.712, and so on. The true value is log(0.1)≈−2.303. The series converges slowly because p is small (the model is weak), but it does converge. For stronger models with p closer to 1, the series converges much faster because (1−p)k shrinks rapidly.
Differentiating the Series: Why ML Mixes pass@k Gradients
Now we take the derivative of logp with respect to the model parameters θ. This is where the connection between ML and pass@k becomes concrete.
Start from the series:
logpθ(x)=−k=1∑∞k(1−pθ(x))k
Focus on one term and differentiate. We want:
∇θ(−k(1−pθ(x))k)
First, differentiate the inner expression (1−pθ(x))k using the chain rule. Let u=1−pθ(x), so we need ∇θuk. The chain rule gives:
From this we can solve for the factor that appears in our gradient sum:
(1−pθ(x))k−1∇θpθ(x)=k1∇θpass@k
Substituting this into the gradient:
∇θlogpθ(x)=k=1∑∞k1∇θpass@k
This is the core identity. The ML gradient is an infinite weighted sum of pass@k gradients, where the weight on the k-th term is 1/k.
Standard RL only optimizes ∇θpθ(x)=∇θpass@1. It uses only the first-order information — how to improve single-try success. Maximum likelihood, on the other hand, simultaneously pushes the model to improve pass@1, pass@2, pass@3, and so on, with decreasing weights. ML is a richer training signal because it cares about multi-try behavior, not just single-try performance.
If we truncate the series at N terms instead of going to infinity, we get the MaxRL objective:
This is a finite approximation to the ML gradient that keeps the first N pass@k terms. The paper’s key result is that there is a simple estimator — one you can compute from samples — whose expectation equals exactly this truncated gradient.
Theorem 1: The ML Gradient Is the Conditional Expectation Over Successes
Before proving the main result about the estimator, we need a foundational fact about what the ML gradient actually looks like. This is Theorem 1 in the paper, and it says something elegant: to compute the gradient of logpθ(x), you only need to look at successful outputs.
Setup
Define the success set S:={z:f(z)=y∗(x)} — all outputs that are correct. In our example, S={A}. The probability of success is:
pθ(x)=z∈S∑mθ(z∣x)
Derivation
We want ∇θlogpθ(x). Start by differentiating pθ(x):
∇θpθ(x)=∇θz∈S∑mθ(z∣x)=z∈S∑∇θmθ(z∣x)
The gradient passes inside the sum because differentiation is linear. Now apply the log-derivative identity — the same log trick from the previous posts. For any function mθ:
∇θlogmθ(z∣x)=mθ(z∣x)1∇θmθ(z∣x)
Multiply both sides by mθ(z∣x):
mθ(z∣x)⋅∇θlogmθ(z∣x)=∇θmθ(z∣x)
This lets us replace each ∇θmθ(z∣x) in the sum:
∇θpθ(x)=z∈S∑mθ(z∣x)∇θlogmθ(z∣x)
We want ∇θlogpθ(x), not ∇θpθ(x). The chain rule for logarithms says ∇θlogu=u1∇θu, so:
∇θlogpθ(x)=pθ(x)1∇θpθ(x)
Now substitute the expression for ∇θpθ(x) that we just derived:
=pθ(x)1z∈S∑mθ(z∣x)∇θlogmθ(z∣x)
We can push the pθ(x)1 inside the sum and group it with mθ(z∣x):
=z∈S∑pθ(x)mθ(z∣x)∇θlogmθ(z∣x)
Now recognize what pθ(x)mθ(z∣x) means. The definition of conditional probability says:
Pr(z∣S)=Pr(S)Pr(z and S)
If z∈S, then choosing output z automatically means success, so the event ”z and success” is just the event "z". Therefore Pr(z and S)=mθ(z∣x). And Pr(S)=pθ(x). So:
mθ(z∣x,S)=pθ(x)mθ(z∣x)for z∈S
Substituting this into our expression:
∇θlogpθ(x)=z∈S∑mθ(z∣x,S)∇θlogmθ(z∣x)
The right side is the definition of a conditional expectation — it sums a function of z weighted by the conditional probability of z given success. So:
∇θlogpθ(x)=Ez∼mθ(⋅∣x)[∇θlogmθ(z∣x)∣S]
The ML gradient is the expected score function, conditioned on the output being correct. In plain language: to increase logpθ(x), look only at correct outputs and push up their log-probabilities on average. This is the mathematical reason why, in the estimator, we will average only over successes rather than over the entire batch.
Numerical check
In our running example, S={A} and pθ(x)=0.1. Let’s verify both sides of the boxed identity match.
Right side (conditional expectation). The conditional distribution given success puts all its mass on A, because A is the only correct output:
Both sides give ∇θlogmθ(A∣x). The identity checks out.
Theorem 2: The Divide-by-K Estimator
This is the main result of the paper. It says that a simple modification to REINFORCE — dividing by the number of successes K instead of the batch size N — produces an unbiased estimator for the truncated MaxRL gradient. The number of samples N is not just a variance knob; it literally determines how many terms of the infinite series you are optimizing.
The estimator
Fix a prompt x. Sample N outputs independently from the model (i.i.d.). All expectations below are over the randomness of this sampled batch.
z1,z2,…,zN∼mθ(⋅∣x)
For each sample, compute:
The success indicator: ri=1{f(zi)=y∗(x)}, which is 1 if the output is correct and 0 otherwise.
The score function: Si=∇θlogmθ(zi∣x), the gradient of the log-probability with respect to parameters.
The total number of successes: K=∑i=1Nri.
The paper’s estimator is:
gNb(x)={K1∑i=1NriSi0if K≥1if K=0
Compare this to what REINFORCE would do. Standard REINFORCE averages over the entire batch: N1∑i=1NriSi. The MaxRL estimator gNb(x) instead averages only over the successful samples — the numerator ∑riSi picks out the score functions of correct outputs (since ri=0 zeroes out the wrong ones), and the denominator K counts how many correct outputs there were.
Concrete example
Suppose we sample N=10 outputs from our model on the prompt “2 + 3 = ?”. With pθ(x)=0.1, we typically get 0 or 1 correct answer.
Case: exactly one success. Suppose only sample 7 is correct (z7=A), so r7=1 and all other ri=0. Then K=1, the numerator is r7S7=S7, and:
REINFORCE gives S7/10 — the correct gradient signal diluted by a factor of 10 because nine failures are in the average.
MaxRL gives S7/1=S7 — the full gradient signal from the one success.
MaxRL amplifies the rare success by a factor of 10 in this case. This factor is not arbitrary — it corresponds to 1/pθ(x)=1/0.1=10, which is exactly the amplification factor in the ML gradient from Theorem 1.
Case: two successes. Suppose samples 3 and 9 are correct. Then K=2, and:
REINFORCE gives (S3+S9)/10.
MaxRL gives (S3+S9)/2 — the average over the two successes only.
Again, MaxRL does not dilute the signal by counting failures in the denominator.
The theorem
The claim is:
E[gNb(x)]=∇θJMaxRL(N)(x)
We already know the right side equals (∑k=1N(1−pθ(x))k−1)∇θpθ(x) from differentiating the truncated series. So our job is to show the left side equals the same expression.
Proof
Step 1 — Symmetry. Start from the definition and take the expectation:
E[gNb(x)]=E[K1i=1∑NriSi]=i=1∑NE[KriSi]
All N samples are i.i.d., so each term in the sum has the same expectation. Therefore:
E[gNb(x)]=N⋅E[Kr1S1]
Step 2 — Condition on sample 1. We split the expectation into two cases using the law of total expectation:
When r1=0, the numerator r1S1=0⋅S1=0, so the entire second term vanishes. When r1=1, we have r1S1=S1. So:
E[Kr1S1]=Pr(r1=1)⋅E[KS1r1=1]
Since Pr(r1=1)=pθ(x):
=pθ(x)⋅E[KS1r1=1]
Step 3 — Split K into sample 1 and the rest. When r1=1, the total number of successes is K=1+K−1, where K−1=∑i=2Nri counts successes among the other N−1 samples. Since the samples are independent, S1 (which depends only on z1) is independent of K−1 (which depends only on z2,…,zN). So the expectation factors:
E[1+K−1S1r1=1]=E[S1∣r1=1]⋅E[1+K−11]
Step 4 — Recognize pθ(x)⋅E[S1∣r1=1]=∇θpθ(x). This follows from the single-sample policy gradient identity, which we derive now. Take one sample z∼mθ(⋅∣x) with r=1{z∈S} and S=∇θlogmθ(z∣x). Write out E[rS] as a sum over all outputs:
E[rS]=z∑mθ(z∣x)⋅r(x,z)⋅∇θlogmθ(z∣x)
Since r(x,z)=0 for z∈/S, the sum reduces to correct outputs only:
=z∈S∑mθ(z∣x)⋅∇θlogmθ(z∣x)
Apply the log trick in reverse — replace m⋅∇logm with ∇m:
=z∈S∑∇θmθ(z∣x)
Pull the gradient outside the sum:
=∇θz∈S∑mθ(z∣x)=∇θpθ(x)
So we have established that E[rS]=∇θpθ(x). But we can also decompose E[rS] using the law of total expectation:
E[rS]=Pr(r=1)⋅E[S∣r=1]+Pr(r=0)⋅E[0⋅S∣r=0]
The second term vanishes because r=0 zeroes it out. So:
E[rS]=pθ(x)⋅E[S∣r=1]
Combining both expressions for E[rS]:
pθ(x)⋅E[S1∣r1=1]=∇θpθ(x)
Now substitute this, along with the result from Step 3, into our running expression:
E[gNb(x)]=N⋅∇θpθ(x)⋅E[1+K−11]
Everything now reduces to computing one scalar quantity: E[1+K−11].
Step 5 — Compute E[1+K−11] exactly. Since each of the N−1 remaining samples succeeds independently with probability pθ(x), we have K−1∼Binomial(N−1,pθ(x)). Writing out the expectation:
The sum inside the integral has the form ∑j=0n(jn)ajbn−j with n=N−1, a=pθ(x)⋅t, and b=1−pθ(x). By the binomial theorem, this equals (a+b)n=((1−pθ(x))+pθ(x)⋅t)N−1. So:
E[1+K−11]=∫01((1−pθ(x))+pθ(x)⋅t)N−1dt
To evaluate this integral, substitute u=(1−pθ(x))+pθ(x)⋅t. Then du=pθ(x)dt, which means dt=pθ(x)du. The limits change: when t=0, u=1−pθ(x); when t=1, u=(1−pθ(x))+pθ(x)=1. The integral becomes:
The divide-by-K estimator gNb(x) is an unbiased estimator for the gradient of the truncated MaxRL objective JMaxRL(N)(x). The number of samples N is not just a computational budget — it determines how many terms of the infinite series logpθ(x)=−∑k=1∞k(1−p)k you are unbiasedly optimizing. With N=1, you optimize only ∇θpass@1 (standard RL). With N=10, you optimize a weighted combination of ∇θpass@1 through ∇θpass@10. As N→∞, you recover the full ML gradient.
The intuition for why N shows up as the truncation level is that your batch of N samples can directly witness events like “first success happens at try k” only up to k=N. The estimator naturally encodes pass@k information for k≤N because those are the multi-try success events that N samples can represent.
Variance Reduction: The Control Variate
The estimator gNb(x) is unbiased, but it can have high variance — especially when pθ(x) is small and most batches contain zero successes. The paper introduces a modified estimator that reduces this variance without changing the expected value.
The first term is the original divide-by-K estimator. The second term is the average of all score functions across the entire batch — both successes and failures. This second term is the control variate.
Why the control variate has expectation zero
We need to show that E[N1∑i=1NSi]=0. By linearity, it suffices to show E[Si]=0 for a single sample.
Take one sample z∼mθ(⋅∣x) and expand the expectation of S=∇θlogmθ(z∣x):
E[S]=z∑mθ(z∣x)∇θlogmθ(z∣x)
Now apply the log trick in reverse to each term. Recall that m⋅∇logm=∇m:
E[S]=z∑∇θmθ(z∣x)
Pull the gradient outside the sum (differentiation is linear):
=∇θ(z∑mθ(z∣x))
The sum ∑zmθ(z∣x) equals 1 for any value of θ — this is the normalization constraint of a probability distribution. Differentiating a constant gives zero:
=∇θ(1)=0
So E[S]=0. This is the same score function identity we used in the REINFORCE derivation: the expected score under any distribution is always zero.
Therefore E[N1∑i=1NSi]=0, and:
E[gNe(x)]=E[gNb(x)]−0=∇θJMaxRL(N)(x)
The modified estimator is still unbiased for the same truncated objective.
Why subtracting a zero-mean term helps
This is a general principle in statistics. If you have an estimator G of some quantity and you subtract a zero-mean random variable V that is correlated with G, the variance of G−V can be smaller than the variance of G alone. The formula is:
Var(G−V)=Var(G)−2Cov(G,V)+Var(V)
If G and V are positively correlated — meaning they tend to fluctuate in the same direction — then the covariance term is positive, and the reduction from −2Cov(G,V) can outweigh the addition of Var(V), making the overall variance smaller.
The intuition for why N1∑Si correlates with K1∑riSi is that both terms are driven by the same underlying samples. When the batch happens to contain outputs with large score function magnitudes, both the weighted (success-only) average and the unweighted (all-sample) average tend to be large. Subtracting the unweighted average removes some of this common fluctuation, leaving a cleaner signal.
Compact form
The paper also writes the estimator in a combined form (for K≥1):
gNe(x)=i=1∑N(Kri−N1)Si
When K=0, use the piecewise definition above (equivalently, adopt the convention Kri:=0 in that case).
The weight Kri−N1 is positive for successful samples (they get upweighted) and negative for failed samples (they get slightly downweighted). This is reminiscent of a baseline in REINFORCE — except here the “baseline” is 1/N rather than an estimated value function.
The Unified Weight Function View
There is another way to see the relationship between REINFORCE, ML, and MaxRL — one that makes the conceptual difference between the three methods completely transparent. All three objectives produce gradients that can be written in the form:
∇θJ=Ex[w(pθ(x))⋅∇θpθ(x)]
Here pθ(x) is the single-sample pass rate, ∇θpθ(x) is the basic REINFORCE direction that increases success probability, and w(p) is a weight function that depends only on how hard the prompt currently is. The gradient direction is the same across all three methods — it is always ∇θpθ(x), the raw signal for improving single-try success. What differs is how strongly each prompt contributes to the overall update, depending on its current difficulty.
For REINFORCE, the objective is JRL=Ex[pθ(x)], which differentiates to:
∇JRL=Ex[∇pθ(x)]
The weight function is w(p)=1 — all prompts contribute equally regardless of whether they are easy or hard.
For maximum likelihood, the objective is JML=Ex[logpθ(x)], and differentiating using the chain rule ∇logp=p1∇p gives:
∇JML=Ex[pθ(x)1∇pθ(x)]
The weight function is w(p)=1/p, which means a prompt with p=0.1 gets weight 10, while a prompt with p=0.01 gets weight 100. ML aggressively amplifies the gradient signal from hard prompts.
For MaxRL with N samples, we showed in Theorem 2 that:
E[gNb(x)]=p1−(1−p)N∇θpθ(x)
So the weight function is w(p)=p1−(1−p)N. When N=1, this reduces to:
w(p)=p1−(1−p)=pp=1
recovering REINFORCE. As N→∞, (1−p)N→0, so w(p)→1/p, recovering ML. For any finite N in between, the weight function interpolates smoothly. Taking our running example with p=0.1 and N=10:
w(0.1)=0.11−0.910=0.11−0.3487≈6.5
Compare this to REINFORCE’s weight of 1 and ML’s weight of 10.
This unified view reveals that the entire paper reduces to a single question: how aggressively should you emphasize hard examples? REINFORCE treats all prompts uniformly. ML pushes hardest on the prompts the model currently struggles with, which can destabilize training. MaxRL provides a controlled middle ground — the parameter N directly governs how much extra attention hard prompts receive, letting you dial the difficulty emphasis up or down depending on your training needs.
The Full Picture
Let’s step back and see how all the pieces fit together.
Standard RL (REINFORCE) computes N1∑riSi. This estimates ∇θpθ(x) — the gradient of pass@1. It treats each sample equally and only cares about single-try success.
Maximum likelihood computes pθ(x)1∇θpθ(x), which is equivalent to E[S∣success]. This implicitly optimizes a weighted combination of all pass@k gradients. But it requires knowing pθ(x) exactly, which we don’t have.
MaxRL bridges the gap. By dividing by K (the observed number of successes) instead of N (the batch size), we get an estimator that is computable from samples alone — no knowledge of pθ(x) required — and whose expectation equals the gradient of a truncated approximation to logpθ(x). The truncation level equals N, the number of samples. More samples means a closer approximation to ML, which means richer multi-try training signal.
The control variate (subtracting the average score) keeps the estimator unbiased while reducing the variance that comes from the randomness of sampling. The result is a practical algorithm: sample N outputs, check which ones are correct, average the score functions of the correct ones, subtract the average score function of all outputs, and use that as your gradient estimate.
The paper’s central message is that REINFORCE and maximum likelihood are not separate training paradigms — they are endpoints of a single spectrum parameterized by N. At N=1, you get REINFORCE. At N=∞, you get ML. And for any finite N in between, you get a well-defined objective JMaxRL(N) with a simple, unbiased gradient estimator. The choice of N lets you smoothly trade off between the computational cost of generating more samples and the richness of the training signal you extract from them. As the unified weight function view showed, this entire spectrum boils down to one question: how aggressively should you emphasize hard examples? The weight function w(p) goes from uniform (w=1) at N=1 to maximal hard-example amplification (w=1/p) at N=∞, with N controlling where you sit on that curve.
Enjoyed this post?
Subscribe to get notified when I publish new posts. No spam, unsubscribe anytime.