SVM: Learning Process Rewards via Success Visitation Matching for Efficient RL

In many modern applications of reinforcement learning (RL), the natural reward for a task of interest is inherently sparse: a reward of 0 is given everywhere except when the task is completed, when a reward of +1 is given. Training a policy to maximize such a sparse reward requires solving a challenging credit assignment problem, leading to slow or ineffective RL improvement. We propose a simple approach to transform a sparse outcome reward into a dense process reward. Our approach relies on training a discriminator to distinguish between previous successful and unsuccessful episodes, and using this discriminator to incentivize the RL-learned policy to match the state-action visitations of successful episodes, while avoiding those of unsuccessful episodes. By incentivizing the policy to match the visitations over all states, not just those that correspond to task success, this reward provides dense feedback on whether progress is being made towards task completion, and, we show, provably achieves this without changing the optimal policy. Focusing on finetuning of robotic control policies, we demonstrate that our approach leads to significantly faster RL finetuning performance on both simulated and real-world manipulation tasks, as compared to simply maximizing the sparse outcome reward.

We consider the problem of finetuning a pretrained policy $\pi_{\text{pre}}$ to maximize a sparse outcome reward $r^{\mathrm{out}}$. To enable efficient RL finetuning, we seek to design a dense process reward that:

(a) Provides dense step-level feedback guiding the learner toward successful states, and

(b) Peserves the optimal policies under the original outcome reward.

Our approach is based on a simple principle: given past episodes labeled by outcome reward $r^{\mathrm{out}}$, reward the learner for visiting state-actions that are associated with successful episodes, and penalize those associated with unsuccessful episodes. Concretely, denoting $\mathfrak{D}^+$ and $\mathfrak{D}^-$ the state-action pairs from successful and unsuccessful episodes, respectively, we train a discriminator to distinguish these two sets: $$\widehat{f}(s,a) = {\text{argmin}_{f}}\; \mathbb{E}_{(s, a)\sim \mathfrak{D}^+} [\log f(s,a)]+ \mathbb{E}_{(s, a)\sim \mathfrak{D}^-}[\log (1-f(s,a))]$$ We then define the success visitation matching (SVM) process reward with the discriminator logit: $$r^{\mathrm{svm}}(s,a)= r^{\mathrm{out}}(s) + \lambda \cdot \mathrm{clip}_\beta \left ( \log \frac{\widehat{f}(s,a)}{1-\widehat{f}(s,a)} \right )$$ where $\mathrm{clip}_\beta(\cdot)$ clips the value to be within the range $[-\beta,\beta]$. This yields a dense process reward that encourages the learner to visit states likely to lead to success based on past observations—states where $\widehat{f}(s,a)$ is large—while balancing this with the true outcome reward. We show the following:

Theorem (informal). Assuming our environment has deterministic transitions, the optimal policy under $r^{\mathrm{svm}}$ is the same as the optimal policy under $r^{\mathrm{out}}$.

Thus, not only does $r^{\mathrm{svm}}$ provide dense supervision guiding the learner to the goal, but it does not change the optimal policy.

Algorithm: Reinforcement Learning with SVM Process Reward

Input: $r^{\mathrm{svm}}$ weight $\lambda$, $r^{\mathrm{svm}}$ clipping $\beta$, pretrained $\pi_{\mathrm{pre}}$ (optional), initial rollouts $N_0$ (optional)
Collect $N_0$ trajectories with $\pi_{\mathrm{pre}}$, set $\mathcal{D}^+$ to successful trajectories, $\mathcal{D}^-$ to all others
Initialize discriminator $\widehat{f}$ on $\mathcal{D}^+$ and $\mathcal{D}^-$
Initialize $\pi_1$ to $\pi_0$ or random policy
for $t = 1, 2, 3, \ldots$ do
Run $\pi_t$ for one episode, add trajectory to $\mathcal{D}^+$ if it is successful, otherwise to $\mathcal{D}^-$
Update $\widehat{f}$ on $\mathcal{D}^+$ and $\mathcal{D}^-$
Update $\pi_t$ to $\pi_{t+1}$ by maximizing reward $r^{\mathrm{svm}}(s,a) = r^{\mathrm{out}}(s) + \lambda \cdot \mathrm{clip}_\beta\left(\log \frac{\widehat{f}(s,a)}{1-\widehat{f}(s,a)}\right)$
end for

We evaluate SVM rewards in the RL finetuning regime on the $\texttt{LIBERO}$ benchmarks and $\texttt{RoboCasa}$ pick-and-place tasks, using DSRL and Residual RL as the finetuning algorithms. Across all tasks, SVM rewards lead to significantly faster finetuning compared to finetuning on only outcome rewards, as well as other reward shaping baselines.

To understand what the SVM reward learns, we visualize the reward values along several trajectories throughout training on five different $\texttt{LIBERO}$ tasks. We see that after a relatively small number of environment steps collected during training, the SVM reward has already converged to rewarding trajectories that are close to the optimal behavior, while penalizing trajectories that deviate from this.

We demonstrate that SVM scales to real-world robotic RL finetuning. We run DSRL on three real-world WidowX tasks: $\texttt{Pick and Place}$, $\texttt{Open Drawer}$, and $\texttt{Cover Knife}$. Across all three tasks, SVM rewards lead to significantly faster RL finetuning compared to using outcome rewards alone. The video provides a side-by-side comparison of DSRL with and without SVM rewards.

We further show that SVM can improve the efficiency of running RL finetuning on pretrained generalist policies. Here, we consider running DSRL on $\pi_0$ in selected $\texttt{LIBERO}$ tasks. We see that the SVM process reward again yields substantial gains in sample efficiency, requiring roughly 2x fewer environment steps to reach 90% success, as compared to RL without reward shaping.

Next, we seek to understand whether SVM rewards can also lead to improvement when running RL from scratch given a set of successful demonstrations. In the $\texttt{Robomimic}$ benchmark with RLPD, we see that SVM leads to substantial gains over running RL on only the outcome reward. However, the performance of SVM is essentially identical to that of GAIL. We hypothesize that in this case, the rewards obtained by GAIL and SVM are approximately equivalent, since the discriminator learned by SVM and GAIL is trained on similar sets of positive and negative examples.

Given the discriminator $\widehat{f}$ or the successful episodes, one could in principle extract a policy in several other ways. We compare SVM against three such alternatives:

DSRL-BC: We add a behavioral cloning regularizer on successful transitions from $\mathcal{D}^+$.
$\widehat{f}$ Sampling: We use the discriminator for best-of-$N$ action sampling.
$\widehat{f}$ Maximization: We train the policy to maximize the discriminator score directly instead of a learned Q-function.

We find that SVM is significantly more effective than these alternative policy extraction methods.

Learning Process Rewards via Success Visitation Matching for Efficient RL

Abstract

Method

RL Finetuning with SVM Process Rewards

Results

SVM Process Rewards Enable Efficient RL Finetuning

SVM Process Rewards Enable Efficient Real-World RL

Finetuning VLAs with SVM Process Rewards

RL from Demonstrations with SVM Process Rewards

How can we most effectively extract a policy from the SVM discriminator?

BibTeX