Learning Process Rewards via Success Visitation Matching for Efficient RL

University of California, Berkeley

Abstract

In many modern applications of reinforcement learning (RL), the natural reward for a task of interest is inherently sparse: a reward of 0 is given everywhere except when the task is completed, when a reward of +1 is given. Training a policy to maximize such a sparse reward requires solving a challenging credit assignment problem, leading to slow or ineffective RL improvement. We propose a simple approach to transform a sparse outcome reward into a dense process reward. Our approach relies on training a discriminator to distinguish between previous successful and unsuccessful episodes, and using this discriminator to incentivize the RL-learned policy to match the state-action visitations of successful episodes, while avoiding those of unsuccessful episodes. By incentivizing the policy to match the visitations over all states, not just those that correspond to task success, this reward provides dense feedback on whether progress is being made towards task completion, and, we show, provably achieves this without changing the optimal policy. Focusing on finetuning of robotic control policies, we demonstrate that our approach leads to significantly faster RL finetuning performance on both simulated and real-world manipulation tasks, as compared to simply maximizing the sparse outcome reward.

Method

Overview of Success Visitation Matching: discriminator training, visitation matching reward, RL finetuning, and real-world deployment.

We consider the problem of finetuning a pretrained policy $\pi_{\text{pre}}$ to maximize a sparse outcome reward $r^{\mathrm{out}}$. To enable efficient RL finetuning, we seek to design a dense process reward that:

(a) Provides dense step-level feedback guiding the learner toward successful states, and

(b) Peserves the optimal policies under the original outcome reward.

Our approach is based on a simple principle: given past episodes labeled by outcome reward $r^{\mathrm{out}}$, reward the learner for visiting state-actions that are associated with successful episodes, and penalize those associated with unsuccessful episodes. Concretely, denoting $\mathfrak{D}^+$ and $\mathfrak{D}^-$ the state-action pairs from successful and unsuccessful episodes, respectively, we train a discriminator to distinguish these two sets: $$\widehat{f}(s,a) = {\text{argmin}_{f}}\; \mathbb{E}_{(s, a)\sim \mathfrak{D}^+} [\log f(s,a)]+ \mathbb{E}_{(s, a)\sim \mathfrak{D}^-}[\log (1-f(s,a))]$$ We then define the success visitation matching (SVM) process reward with the discriminator logit: $$r^{\mathrm{svm}}(s,a)= r^{\mathrm{out}}(s) + \lambda \cdot \mathrm{clip}_\beta \left ( \log \frac{\widehat{f}(s,a)}{1-\widehat{f}(s,a)} \right )$$ where $\mathrm{clip}_\beta(\cdot)$ clips the value to be within the range $[-\beta,\beta]$. This yields a dense process reward that encourages the learner to visit states likely to lead to success based on past observations—states where $\widehat{f}(s,a)$ is large—while balancing this with the true outcome reward. We show the following:

Theorem (informal). Assuming our environment has deterministic transitions, the optimal policy under $r^{\mathrm{svm}}$ is the same as the optimal policy under $r^{\mathrm{out}}$.
Thus, not only does $r^{\mathrm{svm}}$ provide dense supervision guiding the learner to the goal, but it does not change the optimal policy.

RL Finetuning with SVM Process Rewards

We propose running RL with the SVM process reward, iteratively updating $\widehat{f}$ online as new observations are collected.

Algorithm: Reinforcement Learning with SVM Process Reward

  1. Input: $r^{\mathrm{svm}}$ weight $\lambda$, $r^{\mathrm{svm}}$ clipping $\beta$, pretrained $\pi_{\mathrm{pre}}$ (optional), initial rollouts $N_0$ (optional)
  2. Collect $N_0$ trajectories with $\pi_{\mathrm{pre}}$, set $\mathcal{D}^+$ to successful trajectories, $\mathcal{D}^-$ to all others
  3. Initialize discriminator $\widehat{f}$ on $\mathcal{D}^+$ and $\mathcal{D}^-$
  4. Initialize $\pi_1$ to $\pi_0$ or random policy
  5. for $t = 1, 2, 3, \ldots$ do
  6. Run $\pi_t$ for one episode, add trajectory to $\mathcal{D}^+$ if it is successful, otherwise to $\mathcal{D}^-$
  7. Update $\widehat{f}$ on $\mathcal{D}^+$ and $\mathcal{D}^-$
  8. Update $\pi_t$ to $\pi_{t+1}$ by maximizing reward $r^{\mathrm{svm}}(s,a) = r^{\mathrm{out}}(s) + \lambda \cdot \mathrm{clip}_\beta\left(\log \frac{\widehat{f}(s,a)}{1-\widehat{f}(s,a)}\right)$
  9. end for

Results

SVM Process Rewards Enable Efficient RL Finetuning

Click to see the $\texttt{LIBERO}$ and $\texttt{RoboCasa}$ scenes
Libero kitchen scene 1 Libero kitchen scene 2 Libero kitchen scene 3 RoboCasa banana scene RoboCasa tomato scene RoboCasa mushroom scene
Results legend All simulation results

We evaluate SVM rewards in the RL finetuning regime on the $\texttt{LIBERO}$ benchmarks and $\texttt{RoboCasa}$ pick-and-place tasks, using DSRL and Residual RL as the finetuning algorithms. Across all tasks, SVM rewards lead to significantly faster finetuning compared to finetuning on only outcome rewards, as well as other reward shaping baselines.

To understand what the SVM reward learns, we visualize the reward values along several trajectories throughout training on five different $\texttt{LIBERO}$ tasks. We see that after a relatively small number of environment steps collected during training, the SVM reward has already converged to rewarding trajectories that are close to the optimal behavior, while penalizing trajectories that deviate from this.

Open the bottom drawer

SVM reward trajectories: open the bottom drawer

Open the top drawer

SVM reward trajectories: open the top drawer

Put the black bowl on the plate

SVM reward trajectories: put the black bowl on the plate

Put the moka pot on the stove

SVM reward trajectories: put the moka pot on the stove

Turn on the stove

SVM reward trajectories: turn on the stove

Environment steps: 0

−20
+20

SVM Process Rewards Enable Efficient Real-World RL

Real-world WidowX robot results for pick and place, open drawer, and cover knife tasks

RL finetuning with outcome reward

RL finetuning with SVM reward

DSRL training timelapse

We demonstrate that SVM scales to real-world robotic RL finetuning. We run DSRL on three real-world WidowX tasks: $\texttt{Pick and Place}$, $\texttt{Open Drawer}$, and $\texttt{Cover Knife}$. Across all three tasks, SVM rewards lead to significantly faster RL finetuning compared to using outcome rewards alone. The video provides a side-by-side comparison of DSRL with and without SVM rewards.

Finetuning VLAs with SVM Process Rewards

Finetuning VLAs with SVM process rewards

We further show that SVM can improve the efficiency of running RL finetuning on pretrained generalist policies. Here, we consider running DSRL on $\pi_0$ in selected $\texttt{LIBERO}$ tasks. We see that the SVM process reward again yields substantial gains in sample efficiency, requiring roughly 2x fewer environment steps to reach 90% success, as compared to RL without reward shaping.

RL from Demonstrations with SVM Process Rewards

Results legend
Robomimic can task results Robomimic square task results

Next, we seek to understand whether SVM rewards can also lead to improvement when running RL from scratch given a set of successful demonstrations. In the $\texttt{Robomimic}$ benchmark with RLPD, we see that SVM leads to substantial gains over running RL on only the outcome reward. However, the performance of SVM is essentially identical to that of GAIL. We hypothesize that in this case, the rewards obtained by GAIL and SVM are approximately equivalent, since the discriminator learned by SVM and GAIL is trained on similar sets of positive and negative examples.

How can we most effectively extract a policy from the SVM discriminator?

Classifier usage results legend
Policy extraction results scene 1 Policy extraction results scene 2 Policy extraction results scene 3

Given the discriminator $\widehat{f}$ or the successful episodes, one could in principle extract a policy in several other ways. We compare SVM against three such alternatives:

  1. DSRL-BC: We add a behavioral cloning regularizer on successful transitions from $\mathcal{D}^+$.
  2. $\widehat{f}$ Sampling: We use the discriminator for best-of-$N$ action sampling.
  3. $\widehat{f}$ Maximization: We train the policy to maximize the discriminator score directly instead of a learned Q-function.

We find that SVM is significantly more effective than these alternative policy extraction methods.

BibTeX

@article{tsao2026learning,
  author    = {Tsao, Raymond and Wagenmaker, Andrew and Levine, Sergey},
  title     = {Learning Process Rewards via Success Visitation Matching for Efficient RL},
  year      = {2026},
}