Fact-checked by Grok 2 weeks ago

Process supervision

Process supervision is a for training large language models (LLMs) by providing rewards or feedback on intermediate steps of a reasoning process, rather than solely on the final outcome. This approach, pioneered in research on mathematical reasoning, encourages models to generate human-like, verifiable reasoning traces, improving performance and alignment in complex, multi-step tasks. Introduced by in May 2023, process supervision builds on chain-of-thought prompting techniques and has demonstrated superior results compared to outcome supervision—where only correct final answers are rewarded—particularly on benchmarks like the MATH dataset. For instance, process-supervised models achieve higher accuracy by avoiding reasoning shortcuts and producing more interpretable outputs. Key benefits include enhanced model alignment with human reasoning, reduced reliance on potentially deceptive strategies, and scalability to domains such as and scientific problem-solving. As of 2025, it remains a foundational method for advancing AI capabilities in reasoning-intensive applications.

Introduction

Definition and Overview

Process supervision is a in that provides labeled or rewards at each intermediate step of a model's reasoning process, rather than solely on the final output, to guide the model toward correct throughout multi-step tasks. This approach emphasizes the supervision of the underlying reasoning chain, enabling models to learn structured, verifiable paths to solutions in complex domains such as mathematical problem-solving or logical . Unlike traditional , which relies on input-output pairs to map data directly to end results, process supervision targets sequential processes like chain-of-thought reasoning, where the model generates and refines intermediate steps under targeted guidance. For instance, in addressing a , supervision might label and reward each algebraic manipulation or logical deduction, ensuring accuracy at every stage rather than evaluating only the ultimate answer. As a baseline alternative, outcome supervision focuses exclusively on the correctness of the final result, potentially overlooking flawed intermediate reasoning. Within AI alignment research, process supervision serves as a to enhance model reliability and interpretability in intricate tasks, by aligning the model's internal processes with human-approved reasoning structures and mitigating risks from opaque or misaligned pathways. This method promotes more transparent and robust behavior, particularly in scenarios requiring long-horizon or factual accuracy.

Historical Development

Process supervision in AI, particularly for large language models, traces its roots to the early 2020s within the framework of (RLHF), which was pioneered in OpenAI's InstructGPT work to align models with human preferences through iterative . This approach built upon the chain-of-thought (CoT) prompting technique introduced in 2022, which encouraged models to generate intermediate reasoning steps to enhance performance on complex tasks like and . Prior to the formalization of process supervision, related concepts appeared in pre-2023 research contrasting it with outcome supervision. In 2022, Uesato et al. explored outcome-based , which evaluates only the final answer in math word problems, and process-based , which assesses each reasoning step, demonstrating that the latter could models toward more reliable intermediate outputs. This work highlighted the limitations of focusing solely on end results, setting the stage for more granular supervision methods. A pivotal milestone occurred in 2023 with OpenAI's publication of "Let's Verify Step by Step," which explicitly formalized process supervision as a training paradigm superior to outcome supervision for mathematical reasoning. The paper demonstrated that models trained with process reward models (PRMs) achieved substantially higher accuracy on the MATH dataset—reaching 78.2% compared to 72.4% for outcome supervision on a representative subset—by providing step-level feedback to correct errors early in the reasoning chain. To facilitate further research, the authors released the PRM800K dataset, comprising 800,000 human-annotated step-level correctness labels across 75,000 solutions derived from the MATH dataset. Following this introduction, process supervision rapidly gained adoption in 2023 and 2024, particularly for enhancing mathematical reasoning in language models, with subsequent studies leveraging PRM800K to improve performance on benchmarks such as GSM8K. This period marked its establishment as a core technique, influencing a wave of on step-wise verification in pipelines. This approach influenced subsequent models, such as OpenAI's o1 series released in 2024, which used advanced process supervision to achieve state-of-the-art mathematical reasoning.

Theoretical Foundations

Comparison with Outcome Supervision

Outcome supervision involves training reward models based solely on the final result of a model's chain-of-thought reasoning, utilizing input-output pairs without evaluating intermediate steps. In contrast, process supervision provides feedback for each individual reasoning step, enabling more granular guidance during training. A primary methodological difference lies in their handling of cases where a model arrives at a correct final answer through flawed or incorrect intermediate reasoning. Process supervision can identify and penalize such errors by assessing step-wise correctness, thereby promoting reliable reasoning paths, whereas outcome supervision cannot distinguish between valid and spurious solutions, potentially reinforcing unreliable processes. This distinction is particularly evident in complex tasks like mathematical problem-solving, where intermediate steps are crucial for ensuring logical consistency. Empirical evidence from experiments on the MATH dataset demonstrates the superiority of process supervision. Models trained with process supervision achieved 78.2% accuracy, compared to 72.4% for those trained with outcome supervision, highlighting improved performance even when evaluated solely on final outcomes. Theoretically, process supervision mitigates risks associated with reward hacking by enforcing correctness at every step, which discourages models from taking unaligned shortcuts to achieve high final rewards. Outcome supervision, by focusing only on end results, may inadvertently encourage such exploitative behaviors, leading to misaligned model outputs. This approach was notably advanced in OpenAI's work on mathematical reasoning.

Alignment and Interpretability Benefits

Process supervision enhances AI alignment by training models to follow human-endorsed reasoning patterns at each intermediate step, thereby promoting behaviors that more closely mirror human values in complex, multi-step tasks such as and . Unlike approaches that focus solely on final outputs, this method directly rewards aligned chains-of-thought, reducing the risk of models developing misaligned strategies that achieve correct results through unintended paths. For instance, in (RLHF) extensions, step-level supervision improves policy adherence by providing granular feedback that guides the model toward logical and ethical reasoning processes. A key interpretability benefit arises from the visibility of intermediate reasoning steps, which allows humans to inspect, audit, and debug the model's thought process more effectively than with opaque final outputs. This transparency enables identification of errors at specific points in the chain, fostering trust and facilitating iterative improvements in model behavior. By encouraging models to articulate and justify their decisions step-by-step, process supervision yields reasoning traces that are more human-interpretable and less prone to hidden biases or shortcuts. Empirically, process supervision demonstrates superior sample efficiency for , achieving up to 2.6 times better data utilization through techniques like , which prioritizes informative examples for human annotation. Recent RLHF-based methods, such as process-supervised policy optimization, further validate these gains, showing improved performance on reasoning benchmarks—e.g., 86.76% accuracy on AwpNLI—while outperforming outcome-supervised baselines.

Methods and Implementation

Data Annotation Techniques

Data annotation techniques for process supervision involve creating step-level for intermediate reasoning steps in tasks such as mathematical problem-solving, enabling models to learn verifiable trajectories rather than just final outcomes. Human annotation remains a foundational , where annotators label each step in a solution chain as correct, incorrect, or neutral based on its logical validity and reasonableness. For instance, in the creation of the PRM800K , crowdsourced labelers from platforms like evaluated over 101,000 solutions from the MATH dataset, assigning labels up to the first incorrect step while referencing only the ground-truth final to avoid from full reference solutions. This process was divided into : an initial exploratory collecting diverse step alternatives, followed by a scaled producing the bulk of the 800,000 high-quality labels across 75,000 solutions. To enhance efficiency, is integrated into the annotation pipeline by iteratively selecting uncertain or high-value steps for human labeling, thereby reducing the overall burden. In the PRM800K development, active learning was employed using a preliminary process reward model (PRM) to generate and prioritize "convincing wrong-answer" solutions, which improved data diversity and efficiency by a factor of 2.6 compared to random sampling. This approach focuses annotation efforts on steps where model uncertainty is highest, maximizing the informational gain per label and allowing for more targeted . Automated assistance complements human efforts by leveraging weaker models or large language models (LLMs) to draft initial step-level annotations or synthetic trajectories, which are then refined. Methods like AutoPSV use an outcome-supervised verifier to assign confidence scores to steps and detect errors via confidence deltas, generating process labels without manual intervention and enabling hybrid human refinement for complex cases. Similarly, the OmegaPRM framework employs with binary search to automate error localization in chain-of-thought outputs, producing over 1.5 million synthetic process supervision annotations efficiently. The method further streamlines this by aligning model-generated steps with reference solutions in a single pass, using in-context learning to evaluate and annotate correctness, achieving 2.3 times faster annotation than prior automated techniques while supporting reward model training. These automated tools are particularly applied in mathematical reasoning datasets to scale process supervision beyond human limits. Quality control protocols are essential to ensure the reliability of step-level labels, incorporating screening, ongoing , and metrics. In PRM800K , initial labeler screening required 75% on tasks, with continuous via random checks on 10-20 problems per batch to maintain accuracy above 90%; underperforming annotators were removed, and instructions iteratively refined based on error patterns. Cross-annotator is quantified using metrics like , targeting values above 0.6 for ambiguous steps, while automated consistency checks filter out incomplete or outlier labels prior to dataset finalization. Such measures uphold the verifiability of annotations, critical for training robust process-supervised models.

Training Paradigms

Process supervision extends traditional (RLHF) by incorporating step-wise rewards that evaluate intermediate actions rather than solely final outcomes, enabling more precise alignment during training. In this paradigm, algorithms like (PPO) are adapted to optimize policies based on aggregated process rewards, where a step-wise reward model (SRM) assigns scores to each reasoning step using pairwise comparison losses, and generalized advantage estimation propagates these signals across the trajectory. For instance, STEP-RLHF applies step-PPO to update the policy at each step with a clipped surrogate objective, balancing exploration and exploitation while using the SRM to provide fine-grained feedback on mathematical reasoning chains. This approach has demonstrated improvements in problem-solving accuracy, achieving 20.40% on the MATH dataset compared to 19.26% for standard outcome-based RLHF. A supervised (SFT) variant leverages step-labeled reasoning chains to train models directly on intermediate correctness, employing sequence-to-sequence loss over the full trajectory while aggregating errors from incorrect steps into unified gradients for . This method initializes the policy on datasets like PRM800K, which contains 800,000 step-level annotations, allowing the model to learn verifiable step transitions without explicit reward modeling during . By focusing on token-level predictions aligned with labeled correct steps, it enhances the model's ability to generate coherent multi-step processes, outperforming outcome-only SFT in to out-of-distribution tasks. Hybrid approaches integrate and outcome supervision signals to balance detailed guidance with global correctness, often tuning weights via hyperparameter search to weigh step rewards against final answer scores during optimization. In STEP-RLHF, hybrid weighting in the SRM combines step-wise and outcome signals, yielding a 1.14% absolute gain in MATH solve rates over non- RLHF. These methods mitigate the limitations of isolated supervision, such as sparse rewards in pure setups or misalignment in outcome-only training. Evaluation of process-supervised training emphasizes step-level accuracy, measuring the reward model's precision in classifying individual steps as correct (e.g., 84.7% on test sets for SRMs), and chain completion rates, which assess the proportion of fully solved reasoning trajectories (e.g., 78.2% on MATH using process rewards versus 72.4% with outcome supervision). These metrics serve as proxies for overall performance, highlighting improved reliability in long-horizon tasks without relying solely on end-to-end success rates.

Applications and Case Studies

Mathematical Reasoning

Process supervision enhances AI models' mathematical reasoning by providing feedback on intermediate steps rather than solely on final outcomes, enabling more reliable step-by-step problem-solving. In a seminal 2023 study by , a model trained with process supervision achieved a 78.2% solve rate on a representative 500-problem of the MATH , a for competition-level problems spanning , , , and . This outperformed outcome supervision, which reached 72.4% on the same , by pinpointing errors in algebraic manipulations and logical transitions, thus improving overall accuracy in multi-step derivations. Key datasets for process supervision in mathematical reasoning include GSM8K, which covers grade-school word problems requiring operations and basic , and PRM800K, a collection of 800,000 step-level annotations derived from MATH problems for advanced competition-style tasks. These datasets incorporate human-verified annotations for specific operations, such as setting up equations, applying theorems, or simplifying expressions, allowing models to learn verifiable intermediate results like partial sums or geometric constructions. For instance, on GSM8K, process supervision reduced final-answer errors to 12.9% while minimizing reasoning trace errors to 3.8%, compared to higher error rates with outcome-focused methods. A notable involves training models to verify intermediate proofs in process supervision frameworks, which significantly reduces errors in complex, multi-step problems such as proofs or integrations. By rewarding correct verification of each step—e.g., confirming or rules—models like those in the experiments demonstrated fewer propagation errors, achieving up to 2.6 times greater data efficiency through of high-uncertainty steps. This approach not only corrects flawed reasoning paths but also enhances robustness in out-of-distribution scenarios, such as novel STEM problems. The broader impact of process supervision lies in its ability to generalize reasoning patterns, enabling models to tackle unseen mathematical problems by decomposing them into verifiable sub-steps rather than relying on pattern-matched final answers. This fosters deeper conceptual understanding, as evidenced by improved performance on held-out MATH subsets where models extrapolated algebraic strategies from annotated training traces. Such generalization supports applications in educational tools and , where step fidelity is paramount.

Code Generation and Other Domains

Process supervision has been effectively applied to code generation tasks, where it provides granular feedback on intermediate steps such as design, syntax validation, and to guide language models toward more accurate outputs. In this approach, a Process Reward Model (PRM) evaluates at the line level for correctness, enabling to refine the generation process iteratively. For instance, on the HumanEval benchmark, which consists of 164 programming problems requiring functional , process-supervised methods have demonstrated improvements over outcome-based supervision by rewarding correct intermediate reasoning traces. Beyond textual domains, process supervision extends to multi-modal applications, offering step-wise feedback in tasks like video question-answering and path-planning. In video QA, it breaks down into sequential decisions, such as identifying key frames and inferring temporal relationships, allowing models to correct errors mid-reasoning. Similarly, in , process supervision decomposes path-planning into verifiable sub-steps, including obstacle detection and trajectory adjustment, enhancing decision-making in dynamic environments. The VisualPRM, an 8B-parameter PRM, exemplifies this by achieving a 5.9-point improvement across seven multimodal benchmarks when integrated with models like InternVL2.5-78B. In other domains, process supervision proves valuable for tasks emphasizing intermediate logic, such as natural language inference (NLI) chains and scientific hypothesis testing. For NLI chains, it supervises the step-by-step entailment verification between premises and hypotheses, ensuring logical consistency throughout multi-hop reasoning. In scientific hypothesis testing, frameworks like employ process supervision to validate intermediate experimental steps, such as data interpretation and assumption checking, fostering more rigorous AI-assisted discovery. These applications highlight process supervision's versatility in structuring complex, logic-driven workflows. Empirical results underscore these benefits, with process-supervised reward models yielding up to 19% improvement in rank@1 accuracy on the ToolComp benchmark for multi-step tool-use tasks, outperforming outcome supervision by providing denser guidance. This enhancement also supports better alignment in complex domains by promoting interpretable, human-verified reasoning paths.

Challenges and Limitations

Annotation and Scalability Issues

One major hurdle in implementing is the high cost associated with data , which demands extensive effort for labeling individual steps in reasoning chains. For instance, the PRM800K dataset, comprising 800,000 step-level correctness labels across 75,000 model-generated solutions to 12,000 mathematical problems (averaging approximately 10.7 steps per solution), was created through a large-scale effort involving multiple phases of labeling by a team at Scale AI. This process included rigorous measures, such as labeler screening for 75% inter-annotator agreement and continuous monitoring, yet it highlighted the labor-intensive nature of step-wise compared to outcome supervision, where only final answers require verification. Such efforts underscore the resource demands, with techniques employed to improve data efficiency by a factor of 2.6 during collection. Scalability issues further compound these challenges, particularly as reasoning chains lengthen, leading to a proportional increase in the number of labels needed per and overall size. While current like PRM800K handle average chain lengths of around 10-20 steps effectively, extending process supervision to problems requiring chains exceeding 50 steps becomes inefficient due to the growing burden and the need to cover diverse intermediate paths without resource escalation. This limitation restricts widespread adoption for complex, multi-stage tasks, as the linear scaling with chain length amplifies costs and time, making exhaustive labeling impractical for longer sequences. Training under process supervision also introduces computational overhead relative to outcome supervision, as it involves calculating and aggregating step-wise losses across entire reasoning trajectories, which demands more memory and processing time during model optimization. In practice, this granularity results in larger training datasets—such as the 800,000 step labels in PRM800K versus fewer outcome samples. To mitigate these issues, researchers have explored partial automation of annotation pipelines, such as using Monte Carlo Tree Search algorithms to generate synthetic step-level labels, as in the OmegaPRM approach, which produced over 1.5 million annotations without full human involvement. However, these automated methods often exhibit quality reductions in label accuracy without human oversight, necessitating hybrid strategies that combine synthetic data with selective manual verification to balance cost and reliability. Active learning frameworks, originally demonstrated in PRM800K, continue to play a key role by prioritizing uncertain steps for annotation, though they do not fully eliminate the need for human intervention in high-stakes applications.

Applicability to Diverse Tasks

Process supervision demonstrates particular strengths in objective tasks that involve verifiable intermediate steps, such as mathematical reasoning, where it enables precise credit assignment and outperforms outcome supervision by guiding models through structured, decomposable processes. In mathematical problem-solving, for instance, it rewards alignment at each reasoning step, leading to improved accuracy on complex, multi-step problems by reducing errors propagated from ambiguous intermediates. Similarly, in , process supervision facilitates strategic exploration of algorithmic paths, verified through executable feedback, resulting in higher correctness rates compared to methods focused solely on final outputs. However, it proves less effective for subjective domains like , where intermediate quality lacks clear, objective criteria for evaluation, potentially hindering the model's ability to capture nuanced or artistic intent. A key limitation arises in tasks involving ambiguous reasoning, such as , where open-ended intermediates defy straightforward verification and may introduce misalignment if supervision relies on incomplete or biased proxies. Without well-defined correctness signals, process supervision struggles to enforce consistent guidance, often deferring to neutral or conservative outputs rather than fostering robust deliberation. This is particularly evident in scenarios lacking inherent decomposability, where the absence of clear stepwise objectives amplifies challenges in aligning model behavior with multifaceted human values. Domain transfer poses additional issues, as process patterns acquired in structured domains like do not readily generalize to visual or temporal tasks without targeted adaptations, such as domain-specific strategies. For example, while effective in STEM-related reasoning, the reliance on textual, sequential verification limits seamless application to problems, necessitating hybrid approaches to bridge representational gaps. Empirically, process supervision yields lower gains in non-decomposable problems, as observed in early reinforcement learning applications where benefits diminish without modular subtasks, underscoring its dependence on tasks amenable to hierarchical oversight. In such cases, the method's advantages in alignment and interpretability are offset by the difficulty in defining rewarding processes, highlighting the need for complementary techniques in holistic task environments.

Recent Advancements and Future Directions

Key Developments Post-2023

In 2024, significant progress was made in automating process supervision data generation to scale beyond human-annotated datasets. Liu et al. introduced OmegaPRM, a divide-and-conquer algorithm that generates high-quality synthetic annotations by identifying errors in chain-of-thought reasoning through binary search, producing over 1.5 million step-level labels without manual intervention. This approach enhanced mathematical reasoning in models like Gemini Pro, boosting MATH500 accuracy from 51% to 69.4% and GSM8K from 86.4% to 93.6%. Building on the 2023 foundational process supervision techniques, 2025 saw innovations in reward modeling and optimization paradigms. The Bi-directional Reward Model (BiRM) extended traditional process reward models by incorporating forward-looking signals, evaluating both the correctness of past steps and the probability of future success inspired by A* search, which improved MATH-500 accuracy by 3.8% over standard PRMs in settings. Similarly, Self-traced Step-wise Preference Optimization (SSPO) introduced a framework using verbal value probing for self-assessment of steps, enabling efficient training without auxiliary models and reducing response lengths by up to 37% on AIME24 while maintaining or slightly improving accuracy (e.g., from 18.12% to 19.58%). Benchmark development advanced evaluation of process-supervised tool use in 2025 with the release of ToolComp, a comprising 485 prompts and 1,731 step-wise annotations across 11 tools for multi-tool reasoning tasks. Models trained with process supervision on ToolComp outperformed outcome-supervised baselines by 19% in rank@1 accuracy for base models, highlighting gains in intermediate step correctness for complex planning. Frameworks integrating process supervision with emerged to enhance sample efficiency in tasks. Setlur et al.'s 2024 work scaled automated verifiers in , achieving 5-6x greater sample efficiency compared to outcome-only rewards on reasoning benchmarks. In 2025, Turn-level Adjudicated Reinforcement Learning (TARL) combined LLM-based turn rewards with and GRPO for interactive , yielding over 20% pass rate improvements in multimodal agents on τ-bench tasks through finer credit assignment. Recent advancements in process supervision for large language models (LLMs) are increasingly focusing on models that integrate process with outcome supervision and techniques to achieve balanced efficiency in reasoning tasks. For instance, the Principle Process Reward (PPR) framework employs reward normalization to calibrate process-based evaluations of intermediate steps with outcome verification, enabling reliable performance in non-verifiable agentic tasks where final results are hard to assess directly. This fusion mitigates the limitations of pure process , such as overemphasis on local step quality at the expense of global coherence, while self-supervised components reduce reliance on by leveraging internal model predictions for step validation. Such approaches have demonstrated state-of-the-art results in multi-step reasoning benchmarks, improving generalization across diverse domains by harmonizing local and end-to-end rewards through . Scalability solutions in process supervision are advancing through automated annotation methods powered by stronger foundation models, which significantly lower human annotation costs for generating high-quality process traces. Techniques like OmegaPRM utilize divide-and-conquer to autonomously collect over 1.5 million process supervision s for mathematical reasoning, eliminating manual intervention and enabling efficient training of Process Reward Models (PRMs). Similarly, the AgentPro framework incorporates automated process supervision for agents, aligning candidate solution steps with reference traces to enhance decision-making without domain-specific human expertise. These innovations, including Step-Wise Preference Optimization (SSPO), combine process supervision with to scale to long-horizon tasks, achieving up to 93.6% accuracy on GSM8K benchmarks while reducing annotation expenses by orders of magnitude compared to traditional methods. In the context of broader , process supervision plays a pivotal in scalable oversight mechanisms designed for superintelligent systems, particularly in addressing long-horizon tasks where evaluators cannot directly all steps. By providing granular on reasoning trajectories, process-based methods amplify weaker overseers to evaluate stronger AI behaviors, ensuring alignment with intent in complex, multi-step scenarios such as or . This approach supports weak-to-strong , where models trained on simpler oversight signals robustly scale to advanced capabilities, as evidenced in frameworks that partition supervision across verifiable subprocesses to handle emergent behaviors in systems. Key research gaps in process supervision include the lack of standardized benchmarks for subjective domains, where intermediate step quality is inherently ambiguous, and insufficient studies on real-world deployment beyond controlled mathematical or tasks. Current evaluations predominantly focus on objective metrics in verifiable settings, leaving open challenges in domains like or ethical reasoning, where process traces require nuanced, context-dependent assessment. Additionally, while automated methods like BiRM introduce bi-directional rewarding for improved step evaluation, broader empirical investigations into deployment scalability, such as integration with production agents, remain underexplored to validate long-term robustness. Addressing these gaps through new benchmarks and interdisciplinary studies could accelerate the adoption of process supervision in diverse, high-stakes applications.

References

  1. [1]
    Daemontools
    daemontools is a collection of tools for managing UNIX services. supervise monitors a service. It starts the service and restarts the service if it dies.
  2. [2]
    runit - a UNIX init scheme with service supervision - [smarden]
    runit is a cross-platform Unix init scheme with service supervision, a replacement for sysvinit and other init schemes. It runs on GNU/Linux, *BSD, MacOSX, ...Runit · Runsv · The sv program · Runit - benefits
  3. [3]
    An overview of s6 - skarnet.org
    A process supervision system starts an independent hierarchy of processes at boot time, called a supervision tree. This supervision tree never dies: when one ...Process Supervision · Concept · Helpers For Run Scripts<|control11|><|separator|>
  4. [4]
    Supervisor: A Process Control System — Supervisor 4.3.0 ...
    Supervisor is a client/server system that allows its users to monitor and control a number of processes on UNIX-like operating systems.
  5. [5]
  6. [6]
    [2305.20050] Let's Verify Step by Step - arXiv
    May 31, 2023 · This paper compares outcome and process supervision for training models, finding process supervision outperforms outcome supervision, and ...
  7. [7]
    Improving mathematical reasoning with process supervision - OpenAI
    May 31, 2023 · We've trained a model to achieve a new state-of-the-art in mathematical problem solving by rewarding each correct step of reasoning (“process supervision”)Missing: definition | Show results with:definition
  8. [8]
    Supervise Process, not Outcomes | Ought
    Apr 6, 2022 · Process-based systems are built on human-understandable task decompositions, with direct supervision of reasoning steps. Outcome-based systems ...The Spectrum · Supervising Process · In Between Process And...
  9. [9]
    Training language models to follow instructions with human feedback
    Mar 4, 2022 · In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback.
  10. [10]
    Chain-of-Thought Prompting Elicits Reasoning in Large Language ...
    We explore how generating a chain of thought -- a series of intermediate reasoning steps -- significantly improves the ability of large language models to ...
  11. [11]
    Solving math word problems with process- and outcome-based ...
    This paper compares process- and outcome-based approaches for language models, finding process-based supervision is needed for correct reasoning steps.
  12. [12]
    openai/prm800k: 800,000 step-level correctness labels on ... - GitHub
    Jan 22, 2023 · PRM800K is a process supervision dataset containing 800,000 step-level correctness labels for model-generated solutions to problems from the ...Actions · Issues 7 · Security
  13. [13]
    [PDF] arXiv:2305.20050v1 [cs.LG] 31 May 2023
    May 31, 2023 · Let's Verify Step by Step. Hunter Lightman∗. Vineet Kosaraju∗. Yura Burda∗. Harri Edwards. Bowen Baker. Teddy Lee. Jan Leike. John Schulman.
  14. [14]
    [PDF] arXiv:2411.11681v3 [cs.AI] 14 May 2025
    May 14, 2025 · Process supervision enhances the performance of large lan- guage ... We are the first to assert that the reward score in the rea- soning alignment ...
  15. [15]
    [2406.06592] Improve Mathematical Reasoning in Language ... - arXiv
    Jun 5, 2024 · This paper uses automated process supervision with a new MCTS algorithm to improve LLM math reasoning, achieving improved success rates on MATH ...
  16. [16]
    SPARE: Single-Pass Annotation with Reference-Guided Evaluation ...
    Jun 18, 2025 · SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling. Authors:Md Imbesat Hassan ...Missing: techniques | Show results with:techniques
  17. [17]
    None
    ### Summary of Training Paradigms for STEP-RLHF
  18. [18]
  19. [19]
    Process Supervision-Guided Policy Optimization for Code Generation
    Oct 23, 2024 · The paper proposes a Process Reward Model (PRM) for code generation, providing line-level feedback during generation, mimicking human code ...
  20. [20]
    VisualPRM: An Effective Process Reward Model for Multimodal Reasoning
    ### Summary of VisualPRM: An Effective Process Reward Model for Multimodal Reasoning
  21. [21]
    ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark
    Jan 2, 2025 · ToolComp is a benchmark to evaluate multi-step tool-use reasoning, with human-edited prompts, answers, and process supervision labels.
  22. [22]
    [PDF] Better Process Supervision with Bi-directional Rewarding Signals
    Jul 27, 2025 · 5.0% improvement at K = 100 in MATH-500 dataset. These results emphasize the valuable bidi- rectional supervision signals provided by BiRM,.
  23. [23]
    SSPO: Self-traced Step-wise Preference Optimization for Process ...
    Aug 18, 2025 · In this paper, we empirically reveal that the incorrect answers partially stem from verbose reasoning processes lacking correct self-fix, where ...
  24. [24]
    Hybrid Reward Normalization for Process-supervised Non-verifiable ...
    Sep 29, 2025 · Large Language Models (LLMs) increasingly rely on external tools such as search engines to solve complex agentic tasks that require reasoning ...
  25. [25]
    A Survey of Process Reward Models: From Outcome Signals ... - arXiv
    Oct 9, 2025 · A Survey of Process Reward Models: From Outcome Signals to Process Supervisions for Large Language Models. Authors:Congming Zheng, Jiachen Zhu, ...
  26. [26]
    A Survey of Reinforcement Learning for Large Reasoning Models
    ... Process Supervision and Reasoning Compression, Paper ... PURE, Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning ...
  27. [27]
    [PDF] Enhancing LLM Agents with Automated Process Supervision
    In this paper, we present AgentPro, a novel framework for LLM Agents that incorporates an. Automated Process Supervision mechanism to ad- dress ...Missing: emerging | Show results with:emerging
  28. [28]
    Scalable Oversight and Weak-to-Strong Generalization
    Dec 15, 2023 · Scalable oversight amplifies overseers, while weak-to-strong generalization ensures the AI generalizes from imperfect labels, both addressing ...
  29. [29]
    [PDF] Easy-to-Hard Generalization: Scalable Alignment Beyond Human ...
    Our study advances the field of AI alignment by demonstrating the potential of easy-to-hard gen- eralization, where models trained on simpler tasks can be ...<|separator|>
  30. [30]
    Reasoning beyond limits: Advances and open problems for LLMs
    Sep 22, 2025 · ... process supervision models for mathematical reasoning. Let us ... Model (ORM) and Process Reward Model (PRM) [157]. ORM assigns a ...
  31. [31]
    Better Process Supervision with Bi-directional Rewarding Signals
    Mar 6, 2025 · BiRM is a process supervision model that evaluates past steps and models the probability of future success, unlike one-directional PRMs.Missing: accuracy | Show results with:accuracy