- Generative language modeling for automated theorem proving, Polu et al.,
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, Wei et al.,
- Reflection of thought: Inversely eliciting numerical reasoning in language models via solving linear systems, Zhou et al.,
- MathPrompter: Mathematical Reasoning using Large Language Models, Imani et al.,
- Cross-lingual Prompting: Improving Zero-shot Chain-of-Thought Reasoning across Languages, Qin et al.,
- Deductive Verification of Chain-of-Thought Reasoning, Ling et al.,
- Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks, Chen et al.,
- Mistral 7B, Jiang et al.,
- Llama 2: Open foundation and fine-tuned chat models, Touvron et al.,
- Guiding language model reasoning with planning tokens, Wang et al.,
- Tinygsm: achieving> 80\% on gsm8k with small language models, Liu et al.,
- Chain of Code: Reasoning with a Language Model-Augmented Code Emulator, Li et al.,
- AutoCAP: Towards Automatic Cross-lingual Alignment Planning for Zero-shot Chain-of-Thought, Zhang et al.,
- Planning in Natural Language Improves LLM Search for Code Generation, Wang et al.,
- MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning, Sprague et al.,
- DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving, Tong et al.,
- Enhancing Reasoning Capabilities of LLMs via Principled Synthetic Logic Corpus, Morishita et al.,
- Self-Explore: Enhancing Mathematical Reasoning in Language Models with Fine-grained Rewards, Hwang et al.,
- AlphaMath Almost Zero: Process Supervision without Process, Chen et al.,
- Deepseekmath: Pushing the limits of mathematical reasoning in open language models, Shao et al.,
- Brain-Inspired Two-Stage Approach: Enhancing Mathematical Reasoning by Imitating Human Thought Processes, Chen et al.,
- Quiet-star: Language models can teach themselves to think before speaking, Zelikman et al.,
- Common 7b language models already possess strong math capabilities, Li et al.,
- MathDivide: Improved mathematical reasoning by large language models, Srivastava et al.,
- Certified Deductive Reasoning with Language Models, Poesia et al.,
- From explicit cot to implicit cot: Learning to internalize cot step by step, Deng et al.,
- Interactive Evolution: A Neural-Symbolic Self-Training Framework For Large Language Models, Xu et al.,
- Lean-star: Learning to interleave thinking and proving, Lin et al.,
- The llama 3 herd of models, Dubey et al.,
- Qwen2 Technical Report, Yang et al.,
- Siam: Self-improving code-assisted mathematical reasoning of large language models, Yu et al.,
- Qwen2.5-math technical report: Toward mathematical expert model via self-improvement, Yang et al.,
- TPO: Aligning Large Language Models with Multi-branch \& Multi-step Preference Trees, Liao et al.,
- O1 Replication Journey--Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?, Huang et al.,
- Formal mathematical reasoning: A new frontier in ai, Yang et al.,
- Training large language models to reason in a continuous latent space, Hao et al.,
- Qwen2.5 technical report, Yang et al.,
- System-2 Mathematical Reasoning via Enriched Instruction Tuning, Cai et al.,
- Acemath: Advancing frontier math reasoning with post-training and reward modeling, Liu et al.,
- Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems, Min et al.,
- SKIntern: Internalizing Symbolic Knowledge for Distilling Better CoT Capabilities into Small Language Models, Liao et al.,
- STP: Self-play LLM Theorem Provers with Iterative Conjecturing and Proving, Dong et al.,
- Sky-T1: Train your own O1 preview model within \$ 450, Team et al.,
- QwQ: Reflect Deeply on the Boundaries of the Unknown, Team et al.,
- Bespoke-Stratos: The unreasonable effectiveness of reasoning distillation, Labs et al.,
- Think Less, Achieve More: Cut Reasoning Costs by 50% Without Sacrificing Accuracy, Team et al.,
- Unlocking the Potential of Reinforcement Learning in Improving Reasoning Models, Team et al.,
- Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs, Wang et al.,
- Efficient Reasoning with Hidden Thinking, Shen et al.,
- Advancing Math Reasoning in Language Models: The Impact of Problem-Solving Data, Data Synthesis Methods, and Training Stages, Chen et al.,
- s1: Simple test-time scaling, Muennighoff et al.,
- Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, Guo et al.,
- Improving Chain-of-Thought Reasoning via Quasi-Symbolic Abstractions, Ranaldi et al.,
- Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach, Geiping et al.,
- CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction, Li et al.,
- Towards Better Understanding of Program-of-Thought Reasoning in Cross-Lingual and Multilingual Environments, Payoungkhamdee et al.,
- Theorem Prover as a Judge for Synthetic Data Generation, Leang et al.,
- Self-Enhanced Reasoning Training: Activating Latent Reasoning in Small Models for Enhanced Reasoning Distillation, Zhang et al.,
- Scalable Language Models with Posterior Inference of Latent Thought Vectors, Kong et al.,
- Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking, Chen et al.,
- Demystifying Long Chain-of-Thought Reasoning in LLMs, Yeo et al.,
- FastMCTS: A Simple Sampling Strategy for Data Synthesis, Li et al.,
- LIMO: Less is More for Reasoning, Ye et al.,
- Thinking fast and slow with deep learning and tree search, Anthony et al.,
- Training verifiers to solve math word problems, Cobbe et al.,
- Chain of Thought Imitation with Procedure Cloning, Yang et al.,
- Star: Bootstrapping reasoning with reasoning, Zelikman et al.,
- Large Language Models Are Reasoning Teachers, Ho et al.,
- The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning, Kim et al.,
- Training Chain-of-Thought via Latent-Variable Inference, Hoffman et al.,
- Instruction tuning for large language models: A survey, Zhang et al.,
- Reinforced self-training (rest) for language modeling, Gulcehre et al.,
- Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models, Singh et al.,
- V-STaR: Training Verifiers for Self-Taught Reasoners, Hosseini et al.,
- Enhancing Reasoning Capabilities of LLMs via Principled Synthetic Logic Corpus, Morishita et al.,
- DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving, Tong et al.,
- Weak-to-Strong Reasoning, Yang et al.,
- Iterative Reasoning Preference Optimization, Pang et al.,
- Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs, Zhang et al.,
- ReAct Meets ActRe: Autonomous Annotation of Agent Trajectories for Contrastive Self-Training, Yang et al.,
- Self-Explore: Enhancing Mathematical Reasoning in Language Models with Fine-grained Rewards, Hwang et al.,
- AlphaMath Almost Zero: Process Supervision without Process, Chen et al.,
- Brain-Inspired Two-Stage Approach: Enhancing Mathematical Reasoning by Imitating Human Thought Processes, Chen et al.,
- Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models, Puerto et al.,
- Qwen2.5-math technical report: Toward mathematical expert model via self-improvement, Yang et al.,
- TPO: Aligning Large Language Models with Multi-branch \& Multi-step Preference Trees, Liao et al.,
- Towards Self-Improvement of LLMs via MCTS: Leveraging Stepwise Knowledge with Curriculum Preference Learning, Wang et al.,
- O1 Replication Journey--Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?, Huang et al.,
- System-2 Mathematical Reasoning via Enriched Instruction Tuning, Cai et al.,
- Acemath: Advancing frontier math reasoning with post-training and reward modeling, Liu et al.,
- Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems, Min et al.,
- Openai o1 system card, Jaech et al.,
- OpenRFT: Adapting Reasoning Foundation Model for Domain-specific Tasks with Reinforcement Fine-Tuning, Zhang et al.,
- Proposing and solving olympiad geometry with guided tree search, Zhang et al.,
- Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling, Bansal et al.,
- Enhancing Reasoning through Process Supervision with Monte Carlo Tree Search, Li et al.,
- Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs, Wang et al.,
- Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, Guo et al.,
- Sft memorizes, rl generalizes: A comparative study of foundation model post-training, Chu et al.,
- Advancing Math Reasoning in Language Models: The Impact of Problem-Solving Data, Data Synthesis Methods, and Training Stages, Chen et al.,
- s1: Simple test-time scaling, Muennighoff et al.,
- RedStar: Does Scaling Long-CoT Data Unlock Better Slow-Reasoning Systems?, Xu et al.,
- FastMCTS: A Simple Sampling Strategy for Data Synthesis, Li et al.,
- LLMs Can Teach Themselves to Better Predict the Future, Turtel et al.,
- Policy Guided Tree Search for Enhanced LLM Reasoning, Li et al.,
- Don't Get Lost in the Trees: Streamlining LLM Reasoning by Overcoming Tree Search Exploration Pitfalls, Wang et al.,
- Distillation Scaling Laws, Busbridge et al.,
- Unveiling the Mechanisms of Explicit CoT Training: How Chain-of-Thought Enhances Reasoning Generalization, Yao et al.,
- Demystifying Long Chain-of-Thought Reasoning in LLMs, Yeo et al.,
- LIMO: Less is More for Reasoning, Ye et al.,
- BOLT: Bootstrap Long Chain-of-Thought in Language Models without Distillation, Pang et al.,
- Greedy Policy Search: A Simple Baseline for Learnable Test-Time Augmentation, Lyzhov et al.,
- Scaling scaling laws with board games, Jones et al.,
- Show Your Work: Scratchpads for Intermediate Computation with Language Models, Nye et al.,
- Making large language models better reasoners with step-aware verifier, Li et al.,
- Complexity-Based Prompting for Multi-step Reasoning, Fu et al.,
- Self-Consistency Improves Chain of Thought Reasoning in Language Models, Wang et al.,
- Cross-lingual Prompting: Improving Zero-shot Chain-of-Thought Reasoning across Languages, Qin et al.,
- Deductive Verification of Chain-of-Thought Reasoning, Ling et al.,
- Learning to Reason via Program Generation, Emulation, and Search, Weir et al.,
- Python is Not Always the Best Choice: Embracing Multilingual Program of Thoughts, Luo et al.,
- From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models, Welleck et al.,
- Scaling Inference Computation: Compute-Optimal Inference for Problem-Solving with Language Models, Wu et al.,
- Don't Trust: Verify -- Grounding LLM Quantitative Reasoning with Autoformalization, Zhou et al.,
- Wrong-of-Thought: An Integrated Reasoning Framework with Multi-Perspective Verification and Wrong Information, Zhang et al.,
- Multi-step problem solving through a verifier: An empirical analysis on model-induced process supervision, Wang et al.,
- Stepwise self-consistent mathematical reasoning with large language models, Zhao et al.,
- General purpose verification for chain of thought prompting, Vacareanu et al.,
- Improve Mathematical Reasoning in Language Models by Automated Process Supervision, Luo et al.,
- Large language monkeys: Scaling inference compute with repeated sampling, Brown et al.,
- Scaling llm test-time compute optimally can be more effective than scaling model parameters, Snell et al.,
- Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models, Wu et al.,
- What are the essential factors in crafting effective long context multi-hop instruction datasets? insights and best practices, Chen et al.,
- MAgICoRe: Multi-Agent, Iterative, Coarse-to-Fine Refinement for Reasoning, Chen et al.,
- Scaling llm inference with optimized sample compute allocation, Zhang et al.,
- Rlef: Grounding code llms in execution feedback with reinforcement learning, Gehring et al.,
- Beyond examples: High-level automated reasoning paradigm in in-context learning via mcts, Wu et al.,
- From medprompt to o1: Exploration of run-time strategies for medical challenge problems and beyond, Nori et al.,
- A simple and provable scaling law for the test-time compute of large language models, Chen et al.,
- Openai o1 system card, Jaech et al.,
- Lachesis: Predicting LLM Inference Accuracy using Structural Properties of Reasoning Paths, Kim et al.,
- Seed-cts: Unleashing the power of tree search for superior performance in competitive coding tasks, Wang et al.,
- Inference Scaling vs Reasoning: An Empirical Analysis of Compute-Optimal LLM Problem-Solving, AbdElhameed et al.,
- ExACT: Teaching AI Agents to Explore with Reflective-MCTS and Exploratory Learning, Yu et al.,
- s1: Simple test-time scaling, Muennighoff et al.,
- From Drafts to Answers: Unlocking LLM Potential via Aggregation Fine-Tuning, Li et al.,
- Chain-of-Reasoning: Towards Unified Mathematical Reasoning in Large Language Models via a Multi-Paradigm Perspective, Yu et al.,
- Test-time Computing: from System-1 Thinking to System-2 Thinking, Ji et al.,
- SETS: Leveraging Self-Verification and Self-Correction for Improved Test-Time Scaling, Chen et al.,
- Instantiation-based Formalization of Logical Reasoning Tasks using Language Models and Logical Solvers, Raza et al.,
- The lessons of developing process reward models in mathematical reasoning, Zhang et al.,
- ECM: A Unified Electronic Circuit Model for Explaining the Emergence of In-Context Learning and Chain-of-Thought in Large Language Model, Chen et al.,
- Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach, Geiping et al.,
- Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking, Chen et al.,
- Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling, Liu et al.,
- Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities?, Zeng et al.,
- Optimizing Temperature for Language Models with Multi-Sample Inference, Du et al.,
- Bag of Tricks for Inference-time Computation of LLM Reasoning, Liu et al.,
- Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning, Yang et al.,
- (Mis) Fitting: A Survey of Scaling Laws, Li et al.,
- METAL: A Multi-Agent Framework for Chart Generation with Test-Time Scaling, Li et al.,
- Reasoning-as-Logic-Units: Scaling Test-Time Reasoning in Large Language Models Through Logic Unit Alignment, Li et al.,
- Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification, Zhao et al.,
- TestNUC: Enhancing Test-Time Computing Approaches through Neighboring Unlabeled Data Consistency, Zou et al.,
- Confidence Improves Self-Consistency in LLMs, Taubenfeld et al.,
- S*: Test Time Scaling for Code Generation, Li et al.,
- Is Depth All You Need? An Exploration of Iterative Reasoning in LLMs, Wu et al.,
- Self-Evaluation Guided Beam Search for Reasoning, Xie et al.,
- PATHFINDER: Guided Search over Multi-Step Reasoning Paths, Golovneva et al.,
- Least-to-Most Prompting Enables Complex Reasoning in Large Language Models, Zhou et al.,
- Tree of Thoughts: Deliberate Problem Solving with Large Language Models, Yao et al.,
- No train still gain. unleash mathematical reasoning of large language models with monte carlo tree search guided by energy function, Xu et al.,
- Deductive Beam Search: Decoding Deducible Rationale for Chain-of-Thought Reasoning, Zhu et al.,
- Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping, Lehnert et al.,
- Don't throw away your value model! Generating more preferable text with Value-Guided Monte-Carlo Tree Search decoding, Liu et al.,
- Making PPO even better: Value-Guided Monte-Carlo Tree Search decoding, Liu et al.,
- On the Empirical Complexity of Reasoning and Planning in LLMs, Kang et al.,
- Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing, Tian et al.,
- Tree of Uncertain Thoughts Reasoning for Large Language Models, Mo et al.,
- Graph of Thoughts: Solving Elaborate Problems with Large Language Models, Besta et al.,
- GraphReason: Enhancing Reasoning Capabilities of Large Language Models through A Graph-Based Verification Approach, Cao et al.,
- Strategist: Learning Strategic Skills by LLMs via Bi-Level Tree Search, Light et al.,
- Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning, Zhai et al.,
- Demystifying chains, trees, and graphs of thoughts, Besta et al.,
- Mindstar: Enhancing math reasoning in pre-trained llms at inference time, Kang et al.,
- Tree search for language model agents, Koh et al.,
- Agent q: Advanced reasoning and learning for autonomous ai agents, Putta et al.,
- RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation, Li et al.,
- Llama-berry: Pairwise optimization for o1-like olympiad-level mathematical reasoning, Zhang et al.,
- Understanding When Tree of Thoughts Succeeds: Larger Models Excel in Generation, Not Discrimination, Chen et al.,
- Treebon: Enhancing inference-time alignment with speculative tree-search and best-of-n sampling, Qiu et al.,
- Aflow: Automating agentic workflow generation, Zhang et al.,
- Cooperative Strategic Planning Enhances Reasoning Capabilities in Large Language Models, Wang et al.,
- Deliberate reasoning for llms as structure-aware planning with accurate world model, Xiong et al.,
- Enhancing multi-step reasoning abilities of language models through direct q-function optimization, Liu et al.,
- Process reward model with q-value rankings, Li et al.,
- Scattered Forest Search: Smarter Code Space Exploration with LLMs, Light et al.,
- AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning, Xiang et al.,
- CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models, Li et al.,
- Marco-o1: Towards open reasoning models for open-ended solutions, Zhao et al.,
- Technical report: Enhancing llm reasoning with reward-guided tree search, Jiang et al.,
- SRA-MCTS: Self-driven Reasoning Aurmentation with Monte Carlo Tree Search for Enhanced Code Generation, Xu et al.,
- GPT-Guided Monte Carlo Tree Search for Symbolic Regression in Financial Fraud Detection, Kadam et al.,
- MC-NEST--Enhancing Mathematical Reasoning in Large Language Models with a Monte Carlo Nash Equilibrium Self-Refine Tree, Rabby et al.,
- SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models, Cheng et al.,
- Forest-of-thought: Scaling test-time compute for enhancing LLM reasoning, Bi et al.,
- Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search, Yao et al.,
- Tree-of-Code: A Tree-Structured Exploring Framework for End-to-End Code Generation and Execution in Complex Task Handling, Ni et al.,
- Towards Intrinsic Self-Correction Enhancement in Monte Carlo Tree Search Boosted Reasoning via Iterative Preference Learning, Jiang et al.,
- Ensembling Large Language Models with Process Reward-Guided Tree Search for Better Complex Reasoning, Park et al.,
- rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking, Guan et al.,
- Evolving Deeper LLM Thinking, Lee et al.,
- A Roadmap to Guide the Integration of LLMs in Hierarchical Planning, Puerta-Merino et al.,
- Monte Carlo Tree Search for Comprehensive Exploration in LLM-Based Automatic Heuristic Design, Zheng et al.,
- Leveraging Constrained Monte Carlo Tree Search to Generate Reliable Long Chain-of-Thought for Mathematical Reasoning, Lin et al.,
- A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods, Puri et al.,
- Hypothesis-Driven Theory-of-Mind Reasoning for Large Language Models, Kim et al.,
- Atom of Thoughts for Markov LLM Test-Time Scaling, Teng et al.,
- CoAT: Chain-of-Associated-Thoughts Framework for Enhancing Large Language Models Reasoning, Pan et al.,
- QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search, Lin et al.,
- CritiQ: Mining Data Quality Criteria from Human Preferences, Guo et al.,
- Policy Gradient Methods for Reinforcement Learning with Function Approximation, Sutton et al.,
- Proximal policy optimization algorithms, Schulman et al.,
- Training verifiers to solve math word problems, Cobbe et al.,
- Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models, Singh et al.,
- RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold, Setlur et al.,
- A Small Step Towards Reproducing OpenAI o1: Progress Report on the Steiner Open Source Models, Ji et al.,
- Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback, Ivison et al.,
- Unlocking the Capabilities of Thought: A Reasoning Boundary Framework to Quantify and Optimize Chain-of-Thought, Chen et al.,
- ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search, Zhang et al.,
- Deepseekmath: Pushing the limits of mathematical reasoning in open language models, Shao et al.,
- Stepcoder: Improve code generation with reinforcement learning from compiler feedback, Dou et al.,
- ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models, Li et al.,
- AlphaZero-Like Tree-Search can Guide Large Language Model Decoding and Training, Wan et al.,
- CPL: Critical Plan Step Learning Boosts LLM Generalization in Reasoning Tasks, Wang et al.,
- A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications, Xiao et al.,
- Critical Tokens Matter: Token-Level Contrastive Estimation Enhence LLM's Reasoning Capability, Lin et al.,
- Improving Multi-Step Reasoning Abilities of Large Language Models with Direct Advantage Policy Optimization, Liu et al.,
- o1-coder: an o1 replication for coding, Zhang et al.,
- Offline Reinforcement Learning for LLM Multi-Step Reasoning, Wang et al.,
- DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL, Luo et al.,
- Sft memorizes, rl generalizes: A comparative study of foundation model post-training, Chu et al.,
- REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models, Hu et al.,
- Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, Guo et al.,
- Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling, Hou et al.,
- Diverse Preference Optimization, Lanchantin et al.,
- COS (M+ O) S: Curiosity and RL-Enhanced MCTS for Exploring Story Space via Language Models, Materzok et al.,
- Kimi k1. 5: Scaling reinforcement learning with llms, Team et al.,
- Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search, Shen et al.,
- Demystifying Long Chain-of-Thought Reasoning in LLMs, Yeo et al.,
- LIMR: Less is More for RL Scaling, Li et al.,
- Ignore the KL Penalty! Boosting Exploration on Critical Tokens to Enhance RL Fine-Tuning, Vassoyan et al.,
- Lean and Mean: Decoupled Value Policy Optimization with Global Value Guidance, Huang et al.,
- Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning, Xie et al.,
- Training Language Models to Reason Efficiently, Arora et al.,
- Process reinforcement through implicit rewards, Cui et al.,
- Reward-aware Preference Optimization: A Unified Mathematical Framework for Model Alignment, Sun et al.,
- Focused-DPO: Enhancing Code Generation Through Focused Preference Optimization on Error-Prone Points, Zhang et al.,
- Reasoning with Reinforced Functional Token Tuning, Zhang et al.,
- Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning, Lyu et al.,
- Competitive Programming with Large Reasoning Models, El-Kishky et al.,
- SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution, Wei et al.,
- Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights, Parashar et al.,
- Metastable Dynamics of Chain-of-Thought Reasoning: Provable Benefits of Search, RL and Distillation, Kim et al.,
- On the Emergence of Thinking in LLMs I: Searching for the Right Intuition, Ye et al.,
- The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks, Cuadron et al.,
- STeCa: Step-level Trajectory Calibration for LLM Agent Learning, Wang et al.,
- Thinking Preference Optimization, Yang et al.,
- Concrete problems in AI safety, Amodei et al.,
- Training verifiers to solve math word problems, Cobbe et al.,
- Star: Bootstrapping reasoning with reasoning, Zelikman et al.,
- Goal misgeneralization in deep reinforcement learning, Di Langosco et al.,
- The effects of reward misspecification: Mapping and mitigating misaligned models, Pan et al.,
- Self-critiquing models for assisting human evaluators, Saunders et al.,
- Solving math word problems with process- and outcome-based feedback, Uesato et al.,
- Constitutional AI: Harmlessness from AI Feedback, Bai et al.,
- Towards Mitigating LLM Hallucination via Self Reflection, Ji et al.,
- Tree of Thoughts: Deliberate Problem Solving with Large Language Models, Yao et al.,
- Reasoning with Language Model is Planning with World Model, Hao et al.,
- LEVER: Learning to Verify Language-to-Code Generation with Execution, Ni et al.,
- Large Language Models are Better Reasoners with Self-Verification, Weng et al.,
- Self-verification improves few-shot clinical information extraction, Gero et al.,
- ReAct: Synergizing Reasoning and Acting in Language Models, Yao et al.,
- Reflexion: language agents with verbal reinforcement learning, Shinn et al.,
- Critic: Large language models can self-correct with tool-interactive critiquing, Gou et al.,
- Reinforced self-training (rest) for language modeling, Gulcehre et al.,
- Shepherd: A critic for language model generation, Wang et al.,
- Let's reward step by step: Step-Level reward model as the Navigators for Reasoning, Ma et al.,
- ReFT: Reasoning with Reinforced Fine-Tuning, Trung et al.,
- Large Language Models Cannot Self-Correct Reasoning Yet, Huang et al.,
- Self-Contrast: Better Reflection Through Inconsistent Solving Perspectives, Zhang et al.,
- LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models, Hao et al.,
- Let's verify step by step, Lightman et al.,
- Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations, Wang et al.,
- Advancing Process Verification for Large Language Models via Tree-Based Preference Learning, He et al.,
- Reasoning in Flux: Enhancing Large Language Models Reasoning through Uncertainty-aware Adaptive Guidance, Yin et al.,
- When is Tree Search Useful for LLM Planning? It Depends on the Discriminator, Chen et al.,
- Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification, Zhou et al.,
- SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning, Miao et al.,
- Promptbreeder: Self-Referential Self-Improvement via Prompt Evolution, Fernando et al.,
- Small Language Models Need Strong Verifiers to Self-Correct Reasoning, Zhang et al.,
- Step-level Value Preference Optimization for Mathematical Reasoning, Chen et al.,
- Skywork-o1 open series, Team et al.,
- OVM, Outcome-supervised Value Models for Planning in Mathematical Reasoning, Yu et al.,
- AutoPSV: Automated Process-Supervised Verifier, Lu et al.,
- Uncertainty of Thoughts: Uncertainty-Aware Planning Enhances Information Seeking in Large Language Models, Hu et al.,
- Deepseekmath: Pushing the limits of mathematical reasoning in open language models, Shao et al.,
- VerMCTS: Synthesizing Multi-Step Programs using a Verifier, a Large Language Model, and Tree Search, Brandfonbrener et al.,
- Can We Verify Step by Step for Incorrect Answer Detection?, Xu et al.,
- Monte carlo tree search boosts reasoning via iterative preference learning, Xie et al.,
- Self-reflection in llm agents: Effects on problem-solving performance, Renze et al.,
- Interactive Evolution: A Neural-Symbolic Self-Training Framework For Large Language Models, Xu et al.,
- Llm critics help catch bugs in mathematics: Towards a better mathematical verifier with natural language feedback, Gao et al.,
- Step-dpo: Step-wise preference optimization for long-chain reasoning of llms, Lai et al.,
- Llm critics help catch llm bugs, McAleese et al.,
- Token-Supervised Value Models for Enhancing Mathematical Reasoning Capabilities of Large Language Models, Lee et al.,
- Tlcr: Token-level continuous reward for fine-grained reinforcement learning from human feedback, Yoon et al.,
- Selective Preference Optimization via Token-Level Reward Function Estimation, Yang et al.,
- Generative verifiers: Reward modeling as next-token prediction, Zhang et al.,
- Critic-cot: Boosting the reasoning abilities of large language model via chain-of-thoughts critic, Zheng et al.,
- On designing effective rl reward at training time for llm reasoning, Gao et al.,
- Reversal of Thought: Enhancing Large Language Models with Preference-Guided Reverse Reasoning Warm-up, Yuan et al.,
- Vineppo: Unlocking rl potential for llm reasoning through refined credit assignment, Kazemnejad et al.,
- Self-generated critiques boost reward modeling for language models, Yu et al.,
- From generation to judgment: Opportunities and challenges of llm-as-a-judge, Li et al.,
- Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering, Guan et al.,
- Entropy-Regularized Process Reward Model, Zhang et al.,
- Llms-as-judges: a comprehensive survey on llm-based evaluation methods, Li et al.,
- o1-coder: an o1 replication for coding, Zhang et al.,
- Hunyuanprover: A scalable data synthesis framework and guided tree search for automated theorem proving, Li et al.,
- Acemath: Advancing frontier math reasoning with post-training and reward modeling, Liu et al.,
- Free process rewards without process labels, Yuan et al.,
- Outcome-Refining Process Supervision for Code Generation, Yu et al.,
- What Makes Large Language Models Reason in (Multi-Turn) Code Generation?, Zheng et al.,
- Advancing LLM Reasoning Generalists with Preference Trees, Yuan et al.,
- Physics of Language Models: Part 2.2, How to Learn From Mistakes on Grade-School Math Problems, Ye et al.,
- QwQ: Reflect Deeply on the Boundaries of the Unknown, Team et al.,
- Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning, Setlur et al.,
- Learning to Plan \& Reason for Evaluation with Thinking-LLM-as-a-Judge, Saha et al.,
- Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, Guo et al.,
- Dynamic Scaling of Unit Tests for Code Reward Modeling, Ma et al.,
- Attention-guided Self-reflection for Zero-shot Hallucination Detection in Large Language Models, Liu et al.,
- Error Classification of Large Language Models on Math Word Problems: A Dynamically Adaptive Framework, Sun et al.,
- The lessons of developing process reward models in mathematical reasoning, Zhang et al.,
- Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback, Lin et al.,
- Zero-Shot Verification-guided Chain of Thoughts, Chowdhury et al.,
- Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning, Xie et al.,
- Capturing Nuanced Preferences: Preference-Aligned Distillation for Small Language Models, Gu et al.,
- Distill Not Only Data but Also Rewards: Can Smaller Language Models Surpass Larger Ones?, Zhang et al.,
- Uncertainty-Aware Step-wise Verification with Generative Reward Models, Ye et al.,
- Unveiling and Causalizing CoT: A Causal Pespective, Fu et al.,
- Diverse Inference and Verification for Advanced Reasoning, Drori et al.,
- Mathematical Reasoning in Large Language Models: Assessing Logical and Arithmetic Errors across Wide Numerical Ranges, Shrestha et al.,
- A Survey on Feedback-based Multi-step Reasoning for Large Language Models on Mathematics, Wei et al.,
- Self-Consistency of the Internal Reward Models Improves Self-Rewarding Language Models, Zhou et al.,
- ACECODER: Acing Coder RL via Automated Test-Case Synthesis, Zeng et al.,
- RefineCoder: Iterative Improving of Large Language Models via Adaptive Critique Refinement for Code Generation, Zhou et al.,
- Process Reward Models for LLM Agents: Practical Framework and Directions, Choudhury et al.,
- Process reinforcement through implicit rewards, Cui et al.,
- Full-Step-DPO: Self-Supervised Preference Optimization with Step-wise Rewards for Mathematical Reasoning, Xu et al.,
- VersaPRM: Multi-Domain Process Reward Model via Synthetic Reasoning Data, Zeng et al.,
- Direct Value Optimization: Improving Chain-of-Thought Reasoning in LLMs with Refined Values, Zhang et al.,
- Teaching Language Models to Critique via Reinforcement Learning, Xie et al.,
- Uncertainty-Aware Search and Value Models: Mitigating Search Scaling Flaws in LLMs, Yu et al.,
- AURORA: Automated Training Framework of Universal Process Reward Models via Ensemble Prompting and Reverse Verification, Tan et al.,
- Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems, Peng et al.,
- Self-critiquing models for assisting human evaluators, Saunders et al.,
- Self-Refine: Iterative Refinement with Self-Feedback, Madaan et al.,
- Reflexion: language agents with verbal reinforcement learning, Shinn et al.,
- Towards Mitigating LLM Hallucination via Self Reflection, Ji et al.,
- Grace: Discriminator-guided chain-of-thought reasoning, Khalifa et al.,
- Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies, Pan et al.,
- Learning from mistakes makes llm better reasoner, An et al.,
- Reflection-tuning: Data recycling improves llm instruction-tuning, Li et al.,
- Toward Adaptive Reasoning in Large Language Models with Thought Rollback, Chen et al.,
- Progressive-Hint Prompting Improves Reasoning in Large Language Models, Zheng et al.,
- SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning, Miao et al.,
- Advancing Large Language Model Attribution through Self-Improving, Huang et al.,
- REFINER: Reasoning Feedback on Intermediate Representations, Paul et al.,
- ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search, Zhang et al.,
- LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints, Ferraz et al.,
- Teaching Large Language Models to Self-Debug, Chen et al.,
- Recursive Introspection: Teaching Language Model Agents How to Self-Improve, Qu et al.,
- Learning to check: Unleashing potentials for self-correction in large language models, Zhang et al.,
- GLoRe: When, Where, and How to Improve LLM Reasoning via Global and Local Refinements, Havrilla et al.,
- Enhancing Zero-Shot Chain-of-Thought Reasoning in Large Language Models through Logic, Zhao et al.,
- General purpose verification for chain of thought prompting, Vacareanu et al.,
- Enhancing visual-language modality alignment in large vision language models via self-improvement, Wang et al.,
- Llm critics help catch bugs in mathematics: Towards a better mathematical verifier with natural language feedback, Gao et al.,
- Large language models have intrinsic self-correction ability, Liu et al.,
- Accessing gpt-4 level mathematical olympiad solutions via monte carlo tree self-refine with llama-3 8b, Zhang et al.,
- CoT Rerailer: Enhancing the Reliability of Large Language Models in Complex Reasoning Tasks through Error Detection and Correction, Wan et al.,
- Mutual reasoning makes smaller llms stronger problem-solvers, Qi et al.,
- S $^3$ c-Math: Spontaneous Step-level Self-correction Makes Large Language Models Better Mathematical Reasoners, Yan et al.,
- Training language models to self-correct via reinforcement learning, Kumar et al.,
- Enhancing Mathematical Reasoning in LLMs by Stepwise Correction, Wu et al.,
- O1 Replication Journey: A Strategic Progress Report--Part 1, Qin et al.,
- Enhancing llm reasoning via critique models with test-time and training-time supervision, Xi et al.,
- Vision-language models can self-improve reasoning via reflection, Cheng et al.,
- Confidence vs Critique: A Decomposition of Self-Correction Capability for LLMs, Yang et al.,
- LLM2: Let Large Language Models Harness System 2 Reasoning, Yang et al.,
- Understanding the Dark Side of LLMs' Intrinsic Self-Correction, Zhang et al.,
- 7B Model and 8K Examples: Emerging Reasoning with Reinforcement Learning is Both Effective and Efficient, Zeng et al.,
- BackMATH: Towards Backward Reasoning for Solving Math Problems Step by Step, Zhang et al.,
- Enhancing LLM Reasoning with Multi-Path Collaborative Reactive and Reflection agents, He et al.,
- CoT-based Synthesizer: Enhancing LLM Performance through Answer Synthesis, Zhang et al.,
- Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, Guo et al.,
- ReARTeR: Retrieval-Augmented Reasoning with Trustworthy Process Rewarding, Sun et al.,
- Critique fine-tuning: Learning to critique is more effective than learning to imitate, Wang et al.,
- RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques, Tang et al.,
- ProgCo: Program Helps Self-Correction of Large Language Models, Song et al.,
- URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics, Luo et al.,
- S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning, Ma et al.,
- ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates, Yang et al.,
- ReVISE: Learning to Refine at Test-Time via Intrinsic Self-Verification, Lee et al.,
- Step Back to Leap Forward: Self-Backtracking for Boosting Reasoning of Language Models, Yang et al.,
- Iterative Deepening Sampling for Large Language Models, Chen et al.,
- LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!, Li et al.,
- MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification, Sun et al.,
- ARIES: Stimulating Self-Refinement of Large Language Models by Iterative Preference Optimization, Zeng et al.,
- Greedy Policy Search: A Simple Baseline for Learnable Test-Time Augmentation, Lyzhov et al.,
- Can language models learn from explanations in context?, Lampinen et al.,
- Star: Bootstrapping reasoning with reasoning, Zelikman et al.,
- Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters, Wang et al.,
- What Makes Chain-of-Thought Prompting Effective? A Counterfactual Study, Madaan et al.,
- The Expressive Power of Transformers with Chain of Thought, Merrill et al.,
- Chain of Thought Empowers Transformers to Solve Inherently Serial Problems, Li et al.,
- Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective, Feng et al.,
- Causal Abstraction for Chain-of-Thought Reasoning in Arithmetic Word Problems, Tan et al.,
- How Large Language Models Implement Chain-of-Thought?, Wang et al.,
- How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model, Hanna et al.,
- Why think step by step? Reasoning emerges from the locality of experience, Prystawski et al.,
- Automatic Prompt Augmentation and Selection with Chain-of-Thought from Labeled Data, Shum et al.,
- MoT: Memory-of-Thought Enables ChatGPT to Self-Improve, Li et al.,
- LAMBADA: Backward Chaining for Automated Reasoning in Natural Language, Kazemi et al.,
- MathPrompter: Mathematical Reasoning using Large Language Models, Imani et al.,
- System 2 Attention (is something you might need too), Weston et al.,
- Chain of Thoughtlessness? An Analysis of CoT in Planning, Stechly et al.,
- Chain-of-Thought Reasoning Without Prompting, Wang et al.,
- When Do Program-of-Thought Works for Reasoning?, Bi et al.,
- Unlocking the Capabilities of Thought: A Reasoning Boundary Framework to Quantify and Optimize Chain-of-Thought, Chen et al.,
- The Impact of Reasoning Step Length on Large Language Models, Jin et al.,
- Explainable AI in Large Language Models: A Review, Sauhandikaa et al.,
- Do Large Language Models Latently Perform Multi-Hop Reasoning?, Yang et al.,
- MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning, Sprague et al.,
- Not All LLM Reasoners Are Created Equal, Hosseini et al.,
- DynaThink: Fast or Slow? A Dynamic Decision-Making Framework for Large Language Models, Pan et al.,
- From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models, Welleck et al.,
- How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning, Dutta et al.,
- How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments, Huang et al.,
- Exploring the compositional deficiency of large language models in mathematical reasoning, Zhao et al.,
- Xai meets llms: A survey of the relation between explainable ai and large language models, Cambria et al.,
- Large language monkeys: Scaling inference compute with repeated sampling, Brown et al.,
- Compositional Hardness of Code in Large Language Models--A Probabilistic Perspective, Wolf et al.,
- What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective, Li et al.,
- When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1, McCoy et al.,
- Thinking llms: General instruction following with thought generation, Wu et al.,
- What Are Step-Level Reward Models Rewarding? Counterintuitive Findings from MCTS-Boosted Mathematical Reasoning, Ma et al.,
- Do not think that much for 2+ 3=? on the overthinking of o1-like llms, Chen et al.,
- Openai o1 system card, Jaech et al.,
- Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models, Song et al.,
- Open R1, Team et al.,
- There May Not be Aha Moment in R1-Zero-like Training — A Pilot Study, Liu et al.,
- Exploring Concept Depth: How Large Language Models Acquire Knowledge and Concept at Different Layers?, Jin et al.,
- OverThink: Slowdown Attacks on Reasoning LLMs, Kumar et al.,
- Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, Guo et al.,
- Sft memorizes, rl generalizes: A comparative study of foundation model post-training, Chu et al.,
- On the reasoning capacity of ai models and how to quantify it, Radha et al.,
- Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Though, Xiang et al.,
- Rethinking External Slow-Thinking: From Snowball Errors to Probability of Correct Reasoning, Gan et al.,
- Complexity Control Facilitates Reasoning-Based Compositional Generalization in Transformers, Zhang et al.,
- Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning, Xie et al.,
- GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?, Zhou et al.,
- Lower Bounds for Chain-of-Thought Reasoning in Hard-Attention Transformers, Amiri et al.,
- When More is Less: Understanding Chain-of-Thought Length in LLMs, Wu et al.,
- ECM: A Unified Electronic Circuit Model for Explaining the Emergence of In-Context Learning and Chain-of-Thought in Large Language Model, Chen et al.,
- Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights, Parashar et al.,
- Examining False Positives under Inference Scaling for Mathematical Reasoning, Wang et al.,
- Do We Need to Verify Step by Step? Rethinking Process Supervision from a Theoretical Perspective, Jia et al.,
- Back Attention: Understanding and Enhancing Multi-Hop Reasoning in Large Language Models, Yu et al.,
- Think or Step-by-Step? UnZIPping the Black Box in Zero-Shot Prompts, Sadr et al.,
- Finite State Automata Inside Transformers with Chain-of-Thought: A Mechanistic Study on State Tracking, Zhang et al.,
- The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It, Bertolazzi et al.,
- How Do LLMs Acquire New Knowledge? A Knowledge Circuits Perspective on Continual Pre-Training, Ou et al.,
- Language Models Can Predict Their Own Behavior, Ashok et al.,
- Problem-Solving Logic Guided Curriculum In-Context Learning for LLMs Complex Reasoning, Ma et al.,
- The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks, Cuadron et al.,
- PhD Knowledge Not Required: A Reasoning Challenge for Large Language Models, Anderson et al.,
- Scaling Test-Time Compute Without Verification or RL is Suboptimal, Setlur et al.,
- On the measure of intelligence, Chollet et al.,
- Measuring Mathematical Problem Solving With the MATH Dataset, Hendrycks et al.,
- What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams, Jin et al.,
- Training verifiers to solve math word problems, Cobbe et al.,
- ScienceWorld: Is your Agent Smarter than a 5th Grader?, Wang et al.,
- WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents, Yao et al.,
- Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering, Lu et al.,
- Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them, Suzgun et al.,
- ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning, Golovneva et al.,
- ReCEval: Evaluating Reasoning Chains via Correctness and Informativeness, Prasad et al.,
- Making Language Models Better Reasoners with Step-Aware Verifier, Li et al.,
- A Multi-Modal Neural Geometric Solver with Textual Clauses Parsed from Diagram, Zhang et al.,
- AI for Math or Math for AI? On the Generalization of Learning Mathematical Problem Solving, Zhou et al.,
- AIME 2024, AI-MO et al.,
- Let's verify step by step, Lightman et al.,
- AMC 2023, AI-MO et al.,
- OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems, He et al.,
- Putnam-AXIOM: A Functional and Static Benchmark for Measuring Higher Level Mathematical Reasoning, Gulati et al.,
- SWE-bench: Can Language Models Resolve Real-world Github Issues?, Jimenez et al.,
- GPQA: A Graduate-Level Google-Proof Q\&A Benchmark, Rein et al.,
- MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark, Wang et al.,
- Unlocking the Capabilities of Thought: A Reasoning Boundary Framework to Quantify and Optimize Chain-of-Thought, Chen et al.,
- Evaluating LLMs at Detecting Errors in LLM Responses, Kamoi et al.,
- MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs, Zeng et al.,
- CriticBench: Benchmarking LLMs for Critique-Correct Reasoning, Lin et al.,
- WebArena: A Realistic Web Environment for Building Autonomous Agents, Zhou et al.,
- OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments, Xie et al.,
- CogAgent: A Visual Language Model for GUI Agents, Hong et al.,
- MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI, Yue et al.,
- MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts, Lu et al.,
- Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset, Wang et al.,
- Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?, Zhang et al.,
- M$^3$CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought, Chen et al.,
- Benchmarking large language models on answering and explaining challenging medical questions, Chen et al.,
- How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments, Huang et al.,
- Achieving> 97\% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word Problems, Zhong et al.,
- Mhpp: Exploring the capabilities and limitations of language models beyond basic code generation, Dai et al.,
- Plot2code: A comprehensive benchmark for evaluating multi-modal large language models in code generation from scientific plots, Wu et al.,
- Can llms generate novel research ideas? a large-scale human study with 100+ nlp researchers, Si et al.,
- Mle-bench: Evaluating machine learning agents on machine learning engineering, Chan et al.,
- EVOLvE: Evaluating and Optimizing LLMs For Exploration, Nie et al.,
- Judgebench: A benchmark for evaluating llm-based judges, Tan et al.,
- Errorradar: Benchmarking complex mathematical reasoning of multimodal large language models via error detection, Yan et al.,
- HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks, Zhang et al.,
- Chain of ideas: Revolutionizing research via novel idea development with llm agents, Li et al.,
- Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai, Glazer et al.,
- HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation, Yu et al.,
- Medec: A benchmark for medical error detection and correction in clinical notes, Abacha et al.,
- CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models, Cheng et al.,
- AIME 2025, OpenCompass et al.,
- LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code, Jain et al.,
- CMMaTH: A Chinese Multi-modal Math Skill Evaluation Benchmark for Foundation Models, Li et al.,
- Open Deep Research, Team et al.,
- JustLogic: A Comprehensive Benchmark for Evaluating Deductive Reasoning in Large Language Models, Chen et al.,
- ToolComp: A Multi-Tool Reasoning \& Process Supervision Benchmark, Nath et al.,
- HardML: A Benchmark For Evaluating Data Science And Machine Learning knowledge and reasoning in AI, Pricope et al.,
- Humanity's Last Exam, Phan et al.,
- MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding, Zuo et al.,
- PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models, Song et al.,
- Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks, Wang et al.,
- PhD Knowledge Not Required: A Reasoning Challenge for Large Language Models, Anderson et al.,
- ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning, Lin et al.,
- Evaluating the Systematic Reasoning Abilities of Large Language Models through Graph Coloring, Heyman et al.,
- Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models, Yasunaga et al.,
- PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning, Zhang et al.,
- Text2World: Benchmarking Large Language Models for Symbolic World Model Generation, Hu et al.,
- Generating Symbolic World Models via Test-time Scaling of Large Language Models, Yu et al.,
- MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations, Huang et al.,
- EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking, Wei et al.,
- SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines, Du et al.,
- Evaluating Step-by-step Reasoning Traces: A Survey, Lee et al.,
- Mathematical Reasoning in Large Language Models: Assessing Logical and Arithmetic Errors across Wide Numerical Ranges, Shrestha et al.,
- Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights, Parashar et al.,
- CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models, Zhang et al.,
- WebGames: Challenging General-Purpose Web-Browsing AI Agents, Thomas et al.,
- VEM: Environment-Free Exploration for Training GUI Agent with Value Environment Model, Zheng et al.,
- Mobile-Agent-V: Learning Mobile Device Operation Through Video-Guided Multi-Agent Collaboration, Wang et al.,
- EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges, Wang et al.,
- Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities, Wang et al.,
- Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research, Wu et al.,
- Large language models as commonsense knowledge for large-scale task planning, Zhao et al.,
- Solving Math Word Problems via Cooperative Reasoning induced Language Models, Zhu et al.,
- Reasoning with language model is planning with world model, Hao et al.,
- Tree-Planner: Efficient Close-loop Task Planning with Large Language Models, Hu et al.,
- Strategist: Learning Strategic Skills by LLMs via Bi-Level Tree Search, Light et al.,
- Agents Thinking Fast and Slow: A Talker-Reasoner Architecture, Christakopoulou et al.,
- MACM: Utilizing a Multi-Agent System for Condition Mining in Solving Complex Mathematical Problems, Lei et al.,
- Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning, Zhai et al.,
- ADaPT: As-Needed Decomposition and Planning with Language Models, Prasad et al.,
- Language Agent Tree Search Unifies Reasoning, Acting, and Planning in Language Models, Zhou et al.,
- Mixture-of-agents enhances large language model capabilities, Wang et al.,
- Tree search for language model agents, Koh et al.,
- Hiagent: Hierarchical working memory management for solving long-horizon agent tasks with large language model, Hu et al.,
- EVOLvE: Evaluating and Optimizing LLMs For Exploration, Nie et al.,
- Titans: Learning to memorize at test time, Behrouz et al.,
- Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success, Kim et al.,
- Guiding language model reasoning with planning tokens, Wang et al.,
- Unlocking the Capabilities of Thought: A Reasoning Boundary Framework to Quantify and Optimize Chain-of-Thought, Chen et al.,
- DynaThink: Fast or Slow? A Dynamic Decision-Making Framework for Large Language Models, Pan et al.,
- Synergy-of-thoughts: Eliciting efficient reasoning in hybrid language models, Shang et al.,
- Distilling system 2 into system 1, Yu et al.,
- Concise thoughts: Impact of output length on llm reasoning and cost, Nayab et al.,
- Litesearch: Efficacious tree search for llm, Wang et al.,
- Uncertainty-Guided Optimization on Large Language Model Search Trees, Grosse et al.,
- Kvsharer: Efficient inference via layer-wise dissimilar KV cache sharing, Yang et al.,
- Interpretable contrastive monte carlo tree search reasoning, Gao et al.,
- Dualformer: Controllable fast and slow thinking by learning with randomized reasoning traces, Su et al.,
- Language models are hidden reasoners: Unlocking latent reasoning capabilities via self-rewarding, Chen et al.,
- Token-budget-aware llm reasoning, Han et al.,
- B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners, Zeng et al.,
- C3oT: Generating Shorter Chain-of-Thought without Compromising Effectiveness, Kang et al.,
- Training large language models to reason in a continuous latent space, Hao et al.,
- CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models, Cheng et al.,
- On the Convergence Rate of MCTS for the Optimal Value Estimation in Markov Decision Processes, Chang et al.,
- O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning, Luo et al.,
- Reward-Guided Speculative Decoding for Efficient LLM Reasoning, Liao et al.,
- Think Smarter not Harder: Adaptive Reasoning with Inference Aware Optimization, Yu et al.,
- Efficient Reasoning with Hidden Thinking, Shen et al.,
- On the Query Complexity of Verifier-Assisted Language Generation, Botta et al.,
- TokenSkip: Controllable Chain-of-Thought Compression in LLMs, Xia et al.,
- Boost, Disentangle, and Customize: A Robust System2-to-System1 Pipeline for Code Generation, Du et al.,
- Jakiro: Boosting Speculative Decoding with Decoupled Multi-Head via MoE, Huang et al.,
- Towards Reasoning Ability of Small Language Models, Srivastava et al.,
- Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs, Ji et al.,
- Portable Reward Tuning: Towards Reusable Fine-Tuning across Different Pretrained Models, Chijiwa et al.,
- MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification, Sun et al.,
- Language Models Can Predict Their Own Behavior, Ashok et al.,
- CoT-Valve: Length-Compressible Chain-of-Thought Tuning, Ma et al.,
- Training Language Models to Reason Efficiently, Arora et al.,
- Chain of Draft: Thinking Faster by Writing Less, Xu et al.,
- Dynamic Chain-of-Thought: Towards Adaptive Deep Reasoning, Wang et al.,
- Reasoning on a Spectrum: Aligning LLMs to System 1 and System 2 Thinking, Ziabari et al.,
- Dynamic Parallel Tree Search for Efficient LLM Reasoning, Ding et al.,
- Stepwise Perplexity-Guided Refinement for Efficient Chain-of-Thought Reasoning in Large Language Models, Cui et al.,
- SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs, Xu et al.,
- LightThinker: Thinking Step-by-Step Compression, Zhang et al.,
- Best of Both Worlds: Harmonizing LLM Capabilities in Decision-Making and Question-Answering for Treatment Regimes, Liu et al.,
- Understanding Reasoning Ability of Language Models From the Perspective of Reasoning Paths Aggregation, Wang et al.,
- Stream of search (sos): Learning to search in language, Gandhi et al.,
- CoPS: Empowering LLM Agents with Provable Cross-Task Experience Sharing, Yang et al.,
- Disentangling memory and reasoning ability in large language models, Jin et al.,
- Huatuogpt-o1, towards medical complex reasoning with llms, Chen et al.,
- RAG-Star: Enhancing Deliberative Reasoning with Retrieval Augmented Verification and Refinement, Jiang et al.,
- Open Deep Research, Team et al.,
- Evaluating Large Language Models through Role-Guide and Self-Reflection: A Comparative Study, Zhao et al.,
- O1 Replication Journey--Part 3: Inference-time Scaling for Medical Reasoning, Huang et al.,
- MedS $^3$: Towards Medical Small Language Models with Self-Evolved Slow Thinking, Jiang et al.,
- Search-o1: Agentic search-enhanced large reasoning models, Li et al.,
- Chain-of-Retrieval Augmented Generation, Wang et al.,
- ECM: A Unified Electronic Circuit Model for Explaining the Emergence of In-Context Learning and Chain-of-Thought in Large Language Model, Chen et al.,
- Large Language Models for Recommendation with Deliberative User Preference Alignment, Fang et al.,
- DeepRAG: Thinking to Retrieval Step by Step for Large Language Models, Guan et al.,
- HopRAG: Multi-Hop Reasoning for Logic-Aware Retrieval-Augmented Generation, Liu et al.,
- O1 Embedder: Let Retrievers Think Before Action, Yan et al.,
- Towards Robust Legal Reasoning: Harnessing Logical LLMs in Law, Kant et al.,
- OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning, Lu et al.,
- Cross-lingual Prompting: Improving Zero-shot Chain-of-Thought Reasoning across Languages, Qin et al.,
- Not All Languages Are Created Equal in LLMs: Improving Multilingual Capability by Cross-Lingual-Thought Prompting, Huang et al.,
- A Tree-of-Thoughts to Broaden Multi-step Reasoning across Languages, Ranaldi et al.,
- Enhancing Advanced Visual Reasoning Ability of Large Language Models, Li et al.,
- AutoCAP: Towards Automatic Cross-lingual Alignment Planning for Zero-shot Chain-of-Thought, Zhang et al.,
- xcot: Cross-lingual instruction tuning for cross-lingual chain-of-thought reasoning, Chai et al.,
- Multilingual large language model: A survey of resources, taxonomy and frontiers, Qin et al.,
- DRT-o1: Optimized Deep Reasoning Translation via Long Chain-of-Thought, Wang et al.,
- A survey of multilingual large language models, Qin et al.,
- Demystifying Multilingual Chain-of-Thought in Process Reward Modeling, Wang et al.,
- The Multilingual Mind: A Survey of Multilingual Reasoning in Language Models, Ghosh et al.,
- Multimodal Chain-of-Thought Reasoning in Language Models, Zhang et al.,
- M$^3$CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought, Chen et al.,
- Enhancing Advanced Visual Reasoning Ability of Large Language Models, Li et al.,
- ARES: Alternating Reinforcement Learning and Supervised Fine-Tuning for Enhanced Multi-Modal Chain-of-Thought Reasoning Through Diverse AI Feedback, Byun et al.,
- Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning, Zhai et al.,
- Large Language Models Can Self-Correct with Minimal Effort, Wu et al.,
- Q*: Improving multi-step reasoning for llms with deliberative planning, Wang et al.,
- A survey on evaluation of multimodal large language models, Huang et al.,
- What factors affect multi-modal in-context learning? an in-depth exploration, Qin et al.,
- Insight-v: Exploring long-chain visual reasoning with multimodal large language models, Dong et al.,
- AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning, Xiang et al.,
- Enhancing the reasoning ability of multimodal large language models via mixed preference optimization, Wang et al.,
- Llava-o1: Let vision language models reason step-by-step, Xu et al.,
- Slow Perception: Let's Perceive Geometric Figures Step-by-step, Wei et al.,
- Diving into Self-Evolving Training for Multimodal Reasoning, Liu et al.,
- Scaling inference-time search with vision value model for improved visual comprehension, Xiyao et al.,
- CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models, Cheng et al.,
- Visual Agents as Fast and Slow Thinkers, Sun et al.,
- Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model, Ma et al.,
- BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning, Zhang et al.,
- Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark, Hao et al.,
- Virgo: A Preliminary Exploration on Reproducing o1-like MLLM, Du et al.,
- Llamav-o1: Rethinking step-by-step visual reasoning in llms, Thawakar et al.,
- Inference-time scaling for diffusion models beyond scaling denoising steps, Ma et al.,
- Imagine while Reasoning in Space: Multimodal Visualization-of-Thought, Li et al.,
- Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step, Guo et al.,
- Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking, Wu et al.,
- Larger and more instructable language models become less reliable, Zhou et al.,
- The Impact of Reasoning Step Length on Large Language Models, Jin et al.,
- Unlocking the Capabilities of Thought: A Reasoning Boundary Framework to Quantify and Optimize Chain-of-Thought, Chen et al.,
- Can Large Language Models Understand You Better? An MBTI Personality Detection Dataset Aligned with Population Traits, Li et al.,
- OverThink: Slowdown Attacks on Reasoning LLMs, Kumar et al.,
- o3-mini vs DeepSeek-R1: Which One is Safer?, Arrieta et al.,
- Efficient Reasoning with Hidden Thinking, Shen et al.,
- Think More, Hallucinate Less: Mitigating Hallucinations via Dual Process of Fast and Slow Thinking, Cheng et al.,
- Explicit vs. Implicit: Investigating Social Bias in Large Language Models through Self-Reflection, Zhao et al.,
- Challenges in Ensuring AI Safety in DeepSeek-R1 Models: The Shortcomings of Reinforcement Learning Strategies, Parmar et al.,
- Early External Safety Testing of OpenAI's o3-mini: Insights from the Pre-Deployment Evaluation, Arrieta et al.,
- International AI Safety Report, Bengio et al.,
- MetaSC: Test-Time Safety Specification Optimization for Language Models, Gallego et al.,
- Leveraging Reasoning with Guidelines to Elicit and Utilize Knowledge for Enhancing Safety Alignment, Wang et al.,
- The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1, Zhou et al.,
- Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models, Lu et al.,
- Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?, Bengio et al.,
- Emergent Response Planning in LLM, Dong et al.,
- Investigating the Impact of Quantization Methods on the Safety and Reliability of Large Language Models, Kharinaev et al.,
- H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models, Including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking, Kuo et al.,
- BoT: Breaking Long Thought Processes of o1-like Large Language Models through Backdoor Attack, Zhu et al.,
- SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities, Jiang et al.,