Towards Reasoning Era: A Survey of Long Chain-of-Thought

Analysis and Evaluation

Analysis & Explanation for Long CoT

Concrete problems in AI safety, Amodei et al.,
Greedy Policy Search: A Simple Baseline for Learnable Test-Time Augmentation, Lyzhov et al.,
The effects of reward misspecification: Mapping and mitigating misaligned models, Pan et al.,
Goal misgeneralization in deep reinforcement learning, Di Langosco et al.,
Star: Bootstrapping reasoning with reasoning, Zelikman et al.,
Can language models learn from explanations in context?, Lampinen et al.,
The Expressive Power of Transformers with Chain of Thought, Merrill et al.,
Chain of Thought Empowers Transformers to Solve Inherently Serial Problems, Li et al.,
Mathprompter: Mathematical reasoning using large language models, Imani et al.,
Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters, Wang et al.,
LAMBADA: Backward Chaining for Automated Reasoning in Natural Language, Kazemi et al.,
Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective, Feng et al.,
Why think step by step? Reasoning emerges from the locality of experience, Prystawski et al.,
How Large Language Models Implement Chain-of-Thought?, Wang et al.,
How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model, Hanna et al.,
System 2 Attention (is something you might need too), Weston et al.,
What Makes Chain-of-Thought Prompting Effective? A Counterfactual Study, Madaan et al.,
Causal Abstraction for Chain-of-Thought Reasoning in Arithmetic Word Problems, Tan et al.,
Automatic Prompt Augmentation and Selection with Chain-of-Thought from Labeled Data, Shum et al.,
MoT: Memory-of-Thought Enables ChatGPT to Self-Improve, Li et al.,
When Do Program-of-Thought Works for Reasoning?, Bi et al.,
Explainable AI in Large Language Models: A Review, Sauhandikaa et al.,
MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning, Sprague et al.,
How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments, Huang et al.,
Large language monkeys: Scaling inference compute with repeated sampling, Brown et al.,
Xai meets llms: A survey of the relation between explainable ai and large language models, Cambria et al.,
The llama 3 herd of models, Dubey et al.,
How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning, Dutta et al.,
The Impact of Reasoning Step Length on Large Language Models, Jin et al.,
Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models, Wu et al.,
Do Large Language Models Latently Perform Multi-Hop Reasoning?, Yang et al.,
An Investigation of Neuron Activation as a Unified Lens to Explain Chain-of-Thought Eliciting Arithmetic Reasoning of LLMs, Rai et al.,
Chain of Thoughtlessness? An Analysis of CoT in Planning, Stechly et al.,
Chain-of-Thought Reasoning Without Prompting, Wang et al.,
Unlocking the Capabilities of Thought: A Reasoning Boundary Framework to Quantify and Optimize Chain-of-Thought, Chen et al.,
Compositional Hardness of Code in Large Language Models--A Probabilistic Perspective, Wolf et al.,
What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective, Li et al.,
When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1, McCoy et al.,
Not All LLM Reasoners Are Created Equal, Hosseini et al.,
Thinking llms: General instruction following with thought generation, Wu et al.,
Exploring the Compositional Deficiency of Large Language Models in Mathematical Reasoning Through Trap Problems, Zhao et al.,
DynaThink: Fast or Slow? A Dynamic Decision-Making Framework for Large Language Models, Pan et al.,
From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models, Welleck et al.,
What Are Step-Level Reward Models Rewarding? Counterintuitive Findings from MCTS-Boosted Mathematical Reasoning, Ma et al.,
Qwen2.5 technical report, Yang et al.,
Do not think that much for 2+ 3=? on the overthinking of o1-like llms, Chen et al.,
Openai o1 system card, Jaech et al.,
Processbench: Identifying process errors in mathematical reasoning, Zheng et al.,
There May Not be Aha Moment in R1-Zero-like Training — A Pilot Study, Liu et al.,
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, Guo et al.,
Open R1, Team et al.,
On the reasoning capacity of ai models and how to quantify it, Radha et al.,
Exploring Concept Depth: How Large Language Models Acquire Knowledge and Concept at Different Layers?, Jin et al.,
Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Though, Xiang et al.,
Rethinking External Slow-Thinking: From Snowball Errors to Probability of Correct Reasoning, Gan et al.,
Complexity Control Facilitates Reasoning-Based Compositional Generalization in Transformers, Zhang et al.,
Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models, Xu et al.,
PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models, Song et al.,
Think or Step-by-Step? UnZIPping the Black Box in Zero-Shot Prompts, Sadr et al.,
Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning, Xie et al.,
GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?, Zhou et al.,
Lower Bounds for Chain-of-Thought Reasoning in Hard-Attention Transformers, Amiri et al.,
The Lookahead Limitation: Why Multi-Operand Addition is Hard for LLMs, Baeumel et al.,
When More is Less: Understanding Chain-of-Thought Length in LLMs, Wu et al.,
ECM: A Unified Electronic Circuit Model for Explaining the Emergence of In-Context Learning and Chain-of-Thought in Large Language Model, Chen et al.,
Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights, Parashar et al.,
Examining False Positives under Inference Scaling for Mathematical Reasoning, Wang et al.,
Do We Need to Verify Step by Step? Rethinking Process Supervision from a Theoretical Perspective, Jia et al.,
Finite State Automata Inside Transformers with Chain-of-Thought: A Mechanistic Study on State Tracking, Zhang et al.,
The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It, Bertolazzi et al.,
How Do LLMs Acquire New Knowledge? A Knowledge Circuits Perspective on Continual Pre-Training, Ou et al.,
Language Models Can Predict Their Own Behavior, Ashok et al.,
Problem-Solving Logic Guided Curriculum In-Context Learning for LLMs Complex Reasoning, Ma et al.,
The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks, Cuadron et al.,
OVERTHINKING: Slowdown Attacks on Reasoning LLMs, Kumar et al.,
PhD Knowledge Not Required: A Reasoning Challenge for Large Language Models, Anderson et al.,
Scaling Test-Time Compute Without Verification or RL is Suboptimal, Setlur et al.,
The Relationship Between Reasoning and Performance in Large Language Models--o3 (mini) Thinks Harder, Not Longer, Ballon et al.,
Unveiling and Causalizing CoT: A Causal Pespective, Fu et al.,
Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems, Peng et al.,
Layer by Layer: Uncovering Hidden Representations in Language Models, Skean et al.,
Back Attention: Understanding and Enhancing Multi-Hop Reasoning in Large Language Models, Yu et al.,
Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs, Gandhi et al.,
R1-Zero's" Aha Moment" in Visual Reasoning on a 2B Non-SFT Model, Zhou et al.,
MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning, Meng et al.,
Reasoning Beyond Limits: Advances and Open Problems for LLMs, Ferrag et al.,
Process-based Self-Rewarding Language Models, Zhang et al.,
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation, Baker et al.,
Rewarding Curse: Analyze and Mitigate Reward Modeling Issues for LLM Reasoning, Li et al.,
Enhancing llm reliability via explicit knowledge boundary modeling, Zheng et al.,
Style over Substance: Distilled Language Models Reason Via Stylistic Replication, Lippmann et al.,
Do Larger Language Models Imply Better Reasoning? A Pretraining Scaling Law for Reasoning, Wang et al.,
Understanding Aha Moments: from External Observations to Internal Mechanisms, Yang et al.,
Inference-Time Scaling for Complex Tasks: Where We Stand and What Lies Ahead, Balachandran et al.,

Long CoT Evaluations

On the measure of intelligence, Chollet et al.,
What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams, Jin et al.,
Training verifiers to solve math word problems, Cobbe et al.,
Measuring Mathematical Problem Solving With the MATH Dataset, Hendrycks et al.,
WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents, Yao et al.,
Competition-Level Code Generation with AlphaCode, Li et al.,
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering, Lu et al.,
ScienceWorld: Is your Agent Smarter than a 5th Grader?, Wang et al.,
ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning, Golovneva et al.,
A Multi-Modal Neural Geometric Solver with Textual Clauses Parsed from Diagram, Zhang et al.,
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them, Suzgun et al.,
Making Language Models Better Reasoners with Step-Aware Verifier, Li et al.,
Assessing and Enhancing the Robustness of Large Language Models with Task Structure Variations for Logical Reasoning, Bao et al.,
ReCEval: Evaluating Reasoning Chains via Correctness and Informativeness, Prasad et al.,
AI for Math or Math for AI? On the Generalization of Learning Mathematical Problem Solving, Zhou et al.,
OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI, Huang et al.,
Putnam-AXIOM: A Functional and Static Benchmark for Measuring Higher Level Mathematical Reasoning, Gulati et al.,
Let's verify step by step, Lightman et al.,
SWE-bench: Can Language Models Resolve Real-world Github Issues?, Jimenez et al.,
WebArena: A Realistic Web Environment for Building Autonomous Agents, Zhou et al.,
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts, Lu et al.,
Benchmarking large language models on answering and explaining challenging medical questions, Chen et al.,
Rewardbench: Evaluating reward models for language modeling, Lambert et al.,
How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments, Huang et al.,
Achieving> 97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word Problems, Zhong et al.,
Mhpp: Exploring the capabilities and limitations of language models beyond basic code generation, Dai et al.,
Plot2code: A comprehensive benchmark for evaluating multi-modal large language models in code generation from scientific plots, Wu et al.,
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs, Zeng et al.,
CogAgent: A Visual Language Model for GUI Agents, Hong et al.,
AIME 2024, AI-MO et al.,
AMC 2023, AI-MO et al.,
GPQA: A Graduate-Level Google-Proof Q&A Benchmark, Rein et al.,
Evaluating LLMs at Detecting Errors in LLM Responses, Kamoi et al.,
M³CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought, Chen et al.,
OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems, He et al.,
CriticBench: Benchmarking LLMs for Critique-Correct Reasoning, Lin et al.,
PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns, Chia et al.,
Can LLMs Solve Molecule Puzzles? A Multimodal Benchmark for Molecular Structure Elucidation, Guo et al.,
Can llms generate novel research ideas? a large-scale human study with 100+ nlp researchers, Si et al.,
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark, Wang et al.,
Unlocking the Capabilities of Thought: A Reasoning Boundary Framework to Quantify and Optimize Chain-of-Thought, Chen et al.,
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments, Xie et al.,
Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset, Wang et al.,
Mle-bench: Evaluating machine learning agents on machine learning engineering, Chan et al.,
EVOLvE: Evaluating and Optimizing LLMs For Exploration, Nie et al.,
Judgebench: A benchmark for evaluating llm-based judges, Tan et al.,
Errorradar: Benchmarking complex mathematical reasoning of multimodal large language models via error detection, Yan et al.,
Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?, Zhang et al.,
HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks, Zhang et al.,
Chain of ideas: Revolutionizing research via novel idea development with llm agents, Li et al.,
Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai, Glazer et al.,
HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation, Yu et al.,
Processbench: Identifying process errors in mathematical reasoning, Zheng et al.,
Medec: A benchmark for medical error detection and correction in clinical notes, Abacha et al.,
A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges, Yan et al.,
CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models, Cheng et al.,
LiveBench: A Challenging, Contamination-Limited LLM Benchmark, White et al.,
ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark, Nath et al.,
HardML: A Benchmark For Evaluating Data Science And Machine Learning knowledge and reasoning in AI, Pricope et al.,
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code, Jain et al.,
JustLogic: A Comprehensive Benchmark for Evaluating Deductive Reasoning in Large Language Models, Chen et al.,
Humanity's Last Exam, Phan et al.,
MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding, Zuo et al.,
PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models, Song et al.,
Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks, Wang et al.,
CMMaTH: A Chinese Multi-modal Math Skill Evaluation Benchmark for Foundation Models, Li et al.,
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation, Yang et al.,
Theoretical Physics Benchmark (TPBench)--a Dataset and Study of AI Reasoning Capabilities in Theoretical Physics, Chung et al.,
ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning, Lin et al.,
Evaluating the Systematic Reasoning Abilities of Large Language Models through Graph Coloring, Heyman et al.,
Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models, Yasunaga et al.,
CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models, Zhang et al.,
PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning, Zhang et al.,
Text2World: Benchmarking Large Language Models for Symbolic World Model Generation, Hu et al.,
Can Large Language Models Unveil the Mysteries? An Exploration of Their Ability to Unlock Information in Complex Scenarios, Wang et al.,
DeepSolution: Boosting Complex Engineering Solution Design via Tree-based Exploration and Bi-point Thinking, Li et al.,
AIME 2025, OpenCompass et al.,
ThinkBench: Dynamic Out-of-Distribution Evaluation for Robust LLM Reasoning, Huang et al.,
MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations, Huang et al.,
ProBench: Benchmarking Large Language Models in Competitive Programming, Yang et al.,
EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking, Wei et al.,
DivIL: Unveiling and Addressing Over-Invariance for Out-of-Distribution Generalization, WANG et al.,
SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines, Du et al.,
DeepSeek-R1 Outperforms Gemini 2.0 Pro, OpenAI o1, and o3-mini in Bilingual Complex Ophthalmology Reasoning, Xu et al.,
Evaluating Step-by-step Reasoning Traces: A Survey, Lee et al.,
Mathematical Reasoning in Large Language Models: Assessing Logical and Arithmetic Errors across Wide Numerical Ranges, Shrestha et al.,
Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights, Parashar et al.,
Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?, He et al.,
FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving, Chen et al.,
WebGames: Challenging General-Purpose Web-Browsing AI Agents, Thomas et al.,
VEM: Environment-Free Exploration for Training GUI Agent with Value Environment Model, Zheng et al.,
Mobile-Agent-V: Learning Mobile Device Operation Through Video-Guided Multi-Agent Collaboration, Wang et al.,
Generating Symbolic World Models via Test-time Scaling of Large Language Models, Yu et al.,
EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges, Wang et al.,
Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities, Wang et al.,
Large Language Models Penetration in Scholarly Writing and Peer Review, Zhou et al.,
Towards an AI co-scientist, Gottweis et al.,
Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research, Wu et al.,
Open Deep Research, Team et al.,
QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks?, Li et al.,
Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad, Petrov et al.,
Benchmarking Reasoning Robustness in Large Language Models, Yu et al.,
From Code to Courtroom: LLMs as the New Software Judges, He et al.,
Interacting with AI Reasoning Models: Harnessing" Thoughts" for AI-Driven Software Engineering, Treude et al.,
Can Frontier LLMs Replace Annotators in Biomedical Text Mining? Analyzing Challenges and Exploring Solutions, Zhao et al.,
An evaluation of DeepSeek Models in Biomedical Natural Language Processing, Zhan et al.,
Cognitive-Mental-LLM: Leveraging Reasoning in Large Language Models for Mental Health Prediction via Online Text, Patil et al.,
Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models, Zhou et al.,
UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning, Lu et al.,
Exploring and Evaluating Multimodal Knowledge Reasoning Consistency of Multimodal Large Language Models, Jia et al.,
MMSciBench: Benchmarking Language Models on Multimodal Scientific Problems, Ye et al.,
LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?, Tang et al.,
Enabling AI Scientists to Recognize Innovation: A Domain-Agnostic Algorithm for Assessing Novelty, Wang et al.,

Deep Reasoning

Deep Reasoning Format

Generative language modeling for automated theorem proving, Polu et al.,
Multi-step deductive reasoning over natural language: An empirical study on out-of-distribution generalisation, Bao et al.,
Reflection of thought: Inversely eliciting numerical reasoning in language models via solving linear systems, Zhou et al.,
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, Wei et al.,
Star: Bootstrapping reasoning with reasoning, Zelikman et al.,
Gpt-4 technical report, Achiam et al.,
Mathprompter: Mathematical reasoning using large language models, Imani et al.,
Llama 2: Open foundation and fine-tuned chat models, Touvron et al.,
PAL: Program-aided Language Models, Gao et al.,
Code llama: Open foundation models for code, Roziere et al.,
Mammoth: Building math generalist models through hybrid instruction tuning, Yue et al.,
Tora: A tool-integrated reasoning agent for mathematical problem solving, Gou et al.,
Deductive Verification of Chain-of-Thought Reasoning, Ling et al.,
Mistral 7B, Jiang et al.,
Guiding language model reasoning with planning tokens, Wang et al.,
Faithful Chain-of-Thought Reasoning, Lyu et al.,
Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks, Chen et al.,
Cross-lingual Prompting: Improving Zero-shot Chain-of-Thought Reasoning across Languages, Qin et al.,
Tinygsm: achieving> 80% on gsm8k with small language models, Liu et al.,
ChatLogic: Integrating Logic Programming with Large Language Models for Multi-step Reasoning, Wang et al.,
NuminaMath, LI et al.,
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models, Yu et al.,
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning, Wang et al.,
DeepSeek-Coder: When the Large Language Model Meets Programming--The Rise of Code Intelligence, Guo et al.,
MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning, Sprague et al.,
Internlm-math: Open math large language models toward verifiable reasoning, Ying et al.,
Deepseekmath: Pushing the limits of mathematical reasoning in open language models, Shao et al.,
Brain-Inspired Two-Stage Approach: Enhancing Mathematical Reasoning by Imitating Human Thought Processes, Chen et al.,
Quiet-star: Language models can teach themselves to think before speaking, Zelikman et al.,
From explicit cot to implicit cot: Learning to internalize cot step by step, Deng et al.,
MathDivide: Improved mathematical reasoning by large language models, Srivastava et al.,
Certified Deductive Reasoning with Language Models, Poesia et al.,
Interactive Evolution: A Neural-Symbolic Self-Training Framework For Large Language Models, Xu et al.,
OVM, Outcome-supervised Value Models for Planning in Mathematical Reasoning, Yu et al.,
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models, Chen et al.,
The llama 3 herd of models, Dubey et al.,
Qwen2 Technical Report, Yang et al.,
Lean-star: Learning to interleave thinking and proving, Lin et al.,
Chain of Code: Reasoning with a Language Model-Augmented Code Emulator, Li et al.,
Siam: Self-improving code-assisted mathematical reasoning of large language models, Yu et al.,
AutoCAP: Towards Automatic Cross-lingual Alignment Planning for Zero-shot Chain-of-Thought, Zhang et al.,
Large language models are not strong abstract reasoners, Gendron et al.,
Qwen2.5-math technical report: Toward mathematical expert model via self-improvement, Yang et al.,
Qwen2.5-coder technical report, Hui et al.,
CoMAT: Chain of mathematically annotated thought improves mathematical reasoning, Leang et al.,
Planning in Natural Language Improves LLM Search for Code Generation, Wang et al.,
Formal mathematical reasoning: A new frontier in ai, Yang et al.,
Training large language models to reason in a continuous latent space, Hao et al.,
SKIntern: Internalizing Symbolic Knowledge for Distilling Better CoT Capabilities into Small Language Models, Liao et al.,
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, Guo et al.,
CodePlan: Unlocking Reasoning Potential in Large Language Models by Scaling Code-form Planning, Wen et al.,
Efficient Reasoning with Hidden Thinking, Shen et al.,
Improving Chain-of-Thought Reasoning via Quasi-Symbolic Abstractions, Ranaldi et al.,
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach, Geiping et al.,
Reasoning with Latent Thoughts: On the Power of Looped Transformers, Saunshi et al.,
CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction, Li et al.,
Towards Better Understanding of Program-of-Thought Reasoning in Cross-Lingual and Multilingual Environments, Payoungkhamdee et al.,
Beyond Limited Data: Self-play LLM Theorem Provers with Iterative Conjecturing and Proving, Dong et al.,
Theorem Prover as a Judge for Synthetic Data Generation, Leang et al.,
Self-Enhanced Reasoning Training: Activating Latent Reasoning in Small Models for Enhanced Reasoning Distillation, Zhang et al.,
LLM Pretraining with Continuous Concepts, Tack et al.,
Scalable Language Models with Posterior Inference of Latent Thought Vectors, Kong et al.,
Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking, Chen et al.,
Code-Driven Inductive Synthesis: Enhancing Reasoning Abilities of Large Language Models with Sequences, Chen et al.,
Reasoning to Learn from Latent Thoughts, Ruan et al.,

Deep Reasoning Learning

Thinking fast and slow with deep learning and tree search, Anthony et al.,
Training verifiers to solve math word problems, Cobbe et al.,
Chain of Thought Imitation with Procedure Cloning, Yang et al.,
Star: Bootstrapping reasoning with reasoning, Zelikman et al.,
Large Language Models Are Reasoning Teachers, Ho et al.,
Llama 2: Open foundation and fine-tuned chat models, Touvron et al.,
Synthetic Prompting: Generating Chain-of-Thought Demonstrations for Large Language Models, Shao et al.,
Instruction tuning for large language models: A survey, Zhang et al.,
Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct, Luo et al.,
Reinforced self-training (rest) for language modeling, Gulcehre et al.,
Training Chain-of-Thought via Latent-Variable Inference, Hoffman et al.,
Tree of Thoughts: Deliberate Problem Solving with Large Language Models, Yao et al.,
Mistral 7B, Jiang et al.,
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment, Dong et al.,
The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning, Kim et al.,
Deepseekmath: Pushing the limits of mathematical reasoning in open language models, Shao et al.,
Training large language models for reasoning through reverse curriculum reinforcement learning, Xi et al.,
Brain-Inspired Two-Stage Approach: Enhancing Mathematical Reasoning by Imitating Human Thought Processes, Chen et al.,
Exploring Iterative Enhancement for Improving Learnersourced Multiple-Choice Question Explanations with Large Language Models, Bao et al.,
Common 7b language models already possess strong math capabilities, Li et al.,
Key-point-driven data synthesis with its enhancement on mathematical reasoning, Huang et al.,
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models, Singh et al.,
The llama 3 herd of models, Dubey et al.,
Qwen2 Technical Report, Yang et al.,
V-STaR: Training Verifiers for Self-Taught Reasoners, Hosseini et al.,
ReAct Meets ActRe: Autonomous Annotation of Agent Trajectories for Contrastive Self-Training, Yang et al.,
Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models, Puerto et al.,
Direct Large Language Model Alignment Through Self-Rewarding Contrastive Prompt Distillation, Liu et al.,
Qwen2.5-math technical report: Toward mathematical expert model via self-improvement, Yang et al.,
Enhancing Reasoning Capabilities of LLMs via Principled Synthetic Logic Corpus, Morishita et al.,
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving, Tong et al.,
AlphaMath Almost Zero: Process Supervision without Process, Chen et al.,
Iterative Reasoning Preference Optimization, Pang et al.,
Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs, Zhang et al.,
On memorization of large language models in logical reasoning, Xie et al.,
Openmathinstruct-2: Accelerating ai for math with massive open-source instruction data, Toshniwal et al.,
TPO: Aligning Large Language Models with Multi-branch & Multi-step Preference Trees, Liao et al.,
Cream: Consistency Regularized Self-Rewarding Language Models, Wang et al.,
Towards Self-Improvement of LLMs via MCTS: Leveraging Stepwise Knowledge with Curriculum Preference Learning, Wang et al.,
O1 Replication Journey--Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?, Huang et al.,
Self-Explore: Enhancing Mathematical Reasoning in Language Models with Fine-grained Rewards, Hwang et al.,
On the impact of fine-tuning on chain-of-thought reasoning, Lobo et al.,
Weak-to-Strong Reasoning, Yang et al.,
System-2 Mathematical Reasoning via Enriched Instruction Tuning, Cai et al.,
Acemath: Advancing frontier math reasoning with post-training and reward modeling, Liu et al.,
Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems, Min et al.,
Openai o1 system card, Jaech et al.,
Qwen2.5 technical report, Yang et al.,
OpenRFT: Adapting Reasoning Foundation Model for Domain-specific Tasks with Reinforcement Fine-Tuning, Zhang et al.,
Proposing and solving olympiad geometry with guided tree search, Zhang et al.,
Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling, Bansal et al.,
Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs, Wang et al.,
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, Guo et al.,
Sft memorizes, rl generalizes: A comparative study of foundation model post-training, Chu et al.,
Advancing Math Reasoning in Language Models: The Impact of Problem-Solving Data, Data Synthesis Methods, and Training Stages, Chen et al.,
Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training, Yuan et al.,
s1: Simple test-time scaling, Muennighoff et al.,
RedStar: Does Scaling Long-CoT Data Unlock Better Slow-Reasoning Systems?, Xu et al.,
Sky-T1: Train your own O1 preview model within 450, Team et al.,
Bespoke-Stratos: The unreasonable effectiveness of reasoning distillation, Labs et al.,
Think Less, Achieve More: Cut Reasoning Costs by 50% Without Sacrificing Accuracy, Team et al.,
Enhancing Reasoning through Process Supervision with Monte Carlo Tree Search, Li et al.,
FastMCTS: A Simple Sampling Strategy for Data Synthesis, Li et al.,
LLMs Can Teach Themselves to Better Predict the Future, Turtel et al.,
Policy Guided Tree Search for Enhanced LLM Reasoning, Li et al.,
Don't Get Lost in the Trees: Streamlining LLM Reasoning by Overcoming Tree Search Exploration Pitfalls, Wang et al.,
SoS1: O1 and R1-Like Reasoning LLMs are Sum-of-Square Solvers, Li et al.,
Distillation Scaling Laws, Busbridge et al.,
Unveiling the Mechanisms of Explicit CoT Training: How Chain-of-Thought Enhances Reasoning Generalization, Yao et al.,
CoT2Align: Cross-Chain of Thought Distillation via Optimal Transport Alignment for Language Models with Different Tokenizers, Le et al.,
Unveiling the Key Factors for Distilling Chain-of-Thought Reasoning, Chen et al.,
Chain-of-Thought Matters: Improving Long-Context Language Models with Reasoning Path Supervision, Zhu et al.,
Demystifying Long Chain-of-Thought Reasoning in LLMs, Yeo et al.,
LIMO: Less is More for Reasoning, Ye et al.,
Self-Improvement Towards Pareto Optimality: Mitigating Preference Conflicts in Multi-Objective Alignment, Li et al.,
BOLT: Bootstrap Long Chain-of-Thought in Language Models without Distillation, Pang et al.,
PromptCoT: Synthesizing Olympiad-level Problems for Mathematical Reasoning in Large Language Models, Zhao et al.,
Rewarding Graph Reasoning Process makes LLMs more Generalized Reasoners, Peng et al.,
Process-based Self-Rewarding Language Models, Zhang et al.,
Entropy-Based Adaptive Weighting for Self-Training, Wang et al.,
Entropy-based Exploration Conduction for Multi-step Reasoning, Zhang et al.,
OpenCodeReasoning: Advancing Data Distillation for Competitive Coding, Ahmad et al.,

Feasible Reflection

Feedback

Learning to summarize with human feedback, Stiennon et al.,
Training verifiers to solve math word problems, Cobbe et al.,
Self-critiquing models for assisting human evaluators, Saunders et al.,
Language models (mostly) know what they know, Kadavath et al.,
Star: Bootstrapping reasoning with reasoning, Zelikman et al.,
Solving math word problems with process- and outcome-based feedback, Uesato et al.,
Constitutional AI: Harmlessness from AI Feedback, Bai et al.,
ReAct: Synergizing Reasoning and Acting in Language Models, Yao et al.,
Gpt-4 technical report, Achiam et al.,
Palm 2 technical report, Anil et al.,
Critic: Large language models can self-correct with tool-interactive critiquing, Gou et al.,
Contrastive learning with logic-driven data augmentation for logical reasoning over text, Bao et al.,
Self-verification improves few-shot clinical information extraction, Gero et al.,
LEVER: Learning to Verify Language-to-Code Generation with Execution, Ni et al.,
Llama 2: Open foundation and fine-tuned chat models, Touvron et al.,
Reinforced self-training (rest) for language modeling, Gulcehre et al.,
Shepherd: A critic for language model generation, Wang et al.,
Tree of Thoughts: Deliberate Problem Solving with Large Language Models, Yao et al.,
Mistral 7B, Jiang et al.,
Let's reward step by step: Step-Level reward model as the Navigators for Reasoning, Ma et al.,
Camels in a changing climate: Enhancing lm adaptation with tulu 2, Ivison et al.,
Towards Mitigating LLM Hallucination via Self Reflection, Ji et al.,
Reasoning with Language Model is Planning with World Model, Hao et al.,
Large Language Models are Better Reasoners with Self-Verification, Weng et al.,
Reflexion: language agents with verbal reinforcement learning, Shinn et al.,
Large Language Models Cannot Self-Correct Reasoning Yet, Huang et al.,
Let's verify step by step, Lightman et al.,
Mixtral of experts, Jiang et al.,
Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification, Zhou et al.,
SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning, Miao et al.,
Deepseek llm: Scaling open-source language models with longtermism, Bi et al.,
Llemma: An Open Language Model for Mathematics, Azerbayev et al.,
Deepseekmath: Pushing the limits of mathematical reasoning in open language models, Shao et al.,
VerMCTS: Synthesizing Multi-Step Programs using a Verifier, a Large Language Model, and Tree Search, Brandfonbrener et al.,
Can We Verify Step by Step for Incorrect Answer Detection?, Xu et al.,
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, Team et al.,
Internlm2 technical report, Cai et al.,
Rewardbench: Evaluating reward models for language modeling, Lambert et al.,
Uncertainty of Thoughts: Uncertainty-Aware Planning Enhances Information Seeking in Large Language Models, Hu et al.,
Evaluating Mathematical Reasoning Beyond Accuracy, Xia et al.,
Monte carlo tree search boosts reasoning via iterative preference learning, Xie et al.,
Improving reward models with synthetic critiques, Ye et al.,
Self-reflection in llm agents: Effects on problem-solving performance, Renze et al.,
Rlhf workflow: From reward modeling to online rlhf, Dong et al.,
Interactive Evolution: A Neural-Symbolic Self-Training Framework For Large Language Models, Xu et al.,
Nemotron-4 340b technical report, Adler et al.,
OVM, Outcome-supervised Value Models for Planning in Mathematical Reasoning, Yu et al.,
The Reason behind Good or Bad: Towards a Better Mathematical Verifier with Natural Language Feedback, Gao et al.,
Llm critics help catch bugs in mathematics: Towards a better mathematical verifier with natural language feedback, Gao et al.,
Step-dpo: Step-wise preference optimization for long-chain reasoning of llms, Lai et al.,
Llm critics help catch llm bugs, McAleese et al.,
LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models, Hao et al.,
Token-Supervised Value Models for Enhancing Mathematical Reasoning Capabilities of Large Language Models, Lee et al.,
Tlcr: Token-level continuous reward for fine-grained reinforcement learning from human feedback, Yoon et al.,
The llama 3 herd of models, Dubey et al.,
Mistral-NeMo-12B-Instruct, Team et al.,
OffsetBias: Leveraging Debiased Data for Tuning Evaluators, Park et al.,
Promptbreeder: Self-Referential Self-Improvement via Prompt Evolution, Fernando et al.,
ReFT: Reasoning with Reinforced Fine-Tuning, Trung et al.,
Self-Contrast: Better Reflection Through Inconsistent Solving Perspectives, Zhang et al.,
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations, Wang et al.,
Selective Preference Optimization via Token-Level Reward Function Estimation, Yang et al.,
When is Tree Search Useful for LLM Planning? It Depends on the Discriminator, Chen et al.,
Self-taught evaluators, Wang et al.,
Gemma 2: Improving open language models at a practical size, Team et al.,
Generative verifiers: Reward modeling as next-token prediction, Zhang et al.,
OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement, Zheng et al.,
Critic-cot: Boosting the reasoning abilities of large language model via chain-of-thoughts critic, Zheng et al.,
Small Language Models Need Strong Verifiers to Self-Correct Reasoning, Zhang et al.,
Abstract Meaning Representation-Based Logic-Driven Data Augmentation for Logical Reasoning, Bao et al.,
Reasoning in Flux: Enhancing Large Language Models Reasoning through Uncertainty-aware Adaptive Guidance, Yin et al.,
Direct Judgement Preference Optimization, Wang et al.,
Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback, Ivison et al.,
HelpSteer 2: Open-source dataset for training top-performing reward models, Wang et al.,
Qwen2.5-math technical report: Toward mathematical expert model via self-improvement, Yang et al.,
Critique-out-Loud Reward Models, Ankner et al.,
Skywork-reward: Bag of tricks for reward modeling in llms, Liu et al.,
On designing effective rl reward at training time for llm reasoning, Gao et al.,
Reversal of Thought: Enhancing Large Language Models with Preference-Guided Reverse Reasoning Warm-up, Yuan et al.,
Vineppo: Unlocking rl potential for llm reasoning through refined credit assignment, Kazemnejad et al.,
Self-generated critiques boost reward modeling for language models, Yu et al.,
Advancing Process Verification for Large Language Models via Tree-Based Preference Learning, He et al.,
From generation to judgment: Opportunities and challenges of llm-as-a-judge, Li et al.,
Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering, Guan et al.,
Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models, Kim et al.,
Foundational Autoraters: Taming Large Language Models for Better Automatic Evaluation, Vu et al.,
Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts, Wang et al.,
Step-level Value Preference Optimization for Mathematical Reasoning, Chen et al.,
Skywork-o1 open series, Team et al.,
Entropy-Regularized Process Reward Model, Zhang et al.,
Llms-as-judges: a comprehensive survey on llm-based evaluation methods, Li et al.,
Lmunit: Fine-grained evaluation with natural language unit tests, Saad-Falcon et al.,
o1-coder: an o1 replication for coding, Zhang et al.,
Hunyuanprover: A scalable data synthesis framework and guided tree search for automated theorem proving, Li et al.,
Acemath: Advancing frontier math reasoning with post-training and reward modeling, Liu et al.,
Free process rewards without process labels, Yuan et al.,
AutoPSV: Automated Process-Supervised Verifier, Lu et al.,
Processbench: Identifying process errors in mathematical reasoning, Zheng et al.,
Qwen2.5 technical report, Yang et al.,
Openai o1 system card, Jaech et al.,
Outcome-Refining Process Supervision for Code Generation, Yu et al.,
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, Guo et al.,
Dynamic Scaling of Unit Tests for Code Reward Modeling, Ma et al.,
What Makes Large Language Models Reason in (Multi-Turn) Code Generation?, Zheng et al.,
Attention-guided Self-reflection for Zero-shot Hallucination Detection in Large Language Models, Liu et al.,
Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge, Saha et al.,
Scaling Autonomous Agents via Automatic Reward Modeling And Planning, Chen et al.,
Error Classification of Large Language Models on Math Word Problems: A Dynamically Adaptive Framework, Sun et al.,
The lessons of developing process reward models in mathematical reasoning, Zhang et al.,
Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback, Lin et al.,
Physics of Language Models: Part 2.2, How to Learn From Mistakes on Grade-School Math Problems, Ye et al.,
Advancing LLM Reasoning Generalists with Preference Trees, Yuan et al.,
Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning, Setlur et al.,
Enabling Scalable Oversight via Self-Evolving Critic, Tang et al.,
PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models, Song et al.,
Zero-Shot Verification-guided Chain of Thoughts, Chowdhury et al.,
Robotic Programmer: Video Instructed Policy Code Generation for Robotic Manipulation, Xie et al.,
Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning, Xie et al.,
SoRFT: Issue Resolving with Subtask-oriented Reinforced Fine-Tuning, Ma et al.,
Training an LLM-as-a-Judge Model: Pipeline, Insights, and Practical Lessons, Hu et al.,
Capturing Nuanced Preferences: Preference-Aligned Distillation for Small Language Models, Gu et al.,
Distill Not Only Data but Also Rewards: Can Smaller Language Models Surpass Larger Ones?, Zhang et al.,
Adaptivestep: Automatically dividing reasoning step through model confidence, Liu et al.,
Unveiling and Causalizing CoT: A Causal Pespective, Fu et al.,
Diverse Inference and Verification for Advanced Reasoning, Drori et al.,
Mathematical Reasoning in Large Language Models: Assessing Logical and Arithmetic Errors across Wide Numerical Ranges, Shrestha et al.,
A Survey on Feedback-based Multi-step Reasoning for Large Language Models on Mathematics, Wei et al.,
Self-Consistency of the Internal Reward Models Improves Self-Rewarding Language Models, Zhou et al.,
ACECODER: Acing Coder RL via Automated Test-Case Synthesis, Zeng et al.,
A Study on Leveraging Search and Self-Feedback for Agent Reasoning, Yuan et al.,
RefineCoder: Iterative Improving of Large Language Models via Adaptive Critique Refinement for Code Generation, Zhou et al.,
Process Reward Models for LLM Agents: Practical Framework and Directions, Choudhury et al.,
Process reinforcement through implicit rewards, Cui et al.,
Full-Step-DPO: Self-Supervised Preference Optimization with Step-wise Rewards for Mathematical Reasoning, Xu et al.,
VersaPRM: Multi-Domain Process Reward Model via Synthetic Reasoning Data, Zeng et al.,
Direct Value Optimization: Improving Chain-of-Thought Reasoning in LLMs with Refined Values, Zhang et al.,
Teaching Language Models to Critique via Reinforcement Learning, Xie et al.,
Uncertainty-Aware Search and Value Models: Mitigating Search Scaling Flaws in LLMs, Yu et al.,
AURORA: Automated Training Framework of Universal Process Reward Models via Ensemble Prompting and Reverse Verification, Tan et al.,
Dyve: Thinking Fast and Slow for Dynamic Process Verification, Zhong et al.,
Uncertainty-Aware Step-wise Verification with Generative Reward Models, Ye et al.,
Visualprm: An effective process reward model for multimodal reasoning, Wang et al.,
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation, Baker et al.,
An Efficient and Precise Training Data Construction Framework for Process-supervised Reward Model in Mathematical Reasoning, Sun et al.,
JudgeLRM: Large Reasoning Models as a Judge, Chen et al.,
QwQ: Reflect Deeply on the Boundaries of the Unknown, Team et al.,

Refinement

Self-critiquing models for assisting human evaluators, Saunders et al.,
Self-Refine: Iterative Refinement with Self-Feedback, Madaan et al.,
Grace: Discriminator-guided chain-of-thought reasoning, Khalifa et al.,
Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies, Pan et al.,
Learning from mistakes makes llm better reasoner, An et al.,
Reflection-tuning: Data recycling improves llm instruction-tuning, Li et al.,
Reflexion: language agents with verbal reinforcement learning, Shinn et al.,
Towards Mitigating LLM Hallucination via Self Reflection, Ji et al.,
SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning, Miao et al.,
Teaching Large Language Models to Self-Debug, Chen et al.,
Learning to check: Unleashing potentials for self-correction in large language models, Zhang et al.,
REFINER: Reasoning Feedback on Intermediate Representations, Paul et al.,
GLoRe: When, Where, and How to Improve LLM Reasoning via Global and Local Refinements, Havrilla et al.,
Enhancing Zero-Shot Chain-of-Thought Reasoning in Large Language Models through Logic, Zhao et al.,
General purpose verification for chain of thought prompting, Vacareanu et al.,
Enhancing visual-language modality alignment in large vision language models via self-improvement, Wang et al.,
Llm critics help catch bugs in mathematics: Towards a better mathematical verifier with natural language feedback, Gao et al.,
Large language models have intrinsic self-correction ability, Liu et al.,
Progressive-Hint Prompting Improves Reasoning in Large Language Models, Zheng et al.,
Accessing gpt-4 level mathematical olympiad solutions via monte carlo tree self-refine with llama-3 8b, Zhang et al.,
Toward Adaptive Reasoning in Large Language Models with Thought Rollback, Chen et al.,
CoT Rerailer: Enhancing the Reliability of Large Language Models in Complex Reasoning Tasks through Error Detection and Correction, Wan et al.,
Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement, Xu et al.,
Mutual reasoning makes smaller llms stronger problem-solvers, Qi et al.,
S ³ c-Math: Spontaneous Step-level Self-correction Makes Large Language Models Better Mathematical Reasoners, Yan et al.,
Training language models to self-correct via reinforcement learning, Kumar et al.,
A Theoretical Understanding of Self-Correction through In-context Alignment, Wang et al.,
ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search, Zhang et al.,
Recursive Introspection: Teaching Language Model Agents How to Self-Improve, Qu et al.,
Enhancing Mathematical Reasoning in LLMs by Stepwise Correction, Wu et al.,
LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints, Ferraz et al.,
O1 Replication Journey: A Strategic Progress Report--Part 1, Qin et al.,
Advancing Large Language Model Attribution through Self-Improving, Huang et al.,
Enhancing llm reasoning via critique models with test-time and training-time supervision, Xi et al.,
Vision-language models can self-improve reasoning via reflection, Cheng et al.,
Confidence vs Critique: A Decomposition of Self-Correction Capability for LLMs, Yang et al.,
LLM2: Let Large Language Models Harness System 2 Reasoning, Yang et al.,
Understanding the Dark Side of LLMs' Intrinsic Self-Correction, Zhang et al.,
Enhancing LLM Reasoning with Multi-Path Collaborative Reactive and Reflection agents, He et al.,
CoT-based Synthesizer: Enhancing LLM Performance through Answer Synthesis, Zhang et al.,
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, Guo et al.,
7B Model and 8K Examples: Emerging Reasoning with Reinforcement Learning is Both Effective and Efficient, Zeng et al.,
BackMATH: Towards Backward Reasoning for Solving Math Problems Step by Step, Zhang et al.,
ReARTeR: Retrieval-Augmented Reasoning with Trustworthy Process Rewarding, Sun et al.,
Critique fine-tuning: Learning to critique is more effective than learning to imitate, Wang et al.,
RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques, Tang et al.,
ProgCo: Program Helps Self-Correction of Large Language Models, Song et al.,
URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics, Luo et al.,
S²R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning, Ma et al.,
ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates, Yang et al.,
ReVISE: Learning to Refine at Test-Time via Intrinsic Self-Verification, Lee et al.,
Step Back to Leap Forward: Self-Backtracking for Boosting Reasoning of Language Models, Yang et al.,
Iterative Deepening Sampling for Large Language Models, Chen et al.,
LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!, Li et al.,
MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification, Sun et al.,
ARIES: Stimulating Self-Refinement of Large Language Models by Iterative Preference Optimization, Zeng et al.,
Optimizing generative AI by backpropagating language model feedback, Yuksekgonul et al.,
DLPO: Towards a Robust, Efficient, and Generalizable Prompt Optimization Framework from a Deep-Learning Perspective, Peng et al.,
Instruct-of-Reflection: Enhancing Large Language Models Iterative Reflection Capabilities via Dynamic-Meta Instruction, Liu et al.,
The Lighthouse of Language: Enhancing LLM Agents via Critique-Guided Improvement, Yang et al.,

Extensive Exploration

Exploration Scaling

Greedy Policy Search: A Simple Baseline for Learnable Test-Time Augmentation, Lyzhov et al.,
Scaling scaling laws with board games, Jones et al.,
Show Your Work: Scratchpads for Intermediate Computation with Language Models, Nye et al.,
Complexity-Based Prompting for Multi-step Reasoning, Fu et al.,
Self-Consistency Improves Chain of Thought Reasoning in Language Models, Wang et al.,
Making Language Models Better Reasoners with Step-Aware Verifier, Li et al.,
Deductive Verification of Chain-of-Thought Reasoning, Ling et al.,
Cross-lingual Prompting: Improving Zero-shot Chain-of-Thought Reasoning across Languages, Qin et al.,
Don't Trust: Verify -- Grounding LLM Quantitative Reasoning with Autoformalization, Zhou et al.,
Multi-step problem solving through a verifier: An empirical analysis on model-induced process supervision, Wang et al.,
Stepwise self-consistent mathematical reasoning with large language models, Zhao et al.,
General purpose verification for chain of thought prompting, Vacareanu et al.,
Improve Mathematical Reasoning in Language Models by Automated Process Supervision, Luo et al.,
Large language monkeys: Scaling inference compute with repeated sampling, Brown et al.,
Scaling llm test-time compute optimally can be more effective than scaling model parameters, Snell et al.,
Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models, Wu et al.,
Learning to Reason via Program Generation, Emulation, and Search, Weir et al.,
What are the essential factors in crafting effective long context multi-hop instruction datasets? insights and best practices, Chen et al.,
MAgICoRe: Multi-Agent, Iterative, Coarse-to-Fine Refinement for Reasoning, Chen et al.,
Scaling llm inference with optimized sample compute allocation, Zhang et al.,
Rlef: Grounding code llms in execution feedback with reinforcement learning, Gehring et al.,
Planning in Natural Language Improves LLM Search for Code Generation, Wang et al.,
Beyond examples: High-level automated reasoning paradigm in in-context learning via mcts, Wu et al.,
Python is Not Always the Best Choice: Embracing Multilingual Program of Thoughts, Luo et al.,
From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models, Welleck et al.,
From medprompt to o1: Exploration of run-time strategies for medical challenge problems and beyond, Nori et al.,
Wrong-of-Thought: An Integrated Reasoning Framework with Multi-Perspective Verification and Wrong Information, Zhang et al.,
A simple and provable scaling law for the test-time compute of large language models, Chen et al.,
Openai o1 system card, Jaech et al.,
Lachesis: Predicting LLM Inference Accuracy using Structural Properties of Reasoning Paths, Kim et al.,
Seed-cts: Unleashing the power of tree search for superior performance in competitive coding tasks, Wang et al.,
Inference Scaling vs Reasoning: An Empirical Analysis of Compute-Optimal LLM Problem-Solving, AbdElhameed et al.,
s1: Simple test-time scaling, Muennighoff et al.,
From Drafts to Answers: Unlocking LLM Potential via Aggregation Fine-Tuning, Li et al.,
Chain-of-Reasoning: Towards Unified Mathematical Reasoning in Large Language Models via a Multi-Paradigm Perspective, Yu et al.,
Test-time Computing: from System-1 Thinking to System-2 Thinking, Ji et al.,
SETS: Leveraging Self-Verification and Self-Correction for Improved Test-Time Scaling, Chen et al.,
Instantiation-based Formalization of Logical Reasoning Tasks using Language Models and Logical Solvers, Raza et al.,
The lessons of developing process reward models in mathematical reasoning, Zhang et al.,
ExACT: Teaching AI Agents to Explore with Reflective-MCTS and Exploratory Learning, Yu et al.,
ECM: A Unified Electronic Circuit Model for Explaining the Emergence of In-Context Learning and Chain-of-Thought in Large Language Model, Chen et al.,
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach, Geiping et al.,
Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking, Chen et al.,
Scalable Best-of-N Selection for Large Language Models via Self-Certainty, Kang et al.,
Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling, Liu et al.,
Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities?, Zeng et al.,
Optimizing Temperature for Language Models with Multi-Sample Inference, Du et al.,
Bag of Tricks for Inference-time Computation of LLM Reasoning, Liu et al.,
Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning, Yang et al.,
(Mis) Fitting: A Survey of Scaling Laws, Li et al.,
METAL: A Multi-Agent Framework for Chart Generation with Test-Time Scaling, Li et al.,
Reasoning-as-Logic-Units: Scaling Test-Time Reasoning in Large Language Models Through Logic Unit Alignment, Li et al.,
Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification, Zhao et al.,
TestNUC: Enhancing Test-Time Computing Approaches through Neighboring Unlabeled Data Consistency, Zou et al.,
Confidence Improves Self-Consistency in LLMs, Taubenfeld et al.,
S*: Test Time Scaling for Code Generation, Li et al.,
Bridging Internal Probability and Self-Consistency for Effective and Efficient LLM Reasoning, Zhou et al.,
Is Depth All You Need? An Exploration of Iterative Reasoning in LLMs, Wu et al.,
Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking, Tian et al.,
What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models, Zhang et al.,
Metascale: Test-time scaling with evolving meta-thoughts, Liu et al.,
Multidimensional Consistency Improves Reasoning in Language Models, Lai et al.,
Efficient test-time scaling via self-calibration, Huang et al.,
Do We Truly Need So Many Samples? Multi-LLM Repeated Sampling Efficiently Scale Test-Time Compute, Chen et al.,

External Exploration

Competition-Level Code Generation with AlphaCode, Li et al.,
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models, Zhou et al.,
Llama: Open and efficient foundation language models, Touvron et al.,
Gpt-4 technical report, Achiam et al.,
Llama 2: Open foundation and fine-tuned chat models, Touvron et al.,
Code llama: Open foundation models for code, Roziere et al.,
Self-Evaluation Guided Beam Search for Reasoning, Xie et al.,
No train still gain. unleash mathematical reasoning of large language models with monte carlo tree search guided by energy function, Xu et al.,
Tree of Thoughts: Deliberate Problem Solving with Large Language Models, Yao et al.,
Tora: A tool-integrated reasoning agent for mathematical problem solving, Gou et al.,
Mistral 7B, Jiang et al.,
Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning, Wang et al.,
PATHFINDER: Guided Search over Multi-Step Reasoning Paths, Golovneva et al.,
Reflexion: language agents with verbal reinforcement learning, Shinn et al.,
Reasoning with Language Model is Planning with World Model, Hao et al.,
The claude 3 model family: Opus, sonnet, haiku, Anthropic et al.,
NuminaMath, LI et al.,
Demystifying chains, trees, and graphs of thoughts, Besta et al.,
MARIO: MAth Reasoning with code Interpreter Output--A Reproducible Pipeline, Liao et al.,
Deepseekmath: Pushing the limits of mathematical reasoning in open language models, Shao et al.,
Mathgenie: Generating synthetic data with question back-translation for enhancing mathematical reasoning of llms, Lu et al.,
Graph of Thoughts: Solving Elaborate Problems with Large Language Models, Besta et al.,
Tree of Uncertain Thoughts Reasoning for Large Language Models, Mo et al.,
Mindstar: Enhancing math reasoning in pre-trained llms at inference time, Kang et al.,
Mapcoder: Multi-agent code generation for competitive problem solving, Islam et al.,
AlphaZero-Like Tree-Search can Guide Large Language Model Decoding and Training, Wan et al.,
Monte carlo tree search boosts reasoning via iterative preference learning, Xie et al.,
Accessing gpt-4 level mathematical olympiad solutions via monte carlo tree self-refine with llama-3 8b, Zhang et al.,
Strategist: Learning Strategic Skills by LLMs via Bi-Level Tree Search, Light et al.,
Deductive Beam Search: Decoding Deducible Rationale for Chain-of-Thought Reasoning, Zhu et al.,
Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping, Lehnert et al.,
Don't throw away your value model! Generating more preferable text with Value-Guided Monte-Carlo Tree Search decoding, Liu et al.,
Qwen2 Technical Report, Yang et al.,
The llama 3 herd of models, Dubey et al.,
Litesearch: Efficacious tree search for llm, Wang et al.,
LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models, Hao et al.,
Tree search for language model agents, Koh et al.,
GraphReason: Enhancing Reasoning Capabilities of Large Language Models through A Graph-Based Verification Approach, Cao et al.,
Agent q: Advanced reasoning and learning for autonomous ai agents, Putta et al.,
Making PPO even better: Value-Guided Monte-Carlo Tree Search decoding, Liu et al.,
Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing, Tian et al.,
On the diagram of thought, Zhang et al.,
Qwen2.5-math technical report: Toward mathematical expert model via self-improvement, Yang et al.,
AlphaMath Almost Zero: Process Supervision without Process, Chen et al.,
RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation, Li et al.,
Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning, Zhai et al.,
Llama-berry: Pairwise optimization for o1-like olympiad-level mathematical reasoning, Zhang et al.,
Understanding When Tree of Thoughts Succeeds: Larger Models Excel in Generation, Not Discrimination, Chen et al.,
Treebon: Enhancing inference-time alignment with speculative tree-search and best-of-n sampling, Qiu et al.,
Towards Self-Improvement of LLMs via MCTS: Leveraging Stepwise Knowledge with Curriculum Preference Learning, Wang et al.,
Aflow: Automating agentic workflow generation, Zhang et al.,
Cooperative Strategic Planning Enhances Reasoning Capabilities in Large Language Models, Wang et al.,
Deliberate reasoning for llms as structure-aware planning with accurate world model, Xiong et al.,
Enhancing multi-step reasoning abilities of language models through direct q-function optimization, Liu et al.,
Process reward model with q-value rankings, Li et al.,
Scattered Forest Search: Smarter Code Space Exploration with LLMs, Light et al.,
AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning, Xiang et al.,
On the Empirical Complexity of Reasoning and Planning in LLMs, Kang et al.,
CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models, Li et al.,
Technical report: Enhancing llm reasoning with reward-guided tree search, Jiang et al.,
SRA-MCTS: Self-driven Reasoning Aurmentation with Monte Carlo Tree Search for Enhanced Code Generation, Xu et al.,
Marco-o1: Towards open reasoning models for open-ended solutions, Zhao et al.,
GPT-Guided Monte Carlo Tree Search for Symbolic Regression in Financial Fraud Detection, Kadam et al.,
MC-NEST--Enhancing Mathematical Reasoning in Large Language Models with a Monte Carlo Nash Equilibrium Self-Refine Tree, Rabby et al.,
SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models, Cheng et al.,
Forest-of-thought: Scaling test-time compute for enhancing LLM reasoning, Bi et al.,
Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search, Yao et al.,
Tree-of-Code: A Tree-Structured Exploring Framework for End-to-End Code Generation and Execution in Complex Task Handling, Ni et al.,
LLM2: Let Large Language Models Harness System 2 Reasoning, Yang et al.,
Towards Intrinsic Self-Correction Enhancement in Monte Carlo Tree Search Boosted Reasoning via Iterative Preference Learning, Jiang et al.,
Ensembling Large Language Models with Process Reward-Guided Tree Search for Better Complex Reasoning, Park et al.,
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking, Guan et al.,
Evolving Deeper LLM Thinking, Lee et al.,
A Roadmap to Guide the Integration of LLMs in Hierarchical Planning, Puerta-Merino et al.,
BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning, Zhang et al.,
Monte Carlo Tree Search for Comprehensive Exploration in LLM-Based Automatic Heuristic Design, Zheng et al.,
Leveraging Constrained Monte Carlo Tree Search to Generate Reliable Long Chain-of-Thought for Mathematical Reasoning, Lin et al.,
A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods, Puri et al.,
Hypothesis-Driven Theory-of-Mind Reasoning for Large Language Models, Kim et al.,
SIFT: Grounding LLM Reasoning in Contexts via Stickers, Zeng et al.,
Atom of Thoughts for Markov LLM Test-Time Scaling, Teng et al.,
Reasoning with Reinforced Functional Token Tuning, Zhang et al.,
CoAT: Chain-of-Associated-Thoughts Framework for Enhancing Large Language Models Reasoning, Pan et al.,
MAPoRL: Multi-Agent Post-Co-Training for Collaborative Large Language Models with Reinforcement Learning, Park et al.,
QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search, Lin et al.,
CritiQ: Mining Data Quality Criteria from Human Preferences, Guo et al.,
START: Self-taught Reasoner with Tools, Li et al.,
Better Process Supervision with Bi-directional Rewarding Signals, Chen et al.,

Internal Exploration

Policy Gradient Methods for Reinforcement Learning with Function Approximation, Sutton et al.,
Proximal policy optimization algorithms, Schulman et al.,
Training verifiers to solve math word problems, Cobbe et al.,
Direct preference optimization: Your language model is secretly a reward model, Rafailov et al.,
Gpt-4 technical report, Achiam et al.,
The claude 3 model family: Opus, sonnet, haiku, Anthropic et al.,
Deepseekmath: Pushing the limits of mathematical reasoning in open language models, Shao et al.,
Kto: Model alignment as prospect theoretic optimization, Ethayarajh et al.,
Stepcoder: Improve code generation with reinforcement learning from compiler feedback, Dou et al.,
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models, Singh et al.,
ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models, Li et al.,
AlphaZero-Like Tree-Search can Guide Large Language Model Decoding and Training, Wan et al.,
Chatglm: A family of large language models from glm-130b to glm-4 all tools, GLM et al.,
The llama 3 herd of models, Dubey et al.,
RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold, Setlur et al.,
CPL: Critical Plan Step Learning Boosts LLM Generalization in Reasoning Tasks, Wang et al.,
Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback, Ivison et al.,
Qwen2.5-coder technical report, Hui et al.,
Qwen2.5-math technical report: Toward mathematical expert model via self-improvement, Yang et al.,
Building math agents with multi-turn iterative preference learning, Xiong et al.,
Unlocking the Capabilities of Thought: A Reasoning Boundary Framework to Quantify and Optimize Chain-of-Thought, Chen et al.,
ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search, Zhang et al.,
A Small Step Towards Reproducing OpenAI o1: Progress Report on the Steiner Open Source Models, Ji et al.,
A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications, Xiao et al.,
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data, Toshniwal et al.,
Critical Tokens Matter: Token-Level Contrastive Estimation Enhence LLM's Reasoning Capability, Lin et al.,
Improving Multi-Step Reasoning Abilities of Large Language Models with Direct Advantage Policy Optimization, Liu et al.,
o1-coder: an o1 replication for coding, Zhang et al.,
Offline Reinforcement Learning for LLM Multi-Step Reasoning, Wang et al.,
Qwen2.5 technical report, Yang et al.,
Deepseek-v3 technical report, Liu et al.,
Openai o1 system card, Jaech et al.,
Sft memorizes, rl generalizes: A comparative study of foundation model post-training, Chu et al.,
REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models, Hu et al.,
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, Guo et al.,
Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling, Hou et al.,
Diverse Preference Optimization, Lanchantin et al.,
COS (M+ O) S: Curiosity and RL-Enhanced MCTS for Exploring Story Space via Language Models, Materzok et al.,
7B Model and 8K Examples: Emerging Reasoning with Reinforcement Learning is Both Effective and Efficient, Zeng et al.,
Search-o1: Agentic search-enhanced large reasoning models, Li et al.,
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking, Guan et al.,
Kimi k1. 5: Scaling reinforcement learning with llms, Team et al.,
Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search, Shen et al.,
Demystifying Long Chain-of-Thought Reasoning in LLMs, Yeo et al.,
LIMR: Less is More for RL Scaling, Li et al.,
Ignore the KL Penalty! Boosting Exploration on Critical Tokens to Enhance RL Fine-Tuning, Vassoyan et al.,
Lean and Mean: Decoupled Value Policy Optimization with Global Value Guidance, Huang et al.,
Process reinforcement through implicit rewards, Cui et al.,
SPPD: Self-training with Process Preference Learning Using Dynamic Value Margin, Yi et al.,
Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning, Xie et al.,
Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning, Yang et al.,
Training Language Models to Reason Efficiently, Arora et al.,
LLM Post-Training: A Deep Dive into Reasoning Large Language Models, Kumar et al.,
Reward-aware Preference Optimization: A Unified Mathematical Framework for Model Alignment, Sun et al.,
Focused-DPO: Enhancing Code Generation Through Focused Preference Optimization on Error-Prone Points, Zhang et al.,
Reasoning with Reinforced Functional Token Tuning, Zhang et al.,
Qsharp: Provably Optimal Distributional RL for LLM Post-Training, Zhou et al.,
Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning, Lyu et al.,
Competitive Programming with Large Reasoning Models, El-Kishky et al.,
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution, Wei et al.,
Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights, Parashar et al.,
Metastable Dynamics of Chain-of-Thought Reasoning: Provable Benefits of Search, RL and Distillation, Kim et al.,
On the Emergence of Thinking in LLMs I: Searching for the Right Intuition, Ye et al.,
The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks, Cuadron et al.,
DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL, Luo et al.,
STeCa: Step-level Trajectory Calibration for LLM Agent Learning, Wang et al.,
Thinking Preference Optimization, Yang et al.,
Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training, Bartoldson et al.,
Dapo: An open-source llm reinforcement learning system at scale, Yu et al.,
Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model, Hu et al.,
Optimizing Test-Time Compute via Meta Reinforcement Finetuning, Qu et al.,
Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't, Dang et al.,
SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks, Zhou et al.,
START: Self-taught Reasoner with Tools, Li et al.,
Expanding RL with Verifiable Rewards Across Diverse Domains, Su et al.,
R-PRM: Reasoning-Driven Process Reward Modeling, She et al.,
VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks, YuYue et al.,
Z1: Efficient Test-time Scaling with Code, Yu et al.,
QwQ: Reflect Deeply on the Boundaries of the Unknown, Team et al.,

Future and Frontiers

Agentic & Embodied Long CoT

Solving Math Word Problems via Cooperative Reasoning induced Language Models, Zhu et al.,
Reasoning with Language Model is Planning with World Model, Hao et al.,
Large language models as commonsense knowledge for large-scale task planning, Zhao et al.,
Robotic Control via Embodied Chain-of-Thought Reasoning, Zawalski et al.,
Tree-Planner: Efficient Close-loop Task Planning with Large Language Models, Hu et al.,
Language Agent Tree Search Unifies Reasoning, Acting, and Planning in Language Models, Zhou et al.,
Strategist: Learning Strategic Skills by LLMs via Bi-Level Tree Search, Light et al.,
Mixture-of-agents enhances large language model capabilities, Wang et al.,
ADaPT: As-Needed Decomposition and Planning with Language Models, Prasad et al.,
Tree search for language model agents, Koh et al.,
Hiagent: Hierarchical working memory management for solving long-horizon agent tasks with large language model, Hu et al.,
S3 agent: Unlocking the power of VLLM for zero-shot multi-modal sarcasm detection, Wang et al.,
MACM: Utilizing a Multi-Agent System for Condition Mining in Solving Complex Mathematical Problems, Lei et al.,
Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning, Zhai et al.,
EVOLvE: Evaluating and Optimizing LLMs For Exploration, Nie et al.,
Agents Thinking Fast and Slow: A Talker-Reasoner Architecture, Christakopoulou et al.,
Robotic Programmer: Video Instructed Policy Code Generation for Robotic Manipulation, Xie et al.,
Titans: Learning to memorize at test time, Behrouz et al.,
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success, Kim et al.,
World Modeling Makes a Better Planner: Dual Preference Optimization for Embodied Task Planning, Wang et al.,
Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks, Zhang et al.,
Cosmos-reason1: From physical common sense to embodied reasoning, Azzolini et al.,
Improving Retrospective Language Agents via Joint Policy Gradient Optimization, Feng et al.,
Haste Makes Waste: Evaluating Planning Abilities of LLMs for Efficient and Feasible Multitasking with Time Constraints Between Actions, Wu et al.,
MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents, Zhu et al.,
ReMA: Learning to Meta-think for LLMs with Multi-Agent Reinforcement Learning, Wan et al.,
MAS-GPT: Training LLMs To Build LLM-Based Multi-Agent Systems, Ye et al.,
Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems, Liu et al.,

Efficient Long CoT

Guiding language model reasoning with planning tokens, Wang et al.,
Synergy-of-thoughts: Eliciting efficient reasoning in hybrid language models, Shang et al.,
Distilling system 2 into system 1, Yu et al.,
Concise thoughts: Impact of output length on llm reasoning and cost, Nayab et al.,
Litesearch: Efficacious tree search for llm, Wang et al.,
Uncertainty-Guided Optimization on Large Language Model Search Trees, Grosse et al.,
CPL: Critical Plan Step Learning Boosts LLM Generalization in Reasoning Tasks, Wang et al.,
Unlocking the Capabilities of Thought: A Reasoning Boundary Framework to Quantify and Optimize Chain-of-Thought, Chen et al.,
Kvsharer: Efficient inference via layer-wise dissimilar KV cache sharing, Yang et al.,
Interpretable contrastive monte carlo tree search reasoning, Gao et al.,
Dualformer: Controllable fast and slow thinking by learning with randomized reasoning traces, Su et al.,
DynaThink: Fast or Slow? A Dynamic Decision-Making Framework for Large Language Models, Pan et al.,
Language models are hidden reasoners: Unlocking latent reasoning capabilities via self-rewarding, Chen et al.,
Token-budget-aware llm reasoning, Han et al.,
B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners, Zeng et al.,
C3oT: Generating Shorter Chain-of-Thought without Compromising Effectiveness, Kang et al.,
Training large language models to reason in a continuous latent space, Hao et al.,
CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models, Cheng et al.,
Kimi k1. 5: Scaling reinforcement learning with llms, Team et al.,
O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning, Luo et al.,
Reward-Guided Speculative Decoding for Efficient LLM Reasoning, Liao et al.,
Think Smarter not Harder: Adaptive Reasoning with Inference Aware Optimization, Yu et al.,
Efficient Reasoning with Hidden Thinking, Shen et al.,
On the Query Complexity of Verifier-Assisted Language Generation, Botta et al.,
TokenSkip: Controllable Chain-of-Thought Compression in LLMs, Xia et al.,
Boost, Disentangle, and Customize: A Robust System2-to-System1 Pipeline for Code Generation, Du et al.,
Jakiro: Boosting Speculative Decoding with Decoupled Multi-Head via MoE, Huang et al.,
Towards Reasoning Ability of Small Language Models, Srivastava et al.,
Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs, Ji et al.,
Portable Reward Tuning: Towards Reusable Fine-Tuning across Different Pretrained Models, Chijiwa et al.,
MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification, Sun et al.,
Language Models Can Predict Their Own Behavior, Ashok et al.,
On the Convergence Rate of MCTS for the Optimal Value Estimation in Markov Decision Processes, Chang et al.,
CoT-Valve: Length-Compressible Chain-of-Thought Tuning, Ma et al.,
Training Language Models to Reason Efficiently, Arora et al.,
Chain of Draft: Thinking Faster by Writing Less, Xu et al.,
Learning to Stop Overthinking at Test Time, Bao et al.,
Self-Training Elicits Concise Reasoning in Large Language Models, Munkhbat et al.,
Length-Controlled Margin-Based Preference Optimization without Reference Model, Li et al.,
Reasoning on a Spectrum: Aligning LLMs to System 1 and System 2 Thinking, Ziabari et al.,
Dynamic Parallel Tree Search for Efficient LLM Reasoning, Ding et al.,
Stepwise Perplexity-Guided Refinement for Efficient Chain-of-Thought Reasoning in Large Language Models, Cui et al.,
Dynamic Chain-of-Thought: Towards Adaptive Deep Reasoning, Wang et al.,
SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs, Xu et al.,
LightThinker: Thinking Step-by-Step Compression, Zhang et al.,
Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning, Yan et al.,
Stepwise Informativeness Search for Improving LLM Reasoning, Wang et al.,
Adaptive Group Policy Optimization: Towards Stable Training and Token-Efficient Reasoning, Li et al.,
Innate Reasoning is Not Enough: In-Context Learning Enhances Reasoning Large Language Models with Less Overthinking, Ge et al.,
Understanding r1-zero-like training: A critical perspective, Liu et al.,
The First Few Tokens Are All You Need: An Efficient and Effective Unsupervised Prefix Fine-Tuning Method for Reasoning Models, Ji et al.,
L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning, Aggarwal et al.,
DAST: Difficulty-Adaptive Slow-Thinking for Large Reasoning Models, Shen et al.,
ThinkPrune: Pruning Long Chain-of-Thought of LLMs via Reinforcement Learning, Hou et al.,

Knowledge-Augmented Long CoT

Best of Both Worlds: Harmonizing LLM Capabilities in Decision-Making and Question-Answering for Treatment Regimes, Liu et al.,
Understanding Reasoning Ability of Language Models From the Perspective of Reasoning Paths Aggregation, Wang et al.,
Stream of search (sos): Learning to search in language, Gandhi et al.,
CoPS: Empowering LLM Agents with Provable Cross-Task Experience Sharing, Yang et al.,
Disentangling memory and reasoning ability in large language models, Jin et al.,
Huatuogpt-o1, towards medical complex reasoning with llms, Chen et al.,
RAG-Star: Enhancing Deliberative Reasoning with Retrieval Augmented Verification and Refinement, Jiang et al.,
O1 Replication Journey--Part 3: Inference-time Scaling for Medical Reasoning, Huang et al.,
MedS ³: Towards Medical Small Language Models with Self-Evolved Slow Thinking, Jiang et al.,
Search-o1: Agentic search-enhanced large reasoning models, Li et al.,
Chain-of-Retrieval Augmented Generation, Wang et al.,
Evaluating Large Language Models through Role-Guide and Self-Reflection: A Comparative Study, Zhao et al.,
Citrus: Leveraging Expert Cognitive Pathways in a Medical Language Model for Advanced Medical Decision Support, Wang et al.,
ECM: A Unified Electronic Circuit Model for Explaining the Emergence of In-Context Learning and Chain-of-Thought in Large Language Model, Chen et al.,
Large Language Models for Recommendation with Deliberative User Preference Alignment, Fang et al.,
ChineseEcomQA: A Scalable E-commerce Concept Evaluation Benchmark for Large Language Models, Chen et al.,
DeepRAG: Thinking to Retrieval Step by Step for Large Language Models, Guan et al.,
Open Deep Research, Team et al.,
HopRAG: Multi-Hop Reasoning for Logic-Aware Retrieval-Augmented Generation, Liu et al.,
O1 Embedder: Let Retrievers Think Before Action, Yan et al.,
MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning, Pan et al.,
Towards Robust Legal Reasoning: Harnessing Logical LLMs in Law, Kant et al.,
OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning, Lu et al.,
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning, Song et al.,
RARE: Retrieval-Augmented Reasoning Modeling, Wang et al.,
Graph-Augmented Reasoning: Evolving Step-by-Step Knowledge Graph Retrieval for LLM Reasoning, Wu et al.,
Learning to Reason with Search for LLMs via Reinforcement Learning, Chen et al.,
Fin-R1: A Large Language Model for Financial Reasoning through Reinforcement Learning, Liu et al.,
m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning with Large Language Models, Huang et al.,

Multilingual Long CoT

Cross-lingual Prompting: Improving Zero-shot Chain-of-Thought Reasoning across Languages, Qin et al.,
Not All Languages Are Created Equal in LLMs: Improving Multilingual Capability by Cross-Lingual-Thought Prompting, Huang et al.,
xcot: Cross-lingual instruction tuning for cross-lingual chain-of-thought reasoning, Chai et al.,
Multilingual large language model: A survey of resources, taxonomy and frontiers, Qin et al.,
A Tree-of-Thoughts to Broaden Multi-step Reasoning across Languages, Ranaldi et al.,
AutoCAP: Towards Automatic Cross-lingual Alignment Planning for Zero-shot Chain-of-Thought, Zhang et al.,
Enhancing Advanced Visual Reasoning Ability of Large Language Models, Li et al.,
DRT-o1: Optimized Deep Reasoning Translation via Long Chain-of-Thought, Wang et al.,
A survey of multilingual large language models, Qin et al.,
Demystifying Multilingual Chain-of-Thought in Process Reward Modeling, Wang et al.,
The Multilingual Mind: A Survey of Multilingual Reasoning in Language Models, Ghosh et al.,

Multimodal Long CoT

Large Language Models Can Self-Correct with Minimal Effort, Wu et al.,
Multimodal Chain-of-Thought Reasoning in Language Models, Zhang et al.,
Q*: Improving multi-step reasoning for llms with deliberative planning, Wang et al.,
M³CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought, Chen et al.,
A survey on evaluation of multimodal large language models, Huang et al.,
Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning, Zhai et al.,
What factors affect multi-modal in-context learning? an in-depth exploration, Qin et al.,
Enhancing Advanced Visual Reasoning Ability of Large Language Models, Li et al.,
Insight-v: Exploring long-chain visual reasoning with multimodal large language models, Dong et al.,
Llava-o1: Let vision language models reason step-by-step, Xu et al.,
AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning, Xiang et al.,
ARES: Alternating Reinforcement Learning and Supervised Fine-Tuning for Enhanced Multi-Modal Chain-of-Thought Reasoning Through Diverse AI Feedback, Byun et al.,
Enhancing the reasoning ability of multimodal large language models via mixed preference optimization, Wang et al.,
Slow Perception: Let's Perceive Geometric Figures Step-by-step, Wei et al.,
Diving into Self-Evolving Training for Multimodal Reasoning, Liu et al.,
Scaling inference-time search with vision value model for improved visual comprehension, Xiyao et al.,
CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models, Cheng et al.,
Inference Retrieval-Augmented Multi-Modal Chain-of-Thoughts Reasoning for Language Models, He et al.,
Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model, Ma et al.,
BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning, Zhang et al.,
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model, Zang et al.,
Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark, Hao et al.,
Visual Agents as Fast and Slow Thinkers, Sun et al.,
Virgo: A Preliminary Exploration on Reproducing o1-like MLLM, Du et al.,
Llamav-o1: Rethinking step-by-step visual reasoning in llms, Thawakar et al.,
Inference-time scaling for diffusion models beyond scaling denoising steps, Ma et al.,
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step, Guo et al.,
Imagine while Reasoning in Space: Multimodal Visualization-of-Thought, Li et al.,
Monte Carlo Tree Diffusion for System 2 Planning, Yoon et al.,
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking, Wu et al.,
Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models, Xie et al.,
Visual-RFT: Visual Reinforcement Fine-Tuning, Liu et al.,
Qwen2. 5-Omni Technical Report, Xu et al.,
Vision-r1: Incentivizing reasoning capability in multimodal large language models, Huang et al.,
Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl, Peng et al.,
Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning, Tan et al.,
OThink-MR1: Stimulating multimodal generalized reasoning capabilities through dynamic reinforcement learning, Liu et al.,
Grounded Chain-of-Thought for Multimodal Large Language Models, Wu et al.,
Test-Time View Selection for Multi-Modal Decision Making, Jain et al.,
Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme, Ma et al.,

Safety and Stability for Long CoT

Larger and more instructable language models become less reliable, Zhou et al.,
On the Hardness of Faithful Chain-of-Thought Reasoning in Large Language Models, Tanneru et al.,
The Impact of Reasoning Step Length on Large Language Models, Jin et al.,
Unlocking the Capabilities of Thought: A Reasoning Boundary Framework to Quantify and Optimize Chain-of-Thought, Chen et al.,
Can Large Language Models Understand You Better? An MBTI Personality Detection Dataset Aligned with Population Traits, Li et al.,
o3-mini vs DeepSeek-R1: Which One is Safer?, Arrieta et al.,
Efficient Reasoning with Hidden Thinking, Shen et al.,
Think More, Hallucinate Less: Mitigating Hallucinations via Dual Process of Fast and Slow Thinking, Cheng et al.,
Explicit vs. Implicit: Investigating Social Bias in Large Language Models through Self-Reflection, Zhao et al.,
Challenges in Ensuring AI Safety in DeepSeek-R1 Models: The Shortcomings of Reinforcement Learning Strategies, Parmar et al.,
Early External Safety Testing of OpenAI's o3-mini: Insights from the Pre-Deployment Evaluation, Arrieta et al.,
International AI Safety Report, Bengio et al.,
GuardReasoner: Towards Reasoning-based LLM Safeguards, Liu et al.,
OVERTHINKING: Slowdown Attacks on Reasoning LLMs, Kumar et al.,
A Mousetrap: Fooling Large Reasoning Models for Jailbreak with Chain of Iterative Chaos, Yao et al.,
MetaSC: Test-Time Safety Specification Optimization for Language Models, Gallego et al.,
Leveraging Reasoning with Guidelines to Elicit and Utilize Knowledge for Enhancing Safety Alignment, Wang et al.,
The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1, Zhou et al.,
Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models, Lu et al.,
Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?, Bengio et al.,
Emergent Response Planning in LLM, Dong et al.,
Investigating the Impact of Quantization Methods on the Safety and Reliability of Large Language Models, Kharinaev et al.,
Safety Evaluation of DeepSeek Models in Chinese Contexts, Zhang et al.,
Reasoning Does Not Necessarily Improve Role-Playing Ability, Feng et al.,
H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models, Including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking, Kuo et al.,
BoT: Breaking Long Thought Processes of o1-like Large Language Models through Backdoor Attack, Zhu et al.,
" Nuclear Deployed!": Analyzing Catastrophic Risks in Decision-making of Autonomous LLM Agents, Xu et al.,
SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities, Jiang et al.,
Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking, Zhu et al.,
CER: Confidence Enhanced Reasoning in LLMs, Razghandi et al.,
Measuring Faithfulness of Chains of Thought by Unlearning Reasoning Steps, Tutek et al.,
The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Safety Analysis, Pan et al.,
Policy Frameworks for Transparent Chain-of-Thought Reasoning in Large Language Models, Chen et al.,
Do Chains-of-Thoughts of Large Language Models Suffer from Hallucinations, Cognitive Biases, or Phobias in Bayesian Reasoning?, Araya et al.,
Process or Result? Manipulated Ending Tokens Can Mislead Reasoning LLMs to Ignore the Correct Reasoning Steps, Cui et al.,
Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable, Huang et al.,
Recitation over Reasoning: How Cutting-Edge Language Models Can Fail on Elementary School-Level Reasoning Problems?, Yan et al.,
Reasoning Models Don’t Always Say What They Think, Chen et al.,

Resources

Open-Sourced Training Framework

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework, Hu et al.,
LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models, Hao et al.,
OpenR: An Open Source Framework for Advanced Reasoning with Large Language Models, Wang et al.,
TinyZero, Pan et al.,
R1-V: Reinforcing Super Generalization Ability in Vision-Language Models with Less Than 3, Chen et al.,
VL-Thinking: An R1-Derived Visual Instruction Tuning Dataset for Thinkable LVLMs, Chen et al.,
VLM-R1: A stable and generalizable R1-style Large Vision-Language Model, Shen et al.,
7B Model and 8K Examples: Emerging Reasoning with Reinforcement Learning is Both Effective and Efficient, Zeng et al.,
Open R1, Team et al.,
DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL, Luo et al.,
X-R1, Team et al.,
Open-Reasoner-Zero: An Open Source Approach to Scaling Reinforcement Learning on the Base Model, Jingcheng Hu and Yinmin Zhang and Qi Han and Daxin Jiang and Xiangyu Zhang et al.,
Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning, Xie et al.,
R1-Multimodal-Journey, Shao et al.,
Open-R1-Multimodal, Lab et al.,
Video-R1, Team et al.,
Dapo: An open-source llm reinforcement learning system at scale, Yu et al.,
VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks, YuYue et al.,

Towards Reasoning Era: A Survey of Long Chain-of-Thought

Abstract

Paper List

Analysis and Evaluation

Analysis & Explanation for Long CoT

Long CoT Evaluations

Deep Reasoning

Deep Reasoning Format

Deep Reasoning Learning

Feasible Reflection

Feedback

Refinement

Extensive Exploration

Exploration Scaling

External Exploration

Internal Exploration

Future and Frontiers

Agentic & Embodied Long CoT

Efficient Long CoT

Knowledge-Augmented Long CoT

Multilingual Long CoT

Multimodal Long CoT

Safety and Stability for Long CoT

Resources

Open-Sourced Training Framework

BibTeX