SVG Image Towards Reasoning Era: A Survey of Long Chain-of-Thought

Harbin Institute of Technology+, Central South University*, The University of Hong Kong, Fudan University§

Abstract

Recent advances in logical reasoning tasks are often attributed to test-time scaling, with many researchers suggesting that increasing model capacity to process longer reasoning sequences improves performance. However, this idea is challenged in simpler tasks, such as commonsense reasoning and basic mathematics, where test-time scaling may lead to “overthinking” that hampers model performance. This paradox remains underexplored, with existing research limited by two main shortcomings: a failure to distinguish between Long Chains of Thought (Long CoT) and Short Chains of Thought (Short CoT), and the absence of a comprehensive review on the topic. To address these issues, this survey first distinguishes between Long CoT and Short CoT, introducing a new taxonomy to categorize these reasoning paradigms. We examine the key characteristics of Long CoT—Deep Reasoning, Extensive Exploration, and Feasible Reflection—and highlight how these features enable deeper and more efficient reasoning compared to the shallow, redundancy-prone Short CoT. Our review synthesizes the current state of Long CoT research, identifies critical gaps, and suggests future research directions. We also address challenges in Long CoT, such as multi-modal reasoning, efficiency, and knowledge integration, and recommend resources, including open-source software, corpora, and key publications, to support further studies. Through this survey, we aim to offer a unified perspective on Long CoT, propose strategies to overcome existing limitations, and inspire future research to push the boundaries of logical reasoning in artificial intelligence.

Paper List

    Deep Reasoning

    Deep Reasoning Execution

    • Generative language modeling for automated theorem proving, Polu et al., arXiv Badge
    • Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, Wei et al., PDF Badge
    • Reflection of thought: Inversely eliciting numerical reasoning in language models via solving linear systems, Zhou et al., arXiv Badge
    • MathPrompter: Mathematical Reasoning using Large Language Models, Imani et al., No Link Badge
    • Cross-lingual Prompting: Improving Zero-shot Chain-of-Thought Reasoning across Languages, Qin et al., No Link Badge
    • Deductive Verification of Chain-of-Thought Reasoning, Ling et al., PDF Badge
    • Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks, Chen et al., PDF Badge
    • Mistral 7B, Jiang et al., PDF Badge
    • Llama 2: Open foundation and fine-tuned chat models, Touvron et al., arXiv Badge
    • Guiding language model reasoning with planning tokens, Wang et al., arXiv Badge
    • Tinygsm: achieving> 80\% on gsm8k with small language models, Liu et al., arXiv Badge
    • Chain of Code: Reasoning with a Language Model-Augmented Code Emulator, Li et al., PDF Badge
    • AutoCAP: Towards Automatic Cross-lingual Alignment Planning for Zero-shot Chain-of-Thought, Zhang et al., PDF Badge
    • Planning in Natural Language Improves LLM Search for Code Generation, Wang et al., PDF Badge
    • MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning, Sprague et al., PDF Badge
    • DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving, Tong et al., PDF Badge
    • Enhancing Reasoning Capabilities of LLMs via Principled Synthetic Logic Corpus, Morishita et al., PDF Badge
    • Self-Explore: Enhancing Mathematical Reasoning in Language Models with Fine-grained Rewards, Hwang et al., PDF Badge
    • AlphaMath Almost Zero: Process Supervision without Process, Chen et al., PDF Badge
    • Deepseekmath: Pushing the limits of mathematical reasoning in open language models, Shao et al., arXiv Badge
    • Brain-Inspired Two-Stage Approach: Enhancing Mathematical Reasoning by Imitating Human Thought Processes, Chen et al., arXiv Badge
    • Quiet-star: Language models can teach themselves to think before speaking, Zelikman et al., arXiv Badge
    • Common 7b language models already possess strong math capabilities, Li et al., arXiv Badge
    • MathDivide: Improved mathematical reasoning by large language models, Srivastava et al., arXiv Badge
    • Certified Deductive Reasoning with Language Models, Poesia et al., PDF Badge
    • From explicit cot to implicit cot: Learning to internalize cot step by step, Deng et al., arXiv Badge
    • Interactive Evolution: A Neural-Symbolic Self-Training Framework For Large Language Models, Xu et al., arXiv Badge
    • Lean-star: Learning to interleave thinking and proving, Lin et al., arXiv Badge
    • The llama 3 herd of models, Dubey et al., arXiv Badge
    • Qwen2 Technical Report, Yang et al., arXiv Badge
    • Siam: Self-improving code-assisted mathematical reasoning of large language models, Yu et al., arXiv Badge
    • Qwen2.5-math technical report: Toward mathematical expert model via self-improvement, Yang et al., arXiv Badge
    • TPO: Aligning Large Language Models with Multi-branch \& Multi-step Preference Trees, Liao et al., arXiv Badge
    • O1 Replication Journey--Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?, Huang et al., arXiv Badge
    • Formal mathematical reasoning: A new frontier in ai, Yang et al., arXiv Badge
    • Training large language models to reason in a continuous latent space, Hao et al., arXiv Badge
    • Qwen2.5 technical report, Yang et al., arXiv Badge
    • System-2 Mathematical Reasoning via Enriched Instruction Tuning, Cai et al., arXiv Badge
    • Acemath: Advancing frontier math reasoning with post-training and reward modeling, Liu et al., arXiv Badge
    • Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems, Min et al., arXiv Badge
    • SKIntern: Internalizing Symbolic Knowledge for Distilling Better CoT Capabilities into Small Language Models, Liao et al., No Link Badge
    • STP: Self-play LLM Theorem Provers with Iterative Conjecturing and Proving, Dong et al., No Link Badge
    • Sky-T1: Train your own O1 preview model within \$ 450, Team et al., No Link Badge
    • QwQ: Reflect Deeply on the Boundaries of the Unknown, Team et al., No Link Badge
    • Bespoke-Stratos: The unreasonable effectiveness of reasoning distillation, Labs et al., No Link Badge
    • Think Less, Achieve More: Cut Reasoning Costs by 50% Without Sacrificing Accuracy, Team et al., No Link Badge
    • Unlocking the Potential of Reinforcement Learning in Improving Reasoning Models, Team et al., No Link Badge
    • Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs, Wang et al., arXiv Badge
    • Efficient Reasoning with Hidden Thinking, Shen et al., arXiv Badge
    • Advancing Math Reasoning in Language Models: The Impact of Problem-Solving Data, Data Synthesis Methods, and Training Stages, Chen et al., arXiv Badge
    • s1: Simple test-time scaling, Muennighoff et al., arXiv Badge
    • Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, Guo et al., arXiv Badge
    • Improving Chain-of-Thought Reasoning via Quasi-Symbolic Abstractions, Ranaldi et al., arXiv Badge
    • Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach, Geiping et al., arXiv Badge
    • CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction, Li et al., arXiv Badge
    • Towards Better Understanding of Program-of-Thought Reasoning in Cross-Lingual and Multilingual Environments, Payoungkhamdee et al., arXiv Badge
    • Theorem Prover as a Judge for Synthetic Data Generation, Leang et al., arXiv Badge
    • Self-Enhanced Reasoning Training: Activating Latent Reasoning in Small Models for Enhanced Reasoning Distillation, Zhang et al., arXiv Badge
    • Scalable Language Models with Posterior Inference of Latent Thought Vectors, Kong et al., arXiv Badge
    • Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking, Chen et al., arXiv Badge
    • Demystifying Long Chain-of-Thought Reasoning in LLMs, Yeo et al., arXiv Badge
    • FastMCTS: A Simple Sampling Strategy for Data Synthesis, Li et al., arXiv Badge
    • LIMO: Less is More for Reasoning, Ye et al., arXiv Badge

    Deep Reasoning Learning

    • Thinking fast and slow with deep learning and tree search, Anthony et al., No Link Badge
    • Training verifiers to solve math word problems, Cobbe et al., arXiv Badge
    • Chain of Thought Imitation with Procedure Cloning, Yang et al., PDF Badge
    • Star: Bootstrapping reasoning with reasoning, Zelikman et al., No Link Badge
    • Large Language Models Are Reasoning Teachers, Ho et al., PDF Badge
    • The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning, Kim et al., PDF Badge
    • Training Chain-of-Thought via Latent-Variable Inference, Hoffman et al., PDF Badge
    • Instruction tuning for large language models: A survey, Zhang et al., arXiv Badge
    • Reinforced self-training (rest) for language modeling, Gulcehre et al., arXiv Badge
    • Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models, Singh et al., No Link Badge
    • V-STaR: Training Verifiers for Self-Taught Reasoners, Hosseini et al., PDF Badge
    • Enhancing Reasoning Capabilities of LLMs via Principled Synthetic Logic Corpus, Morishita et al., PDF Badge
    • DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving, Tong et al., PDF Badge
    • Weak-to-Strong Reasoning, Yang et al., PDF Badge
    • Iterative Reasoning Preference Optimization, Pang et al., PDF Badge
    • Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs, Zhang et al., PDF Badge
    • ReAct Meets ActRe: Autonomous Annotation of Agent Trajectories for Contrastive Self-Training, Yang et al., PDF Badge
    • Self-Explore: Enhancing Mathematical Reasoning in Language Models with Fine-grained Rewards, Hwang et al., PDF Badge
    • AlphaMath Almost Zero: Process Supervision without Process, Chen et al., PDF Badge
    • Brain-Inspired Two-Stage Approach: Enhancing Mathematical Reasoning by Imitating Human Thought Processes, Chen et al., arXiv Badge
    • Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models, Puerto et al., arXiv Badge
    • Qwen2.5-math technical report: Toward mathematical expert model via self-improvement, Yang et al., arXiv Badge
    • TPO: Aligning Large Language Models with Multi-branch \& Multi-step Preference Trees, Liao et al., arXiv Badge
    • Towards Self-Improvement of LLMs via MCTS: Leveraging Stepwise Knowledge with Curriculum Preference Learning, Wang et al., arXiv Badge
    • O1 Replication Journey--Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?, Huang et al., arXiv Badge
    • System-2 Mathematical Reasoning via Enriched Instruction Tuning, Cai et al., arXiv Badge
    • Acemath: Advancing frontier math reasoning with post-training and reward modeling, Liu et al., arXiv Badge
    • Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems, Min et al., arXiv Badge
    • Openai o1 system card, Jaech et al., arXiv Badge
    • OpenRFT: Adapting Reasoning Foundation Model for Domain-specific Tasks with Reinforcement Fine-Tuning, Zhang et al., arXiv Badge
    • Proposing and solving olympiad geometry with guided tree search, Zhang et al., arXiv Badge
    • Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling, Bansal et al., PDF Badge
    • Enhancing Reasoning through Process Supervision with Monte Carlo Tree Search, Li et al., PDF Badge
    • Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs, Wang et al., arXiv Badge
    • Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, Guo et al., arXiv Badge
    • Sft memorizes, rl generalizes: A comparative study of foundation model post-training, Chu et al., arXiv Badge
    • Advancing Math Reasoning in Language Models: The Impact of Problem-Solving Data, Data Synthesis Methods, and Training Stages, Chen et al., arXiv Badge
    • s1: Simple test-time scaling, Muennighoff et al., arXiv Badge
    • RedStar: Does Scaling Long-CoT Data Unlock Better Slow-Reasoning Systems?, Xu et al., arXiv Badge
    • FastMCTS: A Simple Sampling Strategy for Data Synthesis, Li et al., arXiv Badge
    • LLMs Can Teach Themselves to Better Predict the Future, Turtel et al., arXiv Badge
    • Policy Guided Tree Search for Enhanced LLM Reasoning, Li et al., arXiv Badge
    • Don't Get Lost in the Trees: Streamlining LLM Reasoning by Overcoming Tree Search Exploration Pitfalls, Wang et al., arXiv Badge
    • Distillation Scaling Laws, Busbridge et al., arXiv Badge
    • Unveiling the Mechanisms of Explicit CoT Training: How Chain-of-Thought Enhances Reasoning Generalization, Yao et al., arXiv Badge
    • Demystifying Long Chain-of-Thought Reasoning in LLMs, Yeo et al., arXiv Badge
    • LIMO: Less is More for Reasoning, Ye et al., arXiv Badge
    • BOLT: Bootstrap Long Chain-of-Thought in Language Models without Distillation, Pang et al., arXiv Badge

    Extensive Exploration

    Exploration Scaling

    • Greedy Policy Search: A Simple Baseline for Learnable Test-Time Augmentation, Lyzhov et al., PDF Badge
    • Scaling scaling laws with board games, Jones et al., arXiv Badge
    • Show Your Work: Scratchpads for Intermediate Computation with Language Models, Nye et al., PDF Badge
    • Making large language models better reasoners with step-aware verifier, Li et al., arXiv Badge
    • Complexity-Based Prompting for Multi-step Reasoning, Fu et al., PDF Badge
    • Self-Consistency Improves Chain of Thought Reasoning in Language Models, Wang et al., PDF Badge
    • Cross-lingual Prompting: Improving Zero-shot Chain-of-Thought Reasoning across Languages, Qin et al., No Link Badge
    • Deductive Verification of Chain-of-Thought Reasoning, Ling et al., PDF Badge
    • Learning to Reason via Program Generation, Emulation, and Search, Weir et al., PDF Badge
    • Python is Not Always the Best Choice: Embracing Multilingual Program of Thoughts, Luo et al., PDF Badge
    • From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models, Welleck et al., PDF Badge
    • Scaling Inference Computation: Compute-Optimal Inference for Problem-Solving with Language Models, Wu et al., PDF Badge
    • Don't Trust: Verify -- Grounding LLM Quantitative Reasoning with Autoformalization, Zhou et al., PDF Badge
    • Wrong-of-Thought: An Integrated Reasoning Framework with Multi-Perspective Verification and Wrong Information, Zhang et al., PDF Badge
    • Multi-step problem solving through a verifier: An empirical analysis on model-induced process supervision, Wang et al., arXiv Badge
    • Stepwise self-consistent mathematical reasoning with large language models, Zhao et al., arXiv Badge
    • General purpose verification for chain of thought prompting, Vacareanu et al., arXiv Badge
    • Improve Mathematical Reasoning in Language Models by Automated Process Supervision, Luo et al., arXiv Badge
    • Large language monkeys: Scaling inference compute with repeated sampling, Brown et al., arXiv Badge
    • Scaling llm test-time compute optimally can be more effective than scaling model parameters, Snell et al., arXiv Badge
    • Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models, Wu et al., arXiv Badge
    • What are the essential factors in crafting effective long context multi-hop instruction datasets? insights and best practices, Chen et al., arXiv Badge
    • MAgICoRe: Multi-Agent, Iterative, Coarse-to-Fine Refinement for Reasoning, Chen et al., arXiv Badge
    • Scaling llm inference with optimized sample compute allocation, Zhang et al., arXiv Badge
    • Rlef: Grounding code llms in execution feedback with reinforcement learning, Gehring et al., arXiv Badge
    • Beyond examples: High-level automated reasoning paradigm in in-context learning via mcts, Wu et al., arXiv Badge
    • From medprompt to o1: Exploration of run-time strategies for medical challenge problems and beyond, Nori et al., arXiv Badge
    • A simple and provable scaling law for the test-time compute of large language models, Chen et al., arXiv Badge
    • Openai o1 system card, Jaech et al., arXiv Badge
    • Lachesis: Predicting LLM Inference Accuracy using Structural Properties of Reasoning Paths, Kim et al., arXiv Badge
    • Seed-cts: Unleashing the power of tree search for superior performance in competitive coding tasks, Wang et al., arXiv Badge
    • Inference Scaling vs Reasoning: An Empirical Analysis of Compute-Optimal LLM Problem-Solving, AbdElhameed et al., arXiv Badge
    • ExACT: Teaching AI Agents to Explore with Reflective-MCTS and Exploratory Learning, Yu et al., PDF Badge
    • s1: Simple test-time scaling, Muennighoff et al., arXiv Badge
    • From Drafts to Answers: Unlocking LLM Potential via Aggregation Fine-Tuning, Li et al., arXiv Badge
    • Chain-of-Reasoning: Towards Unified Mathematical Reasoning in Large Language Models via a Multi-Paradigm Perspective, Yu et al., arXiv Badge
    • Test-time Computing: from System-1 Thinking to System-2 Thinking, Ji et al., arXiv Badge
    • SETS: Leveraging Self-Verification and Self-Correction for Improved Test-Time Scaling, Chen et al., arXiv Badge
    • Instantiation-based Formalization of Logical Reasoning Tasks using Language Models and Logical Solvers, Raza et al., arXiv Badge
    • The lessons of developing process reward models in mathematical reasoning, Zhang et al., arXiv Badge
    • ECM: A Unified Electronic Circuit Model for Explaining the Emergence of In-Context Learning and Chain-of-Thought in Large Language Model, Chen et al., arXiv Badge
    • Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach, Geiping et al., arXiv Badge
    • Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking, Chen et al., arXiv Badge
    • Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling, Liu et al., arXiv Badge
    • Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities?, Zeng et al., arXiv Badge
    • Optimizing Temperature for Language Models with Multi-Sample Inference, Du et al., arXiv Badge
    • Bag of Tricks for Inference-time Computation of LLM Reasoning, Liu et al., arXiv Badge
    • Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning, Yang et al., arXiv Badge
    • (Mis) Fitting: A Survey of Scaling Laws, Li et al., arXiv Badge
    • METAL: A Multi-Agent Framework for Chart Generation with Test-Time Scaling, Li et al., arXiv Badge
    • Reasoning-as-Logic-Units: Scaling Test-Time Reasoning in Large Language Models Through Logic Unit Alignment, Li et al., arXiv Badge
    • Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification, Zhao et al., arXiv Badge
    • TestNUC: Enhancing Test-Time Computing Approaches through Neighboring Unlabeled Data Consistency, Zou et al., arXiv Badge
    • Confidence Improves Self-Consistency in LLMs, Taubenfeld et al., arXiv Badge
    • S*: Test Time Scaling for Code Generation, Li et al., arXiv Badge
    • Is Depth All You Need? An Exploration of Iterative Reasoning in LLMs, Wu et al., arXiv Badge

    External Exploration

    • Self-Evaluation Guided Beam Search for Reasoning, Xie et al., PDF Badge
    • PATHFINDER: Guided Search over Multi-Step Reasoning Paths, Golovneva et al., PDF Badge
    • Least-to-Most Prompting Enables Complex Reasoning in Large Language Models, Zhou et al., PDF Badge
    • Tree of Thoughts: Deliberate Problem Solving with Large Language Models, Yao et al., PDF Badge
    • No train still gain. unleash mathematical reasoning of large language models with monte carlo tree search guided by energy function, Xu et al., arXiv Badge
    • Deductive Beam Search: Decoding Deducible Rationale for Chain-of-Thought Reasoning, Zhu et al., PDF Badge
    • Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping, Lehnert et al., PDF Badge
    • Don't throw away your value model! Generating more preferable text with Value-Guided Monte-Carlo Tree Search decoding, Liu et al., PDF Badge
    • Making PPO even better: Value-Guided Monte-Carlo Tree Search decoding, Liu et al., PDF Badge
    • On the Empirical Complexity of Reasoning and Planning in LLMs, Kang et al., PDF Badge
    • Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing, Tian et al., PDF Badge
    • Tree of Uncertain Thoughts Reasoning for Large Language Models, Mo et al., No Link Badge
    • Graph of Thoughts: Solving Elaborate Problems with Large Language Models, Besta et al., PDF Badge
    • GraphReason: Enhancing Reasoning Capabilities of Large Language Models through A Graph-Based Verification Approach, Cao et al., PDF Badge
    • Strategist: Learning Strategic Skills by LLMs via Bi-Level Tree Search, Light et al., PDF Badge
    • Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning, Zhai et al., PDF Badge
    • Demystifying chains, trees, and graphs of thoughts, Besta et al., arXiv Badge
    • Mindstar: Enhancing math reasoning in pre-trained llms at inference time, Kang et al., arXiv Badge
    • Tree search for language model agents, Koh et al., arXiv Badge
    • Agent q: Advanced reasoning and learning for autonomous ai agents, Putta et al., arXiv Badge
    • RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation, Li et al., arXiv Badge
    • Llama-berry: Pairwise optimization for o1-like olympiad-level mathematical reasoning, Zhang et al., arXiv Badge
    • Understanding When Tree of Thoughts Succeeds: Larger Models Excel in Generation, Not Discrimination, Chen et al., arXiv Badge
    • Treebon: Enhancing inference-time alignment with speculative tree-search and best-of-n sampling, Qiu et al., arXiv Badge
    • Aflow: Automating agentic workflow generation, Zhang et al., arXiv Badge
    • Cooperative Strategic Planning Enhances Reasoning Capabilities in Large Language Models, Wang et al., arXiv Badge
    • Deliberate reasoning for llms as structure-aware planning with accurate world model, Xiong et al., arXiv Badge
    • Enhancing multi-step reasoning abilities of language models through direct q-function optimization, Liu et al., arXiv Badge
    • Process reward model with q-value rankings, Li et al., arXiv Badge
    • Scattered Forest Search: Smarter Code Space Exploration with LLMs, Light et al., arXiv Badge
    • AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning, Xiang et al., arXiv Badge
    • CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models, Li et al., arXiv Badge
    • Marco-o1: Towards open reasoning models for open-ended solutions, Zhao et al., arXiv Badge
    • Technical report: Enhancing llm reasoning with reward-guided tree search, Jiang et al., arXiv Badge
    • SRA-MCTS: Self-driven Reasoning Aurmentation with Monte Carlo Tree Search for Enhanced Code Generation, Xu et al., arXiv Badge
    • GPT-Guided Monte Carlo Tree Search for Symbolic Regression in Financial Fraud Detection, Kadam et al., arXiv Badge
    • MC-NEST--Enhancing Mathematical Reasoning in Large Language Models with a Monte Carlo Nash Equilibrium Self-Refine Tree, Rabby et al., arXiv Badge
    • SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models, Cheng et al., arXiv Badge
    • Forest-of-thought: Scaling test-time compute for enhancing LLM reasoning, Bi et al., arXiv Badge
    • Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search, Yao et al., arXiv Badge
    • Tree-of-Code: A Tree-Structured Exploring Framework for End-to-End Code Generation and Execution in Complex Task Handling, Ni et al., arXiv Badge
    • Towards Intrinsic Self-Correction Enhancement in Monte Carlo Tree Search Boosted Reasoning via Iterative Preference Learning, Jiang et al., arXiv Badge
    • Ensembling Large Language Models with Process Reward-Guided Tree Search for Better Complex Reasoning, Park et al., arXiv Badge
    • rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking, Guan et al., arXiv Badge
    • Evolving Deeper LLM Thinking, Lee et al., arXiv Badge
    • A Roadmap to Guide the Integration of LLMs in Hierarchical Planning, Puerta-Merino et al., arXiv Badge
    • Monte Carlo Tree Search for Comprehensive Exploration in LLM-Based Automatic Heuristic Design, Zheng et al., arXiv Badge
    • Leveraging Constrained Monte Carlo Tree Search to Generate Reliable Long Chain-of-Thought for Mathematical Reasoning, Lin et al., arXiv Badge
    • A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods, Puri et al., arXiv Badge
    • Hypothesis-Driven Theory-of-Mind Reasoning for Large Language Models, Kim et al., arXiv Badge
    • Atom of Thoughts for Markov LLM Test-Time Scaling, Teng et al., arXiv Badge
    • CoAT: Chain-of-Associated-Thoughts Framework for Enhancing Large Language Models Reasoning, Pan et al., arXiv Badge
    • QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search, Lin et al., arXiv Badge
    • CritiQ: Mining Data Quality Criteria from Human Preferences, Guo et al., arXiv Badge

    Internal Exploration

    • Policy Gradient Methods for Reinforcement Learning with Function Approximation, Sutton et al., PDF Badge
    • Proximal policy optimization algorithms, Schulman et al., arXiv Badge
    • Training verifiers to solve math word problems, Cobbe et al., arXiv Badge
    • Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models, Singh et al., No Link Badge
    • RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold, Setlur et al., PDF Badge
    • A Small Step Towards Reproducing OpenAI o1: Progress Report on the Steiner Open Source Models, Ji et al., PDF Badge
    • Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback, Ivison et al., PDF Badge
    • Unlocking the Capabilities of Thought: A Reasoning Boundary Framework to Quantify and Optimize Chain-of-Thought, Chen et al., PDF Badge
    • ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search, Zhang et al., PDF Badge
    • Deepseekmath: Pushing the limits of mathematical reasoning in open language models, Shao et al., arXiv Badge
    • Stepcoder: Improve code generation with reinforcement learning from compiler feedback, Dou et al., arXiv Badge
    • ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models, Li et al., PDF Badge
    • AlphaZero-Like Tree-Search can Guide Large Language Model Decoding and Training, Wan et al., PDF Badge
    • CPL: Critical Plan Step Learning Boosts LLM Generalization in Reasoning Tasks, Wang et al., arXiv Badge
    • A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications, Xiao et al., arXiv Badge
    • Critical Tokens Matter: Token-Level Contrastive Estimation Enhence LLM's Reasoning Capability, Lin et al., arXiv Badge
    • Improving Multi-Step Reasoning Abilities of Large Language Models with Direct Advantage Policy Optimization, Liu et al., arXiv Badge
    • o1-coder: an o1 replication for coding, Zhang et al., arXiv Badge
    • Offline Reinforcement Learning for LLM Multi-Step Reasoning, Wang et al., arXiv Badge
    • DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL, Luo et al., No Link Badge
    • Sft memorizes, rl generalizes: A comparative study of foundation model post-training, Chu et al., arXiv Badge
    • REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models, Hu et al., arXiv Badge
    • Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, Guo et al., arXiv Badge
    • Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling, Hou et al., arXiv Badge
    • Diverse Preference Optimization, Lanchantin et al., arXiv Badge
    • COS (M+ O) S: Curiosity and RL-Enhanced MCTS for Exploring Story Space via Language Models, Materzok et al., arXiv Badge
    • Kimi k1. 5: Scaling reinforcement learning with llms, Team et al., arXiv Badge
    • Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search, Shen et al., arXiv Badge
    • Demystifying Long Chain-of-Thought Reasoning in LLMs, Yeo et al., arXiv Badge
    • LIMR: Less is More for RL Scaling, Li et al., arXiv Badge
    • Ignore the KL Penalty! Boosting Exploration on Critical Tokens to Enhance RL Fine-Tuning, Vassoyan et al., arXiv Badge
    • Lean and Mean: Decoupled Value Policy Optimization with Global Value Guidance, Huang et al., arXiv Badge
    • Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning, Xie et al., arXiv Badge
    • Training Language Models to Reason Efficiently, Arora et al., arXiv Badge
    • Process reinforcement through implicit rewards, Cui et al., arXiv Badge
    • Reward-aware Preference Optimization: A Unified Mathematical Framework for Model Alignment, Sun et al., arXiv Badge
    • Focused-DPO: Enhancing Code Generation Through Focused Preference Optimization on Error-Prone Points, Zhang et al., arXiv Badge
    • Reasoning with Reinforced Functional Token Tuning, Zhang et al., arXiv Badge
    • Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning, Lyu et al., arXiv Badge
    • Competitive Programming with Large Reasoning Models, El-Kishky et al., arXiv Badge
    • SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution, Wei et al., arXiv Badge
    • Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights, Parashar et al., arXiv Badge
    • Metastable Dynamics of Chain-of-Thought Reasoning: Provable Benefits of Search, RL and Distillation, Kim et al., arXiv Badge
    • On the Emergence of Thinking in LLMs I: Searching for the Right Intuition, Ye et al., arXiv Badge
    • The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks, Cuadron et al., arXiv Badge
    • STeCa: Step-level Trajectory Calibration for LLM Agent Learning, Wang et al., arXiv Badge
    • Thinking Preference Optimization, Yang et al., arXiv Badge

    Feasible Reflection

    Feedback

    • Concrete problems in AI safety, Amodei et al., arXiv Badge
    • Training verifiers to solve math word problems, Cobbe et al., arXiv Badge
    • Star: Bootstrapping reasoning with reasoning, Zelikman et al., No Link Badge
    • Goal misgeneralization in deep reinforcement learning, Di Langosco et al., No Link Badge
    • The effects of reward misspecification: Mapping and mitigating misaligned models, Pan et al., arXiv Badge
    • Self-critiquing models for assisting human evaluators, Saunders et al., arXiv Badge
    • Solving math word problems with process- and outcome-based feedback, Uesato et al., arXiv Badge
    • Constitutional AI: Harmlessness from AI Feedback, Bai et al., arXiv Badge
    • Towards Mitigating LLM Hallucination via Self Reflection, Ji et al., PDF Badge
    • Tree of Thoughts: Deliberate Problem Solving with Large Language Models, Yao et al., PDF Badge
    • Reasoning with Language Model is Planning with World Model, Hao et al., PDF Badge
    • LEVER: Learning to Verify Language-to-Code Generation with Execution, Ni et al., PDF Badge
    • Large Language Models are Better Reasoners with Self-Verification, Weng et al., PDF Badge
    • Self-verification improves few-shot clinical information extraction, Gero et al., PDF Badge
    • ReAct: Synergizing Reasoning and Acting in Language Models, Yao et al., PDF Badge
    • Reflexion: language agents with verbal reinforcement learning, Shinn et al., PDF Badge
    • Critic: Large language models can self-correct with tool-interactive critiquing, Gou et al., arXiv Badge
    • Reinforced self-training (rest) for language modeling, Gulcehre et al., arXiv Badge
    • Shepherd: A critic for language model generation, Wang et al., arXiv Badge
    • Let's reward step by step: Step-Level reward model as the Navigators for Reasoning, Ma et al., arXiv Badge
    • ReFT: Reasoning with Reinforced Fine-Tuning, Trung et al., PDF Badge
    • Large Language Models Cannot Self-Correct Reasoning Yet, Huang et al., PDF Badge
    • Self-Contrast: Better Reflection Through Inconsistent Solving Perspectives, Zhang et al., PDF Badge
    • LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models, Hao et al., PDF Badge
    • Let's verify step by step, Lightman et al., PDF Badge
    • Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations, Wang et al., PDF Badge
    • Advancing Process Verification for Large Language Models via Tree-Based Preference Learning, He et al., PDF Badge
    • Reasoning in Flux: Enhancing Large Language Models Reasoning through Uncertainty-aware Adaptive Guidance, Yin et al., PDF Badge
    • When is Tree Search Useful for LLM Planning? It Depends on the Discriminator, Chen et al., PDF Badge
    • Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification, Zhou et al., PDF Badge
    • SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning, Miao et al., PDF Badge
    • Promptbreeder: Self-Referential Self-Improvement via Prompt Evolution, Fernando et al., PDF Badge
    • Small Language Models Need Strong Verifiers to Self-Correct Reasoning, Zhang et al., PDF Badge
    • Step-level Value Preference Optimization for Mathematical Reasoning, Chen et al., PDF Badge
    • Skywork-o1 open series, Team et al., No Link Badge
    • OVM, Outcome-supervised Value Models for Planning in Mathematical Reasoning, Yu et al., No Link Badge
    • AutoPSV: Automated Process-Supervised Verifier, Lu et al., PDF Badge
    • Uncertainty of Thoughts: Uncertainty-Aware Planning Enhances Information Seeking in Large Language Models, Hu et al., PDF Badge
    • Deepseekmath: Pushing the limits of mathematical reasoning in open language models, Shao et al., arXiv Badge
    • VerMCTS: Synthesizing Multi-Step Programs using a Verifier, a Large Language Model, and Tree Search, Brandfonbrener et al., arXiv Badge
    • Can We Verify Step by Step for Incorrect Answer Detection?, Xu et al., arXiv Badge
    • Monte carlo tree search boosts reasoning via iterative preference learning, Xie et al., arXiv Badge
    • Self-reflection in llm agents: Effects on problem-solving performance, Renze et al., arXiv Badge
    • Interactive Evolution: A Neural-Symbolic Self-Training Framework For Large Language Models, Xu et al., arXiv Badge
    • Llm critics help catch bugs in mathematics: Towards a better mathematical verifier with natural language feedback, Gao et al., arXiv Badge
    • Step-dpo: Step-wise preference optimization for long-chain reasoning of llms, Lai et al., arXiv Badge
    • Llm critics help catch llm bugs, McAleese et al., arXiv Badge
    • Token-Supervised Value Models for Enhancing Mathematical Reasoning Capabilities of Large Language Models, Lee et al., arXiv Badge
    • Tlcr: Token-level continuous reward for fine-grained reinforcement learning from human feedback, Yoon et al., arXiv Badge
    • Selective Preference Optimization via Token-Level Reward Function Estimation, Yang et al., arXiv Badge
    • Generative verifiers: Reward modeling as next-token prediction, Zhang et al., arXiv Badge
    • Critic-cot: Boosting the reasoning abilities of large language model via chain-of-thoughts critic, Zheng et al., arXiv Badge
    • On designing effective rl reward at training time for llm reasoning, Gao et al., arXiv Badge
    • Reversal of Thought: Enhancing Large Language Models with Preference-Guided Reverse Reasoning Warm-up, Yuan et al., arXiv Badge
    • Vineppo: Unlocking rl potential for llm reasoning through refined credit assignment, Kazemnejad et al., arXiv Badge
    • Self-generated critiques boost reward modeling for language models, Yu et al., arXiv Badge
    • From generation to judgment: Opportunities and challenges of llm-as-a-judge, Li et al., arXiv Badge
    • Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering, Guan et al., arXiv Badge
    • Entropy-Regularized Process Reward Model, Zhang et al., arXiv Badge
    • Llms-as-judges: a comprehensive survey on llm-based evaluation methods, Li et al., arXiv Badge
    • o1-coder: an o1 replication for coding, Zhang et al., arXiv Badge
    • Hunyuanprover: A scalable data synthesis framework and guided tree search for automated theorem proving, Li et al., arXiv Badge
    • Acemath: Advancing frontier math reasoning with post-training and reward modeling, Liu et al., arXiv Badge
    • Free process rewards without process labels, Yuan et al., arXiv Badge
    • Outcome-Refining Process Supervision for Code Generation, Yu et al., arXiv Badge
    • What Makes Large Language Models Reason in (Multi-Turn) Code Generation?, Zheng et al., PDF Badge
    • Advancing LLM Reasoning Generalists with Preference Trees, Yuan et al., PDF Badge
    • Physics of Language Models: Part 2.2, How to Learn From Mistakes on Grade-School Math Problems, Ye et al., PDF Badge
    • QwQ: Reflect Deeply on the Boundaries of the Unknown, Team et al., No Link Badge
    • Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning, Setlur et al., PDF Badge
    • Learning to Plan \& Reason for Evaluation with Thinking-LLM-as-a-Judge, Saha et al., arXiv Badge
    • Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, Guo et al., arXiv Badge
    • Dynamic Scaling of Unit Tests for Code Reward Modeling, Ma et al., arXiv Badge
    • Attention-guided Self-reflection for Zero-shot Hallucination Detection in Large Language Models, Liu et al., arXiv Badge
    • Error Classification of Large Language Models on Math Word Problems: A Dynamically Adaptive Framework, Sun et al., arXiv Badge
    • The lessons of developing process reward models in mathematical reasoning, Zhang et al., arXiv Badge
    • Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback, Lin et al., arXiv Badge
    • Zero-Shot Verification-guided Chain of Thoughts, Chowdhury et al., arXiv Badge
    • Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning, Xie et al., arXiv Badge
    • Capturing Nuanced Preferences: Preference-Aligned Distillation for Small Language Models, Gu et al., arXiv Badge
    • Distill Not Only Data but Also Rewards: Can Smaller Language Models Surpass Larger Ones?, Zhang et al., arXiv Badge
    • Uncertainty-Aware Step-wise Verification with Generative Reward Models, Ye et al., arXiv Badge
    • Unveiling and Causalizing CoT: A Causal Pespective, Fu et al., arXiv Badge
    • Diverse Inference and Verification for Advanced Reasoning, Drori et al., arXiv Badge
    • Mathematical Reasoning in Large Language Models: Assessing Logical and Arithmetic Errors across Wide Numerical Ranges, Shrestha et al., arXiv Badge
    • A Survey on Feedback-based Multi-step Reasoning for Large Language Models on Mathematics, Wei et al., arXiv Badge
    • Self-Consistency of the Internal Reward Models Improves Self-Rewarding Language Models, Zhou et al., arXiv Badge
    • ACECODER: Acing Coder RL via Automated Test-Case Synthesis, Zeng et al., arXiv Badge
    • RefineCoder: Iterative Improving of Large Language Models via Adaptive Critique Refinement for Code Generation, Zhou et al., arXiv Badge
    • Process Reward Models for LLM Agents: Practical Framework and Directions, Choudhury et al., arXiv Badge
    • Process reinforcement through implicit rewards, Cui et al., arXiv Badge
    • Full-Step-DPO: Self-Supervised Preference Optimization with Step-wise Rewards for Mathematical Reasoning, Xu et al., arXiv Badge
    • VersaPRM: Multi-Domain Process Reward Model via Synthetic Reasoning Data, Zeng et al., arXiv Badge
    • Direct Value Optimization: Improving Chain-of-Thought Reasoning in LLMs with Refined Values, Zhang et al., arXiv Badge
    • Teaching Language Models to Critique via Reinforcement Learning, Xie et al., arXiv Badge
    • Uncertainty-Aware Search and Value Models: Mitigating Search Scaling Flaws in LLMs, Yu et al., arXiv Badge
    • AURORA: Automated Training Framework of Universal Process Reward Models via Ensemble Prompting and Reverse Verification, Tan et al., arXiv Badge
    • Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems, Peng et al., arXiv Badge

    Refinement

    • Self-critiquing models for assisting human evaluators, Saunders et al., arXiv Badge
    • Self-Refine: Iterative Refinement with Self-Feedback, Madaan et al., PDF Badge
    • Reflexion: language agents with verbal reinforcement learning, Shinn et al., PDF Badge
    • Towards Mitigating LLM Hallucination via Self Reflection, Ji et al., PDF Badge
    • Grace: Discriminator-guided chain-of-thought reasoning, Khalifa et al., arXiv Badge
    • Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies, Pan et al., arXiv Badge
    • Learning from mistakes makes llm better reasoner, An et al., arXiv Badge
    • Reflection-tuning: Data recycling improves llm instruction-tuning, Li et al., arXiv Badge
    • Toward Adaptive Reasoning in Large Language Models with Thought Rollback, Chen et al., PDF Badge
    • Progressive-Hint Prompting Improves Reasoning in Large Language Models, Zheng et al., PDF Badge
    • SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning, Miao et al., PDF Badge
    • Advancing Large Language Model Attribution through Self-Improving, Huang et al., PDF Badge
    • REFINER: Reasoning Feedback on Intermediate Representations, Paul et al., PDF Badge
    • ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search, Zhang et al., PDF Badge
    • LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints, Ferraz et al., PDF Badge
    • Teaching Large Language Models to Self-Debug, Chen et al., PDF Badge
    • Recursive Introspection: Teaching Language Model Agents How to Self-Improve, Qu et al., PDF Badge
    • Learning to check: Unleashing potentials for self-correction in large language models, Zhang et al., arXiv Badge
    • GLoRe: When, Where, and How to Improve LLM Reasoning via Global and Local Refinements, Havrilla et al., PDF Badge
    • Enhancing Zero-Shot Chain-of-Thought Reasoning in Large Language Models through Logic, Zhao et al., PDF Badge
    • General purpose verification for chain of thought prompting, Vacareanu et al., arXiv Badge
    • Enhancing visual-language modality alignment in large vision language models via self-improvement, Wang et al., arXiv Badge
    • Llm critics help catch bugs in mathematics: Towards a better mathematical verifier with natural language feedback, Gao et al., arXiv Badge
    • Large language models have intrinsic self-correction ability, Liu et al., arXiv Badge
    • Accessing gpt-4 level mathematical olympiad solutions via monte carlo tree self-refine with llama-3 8b, Zhang et al., arXiv Badge
    • CoT Rerailer: Enhancing the Reliability of Large Language Models in Complex Reasoning Tasks through Error Detection and Correction, Wan et al., arXiv Badge
    • Mutual reasoning makes smaller llms stronger problem-solvers, Qi et al., arXiv Badge
    • S $^3$ c-Math: Spontaneous Step-level Self-correction Makes Large Language Models Better Mathematical Reasoners, Yan et al., arXiv Badge
    • Training language models to self-correct via reinforcement learning, Kumar et al., arXiv Badge
    • Enhancing Mathematical Reasoning in LLMs by Stepwise Correction, Wu et al., arXiv Badge
    • O1 Replication Journey: A Strategic Progress Report--Part 1, Qin et al., arXiv Badge
    • Enhancing llm reasoning via critique models with test-time and training-time supervision, Xi et al., arXiv Badge
    • Vision-language models can self-improve reasoning via reflection, Cheng et al., arXiv Badge
    • Confidence vs Critique: A Decomposition of Self-Correction Capability for LLMs, Yang et al., arXiv Badge
    • LLM2: Let Large Language Models Harness System 2 Reasoning, Yang et al., arXiv Badge
    • Understanding the Dark Side of LLMs' Intrinsic Self-Correction, Zhang et al., arXiv Badge
    • 7B Model and 8K Examples: Emerging Reasoning with Reinforcement Learning is Both Effective and Efficient, Zeng et al., No Link Badge
    • BackMATH: Towards Backward Reasoning for Solving Math Problems Step by Step, Zhang et al., PDF Badge
    • Enhancing LLM Reasoning with Multi-Path Collaborative Reactive and Reflection agents, He et al., arXiv Badge
    • CoT-based Synthesizer: Enhancing LLM Performance through Answer Synthesis, Zhang et al., arXiv Badge
    • Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, Guo et al., arXiv Badge
    • ReARTeR: Retrieval-Augmented Reasoning with Trustworthy Process Rewarding, Sun et al., arXiv Badge
    • Critique fine-tuning: Learning to critique is more effective than learning to imitate, Wang et al., arXiv Badge
    • RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques, Tang et al., arXiv Badge
    • ProgCo: Program Helps Self-Correction of Large Language Models, Song et al., arXiv Badge
    • URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics, Luo et al., arXiv Badge
    • S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning, Ma et al., arXiv Badge
    • ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates, Yang et al., arXiv Badge
    • ReVISE: Learning to Refine at Test-Time via Intrinsic Self-Verification, Lee et al., arXiv Badge
    • Step Back to Leap Forward: Self-Backtracking for Boosting Reasoning of Language Models, Yang et al., arXiv Badge
    • Iterative Deepening Sampling for Large Language Models, Chen et al., arXiv Badge
    • LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!, Li et al., arXiv Badge
    • MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification, Sun et al., arXiv Badge
    • ARIES: Stimulating Self-Refinement of Large Language Models by Iterative Preference Optimization, Zeng et al., arXiv Badge

    Analysis

    Analysis \& Explanation for Long CoT

    • Greedy Policy Search: A Simple Baseline for Learnable Test-Time Augmentation, Lyzhov et al., PDF Badge
    • Can language models learn from explanations in context?, Lampinen et al., PDF Badge
    • Star: Bootstrapping reasoning with reasoning, Zelikman et al., No Link Badge
    • Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters, Wang et al., PDF Badge
    • What Makes Chain-of-Thought Prompting Effective? A Counterfactual Study, Madaan et al., No Link Badge
    • The Expressive Power of Transformers with Chain of Thought, Merrill et al., No Link Badge
    • Chain of Thought Empowers Transformers to Solve Inherently Serial Problems, Li et al., No Link Badge
    • Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective, Feng et al., PDF Badge
    • Causal Abstraction for Chain-of-Thought Reasoning in Arithmetic Word Problems, Tan et al., PDF Badge
    • How Large Language Models Implement Chain-of-Thought?, Wang et al., No Link Badge
    • How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model, Hanna et al., PDF Badge
    • Why think step by step? Reasoning emerges from the locality of experience, Prystawski et al., PDF Badge
    • Automatic Prompt Augmentation and Selection with Chain-of-Thought from Labeled Data, Shum et al., PDF Badge
    • MoT: Memory-of-Thought Enables ChatGPT to Self-Improve, Li et al., PDF Badge
    • LAMBADA: Backward Chaining for Automated Reasoning in Natural Language, Kazemi et al., PDF Badge
    • MathPrompter: Mathematical Reasoning using Large Language Models, Imani et al., No Link Badge
    • System 2 Attention (is something you might need too), Weston et al., arXiv Badge
    • Chain of Thoughtlessness? An Analysis of CoT in Planning, Stechly et al., PDF Badge
    • Chain-of-Thought Reasoning Without Prompting, Wang et al., PDF Badge
    • When Do Program-of-Thought Works for Reasoning?, Bi et al., No Link Badge
    • Unlocking the Capabilities of Thought: A Reasoning Boundary Framework to Quantify and Optimize Chain-of-Thought, Chen et al., PDF Badge
    • The Impact of Reasoning Step Length on Large Language Models, Jin et al., PDF Badge
    • Explainable AI in Large Language Models: A Review, Sauhandikaa et al., No Link Badge
    • Do Large Language Models Latently Perform Multi-Hop Reasoning?, Yang et al., PDF Badge
    • MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning, Sprague et al., PDF Badge
    • Not All LLM Reasoners Are Created Equal, Hosseini et al., PDF Badge
    • DynaThink: Fast or Slow? A Dynamic Decision-Making Framework for Large Language Models, Pan et al., PDF Badge
    • From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models, Welleck et al., PDF Badge
    • How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning, Dutta et al., arXiv Badge
    • How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments, Huang et al., arXiv Badge
    • Exploring the compositional deficiency of large language models in mathematical reasoning, Zhao et al., arXiv Badge
    • Xai meets llms: A survey of the relation between explainable ai and large language models, Cambria et al., arXiv Badge
    • Large language monkeys: Scaling inference compute with repeated sampling, Brown et al., arXiv Badge
    • Compositional Hardness of Code in Large Language Models--A Probabilistic Perspective, Wolf et al., arXiv Badge
    • What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective, Li et al., arXiv Badge
    • When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1, McCoy et al., arXiv Badge
    • Thinking llms: General instruction following with thought generation, Wu et al., arXiv Badge
    • What Are Step-Level Reward Models Rewarding? Counterintuitive Findings from MCTS-Boosted Mathematical Reasoning, Ma et al., arXiv Badge
    • Do not think that much for 2+ 3=? on the overthinking of o1-like llms, Chen et al., arXiv Badge
    • Openai o1 system card, Jaech et al., arXiv Badge
    • Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models, Song et al., PDF Badge
    • Open R1, Team et al., No Link Badge
    • There May Not be Aha Moment in R1-Zero-like Training — A Pilot Study, Liu et al., No Link Badge
    • Exploring Concept Depth: How Large Language Models Acquire Knowledge and Concept at Different Layers?, Jin et al., PDF Badge
    • OverThink: Slowdown Attacks on Reasoning LLMs, Kumar et al., No Link Badge
    • Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, Guo et al., arXiv Badge
    • Sft memorizes, rl generalizes: A comparative study of foundation model post-training, Chu et al., arXiv Badge
    • On the reasoning capacity of ai models and how to quantify it, Radha et al., arXiv Badge
    • Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Though, Xiang et al., arXiv Badge
    • Rethinking External Slow-Thinking: From Snowball Errors to Probability of Correct Reasoning, Gan et al., arXiv Badge
    • Complexity Control Facilitates Reasoning-Based Compositional Generalization in Transformers, Zhang et al., arXiv Badge
    • Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning, Xie et al., arXiv Badge
    • GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?, Zhou et al., arXiv Badge
    • Lower Bounds for Chain-of-Thought Reasoning in Hard-Attention Transformers, Amiri et al., arXiv Badge
    • When More is Less: Understanding Chain-of-Thought Length in LLMs, Wu et al., arXiv Badge
    • ECM: A Unified Electronic Circuit Model for Explaining the Emergence of In-Context Learning and Chain-of-Thought in Large Language Model, Chen et al., arXiv Badge
    • Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights, Parashar et al., arXiv Badge
    • Examining False Positives under Inference Scaling for Mathematical Reasoning, Wang et al., arXiv Badge
    • Do We Need to Verify Step by Step? Rethinking Process Supervision from a Theoretical Perspective, Jia et al., arXiv Badge
    • Back Attention: Understanding and Enhancing Multi-Hop Reasoning in Large Language Models, Yu et al., arXiv Badge
    • Think or Step-by-Step? UnZIPping the Black Box in Zero-Shot Prompts, Sadr et al., arXiv Badge
    • Finite State Automata Inside Transformers with Chain-of-Thought: A Mechanistic Study on State Tracking, Zhang et al., arXiv Badge
    • The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It, Bertolazzi et al., arXiv Badge
    • How Do LLMs Acquire New Knowledge? A Knowledge Circuits Perspective on Continual Pre-Training, Ou et al., arXiv Badge
    • Language Models Can Predict Their Own Behavior, Ashok et al., arXiv Badge
    • Problem-Solving Logic Guided Curriculum In-Context Learning for LLMs Complex Reasoning, Ma et al., arXiv Badge
    • The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks, Cuadron et al., arXiv Badge
    • PhD Knowledge Not Required: A Reasoning Challenge for Large Language Models, Anderson et al., arXiv Badge
    • Scaling Test-Time Compute Without Verification or RL is Suboptimal, Setlur et al., arXiv Badge

    Long CoT Evaluations

    • On the measure of intelligence, Chollet et al., arXiv Badge
    • Measuring Mathematical Problem Solving With the MATH Dataset, Hendrycks et al., PDF Badge
    • What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams, Jin et al., PDF Badge
    • Training verifiers to solve math word problems, Cobbe et al., arXiv Badge
    • ScienceWorld: Is your Agent Smarter than a 5th Grader?, Wang et al., PDF Badge
    • WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents, Yao et al., PDF Badge
    • Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering, Lu et al., PDF Badge
    • Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them, Suzgun et al., PDF Badge
    • ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning, Golovneva et al., PDF Badge
    • ReCEval: Evaluating Reasoning Chains via Correctness and Informativeness, Prasad et al., PDF Badge
    • Making Language Models Better Reasoners with Step-Aware Verifier, Li et al., PDF Badge
    • A Multi-Modal Neural Geometric Solver with Textual Clauses Parsed from Diagram, Zhang et al., PDF Badge
    • AI for Math or Math for AI? On the Generalization of Learning Mathematical Problem Solving, Zhou et al., PDF Badge
    • AIME 2024, AI-MO et al., No Link Badge
    • Let's verify step by step, Lightman et al., PDF Badge
    • AMC 2023, AI-MO et al., No Link Badge
    • OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems, He et al., PDF Badge
    • Putnam-AXIOM: A Functional and Static Benchmark for Measuring Higher Level Mathematical Reasoning, Gulati et al., PDF Badge
    • SWE-bench: Can Language Models Resolve Real-world Github Issues?, Jimenez et al., PDF Badge
    • GPQA: A Graduate-Level Google-Proof Q\&A Benchmark, Rein et al., PDF Badge
    • MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark, Wang et al., PDF Badge
    • Unlocking the Capabilities of Thought: A Reasoning Boundary Framework to Quantify and Optimize Chain-of-Thought, Chen et al., PDF Badge
    • Evaluating LLMs at Detecting Errors in LLM Responses, Kamoi et al., PDF Badge
    • MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs, Zeng et al., PDF Badge
    • CriticBench: Benchmarking LLMs for Critique-Correct Reasoning, Lin et al., PDF Badge
    • WebArena: A Realistic Web Environment for Building Autonomous Agents, Zhou et al., PDF Badge
    • OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments, Xie et al., PDF Badge
    • CogAgent: A Visual Language Model for GUI Agents, Hong et al., No Link Badge
    • MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI, Yue et al., No Link Badge
    • MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts, Lu et al., PDF Badge
    • Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset, Wang et al., PDF Badge
    • Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?, Zhang et al., No Link Badge
    • M$^3$CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought, Chen et al., PDF Badge
    • Benchmarking large language models on answering and explaining challenging medical questions, Chen et al., arXiv Badge
    • How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments, Huang et al., arXiv Badge
    • Achieving> 97\% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word Problems, Zhong et al., arXiv Badge
    • Mhpp: Exploring the capabilities and limitations of language models beyond basic code generation, Dai et al., arXiv Badge
    • Plot2code: A comprehensive benchmark for evaluating multi-modal large language models in code generation from scientific plots, Wu et al., arXiv Badge
    • Can llms generate novel research ideas? a large-scale human study with 100+ nlp researchers, Si et al., arXiv Badge
    • Mle-bench: Evaluating machine learning agents on machine learning engineering, Chan et al., arXiv Badge
    • EVOLvE: Evaluating and Optimizing LLMs For Exploration, Nie et al., arXiv Badge
    • Judgebench: A benchmark for evaluating llm-based judges, Tan et al., arXiv Badge
    • Errorradar: Benchmarking complex mathematical reasoning of multimodal large language models via error detection, Yan et al., arXiv Badge
    • HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks, Zhang et al., arXiv Badge
    • Chain of ideas: Revolutionizing research via novel idea development with llm agents, Li et al., arXiv Badge
    • Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai, Glazer et al., arXiv Badge
    • HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation, Yu et al., arXiv Badge
    • Medec: A benchmark for medical error detection and correction in clinical notes, Abacha et al., arXiv Badge
    • CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models, Cheng et al., arXiv Badge
    • AIME 2025, OpenCompass et al., No Link Badge
    • LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code, Jain et al., PDF Badge
    • CMMaTH: A Chinese Multi-modal Math Skill Evaluation Benchmark for Foundation Models, Li et al., PDF Badge
    • Open Deep Research, Team et al., No Link Badge
    • JustLogic: A Comprehensive Benchmark for Evaluating Deductive Reasoning in Large Language Models, Chen et al., arXiv Badge
    • ToolComp: A Multi-Tool Reasoning \& Process Supervision Benchmark, Nath et al., arXiv Badge
    • HardML: A Benchmark For Evaluating Data Science And Machine Learning knowledge and reasoning in AI, Pricope et al., arXiv Badge
    • Humanity's Last Exam, Phan et al., arXiv Badge
    • MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding, Zuo et al., arXiv Badge
    • PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models, Song et al., arXiv Badge
    • Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks, Wang et al., arXiv Badge
    • PhD Knowledge Not Required: A Reasoning Challenge for Large Language Models, Anderson et al., arXiv Badge
    • ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning, Lin et al., arXiv Badge
    • Evaluating the Systematic Reasoning Abilities of Large Language Models through Graph Coloring, Heyman et al., arXiv Badge
    • Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models, Yasunaga et al., arXiv Badge
    • PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning, Zhang et al., arXiv Badge
    • Text2World: Benchmarking Large Language Models for Symbolic World Model Generation, Hu et al., arXiv Badge
    • Generating Symbolic World Models via Test-time Scaling of Large Language Models, Yu et al., arXiv Badge
    • MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations, Huang et al., arXiv Badge
    • EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking, Wei et al., arXiv Badge
    • SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines, Du et al., arXiv Badge
    • Evaluating Step-by-step Reasoning Traces: A Survey, Lee et al., arXiv Badge
    • Mathematical Reasoning in Large Language Models: Assessing Logical and Arithmetic Errors across Wide Numerical Ranges, Shrestha et al., arXiv Badge
    • Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights, Parashar et al., arXiv Badge
    • CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models, Zhang et al., arXiv Badge
    • WebGames: Challenging General-Purpose Web-Browsing AI Agents, Thomas et al., arXiv Badge
    • VEM: Environment-Free Exploration for Training GUI Agent with Value Environment Model, Zheng et al., arXiv Badge
    • Mobile-Agent-V: Learning Mobile Device Operation Through Video-Guided Multi-Agent Collaboration, Wang et al., arXiv Badge
    • EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges, Wang et al., arXiv Badge
    • Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities, Wang et al., arXiv Badge
    • Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research, Wu et al., arXiv Badge

    Future

    Agentic \& Embodied Long CoT

    • Large language models as commonsense knowledge for large-scale task planning, Zhao et al., No Link Badge
    • Solving Math Word Problems via Cooperative Reasoning induced Language Models, Zhu et al., PDF Badge
    • Reasoning with language model is planning with world model, Hao et al., arXiv Badge
    • Tree-Planner: Efficient Close-loop Task Planning with Large Language Models, Hu et al., PDF Badge
    • Strategist: Learning Strategic Skills by LLMs via Bi-Level Tree Search, Light et al., PDF Badge
    • Agents Thinking Fast and Slow: A Talker-Reasoner Architecture, Christakopoulou et al., PDF Badge
    • MACM: Utilizing a Multi-Agent System for Condition Mining in Solving Complex Mathematical Problems, Lei et al., PDF Badge
    • Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning, Zhai et al., PDF Badge
    • ADaPT: As-Needed Decomposition and Planning with Language Models, Prasad et al., PDF Badge
    • Language Agent Tree Search Unifies Reasoning, Acting, and Planning in Language Models, Zhou et al., PDF Badge
    • Mixture-of-agents enhances large language model capabilities, Wang et al., arXiv Badge
    • Tree search for language model agents, Koh et al., arXiv Badge
    • Hiagent: Hierarchical working memory management for solving long-horizon agent tasks with large language model, Hu et al., arXiv Badge
    • EVOLvE: Evaluating and Optimizing LLMs For Exploration, Nie et al., arXiv Badge
    • Titans: Learning to memorize at test time, Behrouz et al., arXiv Badge
    • Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success, Kim et al., arXiv Badge

    Efficient Long CoT

    • Guiding language model reasoning with planning tokens, Wang et al., arXiv Badge
    • Unlocking the Capabilities of Thought: A Reasoning Boundary Framework to Quantify and Optimize Chain-of-Thought, Chen et al., PDF Badge
    • DynaThink: Fast or Slow? A Dynamic Decision-Making Framework for Large Language Models, Pan et al., PDF Badge
    • Synergy-of-thoughts: Eliciting efficient reasoning in hybrid language models, Shang et al., arXiv Badge
    • Distilling system 2 into system 1, Yu et al., arXiv Badge
    • Concise thoughts: Impact of output length on llm reasoning and cost, Nayab et al., arXiv Badge
    • Litesearch: Efficacious tree search for llm, Wang et al., arXiv Badge
    • Uncertainty-Guided Optimization on Large Language Model Search Trees, Grosse et al., arXiv Badge
    • Kvsharer: Efficient inference via layer-wise dissimilar KV cache sharing, Yang et al., arXiv Badge
    • Interpretable contrastive monte carlo tree search reasoning, Gao et al., arXiv Badge
    • Dualformer: Controllable fast and slow thinking by learning with randomized reasoning traces, Su et al., arXiv Badge
    • Language models are hidden reasoners: Unlocking latent reasoning capabilities via self-rewarding, Chen et al., arXiv Badge
    • Token-budget-aware llm reasoning, Han et al., arXiv Badge
    • B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners, Zeng et al., arXiv Badge
    • C3oT: Generating Shorter Chain-of-Thought without Compromising Effectiveness, Kang et al., arXiv Badge
    • Training large language models to reason in a continuous latent space, Hao et al., arXiv Badge
    • CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models, Cheng et al., arXiv Badge
    • On the Convergence Rate of MCTS for the Optimal Value Estimation in Markov Decision Processes, Chang et al., No Link Badge
    • O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning, Luo et al., arXiv Badge
    • Reward-Guided Speculative Decoding for Efficient LLM Reasoning, Liao et al., arXiv Badge
    • Think Smarter not Harder: Adaptive Reasoning with Inference Aware Optimization, Yu et al., arXiv Badge
    • Efficient Reasoning with Hidden Thinking, Shen et al., arXiv Badge
    • On the Query Complexity of Verifier-Assisted Language Generation, Botta et al., arXiv Badge
    • TokenSkip: Controllable Chain-of-Thought Compression in LLMs, Xia et al., arXiv Badge
    • Boost, Disentangle, and Customize: A Robust System2-to-System1 Pipeline for Code Generation, Du et al., arXiv Badge
    • Jakiro: Boosting Speculative Decoding with Decoupled Multi-Head via MoE, Huang et al., arXiv Badge
    • Towards Reasoning Ability of Small Language Models, Srivastava et al., arXiv Badge
    • Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs, Ji et al., arXiv Badge
    • Portable Reward Tuning: Towards Reusable Fine-Tuning across Different Pretrained Models, Chijiwa et al., arXiv Badge
    • MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification, Sun et al., arXiv Badge
    • Language Models Can Predict Their Own Behavior, Ashok et al., arXiv Badge
    • CoT-Valve: Length-Compressible Chain-of-Thought Tuning, Ma et al., arXiv Badge
    • Training Language Models to Reason Efficiently, Arora et al., arXiv Badge
    • Chain of Draft: Thinking Faster by Writing Less, Xu et al., arXiv Badge
    • Dynamic Chain-of-Thought: Towards Adaptive Deep Reasoning, Wang et al., arXiv Badge
    • Reasoning on a Spectrum: Aligning LLMs to System 1 and System 2 Thinking, Ziabari et al., arXiv Badge
    • Dynamic Parallel Tree Search for Efficient LLM Reasoning, Ding et al., arXiv Badge
    • Stepwise Perplexity-Guided Refinement for Efficient Chain-of-Thought Reasoning in Large Language Models, Cui et al., arXiv Badge
    • SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs, Xu et al., arXiv Badge
    • LightThinker: Thinking Step-by-Step Compression, Zhang et al., arXiv Badge

    Knowledge-Augmented Long CoT

    • Best of Both Worlds: Harmonizing LLM Capabilities in Decision-Making and Question-Answering for Treatment Regimes, Liu et al., PDF Badge
    • Understanding Reasoning Ability of Language Models From the Perspective of Reasoning Paths Aggregation, Wang et al., PDF Badge
    • Stream of search (sos): Learning to search in language, Gandhi et al., No Link Badge
    • CoPS: Empowering LLM Agents with Provable Cross-Task Experience Sharing, Yang et al., arXiv Badge
    • Disentangling memory and reasoning ability in large language models, Jin et al., arXiv Badge
    • Huatuogpt-o1, towards medical complex reasoning with llms, Chen et al., arXiv Badge
    • RAG-Star: Enhancing Deliberative Reasoning with Retrieval Augmented Verification and Refinement, Jiang et al., arXiv Badge
    • Open Deep Research, Team et al., No Link Badge
    • Evaluating Large Language Models through Role-Guide and Self-Reflection: A Comparative Study, Zhao et al., PDF Badge
    • O1 Replication Journey--Part 3: Inference-time Scaling for Medical Reasoning, Huang et al., arXiv Badge
    • MedS $^3$: Towards Medical Small Language Models with Self-Evolved Slow Thinking, Jiang et al., arXiv Badge
    • Search-o1: Agentic search-enhanced large reasoning models, Li et al., arXiv Badge
    • Chain-of-Retrieval Augmented Generation, Wang et al., arXiv Badge
    • ECM: A Unified Electronic Circuit Model for Explaining the Emergence of In-Context Learning and Chain-of-Thought in Large Language Model, Chen et al., arXiv Badge
    • Large Language Models for Recommendation with Deliberative User Preference Alignment, Fang et al., arXiv Badge
    • DeepRAG: Thinking to Retrieval Step by Step for Large Language Models, Guan et al., arXiv Badge
    • HopRAG: Multi-Hop Reasoning for Logic-Aware Retrieval-Augmented Generation, Liu et al., arXiv Badge
    • O1 Embedder: Let Retrievers Think Before Action, Yan et al., arXiv Badge
    • Towards Robust Legal Reasoning: Harnessing Logical LLMs in Law, Kant et al., arXiv Badge
    • OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning, Lu et al., arXiv Badge

    Multilingual Long CoT

    • Cross-lingual Prompting: Improving Zero-shot Chain-of-Thought Reasoning across Languages, Qin et al., No Link Badge
    • Not All Languages Are Created Equal in LLMs: Improving Multilingual Capability by Cross-Lingual-Thought Prompting, Huang et al., PDF Badge
    • A Tree-of-Thoughts to Broaden Multi-step Reasoning across Languages, Ranaldi et al., PDF Badge
    • Enhancing Advanced Visual Reasoning Ability of Large Language Models, Li et al., PDF Badge
    • AutoCAP: Towards Automatic Cross-lingual Alignment Planning for Zero-shot Chain-of-Thought, Zhang et al., PDF Badge
    • xcot: Cross-lingual instruction tuning for cross-lingual chain-of-thought reasoning, Chai et al., arXiv Badge
    • Multilingual large language model: A survey of resources, taxonomy and frontiers, Qin et al., arXiv Badge
    • DRT-o1: Optimized Deep Reasoning Translation via Long Chain-of-Thought, Wang et al., arXiv Badge
    • A survey of multilingual large language models, Qin et al., No Link Badge
    • Demystifying Multilingual Chain-of-Thought in Process Reward Modeling, Wang et al., arXiv Badge
    • The Multilingual Mind: A Survey of Multilingual Reasoning in Language Models, Ghosh et al., arXiv Badge

    Multimodal Long CoT

    • Multimodal Chain-of-Thought Reasoning in Language Models, Zhang et al., PDF Badge
    • M$^3$CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought, Chen et al., PDF Badge
    • Enhancing Advanced Visual Reasoning Ability of Large Language Models, Li et al., PDF Badge
    • ARES: Alternating Reinforcement Learning and Supervised Fine-Tuning for Enhanced Multi-Modal Chain-of-Thought Reasoning Through Diverse AI Feedback, Byun et al., PDF Badge
    • Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning, Zhai et al., PDF Badge
    • Large Language Models Can Self-Correct with Minimal Effort, Wu et al., PDF Badge
    • Q*: Improving multi-step reasoning for llms with deliberative planning, Wang et al., arXiv Badge
    • A survey on evaluation of multimodal large language models, Huang et al., arXiv Badge
    • What factors affect multi-modal in-context learning? an in-depth exploration, Qin et al., arXiv Badge
    • Insight-v: Exploring long-chain visual reasoning with multimodal large language models, Dong et al., arXiv Badge
    • AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning, Xiang et al., arXiv Badge
    • Enhancing the reasoning ability of multimodal large language models via mixed preference optimization, Wang et al., arXiv Badge
    • Llava-o1: Let vision language models reason step-by-step, Xu et al., arXiv Badge
    • Slow Perception: Let's Perceive Geometric Figures Step-by-step, Wei et al., arXiv Badge
    • Diving into Self-Evolving Training for Multimodal Reasoning, Liu et al., arXiv Badge
    • Scaling inference-time search with vision value model for improved visual comprehension, Xiyao et al., arXiv Badge
    • CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models, Cheng et al., arXiv Badge
    • Visual Agents as Fast and Slow Thinkers, Sun et al., PDF Badge
    • Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model, Ma et al., arXiv Badge
    • BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning, Zhang et al., arXiv Badge
    • Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark, Hao et al., arXiv Badge
    • Virgo: A Preliminary Exploration on Reproducing o1-like MLLM, Du et al., arXiv Badge
    • Llamav-o1: Rethinking step-by-step visual reasoning in llms, Thawakar et al., arXiv Badge
    • Inference-time scaling for diffusion models beyond scaling denoising steps, Ma et al., arXiv Badge
    • Imagine while Reasoning in Space: Multimodal Visualization-of-Thought, Li et al., arXiv Badge
    • Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step, Guo et al., arXiv Badge
    • Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking, Wu et al., arXiv Badge

    Safety for Long CoT

    • Larger and more instructable language models become less reliable, Zhou et al., No Link Badge
    • The Impact of Reasoning Step Length on Large Language Models, Jin et al., PDF Badge
    • Unlocking the Capabilities of Thought: A Reasoning Boundary Framework to Quantify and Optimize Chain-of-Thought, Chen et al., PDF Badge
    • Can Large Language Models Understand You Better? An MBTI Personality Detection Dataset Aligned with Population Traits, Li et al., arXiv Badge
    • OverThink: Slowdown Attacks on Reasoning LLMs, Kumar et al., No Link Badge
    • o3-mini vs DeepSeek-R1: Which One is Safer?, Arrieta et al., arXiv Badge
    • Efficient Reasoning with Hidden Thinking, Shen et al., arXiv Badge
    • Think More, Hallucinate Less: Mitigating Hallucinations via Dual Process of Fast and Slow Thinking, Cheng et al., arXiv Badge
    • Explicit vs. Implicit: Investigating Social Bias in Large Language Models through Self-Reflection, Zhao et al., arXiv Badge
    • Challenges in Ensuring AI Safety in DeepSeek-R1 Models: The Shortcomings of Reinforcement Learning Strategies, Parmar et al., arXiv Badge
    • Early External Safety Testing of OpenAI's o3-mini: Insights from the Pre-Deployment Evaluation, Arrieta et al., arXiv Badge
    • International AI Safety Report, Bengio et al., arXiv Badge
    • MetaSC: Test-Time Safety Specification Optimization for Language Models, Gallego et al., arXiv Badge
    • Leveraging Reasoning with Guidelines to Elicit and Utilize Knowledge for Enhancing Safety Alignment, Wang et al., arXiv Badge
    • The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1, Zhou et al., arXiv Badge
    • Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models, Lu et al., arXiv Badge
    • Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?, Bengio et al., arXiv Badge
    • Emergent Response Planning in LLM, Dong et al., arXiv Badge
    • Investigating the Impact of Quantization Methods on the Safety and Reliability of Large Language Models, Kharinaev et al., arXiv Badge
    • H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models, Including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking, Kuo et al., arXiv Badge
    • BoT: Breaking Long Thought Processes of o1-like Large Language Models through Backdoor Attack, Zhu et al., arXiv Badge
    • SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities, Jiang et al., arXiv Badge

    proper reward design

    • Demystifying Long Chain-of-Thought Reasoning in LLMs, Yeo et al., arXiv Badge

BibTeX

BibTex Code Here