SVG Image Towards Reasoning Era: A Survey of Long Chain-of-Thought

Harbin Institute of Technology+, Central South University*, The University of Hong Kong, Fudan University§

Abstract

Recent advances in logical reasoning tasks are often attributed to test-time scaling, with many researchers suggesting that increasing model capacity to process longer reasoning sequences improves performance. However, this idea is challenged in simpler tasks, such as commonsense reasoning and basic mathematics, where test-time scaling may lead to “overthinking” that hampers model performance. This paradox remains underexplored, with existing research limited by two main shortcomings: a failure to distinguish between Long Chains of Thought (Long CoT) and Short Chains of Thought (Short CoT), and the absence of a comprehensive review on the topic. To address these issues, this survey first distinguishes between Long CoT and Short CoT, introducing a new taxonomy to categorize these reasoning paradigms. We examine the key characteristics of Long CoT—Deep Reasoning, Extensive Exploration, and Feasible Reflection—and highlight how these features enable deeper and more efficient reasoning compared to the shallow, redundancy-prone Short CoT. Our review synthesizes the current state of Long CoT research, identifies critical gaps, and suggests future research directions. We also address challenges in Long CoT, such as multi-modal reasoning, efficiency, and knowledge integration, and recommend resources, including open-source software, corpora, and key publications, to support further studies. Through this survey, we aim to offer a unified perspective on Long CoT, propose strategies to overcome existing limitations, and inspire future research to push the boundaries of logical reasoning in artificial intelligence.

Paper List

Analysis and Evaluation

Analysis & Explanation for Long CoT

  • Concrete problems in AI safety, Amodei et al., arXiv Badge
  • Greedy Policy Search: A Simple Baseline for Learnable Test-Time Augmentation, Lyzhov et al., PDF Badge
  • The effects of reward misspecification: Mapping and mitigating misaligned models, Pan et al., arXiv Badge
  • Goal misgeneralization in deep reinforcement learning, Di Langosco et al., PDF Badge
  • Star: Bootstrapping reasoning with reasoning, Zelikman et al., PDF Badge
  • Can language models learn from explanations in context?, Lampinen et al., PDF Badge
  • The Expressive Power of Transformers with Chain of Thought, Merrill et al., PDF Badge
  • Chain of Thought Empowers Transformers to Solve Inherently Serial Problems, Li et al., PDF Badge
  • Mathprompter: Mathematical reasoning using large language models, Imani et al., arXiv Badge
  • Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters, Wang et al., PDF Badge
  • LAMBADA: Backward Chaining for Automated Reasoning in Natural Language, Kazemi et al., PDF Badge
  • Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective, Feng et al., PDF Badge
  • Why think step by step? Reasoning emerges from the locality of experience, Prystawski et al., PDF Badge
  • How Large Language Models Implement Chain-of-Thought?, Wang et al., PDF Badge
  • How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model, Hanna et al., PDF Badge
  • System 2 Attention (is something you might need too), Weston et al., arXiv Badge
  • What Makes Chain-of-Thought Prompting Effective? A Counterfactual Study, Madaan et al., PDF Badge
  • Causal Abstraction for Chain-of-Thought Reasoning in Arithmetic Word Problems, Tan et al., PDF Badge
  • Automatic Prompt Augmentation and Selection with Chain-of-Thought from Labeled Data, Shum et al., PDF Badge
  • MoT: Memory-of-Thought Enables ChatGPT to Self-Improve, Li et al., PDF Badge
  • When Do Program-of-Thought Works for Reasoning?, Bi et al., PDF Badge
  • Explainable AI in Large Language Models: A Review, Sauhandikaa et al., PDF Badge
  • MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning, Sprague et al., PDF Badge
  • How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments, Huang et al., arXiv Badge
  • Large language monkeys: Scaling inference compute with repeated sampling, Brown et al., arXiv Badge
  • Xai meets llms: A survey of the relation between explainable ai and large language models, Cambria et al., arXiv Badge
  • The llama 3 herd of models, Dubey et al., arXiv Badge
  • How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning, Dutta et al., PDF Badge
  • The Impact of Reasoning Step Length on Large Language Models, Jin et al., PDF Badge
  • Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models, Wu et al., arXiv Badge
  • Do Large Language Models Latently Perform Multi-Hop Reasoning?, Yang et al., PDF Badge
  • An Investigation of Neuron Activation as a Unified Lens to Explain Chain-of-Thought Eliciting Arithmetic Reasoning of LLMs, Rai et al., PDF Badge
  • Chain of Thoughtlessness? An Analysis of CoT in Planning, Stechly et al., PDF Badge
  • Chain-of-Thought Reasoning Without Prompting, Wang et al., PDF Badge
  • Unlocking the Capabilities of Thought: A Reasoning Boundary Framework to Quantify and Optimize Chain-of-Thought, Chen et al., PDF Badge
  • Compositional Hardness of Code in Large Language Models--A Probabilistic Perspective, Wolf et al., arXiv Badge
  • What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective, Li et al., arXiv Badge
  • When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1, McCoy et al., arXiv Badge
  • Not All LLM Reasoners Are Created Equal, Hosseini et al., PDF Badge
  • Thinking llms: General instruction following with thought generation, Wu et al., arXiv Badge
  • Exploring the Compositional Deficiency of Large Language Models in Mathematical Reasoning Through Trap Problems, Zhao et al., PDF Badge
  • DynaThink: Fast or Slow? A Dynamic Decision-Making Framework for Large Language Models, Pan et al., PDF Badge
  • From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models, Welleck et al., PDF Badge
  • What Are Step-Level Reward Models Rewarding? Counterintuitive Findings from MCTS-Boosted Mathematical Reasoning, Ma et al., arXiv Badge
  • Qwen2.5 technical report, Yang et al., arXiv Badge
  • Do not think that much for 2+ 3=? on the overthinking of o1-like llms, Chen et al., arXiv Badge
  • Openai o1 system card, Jaech et al., arXiv Badge
  • Processbench: Identifying process errors in mathematical reasoning, Zheng et al., arXiv Badge
  • There May Not be Aha Moment in R1-Zero-like Training — A Pilot Study, Liu et al., Notion Badge
  • Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, Guo et al., arXiv Badge
  • Open R1, Team et al., Github Badge
  • On the reasoning capacity of ai models and how to quantify it, Radha et al., arXiv Badge
  • Exploring Concept Depth: How Large Language Models Acquire Knowledge and Concept at Different Layers?, Jin et al., PDF Badge
  • Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Though, Xiang et al., arXiv Badge
  • Rethinking External Slow-Thinking: From Snowball Errors to Probability of Correct Reasoning, Gan et al., arXiv Badge
  • Complexity Control Facilitates Reasoning-Based Compositional Generalization in Transformers, Zhang et al., arXiv Badge
  • Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models, Xu et al., arXiv Badge
  • PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models, Song et al., arXiv Badge
  • Think or Step-by-Step? UnZIPping the Black Box in Zero-Shot Prompts, Sadr et al., arXiv Badge
  • Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning, Xie et al., arXiv Badge
  • GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?, Zhou et al., arXiv Badge
  • Lower Bounds for Chain-of-Thought Reasoning in Hard-Attention Transformers, Amiri et al., arXiv Badge
  • The Lookahead Limitation: Why Multi-Operand Addition is Hard for LLMs, Baeumel et al., arXiv Badge
  • When More is Less: Understanding Chain-of-Thought Length in LLMs, Wu et al., arXiv Badge
  • ECM: A Unified Electronic Circuit Model for Explaining the Emergence of In-Context Learning and Chain-of-Thought in Large Language Model, Chen et al., arXiv Badge
  • Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights, Parashar et al., arXiv Badge
  • Examining False Positives under Inference Scaling for Mathematical Reasoning, Wang et al., arXiv Badge
  • Do We Need to Verify Step by Step? Rethinking Process Supervision from a Theoretical Perspective, Jia et al., arXiv Badge
  • Finite State Automata Inside Transformers with Chain-of-Thought: A Mechanistic Study on State Tracking, Zhang et al., arXiv Badge
  • The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It, Bertolazzi et al., arXiv Badge
  • How Do LLMs Acquire New Knowledge? A Knowledge Circuits Perspective on Continual Pre-Training, Ou et al., arXiv Badge
  • Language Models Can Predict Their Own Behavior, Ashok et al., arXiv Badge
  • Problem-Solving Logic Guided Curriculum In-Context Learning for LLMs Complex Reasoning, Ma et al., arXiv Badge
  • The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks, Cuadron et al., arXiv Badge
  • OVERTHINKING: Slowdown Attacks on Reasoning LLMs, Kumar et al., arXiv Badge
  • PhD Knowledge Not Required: A Reasoning Challenge for Large Language Models, Anderson et al., arXiv Badge
  • Scaling Test-Time Compute Without Verification or RL is Suboptimal, Setlur et al., arXiv Badge
  • The Relationship Between Reasoning and Performance in Large Language Models--o3 (mini) Thinks Harder, Not Longer, Ballon et al., arXiv Badge
  • Unveiling and Causalizing CoT: A Causal Pespective, Fu et al., arXiv Badge
  • Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems, Peng et al., arXiv Badge
  • Layer by Layer: Uncovering Hidden Representations in Language Models, Skean et al., arXiv Badge
  • Back Attention: Understanding and Enhancing Multi-Hop Reasoning in Large Language Models, Yu et al., arXiv Badge
  • Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs, Gandhi et al., arXiv Badge
  • R1-Zero's" Aha Moment" in Visual Reasoning on a 2B Non-SFT Model, Zhou et al., arXiv Badge
  • MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning, Meng et al., arXiv Badge
  • Reasoning Beyond Limits: Advances and Open Problems for LLMs, Ferrag et al., arXiv Badge
  • Process-based Self-Rewarding Language Models, Zhang et al., arXiv Badge
  • Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation, Baker et al., PDF Badge
  • Rewarding Curse: Analyze and Mitigate Reward Modeling Issues for LLM Reasoning, Li et al., arXiv Badge
  • Enhancing llm reliability via explicit knowledge boundary modeling, Zheng et al., arXiv Badge
  • Style over Substance: Distilled Language Models Reason Via Stylistic Replication, Lippmann et al., arXiv Badge
  • Do Larger Language Models Imply Better Reasoning? A Pretraining Scaling Law for Reasoning, Wang et al., arXiv Badge
  • Understanding Aha Moments: from External Observations to Internal Mechanisms, Yang et al., arXiv Badge
  • Inference-Time Scaling for Complex Tasks: Where We Stand and What Lies Ahead, Balachandran et al., arXiv Badge

Long CoT Evaluations

  • On the measure of intelligence, Chollet et al., arXiv Badge
  • What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams, Jin et al., PDF Badge
  • Training verifiers to solve math word problems, Cobbe et al., arXiv Badge
  • Measuring Mathematical Problem Solving With the MATH Dataset, Hendrycks et al., PDF Badge
  • WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents, Yao et al., PDF Badge
  • Competition-Level Code Generation with AlphaCode, Li et al., arXiv Badge
  • Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering, Lu et al., PDF Badge
  • ScienceWorld: Is your Agent Smarter than a 5th Grader?, Wang et al., PDF Badge
  • ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning, Golovneva et al., PDF Badge
  • A Multi-Modal Neural Geometric Solver with Textual Clauses Parsed from Diagram, Zhang et al., PDF Badge
  • Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them, Suzgun et al., PDF Badge
  • Making Language Models Better Reasoners with Step-Aware Verifier, Li et al., PDF Badge
  • Assessing and Enhancing the Robustness of Large Language Models with Task Structure Variations for Logical Reasoning, Bao et al., arXiv Badge
  • ReCEval: Evaluating Reasoning Chains via Correctness and Informativeness, Prasad et al., PDF Badge
  • AI for Math or Math for AI? On the Generalization of Learning Mathematical Problem Solving, Zhou et al., PDF Badge
  • OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI, Huang et al., PDF Badge
  • Putnam-AXIOM: A Functional and Static Benchmark for Measuring Higher Level Mathematical Reasoning, Gulati et al., PDF Badge
  • Let's verify step by step, Lightman et al., PDF Badge
  • SWE-bench: Can Language Models Resolve Real-world Github Issues?, Jimenez et al., PDF Badge
  • WebArena: A Realistic Web Environment for Building Autonomous Agents, Zhou et al., PDF Badge
  • MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts, Lu et al., PDF Badge
  • Benchmarking large language models on answering and explaining challenging medical questions, Chen et al., arXiv Badge
  • Rewardbench: Evaluating reward models for language modeling, Lambert et al., arXiv Badge
  • How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments, Huang et al., arXiv Badge
  • Achieving> 97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word Problems, Zhong et al., arXiv Badge
  • Mhpp: Exploring the capabilities and limitations of language models beyond basic code generation, Dai et al., arXiv Badge
  • Plot2code: A comprehensive benchmark for evaluating multi-modal large language models in code generation from scientific plots, Wu et al., arXiv Badge
  • MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs, Zeng et al., PDF Badge
  • CogAgent: A Visual Language Model for GUI Agents, Hong et al., PDF Badge
  • AIME 2024, AI-MO et al., Huggingface Badge
  • AMC 2023, AI-MO et al., Huggingface Badge
  • GPQA: A Graduate-Level Google-Proof Q&A Benchmark, Rein et al., PDF Badge
  • Evaluating LLMs at Detecting Errors in LLM Responses, Kamoi et al., PDF Badge
  • M3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought, Chen et al., PDF Badge
  • OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems, He et al., PDF Badge
  • CriticBench: Benchmarking LLMs for Critique-Correct Reasoning, Lin et al., PDF Badge
  • PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns, Chia et al., PDF Badge
  • Can LLMs Solve Molecule Puzzles? A Multimodal Benchmark for Molecular Structure Elucidation, Guo et al., PDF Badge
  • Can llms generate novel research ideas? a large-scale human study with 100+ nlp researchers, Si et al., arXiv Badge
  • MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark, Wang et al., PDF Badge
  • Unlocking the Capabilities of Thought: A Reasoning Boundary Framework to Quantify and Optimize Chain-of-Thought, Chen et al., PDF Badge
  • OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments, Xie et al., PDF Badge
  • Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset, Wang et al., PDF Badge
  • Mle-bench: Evaluating machine learning agents on machine learning engineering, Chan et al., arXiv Badge
  • EVOLvE: Evaluating and Optimizing LLMs For Exploration, Nie et al., arXiv Badge
  • Judgebench: A benchmark for evaluating llm-based judges, Tan et al., arXiv Badge
  • Errorradar: Benchmarking complex mathematical reasoning of multimodal large language models via error detection, Yan et al., arXiv Badge
  • Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?, Zhang et al., PDF Badge
  • HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks, Zhang et al., arXiv Badge
  • Chain of ideas: Revolutionizing research via novel idea development with llm agents, Li et al., arXiv Badge
  • Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai, Glazer et al., arXiv Badge
  • HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation, Yu et al., arXiv Badge
  • Processbench: Identifying process errors in mathematical reasoning, Zheng et al., arXiv Badge
  • Medec: A benchmark for medical error detection and correction in clinical notes, Abacha et al., arXiv Badge
  • A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges, Yan et al., arXiv Badge
  • CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models, Cheng et al., arXiv Badge
  • LiveBench: A Challenging, Contamination-Limited LLM Benchmark, White et al., PDF Badge
  • ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark, Nath et al., arXiv Badge
  • HardML: A Benchmark For Evaluating Data Science And Machine Learning knowledge and reasoning in AI, Pricope et al., arXiv Badge
  • LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code, Jain et al., PDF Badge
  • JustLogic: A Comprehensive Benchmark for Evaluating Deductive Reasoning in Large Language Models, Chen et al., arXiv Badge
  • Humanity's Last Exam, Phan et al., arXiv Badge
  • MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding, Zuo et al., arXiv Badge
  • PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models, Song et al., arXiv Badge
  • Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks, Wang et al., arXiv Badge
  • CMMaTH: A Chinese Multi-modal Math Skill Evaluation Benchmark for Foundation Models, Li et al., PDF Badge
  • ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation, Yang et al., PDF Badge
  • Theoretical Physics Benchmark (TPBench)--a Dataset and Study of AI Reasoning Capabilities in Theoretical Physics, Chung et al., arXiv Badge
  • ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning, Lin et al., arXiv Badge
  • Evaluating the Systematic Reasoning Abilities of Large Language Models through Graph Coloring, Heyman et al., arXiv Badge
  • Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models, Yasunaga et al., arXiv Badge
  • CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models, Zhang et al., arXiv Badge
  • PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning, Zhang et al., arXiv Badge
  • Text2World: Benchmarking Large Language Models for Symbolic World Model Generation, Hu et al., arXiv Badge
  • Can Large Language Models Unveil the Mysteries? An Exploration of Their Ability to Unlock Information in Complex Scenarios, Wang et al., arXiv Badge
  • DeepSolution: Boosting Complex Engineering Solution Design via Tree-based Exploration and Bi-point Thinking, Li et al., arXiv Badge
  • AIME 2025, OpenCompass et al., Huggingface Badge
  • ThinkBench: Dynamic Out-of-Distribution Evaluation for Robust LLM Reasoning, Huang et al., arXiv Badge
  • MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations, Huang et al., arXiv Badge
  • ProBench: Benchmarking Large Language Models in Competitive Programming, Yang et al., arXiv Badge
  • EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking, Wei et al., arXiv Badge
  • DivIL: Unveiling and Addressing Over-Invariance for Out-of-Distribution Generalization, WANG et al., PDF Badge
  • SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines, Du et al., arXiv Badge
  • DeepSeek-R1 Outperforms Gemini 2.0 Pro, OpenAI o1, and o3-mini in Bilingual Complex Ophthalmology Reasoning, Xu et al., arXiv Badge
  • Evaluating Step-by-step Reasoning Traces: A Survey, Lee et al., arXiv Badge
  • Mathematical Reasoning in Large Language Models: Assessing Logical and Arithmetic Errors across Wide Numerical Ranges, Shrestha et al., arXiv Badge
  • Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights, Parashar et al., arXiv Badge
  • Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?, He et al., arXiv Badge
  • FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving, Chen et al., arXiv Badge
  • WebGames: Challenging General-Purpose Web-Browsing AI Agents, Thomas et al., arXiv Badge
  • VEM: Environment-Free Exploration for Training GUI Agent with Value Environment Model, Zheng et al., arXiv Badge
  • Mobile-Agent-V: Learning Mobile Device Operation Through Video-Guided Multi-Agent Collaboration, Wang et al., arXiv Badge
  • Generating Symbolic World Models via Test-time Scaling of Large Language Models, Yu et al., arXiv Badge
  • EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges, Wang et al., arXiv Badge
  • Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities, Wang et al., arXiv Badge
  • Large Language Models Penetration in Scholarly Writing and Peer Review, Zhou et al., arXiv Badge
  • Towards an AI co-scientist, Gottweis et al., arXiv Badge
  • Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research, Wu et al., arXiv Badge
  • Open Deep Research, Team et al., Github Badge
  • QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks?, Li et al., arXiv Badge
  • Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad, Petrov et al., arXiv Badge
  • Benchmarking Reasoning Robustness in Large Language Models, Yu et al., arXiv Badge
  • From Code to Courtroom: LLMs as the New Software Judges, He et al., arXiv Badge
  • Interacting with AI Reasoning Models: Harnessing" Thoughts" for AI-Driven Software Engineering, Treude et al., arXiv Badge
  • Can Frontier LLMs Replace Annotators in Biomedical Text Mining? Analyzing Challenges and Exploring Solutions, Zhao et al., arXiv Badge
  • An evaluation of DeepSeek Models in Biomedical Natural Language Processing, Zhan et al., arXiv Badge
  • Cognitive-Mental-LLM: Leveraging Reasoning in Large Language Models for Mental Health Prediction via Online Text, Patil et al., arXiv Badge
  • Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models, Zhou et al., arXiv Badge
  • UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning, Lu et al., arXiv Badge
  • Exploring and Evaluating Multimodal Knowledge Reasoning Consistency of Multimodal Large Language Models, Jia et al., arXiv Badge
  • MMSciBench: Benchmarking Language Models on Multimodal Scientific Problems, Ye et al., arXiv Badge
  • LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?, Tang et al., arXiv Badge
  • Enabling AI Scientists to Recognize Innovation: A Domain-Agnostic Algorithm for Assessing Novelty, Wang et al., arXiv Badge

Deep Reasoning

Deep Reasoning Format

  • Generative language modeling for automated theorem proving, Polu et al., arXiv Badge
  • Multi-step deductive reasoning over natural language: An empirical study on out-of-distribution generalisation, Bao et al., arXiv Badge
  • Reflection of thought: Inversely eliciting numerical reasoning in language models via solving linear systems, Zhou et al., arXiv Badge
  • Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, Wei et al., PDF Badge
  • Star: Bootstrapping reasoning with reasoning, Zelikman et al., PDF Badge
  • Gpt-4 technical report, Achiam et al., arXiv Badge
  • Mathprompter: Mathematical reasoning using large language models, Imani et al., arXiv Badge
  • Llama 2: Open foundation and fine-tuned chat models, Touvron et al., arXiv Badge
  • PAL: Program-aided Language Models, Gao et al., PDF Badge
  • Code llama: Open foundation models for code, Roziere et al., arXiv Badge
  • Mammoth: Building math generalist models through hybrid instruction tuning, Yue et al., arXiv Badge
  • Tora: A tool-integrated reasoning agent for mathematical problem solving, Gou et al., arXiv Badge
  • Deductive Verification of Chain-of-Thought Reasoning, Ling et al., PDF Badge
  • Mistral 7B, Jiang et al., arXiv Badge
  • Guiding language model reasoning with planning tokens, Wang et al., arXiv Badge
  • Faithful Chain-of-Thought Reasoning, Lyu et al., PDF Badge
  • Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks, Chen et al., PDF Badge
  • Cross-lingual Prompting: Improving Zero-shot Chain-of-Thought Reasoning across Languages, Qin et al., PDF Badge
  • Tinygsm: achieving> 80% on gsm8k with small language models, Liu et al., arXiv Badge
  • ChatLogic: Integrating Logic Programming with Large Language Models for Multi-step Reasoning, Wang et al., PDF Badge
  • NuminaMath, LI et al., Huggingface Badge
  • MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models, Yu et al., PDF Badge
  • MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning, Wang et al., PDF Badge
  • DeepSeek-Coder: When the Large Language Model Meets Programming--The Rise of Code Intelligence, Guo et al., arXiv Badge
  • MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning, Sprague et al., PDF Badge
  • Internlm-math: Open math large language models toward verifiable reasoning, Ying et al., arXiv Badge
  • Deepseekmath: Pushing the limits of mathematical reasoning in open language models, Shao et al., arXiv Badge
  • Brain-Inspired Two-Stage Approach: Enhancing Mathematical Reasoning by Imitating Human Thought Processes, Chen et al., arXiv Badge
  • Quiet-star: Language models can teach themselves to think before speaking, Zelikman et al., arXiv Badge
  • From explicit cot to implicit cot: Learning to internalize cot step by step, Deng et al., arXiv Badge
  • MathDivide: Improved mathematical reasoning by large language models, Srivastava et al., arXiv Badge
  • Certified Deductive Reasoning with Language Models, Poesia et al., PDF Badge
  • Interactive Evolution: A Neural-Symbolic Self-Training Framework For Large Language Models, Xu et al., arXiv Badge
  • OVM, Outcome-supervised Value Models for Planning in Mathematical Reasoning, Yu et al., PDF Badge
  • Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models, Chen et al., PDF Badge
  • The llama 3 herd of models, Dubey et al., arXiv Badge
  • Qwen2 Technical Report, Yang et al., arXiv Badge
  • Lean-star: Learning to interleave thinking and proving, Lin et al., arXiv Badge
  • Chain of Code: Reasoning with a Language Model-Augmented Code Emulator, Li et al., PDF Badge
  • Siam: Self-improving code-assisted mathematical reasoning of large language models, Yu et al., arXiv Badge
  • AutoCAP: Towards Automatic Cross-lingual Alignment Planning for Zero-shot Chain-of-Thought, Zhang et al., PDF Badge
  • Large language models are not strong abstract reasoners, Gendron et al., PDF Badge
  • Qwen2.5-math technical report: Toward mathematical expert model via self-improvement, Yang et al., arXiv Badge
  • Qwen2.5-coder technical report, Hui et al., arXiv Badge
  • CoMAT: Chain of mathematically annotated thought improves mathematical reasoning, Leang et al., arXiv Badge
  • Planning in Natural Language Improves LLM Search for Code Generation, Wang et al., PDF Badge
  • Formal mathematical reasoning: A new frontier in ai, Yang et al., arXiv Badge
  • Training large language models to reason in a continuous latent space, Hao et al., arXiv Badge
  • SKIntern: Internalizing Symbolic Knowledge for Distilling Better CoT Capabilities into Small Language Models, Liao et al., PDF Badge
  • Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, Guo et al., arXiv Badge
  • CodePlan: Unlocking Reasoning Potential in Large Language Models by Scaling Code-form Planning, Wen et al., PDF Badge
  • Efficient Reasoning with Hidden Thinking, Shen et al., arXiv Badge
  • Improving Chain-of-Thought Reasoning via Quasi-Symbolic Abstractions, Ranaldi et al., arXiv Badge
  • Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach, Geiping et al., arXiv Badge
  • Reasoning with Latent Thoughts: On the Power of Looped Transformers, Saunshi et al., arXiv Badge
  • CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction, Li et al., arXiv Badge
  • Towards Better Understanding of Program-of-Thought Reasoning in Cross-Lingual and Multilingual Environments, Payoungkhamdee et al., arXiv Badge
  • Beyond Limited Data: Self-play LLM Theorem Provers with Iterative Conjecturing and Proving, Dong et al., arXiv Badge
  • Theorem Prover as a Judge for Synthetic Data Generation, Leang et al., arXiv Badge
  • Self-Enhanced Reasoning Training: Activating Latent Reasoning in Small Models for Enhanced Reasoning Distillation, Zhang et al., arXiv Badge
  • LLM Pretraining with Continuous Concepts, Tack et al., arXiv Badge
  • Scalable Language Models with Posterior Inference of Latent Thought Vectors, Kong et al., arXiv Badge
  • Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking, Chen et al., arXiv Badge
  • Code-Driven Inductive Synthesis: Enhancing Reasoning Abilities of Large Language Models with Sequences, Chen et al., arXiv Badge
  • Reasoning to Learn from Latent Thoughts, Ruan et al., arXiv Badge

Deep Reasoning Learning

  • Thinking fast and slow with deep learning and tree search, Anthony et al., PDF Badge
  • Training verifiers to solve math word problems, Cobbe et al., arXiv Badge
  • Chain of Thought Imitation with Procedure Cloning, Yang et al., PDF Badge
  • Star: Bootstrapping reasoning with reasoning, Zelikman et al., PDF Badge
  • Large Language Models Are Reasoning Teachers, Ho et al., PDF Badge
  • Llama 2: Open foundation and fine-tuned chat models, Touvron et al., arXiv Badge
  • Synthetic Prompting: Generating Chain-of-Thought Demonstrations for Large Language Models, Shao et al., PDF Badge
  • Instruction tuning for large language models: A survey, Zhang et al., arXiv Badge
  • Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct, Luo et al., arXiv Badge
  • Reinforced self-training (rest) for language modeling, Gulcehre et al., arXiv Badge
  • Training Chain-of-Thought via Latent-Variable Inference, Hoffman et al., PDF Badge
  • Tree of Thoughts: Deliberate Problem Solving with Large Language Models, Yao et al., PDF Badge
  • Mistral 7B, Jiang et al., arXiv Badge
  • RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment, Dong et al., PDF Badge
  • The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning, Kim et al., PDF Badge
  • Deepseekmath: Pushing the limits of mathematical reasoning in open language models, Shao et al., arXiv Badge
  • Training large language models for reasoning through reverse curriculum reinforcement learning, Xi et al., arXiv Badge
  • Brain-Inspired Two-Stage Approach: Enhancing Mathematical Reasoning by Imitating Human Thought Processes, Chen et al., arXiv Badge
  • Exploring Iterative Enhancement for Improving Learnersourced Multiple-Choice Question Explanations with Large Language Models, Bao et al., PDF Badge
  • Common 7b language models already possess strong math capabilities, Li et al., arXiv Badge
  • Key-point-driven data synthesis with its enhancement on mathematical reasoning, Huang et al., arXiv Badge
  • Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models, Singh et al., PDF Badge
  • The llama 3 herd of models, Dubey et al., arXiv Badge
  • Qwen2 Technical Report, Yang et al., arXiv Badge
  • V-STaR: Training Verifiers for Self-Taught Reasoners, Hosseini et al., PDF Badge
  • ReAct Meets ActRe: Autonomous Annotation of Agent Trajectories for Contrastive Self-Training, Yang et al., PDF Badge
  • Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models, Puerto et al., arXiv Badge
  • Direct Large Language Model Alignment Through Self-Rewarding Contrastive Prompt Distillation, Liu et al., PDF Badge
  • Qwen2.5-math technical report: Toward mathematical expert model via self-improvement, Yang et al., arXiv Badge
  • Enhancing Reasoning Capabilities of LLMs via Principled Synthetic Logic Corpus, Morishita et al., PDF Badge
  • DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving, Tong et al., PDF Badge
  • AlphaMath Almost Zero: Process Supervision without Process, Chen et al., PDF Badge
  • Iterative Reasoning Preference Optimization, Pang et al., PDF Badge
  • Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs, Zhang et al., PDF Badge
  • On memorization of large language models in logical reasoning, Xie et al., arXiv Badge
  • Openmathinstruct-2: Accelerating ai for math with massive open-source instruction data, Toshniwal et al., arXiv Badge
  • TPO: Aligning Large Language Models with Multi-branch & Multi-step Preference Trees, Liao et al., arXiv Badge
  • Cream: Consistency Regularized Self-Rewarding Language Models, Wang et al., PDF Badge
  • Towards Self-Improvement of LLMs via MCTS: Leveraging Stepwise Knowledge with Curriculum Preference Learning, Wang et al., arXiv Badge
  • O1 Replication Journey--Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?, Huang et al., arXiv Badge
  • Self-Explore: Enhancing Mathematical Reasoning in Language Models with Fine-grained Rewards, Hwang et al., PDF Badge
  • On the impact of fine-tuning on chain-of-thought reasoning, Lobo et al., arXiv Badge
  • Weak-to-Strong Reasoning, Yang et al., PDF Badge
  • System-2 Mathematical Reasoning via Enriched Instruction Tuning, Cai et al., arXiv Badge
  • Acemath: Advancing frontier math reasoning with post-training and reward modeling, Liu et al., arXiv Badge
  • Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems, Min et al., arXiv Badge
  • Openai o1 system card, Jaech et al., arXiv Badge
  • Qwen2.5 technical report, Yang et al., arXiv Badge
  • OpenRFT: Adapting Reasoning Foundation Model for Domain-specific Tasks with Reinforcement Fine-Tuning, Zhang et al., arXiv Badge
  • Proposing and solving olympiad geometry with guided tree search, Zhang et al., arXiv Badge
  • Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling, Bansal et al., PDF Badge
  • Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs, Wang et al., arXiv Badge
  • Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, Guo et al., arXiv Badge
  • Sft memorizes, rl generalizes: A comparative study of foundation model post-training, Chu et al., arXiv Badge
  • Advancing Math Reasoning in Language Models: The Impact of Problem-Solving Data, Data Synthesis Methods, and Training Stages, Chen et al., arXiv Badge
  • Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training, Yuan et al., arXiv Badge
  • s1: Simple test-time scaling, Muennighoff et al., arXiv Badge
  • RedStar: Does Scaling Long-CoT Data Unlock Better Slow-Reasoning Systems?, Xu et al., arXiv Badge
  • Sky-T1: Train your own O1 preview model within 450, Team et al., Github Badge
  • Bespoke-Stratos: The unreasonable effectiveness of reasoning distillation, Labs et al., Other Source Badge
  • Think Less, Achieve More: Cut Reasoning Costs by 50% Without Sacrificing Accuracy, Team et al., Github Badge
  • Enhancing Reasoning through Process Supervision with Monte Carlo Tree Search, Li et al., PDF Badge
  • FastMCTS: A Simple Sampling Strategy for Data Synthesis, Li et al., arXiv Badge
  • LLMs Can Teach Themselves to Better Predict the Future, Turtel et al., arXiv Badge
  • Policy Guided Tree Search for Enhanced LLM Reasoning, Li et al., arXiv Badge
  • Don't Get Lost in the Trees: Streamlining LLM Reasoning by Overcoming Tree Search Exploration Pitfalls, Wang et al., arXiv Badge
  • SoS1: O1 and R1-Like Reasoning LLMs are Sum-of-Square Solvers, Li et al., arXiv Badge
  • Distillation Scaling Laws, Busbridge et al., arXiv Badge
  • Unveiling the Mechanisms of Explicit CoT Training: How Chain-of-Thought Enhances Reasoning Generalization, Yao et al., arXiv Badge
  • CoT2Align: Cross-Chain of Thought Distillation via Optimal Transport Alignment for Language Models with Different Tokenizers, Le et al., arXiv Badge
  • Unveiling the Key Factors for Distilling Chain-of-Thought Reasoning, Chen et al., arXiv Badge
  • Chain-of-Thought Matters: Improving Long-Context Language Models with Reasoning Path Supervision, Zhu et al., arXiv Badge
  • Demystifying Long Chain-of-Thought Reasoning in LLMs, Yeo et al., arXiv Badge
  • LIMO: Less is More for Reasoning, Ye et al., arXiv Badge
  • Self-Improvement Towards Pareto Optimality: Mitigating Preference Conflicts in Multi-Objective Alignment, Li et al., arXiv Badge
  • BOLT: Bootstrap Long Chain-of-Thought in Language Models without Distillation, Pang et al., arXiv Badge
  • PromptCoT: Synthesizing Olympiad-level Problems for Mathematical Reasoning in Large Language Models, Zhao et al., arXiv Badge
  • Rewarding Graph Reasoning Process makes LLMs more Generalized Reasoners, Peng et al., arXiv Badge
  • Process-based Self-Rewarding Language Models, Zhang et al., arXiv Badge
  • Entropy-Based Adaptive Weighting for Self-Training, Wang et al., arXiv Badge
  • Entropy-based Exploration Conduction for Multi-step Reasoning, Zhang et al., arXiv Badge
  • OpenCodeReasoning: Advancing Data Distillation for Competitive Coding, Ahmad et al., arXiv Badge

Feasible Reflection

Feedback

  • Learning to summarize with human feedback, Stiennon et al., PDF Badge
  • Training verifiers to solve math word problems, Cobbe et al., arXiv Badge
  • Self-critiquing models for assisting human evaluators, Saunders et al., arXiv Badge
  • Language models (mostly) know what they know, Kadavath et al., arXiv Badge
  • Star: Bootstrapping reasoning with reasoning, Zelikman et al., PDF Badge
  • Solving math word problems with process- and outcome-based feedback, Uesato et al., arXiv Badge
  • Constitutional AI: Harmlessness from AI Feedback, Bai et al., arXiv Badge
  • ReAct: Synergizing Reasoning and Acting in Language Models, Yao et al., PDF Badge
  • Gpt-4 technical report, Achiam et al., arXiv Badge
  • Palm 2 technical report, Anil et al., arXiv Badge
  • Critic: Large language models can self-correct with tool-interactive critiquing, Gou et al., arXiv Badge
  • Contrastive learning with logic-driven data augmentation for logical reasoning over text, Bao et al., arXiv Badge
  • Self-verification improves few-shot clinical information extraction, Gero et al., PDF Badge
  • LEVER: Learning to Verify Language-to-Code Generation with Execution, Ni et al., PDF Badge
  • Llama 2: Open foundation and fine-tuned chat models, Touvron et al., arXiv Badge
  • Reinforced self-training (rest) for language modeling, Gulcehre et al., arXiv Badge
  • Shepherd: A critic for language model generation, Wang et al., arXiv Badge
  • Tree of Thoughts: Deliberate Problem Solving with Large Language Models, Yao et al., PDF Badge
  • Mistral 7B, Jiang et al., arXiv Badge
  • Let's reward step by step: Step-Level reward model as the Navigators for Reasoning, Ma et al., arXiv Badge
  • Camels in a changing climate: Enhancing lm adaptation with tulu 2, Ivison et al., arXiv Badge
  • Towards Mitigating LLM Hallucination via Self Reflection, Ji et al., PDF Badge
  • Reasoning with Language Model is Planning with World Model, Hao et al., PDF Badge
  • Large Language Models are Better Reasoners with Self-Verification, Weng et al., PDF Badge
  • Reflexion: language agents with verbal reinforcement learning, Shinn et al., PDF Badge
  • Large Language Models Cannot Self-Correct Reasoning Yet, Huang et al., PDF Badge
  • Let's verify step by step, Lightman et al., PDF Badge
  • Mixtral of experts, Jiang et al., arXiv Badge
  • Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification, Zhou et al., PDF Badge
  • SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning, Miao et al., PDF Badge
  • Deepseek llm: Scaling open-source language models with longtermism, Bi et al., arXiv Badge
  • Llemma: An Open Language Model for Mathematics, Azerbayev et al., PDF Badge
  • Deepseekmath: Pushing the limits of mathematical reasoning in open language models, Shao et al., arXiv Badge
  • VerMCTS: Synthesizing Multi-Step Programs using a Verifier, a Large Language Model, and Tree Search, Brandfonbrener et al., arXiv Badge
  • Can We Verify Step by Step for Incorrect Answer Detection?, Xu et al., arXiv Badge
  • Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, Team et al., arXiv Badge
  • Internlm2 technical report, Cai et al., arXiv Badge
  • Rewardbench: Evaluating reward models for language modeling, Lambert et al., arXiv Badge
  • Uncertainty of Thoughts: Uncertainty-Aware Planning Enhances Information Seeking in Large Language Models, Hu et al., PDF Badge
  • Evaluating Mathematical Reasoning Beyond Accuracy, Xia et al., arXiv Badge
  • Monte carlo tree search boosts reasoning via iterative preference learning, Xie et al., arXiv Badge
  • Improving reward models with synthetic critiques, Ye et al., arXiv Badge
  • Self-reflection in llm agents: Effects on problem-solving performance, Renze et al., arXiv Badge
  • Rlhf workflow: From reward modeling to online rlhf, Dong et al., arXiv Badge
  • Interactive Evolution: A Neural-Symbolic Self-Training Framework For Large Language Models, Xu et al., arXiv Badge
  • Nemotron-4 340b technical report, Adler et al., arXiv Badge
  • OVM, Outcome-supervised Value Models for Planning in Mathematical Reasoning, Yu et al., PDF Badge
  • The Reason behind Good or Bad: Towards a Better Mathematical Verifier with Natural Language Feedback, Gao et al., arXiv Badge
  • Llm critics help catch bugs in mathematics: Towards a better mathematical verifier with natural language feedback, Gao et al., arXiv Badge
  • Step-dpo: Step-wise preference optimization for long-chain reasoning of llms, Lai et al., arXiv Badge
  • Llm critics help catch llm bugs, McAleese et al., arXiv Badge
  • LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models, Hao et al., PDF Badge
  • Token-Supervised Value Models for Enhancing Mathematical Reasoning Capabilities of Large Language Models, Lee et al., arXiv Badge
  • Tlcr: Token-level continuous reward for fine-grained reinforcement learning from human feedback, Yoon et al., arXiv Badge
  • The llama 3 herd of models, Dubey et al., arXiv Badge
  • Mistral-NeMo-12B-Instruct, Team et al., Huggingface Badge
  • OffsetBias: Leveraging Debiased Data for Tuning Evaluators, Park et al., arXiv Badge
  • Promptbreeder: Self-Referential Self-Improvement via Prompt Evolution, Fernando et al., PDF Badge
  • ReFT: Reasoning with Reinforced Fine-Tuning, Trung et al., PDF Badge
  • Self-Contrast: Better Reflection Through Inconsistent Solving Perspectives, Zhang et al., PDF Badge
  • Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations, Wang et al., PDF Badge
  • Selective Preference Optimization via Token-Level Reward Function Estimation, Yang et al., arXiv Badge
  • When is Tree Search Useful for LLM Planning? It Depends on the Discriminator, Chen et al., PDF Badge
  • Self-taught evaluators, Wang et al., arXiv Badge
  • Gemma 2: Improving open language models at a practical size, Team et al., arXiv Badge
  • Generative verifiers: Reward modeling as next-token prediction, Zhang et al., arXiv Badge
  • OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement, Zheng et al., PDF Badge
  • Critic-cot: Boosting the reasoning abilities of large language model via chain-of-thoughts critic, Zheng et al., arXiv Badge
  • Small Language Models Need Strong Verifiers to Self-Correct Reasoning, Zhang et al., PDF Badge
  • Abstract Meaning Representation-Based Logic-Driven Data Augmentation for Logical Reasoning, Bao et al., PDF Badge
  • Reasoning in Flux: Enhancing Large Language Models Reasoning through Uncertainty-aware Adaptive Guidance, Yin et al., PDF Badge
  • Direct Judgement Preference Optimization, Wang et al., arXiv Badge
  • Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback, Ivison et al., PDF Badge
  • HelpSteer 2: Open-source dataset for training top-performing reward models, Wang et al., PDF Badge
  • Qwen2.5-math technical report: Toward mathematical expert model via self-improvement, Yang et al., arXiv Badge
  • Critique-out-Loud Reward Models, Ankner et al., PDF Badge
  • Skywork-reward: Bag of tricks for reward modeling in llms, Liu et al., arXiv Badge
  • On designing effective rl reward at training time for llm reasoning, Gao et al., arXiv Badge
  • Reversal of Thought: Enhancing Large Language Models with Preference-Guided Reverse Reasoning Warm-up, Yuan et al., arXiv Badge
  • Vineppo: Unlocking rl potential for llm reasoning through refined credit assignment, Kazemnejad et al., arXiv Badge
  • Self-generated critiques boost reward modeling for language models, Yu et al., arXiv Badge
  • Advancing Process Verification for Large Language Models via Tree-Based Preference Learning, He et al., PDF Badge
  • From generation to judgment: Opportunities and challenges of llm-as-a-judge, Li et al., arXiv Badge
  • Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering, Guan et al., arXiv Badge
  • Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models, Kim et al., PDF Badge
  • Foundational Autoraters: Taming Large Language Models for Better Automatic Evaluation, Vu et al., PDF Badge
  • Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts, Wang et al., PDF Badge
  • Step-level Value Preference Optimization for Mathematical Reasoning, Chen et al., PDF Badge
  • Skywork-o1 open series, Team et al., Huggingface Badge
  • Entropy-Regularized Process Reward Model, Zhang et al., arXiv Badge
  • Llms-as-judges: a comprehensive survey on llm-based evaluation methods, Li et al., arXiv Badge
  • Lmunit: Fine-grained evaluation with natural language unit tests, Saad-Falcon et al., arXiv Badge
  • o1-coder: an o1 replication for coding, Zhang et al., arXiv Badge
  • Hunyuanprover: A scalable data synthesis framework and guided tree search for automated theorem proving, Li et al., arXiv Badge
  • Acemath: Advancing frontier math reasoning with post-training and reward modeling, Liu et al., arXiv Badge
  • Free process rewards without process labels, Yuan et al., arXiv Badge
  • AutoPSV: Automated Process-Supervised Verifier, Lu et al., PDF Badge
  • Processbench: Identifying process errors in mathematical reasoning, Zheng et al., arXiv Badge
  • Qwen2.5 technical report, Yang et al., arXiv Badge
  • Openai o1 system card, Jaech et al., arXiv Badge
  • Outcome-Refining Process Supervision for Code Generation, Yu et al., arXiv Badge
  • Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, Guo et al., arXiv Badge
  • Dynamic Scaling of Unit Tests for Code Reward Modeling, Ma et al., arXiv Badge
  • What Makes Large Language Models Reason in (Multi-Turn) Code Generation?, Zheng et al., PDF Badge
  • Attention-guided Self-reflection for Zero-shot Hallucination Detection in Large Language Models, Liu et al., arXiv Badge
  • Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge, Saha et al., arXiv Badge
  • Scaling Autonomous Agents via Automatic Reward Modeling And Planning, Chen et al., PDF Badge
  • Error Classification of Large Language Models on Math Word Problems: A Dynamically Adaptive Framework, Sun et al., arXiv Badge
  • The lessons of developing process reward models in mathematical reasoning, Zhang et al., arXiv Badge
  • Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback, Lin et al., arXiv Badge
  • Physics of Language Models: Part 2.2, How to Learn From Mistakes on Grade-School Math Problems, Ye et al., PDF Badge
  • Advancing LLM Reasoning Generalists with Preference Trees, Yuan et al., PDF Badge
  • Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning, Setlur et al., PDF Badge
  • Enabling Scalable Oversight via Self-Evolving Critic, Tang et al., arXiv Badge
  • PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models, Song et al., arXiv Badge
  • Zero-Shot Verification-guided Chain of Thoughts, Chowdhury et al., arXiv Badge
  • Robotic Programmer: Video Instructed Policy Code Generation for Robotic Manipulation, Xie et al., arXiv Badge
  • Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning, Xie et al., arXiv Badge
  • SoRFT: Issue Resolving with Subtask-oriented Reinforced Fine-Tuning, Ma et al., arXiv Badge
  • Training an LLM-as-a-Judge Model: Pipeline, Insights, and Practical Lessons, Hu et al., arXiv Badge
  • Capturing Nuanced Preferences: Preference-Aligned Distillation for Small Language Models, Gu et al., arXiv Badge
  • Distill Not Only Data but Also Rewards: Can Smaller Language Models Surpass Larger Ones?, Zhang et al., arXiv Badge
  • Adaptivestep: Automatically dividing reasoning step through model confidence, Liu et al., arXiv Badge
  • Unveiling and Causalizing CoT: A Causal Pespective, Fu et al., arXiv Badge
  • Diverse Inference and Verification for Advanced Reasoning, Drori et al., arXiv Badge
  • Mathematical Reasoning in Large Language Models: Assessing Logical and Arithmetic Errors across Wide Numerical Ranges, Shrestha et al., arXiv Badge
  • A Survey on Feedback-based Multi-step Reasoning for Large Language Models on Mathematics, Wei et al., arXiv Badge
  • Self-Consistency of the Internal Reward Models Improves Self-Rewarding Language Models, Zhou et al., arXiv Badge
  • ACECODER: Acing Coder RL via Automated Test-Case Synthesis, Zeng et al., arXiv Badge
  • A Study on Leveraging Search and Self-Feedback for Agent Reasoning, Yuan et al., arXiv Badge
  • RefineCoder: Iterative Improving of Large Language Models via Adaptive Critique Refinement for Code Generation, Zhou et al., arXiv Badge
  • Process Reward Models for LLM Agents: Practical Framework and Directions, Choudhury et al., arXiv Badge
  • Process reinforcement through implicit rewards, Cui et al., arXiv Badge
  • Full-Step-DPO: Self-Supervised Preference Optimization with Step-wise Rewards for Mathematical Reasoning, Xu et al., arXiv Badge
  • VersaPRM: Multi-Domain Process Reward Model via Synthetic Reasoning Data, Zeng et al., arXiv Badge
  • Direct Value Optimization: Improving Chain-of-Thought Reasoning in LLMs with Refined Values, Zhang et al., arXiv Badge
  • Teaching Language Models to Critique via Reinforcement Learning, Xie et al., arXiv Badge
  • Uncertainty-Aware Search and Value Models: Mitigating Search Scaling Flaws in LLMs, Yu et al., arXiv Badge
  • AURORA: Automated Training Framework of Universal Process Reward Models via Ensemble Prompting and Reverse Verification, Tan et al., arXiv Badge
  • Dyve: Thinking Fast and Slow for Dynamic Process Verification, Zhong et al., arXiv Badge
  • Uncertainty-Aware Step-wise Verification with Generative Reward Models, Ye et al., arXiv Badge
  • Visualprm: An effective process reward model for multimodal reasoning, Wang et al., arXiv Badge
  • Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation, Baker et al., PDF Badge
  • An Efficient and Precise Training Data Construction Framework for Process-supervised Reward Model in Mathematical Reasoning, Sun et al., arXiv Badge
  • JudgeLRM: Large Reasoning Models as a Judge, Chen et al., arXiv Badge
  • QwQ: Reflect Deeply on the Boundaries of the Unknown, Team et al., Github Badge

Refinement

  • Self-critiquing models for assisting human evaluators, Saunders et al., arXiv Badge
  • Self-Refine: Iterative Refinement with Self-Feedback, Madaan et al., PDF Badge
  • Grace: Discriminator-guided chain-of-thought reasoning, Khalifa et al., arXiv Badge
  • Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies, Pan et al., arXiv Badge
  • Learning from mistakes makes llm better reasoner, An et al., arXiv Badge
  • Reflection-tuning: Data recycling improves llm instruction-tuning, Li et al., arXiv Badge
  • Reflexion: language agents with verbal reinforcement learning, Shinn et al., PDF Badge
  • Towards Mitigating LLM Hallucination via Self Reflection, Ji et al., PDF Badge
  • SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning, Miao et al., PDF Badge
  • Teaching Large Language Models to Self-Debug, Chen et al., PDF Badge
  • Learning to check: Unleashing potentials for self-correction in large language models, Zhang et al., arXiv Badge
  • REFINER: Reasoning Feedback on Intermediate Representations, Paul et al., PDF Badge
  • GLoRe: When, Where, and How to Improve LLM Reasoning via Global and Local Refinements, Havrilla et al., PDF Badge
  • Enhancing Zero-Shot Chain-of-Thought Reasoning in Large Language Models through Logic, Zhao et al., PDF Badge
  • General purpose verification for chain of thought prompting, Vacareanu et al., arXiv Badge
  • Enhancing visual-language modality alignment in large vision language models via self-improvement, Wang et al., arXiv Badge
  • Llm critics help catch bugs in mathematics: Towards a better mathematical verifier with natural language feedback, Gao et al., arXiv Badge
  • Large language models have intrinsic self-correction ability, Liu et al., arXiv Badge
  • Progressive-Hint Prompting Improves Reasoning in Large Language Models, Zheng et al., PDF Badge
  • Accessing gpt-4 level mathematical olympiad solutions via monte carlo tree self-refine with llama-3 8b, Zhang et al., arXiv Badge
  • Toward Adaptive Reasoning in Large Language Models with Thought Rollback, Chen et al., PDF Badge
  • CoT Rerailer: Enhancing the Reliability of Large Language Models in Complex Reasoning Tasks through Error Detection and Correction, Wan et al., arXiv Badge
  • Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement, Xu et al., PDF Badge
  • Mutual reasoning makes smaller llms stronger problem-solvers, Qi et al., arXiv Badge
  • S 3 c-Math: Spontaneous Step-level Self-correction Makes Large Language Models Better Mathematical Reasoners, Yan et al., arXiv Badge
  • Training language models to self-correct via reinforcement learning, Kumar et al., arXiv Badge
  • A Theoretical Understanding of Self-Correction through In-context Alignment, Wang et al., PDF Badge
  • ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search, Zhang et al., PDF Badge
  • Recursive Introspection: Teaching Language Model Agents How to Self-Improve, Qu et al., PDF Badge
  • Enhancing Mathematical Reasoning in LLMs by Stepwise Correction, Wu et al., arXiv Badge
  • LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints, Ferraz et al., PDF Badge
  • O1 Replication Journey: A Strategic Progress Report--Part 1, Qin et al., arXiv Badge
  • Advancing Large Language Model Attribution through Self-Improving, Huang et al., PDF Badge
  • Enhancing llm reasoning via critique models with test-time and training-time supervision, Xi et al., arXiv Badge
  • Vision-language models can self-improve reasoning via reflection, Cheng et al., arXiv Badge
  • Confidence vs Critique: A Decomposition of Self-Correction Capability for LLMs, Yang et al., arXiv Badge
  • LLM2: Let Large Language Models Harness System 2 Reasoning, Yang et al., arXiv Badge
  • Understanding the Dark Side of LLMs' Intrinsic Self-Correction, Zhang et al., arXiv Badge
  • Enhancing LLM Reasoning with Multi-Path Collaborative Reactive and Reflection agents, He et al., arXiv Badge
  • CoT-based Synthesizer: Enhancing LLM Performance through Answer Synthesis, Zhang et al., arXiv Badge
  • Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, Guo et al., arXiv Badge
  • 7B Model and 8K Examples: Emerging Reasoning with Reinforcement Learning is Both Effective and Efficient, Zeng et al., Notion Badge
  • BackMATH: Towards Backward Reasoning for Solving Math Problems Step by Step, Zhang et al., PDF Badge
  • ReARTeR: Retrieval-Augmented Reasoning with Trustworthy Process Rewarding, Sun et al., arXiv Badge
  • Critique fine-tuning: Learning to critique is more effective than learning to imitate, Wang et al., arXiv Badge
  • RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques, Tang et al., arXiv Badge
  • ProgCo: Program Helps Self-Correction of Large Language Models, Song et al., arXiv Badge
  • URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics, Luo et al., arXiv Badge
  • S2R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning, Ma et al., arXiv Badge
  • ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates, Yang et al., arXiv Badge
  • ReVISE: Learning to Refine at Test-Time via Intrinsic Self-Verification, Lee et al., arXiv Badge
  • Step Back to Leap Forward: Self-Backtracking for Boosting Reasoning of Language Models, Yang et al., arXiv Badge
  • Iterative Deepening Sampling for Large Language Models, Chen et al., arXiv Badge
  • LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!, Li et al., arXiv Badge
  • MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification, Sun et al., arXiv Badge
  • ARIES: Stimulating Self-Refinement of Large Language Models by Iterative Preference Optimization, Zeng et al., arXiv Badge
  • Optimizing generative AI by backpropagating language model feedback, Yuksekgonul et al., PDF Badge
  • DLPO: Towards a Robust, Efficient, and Generalizable Prompt Optimization Framework from a Deep-Learning Perspective, Peng et al., arXiv Badge
  • Instruct-of-Reflection: Enhancing Large Language Models Iterative Reflection Capabilities via Dynamic-Meta Instruction, Liu et al., arXiv Badge
  • The Lighthouse of Language: Enhancing LLM Agents via Critique-Guided Improvement, Yang et al., arXiv Badge

Extensive Exploration

Exploration Scaling

  • Greedy Policy Search: A Simple Baseline for Learnable Test-Time Augmentation, Lyzhov et al., PDF Badge
  • Scaling scaling laws with board games, Jones et al., arXiv Badge
  • Show Your Work: Scratchpads for Intermediate Computation with Language Models, Nye et al., PDF Badge
  • Complexity-Based Prompting for Multi-step Reasoning, Fu et al., PDF Badge
  • Self-Consistency Improves Chain of Thought Reasoning in Language Models, Wang et al., PDF Badge
  • Making Language Models Better Reasoners with Step-Aware Verifier, Li et al., PDF Badge
  • Deductive Verification of Chain-of-Thought Reasoning, Ling et al., PDF Badge
  • Cross-lingual Prompting: Improving Zero-shot Chain-of-Thought Reasoning across Languages, Qin et al., PDF Badge
  • Don't Trust: Verify -- Grounding LLM Quantitative Reasoning with Autoformalization, Zhou et al., PDF Badge
  • Multi-step problem solving through a verifier: An empirical analysis on model-induced process supervision, Wang et al., arXiv Badge
  • Stepwise self-consistent mathematical reasoning with large language models, Zhao et al., arXiv Badge
  • General purpose verification for chain of thought prompting, Vacareanu et al., arXiv Badge
  • Improve Mathematical Reasoning in Language Models by Automated Process Supervision, Luo et al., arXiv Badge
  • Large language monkeys: Scaling inference compute with repeated sampling, Brown et al., arXiv Badge
  • Scaling llm test-time compute optimally can be more effective than scaling model parameters, Snell et al., arXiv Badge
  • Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models, Wu et al., arXiv Badge
  • Learning to Reason via Program Generation, Emulation, and Search, Weir et al., PDF Badge
  • What are the essential factors in crafting effective long context multi-hop instruction datasets? insights and best practices, Chen et al., arXiv Badge
  • MAgICoRe: Multi-Agent, Iterative, Coarse-to-Fine Refinement for Reasoning, Chen et al., arXiv Badge
  • Scaling llm inference with optimized sample compute allocation, Zhang et al., arXiv Badge
  • Rlef: Grounding code llms in execution feedback with reinforcement learning, Gehring et al., arXiv Badge
  • Planning in Natural Language Improves LLM Search for Code Generation, Wang et al., PDF Badge
  • Beyond examples: High-level automated reasoning paradigm in in-context learning via mcts, Wu et al., arXiv Badge
  • Python is Not Always the Best Choice: Embracing Multilingual Program of Thoughts, Luo et al., PDF Badge
  • From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models, Welleck et al., PDF Badge
  • From medprompt to o1: Exploration of run-time strategies for medical challenge problems and beyond, Nori et al., arXiv Badge
  • Wrong-of-Thought: An Integrated Reasoning Framework with Multi-Perspective Verification and Wrong Information, Zhang et al., PDF Badge
  • A simple and provable scaling law for the test-time compute of large language models, Chen et al., arXiv Badge
  • Openai o1 system card, Jaech et al., arXiv Badge
  • Lachesis: Predicting LLM Inference Accuracy using Structural Properties of Reasoning Paths, Kim et al., arXiv Badge
  • Seed-cts: Unleashing the power of tree search for superior performance in competitive coding tasks, Wang et al., arXiv Badge
  • Inference Scaling vs Reasoning: An Empirical Analysis of Compute-Optimal LLM Problem-Solving, AbdElhameed et al., arXiv Badge
  • s1: Simple test-time scaling, Muennighoff et al., arXiv Badge
  • From Drafts to Answers: Unlocking LLM Potential via Aggregation Fine-Tuning, Li et al., arXiv Badge
  • Chain-of-Reasoning: Towards Unified Mathematical Reasoning in Large Language Models via a Multi-Paradigm Perspective, Yu et al., arXiv Badge
  • Test-time Computing: from System-1 Thinking to System-2 Thinking, Ji et al., arXiv Badge
  • SETS: Leveraging Self-Verification and Self-Correction for Improved Test-Time Scaling, Chen et al., arXiv Badge
  • Instantiation-based Formalization of Logical Reasoning Tasks using Language Models and Logical Solvers, Raza et al., arXiv Badge
  • The lessons of developing process reward models in mathematical reasoning, Zhang et al., arXiv Badge
  • ExACT: Teaching AI Agents to Explore with Reflective-MCTS and Exploratory Learning, Yu et al., PDF Badge
  • ECM: A Unified Electronic Circuit Model for Explaining the Emergence of In-Context Learning and Chain-of-Thought in Large Language Model, Chen et al., arXiv Badge
  • Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach, Geiping et al., arXiv Badge
  • Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking, Chen et al., arXiv Badge
  • Scalable Best-of-N Selection for Large Language Models via Self-Certainty, Kang et al., arXiv Badge
  • Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling, Liu et al., arXiv Badge
  • Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities?, Zeng et al., arXiv Badge
  • Optimizing Temperature for Language Models with Multi-Sample Inference, Du et al., arXiv Badge
  • Bag of Tricks for Inference-time Computation of LLM Reasoning, Liu et al., arXiv Badge
  • Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning, Yang et al., arXiv Badge
  • (Mis) Fitting: A Survey of Scaling Laws, Li et al., arXiv Badge
  • METAL: A Multi-Agent Framework for Chart Generation with Test-Time Scaling, Li et al., arXiv Badge
  • Reasoning-as-Logic-Units: Scaling Test-Time Reasoning in Large Language Models Through Logic Unit Alignment, Li et al., arXiv Badge
  • Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification, Zhao et al., arXiv Badge
  • TestNUC: Enhancing Test-Time Computing Approaches through Neighboring Unlabeled Data Consistency, Zou et al., arXiv Badge
  • Confidence Improves Self-Consistency in LLMs, Taubenfeld et al., arXiv Badge
  • S*: Test Time Scaling for Code Generation, Li et al., arXiv Badge
  • Bridging Internal Probability and Self-Consistency for Effective and Efficient LLM Reasoning, Zhou et al., arXiv Badge
  • Is Depth All You Need? An Exploration of Iterative Reasoning in LLMs, Wu et al., arXiv Badge
  • Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking, Tian et al., arXiv Badge
  • What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models, Zhang et al., arXiv Badge
  • Metascale: Test-time scaling with evolving meta-thoughts, Liu et al., arXiv Badge
  • Multidimensional Consistency Improves Reasoning in Language Models, Lai et al., arXiv Badge
  • Efficient test-time scaling via self-calibration, Huang et al., arXiv Badge
  • Do We Truly Need So Many Samples? Multi-LLM Repeated Sampling Efficiently Scale Test-Time Compute, Chen et al., arXiv Badge

External Exploration

  • Competition-Level Code Generation with AlphaCode, Li et al., arXiv Badge
  • Least-to-Most Prompting Enables Complex Reasoning in Large Language Models, Zhou et al., PDF Badge
  • Llama: Open and efficient foundation language models, Touvron et al., arXiv Badge
  • Gpt-4 technical report, Achiam et al., arXiv Badge
  • Llama 2: Open foundation and fine-tuned chat models, Touvron et al., arXiv Badge
  • Code llama: Open foundation models for code, Roziere et al., arXiv Badge
  • Self-Evaluation Guided Beam Search for Reasoning, Xie et al., PDF Badge
  • No train still gain. unleash mathematical reasoning of large language models with monte carlo tree search guided by energy function, Xu et al., arXiv Badge
  • Tree of Thoughts: Deliberate Problem Solving with Large Language Models, Yao et al., PDF Badge
  • Tora: A tool-integrated reasoning agent for mathematical problem solving, Gou et al., arXiv Badge
  • Mistral 7B, Jiang et al., arXiv Badge
  • Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning, Wang et al., arXiv Badge
  • PATHFINDER: Guided Search over Multi-Step Reasoning Paths, Golovneva et al., PDF Badge
  • Reflexion: language agents with verbal reinforcement learning, Shinn et al., PDF Badge
  • Reasoning with Language Model is Planning with World Model, Hao et al., PDF Badge
  • The claude 3 model family: Opus, sonnet, haiku, Anthropic et al., PDF Badge
  • NuminaMath, LI et al., Huggingface Badge
  • Demystifying chains, trees, and graphs of thoughts, Besta et al., arXiv Badge
  • MARIO: MAth Reasoning with code Interpreter Output--A Reproducible Pipeline, Liao et al., arXiv Badge
  • Deepseekmath: Pushing the limits of mathematical reasoning in open language models, Shao et al., arXiv Badge
  • Mathgenie: Generating synthetic data with question back-translation for enhancing mathematical reasoning of llms, Lu et al., arXiv Badge
  • Graph of Thoughts: Solving Elaborate Problems with Large Language Models, Besta et al., PDF Badge
  • Tree of Uncertain Thoughts Reasoning for Large Language Models, Mo et al., PDF Badge
  • Mindstar: Enhancing math reasoning in pre-trained llms at inference time, Kang et al., arXiv Badge
  • Mapcoder: Multi-agent code generation for competitive problem solving, Islam et al., arXiv Badge
  • AlphaZero-Like Tree-Search can Guide Large Language Model Decoding and Training, Wan et al., PDF Badge
  • Monte carlo tree search boosts reasoning via iterative preference learning, Xie et al., arXiv Badge
  • Accessing gpt-4 level mathematical olympiad solutions via monte carlo tree self-refine with llama-3 8b, Zhang et al., arXiv Badge
  • Strategist: Learning Strategic Skills by LLMs via Bi-Level Tree Search, Light et al., PDF Badge
  • Deductive Beam Search: Decoding Deducible Rationale for Chain-of-Thought Reasoning, Zhu et al., PDF Badge
  • Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping, Lehnert et al., PDF Badge
  • Don't throw away your value model! Generating more preferable text with Value-Guided Monte-Carlo Tree Search decoding, Liu et al., PDF Badge
  • Qwen2 Technical Report, Yang et al., arXiv Badge
  • The llama 3 herd of models, Dubey et al., arXiv Badge
  • Litesearch: Efficacious tree search for llm, Wang et al., arXiv Badge
  • LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models, Hao et al., PDF Badge
  • Tree search for language model agents, Koh et al., arXiv Badge
  • GraphReason: Enhancing Reasoning Capabilities of Large Language Models through A Graph-Based Verification Approach, Cao et al., PDF Badge
  • Agent q: Advanced reasoning and learning for autonomous ai agents, Putta et al., arXiv Badge
  • Making PPO even better: Value-Guided Monte-Carlo Tree Search decoding, Liu et al., PDF Badge
  • Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing, Tian et al., PDF Badge
  • On the diagram of thought, Zhang et al., arXiv Badge
  • Qwen2.5-math technical report: Toward mathematical expert model via self-improvement, Yang et al., arXiv Badge
  • AlphaMath Almost Zero: Process Supervision without Process, Chen et al., PDF Badge
  • RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation, Li et al., arXiv Badge
  • Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning, Zhai et al., PDF Badge
  • Llama-berry: Pairwise optimization for o1-like olympiad-level mathematical reasoning, Zhang et al., arXiv Badge
  • Understanding When Tree of Thoughts Succeeds: Larger Models Excel in Generation, Not Discrimination, Chen et al., arXiv Badge
  • Treebon: Enhancing inference-time alignment with speculative tree-search and best-of-n sampling, Qiu et al., arXiv Badge
  • Towards Self-Improvement of LLMs via MCTS: Leveraging Stepwise Knowledge with Curriculum Preference Learning, Wang et al., arXiv Badge
  • Aflow: Automating agentic workflow generation, Zhang et al., arXiv Badge
  • Cooperative Strategic Planning Enhances Reasoning Capabilities in Large Language Models, Wang et al., arXiv Badge
  • Deliberate reasoning for llms as structure-aware planning with accurate world model, Xiong et al., arXiv Badge
  • Enhancing multi-step reasoning abilities of language models through direct q-function optimization, Liu et al., arXiv Badge
  • Process reward model with q-value rankings, Li et al., arXiv Badge
  • Scattered Forest Search: Smarter Code Space Exploration with LLMs, Light et al., arXiv Badge
  • AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning, Xiang et al., arXiv Badge
  • On the Empirical Complexity of Reasoning and Planning in LLMs, Kang et al., PDF Badge
  • CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models, Li et al., arXiv Badge
  • Technical report: Enhancing llm reasoning with reward-guided tree search, Jiang et al., arXiv Badge
  • SRA-MCTS: Self-driven Reasoning Aurmentation with Monte Carlo Tree Search for Enhanced Code Generation, Xu et al., arXiv Badge
  • Marco-o1: Towards open reasoning models for open-ended solutions, Zhao et al., arXiv Badge
  • GPT-Guided Monte Carlo Tree Search for Symbolic Regression in Financial Fraud Detection, Kadam et al., arXiv Badge
  • MC-NEST--Enhancing Mathematical Reasoning in Large Language Models with a Monte Carlo Nash Equilibrium Self-Refine Tree, Rabby et al., arXiv Badge
  • SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models, Cheng et al., arXiv Badge
  • Forest-of-thought: Scaling test-time compute for enhancing LLM reasoning, Bi et al., arXiv Badge
  • Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search, Yao et al., arXiv Badge
  • Tree-of-Code: A Tree-Structured Exploring Framework for End-to-End Code Generation and Execution in Complex Task Handling, Ni et al., arXiv Badge
  • LLM2: Let Large Language Models Harness System 2 Reasoning, Yang et al., arXiv Badge
  • Towards Intrinsic Self-Correction Enhancement in Monte Carlo Tree Search Boosted Reasoning via Iterative Preference Learning, Jiang et al., arXiv Badge
  • Ensembling Large Language Models with Process Reward-Guided Tree Search for Better Complex Reasoning, Park et al., arXiv Badge
  • rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking, Guan et al., arXiv Badge
  • Evolving Deeper LLM Thinking, Lee et al., arXiv Badge
  • A Roadmap to Guide the Integration of LLMs in Hierarchical Planning, Puerta-Merino et al., arXiv Badge
  • BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning, Zhang et al., arXiv Badge
  • Monte Carlo Tree Search for Comprehensive Exploration in LLM-Based Automatic Heuristic Design, Zheng et al., arXiv Badge
  • Leveraging Constrained Monte Carlo Tree Search to Generate Reliable Long Chain-of-Thought for Mathematical Reasoning, Lin et al., arXiv Badge
  • A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods, Puri et al., arXiv Badge
  • Hypothesis-Driven Theory-of-Mind Reasoning for Large Language Models, Kim et al., arXiv Badge
  • SIFT: Grounding LLM Reasoning in Contexts via Stickers, Zeng et al., arXiv Badge
  • Atom of Thoughts for Markov LLM Test-Time Scaling, Teng et al., arXiv Badge
  • Reasoning with Reinforced Functional Token Tuning, Zhang et al., arXiv Badge
  • CoAT: Chain-of-Associated-Thoughts Framework for Enhancing Large Language Models Reasoning, Pan et al., arXiv Badge
  • MAPoRL: Multi-Agent Post-Co-Training for Collaborative Large Language Models with Reinforcement Learning, Park et al., arXiv Badge
  • QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search, Lin et al., arXiv Badge
  • CritiQ: Mining Data Quality Criteria from Human Preferences, Guo et al., arXiv Badge
  • START: Self-taught Reasoner with Tools, Li et al., arXiv Badge
  • Better Process Supervision with Bi-directional Rewarding Signals, Chen et al., arXiv Badge

Internal Exploration

  • Policy Gradient Methods for Reinforcement Learning with Function Approximation, Sutton et al., PDF Badge
  • Proximal policy optimization algorithms, Schulman et al., arXiv Badge
  • Training verifiers to solve math word problems, Cobbe et al., arXiv Badge
  • Direct preference optimization: Your language model is secretly a reward model, Rafailov et al., PDF Badge
  • Gpt-4 technical report, Achiam et al., arXiv Badge
  • The claude 3 model family: Opus, sonnet, haiku, Anthropic et al., PDF Badge
  • Deepseekmath: Pushing the limits of mathematical reasoning in open language models, Shao et al., arXiv Badge
  • Kto: Model alignment as prospect theoretic optimization, Ethayarajh et al., arXiv Badge
  • Stepcoder: Improve code generation with reinforcement learning from compiler feedback, Dou et al., arXiv Badge
  • Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models, Singh et al., PDF Badge
  • ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models, Li et al., PDF Badge
  • AlphaZero-Like Tree-Search can Guide Large Language Model Decoding and Training, Wan et al., PDF Badge
  • Chatglm: A family of large language models from glm-130b to glm-4 all tools, GLM et al., arXiv Badge
  • The llama 3 herd of models, Dubey et al., arXiv Badge
  • RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold, Setlur et al., PDF Badge
  • CPL: Critical Plan Step Learning Boosts LLM Generalization in Reasoning Tasks, Wang et al., arXiv Badge
  • Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback, Ivison et al., PDF Badge
  • Qwen2.5-coder technical report, Hui et al., arXiv Badge
  • Qwen2.5-math technical report: Toward mathematical expert model via self-improvement, Yang et al., arXiv Badge
  • Building math agents with multi-turn iterative preference learning, Xiong et al., arXiv Badge
  • Unlocking the Capabilities of Thought: A Reasoning Boundary Framework to Quantify and Optimize Chain-of-Thought, Chen et al., PDF Badge
  • ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search, Zhang et al., PDF Badge
  • A Small Step Towards Reproducing OpenAI o1: Progress Report on the Steiner Open Source Models, Ji et al., PDF Badge
  • A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications, Xiao et al., arXiv Badge
  • OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data, Toshniwal et al., arXiv Badge
  • Critical Tokens Matter: Token-Level Contrastive Estimation Enhence LLM's Reasoning Capability, Lin et al., arXiv Badge
  • Improving Multi-Step Reasoning Abilities of Large Language Models with Direct Advantage Policy Optimization, Liu et al., arXiv Badge
  • o1-coder: an o1 replication for coding, Zhang et al., arXiv Badge
  • Offline Reinforcement Learning for LLM Multi-Step Reasoning, Wang et al., arXiv Badge
  • Qwen2.5 technical report, Yang et al., arXiv Badge
  • Deepseek-v3 technical report, Liu et al., arXiv Badge
  • Openai o1 system card, Jaech et al., arXiv Badge
  • Sft memorizes, rl generalizes: A comparative study of foundation model post-training, Chu et al., arXiv Badge
  • REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models, Hu et al., arXiv Badge
  • Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, Guo et al., arXiv Badge
  • Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling, Hou et al., arXiv Badge
  • Diverse Preference Optimization, Lanchantin et al., arXiv Badge
  • COS (M+ O) S: Curiosity and RL-Enhanced MCTS for Exploring Story Space via Language Models, Materzok et al., arXiv Badge
  • 7B Model and 8K Examples: Emerging Reasoning with Reinforcement Learning is Both Effective and Efficient, Zeng et al., Notion Badge
  • Search-o1: Agentic search-enhanced large reasoning models, Li et al., arXiv Badge
  • rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking, Guan et al., arXiv Badge
  • Kimi k1. 5: Scaling reinforcement learning with llms, Team et al., arXiv Badge
  • Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search, Shen et al., arXiv Badge
  • Demystifying Long Chain-of-Thought Reasoning in LLMs, Yeo et al., arXiv Badge
  • LIMR: Less is More for RL Scaling, Li et al., arXiv Badge
  • Ignore the KL Penalty! Boosting Exploration on Critical Tokens to Enhance RL Fine-Tuning, Vassoyan et al., arXiv Badge
  • Lean and Mean: Decoupled Value Policy Optimization with Global Value Guidance, Huang et al., arXiv Badge
  • Process reinforcement through implicit rewards, Cui et al., arXiv Badge
  • SPPD: Self-training with Process Preference Learning Using Dynamic Value Margin, Yi et al., arXiv Badge
  • Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning, Xie et al., arXiv Badge
  • Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning, Yang et al., arXiv Badge
  • Training Language Models to Reason Efficiently, Arora et al., arXiv Badge
  • LLM Post-Training: A Deep Dive into Reasoning Large Language Models, Kumar et al., arXiv Badge
  • Reward-aware Preference Optimization: A Unified Mathematical Framework for Model Alignment, Sun et al., arXiv Badge
  • Focused-DPO: Enhancing Code Generation Through Focused Preference Optimization on Error-Prone Points, Zhang et al., arXiv Badge
  • Reasoning with Reinforced Functional Token Tuning, Zhang et al., arXiv Badge
  • Qsharp: Provably Optimal Distributional RL for LLM Post-Training, Zhou et al., arXiv Badge
  • Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning, Lyu et al., arXiv Badge
  • Competitive Programming with Large Reasoning Models, El-Kishky et al., arXiv Badge
  • SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution, Wei et al., arXiv Badge
  • Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights, Parashar et al., arXiv Badge
  • Metastable Dynamics of Chain-of-Thought Reasoning: Provable Benefits of Search, RL and Distillation, Kim et al., arXiv Badge
  • On the Emergence of Thinking in LLMs I: Searching for the Right Intuition, Ye et al., arXiv Badge
  • The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks, Cuadron et al., arXiv Badge
  • DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL, Luo et al., PDF Badge
  • STeCa: Step-level Trajectory Calibration for LLM Agent Learning, Wang et al., arXiv Badge
  • Thinking Preference Optimization, Yang et al., arXiv Badge
  • Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training, Bartoldson et al., arXiv Badge
  • Dapo: An open-source llm reinforcement learning system at scale, Yu et al., arXiv Badge
  • Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model, Hu et al., arXiv Badge
  • Optimizing Test-Time Compute via Meta Reinforcement Finetuning, Qu et al., PDF Badge
  • Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't, Dang et al., arXiv Badge
  • SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks, Zhou et al., arXiv Badge
  • START: Self-taught Reasoner with Tools, Li et al., arXiv Badge
  • Expanding RL with Verifiable Rewards Across Diverse Domains, Su et al., arXiv Badge
  • R-PRM: Reasoning-Driven Process Reward Modeling, She et al., arXiv Badge
  • VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks, YuYue et al., arXiv Badge
  • Z1: Efficient Test-time Scaling with Code, Yu et al., arXiv Badge
  • QwQ: Reflect Deeply on the Boundaries of the Unknown, Team et al., Github Badge

Future and Frontiers

Agentic & Embodied Long CoT

  • Solving Math Word Problems via Cooperative Reasoning induced Language Models, Zhu et al., PDF Badge
  • Reasoning with Language Model is Planning with World Model, Hao et al., PDF Badge
  • Large language models as commonsense knowledge for large-scale task planning, Zhao et al., PDF Badge
  • Robotic Control via Embodied Chain-of-Thought Reasoning, Zawalski et al., PDF Badge
  • Tree-Planner: Efficient Close-loop Task Planning with Large Language Models, Hu et al., PDF Badge
  • Language Agent Tree Search Unifies Reasoning, Acting, and Planning in Language Models, Zhou et al., PDF Badge
  • Strategist: Learning Strategic Skills by LLMs via Bi-Level Tree Search, Light et al., PDF Badge
  • Mixture-of-agents enhances large language model capabilities, Wang et al., arXiv Badge
  • ADaPT: As-Needed Decomposition and Planning with Language Models, Prasad et al., PDF Badge
  • Tree search for language model agents, Koh et al., arXiv Badge
  • Hiagent: Hierarchical working memory management for solving long-horizon agent tasks with large language model, Hu et al., arXiv Badge
  • S3 agent: Unlocking the power of VLLM for zero-shot multi-modal sarcasm detection, Wang et al., PDF Badge
  • MACM: Utilizing a Multi-Agent System for Condition Mining in Solving Complex Mathematical Problems, Lei et al., PDF Badge
  • Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning, Zhai et al., PDF Badge
  • EVOLvE: Evaluating and Optimizing LLMs For Exploration, Nie et al., arXiv Badge
  • Agents Thinking Fast and Slow: A Talker-Reasoner Architecture, Christakopoulou et al., PDF Badge
  • Robotic Programmer: Video Instructed Policy Code Generation for Robotic Manipulation, Xie et al., arXiv Badge
  • Titans: Learning to memorize at test time, Behrouz et al., arXiv Badge
  • Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success, Kim et al., arXiv Badge
  • World Modeling Makes a Better Planner: Dual Preference Optimization for Embodied Task Planning, Wang et al., arXiv Badge
  • Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks, Zhang et al., arXiv Badge
  • Cosmos-reason1: From physical common sense to embodied reasoning, Azzolini et al., arXiv Badge
  • Improving Retrospective Language Agents via Joint Policy Gradient Optimization, Feng et al., arXiv Badge
  • Haste Makes Waste: Evaluating Planning Abilities of LLMs for Efficient and Feasible Multitasking with Time Constraints Between Actions, Wu et al., arXiv Badge
  • MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents, Zhu et al., arXiv Badge
  • ReMA: Learning to Meta-think for LLMs with Multi-Agent Reinforcement Learning, Wan et al., arXiv Badge
  • MAS-GPT: Training LLMs To Build LLM-Based Multi-Agent Systems, Ye et al., PDF Badge
  • Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems, Liu et al., arXiv Badge

Efficient Long CoT

  • Guiding language model reasoning with planning tokens, Wang et al., arXiv Badge
  • Synergy-of-thoughts: Eliciting efficient reasoning in hybrid language models, Shang et al., arXiv Badge
  • Distilling system 2 into system 1, Yu et al., arXiv Badge
  • Concise thoughts: Impact of output length on llm reasoning and cost, Nayab et al., arXiv Badge
  • Litesearch: Efficacious tree search for llm, Wang et al., arXiv Badge
  • Uncertainty-Guided Optimization on Large Language Model Search Trees, Grosse et al., arXiv Badge
  • CPL: Critical Plan Step Learning Boosts LLM Generalization in Reasoning Tasks, Wang et al., arXiv Badge
  • Unlocking the Capabilities of Thought: A Reasoning Boundary Framework to Quantify and Optimize Chain-of-Thought, Chen et al., PDF Badge
  • Kvsharer: Efficient inference via layer-wise dissimilar KV cache sharing, Yang et al., arXiv Badge
  • Interpretable contrastive monte carlo tree search reasoning, Gao et al., arXiv Badge
  • Dualformer: Controllable fast and slow thinking by learning with randomized reasoning traces, Su et al., arXiv Badge
  • DynaThink: Fast or Slow? A Dynamic Decision-Making Framework for Large Language Models, Pan et al., PDF Badge
  • Language models are hidden reasoners: Unlocking latent reasoning capabilities via self-rewarding, Chen et al., arXiv Badge
  • Token-budget-aware llm reasoning, Han et al., arXiv Badge
  • B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners, Zeng et al., arXiv Badge
  • C3oT: Generating Shorter Chain-of-Thought without Compromising Effectiveness, Kang et al., arXiv Badge
  • Training large language models to reason in a continuous latent space, Hao et al., arXiv Badge
  • CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models, Cheng et al., arXiv Badge
  • Kimi k1. 5: Scaling reinforcement learning with llms, Team et al., arXiv Badge
  • O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning, Luo et al., arXiv Badge
  • Reward-Guided Speculative Decoding for Efficient LLM Reasoning, Liao et al., arXiv Badge
  • Think Smarter not Harder: Adaptive Reasoning with Inference Aware Optimization, Yu et al., arXiv Badge
  • Efficient Reasoning with Hidden Thinking, Shen et al., arXiv Badge
  • On the Query Complexity of Verifier-Assisted Language Generation, Botta et al., arXiv Badge
  • TokenSkip: Controllable Chain-of-Thought Compression in LLMs, Xia et al., arXiv Badge
  • Boost, Disentangle, and Customize: A Robust System2-to-System1 Pipeline for Code Generation, Du et al., arXiv Badge
  • Jakiro: Boosting Speculative Decoding with Decoupled Multi-Head via MoE, Huang et al., arXiv Badge
  • Towards Reasoning Ability of Small Language Models, Srivastava et al., arXiv Badge
  • Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs, Ji et al., arXiv Badge
  • Portable Reward Tuning: Towards Reusable Fine-Tuning across Different Pretrained Models, Chijiwa et al., arXiv Badge
  • MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification, Sun et al., arXiv Badge
  • Language Models Can Predict Their Own Behavior, Ashok et al., arXiv Badge
  • On the Convergence Rate of MCTS for the Optimal Value Estimation in Markov Decision Processes, Chang et al., PDF Badge
  • CoT-Valve: Length-Compressible Chain-of-Thought Tuning, Ma et al., arXiv Badge
  • Training Language Models to Reason Efficiently, Arora et al., arXiv Badge
  • Chain of Draft: Thinking Faster by Writing Less, Xu et al., arXiv Badge
  • Learning to Stop Overthinking at Test Time, Bao et al., arXiv Badge
  • Self-Training Elicits Concise Reasoning in Large Language Models, Munkhbat et al., arXiv Badge
  • Length-Controlled Margin-Based Preference Optimization without Reference Model, Li et al., arXiv Badge
  • Reasoning on a Spectrum: Aligning LLMs to System 1 and System 2 Thinking, Ziabari et al., arXiv Badge
  • Dynamic Parallel Tree Search for Efficient LLM Reasoning, Ding et al., arXiv Badge
  • Stepwise Perplexity-Guided Refinement for Efficient Chain-of-Thought Reasoning in Large Language Models, Cui et al., arXiv Badge
  • Dynamic Chain-of-Thought: Towards Adaptive Deep Reasoning, Wang et al., arXiv Badge
  • SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs, Xu et al., arXiv Badge
  • LightThinker: Thinking Step-by-Step Compression, Zhang et al., arXiv Badge
  • Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning, Yan et al., arXiv Badge
  • Stepwise Informativeness Search for Improving LLM Reasoning, Wang et al., arXiv Badge
  • Adaptive Group Policy Optimization: Towards Stable Training and Token-Efficient Reasoning, Li et al., arXiv Badge
  • Innate Reasoning is Not Enough: In-Context Learning Enhances Reasoning Large Language Models with Less Overthinking, Ge et al., arXiv Badge
  • Understanding r1-zero-like training: A critical perspective, Liu et al., arXiv Badge
  • The First Few Tokens Are All You Need: An Efficient and Effective Unsupervised Prefix Fine-Tuning Method for Reasoning Models, Ji et al., arXiv Badge
  • L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning, Aggarwal et al., arXiv Badge
  • DAST: Difficulty-Adaptive Slow-Thinking for Large Reasoning Models, Shen et al., arXiv Badge
  • ThinkPrune: Pruning Long Chain-of-Thought of LLMs via Reinforcement Learning, Hou et al., arXiv Badge

Knowledge-Augmented Long CoT

  • Best of Both Worlds: Harmonizing LLM Capabilities in Decision-Making and Question-Answering for Treatment Regimes, Liu et al., PDF Badge
  • Understanding Reasoning Ability of Language Models From the Perspective of Reasoning Paths Aggregation, Wang et al., PDF Badge
  • Stream of search (sos): Learning to search in language, Gandhi et al., PDF Badge
  • CoPS: Empowering LLM Agents with Provable Cross-Task Experience Sharing, Yang et al., arXiv Badge
  • Disentangling memory and reasoning ability in large language models, Jin et al., arXiv Badge
  • Huatuogpt-o1, towards medical complex reasoning with llms, Chen et al., arXiv Badge
  • RAG-Star: Enhancing Deliberative Reasoning with Retrieval Augmented Verification and Refinement, Jiang et al., arXiv Badge
  • O1 Replication Journey--Part 3: Inference-time Scaling for Medical Reasoning, Huang et al., arXiv Badge
  • MedS 3: Towards Medical Small Language Models with Self-Evolved Slow Thinking, Jiang et al., arXiv Badge
  • Search-o1: Agentic search-enhanced large reasoning models, Li et al., arXiv Badge
  • Chain-of-Retrieval Augmented Generation, Wang et al., arXiv Badge
  • Evaluating Large Language Models through Role-Guide and Self-Reflection: A Comparative Study, Zhao et al., PDF Badge
  • Citrus: Leveraging Expert Cognitive Pathways in a Medical Language Model for Advanced Medical Decision Support, Wang et al., arXiv Badge
  • ECM: A Unified Electronic Circuit Model for Explaining the Emergence of In-Context Learning and Chain-of-Thought in Large Language Model, Chen et al., arXiv Badge
  • Large Language Models for Recommendation with Deliberative User Preference Alignment, Fang et al., arXiv Badge
  • ChineseEcomQA: A Scalable E-commerce Concept Evaluation Benchmark for Large Language Models, Chen et al., arXiv Badge
  • DeepRAG: Thinking to Retrieval Step by Step for Large Language Models, Guan et al., arXiv Badge
  • Open Deep Research, Team et al., Github Badge
  • HopRAG: Multi-Hop Reasoning for Logic-Aware Retrieval-Augmented Generation, Liu et al., arXiv Badge
  • O1 Embedder: Let Retrievers Think Before Action, Yan et al., arXiv Badge
  • MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning, Pan et al., arXiv Badge
  • Towards Robust Legal Reasoning: Harnessing Logical LLMs in Law, Kant et al., arXiv Badge
  • OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning, Lu et al., arXiv Badge
  • R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning, Song et al., arXiv Badge
  • RARE: Retrieval-Augmented Reasoning Modeling, Wang et al., arXiv Badge
  • Graph-Augmented Reasoning: Evolving Step-by-Step Knowledge Graph Retrieval for LLM Reasoning, Wu et al., arXiv Badge
  • Learning to Reason with Search for LLMs via Reinforcement Learning, Chen et al., arXiv Badge
  • Fin-R1: A Large Language Model for Financial Reasoning through Reinforcement Learning, Liu et al., arXiv Badge
  • m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning with Large Language Models, Huang et al., arXiv Badge

Multilingual Long CoT

  • Cross-lingual Prompting: Improving Zero-shot Chain-of-Thought Reasoning across Languages, Qin et al., PDF Badge
  • Not All Languages Are Created Equal in LLMs: Improving Multilingual Capability by Cross-Lingual-Thought Prompting, Huang et al., PDF Badge
  • xcot: Cross-lingual instruction tuning for cross-lingual chain-of-thought reasoning, Chai et al., arXiv Badge
  • Multilingual large language model: A survey of resources, taxonomy and frontiers, Qin et al., arXiv Badge
  • A Tree-of-Thoughts to Broaden Multi-step Reasoning across Languages, Ranaldi et al., PDF Badge
  • AutoCAP: Towards Automatic Cross-lingual Alignment Planning for Zero-shot Chain-of-Thought, Zhang et al., PDF Badge
  • Enhancing Advanced Visual Reasoning Ability of Large Language Models, Li et al., PDF Badge
  • DRT-o1: Optimized Deep Reasoning Translation via Long Chain-of-Thought, Wang et al., arXiv Badge
  • A survey of multilingual large language models, Qin et al., PDF Badge
  • Demystifying Multilingual Chain-of-Thought in Process Reward Modeling, Wang et al., arXiv Badge
  • The Multilingual Mind: A Survey of Multilingual Reasoning in Language Models, Ghosh et al., arXiv Badge

Multimodal Long CoT

  • Large Language Models Can Self-Correct with Minimal Effort, Wu et al., PDF Badge
  • Multimodal Chain-of-Thought Reasoning in Language Models, Zhang et al., PDF Badge
  • Q*: Improving multi-step reasoning for llms with deliberative planning, Wang et al., arXiv Badge
  • M3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought, Chen et al., PDF Badge
  • A survey on evaluation of multimodal large language models, Huang et al., arXiv Badge
  • Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning, Zhai et al., PDF Badge
  • What factors affect multi-modal in-context learning? an in-depth exploration, Qin et al., arXiv Badge
  • Enhancing Advanced Visual Reasoning Ability of Large Language Models, Li et al., PDF Badge
  • Insight-v: Exploring long-chain visual reasoning with multimodal large language models, Dong et al., arXiv Badge
  • Llava-o1: Let vision language models reason step-by-step, Xu et al., arXiv Badge
  • AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning, Xiang et al., arXiv Badge
  • ARES: Alternating Reinforcement Learning and Supervised Fine-Tuning for Enhanced Multi-Modal Chain-of-Thought Reasoning Through Diverse AI Feedback, Byun et al., PDF Badge
  • Enhancing the reasoning ability of multimodal large language models via mixed preference optimization, Wang et al., arXiv Badge
  • Slow Perception: Let's Perceive Geometric Figures Step-by-step, Wei et al., arXiv Badge
  • Diving into Self-Evolving Training for Multimodal Reasoning, Liu et al., arXiv Badge
  • Scaling inference-time search with vision value model for improved visual comprehension, Xiyao et al., arXiv Badge
  • CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models, Cheng et al., arXiv Badge
  • Inference Retrieval-Augmented Multi-Modal Chain-of-Thoughts Reasoning for Language Models, He et al., PDF Badge
  • Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model, Ma et al., arXiv Badge
  • BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning, Zhang et al., arXiv Badge
  • InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model, Zang et al., arXiv Badge
  • Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark, Hao et al., arXiv Badge
  • Visual Agents as Fast and Slow Thinkers, Sun et al., PDF Badge
  • Virgo: A Preliminary Exploration on Reproducing o1-like MLLM, Du et al., arXiv Badge
  • Llamav-o1: Rethinking step-by-step visual reasoning in llms, Thawakar et al., arXiv Badge
  • Inference-time scaling for diffusion models beyond scaling denoising steps, Ma et al., arXiv Badge
  • Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step, Guo et al., arXiv Badge
  • Imagine while Reasoning in Space: Multimodal Visualization-of-Thought, Li et al., arXiv Badge
  • Monte Carlo Tree Diffusion for System 2 Planning, Yoon et al., arXiv Badge
  • Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking, Wu et al., arXiv Badge
  • Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models, Xie et al., arXiv Badge
  • Visual-RFT: Visual Reinforcement Fine-Tuning, Liu et al., arXiv Badge
  • Qwen2. 5-Omni Technical Report, Xu et al., arXiv Badge
  • Vision-r1: Incentivizing reasoning capability in multimodal large language models, Huang et al., arXiv Badge
  • Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl, Peng et al., arXiv Badge
  • Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning, Tan et al., arXiv Badge
  • OThink-MR1: Stimulating multimodal generalized reasoning capabilities through dynamic reinforcement learning, Liu et al., arXiv Badge
  • Grounded Chain-of-Thought for Multimodal Large Language Models, Wu et al., arXiv Badge
  • Test-Time View Selection for Multi-Modal Decision Making, Jain et al., PDF Badge
  • Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme, Ma et al., arXiv Badge

Safety and Stability for Long CoT

  • Larger and more instructable language models become less reliable, Zhou et al., PDF Badge
  • On the Hardness of Faithful Chain-of-Thought Reasoning in Large Language Models, Tanneru et al., arXiv Badge
  • The Impact of Reasoning Step Length on Large Language Models, Jin et al., PDF Badge
  • Unlocking the Capabilities of Thought: A Reasoning Boundary Framework to Quantify and Optimize Chain-of-Thought, Chen et al., PDF Badge
  • Can Large Language Models Understand You Better? An MBTI Personality Detection Dataset Aligned with Population Traits, Li et al., arXiv Badge
  • o3-mini vs DeepSeek-R1: Which One is Safer?, Arrieta et al., arXiv Badge
  • Efficient Reasoning with Hidden Thinking, Shen et al., arXiv Badge
  • Think More, Hallucinate Less: Mitigating Hallucinations via Dual Process of Fast and Slow Thinking, Cheng et al., arXiv Badge
  • Explicit vs. Implicit: Investigating Social Bias in Large Language Models through Self-Reflection, Zhao et al., arXiv Badge
  • Challenges in Ensuring AI Safety in DeepSeek-R1 Models: The Shortcomings of Reinforcement Learning Strategies, Parmar et al., arXiv Badge
  • Early External Safety Testing of OpenAI's o3-mini: Insights from the Pre-Deployment Evaluation, Arrieta et al., arXiv Badge
  • International AI Safety Report, Bengio et al., arXiv Badge
  • GuardReasoner: Towards Reasoning-based LLM Safeguards, Liu et al., arXiv Badge
  • OVERTHINKING: Slowdown Attacks on Reasoning LLMs, Kumar et al., arXiv Badge
  • A Mousetrap: Fooling Large Reasoning Models for Jailbreak with Chain of Iterative Chaos, Yao et al., arXiv Badge
  • MetaSC: Test-Time Safety Specification Optimization for Language Models, Gallego et al., arXiv Badge
  • Leveraging Reasoning with Guidelines to Elicit and Utilize Knowledge for Enhancing Safety Alignment, Wang et al., arXiv Badge
  • The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1, Zhou et al., arXiv Badge
  • Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models, Lu et al., arXiv Badge
  • Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?, Bengio et al., arXiv Badge
  • Emergent Response Planning in LLM, Dong et al., arXiv Badge
  • Investigating the Impact of Quantization Methods on the Safety and Reliability of Large Language Models, Kharinaev et al., arXiv Badge
  • Safety Evaluation of DeepSeek Models in Chinese Contexts, Zhang et al., arXiv Badge
  • Reasoning Does Not Necessarily Improve Role-Playing Ability, Feng et al., arXiv Badge
  • H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models, Including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking, Kuo et al., arXiv Badge
  • BoT: Breaking Long Thought Processes of o1-like Large Language Models through Backdoor Attack, Zhu et al., arXiv Badge
  • " Nuclear Deployed!": Analyzing Catastrophic Risks in Decision-making of Autonomous LLM Agents, Xu et al., arXiv Badge
  • SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities, Jiang et al., arXiv Badge
  • Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking, Zhu et al., arXiv Badge
  • CER: Confidence Enhanced Reasoning in LLMs, Razghandi et al., arXiv Badge
  • Measuring Faithfulness of Chains of Thought by Unlearning Reasoning Steps, Tutek et al., arXiv Badge
  • The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Safety Analysis, Pan et al., arXiv Badge
  • Policy Frameworks for Transparent Chain-of-Thought Reasoning in Large Language Models, Chen et al., arXiv Badge
  • Do Chains-of-Thoughts of Large Language Models Suffer from Hallucinations, Cognitive Biases, or Phobias in Bayesian Reasoning?, Araya et al., arXiv Badge
  • Process or Result? Manipulated Ending Tokens Can Mislead Reasoning LLMs to Ignore the Correct Reasoning Steps, Cui et al., arXiv Badge
  • Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable, Huang et al., arXiv Badge
  • Recitation over Reasoning: How Cutting-Edge Language Models Can Fail on Elementary School-Level Reasoning Problems?, Yan et al., arXiv Badge
  • Reasoning Models Don’t Always Say What They Think, Chen et al., PDF Badge

Resources

Open-Sourced Training Framework

  • OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework, Hu et al., arXiv Badge
  • LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models, Hao et al., PDF Badge
  • OpenR: An Open Source Framework for Advanced Reasoning with Large Language Models, Wang et al., arXiv Badge
  • TinyZero, Pan et al., Github Badge
  • R1-V: Reinforcing Super Generalization Ability in Vision-Language Models with Less Than 3, Chen et al., Github Badge
  • VL-Thinking: An R1-Derived Visual Instruction Tuning Dataset for Thinkable LVLMs, Chen et al., Github Badge
  • VLM-R1: A stable and generalizable R1-style Large Vision-Language Model, Shen et al., Github Badge
  • 7B Model and 8K Examples: Emerging Reasoning with Reinforcement Learning is Both Effective and Efficient, Zeng et al., Notion Badge
  • Open R1, Team et al., Github Badge
  • DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL, Luo et al., PDF Badge
  • X-R1, Team et al., Github Badge
  • Open-Reasoner-Zero: An Open Source Approach to Scaling Reinforcement Learning on the Base Model, Jingcheng Hu and Yinmin Zhang and Qi Han and Daxin Jiang and Xiangyu Zhang et al., Github Badge
  • Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning, Xie et al., arXiv Badge
  • R1-Multimodal-Journey, Shao et al., Github Badge
  • Open-R1-Multimodal, Lab et al., Github Badge
  • Video-R1, Team et al., Github Badge
  • Dapo: An open-source llm reinforcement learning system at scale, Yu et al., arXiv Badge
  • VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks, YuYue et al., arXiv Badge

BibTeX

BibTex Code Here