April 9, 2025

April 9, 2025

Thinking Machines: A Survey of LLM-based Reasoning Strategies

Thinking Machines: A Survey of LLM-based Reasoning Strategies

  1. Introduction

LLMs such as GPT-3 and GPT-4 have exhibited outstanding ability in the understanding and generation of human-like text. They use massive volumes of textual data for prediction and produce logical answers, which allows them to perform well in tasks like translation, summarization, and answering questions. But, LLM reasoning, which refers to their capacity to take in information in a logical manner and find solutions to problems continues to confront major challenges. ​

But, even though these models are very good at language, they often have trouble with tasks that require real logical judgment and complex thinking. This difference occurs because LLMs have limited thinking ability because to their reliance on pattern recognition rather than actual understanding.   

Various techniques, such as process optimization and prompt engineering, are being investigated by researchers to improve LLM reasoning. These methods are designed to enhance the models' capacity to manage complex tasks that require logical reasoning.

Researchers have come up with post-hoc" methods to encourage LLMs to think. all methods show varying performance.

  • Chain of Thought (CoT) Prompting: This method promotes the generation of intermediate reasoning steps by models, which in turn simplifies complex problems into more manageable components. In turn, it enhances performance on tasks that require logical progression.

  • ReAct (Reasoning+Acting) Pattern: The ReAct (Reasoning+Acting) Pattern integrates the two concepts, allowing models to improve their performance on complicated tasks by interleaving the generation of reasoning traces and task-specific actions. ​​

  • Self-Reflection Mechanisms: These include self-evaluation and internal discussion tools that let models check and improve their outputs, which makes thinking more accurate and easier to understand.

  • Integration with Knowledge Graphs: Uses structured data to generate models that clearly represent the relationships between entities, which leads to more precise and understandable reasoning processes. 

Recently, the development of advanced thinking models like OpenAI's o1 and DeepSeek's R1 has changed the way these problems are solved in a big way. The o1 model designed by OpenAI is intended to spend a greater amount of time processing information before responding, which allows it to more effectively address complex tasks in science, coding, and mathematics than previous models. It achieves this using reinforcement learning techniques that enhance the model's thinking abilities, enable the development of various solutions and help with the recognition of errors. 

Similarly, the DeepSeek R1 model improves reasoning capabilities by employing large-scale reinforcement learning without requiring initial supervised fine-tuning. This method enables the model to acquire reasoning behaviourst like reflection and self-verification. To enhance performance, DeepSeek implemented additional training phases and data, which led to a model that is comparable to OpenAI's o1 in a variety of reasoning tasks.

In this article, we will look into the technical framework of reasoning in LLMs, including preliminary concepts, a taxonomy of methods, in-depth examinations of key algorithms, and powerful reasoning models like OpenAI O1 and DeepSeek R1.


  1. Preliminary Concepts

Fundamentals of Language Modelling

Language modelling is a fundamental responsibility of LLMs, which involves the development of models that are capable of predicting the next token in a sequence using large-scale data. To develop a context-aware understanding of language, they depend on objectives such as causal language modelling loss and next-token prediction. These objectives assist the models in producing text that is consistent with human-like patterns. 

Important aspects consist of:

  • Next-token Prediction: The model predicts the token that follows next by analysing the context that preceded it.

  • Causal Language Modelling Loss: This loss function guides learning by measuring prediction errors one token at a time.

  • Sampling Techniques: During calculation, the result is shaped by techniques like beam search, greedy sampling, top-k sampling, and top-p sampling.

  • Inference Limitations: These sampling techniques frequently fail to effectively manage multi-step reasoning and logical structuring.

This basis makes it possible to understand both the good things about language models and the problems they have.

Defining Reasoning in AI

Reasoning in AI involves the formation of solid outputs by getting logical conclusions from the data that is provided. It is based on strong philosophical rules that govern the process of reasoning. This process clearly shows how people and computers solve problems in very different ways. 

Important aspects consist of the following:

  • Building on Evidence: We are guided to reliable conclusions by clear, organized evidence that serves as the foundation for our reasoning.

  • Human Reasoning: We solve problems by splitting them up into smaller parts and going over our plans again and again until they're perfect.

  • Machine Reasoning: A comparable method is implemented by machines. A hard task was broken up into smaller, easier tasks. Each one was solved separately, and then the results were added together.

  • Iterative Refinement: When new information comes in, both humans and machines review and change their methods to get better results.

For better decision-making, AI can learn from human strategies while still following strict algorithmic paths.  

Reinforcement Learning (RL) Basics for LLMs

LLMs become more proficient by engaging with their surroundings and receiving feedback through reinforcement learning. RLHF (reinforcement learning from human feedback) is a method that enables models to evaluate their actions more accurately by learning from actual user responses through trial and error. This process is further refined by techniques such as PPO (proximal policy optimization) and DPO (direct preference optimization), which modify the model's actions to maximize the rewards it receives.
Important elements consist of:

  • Policies: Specify the group of actions that the model selects in various states.

  • Rewards: Offer feedback that assists the model in understanding the significance of its actions.

  • Value Functions: Provide direction for decision-making and estimate future rewards.

  • Policy Gradient Methods: Use mathematical formulas to modify actions to maximize rewards.

These foundations enable LLMs to learn from their past and improve their outputs, which improves their efficacy in dynamic tasks.

Monte Carlo Tree Search (MCTS)

Monte Carlo Tree Search provides a planning method that involves the examination of many possible results before choosing an approach. It helps LLMs in the proper management of their internal thought processes during reasoning tasks.   It makes decisions more accurately by going through a clear set of steps. 

Important points include:  

  • Selection: The algorithm selects the node that is most promising, as determined by the most recent estimates.

  • Expansion: It introduces new nodes that depict potential future phases.

  • Simulation: The algorithm is designed to estimate the results of various decisions through the use of simulations.

  • Backpropagation: It changes each node's value based on what happened in the scenario.

MCTS helps LLMs plan their lines of thinking so they come up with well-thought-out answers that make sense.

  1. Taxonomy of LLM-Based Reasoning Strategies

a. Reinforcement Learning Paradigm

Verbal Reinforcement

LLMs enhance their reasoning abilities by generating chains of thought and receiving feedback in plain language.   For example, ReAct divides the process into three components: thought, action, and observation. Reflexion generates multiple reasoning paths and refines them in accordance with the feedback it receives. A self-reflector updates the reasoning recorded in memory by using the step-by-step remarks provided by an evaluator. The model generates more logical and clarified responses with each cycle as a result of this closed feedback loop.

Reward‑Based Reinforcement

Reward-based learning strategies modify the behavior of the model by assigning a score to each stage or the final response. The model gets a reward at each step along the way with process supervision, but only one reward for the end answer with outcome supervision. The total future reward from each state is predicted by value functions, while reward functions convert reasoning traces into a simple score. Proximal Policy Optimization (PPO) enhances the policy by increasing the anticipated total rewards. Group Relative Policy Optimization (GRPO) reduces variance by normalizing scores across groups of outputs. Direct Preference Optimization (DPO) refines the policy directly from pairwise comparisons at the step level, removing the need for a distinct value model.

Diagram showing reasoning in LLMs using reinforcement learning, including methods like ReAct, MCTS, DeepSeek-R1, and DPO. Visual taxonomy of AI reasoning frameworks and LLM evaluation strategies.

Figure 1: Reasoning with Reinforcement Learning: Source

Search/Planning Techniques

The search and planning process uses a combination of tree search and LLM reasoning to test out many paths of reasoning before arriving at a conclusion. The system chooses, expands, simulates, and changes value predictions in Monte Carlo Tree Search (MCTS) using a tree of reasoning steps. World models help the LLM make plans by letting them guess what will happen in the future. Reinforcement learning and tree search are both used in hybrid methods to find a balance between trying new things and sticking to tried-and-true ways. To enhance the lucidity and precision of the final answer, the system evaluates numerous chains of thought.

b. Test‑Time Compute (TTC) Paradigm

Feedback Guided Improvement

At the time of the test, additional compute methods enhance the quality of the output by scoring and filtering the reasoning paths as they are generated. Step-feedback, which employs techniques such as Monte Carlo Tree Search (MCTS) or beam search, scores each token or partial answer to ensure that only high-scoring sequences continue. The outcome-feedback method generates several complete answers and then ranks them with the help of an outside reviewer. The scores assigned by verification modules, such as trained analyzers or code execution engines like CodeT and LEVER, assist in the determination of the correct answer. These feedback loops rapidly cut off low-value thinking, which decreases errors and ensuring a consistent, final output.

Scaling and Adaptation

Scaling inference computing refers to the usage of additional compute resources during testing to look into more complex reasoning paths without modifying the model's size. Chain-of-thinking prompting helps the model to think deeper. On the other hand, the forest-of-thought and graph-of-thought methods simultaneously search multiple branches and combine their insights. This additional "thinking time" enhances accuracy and minimizes errors, which improves the quality of reasoning without the need to recalibrate the model.

Self‑feedback and Iterative Refinement

LLMs have the ability to enhance their responses by verifying and rectifying any errors. They evaluate their work, identify errors, and revise the sections that are inaccurate. The model generates numerous answer paths, evaluates them, and selects the one that appears the most frequently. This process of coming up with, reviewing, and changing the answer is repeated until it is clear and consistent. This approach reduces errors and enhances consistency without the need of external assistance.

c. Self-Training Paradigm

Bootstrapping with Generated Reasoning Traces

Self-training enhances the performance of a model by using its own reasoning. It comes up with chain-of-thought answers to different questions and then picks out the best ones. The model is refined by using these high-quality traces, which reinforces accurate reasoning. In the long term, the model acquires deeper reasoning through the repetition of generating and training cycles. This way makes the model's own results into training data, so people don't have to annotate them as much.

Self‑Consistency & Ensemble Methods

Self-consistency functions by generating multiple reasoning paths for each inquiry. The model then combines comparable responses and selects the most prevalent one by using likelihood scores or voting. It can also use internal checks or external verifiers to enhance the reliability of its responses. The end answer that comes out of this method is more consistent and correct.

  1. DeepSeek‑R1 and OpenAI o1‑1217 Benchmark Evaluations

DeepSeek‑R1 and OpenAI o1-1217 both use test-time computation methods, such as chain-of-thought generation, self-consistency, and iterative refinement, to address challenging math and coding challenges. DeepSeek‑R1 achieves a pass@1 score of 79.8% on AIME 2024 by generating multiple reasoning paths and subsequently selecting the most common answer through voting or likelihood aggregation. However, OpenAI o1-1217 achieves a slightly lower pass@1 score of 74.4% by using a comparable approach. In summary, DeepSeek‑R1 appears to have a minor advantage over the other algorithms, as it filters and reinforces the correct reasoning steps while both algorithms look into multiple reasoning trajectories and select the most effective one.

It also gets 97.3% on the MATH‑500 standard, which is the same as o1-1217's score on hard proof-style math questions. DeepSeek‑R1 does better in coding competitions than 96.3% of humans on Codeforces, with an Elo score of 2,029 compared to o1-1217's 1,843. DeepSeek‑R1 has a 73.3% pass rate, while o1-1217 has a 77.3% pass rate on LiveCodeBench, which tests algorithmic code under time limits. These results show that reasoning-focused LLMs can do expert-level work on both organized mathematical proofs and real-world computing problems. 

Each model is very good at multi-step logic, but it comes at the cost of speed. o1-1217 tends to find answers more quickly when it comes to coding tasks, while DeepSeek-R1 is a little more accurate when it comes to complex math. The balance between accuracy and response latency indicates that organizations can choose models that are more appropriate for their task priorities. This sets a new standard for how well LLMs can think as benchmark tests get better. 

You can read more about the DeepSeek model and see how it stands up against its rivals. 

Table 1: Reasoning Model Benchmark

  1. Challenges

Challenges regarding Automation of Process Supervision Signals
Process reward models need step-by-step labels for each reasoning trace, which can't be done automatically and costs a lot of money. This also makes it hard to scale the model. It is still not possible to automate the creation of accurate supervision signals because synthetic labels don't always have enough detail for accurate credit assignments.

Computational Overhead and Overthinking

Search-based methods like MCTS, reduce dependence on explicit value networks but explore large decision trees, resulting in unnecessary computation and "overthinking" that leads to lowered results.   Unfortunately, this inefficiency increases delay and resource costs, which makes it harder to use reasoning-enhanced LLMs in real life.

Expensive Step‑Level Preference Optimization

Detailed annotations are required for step-level preference tuning, which is significantly more expensive than outcome-level labels, but it provides fine-grained rewards for each reasoning step. People don't use this effective but resource-intensive supervision method as much as they can because it costs a lot to annotate and opinions are biased.

Test-time compute Scaling depends on Robust Pre‑training

Inference-time scaling is only able to unleash reasoning advances if the base model has robust pre-training; additional compute cannot compensate for weak foundational capabilities. Models that lack extensive pre-training show minimal improvement from an extended chain-of-thought, which restricts the benefit of test-time computation.

Test‑Time Scaling Limitations

Chain-of-thought and similar scaling methods make very large models (>100B parameters) more accurate, but they make smaller models (<10B parameters) are less accurate. Test-time scaling to high-end models is restricted by this parameter-dependent effectiveness, which also restricts broader accessibility.

  1. Conclusion

Reinforcement learning, test-time computing, and self-training are the three distinct categories into which large language model reasoning strategies fall. These strategies collectively expand LLMs from surface-level pattern matching to genuine multi-step inference. DeepSeek-R1 and OpenAI-o1 are examples of reinforcement learning methods that refine models using step-level and outcome rewards. 

During test time, methods like chain-of-thought prompts, Monte Carlo tree search, and forest-of-thought use dynamic search to get better answers. Self-training loops enhance model weights by converting generated reasoning traces into training data, all without the need for human labels. This taxonomy indicates that optimizing reasoning performance increasingly requires smart inference techniques rather than parameters. 

Advanced reasoning LLMs can now match or beat human accuracy on difficult math and coding tests, but they need to carefully balance the costs of computing, the time it takes to process, and how easy it is to understand. For high accuracy at low cost, further development requires combining robust pre-training with adaptive inference, using lightweight models enhanced by focused search. Finding the right balance between scaling and smart analysis will determine the next generation of AI by allowing it to use its resources efficiently and clearly.

FAQs

FAQs

FAQs

FAQs

FAQs

What is the core reasoning gap in current large language models?

How does reinforcement learning enhance reasoning in LLMs?

What role does test-time computing play in improving reasoning?

What are the benefits of self-training for AI models?

What is the core reasoning gap in current large language models?

How does reinforcement learning enhance reasoning in LLMs?

What role does test-time computing play in improving reasoning?

What are the benefits of self-training for AI models?

What is the core reasoning gap in current large language models?

How does reinforcement learning enhance reasoning in LLMs?

What role does test-time computing play in improving reasoning?

What are the benefits of self-training for AI models?

What is the core reasoning gap in current large language models?

How does reinforcement learning enhance reasoning in LLMs?

What role does test-time computing play in improving reasoning?

What are the benefits of self-training for AI models?

What is the core reasoning gap in current large language models?

How does reinforcement learning enhance reasoning in LLMs?

What role does test-time computing play in improving reasoning?

What are the benefits of self-training for AI models?

What is the core reasoning gap in current large language models?

How does reinforcement learning enhance reasoning in LLMs?

What role does test-time computing play in improving reasoning?

What are the benefits of self-training for AI models?

What is the core reasoning gap in current large language models?

How does reinforcement learning enhance reasoning in LLMs?

What role does test-time computing play in improving reasoning?

What are the benefits of self-training for AI models?

More By

Rishav Hada

future agi background
Background image

Ready to deploy Accurate AI?

Book a Demo
Background image

Ready to deploy Accurate AI?

Book a Demo
future agi background
Background image

Ready to deploy Accurate AI?

Book a Demo
future agi background
Background image

Ready to deploy Accurate AI?

Book a Demo
future agi background
Background image

Ready to deploy Accurate AI?

Book a Demo
future agi background
Background image

Ready to deploy Accurate AI?

Book a Demo
future agi background
Background image

Ready to deploy Accurate AI?

Book a Demo