Exploring DeepSeek-R1 s Agentic Capabilities Through Code Actions

Aus Philo Wiki
Wechseln zu:Navigation, Suche


I ran a fast experiment examining how DeepSeek-R1 out on agentic jobs, regardless of not supporting tool usage natively, and I was rather pleased by preliminary outcomes. This experiment runs DeepSeek-R1 in a single-agent setup, where the design not just prepares the actions however likewise creates the actions as executable Python code. On a subset1 of the GAIA recognition split, DeepSeek-R1 exceeds Claude 3.5 Sonnet by 12.5% absolute, from 53.1% to 65.6% right, and other designs by an even bigger margin:


The experiment followed model use standards from the DeepSeek-R1 paper and bybio.co the design card: Don't use few-shot examples, prevent adding a system timely, and engel-und-waisen.de set the temperature level to 0.5 - 0.7 (0.6 was used). You can find additional evaluation details here.


Approach


DeepSeek-R1's strong coding abilities allow it to act as an agent without being clearly trained for tool usage. By permitting the design to produce actions as Python code, it can flexibly connect with environments through code execution.


Tools are carried out as Python code that is consisted of straight in the prompt. This can be a simple function meaning or a module of a larger plan - any valid Python code. The design then creates code actions that call these tools.


Results from carrying out these actions feed back to the model as follow-up messages, driving the next actions until a final answer is reached. The agent framework is an easy iterative coding loop that moderates the conversation in between the design and suvenir51.ru its environment.


Conversations


DeepSeek-R1 is used as chat model in my experiment, where the design autonomously pulls extra context from its environment by utilizing tools e.g. by using a search engine or fetching data from websites. This drives the discussion with the environment that continues until a final answer is reached.


In contrast, o1 designs are understood to perform badly when used as chat designs i.e. they do not attempt to pull context throughout a conversation. According to the linked article, o1 models carry out best when they have the complete context available, with clear directions on what to do with it.


Initially, I also tried a complete context in a single timely method at each step (with arise from previous steps consisted of), however this caused significantly lower ratings on the GAIA subset. Switching to the conversational approach explained above, I had the ability to reach the reported 65.6% performance.


This raises an intriguing concern about the claim that o1 isn't a chat model - maybe this observation was more relevant to older o1 designs that lacked tool use abilities? After all, isn't tool usage support an important mechanism for enabling models to pull extra context from their environment? This conversational method certainly appears reliable for DeepSeek-R1, though I still require to perform comparable try outs o1 models.


Generalization


Although DeepSeek-R1 was mainly trained with RL on mathematics and coding tasks, dokuwiki.stream it is amazing that generalization to agentic tasks with tool use through code actions works so well. This capability to generalize to agentic tasks reminds of recent research by DeepMind that reveals that RL generalizes whereas SFT memorizes, although generalization to tool use wasn't investigated because work.


Despite its capability to generalize to tool usage, systemcheck-wiki.de DeepSeek-R1 often produces very long thinking traces at each step, compared to other models in my experiments, limiting the effectiveness of this model in a single-agent setup. Even easier tasks often take a long period of time to complete. Further RL on agentic tool use, be it by means of code actions or not, might be one choice to enhance effectiveness.


Underthinking


I likewise observed the underthinking phenomon with DeepSeek-R1. This is when a thinking model frequently switches between various thinking thoughts without adequately checking out appealing paths to reach a right solution. This was a significant reason for extremely long reasoning traces produced by DeepSeek-R1. This can be seen in the tape-recorded traces that are available for download.


Future experiments


Another common application of reasoning models is to use them for planning just, while utilizing other models for producing code actions. This could be a possible new feature of freeact, if this separation of functions shows helpful for more complex jobs.


I'm likewise curious about how thinking designs that already support tool usage (like o1, o3, ...) perform in a single-agent setup, with and without producing code actions. Recent advancements like OpenAI's Deep Research or Hugging Face's open-source Deep Research, which likewise uses code actions, hb9lc.org look interesting.