Local agentic workflows are getting crazy good

New open models provide amazing value for local inference

By Toni Sagrista Selles

4 minute read

Building on the previous post about how good current open TTS models are, this post focuses on local LLMs, and their explosion in quality in the recent years months. I have been using OpenRouter with, for the most part, DeepSeek v4 Flash. It is a very capable model, dirt-cheap (seems to be heavily subsidized, at $0.14/M tokens input and $0.28/M tokens output), and it works very well when it is free to browse the internet. I have also been tinkering with local models like the Qwen3.5 family, Gemma 3/4, Mistral, and more. They are good, I haven’t been able to use the versions that can run reasonably on consumer hardware for anything too productive. That is until now. The brand new Qwen3.6 models, particularly the 27B dense, and the 35B A3B MoE, have really impressed the community. They are trivial to run on common setups (e.g. 16GB RTX card, ~32-64GB RAM), and when boosted with agentic capabilities, they perform very well.

My use case is quite modest. I typically use LLMs for research. I explain the problem and propose a couple of solutions, and ask for feedback. When I’m new to a topic, I like to get a bird’s eye view type of summary first, and then I ask for further explanation and deeper insight on certain areas that interest me. If possible, I add the sources that may help it gain more knowledge on the subject. Sometimes, I ask it to proofread my writings, but I tend to restrict it to fixing grammatical and spelling errors, with the occasional wording suggestion. I think writing is valuable and useful, as it helps with structuring my thoughts and organizing my ideas. That’s why I frown upon AI-generated blog posts. Additionally, I don’t like the overly verbose and structured texts it generates. I don’t do much AI-assisted coding. Sometimes, I may ask it to generate the odd Python script to solve a specific task that would otherwise get tedious.

My setup is like this: For casual questions and general chatting, I use OpenRouter chat for remote models and LM Studio for local models. Usually, I run the models on a gaming laptop (RTX 5080 16GB, 32GB RAM) and serve them to my machine via LM Link. For researching and more complicated workflows, I use the Hermes Agent. It is open source (MIT), fully-featured, self-improving (learns between sessions!), and offers everything I need with minimal configuration. It works very, very well. I connect it to either DeepSeek V4 Flash via OpenRouter, or to my local LM Studio, which is tethered to the inference machine via LM Link.

The Qwen3.6 27B and 35B A3B models, released in April 2026, have made strides in the community. They are extremely capable for their small size. They don’t have a ton of knowledge on all domains like the near-trillion-parameter frontier models, but when paired with the right toolset, they get very close. They don’t fit in most GPUs, but they can be partially offloaded. The MoE model in particular, with only 3B parameters active during inference, is very fast. I get ~35-40 tok/s with the 35B A3B MoE model, and about 4.5 tok/s with the 27B dense model. The dense model is borderline unusable, so I almost always rely on the MoE model. I offload 19 layers to the GPU (~12.5 GB out of a total of 26.63 GB), and use the maximum context size of ~262K tokens. This is plenty for agent sessions and works very well. The model is able to plan and execute actions very efficiently. It tends to overthink a lot, but this is an acceptable trade-off given the output quality. On OpenRouter, this exact same model is more expensive than DeepSeek V4 Flash, at $0.15/$1 per million tokens input/output. Being able to run such a model locally feels like a superpower.

So, now the question is obvious. Will local LLM get good enough in the future to replace frontier models? I’d argue that it depends on the task. Frontier models have a vast amount of knowledge and are able to one-shot many complex problems successfully. They will continue to be more suited for bleeding-edge research for the years to come. But they are also very expensive, very taxing on the environment, and the primary reason for the sky-high current hardware prices. There is certainly a case to be made for local and more economic small models.For day-to-day agentic workflow, I think open models are very close already. I would bet that in the next 2 years they will get on pair for most everyday tasks. Is this prediction grounded on reality, or just wishful thinking? I’d like to think the former. In any case, I’ll be here, watching, tinkering.

Website design by myself. See the privacy policy.
Content licensed under CC-BY-NC-SA 4.0 .