Local agentic workflows are getting crazy good

New open models provide amazing value for local inference

By Toni Sagrista Selles

4 minute read

Building on the previous post about how good current open TTS models are, this post focuses on local LLMs, and their explosion in quality in the recent years months. I have been using OpenRouter with, for the most part, DeepSeek v4 Flash. It is a very capable model, dirt-cheap (seems to be heavily subsidized, at $0.14/M tokens input and $0.28/M tokens output), and it works very well when it is free to browse the internet. I have also been tinkering with local models like the Qwen3.5 family, Gemma 3/4, Mistral, and more. They are good, I haven’t been able to use the versions that can run reasonably on consumer hardware for anything too productive. That is until now. The brand new Qwen3.6 models, particularly the 27B dense, and the 35B A3B MoE, have really impressed the community. They are trivial to run on common setups (e.g. 16GB RTX card, ~32-64GB RAM), and when boosted with agentic capabilities, they perform very well.

Local TTS is getting very capable and accessible

It is now possible to do high-quality TTS on local hardware very easily.

By Toni Sagrista Selles

4 minute read

Around 2007 I spent half a year in the University of Aberdeen working on my final year project involving NLP. The project consisted of a series of modules to build an graphical adventure-type game controlled by language input in the form of text. It also featured speech output generated by an early TTS system. To achieve this, we managed to partner with a group at La Salle University that were working on a TTS system for Catalan. It was a closed system that was accessible via a web API, but that option was far too slow for real time use. I ended up preprocessing the audio of all dialogs in the project into WAV files just to be able to play them in sync with mouth movements. At that time, I was amazed that a computer could so easily convert text to an understandable audio file. The voice was very plain and robotic, with no emotion whatsoever. The results were hit or miss, but it worked.

Fast forward to today, TTS systems are everywhere. Several groups have released low-parameter TTS models that run very well on consumer hardware. I have been using the lightweight Kitten TTS for a while with fantastic results.

Fine-tuning Qwen3.5 for Gaia Sky

How I fine-tuned Qwen3.5 4/9B models to become Gaia Sky experts

By Toni Sagristà Sellés

24 minute read

A little over a year ago I set up a local pipeline to use different LLMs to respond to Gaia Sky questions using RAG. In that post, I built a dynamic scrapper that parsed the Gaia Sky website and documentation and ingested the content it into a vector database. Then, I built a minimal terminal chatbot interface that received the user prompt, queried the database for semantically similar data, and built up the context for each LLM call. The results were promising, and I found that they (obviously) strongly depended on the model used.

Fast forward a few months, and the Qwen 3.5 models were released by Alibaba. The general consensus is that they are quite good for their size. I’ve been testing them for local inference with a similar impression. I thought that it would be interesting to repeat the exercise of creating a Gaia Sky AI assistant, but using a radically different approach: Instead of RAG, I would fine-tune the model itself. In this post, I describe this fine-tuning project, from the creation and engineering of the training dataset to the fine-tuning and production of the final GGUF models.

7Artisans EF-FX adapter review

Second attempt at adapting Canon EF lenses to Fuji X, this time with much better results

By Toni Sagrista Selles

5 minute read

A while back I reviewed the Viltrox EF-FX1 adapter, my first attempt at bridging my old Canon EF and EF-S lenses over to my Fujifilm X-S10. The abridged version of that experience: it was frustrating. Firmware roulette, random errors, camera freezes requiring battery pulls, and auto focus performance that varied wildly between versions. I kept it because it was just good enough, but it never stopped feeling like a workaround rather than a solution. Moreover, Viltrox never released any further firmware versions, so the 2.29 blob tested in that post really is the last firmware.

The 7Artisans EF-FX lens adapter. It is a good and cheap solution to use your old EF/EF-S lenses on your Fuji X-mount camera.

Fast forward to recently, and I decided to give the whole thing another shot. This time with the 7Artisans EF-FX adapter. Same concept: an electronic adapter with auto focus support, aperture control, and EXIF data transmission, and at 119€ it sits right in the same price bracket as the Viltrox. But is it actually better? Spoiler: yes. Let me walk you through it.

GGUF quantization guide

A quick guide to understanding modern and legacy quantization methods in LLMs for local inference with GGUF/llama.cpp

By Toni Sagristà Sellés

5 minute read

I like running my own LLMs locally. Open models are becoming more and more powerful, with exciting releases like the latest Qwen 3.5 family scoring highly in benchmarks even in their smaller variants. This makes managing and running your own models more viable, as it becomes increasingly easy to repurpose old hardware for local inference with progressively better results. For local users and modest purposes, the GGUF format introduced by llama.cpp is the de-facto default.

Since local inference is typically heavily restricted by the available hardware, several optimization techniques have been implemented to make the models leaner and faster. Perhaps the most important of these is quantization, which trims down the bit count per parameter to achieve lower memory usage and (sometimes) faster inference1. The challenge is that there are many different formats and strategies for quantization. In this post, I summarize them, providing a bird’s-eye view on the available techniques, their strengths, and their weaknesses.

LM Studio on systemd linger

How I set up an old laptop as a persistent inference machine using LM Studio, system-level services, and systemd lingering.

By Toni Sagristà Sellés

3 minute read

The release of LM Studio 0.4.5 has introduced a much needed feature in this local LLM suite that has it much more attractive with respect to other similar projects. LM Link allows you to connect multiple LM Studio instances across your network to share models and perform inference seamlessly.

Website design by myself. See the privacy policy.
Content licensed under CC-BY-NC-SA 4.0 .