Category: AI
Fine-tuning Qwen3.5 for Gaia Sky
How I fine-tuned Qwen3.5 4/9B models to become Gaia Sky experts
A little over a year ago I set up a local pipeline to use different LLMs to respond to Gaia Sky questions using RAG. In that post, I built a dynamic scrapper that parsed the Gaia Sky website and documentation and ingested the content it into a vector database. Then, I built a minimal terminal chatbot interface that received the user prompt, queried the database for semantically similar data, and built up the context for each LLM call. The results were promising, and I found that they (obviously) strongly depended on the model used.
Fast forward a few months, and the Qwen 3.5 models were released by Alibaba. The general consensus is that they are quite good for their size. I’ve been testing them for local inference with a similar impression. I thought that it would be interesting to repeat the exercise of creating a Gaia Sky AI assistant, but using a radically different approach: Instead of RAG, I would fine-tune the model itself. In this post, I describe this fine-tuning project, from the creation and engineering of the training dataset to the fine-tuning and production of the final GGUF models.
GGUF quantization guide
A quick guide to understanding modern and legacy quantization methods in LLMs for local inference with GGUF/llama.cpp
I like running my own LLMs locally. Open models are becoming more and more powerful, with exciting releases like the latest Qwen 3.5 family scoring highly in benchmarks even in their smaller variants. This makes managing and running your own models more viable, as it becomes increasingly easy to repurpose old hardware for local inference with progressively better results. For local users and modest purposes, the GGUF format introduced by llama.cpp is the de-facto default.
Since local inference is typically heavily restricted by the available hardware, several optimization techniques have been implemented to make the models leaner and faster. Perhaps the most important of these is quantization, which trims down the bit count per parameter to achieve lower memory usage and (sometimes) faster inference1. The challenge is that there are many different formats and strategies for quantization. In this post, I summarize them, providing a bird’s-eye view on the available techniques, their strengths, and their weaknesses.
LM Studio on systemd linger
How I set up an old laptop as a persistent inference machine using LM Studio, system-level services, and systemd lingering.
The release of LM Studio 0.4.5 has introduced a much needed feature in this local LLM suite that has it much more attractive with respect to other similar projects. LM Link allows you to connect multiple LM Studio instances across your network to share models and perform inference seamlessly.
Building a local AI assistant with user context
We use Ollama and Chroma DB to build a personalized assistant from scraped web content
In my last post, I explored the concept of Retrieval-Augmented Generation (RAG) to enable a locally running generative AI model to access and incorporate new information. To achieve this, I used hardcoded documents as context, which were then embedded as vectors and persisted into Chroma DB. These vectors are used during inference to use as context for a local LLM chatbot. But using a few hardcoded sentences is hardly elegant or particularly exciting. It’s alright for educational purposes, but that’s as far as it goes. However, if we need to build a minimally useful system, we need to be more sophisticated than this. In this new post, I set out to create a local Gaia Sky assistant by using the Gaia Sky documentation site and the Gaia Sky homepage as supplementary information, and leveraging Ollama to generate context-aware responses. So, let’s dive into the topic and explain how it all works.
The source code used in this post is available in this repository.
Local LLM with Retrieval-Augmented Generation
Let’s build a simple RAG application using a local LLM through Ollama.
Edit (2025-03-26): Added some words about next steps in conclusion.
Edit (2025-03-25): I re-ran the example with a clean database and the results are better. I also cleaned up the code a bit.
Over the past few months I have been running local LLMs on my computer with various results, ranging from ‘unusable’ to ‘pretty good’. Local LLMs are becoming more powerful, but they don’t inherently “know” everything. They’re trained on massive datasets, but those are typically static. To make LLMs truly useful for specific tasks, you often need to augment them with your own data–data that’s constantly changing, specific to your domain, or not included in the LLM’s original training. The technique known as RAG aims to bridge this problem by embedding context information into a vector database that is later used to provide context to the LLM, so that it can expand its knowledge beyond the original training dataset. In this short article, we’ll see how to build a very primitive local AI chatbot powered by Ollama with RAG capabilities.
The source code used in this post is available here.