Qwen3-235B-A22B-Instruct-2507-GGUF and llama.cpp

Introduction

Today, running large language models on our personal computers (especially on computers with unified RAM like Apple Silicon) has become commonplace. Many 8-16GB ARM devices (tablets, computers, etc.) can easily handle 8B models. llama.cpp has made a significant contribution to running these Large Language Models on our computers.

What is llama.cpp?

Due to its prevalence, llama.cpp and GGUF formats may not need an introduction -most likely most people already have it on their computers in the form of ollama and/or LM Studio-. But still, if we need to summarize the llama.cpp project very briefly and roughly, the llama.cpp project allows us to run Large Language Models using CPU&RAM and/or NVIDIA+other graphics cards in a file format called GGUF, focusing on Apple Silicon, without requiring an NVIDIA graphics card.

llama.cpp is a program designed to be run standalone and for developing software on top of it. Since its use requires some technical knowledge, easy wrapper interfaces like ollama and LM Studio have emerged. Despite this, it can be easily used in llama.cpp with the llama-server program.

llama.cpp is not a standalone program, it needs a model file or mmproj files for multimodal files. Since this article focuses on Qwen3, multimodal is not covered much (because it doesn't support it).

What is Qwen3?

Qwen3 is an open-weight (not open-source) large language model series that came out a few months ago (April 2025), where reasoning (like QWQ) is open from the beginning and needs to be manually closed (can be closed by writing "/nothink"). Many variants from 0.6B to 235B have been released, and the 235B variant was updated and re-released in July (with reasoning turned off). The model focused on in this article is the updated Qwen3-235B-A22B-Instruct-2507-GGUF model.

Qwen3 Models have received considerable attention from the open-weight community, overshadowing the LlaMa 4 series (Scout and Maverick) that came out around the same time. Since it was well-liked and used, and gave good results in my tests, this model was taken as the basis in the article.

Requirements

First of all, llama.cpp itself doesn't care much about hardware and power, it works with most hardware available today.

Qwen3 is an open-weight model, as the name suggests, weights (tensors) are stored in the file. When not in GGUF format (plain or as in MLX), we can easily see this as many safetensor files. Since it has a weight, it also has requirements, the computer must be able to handle the weight.

For the best results regarding weight handling, there is normally a complex formula, the weights to be loaded to CPU-GPU are determined with the formula and the model works at full possible performance, but since I lost this formula myself and couldn't find it again while writing this article. Therefore, the thought that "it works if the GGUF file size fits in VRAM+RAM" is not a very wrong thought, but full performance cannot be achieved. Despite this, the thought is not very different;

Let's think of a 235B model. For Q3_K_XL, we can understand from looking at the model size that it requires approximately 104GB RAM for content length (the number of "tokens" the model can keep in mind. Although token doesn't exactly mean syllable, it can be thought of as a syllable for easy understanding) around 2048-4096 (formerly standard, unusable today). If we want the content length to be long, RAM usage increases.

Problem

According to this thinking, a 235B model works on a computer with 104GB RAM, but we encounter a problem;

The model works, but it works very slowly. Think of it as a single syllable in 2 seconds or 3-4 seconds. In GGUF format and llama.cpp, graphics card acceleration is important. The absence of GPU acceleration and/or the very low number of layers loaded to the GPU will cause problems as the model grows.

Since most users expect fluent and natural conversation, this problem makes heavy models unusable on all existing user computers, because no normal, non-field-related, ordinary user will buy server/workstation computer and/or hardware unless they have very important privacy issues or concerns.

Qwen3-235B-A22B-Instruct-2507-GGUF

The most important reason I chose the Qwen3-235B-A22B-Instruct-2507-GGUF model in this article is that it minimizes problems and is perhaps the most comfortable model to use on user hardware (at least for advanced users).

With a medium-level graphics card (like RTX 3080) and filling the normal motherboard RAM slots (128GB), it can be used very comfortably and I got between 4-10 tokens per second in my tests.

"""
prompt eval time = 8309.64 ms / 68 tokens ( 122.20 ms per token, 8.18 tokens per second)
eval time = 2775.31 ms / 14 tokens ( 198.24 ms per token, 5.04 tokens per second)
"""

So how different is Qwen3-235B-A22B-Instruct-2507-GGUF from other Qwen3 models and why did I prefer this model on my computer where most models (especially when going over 10B gradually) run poorly?

One of the reasons I prefer such a large MoE model like Qwen3-235B-A22B-Instruct-2507-GGUF over other very large models like LLaMA 4 Maverick, or smaller MoE models like Qwen3 32B (another model in the series) and LLaMA 4 Scout, is that it has been recently updated (yesterday), cleaned from reasoning processes for everything (reasoning can sometimes be useful, sometimes very harmful and unnecessarily time-consuming. In the non-updated model of Qwen3, the model got stuck in reasoning for a difficult query that has become traditional and customary for me), and most importantly; due to the high parameter count, its existing knowledge is more complex, interconnected, and rich/capable.

Solution

Qwen3-235B-A22B-Instruct-2507-GGUF offers us a solution called MoE. The MoE (Mixture of Experts) technique first made a big noise with the Mixtral model, then was forgotten for a while, then made a higher noise again with DeepSeek.

Since Mixture of Experts is not centralized and only the necessary experts (we can think of them as small expert models) are used, it first feels lighter, works more comfortably and faster, like everything in real life where density is low.

To run MoE models, we still look at RAM, and normally it uses only as much resources as the expert used (this means much less resource consumption and much more speed).

Our other -but most important- advantage is an argument called -ot ".ffn_.*_exps.=CPU". This argument makes usage even more comfortable and faster by loading all unnecessary things to CPU and all necessary things to GPU.

Installation and Command

Most users in Turkey use Windows, so I will explain focusing on Windows. macOS users can also easily install it with just brew, while Linux users are expected to already know the installation.

Download llama.cpp from Github releases, and the desired MoE large language model from HuggingFace from a source (unsloth and bartowski are the most well-known. TheBloke can also be checked for old models. Unsloth focuses on base models, while bartowski focuses on intermediate (finetune, fine-tuned. Most of the time, ironically, Unsloth libraries are used in fine-tuning) models) in a way suitable for the computer.

Let's assume we put them in the same folder, for example, like me, we downloaded 3 GGUF files for Q3_K_XL. Unless we're going to use ollama, there's no need to merge, we load only the first file with the following command;

llama-server.exe -m "Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003.gguf" -ngl 99 -ot ".ffn_.*_exps.=CPU"

Here;

-m "Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003.gguf" -> The first file of the model is specified. It will also find other models in the same folder.
-ngl 99 -> All layers are loaded to the graphics card
-ot ".ffn_.*_exps.=CPU" -> Offload Tensors command. Everything "unnecessary" inactive is loaded to CPU.

We wait for llama-server to load. When loaded, we enter the address 127.0.0.1:8080 in the web browser (local network address, 8080 gateway) and we can easily use the interface.

We cannot expect very large and effective speed, but still we get a faster and more usable result compared to the previous situation.

My Own Opinion and Recommendation

If possible, load the model layers to the graphics card as much as possible (for speed) and if possible, try not to go below 8B and not to go below Q4_K_M regardless of size. As Quantization decreases, quality and accuracy rate decreases. Q4_K_M precision is 4 bits while normal models' precision varies between 8 bits - 32 bits (the highest value is the best because it means high quality, precision, and diversity)

Generally as a preference, a high-parameter model (235B), even at low quantization (Q3), gives better results than a very low-parameter (32B) but high quantization (Q6-Q8) model, so it is preferred.

For Low Quantizations (all below Q4_K_M), Unsloth's Quantizations can be preferred because they are not standard and single-cast (For example, in UD-Q4_K_XL, not all quantization values of layers are INT4). That's why I chose Qwen3-235B-A22B-Instruct-2507-GGUF and UD-Q3_K_XL.