Large Language Models in GGUF Format and Their Implementation

Introduction

In this presentation on GGUF, my goal was to demonstrate how any Large Language Model can be used effectively and knowledgeably. Originally prepared for specific individuals but proving unhelpful then, I hope this video will be useful to you today. Note: The video content is in Turkish.

Video Purpose

This video is designed for complete beginners or those new to the field. The main question addressed is "How can I think about using Large Language Models for my purposes?" The goal is to serve as an introductory resource for newcomers and help them gain familiarity with this technology.

If you don't practice, research, or experiment with the technology while watching or after watching this video, it won't help you much beyond gaining basic familiarity. This video aims to explain not just the application of the technology, but the technology itself at a surface level.

Prerequisites

Creating an AI-powered application with Large Language Models is not difficult. You don't even need to know programming, but you do need to understand programming principles.

English knowledge is helpful but not required. Having intermediate or advanced computer literacy is beneficial. You'll need a server machine (I connected remotely to my Mac Studio with M2 Ultra and 64GB RAM). For this presentation, my server had oobabooga text-generation-webui and miniconda installed.

Large Language Models: What They Are and Aren't

Large Language Models (LLMs) have a relatively simple and easy-to-understand architecture. They're not new technology - their modern form emerged in 2017 with the introduction of the Transformers library in the paper "Attention is all you need."

LLMs are a type of artificial intelligence, but AI doesn't necessarily mean Large Language Models. They're not intelligent ChatBots, nor are they perfect entities that can do everything flawlessly.

There are numerous closed-source (ChatGPT) and open-source models (DialoGPT, BLOOM, Command-R-Plus, LLaMa, Mistral, Qwen, Gemma), meaning there isn't just one LLM or one type of LLM. More information can be found at https://huggingface.co/docs/transformers/index.

How to Train a Large Language Model: Pre-Training and Fine-Tuning

All training types can be done with short, ready-made code snippets and numerous libraries (LLaMa.c, Unsloth, LLaMa Factory). Until recently, training could even be done on CPU with llama.cpp, but nowadays powerful graphics cards are essential for all types of training.

LLM training is divided into two parts: Pre-Training and Fine-Tuning. The Pre-Training phase is generally seen as both unnecessary and expensive (financially and computationally) for general use and projects. Therefore, Fine-Tuning is preferred - it's the process of training on a smaller scale, while Pre-Training is the complete training from scratch.

The data used in Pre-Training differs from Fine-Tuning data. Pre-Training uses raw text to learn language and language structure, while Fine-Tuning uses formatted data. Most applications may not need either. Pre-Training is for knowledge, Fine-Tuning is for style. Knowledge and sometimes style can be injected later using techniques like RAG (Retrieval Augmented Generation).

GGUF: Why Choose This Format?

The GGUF format has significant advantages over other formats. GGUF-based Large Language Models are single files, easily portable and downloadable. GGUF files are lighter than main model files because they're quantized.

Many unusual quantization options are available (not just 8 and 4, but Q8_0, Q4_K_M, IQ3_XS, Q6_K, etc.). They don't require graphics cards (though they run faster with them), so they work at acceptable speeds everywhere. Multi-platform support includes Windows, macOS, and Linux.

GGUF: Essential Knowledge

GGUF is the format name for llama.cpp, a program designed for Apple Silicon chips but also working on other ARM chips, Intel processors, and similar environments. In this presentation, we'll use text-webgen-ui as the backend.

Key terms to understand:
- model: What we call the "AI" or "Large Language Model" entity
- n-gpu-layers: Number of layers loaded onto the graphics card
- n_ctx: Also known as "Context length" - the maximum input & output length the model can handle
- cpu: Forces use of processor only, no graphics card (slower but functional)

These are experience-based essentials for any situation. Your specific needs may require different terms and applications. More information available at: https://github.com/oobabooga/text-generation-webui/wiki/04-%E2%80%90-Model-Tab

GGUF: Model Selection Strategy

How do we choose a Large Language Model for our application? We need to decide on the model itself and our purpose.

Model Types:
- Multimodal: Models with image processing capabilities (LLaMa 3.2)
- Instruct: Models ready for human-like communication out of the box
- Pretrained: Models requiring fine-tuning for specific use cases

Key Considerations:
- Context Size (n_ctx): Total input & output length capacity
- Quantization: Quality vs. speed & compatibility (Q8 = highest quality, Q2 = lowest)

For quick starts: LLaMa 3 series (3.1 and 3.2) are widely used today, appeal to broad audiences, and come in various sizes. Q4_K_M quantization is commonly preferred.

General Terminology

Essential terms for understanding LLMs, not specific to GGUF:

Parameter Count (Size): Think of it as variety levels - like water, bottled water, pink-labeled water, pink-labeled water with red cap.

Important Principles:
- Newer models in the LLM scene generally perform better due to advancing techniques
- The field moves very rapidly
- Newer doesn't always mean better!
- Parameter count affects model quality: Recent LLaMa 3.2 3B won't match recent Mistral 123B quality
- Quantization type affects model complexity and coherence
- LLMs can't "know" what wasn't in their training data - other techniques like RAG are needed for new information

Technical Essentials for Implementation

After understanding these details, we don't need to know much more for application purposes.

Key Technical Terms:
- Token: Think of it as syllables for easy understanding, but it's not exactly that
- max_new_tokens: Number of tokens the model can respond with (text-webgen-ui limits this to 4096)

Advanced Parameters (for sophisticated use): temperature, top_p, top_k, typical_p, min_p, repetition_penalty, frequency_penalty, presence_penalty. I believe these aren't necessary for general use and applications, but they affect how the LLM generates text.

Large Language Models: Beyond ChatBots

The use of Large Language Models isn't limited to ChatBots. Given the right tools and technological developments, they can behave like humans and perform human-like tasks.

Chain of Thought is one such development that enables more sophisticated reasoning.

Task Examples:
- With conversation history: Customer service representatives
- Without conversation history: Quality control, content management and moderation

The potential applications extend far beyond simple question-and-answer interactions, opening possibilities for complex, goal-oriented AI assistance.