English
AI-related learning notes

AI-related learning notes

Things I learned as I did more AI-related work

How to Read Model Names

  1. Developer / Model Name (Gemma 4)

  2. Total Parameters (26B)

  3. MoE Structure (A4B, Active 4B)

  4. Training Type (Instruct)

  5. Format (GGUF)

  • GGUF: Most widely used model, supports Windows/Linux, strong CPU inference

  • MLX: Apple Silicon supported model

  • Safetensors: Original model

  • TensorRT: NVIDIA GPU, performance is best when the entire model is loaded onto the GPU without offloading

  1. Quantization Method (UD-Q4_K_XL)
  • UD: Unsloth Dynamic (Unsloth's proprietary quantization method), Dynamic policy slots

  • Determines performance degradation before and after quantization based on actual data input

  • Heuristically processes quantization status per tensor based on measurement

  • Q4: 4-bit (Importance-based quantization methods like IQ and AWQ also exist)

  • FP32 (Floating Point 32): Basic float data type, uses 32 bits

  • FP16 (Floating Point 16): Abbreviated float data type, uses 16 bits (1 / 5 / 10 bits allocated)

  • BF16 (Brain Floating Point 16): Abbreviated float data type, uses 16 bits (1 / 8 / 7 bits allocated)

  • INT2 ~ INT8: Abbreviated int data types

  • K: K-Quant family

  • XL: Profile

  • Divided from S to XL based on tensor preservation precision

  • Some important tensors are left as BF16/FP16; this refers to the extent to which they are retained.

Glossary of Fine-Tuning Terms

  1. PEFT (Parameter-Efficient Fine-Tuning)
  • Low Rank Adaptation (LoRA): Low-rank adaptation

  • The academic basis is Low Rank Approximation (representing one large matrix as the product of two smaller matrices) Approximation)

  • SVD is a representative example.

  • The LLM weight matrix is fundamentally very large.

  • A situation where the model consists of W (existing model) + ΔW, and ΔW is considered the tuning value.

  • Traditionally, for fine-tuning, the entire ΔW = M x N (where M and N are full size) was processed.

  • Decomposing into ΔW = A x B (A = M x R, B = R x N) has the effect of reducing O(MN) to O(R(M + N)).

  • QLoRA (Quantized LoRA)

  • A technique aimed at tuning large models by quantizing the model itself into 4-bit, etc.

  • DoRA (Weight-Decomposition Low-Rank Adaptation)

  • During the ΔW = A x B learning process, the weight intent is mixed together during training.

  • Therefore, magnitude is separated and managed as a separate parameter.

  1. Alignment
  • RLHF (Reinforcement Learning from Human Feedback)

  • Humans rank the model's responses Building a Reward Model by Grading

  • Train the model through reinforcement learning to answer in a way that maximizes rewards

  • DPO (Direct Preference Optimization)

  • Omit the separate creation of a reward model and directly optimize the dataset's preferences

Explanation of Option Values

  • GPU Offload: Determines how many Transformer Layers to load onto the GPU

  • CPU Thread Pool Size: Number of CPU threads to use

  • Evaluation Batch Size: Option that determines how large a prompt is processed at once

  • Larger values increase VRAM usage and speed up prompt ingest

  • Max Concurrent Predictions: Number of answers that can be generated simultaneously; 1 should be sufficient based on lm Studio.

  • RoPE Frequency Base / Scale: A value adjusted when you want to read longer than the model's learned context length; default is Auto.

  • Offload KV Cache to GPU Memory: An option to determine whether to load conversation content into GPU memory.

  • Keep Model in Memory: Whether to keep the model in memory without unloading it.

  • Try mmap0: Load the model using memory mapping; effective for improving loading speed and efficiency when RAM is sufficient.

  • Number of Experts: The number of experts to activate in the MoE (Mixture of Experts) model.

  • Number of layers for to force MoE weights onto CPU: A setting to force the weights of some layers onto the CPU.

  • Flash Attention

  • Calculates the matrix by splitting it into tile units that fit into hardware-accelerated SRAM.

  • Saves memory and can dramatically increase speed because the Attention matrix is not stored separately.

  • KV Cache Quantization

  • Memory saving through KV cache quantization; default is usually FP16.

TODO

  • Transformer Architecture Analysis

  • Comparison with RNN

  • Self Attention?

댓글 작성

게시글에 대한 의견을 남겨 주세요.

댓글 0