AI-related learning notes
Things I learned as I did more AI-related work
How to Read Model Names
-
Developer / Model Name (Gemma 4)
-
Total Parameters (26B)
-
MoE Structure (A4B, Active 4B)
-
Training Type (Instruct)
-
Format (GGUF)
-
GGUF: Most widely used model, supports Windows/Linux, strong CPU inference
-
MLX: Apple Silicon supported model
-
Safetensors: Original model
-
TensorRT: NVIDIA GPU, performance is best when the entire model is loaded onto the GPU without offloading
- Quantization Method (UD-Q4_K_XL)
-
UD: Unsloth Dynamic (Unsloth's proprietary quantization method), Dynamic policy slots
-
Determines performance degradation before and after quantization based on actual data input
-
Heuristically processes quantization status per tensor based on measurement
-
Q4: 4-bit (Importance-based quantization methods like IQ and AWQ also exist)
-
FP32 (Floating Point 32): Basic float data type, uses 32 bits
-
FP16 (Floating Point 16): Abbreviated float data type, uses 16 bits (1 / 5 / 10 bits allocated)
-
BF16 (Brain Floating Point 16): Abbreviated float data type, uses 16 bits (1 / 8 / 7 bits allocated)
-
INT2 ~ INT8: Abbreviated int data types
-
K: K-Quant family
-
XL: Profile
-
Divided from S to XL based on tensor preservation precision
-
Some important tensors are left as BF16/FP16; this refers to the extent to which they are retained.
Glossary of Fine-Tuning Terms
- PEFT (Parameter-Efficient Fine-Tuning)
-
Low Rank Adaptation (LoRA): Low-rank adaptation
-
The academic basis is Low Rank Approximation (representing one large matrix as the product of two smaller matrices) Approximation)
-
SVD is a representative example.
-
The LLM weight matrix is fundamentally very large.
-
A situation where the model consists of W (existing model) + ΔW, and ΔW is considered the tuning value.
-
Traditionally, for fine-tuning, the entire ΔW = M x N (where M and N are full size) was processed.
-
Decomposing into ΔW = A x B (A = M x R, B = R x N) has the effect of reducing O(MN) to O(R(M + N)).
-
QLoRA (Quantized LoRA)
-
A technique aimed at tuning large models by quantizing the model itself into 4-bit, etc.
-
DoRA (Weight-Decomposition Low-Rank Adaptation)
-
During the ΔW = A x B learning process, the weight intent is mixed together during training.
-
Therefore, magnitude is separated and managed as a separate parameter.
- Alignment
-
RLHF (Reinforcement Learning from Human Feedback)
-
Humans rank the model's responses Building a Reward Model by Grading
-
Train the model through reinforcement learning to answer in a way that maximizes rewards
-
DPO (Direct Preference Optimization)
-
Omit the separate creation of a reward model and directly optimize the dataset's preferences
Explanation of Option Values
-
GPU Offload: Determines how many Transformer Layers to load onto the GPU
-
CPU Thread Pool Size: Number of CPU threads to use
-
Evaluation Batch Size: Option that determines how large a prompt is processed at once
-
Larger values increase VRAM usage and speed up prompt ingest
-
Max Concurrent Predictions: Number of answers that can be generated simultaneously; 1 should be sufficient based on lm Studio.
-
RoPE Frequency Base / Scale: A value adjusted when you want to read longer than the model's learned context length; default is Auto.
-
Offload KV Cache to GPU Memory: An option to determine whether to load conversation content into GPU memory.
-
Keep Model in Memory: Whether to keep the model in memory without unloading it.
-
Try mmap0: Load the model using memory mapping; effective for improving loading speed and efficiency when RAM is sufficient.
-
Number of Experts: The number of experts to activate in the MoE (Mixture of Experts) model.
-
Number of layers for to force MoE weights onto CPU: A setting to force the weights of some layers onto the CPU.
-
Flash Attention
-
Calculates the matrix by splitting it into tile units that fit into hardware-accelerated SRAM.
-
Saves memory and can dramatically increase speed because the Attention matrix is not stored separately.
-
KV Cache Quantization
-
Memory saving through KV cache quantization; default is usually FP16.
TODO
-
Transformer Architecture Analysis
-
Comparison with RNN
-
Self Attention?
댓글 작성
게시글에 대한 의견을 남겨 주세요.