Jun 27, 2026

AI-related learning notes

Things I learned as I did more AI-related work

GGUF: Most widely used model, supports Windows/Linux, strong CPU inference
MLX: Apple Silicon supported model
Safetensors: Original model
TensorRT: NVIDIA GPU, performance is best when the entire model is loaded onto the GPU without offloading

UD: Unsloth Dynamic (Unsloth's proprietary quantization method), Dynamic policy slots
Determines performance degradation before and after quantization based on actual data input
Heuristically processes quantization status per tensor based on measurement
Q4: 4-bit (Importance-based quantization methods like IQ and AWQ also exist)
FP32 (Floating Point 32): Basic float data type, uses 32 bits
FP16 (Floating Point 16): Abbreviated float data type, uses 16 bits (1 / 5 / 10 bits allocated)
BF16 (Brain Floating Point 16): Abbreviated float data type, uses 16 bits (1 / 8 / 7 bits allocated)
INT2 ~ INT8: Abbreviated int data types
K: K-Quant family
XL: Profile
Divided from S to XL based on tensor preservation precision
Some important tensors are left as BF16/FP16; this refers to the extent to which they are retained.

Low Rank Adaptation (LoRA): Low-rank adaptation
The academic basis is Low Rank Approximation (representing one large matrix as the product of two smaller matrices) Approximation)
SVD is a representative example.
The LLM weight matrix is fundamentally very large.
A situation where the model consists of W (existing model) + ΔW, and ΔW is considered the tuning value.
Traditionally, for fine-tuning, the entire ΔW = M x N (where M and N are full size) was processed.
Decomposing into ΔW = A x B (A = M x R, B = R x N) has the effect of reducing O(MN) to O(R(M + N)).
QLoRA (Quantized LoRA)
A technique aimed at tuning large models by quantizing the model itself into 4-bit, etc.
DoRA (Weight-Decomposition Low-Rank Adaptation)
During the ΔW = A x B learning process, the weight intent is mixed together during training.
Therefore, magnitude is separated and managed as a separate parameter.

RLHF (Reinforcement Learning from Human Feedback)
Humans rank the model's responses Building a Reward Model by Grading
Train the model through reinforcement learning to answer in a way that maximizes rewards
DPO (Direct Preference Optimization)
Omit the separate creation of a reward model and directly optimize the dataset's preferences

GPU Offload: Determines how many Transformer Layers to load onto the GPU
CPU Thread Pool Size: Number of CPU threads to use
Evaluation Batch Size: Option that determines how large a prompt is processed at once
Larger values increase VRAM usage and speed up prompt ingest
Max Concurrent Predictions: Number of answers that can be generated simultaneously; 1 should be sufficient based on lm Studio.
RoPE Frequency Base / Scale: A value adjusted when you want to read longer than the model's learned context length; default is Auto.
Offload KV Cache to GPU Memory: An option to determine whether to load conversation content into GPU memory.
Keep Model in Memory: Whether to keep the model in memory without unloading it.
Try mmap0: Load the model using memory mapping; effective for improving loading speed and efficiency when RAM is sufficient.

Number of Experts: The number of experts to activate in the MoE (Mixture of Experts) model.
Number of layers for to force MoE weights onto CPU: A setting to force the weights of some layers onto the CPU.

Flash Attention
Calculates the matrix by splitting it into tile units that fit into hardware-accelerated SRAM.
Saves memory and can dramatically increase speed because the Attention matrix is not stored separately.
KV Cache Quantization
Memory saving through KV cache quantization; default is usually FP16.

댓글 작성

게시글에 대한 의견을 남겨 주세요.

이름

비밀번호

공개 범위

댓글 내용