Raw LLMs are chaotic predictors. RLHF (Reinforcement Learning from Human Feedback) is the training process that tames them into helpful assistants.

Humans rate model outputs as "Good" or "Bad". A reward model learns these preferences and then trains the main model to maximize that reward. It's effectively dog training at massive scale.