post_training

Post-training Methods for LLMs

This repo collects post-training methods for Large Language Models (LLMs) with small, focused implementations and runnable examples. The goal is to make alignment and reinforcement post-training practical, understandable, and reproducible.

Scope

Quickstart

python3 -m venv .venv
. .venv/bin/activate
python -m pip install -U pip
pip install -r requirements.txt

Repository Layout

Current Examples

Gymnasium: CartPole Random Policy

Runs a single random rollout to verify environment setup.

Code: gymnasium/cartpole_random.py

python gymnasium/cartpole_random.py

Gymnasium: CartPole Q-Learning

Trains a discretized Q-learning agent and evaluates it.

Code: gymnasium/cartpole_q_learning.py

python gymnasium/cartpole_q_learning.py

Evaluation renders by default.

Chess: KQ vs K Q-Learning

Trains a Q-learning agent on a toy chess endgame (King + Queen vs King).

Code: chess/chess_q_learning.py

python chess/chess_q_learning.py

Learning Path (Planned)

Each step builds on the previous one; diagrams show simplified dataflow.

1) Reinforcement Learning Foundations

Topics:

Diagram:

flowchart LR
  Env["Environment"] -->|"State s(t)"| Agent["Agent"]
  Agent -->|"Action a(t)"| Env
  Env -->|"Reward r(t+1)"| Agent
  Env -->|"State s(t+1)"| Agent

2) Supervised Fine-tuning (SFT) / Instruction Tuning

Topics:

Diagram:

flowchart LR
  Pretrained["Pretrained Model"] --> Train["SFT Training"]
  Data["SFT Data (Instruction + Response)"] --> Train
  Train --> Instruction["Instruction-Following Model"]

3) Preference Optimization (No-RL Alignment)

Includes:

Focus:

Diagram:

flowchart LR
  Prompt[User Prompt] --> Base[Base Model]
  Base --> Responses[Candidate Responses]
  Responses --> Prefs[Preference Labels]
  Prefs --> Opt[Preference Optimization]
  Base --> Opt
  Opt --> Aligned[Aligned Model]

4) Reward Modeling + RLHF (PPO / GRPO Variants)

Coverage:

Diagram:

flowchart LR
  SFT[Supervised Fine-Tuning] --> PPO[PPO RLHF]
  Pref[Preference Data] --> RM[Reward Model Training]
  RM --> PPO
  PPO --> Aligned[Final Aligned Model]
  SFT --> GRPO[GRPO Optimization]
  Pref --> GRPO
  GRPO --> Aligned

5) RLAIF (AI-Generated Preferences)

Diagram:

flowchart LR
  Prompt[User Prompt] --> Policy[Policy Model]
  Policy --> Responses[Candidate Responses]
  Responses --> Judge[Judge Model]
  Judge --> Prefs[AI Preference Labels]
  Prefs --> Update[Preference Optimization]
  Policy --> Update
  Update --> Updated[Updated Policy Model]

6) Constitutional and Safety Tuning

Diagram:

flowchart LR
  Output[Model Output] --> Critique[Self-Critique]
  Constitution[Constitution / Rules] --> Critique
  Critique --> Revise[Revision]
  Revise --> Safer[Safer Output]

7) Evaluation and Comparison Harnesses

Diagram:

flowchart LR
  Model --> Bench[Benchmark Suite]
  Bench --> Metrics[Metrics + Regression Tracking]

8) Distillation and Post-training Compression

Diagram:

flowchart LR
  Teacher[Large Aligned Model] --> Data[Distillation Data]
  Student[Smaller Model] --> Distill[Distillation Training]
  Data --> Distill
  Distill --> Small[Smaller Aligned Model]