
Multi Agent RL
Agent Orchestrator
A research-oriented multi-agent reinforcement learning environment for language-model agents on GSM8K-style math tasks. Multiple agents solve the same problem, judge each other, and receive rewards from a transparent pipeline that combines objective correctness with recoverability-aware peer signals.
Role
RL Systems Engineer
LLM Evaluation Researcher
Focus
Agent orchestration
Auditable reward verification
Duration
Research prototype
Tools
Python
Multi-Agent RL
LLM Evaluation
GSM8K
Core Loop
Agents independently answer a math question, then provide directed peer evaluations for other responses. Peer scores are normalized so a generous judge cannot dominate the reward signal. Historical trust, attention weights, and objective correctness are then combined into a final scalar reward.
Recoverability-Aware Reward
The reward is not a simple blend of ground truth and peer judgment. It tracks whether a reasoning path can be salvaged over time:R_j = alpha * R_final + beta * sum_t(delta_u) + gamma * mean_t(b_t) - delta * mean_t(f_t) + eta * R_peer_salv_j + zeta * branch_bonus
Peer salvageability is computed from attention-weighted and trust-weighted evaluator scores, so the system can learn from useful critiques without letting unreliable judges steer the episode.
Reward Verification
Every episode logs a strict verification report with stage order, ground-truth rewards, raw and normalized peer-score matrices, trust weights, attention weights, combined peer rewards, final rewards, and sanity checks. This makes reward failures visible immediately instead of hiding them inside unstable learning behavior.
System Components
environment/: multi-agent environment, ranking, reward pipeline, trust weighting, attention weighting, and diagnostics.
agents/: heuristic math agents, Ollama-backed agents, self-refine agents, and ICL memory agents.
experiment/: resumable batch runs, manifests, and summaries.
analysis/: statistics, learning curves, and paper-ready tables.