Parental advisory style black and white label used as Multi Agent RL cover art
RL EnvironmentPeer Reward Pipeline

Multi Agent RL

Agent Orchestrator

A research-oriented multi-agent reinforcement learning environment for language-model agents on GSM8K-style math tasks. Multiple agents solve the same problem, judge each other, and receive rewards from a transparent pipeline that combines objective correctness with recoverability-aware peer signals.

Role

RL Systems Engineer

LLM Evaluation Researcher

Focus

Agent orchestration

Auditable reward verification

Duration

Research prototype

Tools

Python

Multi-Agent RL

LLM Evaluation

GSM8K


Core Loop

Agents independently answer a math question, then provide directed peer evaluations for other responses. Peer scores are normalized so a generous judge cannot dominate the reward signal. Historical trust, attention weights, and objective correctness are then combined into a final scalar reward.


Recoverability-Aware Reward

The reward is not a simple blend of ground truth and peer judgment. It tracks whether a reasoning path can be salvaged over time:

R_j = alpha * R_final + beta * sum_t(delta_u) + gamma * mean_t(b_t) - delta * mean_t(f_t) + eta * R_peer_salv_j + zeta * branch_bonus
Peer salvageability is computed from attention-weighted and trust-weighted evaluator scores, so the system can learn from useful critiques without letting unreliable judges steer the episode.


Reward Verification

Every episode logs a strict verification report with stage order, ground-truth rewards, raw and normalized peer-score matrices, trust weights, attention weights, combined peer rewards, final rewards, and sanity checks. This makes reward failures visible immediately instead of hiding them inside unstable learning behavior.


System Components

environment/: multi-agent environment, ranking, reward pipeline, trust weighting, attention weighting, and diagnostics.

agents/: heuristic math agents, Ollama-backed agents, self-refine agents, and ICL memory agents.

experiment/: resumable batch runs, manifests, and summaries.

analysis/: statistics, learning curves, and paper-ready tables.