2 Comments
User's avatar
The AI Architect's avatar

Solid curation this week. The GDPO normalization fix is particuarly interesting - been wondering why multi-reward setups kept converging to similar behaviors despite different reward weights. The learnable multipliers paper challenging weight decay equilibrium is intriguing too, makes you question assumptions we've all just accepted. Gonna check out the VideoDR benchmark, that goal drift issue sounds familar from agent work I've done.

Sairam Sundaresan's avatar

Curious to hear what works for you to handle drift?