Immediately after switching the page, it will work with CSR.
Please reload your browser to see how it works.
(Then again, apparently the president of the local Diplomacy Society attends my church; I discovered this when another friend whom I'd invited saw him, and quipped that he was surprised he hadn't been struck by lightning at the door.)
DeepSeek and Gemini 2.5 had both a low betrayer and betrayed rate.
o3-mini and DeepSeek had the highest number of first-place finishes, but were only in the upper quartile in the TrueSkill leaderboard; presumably because they played more risky strategies, that would either lead ot complete winning or early drop-out?
Also interesting that o1 was only way to sway the final jury a bit more than 50% of the time, while o3-mini managed 63% of the time.
Anyway, really cool stuff!
This one did make me laugh though: 'Claude 3.5 Sonnet 2024-10-22: "Adjusts seat with a confident yet approachable demeanor"' - an AI communicating to other AIs in a descriptive version of non-verbal behaviour is hilarious.
Some thoughts about the setup:
- the setup seems to give reasoning models an inherent advantage because only they have a private plan and a public text in the same output. I feel like giving all models the option to formulate plans and keep track of other players inside <think> or <secret> tags would level the playing field more.
- from personal experience with social tasks for LLMs it helps both reasoning and non-reasoning LLMs to explicitly ask them to plan their next steps, in a way they are assured is kept hidden from all other players. That might be a good addition here either before or after the public subround
- the individual rounds are pretty short. Humans would struggle to coordinate in so few exchanges with so few words. If this was done for context limitations, asking models to summarize the game state from their perspective, then giving them only the current round, the previous round and their own summary of the game before that might be a good strategy.
It would be cool to have some code to play around with to test how changes in the setup change the results. I guess it isn't that difficult to write, but it's peculiar to have the benchmark but no code to run it yourself