Immediately after switching the page, it will work with CSR.
Please reload your browser to see how it works.

Source:https://github.com/SoraKumo001/next-streaming

⬅️ Benchmarking LLM social skills with an elimination game

wongarsu 9 daysReload

That's an interesting benchmark. It feels like it tests skills that are very relevant to digital assistants, story writing and role play.

Some thoughts about the setup:

- the setup seems to give reasoning models an inherent advantage because only they have a private plan and a public text in the same output. I feel like giving all models the option to formulate plans and keep track of other players inside <think> or <secret> tags would level the playing field more.

- from personal experience with social tasks for LLMs it helps both reasoning and non-reasoning LLMs to explicitly ask them to plan their next steps, in a way they are assured is kept hidden from all other players. That might be a good addition here either before or after the public subround

- the individual rounds are pretty short. Humans would struggle to coordinate in so few exchanges with so few words. If this was done for context limitations, asking models to summarize the game state from their perspective, then giving them only the current round, the previous round and their own summary of the game before that might be a good strategy.

It would be cool to have some code to play around with to test how changes in the setup change the results. I guess it isn't that difficult to write, but it's peculiar to have the benchmark but no code to run it yourself

gwd 9 daysReload

Was interested to find that the Claudes did the most betraying, and were betrayed very little; somewhat surprising given its boy-scout exterior.

(Then again, apparently the president of the local Diplomacy Society attends my church; I discovered this when another friend whom I'd invited saw him, and quipped that he was surprised he hadn't been struck by lightning at the door.)

DeepSeek and Gemini 2.5 had both a low betrayer and betrayed rate.

o3-mini and DeepSeek had the highest number of first-place finishes, but were only in the upper quartile in the TrueSkill leaderboard; presumably because they played more risky strategies, that would either lead ot complete winning or early drop-out?

Also interesting that o1 was only way to sway the final jury a bit more than 50% of the time, while o3-mini managed 63% of the time.

Anyway, really cool stuff!

Gracana 9 daysReload

I've been using QwQ-32B a lot recently and while I quite like it (especially given its size), I noticed it will often misinterpret the system prompt as something I (the user) said, revealing secrets or details that only the agent is supposed to know. When I saw that it topped the "earliest out" chart, I wondered if that was part of the reason.

realaleris149 9 daysReload

As LLM benchmarks go, this is not a bad take at all. One interesting point about this approach is that is self balancing, so when more powerful models come up, there is no need to change it.

viraptor 9 daysReload

It's interesting to see, but I'm not sure what we should learn from this. It may be useful for multiagent coordination, but in direct interactions... no idea.

This one did make me laugh though: 'Claude 3.5 Sonnet 2024-10-22: "Adjusts seat with a confident yet approachable demeanor"' - an AI communicating to other AIs in a descriptive version of non-verbal behaviour is hilarious.