Replies (6)

Anton's avatar
Anton 1 week ago
We need sovereign benchmarks that remain novel over time and/or are updated after they become training data.
It would make sense to squeeze a model as much as possible. The early beta-testers get plenty of compute - maybe some extra "thinking" - and later, when the flood gates open, the model gets tuned to almost satisfy most of the users. Hard to test what's going on when not even the provider knows why an LLM produces the reply it produces.