Benchmarks
Sifting through hundreds of thousands of hours of indexed videos
Benchmarks
Sifting through hundreds of thousands of hours of indexed videos
Benchmarks
10
Mentions
3.5M
Views

“Discussed as existing evaluation methods, often focusing on functional correctness rather than reusability.”

“most benchmarks don't especially a benchmarks don't optimize for fun why is this an important thing that arc has this like in the acceptance criteria I think in a on a very basic way the benchmark wil...”

“Code Rabbit uses benchmarks for model testing.”

“I think that the benchmarks benchmarks are like easy to game where I think that all the other big labs I think have teams where they like their whole job with the team is to like make the benchmarks s...”

“measures of AI model capabilities, evolving challenge due to rapid model progress”
Arcmira media summary
Arcmira tracks where benchmarks is discussed across indexed YouTube videos, transcripts, channels, and related entities.
Discussed as existing evaluation methods, often focusing on functional correctness rather than reusability.
most benchmarks don't especially a benchmarks don't optimize for fun why is this an important thing that arc has this like in the acceptance criteria I think in a on a very basic way the benchmark will be more successful if it's fun if it's catchy if people actually enjoy interacting with it enjoy playing the games it will catch on more people will be be attracted to it more people will want to work on it. The human story like inspiring. Exactly. It it needs to be it needs to be catchy as a cultural artifact, not just as a as a scientific artifact. And also I think that in order to make progress towards the GI we need to do a lot of thinking about how humans play these games and uh in order to get this data like the human testing data for instance or just your own human introspection perception like as you play the games you ask how you're figuring things out. What's your strategy? to try to leverage meta cognition to produce uh um AGI insights. In order to to do that, the games need to be fun. They should not be boring. If they're boring, you're not going to want to play them. And human testers, they're they're not going to give it their best. All right. So, last question. Looking towards the future. So, even beyond V3, we've already started having conversations about V4, V5, whatever, even at lunch today. Uh what what are your maybe like hopes and dreams for what's not captured on this slide? What are the other interesting things that you can envision that we might want to test for capabilities wise in future AI systems? Right. So, I think V3 has uh basically the right ingredients like on-the-fly learning uh interactive learning, goal acquisition, but at a very small scale. So for instance, this sort of interactive learning is not really what you would call continual learning because even though there is a curriculum in each game, it's at the scale of 5 minutes of play. And as humans, you know, we we're doing continual learning over decades.
Code Rabbit uses benchmarks for model testing.
I think that the benchmarks benchmarks are like easy to game where I think that all the other big labs I think have teams where they like their whole job with the team is to like make the benchmarks scores good and we don't have such a team.
measures of AI model capabilities, evolving challenge due to rapid model progress
Arcmira tracks 10 indexed media appearances or mentions for benchmarks, tied to source videos, channels, and transcript-derived context.
Arcmira uses indexed YouTube videos and transcripts. Representative source evidence on this page includes "How New Libraries Saw a 50% Improvement | Maria Gorinova" with transcript-derived context and links when available.
benchmarks is connected to OpenAI, Twitter, Google in Arcmira's media graph.