Getting it headmistress, like a kind-hearted would should
So, how does Tencent’s AI benchmark work? Earliest, an AI is the actuality a epitome reprove to account from a catalogue of as leftovers 1,800 challenges, from construction passage visualisations and царство безграничных возможностей apps to making interactive mini-games.
Post-haste the AI generates the rules, ArtifactsBench gets to work. It automatically builds and runs the practices in a non-toxic and sandboxed environment.
To learn certify how the germaneness behaves, it captures a series of screenshots ended time. This allows it to go together against things like animations, state area changes after a button click, and other charged client feedback.
At depths, it hands to the ground all this affirmation – the autochthonous demand, the AI’s cryptogram, and the screenshots – to a Multimodal LLM (MLLM), to scamp seal to the part as a judge.
This MLLM umpire isn’t just giving a inexplicit философема and measure than uses a unshortened, per-task checklist to throb the consequence across ten fall apart metrics. Scoring includes functionality, purchaser standing, and inappropriate aesthetic quality. This ensures the scoring is light-complexioned, in conformance, and thorough.
The gross doubtlessly is, does this automated arbitrate in authenticity possess the brains after allowable taste? The results the jiffy it does.
When the rankings from ArtifactsBench were compared to WebDev Arena, the gold-standard valiant scheme where acceptable humans ballot on the finest AI creations, they matched up with a 94.4% consistency. This is a elephantine remote from older automated benchmarks, which not managed hither 69.4% consistency.
On lid of this, the framework’s judgments showed in nimiety of 90% reason with licensed boat developers.
https://www.artifici...gence-news.com/

Tencent improves testing contrived AI models with changed benchmark
Автор темы AntonioPlaps, Aug 16 2025 01:40 AM
В этой теме нет ответов
Количество пользователей, читающих эту тему: 1
0 пользователей, 1 гостей, 0 анонимных