arXiv AI recent: Where Did It Go Wrong? Process-Level Evaluation of Web Agents with Semantic State Tracking
Researchers conducted a process-level analysis of web agents using a new benchmark called WebStep.,The analysis revealed differences in agent performance that were not visible through tra...
The WebStep benchmark consists of 1,800 task instances with controlled difficulty and automatic semantic state tracking.,The benchmark allows for fine-grained analysis of agent performance without manual annotation, enabling the identification of specific skills that need improvement.