arXiv AI recent: LongWebBench: Evaluating Structural and Functional Webpage Generation in Long-Horizon Settings
Researchers introduced LongWebBench, a benchmark for evaluating long-horizon webpage generation from structural and functional perspectives.,LongWebBench contains real-world long webpages...
LongWebBench has two complementary protocols: a multi-dimensional VLM-based metric and a DOM-augmented agent-based pipeline.,The benchmark contains 490 real-world long webpages for structural fidelity evaluation and 507 goal-oriented interaction tasks over 129 webpages for functional evaluation.