arXiv AI recent: Dissecting model behavior through agent trajectories
Researchers developed a customizable harness called 'Simple Strands Agent' (SSA) to analyze the 'intent-execution gap' in AI agents. They used this harness to analyze 138k trajectories to...
The authors define the 'intent-execution gap' as the mismatch between a model's intentions and the harness's execution. Using SSA, the researchers reproduced or improved pass@1 performance on benchmarks including SWE-Pro, SWE-Verified, and Terminal-Bench-2 across model families such as Claude, Ge...