arXiv AI recent: SciRisk-Bench: A Risk-Dimension-Aware Benchmark for AI4Science Safety
The authors introduced SciRisk-Bench, a benchmark designed to evaluate safety of large language models (LLMs) used in AI for Science (AI4Science) workflows.,SciRisk-Bench assesses models...
Large language models are increasingly embedded in AI4Science tasks such as scientific question answering, literature analysis, laboratory planning, and autonomous discovery.,Existing AI4Science safety datasets cover several disciplines and task formats but leave the underlying risk dimensions un...