arXiv AI recent: How Do Instructions Shape Speech? Cross-Attention Attribution for Style-Captioned Text-to-Speech
Researchers proposed cross-attention attribution for speech diffusion models to understand how individual words influence acoustic output in style-captioned text-to-speech systems.,The me...
The study used a dataset of 120 style captions conditioning the generation of 30 text transcripts each, resulting in 3,600 combinations.,The method extracts per-token heatmaps across 25 layers and 24 ODE steps, providing detailed information about the influence of individual words on acoustic out...