evaluation harness
Automated framework executing tests to score model or agent performance across metrics.