evaluation harness

Automated framework executing tests to score model or agent performance across metrics.