Exam Certified Generative AI Engineer Associate topic 1 question 62 discussion

Actual exam question from Databricks's Certified Generative AI Engineer Associate

Question #: 62
Topic #: 1

[All Certified Generative AI Engineer Associate Questions]

A Generative AI Engineer has built an LLM-based system that will automatically translate user text between two languages. They now want to benchmark multiple LLM’s on this task and pick the best one. They have an evaluation set with known high quality translation examples. They want to evaluate each LLM using the evaluation set with a performant metric.

Which metric should they choose for this evaluation?