Use LLMs to evaluate model outputs with structured judgments
LLM-as-a-judge is a powerful pattern for evaluating AI outputs using another LLM as the evaluator. This approach is particularly effective for subjective criteria like helpfulness, tone, or creativity that are difficult to measure programmatically.
For chat models, you can use a list of message dictionaries:
from phoenix.evals import ClassificationEvaluator, LLMllm = LLM(provider="openai", model="gpt-4o")evaluator = ClassificationEvaluator( name="tone", llm=llm, prompt_template=[ {"role": "system", "content": "You are an expert at evaluating tone."}, {"role": "user", "content": "Classify the tone of this message: {output}"} ], choices=["professional", "casual", "aggressive"])
prompt_template = """Evaluate if the response is concise.Response: {output}A concise response:- Answers the question directly- Uses minimal words without sacrificing clarity- Avoids repetition and unnecessary elaborationClassify as concise or verbose."""
prompt_template = "Is this response good? {output}"
This format is not recommended. LLMs do not reliably use the descriptions when classifying.
# Not recommended - descriptions are often ignoredchoices={ "accurate": (1.0, "Response is factually correct"), "inaccurate": (0.0, "Response contains errors")}
Off-topic responses: Should this be a separate label?
Ambiguous cases: How to handle borderline examples?
prompt_template = """Classify this response as helpful or not_helpful.Question: {input}Response: {output}Classify as:- helpful: Response directly answers the question- not_helpful: Response is off-topic, empty, or unhelpful"""
Avoid subjective terms like “good”, “bad”, or “quality” without defining them.
# Bad - undefined criteria"Is this response good?"# Good - specific criteria "Does this response answer all parts of the question with accurate information?"
Avoid leading language that biases the evaluation:
# Bad - leading question"This response is probably wrong. Is it incorrect?"# Good - neutral framing"Based on the context, is this response factually correct?"
# Bad - missing contextevaluator.evaluate({"output": "Paris"})# Good - includes questionevaluator.evaluate({ "input": "What is the capital of France?", "output": "Paris"})