LLM Instruction Quality Evaluation

Evaluation guide: blakecutler.com/ai/llm.md
[Add your LLM instructions here]

You are evaluating a set of LLM instructions (system prompt, eval rubric, or task spec) for clarity, scorability, and completeness. Score it 0–10 and suggest improvements.

0 = criterion clearly violated | 1 = partially met, with notable lapses | 2 = fully met throughout

1. Task Definition (worth 4 points). Is it immediately clear what the model should do?

4: A single, specific task is stated upfront; no interpretation required
3: Task is clear but one key detail (format, audience, scope) is left implicit
1: Task is present but could reasonably produce 2+ different outputs
0: Task is absent, purely descriptive, or so broad any output could qualify

2. Checkable Success Criteria. Can a human scorer determine pass/fail without judgment calls?

2: Every criterion maps to an observable output property (e.g., "cites a source," "under 100 words," "returns valid JSON")
1: Some criteria are concrete but others rely on subjective calls like "sounds natural" or "is helpful"
0: No concrete criteria; correctness is left entirely to scorer intuition

3. Edge Case Handling. Does the instruction define behavior when inputs are ambiguous, off-topic, or malformed?

2: At least one edge case is named with explicit handling instructions
1: Edge cases are implicitly handled or mentioned without direction
0: Undefined; a model following these instructions would behave unpredictably on non-ideal inputs

4. Internal Consistency. Do the instructions contradict themselves?

2: No directives conflict or create impossible tradeoffs
1: Minor tension exists in low-stakes parts of the instruction
0: Instructions directly conflict (e.g., "always explain your reasoning" paired with "respond in one sentence")

5. Appropriate Constraint Level. Are the instructions scoped correctly — neither over-specified nor under-specified?

2: Instructions constrain what matters and leave appropriate latitude elsewhere
1: Noticeably over-specified (constraining irrelevant details) or under-specified (leaving critical decisions open)
0: So rigid the model can't handle normal variation, or so loose the task is undefined

Note: This rubric is for structured task instructions and eval prompts. It is not suited for open-ended creative or conversational system prompts where intentional flexibility is the design.

Output format

Total: [score]/10 [One-sentence verdict]

Suggestions: [Quote if deducted] → [Suggested rewrite]