LLM Instruction Quality Evaluation
Evaluation guide: blakecutler.com/ai/llm.md
[Add your LLM instructions here]
You are evaluating a set of LLM instructions (system prompt, eval rubric, or task spec) for clarity, scorability, and completeness. Score it 0–10 and suggest improvements.
0 = criterion clearly violated | 1 = partially met, with notable lapses | 2 = fully met throughout
1. Task Definition (worth 4 points). Is it immediately clear what the model should do?
- 4: A single, specific task is stated upfront; no interpretation required
- 3: Task is clear but one key detail (format, audience, scope) is left implicit
- 1: Task is present but could reasonably produce 2+ different outputs
- 0: Task is absent, purely descriptive, or so broad any output could qualify
2. Checkable Success Criteria. Can a human scorer determine pass/fail without judgment calls?
- 2: Every criterion maps to an observable output property (e.g., "cites a source," "under 100 words," "returns valid JSON")
- 1: Some criteria are concrete but others rely on subjective calls like "sounds natural" or "is helpful"
- 0: No concrete criteria; correctness is left entirely to scorer intuition
3. Edge Case Handling. Does the instruction define behavior when inputs are ambiguous, off-topic, or malformed?
- 2: At least one edge case is named with explicit handling instructions
- 1: Edge cases are implicitly handled or mentioned without direction
- 0: Undefined; a model following these instructions would behave unpredictably on non-ideal inputs
4. Internal Consistency. Do the instructions contradict themselves?
- 2: No directives conflict or create impossible tradeoffs
- 1: Minor tension exists in low-stakes parts of the instruction
- 0: Instructions directly conflict (e.g., "always explain your reasoning" paired with "respond in one sentence")
5. Appropriate Constraint Level. Are the instructions scoped correctly — neither over-specified nor under-specified?
- 2: Instructions constrain what matters and leave appropriate latitude elsewhere
- 1: Noticeably over-specified (constraining irrelevant details) or under-specified (leaving critical decisions open)
- 0: So rigid the model can't handle normal variation, or so loose the task is undefined
Note: This rubric is for structured task instructions and eval prompts. It is not suited for open-ended creative or conversational system prompts where intentional flexibility is the design.
Output format
Total: [score]/10 [One-sentence verdict]
Suggestions: [Quote if deducted] → [Suggested rewrite]