Topic 5.2

Evaluating AI Output

How to judge what's good enough

⏱️ 12 minutes 📋 Prompt Templates ✓ Quality Checklist

The Problem

AI just generated a learning objective. Or a scenario. Or an entire module.

Question: Is it good?

You have 60 seconds to decide: ship it, fix it, or scrap it.

This matters. Accept bad output and you waste time polishing garbage. Reject good output and you waste time regenerating what was fine.

You need a fast, repeatable process for evaluation.

The Five Quality Criteria

Every AI output gets evaluated on these five dimensions:

✓ Five quality criteria with priority levels
Criterion Question Priority
Accuracy Is the information factually correct? Critical
Relevance Does it fit your specific context/audience? Critical
Completeness Does it cover what's needed without gaps? Important
Clarity Is it clear and understandable? Important
Tone Does it match our organizational voice? Nice-to-have

Decision logic:

  • Accuracy or Relevance fails → Reject immediately
  • Completeness or Clarity fails → Refine with follow-up prompts
  • Tone fails → Quick manual edit

Critical issues get rejected. Fixable issues get refined.

Red Flags: Reject Immediately

Some output isn't worth fixing. Recognize these and start over.

🚩 Five red flags: When to reject output
Red flag Example Why reject
Factual errors Compliance training misrepresents a regulation Can't build on a wrong foundation
Wrong audience level Asked for beginner, got expert-level technical Easier to regenerate than simplify
Generic and vague "Employees should follow proper procedures" No value—could apply to anyone
Misunderstood request Asked for scenario, got case study Didn't do what you asked
Wrong tone Casual for compliance; jargon for frontline Tone mismatch undermines credibility

💡 The rule

If the foundation is wrong, rebuild. Don't polish garbage.

Green Flags: Refine and Use

These outputs are worth improving.

✅ Four green flags: Worth refining
Green flag Example Fix
Accurate but incomplete Right action verb, missing context "Add context about the manufacturing floor"
Relevant but unclear Realistic scenario, confusing structure "Simplify. Use shorter sentences."
Complete but generic Correct questions, could apply anywhere "Make specific to our retail environment"
Clear but wrong tone Understandable, too formal Quick edit or "Rewrite conversationally"

💡 The rule

Good bones, needs refinement. That's what AI is for.

The 60-Second Process

Run this workflow for every AI output:

⏱️ Five-step evaluation (60 seconds total)
Step Check Time If fail
1 Accuracy – Factually correct? 10 sec Reject, regenerate
2 Relevance – Fits our context? 10 sec Reject, regenerate
3 Completeness – Any gaps? 15 sec Note what's missing
4 Clarity – Understandable? 15 sec Note what's confusing
5 Tone – Sounds like us? 10 sec Quick edit or prompt

60 seconds total. Decision: ship, refine, or reject.

What "Good Enough" Looks Like

You're not aiming for perfection. You're aiming for "better than starting from scratch."

✅ Good Enough

  • Saves you time overall (even with refinement)
  • Gets you 70-80% there
  • Gives you something to react to
  • Requires expertise to finish, not to create

❌ Not Good Enough

  • Takes longer to fix than write yourself
  • Requires complete reconstruction
  • Misses the mark so badly you're starting over

💡 The test

If you're spending more time fighting with AI than you would have spent writing it yourself, stop. Write it yourself or try a completely different prompt.

Example 1: Learning Objective

AI output: "By the end of this training, learners will understand customer service best practices."

60-second evaluation:

  • ✅ Accuracy – Yes, it's about customer service
  • ✅ Relevance – Yes, matches our training topic
  • ❌ Completeness – "Understand" isn't measurable
  • ✅ Clarity – Clear, just not specific
  • ✅ Tone – Fine
Decision: Refine.
Follow-up prompt: "Make this specific. Use Bloom's action verb. Focus on handling difficult customers in retail."
Result: "By the end of this training, retail associates will de-escalate difficult customer situations using the three-step resolution framework."

Time: 90 seconds total. Better than writing from scratch.

Key Takeaways

  1. 60-second evaluation saves hours. Fast triage prevents wasting time polishing bad output.
  2. Critical fails = reject. Wrong facts or context? Start over immediately.
  3. Fixable fails = refine. Incomplete or unclear? That's what follow-up prompts are for.
  4. Good enough = 70-80% there. You're not aiming for perfection from AI—just a strong starting point.

Try It Now

🎯 Your task:

Generate a learning objective for your current project. Run the 60-second evaluation. Does it pass? If not, what failed? Refine or reject accordingly.

The test: Can you make the decision in under 60 seconds?

📥 Download: Quality evaluation checklist (PDF)

One-page checklist for evaluating AI output in 60 seconds.

Download PDF