1. Why Human Evaluation Matters
A tech company once deployed an AI with 98% test accuracy. Users hated it. The system was technically correct but missed what people actually needed. This is the paradox: perfect metrics can mean real-world failure.
1.1 The Human Touch in AI
Human evaluation bridges mathematics and messy human reality:
- Validity Gap: BLEU scores say translation is great, but it still sounds robotic
- Subjective Qualities: “Helpful” vs “patronizing” - metrics can’t capture this
- Ultimate Truth: If humans don’t find it useful, metrics don’t matter
1.2 Measurement Fundamentals
Key principle: Measuring the right thing badly beats measuring the wrong thing perfectly.
Reliability vs Validity
- Reliability: Consistency (same result each time)
- Validity: Truth (the right result)
- You can have reliability without validity, never validity without reliability
Four Types of Validity
- Construct: Are we measuring what we think? (Intelligence vs test-taking)
- Criterion: Does it predict real outcomes? (Quality scores vs user satisfaction)
- Content: Are we covering everything? (Not just knife skills, but taste too)
- Face: Does it make sense? (Common sense check)
1.3 Real Example: AFib Detection via Smartwatch
The Challenge: Can your smartwatch really detect atrial fibrillation and prevent strokes?
Construct Validity Issue:
- AFib can be constant or intermittent (paroxysmal)
- Smartwatch takes 30-second readings
- Result: Excellent for active AFib, misses intermittent episodes
Clinical Reality:
- Moderate construct validity - detects active AFib well
- Limited for full clinical picture
- Not a failure, but users need to understand limitations
2. Inter-Rater Reliability: When Experts Disagree
2.1 Cohen’s Kappa: Beyond Simple Agreement
Cohen’s insight (1960): Two raters saying “yes/no” agree 50% by chance. So 70% agreement is only 20% better than coin flips.
Formula: $$κ = (P_o - P_e) / (1 - P_e)$$
Where: $P_o$ = observed agreement, $P_e$ = expected agreement by chance
Interpretation:
- κ < 0.20: Slight agreement
- 0.21-0.40: Fair
- 0.41-0.60: Moderate
- 0.61-0.80: Substantial
- 0.81-1.00: Almost perfect
from sklearn.metrics import cohen_kappa_score
kappa = cohen_kappa_score(rater1, rater2)
# Use weighted kappa for ordinal data
weighted_kappa = cohen_kappa_score(rater1, rater2, weights='quadratic')
2.1.1 Sleep Stage Scoring Example
Two technicians scoring EEG sleep stages often disagree on boundaries:
Results:
- Unweighted κ = 0.65: Substantial agreement
- Weighted κ = 0.72: Better when considering ordinal nature
- Common disagreements: Wake/N1 boundary, N2/N3 distinction
2.2 Fleiss’ Kappa
For multiple raters (>2), extends Cohen’s approach.
2.3 Krippendorff’s Alpha
Advantages:
- Handles missing data
- Works with any number of raters
- Supports various data types
$$α = 1 - \frac{D_o}{D_e}$$
Where: $D_o$ = observed disagreement, $D_e$ = expected disagreement
2.3.1 Multi-Radiologist Tumor Segmentation
Five radiologists, same brain MRI, different tumor boundaries. Krippendorff’s Alpha handles this mess:
Why Krippendorff’s Alpha:
- Handles missing data (Radiologist 3 missed 2 cases)
- Works with ordinal scales and multiple raters
- α = 0.743: Good reliability for treatment planning
3. Scale Design Essentials
3.1 Likert Scales: The Psychology of Ratings
Key Decisions:
- Odd (5-point): Allows neutral middle - good when true neutrality exists
- Even (6-point): Forces choice - prevents fence-sitting
- Optimal range: 5-7 points (matches cognitive capacity for distinctions)
3.1.1 Example: Athletic Recovery Scale
Key insights:
- 5-point preferred for daily logging
- Anchors must include physical + physiological markers
- Validate against objective metrics (HRV)
3.2 Semantic Differential
Rate between bipolar adjectives:
Safe |---|---|---|---|---| Dangerous
1 2 3 4 5
4. Classical Test Theory (CTT) vs Item Response Theory (IRT)
4.1 CTT: Simple but Limited
$$X = T + E$$
Observed score = True score + Error
Cronbach’s Alpha: $$α = \frac{k}{k-1} \left(1 - \frac{\sum \text{Var}(X_i)}{\text{Var}(X_{total})}\right)$$
Assumptions: Random error, fixed difficulty, parallel items
4.1.1 CTT in Practice
def calculate_cronbach_alpha(ratings_matrix):
n_items = ratings_matrix.shape[0]
item_var = np.sum(np.var(ratings_matrix, axis=1))
total_var = np.var(np.sum(ratings_matrix, axis=0))
return (n_items / (n_items - 1)) * (1 - item_var / total_var)
# α > 0.9: Excellent
# α > 0.8: Good
# α > 0.7: Acceptable
Limitations: Fixed difficulty assumption, no rater-item interactions
4.2 IRT: The Modern Approach
Rasch Model: $$P(X=1|θ,β) = \frac{\exp(θ-β)}{1 + \exp(θ-β)}$$
- θ = ability, β = difficulty
- Models each item separately
- Enables adaptive testing
4.2.1 Adaptive Testing with IRT
def select_next_item(ability_estimate, item_bank):
"""Select most informative item"""
# Fisher Information: p * (1 - p)
# Maximum info when p ≈ 0.5
# Choose item where difficulty ≈ ability
return min(item_bank,
key=lambda x: abs(x['difficulty'] - ability_estimate))
4.3 When to Use Which?
CTT: Small samples (<200), quick reliability, homogeneous items IRT: Large scale (>500), adaptive testing, varied difficulty, item banks
5. Annotation Guidelines That Work
5.1 Essential Structure
1. **Task Definition**: Clear objective
2. **Categories**: Exhaustive, mutually exclusive
3. **Decision Tree**: If X then Y logic
4. **Edge Cases**: Explicit handling
5. **Examples**: Positive, negative, borderline
5.1.1 Example: ECG Annotation Guide
# Key Elements
1. Clear task definition
2. Exhaustive categories
3. Decision tree for ambiguous cases
4. Edge case handling
5. Concrete examples
Focus on decision trees - they prevent analysis paralysis
6. Sample Size: Do the Math First
6.1 The Formula That Saves Budgets
$$n = \frac{(Z_α + Z_β)^2 × σ^2}{δ^2}$$
import statsmodels.stats.power as smp
def calculate_sample_size(effect_size=0.5, alpha=0.05, power=0.8):
n = smp.ttest_power.solve_power(
effect_size=effect_size,
alpha=alpha,
power=power,
alternative='two-sided'
)
return int(np.ceil(n * 1.2)) # 20% buffer
6.1.1 Items Needed
# Quick rules of thumb:
# Pilot: 30+ items
# Research: 100+ items
# Publication: 200+ items
# Clinical: 300+ items
6.1.2 Number of Raters
- Exploratory: 2-3 raters
- Research: 3-5 raters
- Clinical: 5-7 raters
- Gold standard: 7-9 raters
Diminishing returns after 5 raters
6.2 Stopping Rules
- Fixed n: Stop after N items
- Precision-based: Stop when CI width < threshold
- Sequential: Stop when reliability stabilizes
6.3 Missing Data Strategies
- Complete case: Use only fully annotated (<5% missing)
- Pairwise deletion: Use available pairs (5-15% missing)
- Krippendorff’s α: Built-in handling (any pattern)
- Multiple imputation: Model-based (15-30% missing)
7. Implementation Tips
Drift Detection: Monitor agreement over time with sliding windows
Calibration: Identify and correct systematic rater biases
Quality Control: Weight aggregation by rater reliability
8. Health-Specific Considerations
- Risk stratification: Higher agreement needed for high-risk decisions
- Clinical significance > statistical significance
- Safety thresholds: κ > 0.8 for critical decisions
9. Common Pitfalls
9.1 The Prevalence Paradox
95% agreement but κ = 0.3? When one category dominates (95% negative), chance agreement is high. Kappa adjusts for this.
Solution: Use prevalence-adjusted metrics (PABAK) or report multiple metrics
9.2 Rater Fatigue
Signs: Decreasing variance, increasing speed, default responses Solution: Shorter sessions, regular breaks, quality checks
10. Key Takeaways
The Framework That Works
- Start with purpose: What decision will this inform?
- Validity over reliability: Right thing badly > wrong thing perfectly
- Calculate sample size first: Save money, avoid embarrassment
- Embrace disagreement: It reveals edge cases
The Golden Rule
Human evaluation isn’t about eliminating subjectivity - it’s about harnessing it systematically.
Annotator disagreement often reveals the most interesting problems in your task definition.
Quick Reference
Metric | Use Case | Formula | Threshold |
---|---|---|---|
Cohen’s κ | 2 raters, categorical | (P_o - P_e)/(1 - P_e) | >0.6 good |
Fleiss’ κ | Multiple raters, categorical | Complex | >0.6 good |
Krippendorff’s α | Any scenario, missing data | 1 - (D_o/D_e) | >0.667 acceptable |
Cronbach’s α | Internal consistency | k/(k-1) × (1 - Σσ²ᵢ/σ²ₜ) | >0.7 acceptable |
© 2025 Seyed Yahya Shirazi. All rights reserved.