1. Why Human Evaluation Matters

A tech company once deployed an AI with 98% test accuracy. Users hated it. The system was technically correct but missed what people actually needed. This is the paradox: perfect metrics can mean real-world failure.

1.1 The Human Touch in AI

Human evaluation bridges mathematics and messy human reality:

  • Validity Gap: BLEU scores say translation is great, but it still sounds robotic
  • Subjective Qualities: “Helpful” vs “patronizing” - metrics can’t capture this
  • Ultimate Truth: If humans don’t find it useful, metrics don’t matter

1.2 Measurement Fundamentals

Key principle: Measuring the right thing badly beats measuring the wrong thing perfectly.

Reliability vs Validity

  • Reliability: Consistency (same result each time)
  • Validity: Truth (the right result)
  • You can have reliability without validity, never validity without reliability

Four Types of Validity

  1. Construct: Are we measuring what we think? (Intelligence vs test-taking)
  2. Criterion: Does it predict real outcomes? (Quality scores vs user satisfaction)
  3. Content: Are we covering everything? (Not just knife skills, but taste too)
  4. Face: Does it make sense? (Common sense check)

1.3 Real Example: AFib Detection via Smartwatch

The Challenge: Can your smartwatch really detect atrial fibrillation and prevent strokes?

Construct Validity Issue:

  • AFib can be constant or intermittent (paroxysmal)
  • Smartwatch takes 30-second readings
  • Result: Excellent for active AFib, misses intermittent episodes

Clinical Reality:

  • Moderate construct validity - detects active AFib well
  • Limited for full clinical picture
  • Not a failure, but users need to understand limitations

2. Inter-Rater Reliability: When Experts Disagree

2.1 Cohen’s Kappa: Beyond Simple Agreement

Cohen’s insight (1960): Two raters saying “yes/no” agree 50% by chance. So 70% agreement is only 20% better than coin flips.

Formula: $$κ = (P_o - P_e) / (1 - P_e)$$

Where: $P_o$ = observed agreement, $P_e$ = expected agreement by chance

Interpretation:

  • κ < 0.20: Slight agreement
  • 0.21-0.40: Fair
  • 0.41-0.60: Moderate
  • 0.61-0.80: Substantial
  • 0.81-1.00: Almost perfect
from sklearn.metrics import cohen_kappa_score
kappa = cohen_kappa_score(rater1, rater2)
# Use weighted kappa for ordinal data
weighted_kappa = cohen_kappa_score(rater1, rater2, weights='quadratic')

2.1.1 Sleep Stage Scoring Example

Two technicians scoring EEG sleep stages often disagree on boundaries:

Results:

  • Unweighted κ = 0.65: Substantial agreement
  • Weighted κ = 0.72: Better when considering ordinal nature
  • Common disagreements: Wake/N1 boundary, N2/N3 distinction

2.2 Fleiss’ Kappa

For multiple raters (>2), extends Cohen’s approach.

2.3 Krippendorff’s Alpha

Advantages:

  • Handles missing data
  • Works with any number of raters
  • Supports various data types

$$α = 1 - \frac{D_o}{D_e}$$

Where: $D_o$ = observed disagreement, $D_e$ = expected disagreement

2.3.1 Multi-Radiologist Tumor Segmentation

Five radiologists, same brain MRI, different tumor boundaries. Krippendorff’s Alpha handles this mess:

Why Krippendorff’s Alpha:

  • Handles missing data (Radiologist 3 missed 2 cases)
  • Works with ordinal scales and multiple raters
  • α = 0.743: Good reliability for treatment planning

3. Scale Design Essentials

3.1 Likert Scales: The Psychology of Ratings

Key Decisions:

  • Odd (5-point): Allows neutral middle - good when true neutrality exists
  • Even (6-point): Forces choice - prevents fence-sitting
  • Optimal range: 5-7 points (matches cognitive capacity for distinctions)

3.1.1 Example: Athletic Recovery Scale

Key insights:

  • 5-point preferred for daily logging
  • Anchors must include physical + physiological markers
  • Validate against objective metrics (HRV)

3.2 Semantic Differential

Rate between bipolar adjectives:

Safe |---|---|---|---|---| Dangerous
  1     2     3     4     5

4. Classical Test Theory (CTT) vs Item Response Theory (IRT)

4.1 CTT: Simple but Limited

$$X = T + E$$

Observed score = True score + Error

Cronbach’s Alpha: $$α = \frac{k}{k-1} \left(1 - \frac{\sum \text{Var}(X_i)}{\text{Var}(X_{total})}\right)$$

Assumptions: Random error, fixed difficulty, parallel items

4.1.1 CTT in Practice

def calculate_cronbach_alpha(ratings_matrix):
    n_items = ratings_matrix.shape[0]
    item_var = np.sum(np.var(ratings_matrix, axis=1))
    total_var = np.var(np.sum(ratings_matrix, axis=0))
    return (n_items / (n_items - 1)) * (1 - item_var / total_var)

# α > 0.9: Excellent
# α > 0.8: Good  
# α > 0.7: Acceptable

Limitations: Fixed difficulty assumption, no rater-item interactions

4.2 IRT: The Modern Approach

Rasch Model: $$P(X=1|θ,β) = \frac{\exp(θ-β)}{1 + \exp(θ-β)}$$

  • θ = ability, β = difficulty
  • Models each item separately
  • Enables adaptive testing

4.2.1 Adaptive Testing with IRT

def select_next_item(ability_estimate, item_bank):
    """Select most informative item"""
    # Fisher Information: p * (1 - p)
    # Maximum info when p ≈ 0.5
    # Choose item where difficulty ≈ ability
    return min(item_bank, 
               key=lambda x: abs(x['difficulty'] - ability_estimate))

4.3 When to Use Which?

CTT: Small samples (<200), quick reliability, homogeneous items IRT: Large scale (>500), adaptive testing, varied difficulty, item banks

5. Annotation Guidelines That Work

5.1 Essential Structure

1. **Task Definition**: Clear objective
2. **Categories**: Exhaustive, mutually exclusive
3. **Decision Tree**: If X then Y logic
4. **Edge Cases**: Explicit handling
5. **Examples**: Positive, negative, borderline

5.1.1 Example: ECG Annotation Guide

# Key Elements
1. Clear task definition
2. Exhaustive categories 
3. Decision tree for ambiguous cases
4. Edge case handling
5. Concrete examples

Focus on decision trees - they prevent analysis paralysis

6. Sample Size: Do the Math First

6.1 The Formula That Saves Budgets

$$n = \frac{(Z_α + Z_β)^2 × σ^2}{δ^2}$$

import statsmodels.stats.power as smp

def calculate_sample_size(effect_size=0.5, alpha=0.05, power=0.8):
    n = smp.ttest_power.solve_power(
        effect_size=effect_size,
        alpha=alpha,
        power=power,
        alternative='two-sided'
    )
    return int(np.ceil(n * 1.2))  # 20% buffer

6.1.1 Items Needed

# Quick rules of thumb:
# Pilot: 30+ items
# Research: 100+ items  
# Publication: 200+ items
# Clinical: 300+ items

6.1.2 Number of Raters

  • Exploratory: 2-3 raters
  • Research: 3-5 raters
  • Clinical: 5-7 raters
  • Gold standard: 7-9 raters

Diminishing returns after 5 raters

6.2 Stopping Rules

  • Fixed n: Stop after N items
  • Precision-based: Stop when CI width < threshold
  • Sequential: Stop when reliability stabilizes

6.3 Missing Data Strategies

  • Complete case: Use only fully annotated (<5% missing)
  • Pairwise deletion: Use available pairs (5-15% missing)
  • Krippendorff’s α: Built-in handling (any pattern)
  • Multiple imputation: Model-based (15-30% missing)

7. Implementation Tips

Drift Detection: Monitor agreement over time with sliding windows

Calibration: Identify and correct systematic rater biases

Quality Control: Weight aggregation by rater reliability

8. Health-Specific Considerations

  • Risk stratification: Higher agreement needed for high-risk decisions
  • Clinical significance > statistical significance
  • Safety thresholds: κ > 0.8 for critical decisions

9. Common Pitfalls

9.1 The Prevalence Paradox

95% agreement but κ = 0.3? When one category dominates (95% negative), chance agreement is high. Kappa adjusts for this.

Solution: Use prevalence-adjusted metrics (PABAK) or report multiple metrics

9.2 Rater Fatigue

Signs: Decreasing variance, increasing speed, default responses Solution: Shorter sessions, regular breaks, quality checks

10. Key Takeaways

The Framework That Works

  1. Start with purpose: What decision will this inform?
  2. Validity over reliability: Right thing badly > wrong thing perfectly
  3. Calculate sample size first: Save money, avoid embarrassment
  4. Embrace disagreement: It reveals edge cases

The Golden Rule

Human evaluation isn’t about eliminating subjectivity - it’s about harnessing it systematically.

Annotator disagreement often reveals the most interesting problems in your task definition.

Quick Reference

MetricUse CaseFormulaThreshold
Cohen’s κ2 raters, categorical(P_o - P_e)/(1 - P_e)>0.6 good
Fleiss’ κMultiple raters, categoricalComplex>0.6 good
Krippendorff’s αAny scenario, missing data1 - (D_o/D_e)>0.667 acceptable
Cronbach’s αInternal consistencyk/(k-1) × (1 - Σσ²ᵢ/σ²ₜ)>0.7 acceptable

© 2025 Seyed Yahya Shirazi. All rights reserved.