Human Evaluation & Psychometrics for AI Systems

1. Why Human Evaluation Matters

A tech company once deployed an AI with 98% test accuracy. Users hated it. The system was technically correct but missed what people actually needed. This is the paradox: perfect metrics can mean real-world failure.

1.1 The Human Touch in AI

Human evaluation bridges mathematics and messy human reality:

Validity Gap: BLEU scores say translation is great, but it still sounds robotic
Subjective Qualities: “Helpful” vs “patronizing” - metrics can’t capture this
Ultimate Truth: If humans don’t find it useful, metrics don’t matter

1.2 Measurement Fundamentals

Key principle: Measuring the right thing badly beats measuring the wrong thing perfectly.

Reliability vs Validity

Reliability: Consistency (same result each time)
Validity: Truth (the right result)
You can have reliability without validity, never validity without reliability

Four Types of Validity

Construct: Are we measuring what we think? (Intelligence vs test-taking)
Criterion: Does it predict real outcomes? (Quality scores vs user satisfaction)
Content: Are we covering everything? (Not just knife skills, but taste too)
Face: Does it make sense? (Common sense check)

1.3 Real Example: AFib Detection via Smartwatch

The Challenge: Can your smartwatch really detect atrial fibrillation and prevent strokes?

Construct Validity Issue:

AFib can be constant or intermittent (paroxysmal)
Smartwatch takes 30-second readings
Result: Excellent for active AFib, misses intermittent episodes

Clinical Reality:

Moderate construct validity - detects active AFib well
Limited for full clinical picture
Not a failure, but users need to understand limitations

2. Inter-Rater Reliability: When Experts Disagree

2.1 Cohen’s Kappa: Beyond Simple Agreement

Cohen’s insight (1960): Two raters saying “yes/no” agree 50% by chance. So 70% agreement is only 20% better than coin flips.

Formula:

$$κ = (P_o - P_e) / (1 - P_e)$$

Where: $P_o$ = observed agreement, $P_e$ = expected agreement by chance

Interpretation:

κ < 0.20: Slight agreement
0.21-0.40: Fair
0.41-0.60: Moderate
0.61-0.80: Substantial
0.81-1.00: Almost perfect

from sklearn.metrics import cohen_kappa_score
kappa = cohen_kappa_score(rater1, rater2)
# Use weighted kappa for ordinal data
weighted_kappa = cohen_kappa_score(rater1, rater2, weights='quadratic')

2.1.1 Sleep Stage Scoring Example

Two technicians scoring EEG sleep stages often disagree on boundaries:

Results:

Unweighted κ = 0.65: Substantial agreement
Weighted κ = 0.72: Better when considering ordinal nature
Common disagreements: Wake/N1 boundary, N2/N3 distinction

2.2 Fleiss’ Kappa

For multiple raters (>2), extends Cohen’s approach.

2.3 Krippendorff’s Alpha

Advantages:

Handles missing data
Works with any number of raters
Supports various data types

$$α = 1 - \frac{D_o}{D_e}$$

Where: $D_o$ = observed disagreement, $D_e$ = expected disagreement

2.3.1 Multi-Radiologist Tumor Segmentation

Five radiologists, same brain MRI, different tumor boundaries. Krippendorff’s Alpha handles this mess:

Why Krippendorff’s Alpha:

Handles missing data (Radiologist 3 missed 2 cases)
Works with ordinal scales and multiple raters
α = 0.743: Good reliability for treatment planning

3. Scale Design Essentials

3.1 Likert Scales: The Psychology of Ratings

Key Decisions:

Odd (5-point): Allows neutral middle - good when true neutrality exists
Even (6-point): Forces choice - prevents fence-sitting
Optimal range: 5-7 points (matches cognitive capacity for distinctions)

3.1.1 Example: Athletic Recovery Scale

Key insights:

5-point preferred for daily logging
Anchors must include physical + physiological markers
Validate against objective metrics (HRV)

3.2 Semantic Differential

Rate between bipolar adjectives:

Safe |---|---|---|---|---| Dangerous
  1     2     3     4     5

4. Classical Test Theory (CTT) vs Item Response Theory (IRT)

4.1 CTT: Simple but Limited

$$X = T + E$$

Observed score = True score + Error

Cronbach’s Alpha:

$$α = \frac{k}{k-1} \left(1 - \frac{\sum \text{Var}(X_i)}{\text{Var}(X_{total})}\right)$$

Assumptions: Random error, fixed difficulty, parallel items

4.1.1 CTT in Practice

def calculate_cronbach_alpha(ratings_matrix):
    n_items = ratings_matrix.shape[0]
    item_var = np.sum(np.var(ratings_matrix, axis=1))
    total_var = np.var(np.sum(ratings_matrix, axis=0))
    return (n_items / (n_items - 1)) * (1 - item_var / total_var)

# α > 0.9: Excellent
# α > 0.8: Good  
# α > 0.7: Acceptable

Limitations: Fixed difficulty assumption, no rater-item interactions

4.2 IRT: The Modern Approach

Rasch Model:

$$P(X=1|θ,β) = \frac{\exp(θ-β)}{1 + \exp(θ-β)}$$

θ = ability, β = difficulty
Models each item separately
Enables adaptive testing

4.2.1 Adaptive Testing with IRT

def select_next_item(ability_estimate, item_bank):
    """Select most informative item"""
    # Fisher Information: p * (1 - p)
    # Maximum info when p ≈ 0.5
    # Choose item where difficulty ≈ ability
    return min(item_bank, 
               key=lambda x: abs(x['difficulty'] - ability_estimate))

4.3 When to Use Which?

CTT: Small samples (<200), quick reliability, homogeneous items IRT: Large scale (>500), adaptive testing, varied difficulty, item banks

5. Annotation Guidelines That Work

5.1 Essential Structure

1. **Task Definition**: Clear objective
2. **Categories**: Exhaustive, mutually exclusive
3. **Decision Tree**: If X then Y logic
4. **Edge Cases**: Explicit handling
5. **Examples**: Positive, negative, borderline

5.1.1 Example: ECG Annotation Guide

# Key Elements
1. Clear task definition
2. Exhaustive categories 
3. Decision tree for ambiguous cases
4. Edge case handling
5. Concrete examples

Focus on decision trees - they prevent analysis paralysis

6. Sample Size: Do the Math First

6.1 The Formula That Saves Budgets

$$n = \frac{(Z_α + Z_β)^2 × σ^2}{δ^2}$$

import statsmodels.stats.power as smp

def calculate_sample_size(effect_size=0.5, alpha=0.05, power=0.8):
    n = smp.ttest_power.solve_power(
        effect_size=effect_size,
        alpha=alpha,
        power=power,
        alternative='two-sided'
    )
    return int(np.ceil(n * 1.2))  # 20% buffer

6.1.1 Items Needed

# Quick rules of thumb:
# Pilot: 30+ items
# Research: 100+ items  
# Publication: 200+ items
# Clinical: 300+ items

6.1.2 Number of Raters

Exploratory: 2-3 raters
Research: 3-5 raters
Clinical: 5-7 raters
Gold standard: 7-9 raters

Diminishing returns after 5 raters

6.2 Stopping Rules

Fixed n: Stop after N items
Precision-based: Stop when CI width < threshold
Sequential: Stop when reliability stabilizes

6.3 Missing Data Strategies

Complete case: Use only fully annotated (<5% missing)
Pairwise deletion: Use available pairs (5-15% missing)
Krippendorff’s α: Built-in handling (any pattern)
Multiple imputation: Model-based (15-30% missing)

7. Implementation Tips

Drift Detection: Monitor agreement over time with sliding windows

Calibration: Identify and correct systematic rater biases

Quality Control: Weight aggregation by rater reliability

8. Health-Specific Considerations

Risk stratification: Higher agreement needed for high-risk decisions
Clinical significance > statistical significance
Safety thresholds: κ > 0.8 for critical decisions

9. Common Pitfalls

9.1 The Prevalence Paradox

95% agreement but κ = 0.3? When one category dominates (95% negative), chance agreement is high. Kappa adjusts for this.

Solution: Use prevalence-adjusted metrics (PABAK) or report multiple metrics

9.2 Rater Fatigue

Signs: Decreasing variance, increasing speed, default responses Solution: Shorter sessions, regular breaks, quality checks

10. Key Takeaways

The Framework That Works

Start with purpose: What decision will this inform?
Validity over reliability: Right thing badly > wrong thing perfectly
Calculate sample size first: Save money, avoid embarrassment
Embrace disagreement: It reveals edge cases

The Golden Rule

Human evaluation isn’t about eliminating subjectivity - it’s about harnessing it systematically.

Annotator disagreement often reveals the most interesting problems in your task definition.

Quick Reference

Metric	Use Case	Formula	Threshold
Cohen’s κ	2 raters, categorical	(P_o - P_e)/(1 - P_e)	>0.6 good
Fleiss’ κ	Multiple raters, categorical	Complex	>0.6 good
Krippendorff’s α	Any scenario, missing data	1 - (D_o/D_e)	>0.667 acceptable
Cronbach’s α	Internal consistency	k/(k-1) × (1 - Σσ²ᵢ/σ²ₜ)	>0.7 acceptable