Human Evaluation & Psychometrics for AI Systems

1. Core Concepts & Theoretical Foundation

1.1 What is Human Evaluation in AI Context?

Human evaluation bridges the gap between computational metrics and real-world utility. It’s essential because:

Validity Gap: Automatic metrics often don’t correlate with human judgment
Subjective Qualities: Aspects like helpfulness, safety, and coherence require human assessment
Ground Truth: Humans provide the ultimate validation for AI outputs

1.2 Measurement Theory Fundamentals

Reliability vs. Validity

Reliability: Consistency of measurement (same results under same conditions)
Validity: Measuring what you intend to measure
Remember: A measure can be reliable without being valid, but not vice versa

Types of Validity:

Construct Validity: Does the measure capture the theoretical construct?
Criterion Validity: Does it correlate with external criteria?
Content Validity: Does it cover all aspects of the construct?
Face Validity: Does it appear to measure what it claims?

2. Inter-Rater Reliability Metrics

2.1 Cohen’s Kappa (κ)

Algorithm Logic: Measures agreement between two raters, accounting for chance agreement

Formula:

κ = (P_o - P_e) / (1 - P_e)

Where:
- P_o = observed agreement (proportion of items where raters agree)
- P_e = expected agreement by chance

Algorithmic Steps:

Create confusion matrix of rater agreements
Calculate observed agreement (diagonal sum / total)
Calculate expected agreement (marginal probabilities)
Apply formula

Interpretation:

κ < 0: Less than chance agreement
0 ≤ κ ≤ 0.20: Slight agreement
0.21 ≤ κ ≤ 0.40: Fair agreement
0.41 ≤ κ ≤ 0.60: Moderate agreement
0.61 ≤ κ ≤ 0.80: Substantial agreement
0.81 ≤ κ ≤ 1.00: Almost perfect agreement

Code Snippet:

def cohens_kappa(rater1, rater2):
    """
    Algorithm: 
    1. Build contingency table
    2. Calculate observed agreement
    3. Calculate expected agreement from marginals
    4. Apply kappa formula
    """
    from sklearn.metrics import cohen_kappa_score
    return cohen_kappa_score(rater1, rater2)

# Weighted version for ordinal data
def weighted_kappa(rater1, rater2, weights='quadratic'):
    """
    weights='quadratic' penalizes disagreements more as distance increases
    weights='linear' applies linear penalty
    """
    return cohen_kappa_score(rater1, rater2, weights=weights)

2.2 Fleiss’ Kappa

Use Case: Multiple raters (>2), categorical ratings

Algorithm Logic:

Calculate proportion of rater pairs in agreement for each subject
Average these proportions across all subjects
Adjust for chance agreement

Formula:

κ = (P̄ - P̄_e) / (1 - P̄_e)

Where P̄ is mean pairwise agreement

2.3 Krippendorff’s Alpha

Advantages:

Works with any number of raters
Handles missing data
Supports various data types (nominal, ordinal, interval, ratio)

Algorithm:

α = 1 - (D_o / D_e)

Where:
- D_o = observed disagreement
- D_e = expected disagreement by chance

Key Insight: Alpha measures disagreement rather than agreement, making it more robust

Code Pattern:

import krippendorff

def calculate_alpha(data_matrix):
    """
    data_matrix: raters × items matrix
    Algorithm:
    1. Calculate pairwise differences (disagreement)
    2. Weight by data type (nominal, ordinal, etc.)
    3. Compare to random pairing disagreement
    """
    return krippendorff.alpha(data_matrix)

3. Scale Development & Design

3.1 Likert Scales

Design Principles:

Odd vs Even: Odd allows neutral midpoint, even forces choice
Anchoring: Clear labels for endpoints and midpoint
Range: 5-7 points optimal (balance granularity vs cognitive load)

Common Patterns:

# Standard 5-point Likert
likert_5 = {
    1: "Strongly Disagree",
    2: "Disagree", 
    3: "Neutral",
    4: "Agree",
    5: "Strongly Agree"
}

# 7-point for more granularity
likert_7 = {1: "Strongly Disagree", ..., 7: "Strongly Agree"}

3.2 Semantic Differential

Concept: Rate between bipolar adjectives

Safe |---|---|---|---|---| Dangerous
  1     2     3     4     5

4. Classical Test Theory (CTT) vs Item Response Theory (IRT)

4.1 Classical Test Theory

Core Equation:

X = T + E

Where:
- X = Observed score
- T = True score
- E = Error

Reliability in CTT:

Reliability = Var(T) / Var(X) = Var(T) / (Var(T) + Var(E))

4.2 Item Response Theory

Key Advantage: Models probability of response as function of latent trait

Basic IRT Model (Rasch):

P(X=1|θ,β) = exp(θ-β) / (1 + exp(θ-β))

Where:
- θ = person ability
- β = item difficulty

When to Use IRT:

Large sample sizes (>500)
Need item-level analysis
Adaptive testing scenarios

5. Annotation Guidelines Design

5.1 Structure Template

1. **Task Definition**
   - Clear objective
   - What constitutes success

2. **Categories/Labels**
   - Exhaustive definitions
   - Mutually exclusive when possible

3. **Decision Trees**
   - If X, then check Y
   - Hierarchical logic

4. **Edge Cases**
   - Explicit handling
   - "When in doubt" rules

5. **Examples**
   - Positive examples
   - Negative examples
   - Borderline cases

5.2 Quality Control Algorithm

def annotation_quality_pipeline(annotations_df):
    """
    Algorithmic approach:
    1. Training phase: High agreement subset
    2. Calibration: Identify systematic differences
    3. Production: Monitor drift
    4. Adjudication: Resolve disagreements
    """
    
    # Step 1: Calculate agreement metrics
    agreement_scores = {}
    for rater_pair in combinations(raters, 2):
        agreement_scores[rater_pair] = cohen_kappa_score(...)
    
    # Step 2: Identify outlier annotators
    mean_agreement = {rater: np.mean([...]) for rater in raters}
    outliers = [r for r in raters if mean_agreement[r] < threshold]
    
    # Step 3: Weighted aggregation based on reliability
    weights = calculate_rater_weights(agreement_scores)
    final_labels = weighted_vote(annotations, weights)
    
    return final_labels, quality_metrics

6. Statistical Considerations

6.1 Sample Size Calculation

For Inter-rater Reliability:

def sample_size_for_kappa(expected_kappa=0.6, alpha=0.05, power=0.8):
    """
    Approximation based on Sim & Wright (2005)
    n ≈ 2k² / (κ²)
    where k depends on number of categories
    """
    # Rule of thumb: 50-100 items minimum
    # 200+ for publication quality
    return max(100, calculated_n)

6.2 Handling Missing Annotations

Strategies:

Complete Case: Only use fully annotated items (reduces n)
Pairwise Deletion: Use available pairs (different n for each pair)
Imputation: Model-based (risky for agreement metrics)
Krippendorff’s Alpha: Naturally handles missing data

7. Practical Implementation Patterns

7.1 Annotation Drift Detection

def detect_annotation_drift(annotations_timeline):
    """
    Algorithm:
    1. Sliding window approach
    2. Calculate agreement in each window
    3. Statistical process control (SPC)
    """
    window_size = 50
    agreements = []
    
    for i in range(len(annotations_timeline) - window_size):
        window = annotations_timeline[i:i+window_size]
        agreement = calculate_agreement(window)
        agreements.append(agreement)
    
    # Detect drift using CUSUM or EWMA
    drift_points = detect_changepoints(agreements)
    return drift_points

7.2 Rater Calibration

def calibrate_raters(gold_standard, rater_responses):
    """
    Steps:
    1. Measure individual rater bias
    2. Calculate systematic differences
    3. Apply correction factors
    """
    biases = {}
    for rater in raters:
        # Calculate tendency (lenient vs strict)
        bias = np.mean(rater_responses[rater] - gold_standard)
        biases[rater] = bias
    
    # Calibration options:
    # 1. Z-score normalization per rater
    # 2. Rank-based alignment
    # 3. Linear transformation
    return calibrated_scores

8. Health-Specific Considerations

8.1 Clinical Validity Framework

def clinical_validity_check(annotations):
    """
    Beyond statistical agreement:
    1. Clinical significance vs statistical significance
    2. Safety-first evaluation
    3. Expert validation requirement
    """
    
    # Stratify by risk level
    high_risk_items = filter_high_risk(annotations)
    
    # Require higher agreement for high-risk
    thresholds = {
        'high_risk': 0.8,  # κ > 0.8
        'medium_risk': 0.6,
        'low_risk': 0.4
    }
    
    return risk_adjusted_validation(annotations, thresholds)

9. Common Pitfalls & Solutions

9.1 Prevalence Problem

Issue: Kappa can be low even with high agreement if categories are imbalanced

Solution: Use prevalence-adjusted metrics (PABAK) or report multiple metrics

9.2 Rater Fatigue

Detection:

def detect_fatigue(annotations_sequence):
    # Look for patterns:
    # 1. Decreasing variance over time
    # 2. Increasing speed
    # 3. Default responses
    return fatigue_indicators

10. Interview Talking Points

When asked about human evaluation design:

Start with the purpose: What decision will this evaluation inform?
Discuss validity before reliability
Mention power analysis for sample size
Address practical constraints (cost, time, expertise needed)
Describe quality control measures

Key Insight to Remember: “Perfect agreement isn’t always the goal—systematic disagreements can reveal important edge cases or ambiguities in the task definition.”

Quick Reference Formulas

Metric	Use Case	Formula	Threshold
Cohen’s κ	2 raters, categorical	(P_o - P_e)/(1 - P_e)	>0.6 good
Weighted κ	2 raters, ordinal	Similar with weights	>0.7 good
Fleiss’ κ	Multiple raters, categorical	Complex	>0.6 good
Krippendorff’s α	Any scenario, missing data	1 - (D_o/D_e)	>0.667 acceptable
ICC	Continuous ratings	Various forms	>0.75 excellent