1. Core Concepts & Theoretical Foundation
1.1 What is Human Evaluation in AI Context?
Human evaluation bridges the gap between computational metrics and real-world utility. It’s essential because:
- Validity Gap: Automatic metrics often don’t correlate with human judgment
- Subjective Qualities: Aspects like helpfulness, safety, and coherence require human assessment
- Ground Truth: Humans provide the ultimate validation for AI outputs
1.2 Measurement Theory Fundamentals
Reliability vs. Validity
- Reliability: Consistency of measurement (same results under same conditions)
- Validity: Measuring what you intend to measure
- Remember: A measure can be reliable without being valid, but not vice versa
Types of Validity:
- Construct Validity: Does the measure capture the theoretical construct?
- Criterion Validity: Does it correlate with external criteria?
- Content Validity: Does it cover all aspects of the construct?
- Face Validity: Does it appear to measure what it claims?
2. Inter-Rater Reliability Metrics
2.1 Cohen’s Kappa (κ)
Algorithm Logic: Measures agreement between two raters, accounting for chance agreement
Formula:
κ = (P_o - P_e) / (1 - P_e)
Where:
- P_o = observed agreement (proportion of items where raters agree)
- P_e = expected agreement by chance
Algorithmic Steps:
- Create confusion matrix of rater agreements
- Calculate observed agreement (diagonal sum / total)
- Calculate expected agreement (marginal probabilities)
- Apply formula
Interpretation:
- κ < 0: Less than chance agreement
- 0 ≤ κ ≤ 0.20: Slight agreement
- 0.21 ≤ κ ≤ 0.40: Fair agreement
- 0.41 ≤ κ ≤ 0.60: Moderate agreement
- 0.61 ≤ κ ≤ 0.80: Substantial agreement
- 0.81 ≤ κ ≤ 1.00: Almost perfect agreement
Code Snippet:
def cohens_kappa(rater1, rater2):
"""
Algorithm:
1. Build contingency table
2. Calculate observed agreement
3. Calculate expected agreement from marginals
4. Apply kappa formula
"""
from sklearn.metrics import cohen_kappa_score
return cohen_kappa_score(rater1, rater2)
# Weighted version for ordinal data
def weighted_kappa(rater1, rater2, weights='quadratic'):
"""
weights='quadratic' penalizes disagreements more as distance increases
weights='linear' applies linear penalty
"""
return cohen_kappa_score(rater1, rater2, weights=weights)
2.2 Fleiss’ Kappa
Use Case: Multiple raters (>2), categorical ratings
Algorithm Logic:
- Calculate proportion of rater pairs in agreement for each subject
- Average these proportions across all subjects
- Adjust for chance agreement
Formula:
κ = (P̄ - P̄_e) / (1 - P̄_e)
Where P̄ is mean pairwise agreement
2.3 Krippendorff’s Alpha
Advantages:
- Works with any number of raters
- Handles missing data
- Supports various data types (nominal, ordinal, interval, ratio)
Algorithm:
α = 1 - (D_o / D_e)
Where:
- D_o = observed disagreement
- D_e = expected disagreement by chance
Key Insight: Alpha measures disagreement rather than agreement, making it more robust
Code Pattern:
import krippendorff
def calculate_alpha(data_matrix):
"""
data_matrix: raters × items matrix
Algorithm:
1. Calculate pairwise differences (disagreement)
2. Weight by data type (nominal, ordinal, etc.)
3. Compare to random pairing disagreement
"""
return krippendorff.alpha(data_matrix)
3. Scale Development & Design
3.1 Likert Scales
Design Principles:
- Odd vs Even: Odd allows neutral midpoint, even forces choice
- Anchoring: Clear labels for endpoints and midpoint
- Range: 5-7 points optimal (balance granularity vs cognitive load)
Common Patterns:
# Standard 5-point Likert
likert_5 = {
1: "Strongly Disagree",
2: "Disagree",
3: "Neutral",
4: "Agree",
5: "Strongly Agree"
}
# 7-point for more granularity
likert_7 = {1: "Strongly Disagree", ..., 7: "Strongly Agree"}
3.2 Semantic Differential
Concept: Rate between bipolar adjectives
Safe |---|---|---|---|---| Dangerous
1 2 3 4 5
4. Classical Test Theory (CTT) vs Item Response Theory (IRT)
4.1 Classical Test Theory
Core Equation:
X = T + E
Where:
- X = Observed score
- T = True score
- E = Error
Reliability in CTT:
Reliability = Var(T) / Var(X) = Var(T) / (Var(T) + Var(E))
4.2 Item Response Theory
Key Advantage: Models probability of response as function of latent trait
Basic IRT Model (Rasch):
P(X=1|θ,β) = exp(θ-β) / (1 + exp(θ-β))
Where:
- θ = person ability
- β = item difficulty
When to Use IRT:
- Large sample sizes (>500)
- Need item-level analysis
- Adaptive testing scenarios
5. Annotation Guidelines Design
5.1 Structure Template
1. **Task Definition**
- Clear objective
- What constitutes success
2. **Categories/Labels**
- Exhaustive definitions
- Mutually exclusive when possible
3. **Decision Trees**
- If X, then check Y
- Hierarchical logic
4. **Edge Cases**
- Explicit handling
- "When in doubt" rules
5. **Examples**
- Positive examples
- Negative examples
- Borderline cases
5.2 Quality Control Algorithm
def annotation_quality_pipeline(annotations_df):
"""
Algorithmic approach:
1. Training phase: High agreement subset
2. Calibration: Identify systematic differences
3. Production: Monitor drift
4. Adjudication: Resolve disagreements
"""
# Step 1: Calculate agreement metrics
agreement_scores = {}
for rater_pair in combinations(raters, 2):
agreement_scores[rater_pair] = cohen_kappa_score(...)
# Step 2: Identify outlier annotators
mean_agreement = {rater: np.mean([...]) for rater in raters}
outliers = [r for r in raters if mean_agreement[r] < threshold]
# Step 3: Weighted aggregation based on reliability
weights = calculate_rater_weights(agreement_scores)
final_labels = weighted_vote(annotations, weights)
return final_labels, quality_metrics
6. Statistical Considerations
6.1 Sample Size Calculation
For Inter-rater Reliability:
def sample_size_for_kappa(expected_kappa=0.6, alpha=0.05, power=0.8):
"""
Approximation based on Sim & Wright (2005)
n ≈ 2k² / (κ²)
where k depends on number of categories
"""
# Rule of thumb: 50-100 items minimum
# 200+ for publication quality
return max(100, calculated_n)
6.2 Handling Missing Annotations
Strategies:
- Complete Case: Only use fully annotated items (reduces n)
- Pairwise Deletion: Use available pairs (different n for each pair)
- Imputation: Model-based (risky for agreement metrics)
- Krippendorff’s Alpha: Naturally handles missing data
7. Practical Implementation Patterns
7.1 Annotation Drift Detection
def detect_annotation_drift(annotations_timeline):
"""
Algorithm:
1. Sliding window approach
2. Calculate agreement in each window
3. Statistical process control (SPC)
"""
window_size = 50
agreements = []
for i in range(len(annotations_timeline) - window_size):
window = annotations_timeline[i:i+window_size]
agreement = calculate_agreement(window)
agreements.append(agreement)
# Detect drift using CUSUM or EWMA
drift_points = detect_changepoints(agreements)
return drift_points
7.2 Rater Calibration
def calibrate_raters(gold_standard, rater_responses):
"""
Steps:
1. Measure individual rater bias
2. Calculate systematic differences
3. Apply correction factors
"""
biases = {}
for rater in raters:
# Calculate tendency (lenient vs strict)
bias = np.mean(rater_responses[rater] - gold_standard)
biases[rater] = bias
# Calibration options:
# 1. Z-score normalization per rater
# 2. Rank-based alignment
# 3. Linear transformation
return calibrated_scores
8. Health-Specific Considerations
8.1 Clinical Validity Framework
def clinical_validity_check(annotations):
"""
Beyond statistical agreement:
1. Clinical significance vs statistical significance
2. Safety-first evaluation
3. Expert validation requirement
"""
# Stratify by risk level
high_risk_items = filter_high_risk(annotations)
# Require higher agreement for high-risk
thresholds = {
'high_risk': 0.8, # κ > 0.8
'medium_risk': 0.6,
'low_risk': 0.4
}
return risk_adjusted_validation(annotations, thresholds)
9. Common Pitfalls & Solutions
9.1 Prevalence Problem
Issue: Kappa can be low even with high agreement if categories are imbalanced
Solution: Use prevalence-adjusted metrics (PABAK) or report multiple metrics
9.2 Rater Fatigue
Detection:
def detect_fatigue(annotations_sequence):
# Look for patterns:
# 1. Decreasing variance over time
# 2. Increasing speed
# 3. Default responses
return fatigue_indicators
10. Interview Talking Points
When asked about human evaluation design:
- Start with the purpose: What decision will this evaluation inform?
- Discuss validity before reliability
- Mention power analysis for sample size
- Address practical constraints (cost, time, expertise needed)
- Describe quality control measures
Key Insight to Remember: “Perfect agreement isn’t always the goal—systematic disagreements can reveal important edge cases or ambiguities in the task definition.”
Quick Reference Formulas
Metric | Use Case | Formula | Threshold |
---|---|---|---|
Cohen’s κ | 2 raters, categorical | (P_o - P_e)/(1 - P_e) | >0.6 good |
Weighted κ | 2 raters, ordinal | Similar with weights | >0.7 good |
Fleiss’ κ | Multiple raters, categorical | Complex | >0.6 good |
Krippendorff’s α | Any scenario, missing data | 1 - (D_o/D_e) | >0.667 acceptable |
ICC | Continuous ratings | Various forms | >0.75 excellent |
© 2025 Seyed Yahya Shirazi. All rights reserved.