1. Power Analysis for Annotation Studies
Conceptual Understanding
Power analysis determines the sample size needed to detect a meaningful effect with sufficient confidence. In annotation studies, this means: “How many annotations do I need to reliably detect differences between models/conditions?”
Key Formula
n = (Z_α + Z_β)² × σ² / δ²
Where:
- n = sample size needed
- Z_α = critical value for significance level (1.96 for α=0.05)
- Z_β = critical value for power (0.84 for 80% power)
- σ = standard deviation
- δ = effect size (minimum detectable difference)
Implementation Logic
import statsmodels.stats.power as smp
import numpy as np
def calculate_annotation_sample_size(effect_size=0.5, alpha=0.05, power=0.8):
"""
Algorithm:
1. Define expected effect size (Cohen's d)
2. Set significance level and desired power
3. Calculate required sample size
4. Adjust for multiple raters/conditions
"""
# For comparing two models
n = smp.ttest_power.solve_power(
effect_size=effect_size,
alpha=alpha,
power=power,
alternative='two-sided'
)
# Adjustment for multiple comparisons
n_adjusted = n * 1.2 # 20% buffer for dropout/invalid annotations
return int(np.ceil(n_adjusted))
# Practical example: Health AI evaluation
# "How many patient cases need evaluation to detect 10% improvement?"
baseline_accuracy = 0.75
expected_improvement = 0.10
std_dev = 0.15
effect_size = expected_improvement / std_dev # Cohen's d
sample_size = calculate_annotation_sample_size(effect_size)
2. Mixed-Effects Models for Nested Data
Conceptual Framework
Mixed-effects models handle hierarchical/nested data structure common in evaluations:
- Fixed effects: Consistent across all observations (e.g., model type)
- Random effects: Vary across groups (e.g., individual annotator biases)
Mathematical Representation
Y_ij = β₀ + β₁X_ij + u_i + ε_ij
Where:
- Y_ij = rating for item j by annotator i
- β₀ = intercept (fixed)
- β₁ = effect of condition X (fixed)
- u_i = random effect for annotator i
- ε_ij = residual error
Implementation Pattern
import statsmodels.formula.api as smf
import pandas as pd
def analyze_nested_annotations(df):
"""
Algorithm:
1. Identify hierarchical structure (annotations nested within annotators)
2. Specify fixed effects (conditions to compare)
3. Specify random effects (annotator variation)
4. Fit model and extract insights
"""
# Formula notation: outcome ~ fixed_effects + (1|random_effects)
model = smf.mixedlm(
formula="quality_score ~ model_type + response_length",
data=df,
groups=df["annotator_id"], # Random intercepts by annotator
re_formula="~1" # Random intercepts only
)
result = model.fit()
# Extract key insights
fixed_effects = result.fe_params # Model differences
random_variance = result.cov_re # Annotator variability
# Intraclass correlation: How much variance is due to annotators?
icc = random_variance / (random_variance + result.scale)
return {
'model_effects': fixed_effects,
'annotator_consistency': 1 - icc, # Higher = more consistent
'needs_calibration': icc > 0.2 # Flag if >20% variance from annotators
}
3. Bootstrap Confidence Intervals
Why Bootstrap?
- Works without distributional assumptions
- Handles complex statistics (medians, ratios, custom metrics)
- Provides robust uncertainty estimates
Algorithm Logic
def bootstrap_confidence_interval(data, statistic_func, n_bootstrap=1000, ci=95):
"""
Core Algorithm:
1. Resample data WITH replacement
2. Calculate statistic on each resample
3. Use percentile method for CI
"""
bootstrap_stats = []
n = len(data)
for _ in range(n_bootstrap):
# Resample with replacement
resample = np.random.choice(data, size=n, replace=True)
# Calculate statistic
stat = statistic_func(resample)
bootstrap_stats.append(stat)
# Calculate confidence intervals
alpha = (100 - ci) / 2
lower = np.percentile(bootstrap_stats, alpha)
upper = np.percentile(bootstrap_stats, 100 - alpha)
return {
'estimate': statistic_func(data),
'ci_lower': lower,
'ci_upper': upper,
'std_error': np.std(bootstrap_stats)
}
# Practical use case: LLM evaluation metric uncertainty
def median_response_quality(scores):
return np.median(scores)
scores = [0.7, 0.8, 0.6, 0.9, 0.75, ...] # Annotation scores
result = bootstrap_confidence_interval(scores, median_response_quality)
print(f"Median quality: {result['estimate']:.3f} [{result['ci_lower']:.3f}, {result['ci_upper']:.3f}]")
4. Multiple Comparison Corrections
The Problem
Testing multiple hypotheses increases false positive rate. If testing 20 model comparisons at α=0.05, expect 1 false positive by chance.
Correction Methods
Bonferroni Correction (Conservative)
α_adjusted = α / m
Where m = number of comparisons
False Discovery Rate (FDR) - Benjamini-Hochberg (Less Conservative)
from statsmodels.stats.multitest import multipletests
def apply_multiple_testing_correction(p_values, method='fdr_bh'):
"""
Algorithm (Benjamini-Hochberg):
1. Sort p-values in ascending order
2. Find largest i where P(i) ≤ (i/m) × α
3. Reject all H₀ for P(1)...P(i)
"""
rejected, corrected_pvals, _, _ = multipletests(
p_values,
alpha=0.05,
method=method # 'bonferroni' or 'fdr_bh'
)
return {
'significant': rejected,
'corrected_p': corrected_pvals,
'n_significant': sum(rejected)
}
# Example: Comparing 5 models against baseline
p_values = [0.001, 0.03, 0.04, 0.15, 0.02]
results = apply_multiple_testing_correction(p_values)
5. Effect Sizes for Different Data Types
Cohen’s d (Continuous Data)
d = (μ₁ - μ₂) / σ_pooled
Interpretation: 0.2=small, 0.5=medium, 0.8=large
Cliff’s Delta (Ordinal Data)
Non-parametric effect size for ordinal ratings (1-5 scales)
def cliffs_delta(x, y):
"""
Algorithm:
1. Count all pairwise comparisons
2. Calculate dominance probability
3. Delta = P(X>Y) - P(X<Y)
"""
n1, n2 = len(x), len(y)
greater = sum([1 for xi in x for yi in y if xi > yi])
less = sum([1 for xi in x for yi in y if xi < yi])
delta = (greater - less) / (n1 * n2)
# Interpretation
if abs(delta) < 0.147: effect = "negligible"
elif abs(delta) < 0.33: effect = "small"
elif abs(delta) < 0.474: effect = "medium"
else: effect = "large"
return delta, effect
# Example: Comparing user satisfaction ratings (1-5 scale)
model_a_ratings = [4, 5, 3, 4, 5, 4, 3]
model_b_ratings = [3, 3, 2, 4, 3, 2, 3]
delta, magnitude = cliffs_delta(model_a_ratings, model_b_ratings)
6. Practical Evaluation Pipeline
Complete Statistical Analysis Workflow
class EvaluationStatistics:
def __init__(self, data):
self.data = data
def full_analysis_pipeline(self):
"""
Comprehensive statistical evaluation:
1. Check assumptions
2. Choose appropriate tests
3. Calculate effect sizes
4. Apply corrections
5. Generate report
"""
results = {}
# Step 1: Assumption checking
normality = self.check_normality()
homogeneity = self.check_variance_homogeneity()
# Step 2: Select test based on assumptions
if normality and homogeneity:
test_result = self.parametric_test() # t-test, ANOVA
else:
test_result = self.nonparametric_test() # Mann-Whitney, Kruskal-Wallis
# Step 3: Effect size
if self.data_type == 'continuous':
effect = self.calculate_cohens_d()
else:
effect = self.calculate_cliffs_delta()
# Step 4: Confidence intervals
ci = self.bootstrap_confidence_intervals()
# Step 5: Power analysis
achieved_power = self.post_hoc_power_analysis()
return {
'test_used': test_result['method'],
'p_value': test_result['p'],
'effect_size': effect,
'confidence_interval': ci,
'statistical_power': achieved_power,
'sample_size_recommendation': self.recommend_sample_size()
}
def check_normality(self):
"""Shapiro-Wilk test for normality"""
from scipy import stats
_, p = stats.shapiro(self.data)
return p > 0.05
def recommend_sample_size(self):
"""Based on observed effect size"""
if self.effect_size < 0.3:
return "Need 200+ samples per condition for small effects"
elif self.effect_size < 0.5:
return "Need 50+ samples per condition for medium effects"
else:
return "Need 20+ samples per condition for large effects"
Key Takeaways for Interview
Always consider the data structure: Nested? Paired? Independent?
Effect size > p-value: A significant p-value with tiny effect size may not be practically meaningful
Account for multiple comparisons: Essential when comparing multiple models/conditions
Bootstrap for robustness: When in doubt about distributions, bootstrap
Mixed models for real-world data: Annotators introduce hierarchy - account for it
Power analysis prevents waste: Calculate sample size BEFORE collecting annotations
Quick Reference Formulas
- Standard Error: SE = σ/√n
- 95% CI: mean ± 1.96×SE
- Cohen’s d: d = (μ₁-μ₂)/σ_pooled
- ICC: ρ = σ²_between/(σ²_between + σ²_within)
- Bonferroni α: α_adj = α/m
Python Libraries Checklist
import statsmodels.api as sm # Regression, mixed models
import statsmodels.formula.api as smf # R-style formulas
from scipy import stats # Statistical tests
import pingouin as pg # Effect sizes, power analysis
from statsmodels.stats.multitest import multipletests # Multiple comparisons
import numpy as np
import pandas as pd
© 2025 Seyed Yahya Shirazi. All rights reserved.