1. Sample Size Determination: Foundation of Reliable Evaluation
The Critical Role of Statistical Power
Statistical power analysis represents a fundamental component of experimental design that determines the minimum sample size required to detect meaningful differences between conditions. Without adequate power calculations, evaluation studies risk producing inconclusive results despite substantial resource investment.
Understanding Statistical Power
Statistical power quantifies the probability of correctly detecting a true effect when one exists. Power analysis addresses the essential question: “What sample size is required to reliably detect a clinically or practically meaningful difference?”
The relationship between sample size, effect size, and statistical power forms the foundation of robust experimental design.
The Formula That Saves Budgets
$$ n = \frac{(Z_\alpha + Z_\beta)^2 \times \sigma^2}{\delta^2} $$ Where:
- $n$ = sample size needed
- $Z_\alpha$ = critical value for significance level (1.96 for $\alpha=0.05$)
- $Z_\beta$ = critical value for power (0.84 for 80% power)
- $\sigma$ = standard deviation
- $\delta$ = effect size (minimum detectable difference)
Implementation Logic
import statsmodels.stats.power as smp
import numpy as np
def calculate_annotation_sample_size(effect_size=0.5, alpha=0.05, power=0.8):
"""
Algorithm:
1. Define expected effect size (Cohen's d)
2. Set significance level and desired power
3. Calculate required sample size
4. Adjust for multiple raters/conditions
"""
# For comparing two models
n = smp.ttest_power.solve_power(
effect_size=effect_size,
alpha=alpha,
power=power,
alternative='two-sided'
)
# Adjustment for multiple comparisons
n_adjusted = n * 1.2 # 20% buffer for dropout/invalid annotations
return int(np.ceil(n_adjusted))
# Practical example: Health AI evaluation
# "How many patient cases need evaluation to detect 10% improvement?"
baseline_accuracy = 0.75
expected_improvement = 0.10
std_dev = 0.15
effect_size = expected_improvement / std_dev # Cohen's d
sample_size = calculate_annotation_sample_size(effect_size)
1.1 Blood Pressure Estimation: Clinical Significance Drives Sample Size
Consider the validation of PPG-based blood pressure estimation in wearable devices. The clinical significance threshold of 5 mmHg difference in blood pressure measurements directly influences required sample sizes for meaningful validation.
Power Analysis Implementation:
def bp_estimation_sample_size():
"""
Calculate samples needed to validate PPG-based BP estimation
"""
# Clinical context
clinical_significance = 5 # 5 mmHg is clinically meaningful difference
population_std = 12 # Typical BP measurement std dev
# Calculate effect size
effect_size = clinical_significance / population_std # d = 0.42
# Power analysis for paired comparison (PPG vs gold standard)
from statsmodels.stats.power import ttest_power
n_subjects = ttest_power.solve_power(
effect_size=effect_size,
alpha=0.05, # Type I error rate
power=0.80, # 80% chance of detecting true difference
alternative='two-sided'
)
# Adjust for expected dropout and invalid measurements
dropout_rate = 0.15 # 15% wearable data quality issues
n_adjusted = n_subjects / (1 - dropout_rate)
# Multiple measurements per subject for reliability
measurements_per_subject = 3 # Morning, afternoon, evening
total_measurements = n_adjusted * measurements_per_subject
return {
'subjects_needed': int(np.ceil(n_adjusted)), # ~110 subjects
'total_measurements': int(np.ceil(total_measurements)), # ~330 measurements
'rationale': 'Detect 5mmHg difference with 80% power, accounting for PPG noise'
}
# Practical consideration for wearables
results = bp_estimation_sample_size()
print(f"Need {results['subjects_needed']} participants wearing devices for 24 hours")
print(f"Collect {results['total_measurements']} BP readings paired with PPG data")
Key Insights:
- Effect Size Context: 5 mmHg difference matters clinically (stroke risk reduction)
- Wearable Challenges: Account for motion artifacts, skin tone variations
- Temporal Factors: Multiple readings reduce measurement variability
2. Hierarchical Data Structure: Addressing Nested Dependencies
The Challenge of Nested Observations
Hierarchical data structures present significant challenges in evaluation analysis. When multiple annotators evaluate the same responses, or when measurements are collected across different institutions, traditional statistical approaches that assume independence produce invalid results.
Consider an evaluation scenario where five annotators rate 100 AI responses. Significant variation in average scores between annotators (e.g., 4.2 vs 2.8) indicates systematic differences that must be accounted for in the analysis.
Mixed-Effects Modeling Framework
Hierarchical data requires statistical models that explicitly account for nested structure. In educational assessment, students nested within classrooms within schools exemplify this structure - individual performance is influenced by both classroom-level and school-level factors.
Mixed-effects models separate these layers:
- Fixed effects: The consistent stuff (like the teaching method being tested)
- Random effects: The varying stuff (like which teacher you got)
Mathematical Representation
$$ Y_{ij} = \beta_0 + \beta_1 X_{ij} + u_i + \varepsilon_{ij} $$ Where:
- $Y_{ij}$ = rating for item j by annotator i
- $\beta_0$ = intercept (fixed)
- $\beta_1$ = effect of condition X (fixed)
- $u_i$ = random effect for annotator i
- $\varepsilon_{ij}$ = residual error
Implementation Pattern
import statsmodels.formula.api as smf
import pandas as pd
def analyze_nested_annotations(df):
"""
Algorithm:
1. Identify hierarchical structure (annotations nested within annotators)
2. Specify fixed effects (conditions to compare)
3. Specify random effects (annotator variation)
4. Fit model and extract insights
"""
# Formula notation: outcome ~ fixed_effects + (1|random_effects)
model = smf.mixedlm(
formula="quality_score ~ model_type + response_length",
data=df,
groups=df["annotator_id"], # Random intercepts by annotator
re_formula="~1" # Random intercepts only
)
result = model.fit()
# Extract key insights
fixed_effects = result.fe_params # Model differences
random_variance = result.cov_re # Annotator variability
# Intraclass correlation: How much variance is due to annotators?
icc = random_variance / (random_variance + result.scale)
return {
'model_effects': fixed_effects,
'annotator_consistency': 1 - icc, # Higher = more consistent
'needs_calibration': icc > 0.2 # Flag if >20% variance from annotators
}
2.1 Multi-Site Validation: Equipment and Protocol Variations
Multi-center studies in healthcare AI reveal the critical importance of hierarchical modeling. EEG seizure detection algorithms may show substantial performance variation across sites due to equipment differences, patient populations, and clinical protocols.
Consider validation results showing:
- Site A (modern equipment): 94% accuracy
- Site E (legacy systems): 71% accuracy
This 23% performance gap demonstrates how institutional factors significantly impact algorithm performance, necessitating mixed-effects approaches.
Scenario:
import pandas as pd
import statsmodels.formula.api as smf
def analyze_multicenter_eeg_study(eeg_annotations_df):
"""
Mixed-effects model for hierarchical EEG data:
- Seizures nested within patients
- Patients nested within hospitals
- Annotations from multiple technicians
"""
# Data structure example
# columns: seizure_id, patient_id, hospital_id, technician_id,
# ai_detection, true_label, eeg_quality, seizure_duration
# Model 1: Simple mixed model with random intercepts
model_simple = smf.mixedlm(
formula="ai_accuracy ~ seizure_duration + eeg_quality",
data=eeg_annotations_df,
groups=eeg_annotations_df["hospital_id"], # Random effect for hospital
re_formula="~1" # Random intercepts only
)
# Model 2: Nested random effects (more realistic)
model_nested = smf.mixedlm(
formula="ai_accuracy ~ seizure_duration + eeg_quality + equipment_type",
data=eeg_annotations_df,
groups=eeg_annotations_df["hospital_id"],
re_formula="~1",
vc_formula={"patient_id": "0 + C(patient_id)"} # Nested structure
)
result = model_nested.fit()
# Extract clinically relevant insights
insights = {
'fixed_effects': {
'seizure_duration_impact': result.params['seizure_duration'],
'quality_impact': result.params['eeg_quality'],
'interpretation': 'Longer seizures easier to detect' if result.params['seizure_duration'] > 0 else 'Duration not predictive'
},
'random_effects': {
'hospital_variance': result.cov_re.iloc[0,0],
'explanation': 'High variance suggests site-specific training needed' if result.cov_re.iloc[0,0] > 0.1 else 'Good generalization across sites'
},
'clinical_recommendation': generate_deployment_strategy(result)
}
return insights
def generate_deployment_strategy(model_result):
"""
Translate statistical findings to clinical deployment
"""
hospital_variance = model_result.cov_re.iloc[0,0]
if hospital_variance > 0.2:
return "Require site-specific calibration before deployment"
elif hospital_variance > 0.1:
return "Recommend 1-week local validation period"
else:
return "Can deploy with standard configuration across sites"
# Example usage
data = pd.DataFrame({
'seizure_id': range(500),
'patient_id': np.repeat(range(50), 10),
'hospital_id': np.repeat(['Hospital_A', 'Hospital_B', 'Hospital_C'], [200, 200, 100]),
'ai_accuracy': np.random.binomial(1, 0.85, 500), # 85% baseline accuracy
'seizure_duration': np.random.gamma(2, 15, 500), # Seconds
'eeg_quality': np.random.choice(['poor', 'fair', 'good'], 500),
'equipment_type': np.repeat(['Brand_X', 'Brand_Y', 'Brand_X'], [200, 200, 100])
})
results = analyze_multicenter_eeg_study(data)
print(f"Hospital effect: {results['random_effects']['explanation']}")
print(f"Deployment strategy: {results['clinical_recommendation']}")
Key Insights:
- Fixed Effects: Seizure characteristics (duration, type) affect detection consistently
- Random Effects: Hospital differences due to equipment, protocols, patient populations
- Clinical Impact: Determines if one model works everywhere or needs site-specific tuning
3. Bootstrap Methods: Robust Confidence Interval Estimation
Limitations of Classical Confidence Intervals
Classical statistical methods for confidence interval calculation rely on distributional assumptions that may not hold for complex evaluation metrics. Metrics such as percentiles, ratios, or custom composite scores often lack established analytical solutions for confidence interval computation.
Bootstrap Resampling Methodology
Bootstrap methods provide a non-parametric approach to uncertainty quantification by creating multiple resampled datasets from observed data. This technique estimates the sampling distribution of any statistic without requiring distributional assumptions.
The Magic:
- No assumptions about normal distributions (real data is rarely normal)
- Works for any statistic you can calculate (median, weird ratios, custom metrics)
- Gives you confidence intervals that actually make sense
Algorithm Logic
def bootstrap_confidence_interval(data, statistic_func, n_bootstrap=1000, ci=95):
"""
Core Algorithm:
1. Resample data WITH replacement
2. Calculate statistic on each resample
3. Use percentile method for CI
"""
bootstrap_stats = []
n = len(data)
for _ in range(n_bootstrap):
# Resample with replacement
resample = np.random.choice(data, size=n, replace=True)
# Calculate statistic
stat = statistic_func(resample)
bootstrap_stats.append(stat)
# Calculate confidence intervals
alpha = (100 - ci) / 2
lower = np.percentile(bootstrap_stats, alpha)
upper = np.percentile(bootstrap_stats, 100 - alpha)
return {
'estimate': statistic_func(data),
'ci_lower': lower,
'ci_upper': upper,
'std_error': np.std(bootstrap_stats)
}
# Practical use case: LLM evaluation metric uncertainty
def median_response_quality(scores):
return np.median(scores)
scores = [0.7, 0.8, 0.6, 0.9, 0.75, ...] # Annotation scores
result = bootstrap_confidence_interval(scores, median_response_quality)
print(f"Median quality: {result['estimate']:.3f} [{result['ci_lower']:.3f}, {result['ci_upper']:.3f}]")
3.1 Stress Detection Validation: Regulatory Requirements for Uncertainty
Wearable stress detection algorithms require precise confidence interval estimates for regulatory approval. An algorithm showing 75% accuracy requires uncertainty bounds to determine clinical utility - whether the true performance ranges from 73-77% or 60-90% fundamentally impacts regulatory and clinical decisions.
Bootstrap methods provide the necessary precision for such critical assessments.
Scenario:
import numpy as np
from scipy import stats
def hrv_stress_detection_confidence(hrv_features, stress_labels):
"""
Bootstrap CI for stress detection using HRV metrics
Real-world challenge: HRV has high individual variability
"""
# Custom metric: Stress detection sensitivity at fixed specificity
def stress_detection_metric(indices):
"""
Clinical metric: Sensitivity at 90% specificity
(Minimize false alarms in consumer wearables)
"""
# Resample data
resampled_hrv = hrv_features[indices]
resampled_labels = stress_labels[indices]
# Train simple threshold model
rmssd_values = resampled_hrv['rmssd']
# Find threshold for 90% specificity
non_stress_values = rmssd_values[resampled_labels == 0]
threshold = np.percentile(non_stress_values, 10) # 90% above threshold
# Calculate sensitivity
stress_values = rmssd_values[resampled_labels == 1]
sensitivity = np.mean(stress_values < threshold)
return sensitivity
# Bootstrap with 2000 iterations for stable estimates
n_bootstrap = 2000
n_samples = len(hrv_features)
bootstrap_sensitivities = []
for _ in range(n_bootstrap):
# Resample with replacement
indices = np.random.choice(n_samples, n_samples, replace=True)
sensitivity = stress_detection_metric(indices)
bootstrap_sensitivities.append(sensitivity)
# Calculate confidence intervals
ci_lower = np.percentile(bootstrap_sensitivities, 2.5)
ci_upper = np.percentile(bootstrap_sensitivities, 97.5)
point_estimate = np.median(bootstrap_sensitivities)
# Clinical interpretation
interpretation = interpret_stress_detection_ci(point_estimate, ci_lower, ci_upper)
return {
'sensitivity_at_90_spec': point_estimate,
'95_ci': (ci_lower, ci_upper),
'ci_width': ci_upper - ci_lower,
'clinical_interpretation': interpretation,
'sample_size_recommendation': recommend_sample_size(ci_upper - ci_lower)
}
def interpret_stress_detection_ci(estimate, lower, upper):
"""
Translate statistical CI to clinical meaning
"""
if lower > 0.8:
return "Excellent: Can reliably detect stress episodes"
elif lower > 0.6:
return "Good: Useful for tracking trends, not single events"
elif upper < 0.5:
return "Poor: Not clinically useful for stress detection"
else:
return "Uncertain: Need more data or better features"
def recommend_sample_size(ci_width):
"""
Based on CI width, recommend data collection
"""
if ci_width > 0.2:
return "Collect 2x more data for stable estimates"
elif ci_width > 0.1:
return "Consider 50% more data for precision"
else:
return "Sample size adequate for deployment"
# Example with real-world HRV data structure
hrv_data = pd.DataFrame({
'rmssd': np.random.gamma(30, 2, 500), # HRV RMSSD values
'sdnn': np.random.gamma(50, 2, 500), # HRV SDNN values
'pnn50': np.random.beta(2, 5, 500) # HRV pNN50 percentage
})
stress_labels = np.random.binomial(1, 0.3, 500) # 30% stress prevalence
results = hrv_stress_detection_confidence(hrv_data, stress_labels)
print(f"Stress detection sensitivity: {results['sensitivity_at_90_spec']:.2%}")
print(f"95% CI: [{results['95_ci'][0]:.2%}, {results['95_ci'][1]:.2%}]")
print(f"Clinical use: {results['clinical_interpretation']}")
Key Insights:
- Why Bootstrap: HRV distributions are often non-normal (skewed)
- Clinical Metric: Sensitivity at fixed specificity (avoid false alarms)
- Individual Variability: Wide CIs expected due to person-to-person HRV differences
- Practical Application: Determines if wearable ready for consumer stress tracking
4. Multiple Comparisons: Controlling Family-wise Error Rates
The Multiple Testing Problem
Simultaneous testing of multiple hypotheses dramatically inflates the probability of Type I errors. When testing 20 independent hypotheses at α=0.05, the probability of at least one false positive reaches 64%. With 100 tests, false discoveries become virtually certain.
Statistical Consequences of Multiple Testing
The family-wise error rate increases exponentially with the number of comparisons, leading to spurious significant results that fail to replicate. This fundamental statistical principle requires systematic correction methods.
Correction Methods
Bonferroni Correction (Conservative)
$$ \alpha_{\text{adjusted}} = \frac{\alpha}{m} $$ Where $m$ = number of comparisons
False Discovery Rate (FDR) - Benjamini-Hochberg (Less Conservative)
from statsmodels.stats.multitest import multipletests
def apply_multiple_testing_correction(p_values, method='fdr_bh'):
"""
Algorithm (Benjamini-Hochberg):
1. Sort p-values in ascending order
2. Find largest i where P(i) ≤ (i/m) × α
3. Reject all H₀ for P(1)...P(i)
"""
rejected, corrected_pvals, _, _ = multipletests(
p_values,
alpha=0.05,
method=method # 'bonferroni' or 'fdr_bh'
)
return {
'significant': rejected,
'corrected_p': corrected_pvals,
'n_significant': sum(rejected)
}
# Example: Comparing 5 models against baseline
p_values = [0.001, 0.03, 0.04, 0.15, 0.02]
results = apply_multiple_testing_correction(p_values)
5. Effect Size: Quantifying Practical Significance
Limitations of Statistical Significance
Statistical significance testing alone provides insufficient information for practical decision-making. With sufficiently large sample sizes, trivial differences achieve statistical significance, creating a disconnect between statistical and practical importance. Effect size metrics bridge this gap by quantifying the magnitude of observed differences.
Cohen’s d: The Universal Language of “How Much?”
$$ d = \frac{\mu_1 - \mu_2}{\sigma_{\text{pooled}}} $$ Interpretation: 0.2=small, 0.5=medium, 0.8=large
Cliff’s Delta (Ordinal Data)
Non-parametric effect size for ordinal ratings (1-5 scales)
def cliffs_delta(x, y):
"""
Algorithm:
1. Count all pairwise comparisons
2. Calculate dominance probability
3. Delta = P(X>Y) - P(X<Y)
"""
n1, n2 = len(x), len(y)
greater = sum([1 for xi in x for yi in y if xi > yi])
less = sum([1 for xi in x for yi in y if xi < yi])
delta = (greater - less) / (n1 * n2)
# Interpretation
if abs(delta) < 0.147: effect = "negligible"
elif abs(delta) < 0.33: effect = "small"
elif abs(delta) < 0.474: effect = "medium"
else: effect = "large"
return delta, effect
# Example: Comparing user satisfaction ratings (1-5 scale)
model_a_ratings = [4, 5, 3, 4, 5, 4, 3]
model_b_ratings = [3, 3, 2, 4, 3, 2, 3]
delta, magnitude = cliffs_delta(model_a_ratings, model_b_ratings)
6. Comprehensive Statistical Evaluation Pipeline
Systematic Analysis Framework
A systematic approach to statistical evaluation integrates multiple methodological components to ensure robust and reliable results. This comprehensive pipeline addresses common analytical pitfalls while providing actionable insights.
class EvaluationStatistics:
def __init__(self, data):
self.data = data
def full_analysis_pipeline(self):
"""
Comprehensive statistical evaluation:
1. Check assumptions
2. Choose appropriate tests
3. Calculate effect sizes
4. Apply corrections
5. Generate report
"""
results = {}
# Step 1: Assumption checking
normality = self.check_normality()
homogeneity = self.check_variance_homogeneity()
# Step 2: Select test based on assumptions
if normality and homogeneity:
test_result = self.parametric_test() # t-test, ANOVA
else:
test_result = self.nonparametric_test() # Mann-Whitney, Kruskal-Wallis
# Step 3: Effect size
if self.data_type == 'continuous':
effect = self.calculate_cohens_d()
else:
effect = self.calculate_cliffs_delta()
# Step 4: Confidence intervals
ci = self.bootstrap_confidence_intervals()
# Step 5: Power analysis
achieved_power = self.post_hoc_power_analysis()
return {
'test_used': test_result['method'],
'p_value': test_result['p'],
'effect_size': effect,
'confidence_interval': ci,
'statistical_power': achieved_power,
'sample_size_recommendation': self.recommend_sample_size()
}
def check_normality(self):
"""Shapiro-Wilk test for normality"""
from scipy import stats
_, p = stats.shapiro(self.data)
return p > 0.05
def recommend_sample_size(self):
"""Based on observed effect size"""
if self.effect_size < 0.3:
return "Need 200+ samples per condition for small effects"
elif self.effect_size < 0.5:
return "Need 50+ samples per condition for medium effects"
else:
return "Need 20+ samples per condition for large effects"
Core Principles for Statistical Evaluation
Essential Guidelines for Robust Analysis
Effective statistical evaluation requires adherence to fundamental principles that ensure valid and reliable results:
“What’s your data structure?” - The first question that separates novices from experts. Nested data? Paired observations? The structure determines everything else.
“Statistical significance without practical significance is just expensive noise” - I’ve seen teams celebrate p < 0.001 improvements that users couldn’t even notice. Effect size is your reality check.
“Multiple comparisons are like lottery tickets” - Buy enough and you’ll win (find significance) by chance. Always correct for multiple testing, or prepare to explain why your “discovery” didn’t replicate.
“When classical statistics gets complicated, bootstrap” - The bootstrap is like duct tape - not always elegant, but it works when nothing else will.
“Hierarchical data needs hierarchical models” - Ignoring the fact that annotators are different people with different biases is like ignoring that patients come from different hospitals. Mixed models aren’t fancy - they’re honest.
“Sample size calculations are boring until you run out of money” - I’ve never regretted doing a power analysis. I’ve frequently regretted not doing one.
Common Evaluation Failures: A Case Study
Consider a medical AI system that demonstrated promising results (p < 0.05) in initial validation with 100 cases. However, subsequent large-scale validation with 10,000 cases showed no significant effect. This failure illustrates multiple methodological errors:
- Inadequate power analysis resulting in insufficient sample size
- Multiple comparison testing without appropriate corrections
- Failure to account for hierarchical data structure (hospital effects)
- Overreliance on p-values without effect size consideration
Core principle: Rigorous statistical methodology prevents erroneous conclusions and ensures reproducible results.
Your Statistical Toolkit Checklist
Before any evaluation:
- Did I calculate sample size needed?
- Did I identify my data structure?
- Did I plan for multiple comparisons?
- Did I choose appropriate effect size metrics?
- Did I consider hierarchical effects?
- Did I plan for confidence intervals, not just point estimates?
Get these right, and you’ll never have to explain why your “breakthrough” didn’t replicate
Quick Reference Formulas
- Standard Error: $\text{SE} = \sigma/\sqrt{n}$
- 95% CI: $\text{mean} \pm 1.96 \times \text{SE}$
- Cohen’s d: $d = (\mu_1 - \mu_2)/\sigma_{\text{pooled}}$
- ICC: $\rho = \sigma^2_{\text{between}}/(\sigma^2_{\text{between}} + \sigma^2_{\text{within}})$
- Bonferroni $\alpha$: $\alpha_{\text{adj}} = \alpha/m$
Python Libraries Checklist
import statsmodels.api as sm # Regression, mixed models
import statsmodels.formula.api as smf # R-style formulas
from scipy import stats # Statistical tests
import pingouin as pg # Effect sizes, power analysis
from statsmodels.stats.multitest import multipletests # Multiple comparisons
import numpy as np
import pandas as pd
© 2025 Seyed Yahya Shirazi. All rights reserved.