Clinical Evidence and Validation
This page explains how the acne severity scoring technology is validated, what the performance metrics mean, and why the results demonstrate that the AI is as reliable as an expert dermatologist.
How the ground truth is established
Before looking at any performance numbers, it is critical to understand what the numbers are compared against. This is where most misunderstandings occur.
Mathematical consensus of 2–3 independent expert dermatologists per image. Each dermatologist scored every image independently using the IGA scale. The consensus (ground truth) was computed as the mathematical aggregate of their scores.
Why this approach: In acne grading, there is no objective "true" severity; severity is a clinical judgement. The best approximation of truth is the consensus of multiple experts. This is the same approach used by the FDA and EMA for establishing reference standards in dermatology clinical trials.
Matching the consensus IS the ceiling: it represents effectively 100% of the achievable performance. Individual dermatologists (Pearson ~0.56–0.65, Cohen's κ ~0.46–0.62) do not achieve perfect agreement with their own consensus. Any system that matches these values is performing at the level of an expert dermatologist. A system that exceeds these values (as ALADIN does on Cohen's κ: 0.53 vs. 0.46) is outperforming the average individual expert.
Why there is no "perfect" score
In acne severity grading, there is no objective measurement; severity is a clinical judgement. Unlike measuring blood pressure (where 120/80 is the same on every device), acne severity depends on a dermatologist's interpretation of vague descriptors like "some" or "many" inflammatory lesions.
This means:
- The ground truth is the consensus of multiple expert dermatologists, not an absolute truth
- Individual dermatologists themselves disagree with the consensus; their own Pearson correlation is ~0.56–0.65, their Cohen's κ is ~0.46–0.62
- These dermatologist-level numbers are the realistic ceiling: no scoring system, human or AI, can reliably exceed the agreement that experts have with each other
- A system that matches dermatologist performance is clinically excellent: it means the AI is as reliable as adding another expert to the panel
Matching the consensus is effectively a perfect score. The consensus IS the best available approximation of truth. When ALADIN achieves a Cohen's κ of 0.53 and the average dermatologist achieves 0.46 on the same dataset against the same consensus, ALADIN is not "merely moderate"; it is outperforming the typical inter-rater agreement of trained dermatologists. The correct frame of reference is not "how close to 1.0?" but "how close to what expert dermatologists achieve?" By that measure, ALADIN is at or above 100%.
Understanding the metrics
Four metrics are used to evaluate the agreement between ALADIN scores and the expert consensus. Each captures a different aspect of agreement:
Pearson correlation
The strength and direction of the linear relationship between two sets of scores. In this context: how closely ALADIN's scores track the expert consensus scores.
Perfect score: 1.0 would mean perfect linear correlation with the consensus. However, individual dermatologists themselves only achieve ~0.56–0.65 against the consensus, because acne grading is inherently subjective.
Why it matters: In a clinical trial, you need the scoring system to track the same severity trend as expert dermatologists. High Pearson correlation means ALADIN and dermatologists rank patients in the same order.
Spearman correlation
The strength of the monotonic (rank-order) relationship between two sets of scores. Less sensitive to outliers than Pearson. Measures whether ALADIN ranks patients in the same severity order as the consensus.
Perfect score: 1.0 would mean perfect rank agreement. Individual dermatologists achieve ~0.58 against the consensus.
Why it matters: For clinical trials, rank-order agreement is critical: you need to reliably distinguish ‘improving’ from ‘worsening’ patients. Spearman tells you whether the AI preserves this ordering.
Cohen’s kappa (κ)
Agreement between two raters beyond what would be expected by chance. Uses quadratic weights, meaning a 2-grade disagreement is penalised more than a 1-grade disagreement. This is the primary measure of clinical agreement.
Perfect score: 1.0 would mean perfect agreement. In acne grading, individual dermatologists achieve κ = 0.46–0.62 against the consensus — this is the realistic ceiling set by the subjectivity of the task.
Why it matters: Cohen’s κ is the gold standard for measuring inter-rater reliability in clinical research. Regulatory bodies and journal reviewers expect this metric. It tells you: ‘Is this scoring system as reliable as a human expert?’
Mean absolute error
The average magnitude of scoring errors in IGA points. An MAE of 0.63 means that, on average, ALADIN’s score differs from the consensus by less than 1 IGA grade.
Perfect score: 0.0 would mean zero error. Individual dermatologists achieve MAE of 0.44–0.65 against the consensus — even experts don’t agree perfectly.
Why it matters: For a 5-point scale (IGA 0–4), an MAE under 1.0 means the AI is never more than one grade off on average. This is critical for clinical trials where you need to detect 1–2 grade changes as treatment effects.
Severity scoring performance: ALADIN vs. dermatologists
The following results are from the peer-reviewed validation study (Sabater et al., Skin Health and Disease, 2026). They evaluate the ALADIN scoring formula, the mathematical model that converts lesion count and density into an IGA-aligned severity score. Both ALADIN and individual dermatologists are measured against the same ground truth (dermatologist consensus). Values in bold with a check mark indicate where ALADIN meets or exceeds the dermatologist benchmark.
IGA agreement
| Metric | Mild-to-moderate | Severe cases | ||
|---|---|---|---|---|
| ALADIN | Dermatologists | ALADIN | Dermatologists | |
| Pearson | 0.58 ✓ | 0.56 | 0.65 ✓ | 0.65 |
| Spearman | 0.56 ✓ | 0.58 | 0.54 ✓ | 0.58 |
| MAE | 0.63 (0.49) ✓ | 0.65 (0.63) | 0.64 (0.71) | 0.44 (0.5) |
| Cohen’s κ | 0.53 ✓ | 0.46 | 0.61 ✓ | 0.62 |
How to read this table
-
Pearson 0.58 vs. 0.56: ALADIN's linear correlation with the consensus slightly exceeds the average individual dermatologist's. This means ALADIN tracks severity trends at least as accurately as a single expert.
-
Cohen's κ 0.53 vs. 0.46: This is the most important result. ALADIN has better categorical agreement with the consensus than individual dermatologists do. On a scale where 0.41–0.60 is "moderate" (the typical range for IGA inter-rater agreement), ALADIN sits at the top of that range while the dermatologists sit at the bottom.
-
Cohen's κ 0.61 vs. 0.62 (severe cases): On the independent test set of severe cases, ALADIN essentially matches dermatologist agreement. A κ of 0.61 falls in the "substantial agreement" range.
-
MAE 0.63 vs. 0.65: ALADIN's average error is less than 1 IGA grade, comparable to the error of individual dermatologists. Even when the AI is wrong, it is wrong by less than one severity level on average.
-
MAE 0.64 vs. 0.44 (severe cases): On the severe-case dataset, dermatologists achieve lower error. This is expected: severe acne is easier for human experts to grade consistently, while the AI model was primarily trained on mild-to-moderate cases.
What this means for your trial
- The AI is as reliable as an expert dermatologist: it matches or exceeds their agreement with the consensus across most metrics
- But the AI is perfectly reproducible: unlike a human rater, the same image always produces the same score, with zero intra-rater variability
- Across every site, every time: no calibration drift, no fatigue, no subjective inconsistency between sites
- This reduces the noise in your endpoint data, potentially enabling smaller sample sizes to detect the same treatment effect
For Phase 3 registration trials, sponsors typically use the standard IGA (0–4) and inflammatory lesion counts as primary and co-primary endpoints, both of which the system provides with full reproducibility. The ALADIN composite (0–10) is positioned for exploratory and secondary endpoints: its higher-resolution continuous scale is particularly valuable for Phase 2 dose-finding studies, proof-of-concept trials, and internal go/no-go decisions where sensitivity to subtle treatment effects is critical. AI-computed endpoints from Legit.Health have been accepted by regulators as secondary endpoints and for adverse event detection in clinical submissions.
Lesion detection performance
The following metrics are validated per the Quality Management System (IEC 62304 / ISO 14971).
- mAP@50 (95% CI): 0.45
- rMAE Papule (95% CI): 0.58
- rMAE Pustule (95% CI): 0.28
- rMAE Comedone (95% CI): 0.62
- rMAE Nodule/Cyst (95% CI): 0.33
How to read these metrics
- mAP@50 = 0.45 (95% CI: 0.43–0.47): The inflammatory lesion detector achieves mean average precision of 0.45 at the standard 50% IoU threshold. The acceptance criterion (≥0.21) is based on non-inferiority to published acne detection studies, where the literature range is 0.21–0.54.
- rMAE (relative Mean Absolute Error): For per-type detection, each lesion type's counting error is compared against the inter-rater variability of expert dermatologists on the same task. The acceptance criterion is that the model's rMAE must not exceed the experts' rMAE. All lesion types pass, meaning the AI counts each lesion type at least as consistently as dermatologists disagree with each other.
The lesion detector is not used in isolation; it feeds into the ALADIN/IGA formula, which was co-optimised with the detector. The formula constants (, ) were calibrated to produce correct IGA scores given the detector's actual output, not given perfect counts. This means the end-to-end system (detector + formula) achieves dermatologist-level IGA agreement even though the raw detection metrics appear modest in isolation.
Additionally, the acceptance criteria are based on non-inferiority to expert inter-rater variability, a stronger validation methodology than simple accuracy benchmarks. The question is not "is the AI perfect?" but "is the AI at least as consistent as dermatologists are with each other?"
Validation datasets
| Dataset | Images | Source | Severity profile | Annotators | Role |
|---|---|---|---|---|---|
| Mild-to-moderate | 331 | Private (Clínica Dermatológica Internacional, Spain) | Predominantly mild-to-moderate facial acne (IGA 0: 14, IGA 1: 65, IGA 2: 150, IGA 3: 83, IGA 4: 19) | 3 | Primary dataset for ALADIN formula design and evaluation |
| Severe cases | 27 | Public dermatology atlases (DermAtlas, Atlas Dermatológico) | Predominantly severe acne (IGA 0: 1, IGA 1: 3, IGA 2: 10, IGA 3: 11, IGA 4: 2) | 2 | Exploratory test set for evaluating generalisation to severe cases and third-party images |
The two datasets complement each other:
- 331 images: Primary evaluation dataset with a realistic distribution of mild-to-moderate acne, annotated by 3 specialists
- 27 images: Tests generalisation to severe cases from public atlases, annotated by 2 specialists
ALADIN peer-reviewed publication
Sabater A, et al. “ALADIN: Acne Lesion And Density INdex. A Novel Tool for Automatic Acne Severity Assessment” Skin Health and Disease (British Association of Dermatologists). 2026. (Provisionally accepted)
- Development of the IGA = N^a · (D + b) formula combining lesion count and spatial density into an IGA-aligned severity score
- Calibration of formula constants against the consensus of three board-certified dermatologists
- Validation of the CNN-based inflammatory lesion detection model (papules, pustules, nodules)
- Evidence that incorporating spatial density alongside lesion count improves alignment with clinical severity perception
The paper was presented as a poster at the AEDV 2025 conference in Paris prior to journal submission.
DIQA validation
Hernández Montilla I, Mac Carthy T, Aguilar A, Medela A “Dermatology Image Quality Assessment (DIQA): Artificial intelligence to ensure the clinical utility of images for remote consultations and clinical trials” Journal of the American Academy of Dermatology. 2023. doi:10.1016/j.jaad.2022.11.002
J Am Acad Dermatol. 2023;88(4):927-928
- Pearson correlation ≥0.70 with expert image quality assessment
- Real-time evaluation of focus, lighting, framing, and resolution
- Applicability to both clinical practice and clinical trial settings
DIQA is the image quality assessment algorithm that acts as a quality gate in the clinical trial workflow. It is critical for maintaining consistent image quality across investigator sites in multi-center trials.
Related validation studies
The same AI architecture and methodology used for acne scoring has been validated across multiple dermatological conditions:
Legit.Health “Automatic International Hidradenitis Suppurativa Severity Score System (AIHS4): A Novel Tool to Assess the Severity of Hidradenitis Suppurativa Using Artificial Intelligence” Published. 2025.
- Inter-observer ICC ≥ 0.727 (95% CI: 0.66–0.79) for objective severity assessment
- State-of-the-art comparison: ICC of 0.47 without the device vs. 0.727 with the device
- Same object detection + scoring methodology as acne
Validation has also been completed for APASI (psoriasis), ASCORAD (atopic dermatitis), and multiple MRMC studies (BI_2024, SAN_2024).
Regulatory-grade validation pathway
The clinical evidence follows a structured regulatory pathway:
| Standard | Scope | Application to acne scoring |
|---|---|---|
| IEC 62304 | Software lifecycle processes | The AI scoring pipeline follows a documented development lifecycle with risk-based classification |
| ISO 14971 | Risk management | Systematic risk analysis including failure modes (missed lesions, false positives, image quality) |
| IEC 62366-1 | Usability engineering | The mobile capture application has been validated for usability at investigator sites |
| MEDDEV 2.7/1 Rev 4 | Clinical evaluation | Clinical evidence compiled following the structured methodology for clinical evaluation reports |
| MDR Annex XIV | Clinical evaluation and PMCF | Post-market clinical follow-up ensures ongoing validation as the technology evolves |
Ongoing clinical validation program
| Study | Condition | Endpoints | Status |
|---|---|---|---|
| ALADIN observational study | Acne vulgaris | Lesion count, density, IGA | Completed; paper provisionally accepted |
| ALADIN clinical investigation (DermoMedic) | Acne vulgaris | Prospective validation | In preparation (2026) |
| DIQA validation | All conditions | Image quality assessment | Published (JAAD 2023) |
| AIHS4 validation | Hidradenitis suppurativa | IHS4 severity scoring | Published |
| MRMC study (BI_2024) | Multiple conditions | Diagnostic accuracy, severity | Completed |
| MRMC study (SAN_2024) | Multiple conditions | Diagnostic accuracy, severity | Completed |
| APASI validation | Psoriasis | PASI severity scoring | Published |
| ASCORAD validation | Atopic dermatitis | SCORAD severity scoring | Published |
For the full list of clinical evidence, see the clinical validation section.