Clinical Evidence and Validation

This page explains how the acne severity scoring technology is validated, what the performance metrics mean, and why the results demonstrate that the AI is as reliable as an expert dermatologist.

How the ground truth is established

Before looking at any performance numbers, it is critical to understand what the numbers are compared against. This is where most misunderstandings occur.

How the ground truth was established

Mathematical consensus of 2–3 independent expert dermatologists per image. Each dermatologist scored every image independently using the IGA scale. The consensus (ground truth) was computed as the mathematical aggregate of their scores.

Why this approach: In acne grading, there is no objective "true" severity; severity is a clinical judgement. The best approximation of truth is the consensus of multiple experts. This is the same approach used by the FDA and EMA for establishing reference standards in dermatology clinical trials.

Matching the consensus IS the ceiling: it represents effectively 100% of the achievable performance. Individual dermatologists (Pearson ~0.56–0.65, Cohen's κ ~0.46–0.62) do not achieve perfect agreement with their own consensus. Any system that matches these values is performing at the level of an expert dermatologist. A system that exceeds these values (as ALADIN does on Cohen's κ: 0.53 vs. 0.46) is outperforming the average individual expert.

Why there is no "perfect" score

In acne severity grading, there is no objective measurement; severity is a clinical judgement. Unlike measuring blood pressure (where 120/80 is the same on every device), acne severity depends on a dermatologist's interpretation of vague descriptors like "some" or "many" inflammatory lesions.

This means:

The ground truth is the consensus of multiple expert dermatologists, not an absolute truth
Individual dermatologists themselves disagree with the consensus; their own Pearson correlation is ~0.56–0.65, their Cohen's κ is ~0.46–0.62
These dermatologist-level numbers are the realistic ceiling: no scoring system, human or AI, can reliably exceed the agreement that experts have with each other
A system that matches dermatologist performance is clinically excellent: it means the AI is as reliable as adding another expert to the panel

The key insight for sponsors

Matching the consensus is effectively a perfect score. The consensus IS the best available approximation of truth. When ALADIN achieves a Cohen's κ of 0.53 and the average dermatologist achieves 0.46 on the same dataset against the same consensus, ALADIN is not "merely moderate"; it is outperforming the typical inter-rater agreement of trained dermatologists. The correct frame of reference is not "how close to 1.0?" but "how close to what expert dermatologists achieve?" By that measure, ALADIN is at or above 100%.

Understanding the metrics

Four metrics are used to evaluate the agreement between ALADIN scores and the expert consensus. Each captures a different aspect of agreement:

Pearson correlation

The strength and direction of the linear relationship between two sets of scores. In this context: how closely ALADIN's scores track the expert consensus scores.

Perfect score: 1.0 would mean perfect linear correlation with the consensus. However, individual dermatologists themselves only achieve ~0.56–0.65 against the consensus, because acne grading is inherently subjective.

Why it matters: In a clinical trial, you need the scoring system to track the same severity trend as expert dermatologists. High Pearson correlation means ALADIN and dermatologists rank patients in the same order.

Expert ceiling ▼

WeakModerateStrongVery strong

ALADIN: 0.58Expert ceiling: 0.56

Spearman correlation

The strength of the monotonic (rank-order) relationship between two sets of scores. Less sensitive to outliers than Pearson. Measures whether ALADIN ranks patients in the same severity order as the consensus.

Perfect score: 1.0 would mean perfect rank agreement. Individual dermatologists achieve ~0.58 against the consensus.

Why it matters: For clinical trials, rank-order agreement is critical: you need to reliably distinguish ‘improving’ from ‘worsening’ patients. Spearman tells you whether the AI preserves this ordering.

Expert ceiling ▼

WeakModerateStrongVery strong

ALADIN: 0.56Expert ceiling: 0.58

Cohen’s kappa (κ)

Agreement between two raters beyond what would be expected by chance. Uses quadratic weights, meaning a 2-grade disagreement is penalised more than a 1-grade disagreement. This is the primary measure of clinical agreement.

Perfect score: 1.0 would mean perfect agreement. In acne grading, individual dermatologists achieve κ = 0.46–0.62 against the consensus — this is the realistic ceiling set by the subjectivity of the task.

Why it matters: Cohen’s κ is the gold standard for measuring inter-rater reliability in clinical research. Regulatory bodies and journal reviewers expect this metric. It tells you: ‘Is this scoring system as reliable as a human expert?’

Expert ceiling ▼

FairModerateSubstantialAlmost perfect

ALADIN: 0.53Expert ceiling: 0.46

Mean absolute error

The average magnitude of scoring errors in IGA points. An MAE of 0.63 means that, on average, ALADIN’s score differs from the consensus by less than 1 IGA grade.

Perfect score: 0.0 would mean zero error. Individual dermatologists achieve MAE of 0.44–0.65 against the consensus — even experts don’t agree perfectly.

Why it matters: For a 5-point scale (IGA 0–4), an MAE under 1.0 means the AI is never more than one grade off on average. This is critical for clinical trials where you need to detect 1–2 grade changes as treatment effects.

Expert ceiling ▼

ExcellentGoodAcceptablePoor

ALADIN: 0.63Expert ceiling: 0.65

Severity scoring performance: ALADIN vs. dermatologists

The following results are from the peer-reviewed validation study (Sabater et al., Skin Health and Disease, 2026). They evaluate the ALADIN scoring formula, the mathematical model that converts lesion count and density into an IGA-aligned severity score. Both ALADIN and individual dermatologists are measured against the same ground truth (dermatologist consensus). Values in bold with a check mark indicate where ALADIN meets or exceeds the dermatologist benchmark.

IGA agreement

Metric	Mild-to-moderate		Severe cases
Metric	ALADIN	Dermatologists	ALADIN	Dermatologists
Pearson	0.58 ✓	0.56	0.65 ✓	0.65
Spearman	0.56 ✓	0.58	0.54 ✓	0.58
MAE	0.63 (0.49) ✓	0.65 (0.63)	0.64 (0.71)	0.44 (0.5)
Cohen’s κ	0.53 ✓	0.46	0.61 ✓	0.62

How to read this table

Pearson 0.58 vs. 0.56: ALADIN's linear correlation with the consensus slightly exceeds the average individual dermatologist's. This means ALADIN tracks severity trends at least as accurately as a single expert.
Cohen's κ 0.53 vs. 0.46: This is the most important result. ALADIN has better categorical agreement with the consensus than individual dermatologists do. On a scale where 0.41–0.60 is "moderate" (the typical range for IGA inter-rater agreement), ALADIN sits at the top of that range while the dermatologists sit at the bottom.
Cohen's κ 0.61 vs. 0.62 (severe cases): On the independent test set of severe cases, ALADIN essentially matches dermatologist agreement. A κ of 0.61 falls in the "substantial agreement" range.
MAE 0.63 vs. 0.65: ALADIN's average error is less than 1 IGA grade, comparable to the error of individual dermatologists. Even when the AI is wrong, it is wrong by less than one severity level on average.
MAE 0.64 vs. 0.44 (severe cases): On the severe-case dataset, dermatologists achieve lower error. This is expected: severe acne is easier for human experts to grade consistently, while the AI model was primarily trained on mild-to-moderate cases.

What this means for your trial

The AI is as reliable as an expert dermatologist: it matches or exceeds their agreement with the consensus across most metrics
But the AI is perfectly reproducible: unlike a human rater, the same image always produces the same score, with zero intra-rater variability
Across every site, every time: no calibration drift, no fatigue, no subjective inconsistency between sites
This reduces the noise in your endpoint data, potentially enabling smaller sample sizes to detect the same treatment effect

Regulatory positioning of ALADIN vs. IGA

For Phase 3 registration trials, sponsors typically use the standard IGA (0–4) and inflammatory lesion counts as primary and co-primary endpoints, both of which the system provides with full reproducibility. The ALADIN composite (0–10) is positioned for exploratory and secondary endpoints: its higher-resolution continuous scale is particularly valuable for Phase 2 dose-finding studies, proof-of-concept trials, and internal go/no-go decisions where sensitivity to subtle treatment effects is critical. AI-computed endpoints from Legit.Health have been accepted by regulators as secondary endpoints and for adverse event detection in clinical submissions.

Lesion detection performance

The following metrics are validated per the Quality Management System (IEC 62304 / ISO 14971).

mAP@50 (95% CI): 0.45
rMAE Papule (95% CI): 0.58
rMAE Pustule (95% CI): 0.28
rMAE Comedone (95% CI): 0.62
rMAE Nodule/Cyst (95% CI): 0.33

How to read these metrics

mAP@50 = 0.45 (95% CI: 0.43–0.47): The inflammatory lesion detector achieves mean average precision of 0.45 at the standard 50% IoU threshold. The acceptance criterion (≥0.21) is based on non-inferiority to published acne detection studies, where the literature range is 0.21–0.54.
rMAE (relative Mean Absolute Error): For per-type detection, each lesion type's counting error is compared against the inter-rater variability of expert dermatologists on the same task. The acceptance criterion is that the model's rMAE must not exceed the experts' rMAE. All lesion types pass, meaning the AI counts each lesion type at least as consistently as dermatologists disagree with each other.

Why detection precision matters less than you think

The lesion detector is not used in isolation; it feeds into the ALADIN/IGA formula, which was co-optimised with the detector. The formula constants ( $a$ , $b$ ) were calibrated to produce correct IGA scores given the detector's actual output, not given perfect counts. This means the end-to-end system (detector + formula) achieves dermatologist-level IGA agreement even though the raw detection metrics appear modest in isolation.

Additionally, the acceptance criteria are based on non-inferiority to expert inter-rater variability, a stronger validation methodology than simple accuracy benchmarks. The question is not "is the AI perfect?" but "is the AI at least as consistent as dermatologists are with each other?"

Validation datasets

Dataset	Images	Source	Severity profile	Annotators	Role
Mild-to-moderate	331	Private (Clínica Dermatológica Internacional, Spain)	Predominantly mild-to-moderate facial acne (IGA 0: 14, IGA 1: 65, IGA 2: 150, IGA 3: 83, IGA 4: 19)	3	Primary dataset for ALADIN formula design and evaluation
Severe cases	27	Public dermatology atlases (DermAtlas, Atlas Dermatológico)	Predominantly severe acne (IGA 0: 1, IGA 1: 3, IGA 2: 10, IGA 3: 11, IGA 4: 2)	2	Exploratory test set for evaluating generalisation to severe cases and third-party images

The two datasets complement each other:

331 images: Primary evaluation dataset with a realistic distribution of mild-to-moderate acne, annotated by 3 specialists
27 images: Tests generalisation to severe cases from public atlases, annotated by 2 specialists

ALADIN peer-reviewed publication

Sabater A, et al. “ALADIN: Acne Lesion And Density INdex. A Novel Tool for Automatic Acne Severity Assessment” Skin Health and Disease (British Association of Dermatologists). 2026. (Provisionally accepted)

Development of the IGA = N^a · (D + b) formula combining lesion count and spatial density into an IGA-aligned severity score
Calibration of formula constants against the consensus of three board-certified dermatologists
Validation of the CNN-based inflammatory lesion detection model (papules, pustules, nodules)
Evidence that incorporating spatial density alongside lesion count improves alignment with clinical severity perception

The paper was presented as a poster at the AEDV 2025 conference in Paris prior to journal submission.

DIQA validation

Hernández Montilla I, Mac Carthy T, Aguilar A, Medela A “Dermatology Image Quality Assessment (DIQA): Artificial intelligence to ensure the clinical utility of images for remote consultations and clinical trials” Journal of the American Academy of Dermatology. 2023. doi:10.1016/j.jaad.2022.11.002

J Am Acad Dermatol. 2023;88(4):927-928

Pearson correlation ≥0.70 with expert image quality assessment
Real-time evaluation of focus, lighting, framing, and resolution
Applicability to both clinical practice and clinical trial settings

DIQA is the image quality assessment algorithm that acts as a quality gate in the clinical trial workflow. It is critical for maintaining consistent image quality across investigator sites in multi-center trials.

The same AI architecture and methodology used for acne scoring has been validated across multiple dermatological conditions:

Legit.Health “Automatic International Hidradenitis Suppurativa Severity Score System (AIHS4): A Novel Tool to Assess the Severity of Hidradenitis Suppurativa Using Artificial Intelligence” Published. 2025.

Inter-observer ICC ≥ 0.727 (95% CI: 0.66–0.79) for objective severity assessment
State-of-the-art comparison: ICC of 0.47 without the device vs. 0.727 with the device
Same object detection + scoring methodology as acne

Validation has also been completed for APASI (psoriasis), ASCORAD (atopic dermatitis), and multiple MRMC studies (BI_2024, SAN_2024).

Regulatory-grade validation pathway

The clinical evidence follows a structured regulatory pathway:

Standard	Scope	Application to acne scoring
IEC 62304	Software lifecycle processes	The AI scoring pipeline follows a documented development lifecycle with risk-based classification
ISO 14971	Risk management	Systematic risk analysis including failure modes (missed lesions, false positives, image quality)
IEC 62366-1	Usability engineering	The mobile capture application has been validated for usability at investigator sites
MEDDEV 2.7/1 Rev 4	Clinical evaluation	Clinical evidence compiled following the structured methodology for clinical evaluation reports
MDR Annex XIV	Clinical evaluation and PMCF	Post-market clinical follow-up ensures ongoing validation as the technology evolves

Ongoing clinical validation program

Study	Condition	Endpoints	Status
ALADIN observational study	Acne vulgaris	Lesion count, density, IGA	Completed; paper provisionally accepted
ALADIN clinical investigation (DermoMedic)	Acne vulgaris	Prospective validation	In preparation (2026)
DIQA validation	All conditions	Image quality assessment	Published (JAAD 2023)
AIHS4 validation	Hidradenitis suppurativa	IHS4 severity scoring	Published
MRMC study (BI_2024)	Multiple conditions	Diagnostic accuracy, severity	Completed
MRMC study (SAN_2024)	Multiple conditions	Diagnostic accuracy, severity	Completed
APASI validation	Psoriasis	PASI severity scoring	Published
ASCORAD validation	Atopic dermatitis	SCORAD severity scoring	Published

For the full list of clinical evidence, see the clinical validation section.

How the ground truth is established​

Why there is no "perfect" score​

Understanding the metrics​

Pearson correlation

Spearman correlation

Cohen’s kappa (κ)

Mean absolute error

Severity scoring performance: ALADIN vs. dermatologists​

IGA agreement​

How to read this table​

What this means for your trial​

Lesion detection performance​

How to read these metrics​

Validation datasets​

ALADIN peer-reviewed publication​

DIQA validation​

Related validation studies​

Regulatory-grade validation pathway​

Ongoing clinical validation program​

How the ground truth is established

Why there is no "perfect" score

Understanding the metrics

Severity scoring performance: ALADIN vs. dermatologists

IGA agreement

How to read this table

What this means for your trial

Lesion detection performance

How to read these metrics

Validation datasets

ALADIN peer-reviewed publication

DIQA validation

Related validation studies

Regulatory-grade validation pathway

Ongoing clinical validation program