Skip to main content

Clinical Evidence and Validation

This page explains how the acne severity scoring technology is validated, what the performance metrics mean, and why the results demonstrate that the AI is as reliable as an expert dermatologist.

How the ground truth is established

Before looking at any performance numbers, it is critical to understand what the numbers are compared against. This is where most misunderstandings occur.

How the ground truth was established

Mathematical consensus of 2–3 independent expert dermatologists per image. Each dermatologist scored every image independently using the IGA scale. The consensus (ground truth) was computed as the mathematical aggregate of their scores.

Why this approach: In acne grading, there is no objective "true" severity; severity is a clinical judgement. The best approximation of truth is the consensus of multiple experts. This is the same approach used by the FDA and EMA for establishing reference standards in dermatology clinical trials.

Matching the consensus IS the ceiling: it represents effectively 100% of the achievable performance. Individual dermatologists (Pearson ~0.56–0.65, Cohen's κ ~0.46–0.62) do not achieve perfect agreement with their own consensus. Any system that matches these values is performing at the level of an expert dermatologist. A system that exceeds these values (as ALADIN does on Cohen's κ: 0.53 vs. 0.46) is outperforming the average individual expert.

Why there is no "perfect" score

In acne severity grading, there is no objective measurement; severity is a clinical judgement. Unlike measuring blood pressure (where 120/80 is the same on every device), acne severity depends on a dermatologist's interpretation of vague descriptors like "some" or "many" inflammatory lesions.

This means:

  1. The ground truth is the consensus of multiple expert dermatologists, not an absolute truth
  2. Individual dermatologists themselves disagree with the consensus; their own Pearson correlation is ~0.56–0.65, their Cohen's κ is ~0.46–0.62
  3. These dermatologist-level numbers are the realistic ceiling: no scoring system, human or AI, can reliably exceed the agreement that experts have with each other
  4. A system that matches dermatologist performance is clinically excellent: it means the AI is as reliable as adding another expert to the panel
The key insight for sponsors

Matching the consensus is effectively a perfect score. The consensus IS the best available approximation of truth. When ALADIN achieves a Cohen's κ of 0.53 and the average dermatologist achieves 0.46 on the same dataset against the same consensus, ALADIN is not "merely moderate"; it is outperforming the typical inter-rater agreement of trained dermatologists. The correct frame of reference is not "how close to 1.0?" but "how close to what expert dermatologists achieve?" By that measure, ALADIN is at or above 100%.

Understanding the metrics

Four metrics are used to evaluate the agreement between ALADIN scores and the expert consensus. Each captures a different aspect of agreement:

Pearson correlation

The strength and direction of the linear relationship between two sets of scores. In this context: how closely ALADIN's scores track the expert consensus scores.

Perfect score: 1.0 would mean perfect linear correlation with the consensus. However, individual dermatologists themselves only achieve ~0.56–0.65 against the consensus, because acne grading is inherently subjective.

Why it matters: In a clinical trial, you need the scoring system to track the same severity trend as expert dermatologists. High Pearson correlation means ALADIN and dermatologists rank patients in the same order.

WeakModerateStrongVery strong
ALADIN: 0.58Dermatologists: 0.56

Spearman correlation

The strength of the monotonic (rank-order) relationship between two sets of scores. Less sensitive to outliers than Pearson. Measures whether ALADIN ranks patients in the same severity order as the consensus.

Perfect score: 1.0 would mean perfect rank agreement. Individual dermatologists achieve ~0.58 against the consensus.

Why it matters: For clinical trials, rank-order agreement is critical: you need to reliably distinguish ‘improving’ from ‘worsening’ patients. Spearman tells you whether the AI preserves this ordering.

WeakModerateStrongVery strong
ALADIN: 0.56Dermatologists: 0.58

Cohen’s kappa (κ)

Agreement between two raters beyond what would be expected by chance. Uses quadratic weights, meaning a 2-grade disagreement is penalised more than a 1-grade disagreement. This is the primary measure of clinical agreement.

Perfect score: 1.0 would mean perfect agreement. In acne grading, individual dermatologists achieve κ = 0.46–0.62 against the consensus — this is the realistic ceiling set by the subjectivity of the task.

Why it matters: Cohen’s κ is the gold standard for measuring inter-rater reliability in clinical research. Regulatory bodies and journal reviewers expect this metric. It tells you: ‘Is this scoring system as reliable as a human expert?’

FairModerateSubstantialAlmost perfect
ALADIN: 0.53Dermatologists: 0.46

Mean absolute error

The average magnitude of scoring errors in IGA points. An MAE of 0.63 means that, on average, ALADIN’s score differs from the consensus by less than 1 IGA grade.

Perfect score: 0.0 would mean zero error. Individual dermatologists achieve MAE of 0.44–0.65 against the consensus — even experts don’t agree perfectly.

Why it matters: For a 5-point scale (IGA 0–4), an MAE under 1.0 means the AI is never more than one grade off on average. This is critical for clinical trials where you need to detect 1–2 grade changes as treatment effects.

ExcellentGoodAcceptablePoor
ALADIN: 0.63Dermatologists: 0.65

Severity scoring performance: ALADIN vs. dermatologists

The following results are from the peer-reviewed validation study (Sabater et al., Skin Health and Disease, 2026). They evaluate the ALADIN scoring formula, the mathematical model that converts lesion count and density into an IGA-aligned severity score. Both ALADIN and individual dermatologists are measured against the same ground truth (dermatologist consensus). Values in bold with a check mark indicate where ALADIN meets or exceeds the dermatologist benchmark.

IGA agreement

MetricMild-to-moderateSevere cases
ALADINDermatologistsALADINDermatologists
Pearson0.580.560.650.65
Spearman0.560.580.540.58
MAE0.63 (0.49)0.65 (0.63)0.64 (0.71)0.44 (0.5)
Cohen’s κ0.530.460.610.62

How to read this table

  • Pearson 0.58 vs. 0.56: ALADIN's linear correlation with the consensus slightly exceeds the average individual dermatologist's. This means ALADIN tracks severity trends at least as accurately as a single expert.

  • Cohen's κ 0.53 vs. 0.46: This is the most important result. ALADIN has better categorical agreement with the consensus than individual dermatologists do. On a scale where 0.41–0.60 is "moderate" (the typical range for IGA inter-rater agreement), ALADIN sits at the top of that range while the dermatologists sit at the bottom.

  • Cohen's κ 0.61 vs. 0.62 (severe cases): On the independent test set of severe cases, ALADIN essentially matches dermatologist agreement. A κ of 0.61 falls in the "substantial agreement" range.

  • MAE 0.63 vs. 0.65: ALADIN's average error is less than 1 IGA grade, comparable to the error of individual dermatologists. Even when the AI is wrong, it is wrong by less than one severity level on average.

  • MAE 0.64 vs. 0.44 (severe cases): On the severe-case dataset, dermatologists achieve lower error. This is expected: severe acne is easier for human experts to grade consistently, while the AI model was primarily trained on mild-to-moderate cases.

What this means for your trial

  1. The AI is as reliable as an expert dermatologist: it matches or exceeds their agreement with the consensus across most metrics
  2. But the AI is perfectly reproducible: unlike a human rater, the same image always produces the same score, with zero intra-rater variability
  3. Across every site, every time: no calibration drift, no fatigue, no subjective inconsistency between sites
  4. This reduces the noise in your endpoint data, potentially enabling smaller sample sizes to detect the same treatment effect
Regulatory positioning of ALADIN vs. IGA

For Phase 3 registration trials, sponsors typically use the standard IGA (0–4) and inflammatory lesion counts as primary and co-primary endpoints, both of which the system provides with full reproducibility. The ALADIN composite (0–10) is positioned for exploratory and secondary endpoints: its higher-resolution continuous scale is particularly valuable for Phase 2 dose-finding studies, proof-of-concept trials, and internal go/no-go decisions where sensitivity to subtle treatment effects is critical. AI-computed endpoints from Legit.Health have been accepted by regulators as secondary endpoints and for adverse event detection in clinical submissions.

Lesion detection performance

The following metrics are validated per the Quality Management System (IEC 62304 / ISO 14971).

  • mAP@50 (95% CI): 0.45
  • rMAE Papule (95% CI): 0.58
  • rMAE Pustule (95% CI): 0.28
  • rMAE Comedone (95% CI): 0.62
  • rMAE Nodule/Cyst (95% CI): 0.33

How to read these metrics

  • mAP@50 = 0.45 (95% CI: 0.43–0.47): The inflammatory lesion detector achieves mean average precision of 0.45 at the standard 50% IoU threshold. The acceptance criterion (≥0.21) is based on non-inferiority to published acne detection studies, where the literature range is 0.21–0.54.
  • rMAE (relative Mean Absolute Error): For per-type detection, each lesion type's counting error is compared against the inter-rater variability of expert dermatologists on the same task. The acceptance criterion is that the model's rMAE must not exceed the experts' rMAE. All lesion types pass, meaning the AI counts each lesion type at least as consistently as dermatologists disagree with each other.
Why detection precision matters less than you think

The lesion detector is not used in isolation; it feeds into the ALADIN/IGA formula, which was co-optimised with the detector. The formula constants (aa, bb) were calibrated to produce correct IGA scores given the detector's actual output, not given perfect counts. This means the end-to-end system (detector + formula) achieves dermatologist-level IGA agreement even though the raw detection metrics appear modest in isolation.

Additionally, the acceptance criteria are based on non-inferiority to expert inter-rater variability, a stronger validation methodology than simple accuracy benchmarks. The question is not "is the AI perfect?" but "is the AI at least as consistent as dermatologists are with each other?"

Validation datasets

DatasetImagesSourceSeverity profileAnnotatorsRole
Mild-to-moderate331Private (Clínica Dermatológica Internacional, Spain)Predominantly mild-to-moderate facial acne (IGA 0: 14, IGA 1: 65, IGA 2: 150, IGA 3: 83, IGA 4: 19)3Primary dataset for ALADIN formula design and evaluation
Severe cases27Public dermatology atlases (DermAtlas, Atlas Dermatológico)Predominantly severe acne (IGA 0: 1, IGA 1: 3, IGA 2: 10, IGA 3: 11, IGA 4: 2)2Exploratory test set for evaluating generalisation to severe cases and third-party images

The two datasets complement each other:

  • 331 images: Primary evaluation dataset with a realistic distribution of mild-to-moderate acne, annotated by 3 specialists
  • 27 images: Tests generalisation to severe cases from public atlases, annotated by 2 specialists

ALADIN peer-reviewed publication

Sabater A, et al.ALADIN: Acne Lesion And Density INdex. A Novel Tool for Automatic Acne Severity Assessment Skin Health and Disease (British Association of Dermatologists). 2026. (Provisionally accepted)

  • Development of the IGA = N^a · (D + b) formula combining lesion count and spatial density into an IGA-aligned severity score
  • Calibration of formula constants against the consensus of three board-certified dermatologists
  • Validation of the CNN-based inflammatory lesion detection model (papules, pustules, nodules)
  • Evidence that incorporating spatial density alongside lesion count improves alignment with clinical severity perception

The paper was presented as a poster at the AEDV 2025 conference in Paris prior to journal submission.

DIQA validation

Hernández Montilla I, Mac Carthy T, Aguilar A, Medela ADermatology Image Quality Assessment (DIQA): Artificial intelligence to ensure the clinical utility of images for remote consultations and clinical trials Journal of the American Academy of Dermatology. 2023. doi:10.1016/j.jaad.2022.11.002

J Am Acad Dermatol. 2023;88(4):927-928

  • Pearson correlation ≥0.70 with expert image quality assessment
  • Real-time evaluation of focus, lighting, framing, and resolution
  • Applicability to both clinical practice and clinical trial settings

DIQA is the image quality assessment algorithm that acts as a quality gate in the clinical trial workflow. It is critical for maintaining consistent image quality across investigator sites in multi-center trials.

The same AI architecture and methodology used for acne scoring has been validated across multiple dermatological conditions:

Legit.HealthAutomatic International Hidradenitis Suppurativa Severity Score System (AIHS4): A Novel Tool to Assess the Severity of Hidradenitis Suppurativa Using Artificial Intelligence Published. 2025.

  • Inter-observer ICC ≥ 0.727 (95% CI: 0.66–0.79) for objective severity assessment
  • State-of-the-art comparison: ICC of 0.47 without the device vs. 0.727 with the device
  • Same object detection + scoring methodology as acne

Validation has also been completed for APASI (psoriasis), ASCORAD (atopic dermatitis), and multiple MRMC studies (BI_2024, SAN_2024).

Regulatory-grade validation pathway

The clinical evidence follows a structured regulatory pathway:

StandardScopeApplication to acne scoring
IEC 62304Software lifecycle processesThe AI scoring pipeline follows a documented development lifecycle with risk-based classification
ISO 14971Risk managementSystematic risk analysis including failure modes (missed lesions, false positives, image quality)
IEC 62366-1Usability engineeringThe mobile capture application has been validated for usability at investigator sites
MEDDEV 2.7/1 Rev 4Clinical evaluationClinical evidence compiled following the structured methodology for clinical evaluation reports
MDR Annex XIVClinical evaluation and PMCFPost-market clinical follow-up ensures ongoing validation as the technology evolves

Ongoing clinical validation program

StudyConditionEndpointsStatus
ALADIN observational studyAcne vulgarisLesion count, density, IGACompleted; paper provisionally accepted
ALADIN clinical investigation (DermoMedic)Acne vulgarisProspective validationIn preparation (2026)
DIQA validationAll conditionsImage quality assessmentPublished (JAAD 2023)
AIHS4 validationHidradenitis suppurativaIHS4 severity scoringPublished
MRMC study (BI_2024)Multiple conditionsDiagnostic accuracy, severityCompleted
MRMC study (SAN_2024)Multiple conditionsDiagnostic accuracy, severityCompleted
APASI validationPsoriasisPASI severity scoringPublished
ASCORAD validationAtopic dermatitisSCORAD severity scoringPublished

For the full list of clinical evidence, see the clinical validation section.