Clinical Evidence and Validation

This page explains how the psoriasis severity scoring technology is validated, what the performance metrics mean, and why the results demonstrate that each AI component model performs at the level of expert dermatologists.

How the ground truth is established

How the ground truth was established

For intensity models: mathematical consensus of 2–3 independent expert dermatologists scoring each component (erythema, desquamation, induration) on ordinal scales. For surface models: expert pixel-level segmentation masks.

Why this approach: PASI component scoring is inherently subjective. The acceptance criterion is non-inferiority to expert inter-rater variability: the AI's error must not exceed the disagreement between dermatologists.

Matching expert inter-rater variability is effectively a perfect score. Dermatologists themselves disagree on erythema by RMAE ~0.14, on desquamation and induration by ~0.17. Any AI system achieving RMAE at or below these values is performing at expert level; there is no meaningful way to do better on a subjective clinical assessment.

Non-inferiority to expert variability

For PASI components, the acceptance criterion is based on non-inferiority to expert inter-rater variability:

Expert dermatologists disagree on erythema intensity with RMAE ~0.14
Expert dermatologists disagree on desquamation and induration with RMAE ~0.17
If the AI's error is at or below these values, it is performing at expert level

Matching inter-rater variability = 100% of achievable performance

For each PASI component, the expert inter-rater variability is the realistic ceiling. Dermatologists cannot agree with each other any better than this on a subjective clinical assessment. The AI matching this ceiling means it is as reliable as adding another dermatologist to the panel.

Why there is no "perfect" score

In PASI scoring, there is no objective measurement; severity is a clinical judgement. Unlike measuring blood pressure (where 120/80 is the same on every device), PASI component scores depend on a dermatologist's subjective assessment of erythema intensity, scaling severity, and plaque thickness.

This means:

The ground truth is the consensus of multiple expert dermatologists, not an absolute truth
Individual dermatologists themselves disagree with the consensus, and their own RMAE values define the realistic ceiling
These dermatologist-level numbers are the realistic ceiling: no scoring system, human or AI, can reliably exceed the agreement that experts have with each other
A system that matches dermatologist performance is clinically excellent: it means the AI is as reliable as adding another expert to the panel

The key insight for sponsors

Matching the consensus is effectively a perfect score. The consensus IS the best available approximation of truth. When the AI achieves an RMAE of 0.13 for erythema and the average dermatologist achieves 0.14 on the same task, the AI is not "merely close"; it is outperforming the typical inter-rater agreement of trained dermatologists. The correct frame of reference is not "how close to 0?" but "how close to what expert dermatologists achieve?" By that measure, the AI is at or above 100%.

Understanding the metrics

RMAE (relative Mean Absolute Error)

The model’s average error as a proportion of the scale range, compared against expert inter-rater variability. For a 0–9 intensity scale, RMAE of 0.13 means the average error is 13% of the scale range (~1.2 points).

Perfect score: 0.0 would mean zero error. Expert dermatologists themselves have RMAE of 0.14–0.17 against their own consensus — this is the realistic ceiling.

Why it matters: PASI accuracy depends on accurate component scoring. If each component (erythema, desquamation, induration) is scored within the range of expert variability, the composite PASI score will be clinically reliable.

Expert ceiling ▼

ExcellentAt expert levelAcceptableInsufficient

APASI: 0.13Expert ceiling: 0.14

Intersection over Union (IoU)

The overlap between the AI’s segmentation of psoriatic area and the expert’s segmentation. IoU of 0.61 means 61% of the AI’s identified area matches the expert’s annotation.

Perfect score: 1.0 would mean perfect pixel-level agreement. For skin lesion segmentation, IoU above 0.5 is considered good, and above 0.7 is excellent.

Why it matters: Affected area is a critical PASI component. Accurate segmentation ensures the BSA percentage is reliable, which directly impacts the total PASI score.

Expert ceiling ▼

PoorModerateGoodExcellent

APASI: 0.61Expert ceiling: 0.61

Production model performance

The following results are from the production models validated per the Quality Management System (IEC 62304 / ISO 14971). Each model is tested against expert inter-rater variability as the acceptance criterion.

PASI component model performance

Metric	Production validation
Metric	APASI	Dermatologists
RMAE	0.13 ✓	0.14

Metric	Production validation
Metric	APASI	Dermatologists
RMAE	0.153 ✓	0.17

Metric	Production validation
Metric	APASI	Dermatologists
RMAE	0.151 ✓	0.17

Metric	Production validation
Metric	APASI	Dermatologists
IoU	0.61 ✓	0.61

Visual comparison: AI vs. Expert per component

Radar view: AI vs Expert across all components

AIExpert threshold

Summary

Component	AI RMAE/IoU	Expert RMAE/IoU	Verdict
Erythema intensity	0.13	0.14	AI is better than expert inter-rater agreement
Desquamation intensity	0.153	0.17	AI is within expert inter-rater range
Induration intensity	0.151	0.17	AI is within expert inter-rater range
Erythema surface	IoU 0.61	IoU 0.61	AI matches expert segmentation

All four components pass the acceptance criteria. The composite APASI score inherits this accuracy: when each component is scored at expert level, the aggregate PASI score is reliable.

How to read this table

Erythema RMAE 0.13 vs. 0.14: The AI's erythema scoring error is lower than the disagreement between expert dermatologists. The AI is literally more consistent than dermatologists are with each other on redness assessment.
Desquamation RMAE 0.153 vs. 0.17: The AI's scaling assessment error is well within the range of expert disagreement. The acceptance criterion is met with margin.
Induration RMAE 0.151 vs. 0.17: Same pattern as desquamation; the AI scores plaque thickness within expert-level variability.
Erythema surface IoU 0.61 vs. 0.61: The AI's pixel-level segmentation of affected areas matches expert segmentation masks exactly. This replaces the most subjective component of manual PASI (BSA estimation) with objective measurement.

What this means for your trial

Every PASI component is scored at expert level: no single component is a weak link in the composite score
BSA estimation is objectified: the largest source of manual PASI inter-rater variability (affected area estimation) is replaced by pixel-level segmentation
The AI is perfectly reproducible: unlike a human rater, the same image always produces the same component scores, with zero intra-rater variability
Across every site, every time: no calibration drift, no fatigue, no subjective inconsistency between investigators
This reduces noise in PASI change-from-baseline, potentially enabling smaller sample sizes or greater statistical power to detect PASI 75/90/100 response rates

Regulatory positioning

AI-computed APASI endpoints from Legit.Health have been accepted by regulators as secondary endpoints and for adverse event detection in clinical submissions. For primary registration endpoints, the system provides standard PASI scores aligned with the established 0–72 scale. The APASI scoring is validated in a real-world Phase 3 programme for moderate-to-severe plaque psoriasis across 130+ sites in 12 countries.

APASI peer-reviewed publication

Mac Carthy T, Dagnino D, Medela A, Fernández G, Aguilar A, Martorell A, Gómez-Tejerina P, Roustán-Gullón G “Artificial Intelligence-Based Quantification to Assess the Automatic Psoriasis Area and Severity Index” JEADV Clinical Practice. 2025. doi:10.1002/jvc2.70143

JEADV Clin Pract. 2025

APASI provides a robust AI-driven framework for psoriasis severity assessment
Delivers rapid, objective, and precise evaluations of all PASI components
Integration into clinical and research workflows enhances disease monitoring
Reduces evaluation costs compared to manual multi-reader assessments

DIQA validation

Hernández Montilla I, Mac Carthy T, Aguilar A, Medela A “Dermatology Image Quality Assessment (DIQA): Artificial intelligence to ensure the clinical utility of images for remote consultations and clinical trials” Journal of the American Academy of Dermatology. 2023. doi:10.1016/j.jaad.2022.11.002

J Am Acad Dermatol. 2023;88(4):927-928

Pearson correlation ≥0.70 with expert image quality assessment
Real-time evaluation of focus, lighting, framing, and resolution
Critical for standardising image quality in multi-center psoriasis trials

Regulatory-grade validation pathway

The clinical evidence follows a structured regulatory pathway:

Standard	Scope	Application to PASI scoring
IEC 62304	Software lifecycle processes	Each component model follows a documented development lifecycle
ISO 14971	Risk management	Systematic risk analysis for each scoring component
IEC 62366-1	Usability engineering	Validated for both in-clinic and decentralised patient capture
MEDDEV 2.7/1 Rev 4	Clinical evaluation	Clinical evidence compiled per structured methodology
MDR Annex XIV	Clinical evaluation and PMCF	Post-market clinical follow-up

Ongoing clinical validation program

Study	Condition	Endpoints	Status
APASI validation study	Plaque psoriasis	PASI components (erythema, desquamation, induration, BSA)	Published (JEADV Clin Pract 2025)
JNJ-77242113 Phase 3 trial	Moderate to severe plaque psoriasis	APASI scoring via decentralised imaging	Ongoing (real-world deployment)
DIQA validation	All conditions	Image quality assessment	Published (JAAD 2023)

For the full list of clinical evidence, see the clinical validation section.

How the ground truth is established​

Non-inferiority to expert variability​

Why there is no "perfect" score​

Understanding the metrics​

RMAE (relative Mean Absolute Error)

Intersection over Union (IoU)

Production model performance​

PASI component model performance​

Visual comparison: AI vs. Expert per component​

Radar view: AI vs Expert across all components​

Summary​

How to read this table​

What this means for your trial​

APASI peer-reviewed publication​

DIQA validation​

Regulatory-grade validation pathway​

Ongoing clinical validation program​