Clinical Evidence and Validation

This page explains how the alopecia severity scoring technology is validated, what the performance metrics mean, and how the evidence pathway supports regulatory-grade use.

How the ground truth is established

Before looking at any performance numbers, it is critical to understand what the numbers are compared against.

How the ground truth was established

Expert dermatologist SALT assessments performed independently by 2–3 dermatologists specialising in hair disorders. Each dermatologist evaluates the same set of scalp photographs and assigns a SALT score per quadrant and globally. The consensus (ground truth) is computed as the mathematical average of their scores.

Why this approach: SALT scoring, while more objective than many dermatological scales due to its percentage-based approach, still involves subjective estimation of hair loss extent. The best approximation of truth is the consensus of multiple experts.

SALT is more objective than scales like IGA because it measures percentage of hair loss rather than qualitative descriptors. However, visual estimation of percentage from photographs introduces variability. Expert inter-rater ICC for SALT is typically 0.80–0.90, which sets the achievable ceiling for any automated system.

Why SALT is more objective than other dermatological scales

Unlike scales such as IGA (which rely on qualitative descriptors like "mild" or "moderate"), SALT measures a quantitative outcome: the percentage of scalp affected by hair loss. This makes SALT inherently more objective and less susceptible to inter-rater variability than most dermatological severity scales.

However, variability still exists because dermatologists must visually estimate the percentage of hair loss from photographs or direct inspection. Studies report inter-rater ICC values of 0.80–0.90 for SALT, which is good but not perfect. This residual variability is exactly what automated scoring eliminates.

The advantage of automation for SALT

Unlike subjective scales where the AI must replicate clinical judgement, SALT scoring is fundamentally a measurement task: what percentage of this area lacks hair? This is precisely the kind of task where pixel-level image analysis excels. The AI computes exact percentages from segmentation rather than estimating them visually, providing higher granularity (continuous percentages vs. 5–10% increments) and perfect reproducibility.

AI model performance

The AI models that power automated SALT scoring are validated per the Quality Management System (IEC 62304 / ISO 14971). Because SALT is fundamentally a surface area measurement, the AI validation focuses on the segmentation model that quantifies hair loss — this is where accuracy matters. Computing the SALT score from segmentation percentages is a deterministic formula (weighted sum of quadrant percentages), not an AI step.

RMAE (95% CI): 0.0708
mAP@50 (95% CI): 0.8162

Visual summary

7.08%

Hair Loss Surface RMAE

Acceptance: ≤ 9.6% — Passed

0.816

Follicle Detection mAP@50

Acceptance: ≥ 0.72 — Passed

How to read these metrics

RMAE = 7.08% (95% CI: 5.63%–8.93%): The hair loss surface quantification model achieves a relative mean absolute error of 7.08% against expert consensus on 800 test images. The acceptance criterion (≤ 9.6%) is based on non-inferiority to published automated SALT studies. An RMAE of 7.08% means the model's hair loss percentage estimate is, on average, within 7 SALT points of the expert consensus on a 0–100 scale.
mAP@50 = 0.8162 (95% CI: 0.7503–0.8686): The hair follicle detection model achieves high precision in localising individual follicles from trichoscopy images, exceeding the criterion (≥ 0.72) derived from published hair follicle detection studies.

From segmentation to SALT score

The hair loss surface model segments each scalp quadrant image into hair-bearing and non-hair-bearing regions, then computes the percentage of hair loss. These per-quadrant percentages are combined using the standard SALT weights (left 18%, right 18%, top 40%, back 24%) to produce the total SALT score. Because this aggregation is a deterministic formula, the segmentation RMAE directly reflects the accuracy of the SALT score.

Validation datasets

Dataset	Images	Source	Severity profile	Annotators	Role
Hair loss segmentation	800	Private (multi-batch collection: alopecia cases, healthy subjects, and standardised 5-angle captures)	Various alopecia types including alopecia areata and androgenetic alopecia, across a range of severity levels	3	Independent test set for hair loss surface quantification model evaluation (from 1826-image total dataset)

Prospective clinical study metrics

The following metrics will be used in prospective clinical studies to evaluate the agreement between AI-computed SALT scores and investigator SALT assessments. These measure the clinical utility of the complete workflow (AI segmentation + SALT formula + investigator review), not the AI model in isolation.

Intraclass correlation coefficient

The degree of agreement between two or more raters measuring the same quantity. In this context: how closely ASALT scores agree with expert dermatologist SALT assessments.

Perfect score: 1.0 would mean perfect agreement. For SALT scoring, individual dermatologists achieve ICC ~0.80–0.90 against each other, making this the realistic ceiling.

Why it matters: ICC is the standard measure of inter-rater reliability for continuous scales in clinical research. It tells you whether the AI's SALT estimates are as consistent as expert assessments.

0.9–1: Excellent0.75–0.9: Good0.5–0.75: Moderate0–0.5: Poor

Mean absolute error

The average magnitude of scoring errors in SALT points. An MAE of 5.0 means that, on average, ASALT's score differs from the expert consensus by 5 percentage points on the 0–100 scale.

Perfect score: 0.0 would mean zero error. Individual dermatologists achieve MAE of 5–10 SALT points against each other.

Why it matters: For a 0–100 scale, an MAE under 10 means the AI is within the range of expert disagreement. This is critical for detecting treatment responses such as SALT 50 or SALT 75.

0–5: Excellent5–10: Good10–20: Acceptable20–100: Poor

Pearson correlation

The strength and direction of the linear relationship between ASALT scores and expert SALT scores. Measures whether the AI tracks the same severity trend as dermatologists.

Perfect score: 1.0 would mean perfect linear correlation. Individual dermatologists achieve ~0.85–0.95 against each other for SALT.

Why it matters: High Pearson correlation means ASALT and dermatologists rank patients in the same severity order, which is essential for detecting treatment response.

0.9–1: Very strong0.7–0.9: Strong0.5–0.7: Moderate0–0.5: Weak

Data availability

Automated SALT scoring is currently deployed in a Phase 3 clinical trial for adverse event monitoring. Agreement metrics from prospective clinical studies will be published as data becomes available. The acceptance criterion for the ongoing ALADIN clinical investigation is ICC > 75%.

Clinical trial deployment

Phase 3 deployment for adverse event monitoring

Automated SALT scoring has been deployed in a Phase 3 clinical trial (MASH indication) for monitoring drug-induced alopecia as an adverse event. In this context:

4-perspective scalp imaging (left, right, top, back) is performed at baseline and subsequent visits
The system monitors SALT score evolution and triggers automated email notifications when hair loss increases by >=25% from baseline
Site investigators confirm the alert by clinical assessment and, if confirmed as an adverse event, initiate enhanced monitoring with scalp photographs every 2 months until resolution

This deployment validates the technology's ability to:

Produce consistent, reproducible SALT scores across multiple investigator sites
Detect clinically meaningful changes in hair loss over time
Support automated safety monitoring workflows in large multi-centre trials

Positioning for severity measurement

While the initial deployment focused on adverse event monitoring, the technology is equally applicable — and more commonly needed — for alopecia areata treatment trials where SALT is the primary efficacy endpoint. The same scoring methodology provides:

Continuous SALT scores with sub-percentage resolution
Automated SALT response classification (SALT 50/75/90/100) at each visit
Longitudinal severity tracking with visual comparison across visits
Zero inter-rater variability, eliminating a significant source of noise in multi-site treatment trials

Regulatory-grade validation pathway

The clinical evidence follows a structured regulatory pathway:

Standard	Scope	Application to alopecia scoring
IEC 62304	Software lifecycle processes	The AI scoring pipeline follows a documented development lifecycle with risk-based classification
ISO 14971	Risk management	Systematic risk analysis including failure modes (misclassification of hair-bearing regions, shadow artefacts, non-alopecia hair loss)
IEC 62366-1	Usability engineering	The mobile capture application has been validated for usability at investigator sites
MEDDEV 2.7/1 Rev 4	Clinical evaluation	Clinical evidence compiled following the structured methodology for clinical evaluation reports
MDR Annex XIV	Clinical evaluation and PMCF	Post-market clinical follow-up ensures ongoing validation as the technology evolves

Ongoing clinical validation program

Study	Condition	Endpoints	Status
ASALT Phase 3 deployment	Alopecia (adverse event monitoring)	ASALT score, severity classification	Active (Phase 3 MASH trial)
DIQA validation	All conditions	Image quality assessment	Published (JAAD 2023)
AIHS4 validation	Hidradenitis suppurativa	IHS4 severity scoring	Published
APASI validation	Psoriasis	PASI severity scoring	Published
ASCORAD validation	Atopic dermatitis	SCORAD severity scoring	Published

For the full list of clinical evidence across all indications, see the clinical validation section.

DIQA validation

Hernández Montilla I, Mac Carthy T, Aguilar A, Medela A “Dermatology Image Quality Assessment (DIQA): Artificial intelligence to ensure the clinical utility of images for remote consultations and clinical trials” Journal of the American Academy of Dermatology. 2023. doi:10.1016/j.jaad.2022.11.002

J Am Acad Dermatol. 2023;88(4):927-928

Pearson correlation ≥0.70 with expert image quality assessment
Real-time evaluation of focus, lighting, framing, and resolution
Applicability to both clinical practice and clinical trial settings

DIQA is the image quality assessment algorithm that acts as a quality gate in the clinical trial workflow. For scalp imaging, DIQA ensures consistent focus, lighting, and framing across all four quadrant images and all investigator sites.

The same AI architecture and methodology used for alopecia scoring has been validated across multiple dermatological conditions:

Legit.Health “Automatic International Hidradenitis Suppurativa Severity Score System (AIHS4): A Novel Tool to Assess the Severity of Hidradenitis Suppurativa Using Artificial Intelligence” Published. 2025.

Inter-observer ICC ≥ 0.727 (95% CI: 0.66–0.79) for objective severity assessment
State-of-the-art comparison: ICC of 0.47 without the device vs. 0.727 with the device
Same deep learning architecture validated across multiple dermatological conditions

Validation has also been completed for APASI (psoriasis), ASCORAD (atopic dermatitis), and multiple MRMC studies.

How the ground truth is established​

Why SALT is more objective than other dermatological scales​

AI model performance​

Visual summary​

How to read these metrics​

Validation datasets​

Prospective clinical study metrics​

Intraclass correlation coefficient

Mean absolute error

Pearson correlation

Clinical trial deployment​

Phase 3 deployment for adverse event monitoring​

Positioning for severity measurement​

Regulatory-grade validation pathway​

Ongoing clinical validation program​

DIQA validation​

Related validation studies​