Clinical Evidence and Validation
This page explains how the alopecia severity scoring technology is validated, what the performance metrics mean, and how the evidence pathway supports regulatory-grade use.
How the ground truth is established
Before looking at any performance numbers, it is critical to understand what the numbers are compared against.
Expert dermatologist SALT assessments performed independently by 2–3 dermatologists specialising in hair disorders. Each dermatologist evaluates the same set of scalp photographs and assigns a SALT score per quadrant and globally. The consensus (ground truth) is computed as the mathematical average of their scores.
Why this approach: SALT scoring, while more objective than many dermatological scales due to its percentage-based approach, still involves subjective estimation of hair loss extent. The best approximation of truth is the consensus of multiple experts.
SALT is more objective than scales like IGA because it measures percentage of hair loss rather than qualitative descriptors. However, visual estimation of percentage from photographs introduces variability. Expert inter-rater ICC for SALT is typically 0.80–0.90, which sets the achievable ceiling for any automated system.
Why SALT is more objective than other dermatological scales
Unlike scales such as IGA (which rely on qualitative descriptors like "mild" or "moderate"), SALT measures a quantitative outcome: the percentage of scalp affected by hair loss. This makes SALT inherently more objective and less susceptible to inter-rater variability than most dermatological severity scales.
However, variability still exists because dermatologists must visually estimate the percentage of hair loss from photographs or direct inspection. Studies report inter-rater ICC values of 0.80–0.90 for SALT, which is good but not perfect. This residual variability is exactly what automated scoring eliminates.
Unlike subjective scales where the AI must replicate clinical judgement, SALT scoring is fundamentally a measurement task: what percentage of this area lacks hair? This is precisely the kind of task where pixel-level image analysis excels. The AI computes exact percentages from segmentation rather than estimating them visually, providing higher granularity (continuous percentages vs. 5–10% increments) and perfect reproducibility.
Understanding the metrics
The following metrics are used to evaluate the agreement between ASALT scores and expert SALT assessments:
Intraclass correlation coefficient
The degree of agreement between two or more raters measuring the same quantity. In this context: how closely ASALT scores agree with expert dermatologist SALT assessments.
Perfect score: 1.0 would mean perfect agreement. For SALT scoring, individual dermatologists achieve ICC ~0.80–0.90 against each other, making this the realistic ceiling.
Why it matters: ICC is the standard measure of inter-rater reliability for continuous scales in clinical research. It tells you whether the AI's SALT estimates are as consistent as expert assessments.
Mean absolute error
The average magnitude of scoring errors in SALT points. An MAE of 5.0 means that, on average, ASALT's score differs from the expert consensus by 5 percentage points on the 0–100 scale.
Perfect score: 0.0 would mean zero error. Individual dermatologists achieve MAE of 5–10 SALT points against each other.
Why it matters: For a 0–100 scale, an MAE under 10 means the AI is within the range of expert disagreement. This is critical for detecting treatment responses such as SALT 50 or SALT 75.
Pearson correlation
The strength and direction of the linear relationship between ASALT scores and expert SALT scores. Measures whether the AI tracks the same severity trend as dermatologists.
Perfect score: 1.0 would mean perfect linear correlation. Individual dermatologists achieve ~0.85–0.95 against each other for SALT.
Why it matters: High Pearson correlation means ASALT and dermatologists rank patients in the same severity order, which is essential for detecting treatment response.
Clinical trial deployment
Phase 3 deployment for adverse event monitoring
ASALT has been deployed in a Phase 3 clinical trial (MASH indication) for monitoring drug-induced alopecia as an adverse event. In this context:
- 4-perspective scalp imaging (left, right, top, back) is performed at baseline and subsequent visits
- The system monitors ASALT score evolution and triggers automated email notifications when hair loss increases by >=25% from baseline
- Site investigators confirm the alert by clinical assessment and, if confirmed as an adverse event, initiate enhanced monitoring with scalp photographs every 2 months until resolution
This deployment validates the technology's ability to:
- Produce consistent, reproducible SALT scores across multiple investigator sites
- Detect clinically meaningful changes in hair loss over time
- Support automated safety monitoring workflows in large multi-centre trials
Positioning for severity measurement
While the initial deployment focused on adverse event monitoring, the technology is equally applicable — and more commonly needed — for alopecia areata treatment trials where SALT is the primary efficacy endpoint. The same scoring methodology provides:
- Continuous SALT scores with sub-percentage resolution
- Automated SALT response classification (SALT 50/75/90/100) at each visit
- Longitudinal severity tracking with visual comparison across visits
- Zero inter-rater variability, eliminating a significant source of noise in multi-site treatment trials
Regulatory-grade validation pathway
The clinical evidence follows a structured regulatory pathway:
| Standard | Scope | Application to alopecia scoring |
|---|---|---|
| IEC 62304 | Software lifecycle processes | The AI scoring pipeline follows a documented development lifecycle with risk-based classification |
| ISO 14971 | Risk management | Systematic risk analysis including failure modes (misclassification of hair-bearing regions, shadow artefacts, non-alopecia hair loss) |
| IEC 62366-1 | Usability engineering | The mobile capture application has been validated for usability at investigator sites |
| MEDDEV 2.7/1 Rev 4 | Clinical evaluation | Clinical evidence compiled following the structured methodology for clinical evaluation reports |
| MDR Annex XIV | Clinical evaluation and PMCF | Post-market clinical follow-up ensures ongoing validation as the technology evolves |
Ongoing clinical validation program
| Study | Condition | Endpoints | Status |
|---|---|---|---|
| ASALT Phase 3 deployment | Alopecia (adverse event monitoring) | ASALT score, severity classification | Active (Phase 3 MASH trial) |
| DIQA validation | All conditions | Image quality assessment | Published (JAAD 2023) |
| AIHS4 validation | Hidradenitis suppurativa | IHS4 severity scoring | Published |
| APASI validation | Psoriasis | PASI severity scoring | Published |
| ASCORAD validation | Atopic dermatitis | SCORAD severity scoring | Published |
For the full list of clinical evidence across all indications, see the clinical validation section.
DIQA validation
Hernández Montilla I, Mac Carthy T, Aguilar A, Medela A “Dermatology Image Quality Assessment (DIQA): Artificial intelligence to ensure the clinical utility of images for remote consultations and clinical trials” Journal of the American Academy of Dermatology. 2023. doi:10.1016/j.jaad.2022.11.002
J Am Acad Dermatol. 2023;88(4):927-928
- Pearson correlation ≥0.70 with expert image quality assessment
- Real-time evaluation of focus, lighting, framing, and resolution
- Applicability to both clinical practice and clinical trial settings
DIQA is the image quality assessment algorithm that acts as a quality gate in the clinical trial workflow. For scalp imaging, DIQA ensures consistent focus, lighting, and framing across all four quadrant images and all investigator sites.
Related validation studies
The same AI architecture and methodology used for alopecia scoring has been validated across multiple dermatological conditions:
Legit.Health “Automatic International Hidradenitis Suppurativa Severity Score System (AIHS4): A Novel Tool to Assess the Severity of Hidradenitis Suppurativa Using Artificial Intelligence” Published. 2025.
- Inter-observer ICC ≥ 0.727 (95% CI: 0.66–0.79) for objective severity assessment
- State-of-the-art comparison: ICC of 0.47 without the device vs. 0.727 with the device
- Same deep learning architecture validated across multiple dermatological conditions
Validation has also been completed for APASI (psoriasis), ASCORAD (atopic dermatitis), and multiple MRMC studies.