Performance Across Skin Types

Dermatology AI must perform reliably across the full spectrum of skin tones. This page documents how Legit.Health addresses skin type diversity in training data, validation, and ongoing monitoring.

Training data diversity

The AI models are trained on 280,000+ clinical images drawn from 55+ curated datasets, including collections specifically designed for skin tone diversity:

Dataset	Focus
Black & Brown Skin	Clinical images of dermatological conditions on darker skin tones
Diverse Dermatology Images (DDI)	Balanced representation across Fitzpatrick types from Stanford
Full Spectrum Dermatology	Wide skin tone coverage across common conditions
SkinDeep	Crowdsourced diverse dermatology images

Stratified performance by Fitzpatrick type

The following metrics are from the production model's bias analysis and fairness evaluation, stratified by Fitzpatrick type groupings. All values include 95% confidence intervals.

Diagnostic accuracy

Metric	Overall	FST I–II	FST III–IV	FST V–VI
Top-1 accuracy	0.658 (0.654–0.663)	0.686 (0.680–0.691)	0.615 (0.606–0.624)	0.535 (0.514–0.557)
Top-3 accuracy	0.821 (0.817–0.825)	0.850 (0.846–0.855)	0.774 (0.766–0.782)	0.694 (0.674–0.714)
Top-5 accuracy	0.864 (0.861–0.868)	0.891 (0.887–0.895)	0.822 (0.815–0.830)	0.746 (0.726–0.765)

Clinical safety metrics

Metric	Overall	FST I–II	FST III–IV	FST V–VI
AUC malignant	0.918 (0.914–0.922)	0.918 (0.913–0.923)	0.919 (0.910–0.928)	0.836 (0.794–0.877)
AUC pre-malignant	0.878 (0.872–0.884)	0.882 (0.875–0.889)	0.879 (0.868–0.890)	0.801 (0.763–0.840)
AUC urgent referral	0.900 (0.889–0.911)	0.913 (0.899–0.926)	0.884 (0.868–0.900)	0.827 (0.785–0.865)
AUC high-priority referral	0.888 (0.884–0.892)	0.890 (0.885–0.895)	0.883 (0.876–0.891)	0.855 (0.833–0.877)

How to read these results

FST I–II: Performance is highest for lighter skin types, consistent with the larger training set representation
FST III–IV: Performance remains strong across all metrics, with no clinically significant degradation
FST V–VI: Performance is lower, particularly for diagnostic accuracy. Confidence intervals are wider due to smaller sample sizes. Safety-critical metrics (AUC malignant, urgent referral) remain above 0.80
All Fitzpatrick groups exceed minimum acceptance thresholds: the system meets performance criteria across the full spectrum

Platform-level vs. scoring-model metrics

These metrics are from the platform's diagnostic classification model, which processes the broadest range of conditions and skin types. The clinical trial scoring models (ALADIN for acne, APASI for psoriasis) inherit the same image processing pipeline and benefit from the same training data diversity. Performance monitoring is active across all model families.

Known limitations and mitigation

FST V–VI representation: Darker skin types are underrepresented in dermatology image datasets globally. This is an industry-wide challenge, not specific to Legit.Health. The impact is visible in wider confidence intervals and lower absolute accuracy for FST V–VI.

Mitigation strategy:

Active sourcing of diverse datasets (DDI, SkinDeep, Full Spectrum Dermatology)
Ongoing post-market clinical follow-up (PMCF) monitoring performance by skin type
Transparent reporting of stratified metrics, including limitations

For clinical trial sponsors: The scoring models used in clinical trials are validated against expert inter-rater variability as the acceptance criterion. This non-inferiority approach ensures that the AI is at least as consistent as dermatologists, regardless of skin type.

Ongoing monitoring

Performance across skin types is monitored as part of the post-market clinical follow-up (PMCF) programme under MDR Annex XIV. Any performance degradation detected for specific Fitzpatrick groups triggers a formal risk assessment per ISO 14971 and, if necessary, targeted model retraining.

Accélérez vos essais

Rejoignez les sociétés pharmaceutiques, biotechs et CROs qui atteignent le marché 4 à 6 mois plus rapidement avec une qualité de données supérieure. Remplissez le formulaire pour discuter des exigences de votre étude.

Training data diversity​

Stratified performance by Fitzpatrick type​

Diagnostic accuracy​

Clinical safety metrics​

How to read these results​

Known limitations and mitigation​

Ongoing monitoring​