Saltar al contenido principal

Known Limitations

Clinical trial sponsors need to understand not just what an AI scoring system can do, but where its boundaries are. This page documents the known limitations of the acne severity scoring technology, explains why each limitation exists, and describes how it is managed. Transparency about limitations is a regulatory expectation (ISO 14971, EU AI Act Article 13) and a prerequisite for informed protocol design.

Facial acne only

The model is trained and validated exclusively for facial acne. Body acne (back, chest, shoulders) is not assessed.

This is a deliberate scope decision, not an oversight. The FDA guidance on acne drug development (FDA, 2005) recommends that efficacy assessment be limited to the face, as it is the most frequent site of involvement. Body acne imaging protocols differ substantially from facial imaging (different distances, angles, and lighting requirements), and the Hayashi Criterion counting methodology is defined for half-face assessment.

For sponsors requiring body acne endpoints

If your protocol requires body acne assessment, this would need to be handled as a separate endpoint with a dedicated imaging protocol. Contact us during protocol design to discuss feasibility.

Inflammatory lesions only

The lesion detector counts papules, pustules, and nodules. Comedones (blackheads and whiteheads) are excluded by design.

This follows the IGA methodology and FDA guidance, which recommends that inflammatory and non-inflammatory lesions be counted and reported separately. The current model addresses the inflammatory component, which is the basis for both the IGA score and the ALADIN composite.

Mitigation: Non-inflammatory lesion counts can be added as a separate endpoint if required by a specific protocol. The imaging protocol captures the same photographs needed for both assessments.

Performance on severe acne

The primary validation dataset (n=331) is predominantly mild-to-moderate acne. On the independent severe-case dataset (n=27):

MetricALADINDermatologistsVerdict
MAE0.64 ± 0.710.44 ± 0.50AI slightly worse
Cohen's κ0.610.62AI essentially matches

The AI's mean absolute error on severe cases is higher than the dermatologist benchmark (0.64 vs. 0.44), though it still falls within one IGA grade on average. Cohen's κ of 0.61 indicates "substantial agreement" even on severe cases.

Why this happens: Severe acne features (large nodules, confluent lesions, deep cystic involvement) are underrepresented in the training data relative to the mild-to-moderate cases that dominate clinical practice. Human experts find severe cases easier to grade because they are visually distinctive.

Mitigation: The prospective ALADIN clinical investigation (DermoMedic, 2026) will include a broader severity range. Additionally, the ALADIN formula was co-optimised with the detection model, so the end-to-end system compensates for individual detection characteristics.

Raw detection precision

The inflammatory lesion detector achieves mAP@50 = 0.45 (95% CI: 0.43–0.47). In isolation, this appears modest. However, this metric does not reflect end-to-end clinical performance.

Why this is acceptable: The ALADIN formula constants (aa, bb) were calibrated to produce correct IGA scores given the detector's actual output, not given perfect counts. The detector and the scoring formula are a co-optimised system. The clinically relevant metric is the end-to-end IGA agreement (Cohen's κ = 0.53, exceeding the dermatologist benchmark of 0.46), not the raw detection precision.

The right question to ask

The question is not "does the AI detect every lesion perfectly?" but "does the AI produce severity scores that agree with expert dermatologists?" The answer to the second question is yes — at or above expert level.

Facial hair, accessories, and makeup

Dense facial hair, heavy makeup, or accessories (glasses, piercings) that cover skin can occlude lesions and degrade detection performance. The model cannot detect lesions that are not visible in the photograph.

Mitigation: The patient preparation protocol requires patients to remove makeup, pull hair back, and remove glasses before image capture. Sites are trained on these requirements during the study startup process.

Cross-cutting limitations

The following limitations apply to all indications scored by the platform, not just acne.

Photograph-based assessment

The AI analyses clinical photographs, not live patients. Certain clinical features that require palpation (e.g., induration, or plaque thickness) or observation under specific conditions are estimated from visual cues only. This is an inherent limitation of any remote or image-based assessment method.

Mitigation: The imaging protocol standardises capture conditions (lighting, distance, angle), and the DIQA quality gate rejects images that do not meet minimum quality standards for focus, lighting, framing, and resolution. The acceptance criterion for each AI model is non-inferiority to expert inter-rater variability on the same photographs, ensuring the AI is at least as consistent as dermatologists working from the same modality.

Fitzpatrick skin type V–VI performance

Performance is lower for darker skin types due to the global underrepresentation of Fitzpatrick V–VI skin in dermatology image datasets. This is an industry-wide challenge that affects both AI systems and human assessors.

Mitigation: Stratified performance metrics are published transparently (see Performance Across Skin Types). Active dataset diversification is ongoing through targeted data sourcing (DDI, SkinDeep, Full Spectrum Dermatology) and post-market clinical follow-up (PMCF) monitoring. All Fitzpatrick groups currently exceed minimum acceptance thresholds.

Subjective ground truth

The reference standard for severity scoring is the mathematical consensus of multiple expert dermatologists — not an objective measurement. The AI cannot be “more correct” than the experts it was trained against.

Mitigation: This is not a limitation of the AI specifically, but of the clinical assessment itself. The same constraint applies to any human rater. The consensus of 2–3 independent experts is the best available approximation of truth and the same standard used by the FDA and EMA for reference standards in dermatology clinical trials. The AI matching this consensus represents the realistic ceiling of performance.

Model version specificity

All performance metrics reported in this documentation apply to a specific validated model version. Model updates — including retraining, architecture changes, or threshold adjustments — require full re-validation per IEC 62304 before deployment.

Mitigation: The model version is locked at study initiation. No mid-study model updates occur. This ensures that every patient in a trial is scored by the same model, preserving endpoint integrity throughout the study.

Decision support, not autonomous diagnosis

The system provides severity scoring to support clinical decisions. It does not replace clinical judgement, and it does not make autonomous diagnostic or treatment decisions. All AI-generated scores should be interpreted by qualified healthcare professionals within the context of the patient’s overall clinical presentation.

How limitations are managed

All limitations documented on this page are tracked within the formal risk management process (ISO 14971) and the software development lifecycle (IEC 62304). Each limitation has been assessed for clinical risk, and mitigations have been implemented where the residual risk is not already acceptable.

The post-market clinical follow-up (PMCF) programme under MDR Annex XIV continuously monitors real-world performance. Any new limitation identified through post-market surveillance triggers a formal risk assessment and, if necessary, corrective action.