Zum Hauptinhalt springen

Known Limitations

Clinical trial sponsors need to understand not just what an AI scoring system can do, but where its boundaries are. This page documents the known limitations of the APASI psoriasis severity scoring technology, explains why each limitation exists, and describes how it is managed. Transparency about limitations is a regulatory expectation (ISO 14971, EU AI Act Article 13) and a prerequisite for informed protocol design.

Induration estimated from visual cues only

Plaque thickness (induration) is inherently a tactile property — clinicians assess it by palpation, pressing on the plaque to feel its elevation and firmness. The AI estimates induration from visual cues in photographs: shadow patterns, relief, colour gradients, and surface texture.

This is the fundamental limitation of any photograph-based PASI scoring system, whether human or AI. A dermatologist rating PASI from photographs faces the same constraint.

Current performance: The AI achieves RMAE 0.151 for induration vs. expert inter-rater variability of RMAE 0.17 — meaning the AI's induration estimates from photographs are more consistent than dermatologists' estimates from the same photographs.

Mitigation: The acceptance criterion is non-inferiority to expert inter-rater variability on the same modality (photographs). The AI meets this criterion with margin. For studies where palpation-based induration is critical, the protocol can specify that induration be assessed manually while the other three PASI components (erythema, desquamation, BSA) are scored by the AI.

Full-body photography required for complete PASI

A total PASI score requires BSA estimation across all four body regions (head, trunk, upper extremities, lower extremities). Target lesion photographs alone cannot provide the area component of the PASI formula.

Without BSA, only severity intensity scores (erythema, desquamation, induration) can be computed per region. These are clinically valuable for validating the AI against live assessments but do not constitute a complete PASI score.

Mitigation: The standard 11-perspective full-body protocol provides complete coverage for total PASI calculation. For studies where full-body photography is not feasible (e.g., patient comfort concerns), two approaches are available:

  1. Reduced 4-perspective close-up protocol: Provides per-region intensity scores only; BSA is assessed manually by the investigator
  2. Target lesion validation: Severity scores from target lesion photographs are compared against the clinician's live scores for the same lesion — this validates the AI's severity assessment without requiring full-body photography
The BSA trade-off

BSA estimation is often the largest source of inter-rater variability in manual PASI scoring. The AI's pixel-level segmentation (IoU 0.61) provides objective area measurement. Omitting full-body photography saves patient burden but sacrifices this objectivity advantage.

Erythema assessment on dark skin

Erythema (redness) is harder to detect visually on darker skin tones. On Fitzpatrick V–VI skin, erythema may present as violaceous or hyperpigmented rather than classically red, requiring the recognition of different visual features.

This limitation affects both human assessors and AI equally. Studies have documented that dermatologists themselves have lower inter-rater agreement on erythema scoring for darker skin types.

Current performance: The erythema model achieves RMAE 0.13 overall (vs. expert inter-rater RMAE of 0.14). Stratified performance by Fitzpatrick type is monitored as part of the PMCF programme; targeted dataset expansion for FST V–VI is ongoing. See Performance Across Skin Types for detailed stratified metrics.

Lighting sensitivity for erythema

Erythema intensity scoring is particularly sensitive to imaging conditions. Harsh shadows, warm or cool colour casts, and inconsistent illumination between visits can introduce variability in erythema scores — not because the AI is inconsistent, but because the input signal (the photograph) has changed.

Mitigation:

  • The imaging protocol specifies even, neutral illumination
  • The DIQA quality gate rejects images with lighting issues before they reach the scoring models
  • For studies partnering with QuantifiCare, calibration stickers provide a fixed colour reference that the AI uses to normalise colour perception, substantially reducing lighting-related variability

Nail and scalp-only psoriasis

Nail psoriasis (assessed by NAPSI) is not part of the PASI score and is not evaluated by APASI. Scalp psoriasis contributes to the head region of PASI but is not assessed as a standalone endpoint.

Mitigation: For scalp-specific studies, the ASALT methodology (from the alopecia scoring system) can be adapted for scalp conditions. A dedicated NAPSI model would need to be developed and validated separately for nail psoriasis endpoints.

Cross-cutting limitations

The following limitations apply to all indications scored by the platform, not just psoriasis.

Photograph-based assessment

The AI analyses clinical photographs, not live patients. Certain clinical features that require palpation (e.g., induration, or plaque thickness) or observation under specific conditions are estimated from visual cues only. This is an inherent limitation of any remote or image-based assessment method.

Mitigation: The imaging protocol standardises capture conditions (lighting, distance, angle), and the DIQA quality gate rejects images that do not meet minimum quality standards for focus, lighting, framing, and resolution. The acceptance criterion for each AI model is non-inferiority to expert inter-rater variability on the same photographs, ensuring the AI is at least as consistent as dermatologists working from the same modality.

Fitzpatrick skin type V–VI performance

Performance is lower for darker skin types due to the global underrepresentation of Fitzpatrick V–VI skin in dermatology image datasets. This is an industry-wide challenge that affects both AI systems and human assessors.

Mitigation: Stratified performance metrics are published transparently (see Performance Across Skin Types). Active dataset diversification is ongoing through targeted data sourcing (DDI, SkinDeep, Full Spectrum Dermatology) and post-market clinical follow-up (PMCF) monitoring. All Fitzpatrick groups currently exceed minimum acceptance thresholds.

Subjective ground truth

The reference standard for severity scoring is the mathematical consensus of multiple expert dermatologists — not an objective measurement. The AI cannot be “more correct” than the experts it was trained against.

Mitigation: This is not a limitation of the AI specifically, but of the clinical assessment itself. The same constraint applies to any human rater. The consensus of 2–3 independent experts is the best available approximation of truth and the same standard used by the FDA and EMA for reference standards in dermatology clinical trials. The AI matching this consensus represents the realistic ceiling of performance.

Model version specificity

All performance metrics reported in this documentation apply to a specific validated model version. Model updates — including retraining, architecture changes, or threshold adjustments — require full re-validation per IEC 62304 before deployment.

Mitigation: The model version is locked at study initiation. No mid-study model updates occur. This ensures that every patient in a trial is scored by the same model, preserving endpoint integrity throughout the study.

Decision support, not autonomous diagnosis

The system provides severity scoring to support clinical decisions. It does not replace clinical judgement, and it does not make autonomous diagnostic or treatment decisions. All AI-generated scores should be interpreted by qualified healthcare professionals within the context of the patient’s overall clinical presentation.

How limitations are managed

All limitations documented on this page are tracked within the formal risk management process (ISO 14971) and the software development lifecycle (IEC 62304). Each limitation has been assessed for clinical risk, and mitigations have been implemented where the residual risk is not already acceptable.

The post-market clinical follow-up (PMCF) programme under MDR Annex XIV continuously monitors real-world performance. The APASI system is currently deployed in a Phase 3 programme across 130+ sites in 12 countries, providing continuous real-world validation data. Any new limitation identified through post-market surveillance triggers a formal risk assessment and, if necessary, corrective action.