Saltar al contenido principal

Known Limitations

Clinical trial sponsors need to understand not just what an AI scoring system can do, but where its boundaries are. This page documents the known limitations of the AIHS4 hidradenitis suppurativa severity scoring technology, explains why each limitation exists, and describes how it is managed.

Draining fistula detection

Draining tunnels (fistulae) are the highest-weighted lesion in IHS4 (×4). Their detection is the most technically challenging component: fistulae can be partially occluded, have non-obvious surface openings, and vary substantially in appearance across patients and disease duration.

Mitigation: The AIHS4 model was validated specifically within the M-27134-01 clinical trial. The ICC of 0.727 accounts for fistula detection performance in realistic clinical conditions. In practice, the AI's ability to objectively detect and count fistulae removes the most variable component of manual IHS4 scoring.

Abscess vs. nodule differentiation

Distinguishing abscesses from inflammatory nodules is difficult even for experienced clinicians. Abscesses are characterised by fluctuance (a tactile property) and specific visual features (surrounding erythema, irregular surface), but the boundary between a large nodule and a small abscess is often ambiguous.

Mitigation: This ambiguity is a fundamental characteristic of HS lesion assessment, not specific to AI. The AIHS4 model is trained on expert-annotated images with consensus labelling. The ICC of 0.727 reflects the AI's ability to handle this ambiguity at least as well as the manual inter-rater agreement (ICC 0.47), which suffers from the same classification difficulty.

Depth not assessable from photographs

HS lesion depth (deep nodules, subcutaneous tracts) is not assessable from surface photography alone. The AI assesses surface lesion characteristics. Deep involvement assessed by clinical palpation is a complementary evaluation.

Mitigation: For studies requiring depth assessment, the protocol can specify that palpation-based findings be recorded separately while the AI provides the surface-level IHS4 score. The two assessments are complementary.

Anatomical coverage depends on photographic protocol

Regions not photographed cannot contribute to the score. For patients with widely distributed HS, the imaging protocol must be designed to cover all affected anatomical regions.

Mitigation: The protocol design phase includes a discussion of the patient population's expected lesion distribution. The imaging protocol is configured to cover all relevant anatomical regions. The DIQA quality gate ensures all required perspectives are captured before submission.

Scar tissue and post-inflammatory changes

Chronic HS can leave significant scarring and post-inflammatory changes. The AI must distinguish active lesions (which contribute to IHS4) from inactive scars (which do not). Dense scar tissue can also obscure new lesions.

Mitigation: The model is trained to differentiate active inflammatory lesions from post-inflammatory scars. However, in areas of very dense scarring with superimposed active disease, some detection difficulty may persist. Longitudinal tracking helps: baseline scarring is documented and changes from baseline reflect active disease.

Cross-cutting limitations

The following limitations apply to all indications scored by the platform, not just hidradenitis suppurativa.

Photograph-based assessment

The AI analyses clinical photographs, not live patients. Certain clinical features that require palpation (e.g., induration, or plaque thickness) or observation under specific conditions are estimated from visual cues only. This is an inherent limitation of any remote or image-based assessment method.

Mitigation: The imaging protocol standardises capture conditions (lighting, distance, angle), and the DIQA quality gate rejects images that do not meet minimum quality standards for focus, lighting, framing, and resolution. The acceptance criterion for each AI model is non-inferiority to expert inter-rater variability on the same photographs, ensuring the AI is at least as consistent as dermatologists working from the same modality.

Fitzpatrick skin type V–VI performance

Performance is lower for darker skin types due to the global underrepresentation of Fitzpatrick V–VI skin in dermatology image datasets. This is an industry-wide challenge that affects both AI systems and human assessors.

Mitigation: Stratified performance metrics are published transparently (see Performance Across Skin Types). Active dataset diversification is ongoing through targeted data sourcing (DDI, SkinDeep, Full Spectrum Dermatology) and post-market clinical follow-up (PMCF) monitoring. All Fitzpatrick groups currently exceed minimum acceptance thresholds.

Subjective ground truth

The reference standard for severity scoring is the mathematical consensus of multiple expert dermatologists — not an objective measurement. The AI cannot be “more correct” than the experts it was trained against.

Mitigation: This is not a limitation of the AI specifically, but of the clinical assessment itself. The same constraint applies to any human rater. The consensus of 2–3 independent experts is the best available approximation of truth and the same standard used by the FDA and EMA for reference standards in dermatology clinical trials. The AI matching this consensus represents the realistic ceiling of performance.

Model version specificity

All performance metrics reported in this documentation apply to a specific validated model version. Model updates — including retraining, architecture changes, or threshold adjustments — require full re-validation per IEC 62304 before deployment.

Mitigation: The model version is locked at study initiation. No mid-study model updates occur. This ensures that every patient in a trial is scored by the same model, preserving endpoint integrity throughout the study.

Decision support, not autonomous diagnosis

The system provides severity scoring to support clinical decisions. It does not replace clinical judgement, and it does not make autonomous diagnostic or treatment decisions. All AI-generated scores should be interpreted by qualified healthcare professionals within the context of the patient’s overall clinical presentation.

How limitations are managed

All limitations documented on this page are tracked within the formal risk management process (ISO 14971) and the software development lifecycle (IEC 62304). Each limitation has been assessed for clinical risk, and mitigations have been implemented where the residual risk is not already acceptable.

The post-market clinical follow-up (PMCF) programme under MDR Annex XIV continuously monitors real-world performance. Any new limitation identified through post-market surveillance triggers a formal risk assessment and, if necessary, corrective action.