NEW research suggests that integrating pathogen genomics with patient demographic data using supervised machine learning can substantially improve prediction of gastric cancer risk in people infected with Helicobacter pylori.
H. pylori infection is common worldwide and is a well-established risk factor for gastric cancer. However, only a small proportion of infected individuals go on to develop malignancy, reflecting the complex interplay between bacterial virulence, host factors, and environmental influences. Existing risk prediction approaches rely largely on clinical and lifestyle variables, limiting their ability to identify high-risk individuals early.
Integrating Genomics Into Risk Prediction
In this study, researchers assembled a large dataset of 1,363 publicly available H. pylori genomes collected between 1991 and 2024, each linked to host demographic information. Genomic features included known virulence genes as well as sequence-derived and variant-based characteristics. These data were combined with host metadata and used to train supervised machine learning models to classify infection outcomes as gastric cancer or non-gastric cancer.
Logistic regression was used as an interpretable baseline model, while more complex ensemble approaches, including XGBoost and Random Forest, were evaluated for performance gains. Models were trained using internal cross-validation on 80% of the dataset, with final performance assessed on a held-out test set.
Strong Predictive Performance
The baseline logistic regression model demonstrated solid predictive ability, achieving a recall of approximately 74% for gastric cancer and an area under the receiver operating characteristic curve (AUROC) of 0.83. Both ensemble models significantly outperformed this baseline, with AUROC values exceeding 0.95 and notable improvements in recall for gastric cancer detection.
Across all models, patient age consistently emerged as the strongest predictor of cancer risk. Importantly, several genomic features derived directly from sequence data also contributed meaningfully to prediction, beyond well-characterised virulence genes. This finding suggests that previously underappreciated aspects of H. pylori genetic variation may influence clinical outcomes.
Interpretability and Clinical Relevance
To address the “black box” challenge of machine learning, the researchers applied explainability methods, enabling clearer interpretation of how individual features influenced predictions. This approach may help bridge the gap between high-performing algorithms and clinical decision-making by improving transparency and trust among healthcare professionals.
Looking Ahead
While the results demonstrate strong internal performance, the authors emphasise the need for external validation in independent and more diverse datasets. Future work incorporating additional host, environmental, and lifestyle variables will be essential before such models can be translated into routine clinical practice.
Overall, the study highlights the promise of combining pathogen genomics with patient data to move toward more personalised risk assessment in H. pylori infection, with the potential to support earlier detection and targeted surveillance for gastric cancer.
Reference
Narasimhan V et al. Predicting clinical outcomes in Helicobacter pylori-positive patients using supervised learning through the integration of demographic and genomic features. BMC Gastroenterol. 2026;DOI: 10.1186/s12876-025-04595-3.






