MULTIPLE predictors of breast cancer have previously been identified; these include lifestyle, reproductive, and inherited genetic factors. Prior studies have investigated the etiological differences that exist between pre- and post-menopausal breast cancer, and various approaches have been used in combination to accurately predict this common cancer type in females.
Machine learning can analyse large sets of data on predictors and can also process the complex non-linear relationships between them. Whilst studies have previously used machine learning to predict breast cancer risk, this method has not been used to investigate the identification of predictors before.
The study used machine learning methods to select features of breast cancer, and Cox models for risk prediction. It aimed to demonstrate that machine learning was effective in this scenario and could assist existing methods. Recent research into polygenic risk scores (PRS) can predict the effects of thousands of genetic variants, which are associated with specific traits or diseases, using genome-wide association studies. PRS can identify patients with a high disease risk, with the intent of targeting them for early statin prescriptions.
Contradictory findings have previously been reported on the interaction between PRS and observable characteristics, such as gene and environment interactions in breast cancer analysis. This study employed SHapley Additive exPlanations (SHAP) to explore the interaction between PRS and phenotypic features. Baseline data was conducted using biological samples, physical examination, questionnaires, and verbal interviews with a trained medical professional. Incidence of breast cancer was identified with the International Classification of Diseases (ICD) codes.
Data from the United Kingdom Biobank (UKB), which includes over half a million individuals across England, Scotland, and Wales, were used in this study. The UKB offers researchers the opportunity to adopt approaches to identify novel breast cancer predictors which are hypothesis-free.
Post-menopausal women aged 40–69 years at baseline were recruited. In total, 104,313 participated in this study. Of these, 4,010 developed breast cancer during the follow-up period of 11.9 years. Using machine learning alongside traditional statistical approaches identified several known and unknown risk factors in the incidence of post-menopausal cancer. These included age, age at menopause, and testosterone levels. The study also identified five novel predictors, including urine biomarkers and blood counts, which are strongly associated with the incidence of post-menopausal breast cancer.