TY - JOUR AU - Vu, Thien AU - Kokubo, Yoshihiro AU - Inoue, Mai AU - Yamamoto, Masaki AU - Mohsen, Attayeb AU - Martin-Morales, Agustin AU - Dawadi, Research AU - Inoue, Takao AU - Tay, Jie Ting AU - Yoshizaki, Mari AU - Watanabe, Naoki AU - Kuriya, Yuki AU - Matsumoto, Chisa AU - Arafa, Ahmed AU - Nakao, Yoko M AU - Kato, Yuka AU - Teramoto, Masayuki AU - Araki, Michihiro PY - 2025 DA - 2025/5/12 TI - Machine Learning Model for Predicting Coronary Heart Disease Risk: Development and Validation Using Insights From a Japanese Population–Based Study JO - JMIR Cardio SP - e68066 VL - 9 KW - coronary heart disease KW - machine learning KW - logistic regression KW - random forest KW - support vector machine KW - Extreme Gradient Boosting KW - Light Gradient-Boosting Machine KW - Shapley Additive Explanations KW - CHD KW - SVM KW - XGBoost KW - LightGBM KW - SHAP AB - Background: Coronary heart disease (CHD) is a major cause of morbidity and mortality worldwide. Identifying key risk factors is essential for effective risk assessment and prevention. A data-driven approach using machine learning (ML) offers advanced techniques to analyze complex, nonlinear, and high-dimensional datasets, uncovering novel predictors of CHD that go beyond the limitations of traditional models, which rely on predefined variables. Objective: This study aims to evaluate the contribution of various risk factors to CHD, focusing on both established and novel markers using ML techniques. Methods: The study recruited 7672 participants aged 30-84 years from Suita City, Japan, between 1989 and 1999. Over an average of 15 years, participants were monitored for cardiovascular events. A total of 7260 participants and 28 variables were included in the analysis after excluding individuals with missing outcome data and eliminating unnecessary variables. Five ML models—logistic regression, random forest (RF), support vector machine, Extreme Gradient Boosting, and Light Gradient-Boosting Machine—were applied for predicting CHD incidence. Model performance was evaluated using accuracy, sensitivity, specificity, precision, area under the curve, F1-score, calibration curves, observed-to-expected ratios, and decision curve analysis. Additionally, Shapley Additive Explanations (SHAPs) were used to interpret the prediction models and understand the contribution of various risk factors to CHD. Results: Among 7260 participants, 305 (4.2%) were diagnosed with CHD. The RF model demonstrated the highest performance, with an accuracy of 0.73 (95% CI 0.64‐0.80), sensitivity of 0.74 (95% CI 0.62‐0.84), specificity of 0.72 (95% CI 0.61‐0.83), and an area under the curve of 0.73 (95% CI 0.65‐0.80). RF also showed excellent calibration, with predicted probabilities closely aligning with observed outcomes, and provided substantial net benefit across a range of risk thresholds, as demonstrated by decision curve analysis. SHAP analysis elucidated key predictors of CHD, including the intima-media thickness (IMT_cMax) of the common carotid artery, blood pressure, lipid profiles (non–high-density lipoprotein cholesterol, high-density lipoprotein cholesterol, and triglycerides), and estimated glomerular filtration rate. Novel risk factors identified as significant contributors to CHD risk included lower calcium levels, elevated white blood cell counts, and body fat percentage. Furthermore, a protective effect was observed in women, suggesting the potential necessity for gender-specific risk assessment strategies in future cardiovascular health evaluations. Conclusions: We developed a model to predict CHD using ML and applied SHAP methods for interpretation. This approach highlights the multifactor nature of CHD risk evaluation, aiming to support health care professionals in identifying risk factors and formulating effective prevention strategies. SN - 2561-1011 UR - https://cardio.jmir.org/2025/1/e68066 UR - https://doi.org/10.2196/68066 DO - 10.2196/68066 ID - info:doi/10.2196/68066 ER -