This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Cardio, is properly cited. The complete bibliographic information, a link to the original publication on https://cardio.jmir.org, as well as this copyright and license information must be included.

Many machine learning approaches are limited to classification of outcomes rather than longitudinal prediction. One strategy to use machine learning in clinical risk prediction is to classify outcomes over a given time horizon. However, it is not well-known how to identify the optimal time horizon for risk prediction.

In this study, we aim to identify an optimal time horizon for classification of incident myocardial infarction (MI) using machine learning approaches looped over outcomes with increasing time horizons. Additionally, we sought to compare the performance of these models with the traditional Framingham Heart Study (FHS) coronary heart disease gender-specific Cox proportional hazards regression model.

We analyzed data from a single clinic visit of 5201 participants of a cardiovascular health study. We examined 61 variables collected from this baseline exam, including demographic and biologic data, medical history, medications, serum biomarkers, electrocardiographic, and echocardiographic data. We compared several machine learning methods (eg, random forest, L1 regression, gradient boosted decision tree, support vector machine, and k-nearest neighbor) trained to predict incident MI that occurred within time horizons ranging from 500-10,000 days of follow-up. Models were compared on a 20% held-out testing set using area under the receiver operating characteristic curve (AUROC). Variable importance was performed for random forest and L1 regression models across time points. We compared results with the FHS coronary heart disease gender-specific Cox proportional hazards regression functions.

There were 4190 participants included in the analysis, with 2522 (60.2%) female participants and an average age of 72.6 years. Over 10,000 days of follow-up, there were 813 incident MI events. The machine learning models were most predictive over moderate follow-up time horizons (ie, 1500-2500 days). Overall, the L1 (Lasso) logistic regression demonstrated the strongest classification accuracy across all time horizons. This model was most predictive at 1500 days follow-up, with an AUROC of 0.71. The most influential variables differed by follow-up time and model, with gender being the most important feature for the L1 regression and weight for the random forest model across all time frames. Compared with the Framingham Cox function, the L1 and random forest models performed better across all time frames beyond 1500 days.

In a population free of coronary heart disease, machine learning techniques can be used to predict incident MI at varying time horizons with reasonable accuracy, with the strongest prediction accuracy in moderate follow-up periods. Validation across additional populations is needed to confirm the validity of this approach in risk prediction.

Cardiovascular disease (CVD) is the leading cause of morbidity and mortality in the United States and worldwide. The prevalence of CVD in adults within the United States has reached 48% and greater than 130 million adults in the United States are projected to have CVD by 2035, with total costs expected to reach US $1.1 trillion [

Historically, risk prediction models have been developed by applying traditional statistical models (ie, regression-based models and Cox) to cohort data [

Machine learning has been introduced as a novel method for processing large amounts of data, focused primarily on accurate prediction rather than understanding the relative effect of risk factors on disease. In some applications, machine learning methods have been found to improve upon traditional regression models for predicting various cardiovascular outcomes [

With this investigation, we examined the impact of varying time horizons on the prediction of incident MI. Using data from the Cardiovascular Health Study (CHS) [

Data were approved for use by the Cardiovascular Health Study Policies and Procedures Committee with accompanying data and materials distribution agreement.

We used anonymized data from the CHS [

We excluded patients with a baseline history of prior MI from the cohort. We examined 61 variables collected from the baseline exam, including demographic and biologic data (Table S1 in

Using an end point of incident MI, we applied multiple machine learning methods across varying time horizons to define an optimal risk prediction. Missing variable data was quite uncommon for baseline demographic and laboratory data. Although overall infrequent, missing data was more common for electrocardiogram variables. In these cases of missing data, imputation was performed on missing variables using median value replacement for continuous variables and most common replacement for categorical variables (

Analysis flowchart. CHD: Cardiovascular Health Study.

The data set was randomly split into a training set (80%) and a testing or validation set (20%). The training data set was used to construct 5 machine learning models: random forest, L1 (LASSO) regression, support vector machine, k-nearest neighbor, and gradient boosted decision tree. Hyperparameter tuning to identify the optimal values for parameters that are not learned during the training process was performed using the validation set. These models were then applied to the test set to examine model performance, which was assessed using an area under the receiver operating characteristic curve (AUROC). Additionally, we used the FHS coronary heart disease Cox proportional hazards regression model as a comparison to the machine learning models (Table S2 in

Starting at 500 days, we looped each model over 500-day time horizons in order to identify the optimal predictive horizon up through 10,000 days of follow-up time. For each time horizon, variable importance algorithms were applied to the L1 regression and random forest models. In the L1 regression model, coefficients that are less helpful to the model were shrunk to zero, thereby removing unneeded variables altogether. The remaining coefficients are the variables selected. Because models use normalized inputs, direct comparison of coefficients can be performed based on the absolute value of the average coefficient for each input. In the random forest algorithm, we performed a “permutation” feature selection, which measures the prediction strength of each variable by measuring the decrease in accuracy when a given variable is essentially voided within the model.

Preliminary analyses identified a high degree of bias related to the cases that were selected within the held-out split sample, and so we performed 50 analyses with different random seeds, with separate results stored for each model, time horizon, and seed number (a total of 1000 separate models for each type of model). Results were compiled based on the average AUROC, coefficient value (L1 regression), and impurity or accuracy (random forest) for each model. Model comparison was performed using linear mixed effects models, with seed number as the random effect and unstructured covariance matrix pattern.

All modeling was performed using publicly available packages on R software (version 1.1.463; The R Foundation for statistical computing). The code used for analysis is provided in

Baseline characteristics of the study participants are presented in

Baseline Characteristics of the study participants.

Characteristics | Values (N=4190) |

Age (years), mean (SD) | 72.6 (5.6) |

Gender (male), n (%) | 1668 (39.8) |

Tobacco consumption, n (%) | 2201 (53) |

Hypertension, n (%) | 2300 (55) |

Diabetes, n (%) | 389 (9.3) |

Total Cholesterol (mg/dL), mean (SD) | 211 (38) |

BMI, mean (SD) | 26.4 (1.9) |

Relative performance of the machine learning methods and FHS model is displayed in

In addition to examining AUROC, we also examined the area under the precision-recall curve (

The L1 logistic regression was overall the most predictive across all time points (

When compared with the FHS model, the L1 model performed worse at 500 days of follow-up but had superior prediction accuracy at all subsequent follow-up times. The random forest model performed better than the FHS model starting at 1500 days of follow-up and longer. The remaining machine learning models were less predictive than the FHS model across all time frames (

Predictive accuracy over varying time horizons. FHS: Framingham Heart Study; KNN: k-nearest neighbor; RF: random forest; ROC: receiver operating characteristics; SVM: support vector machine.

Predictive Accuracy using area under precision-recall curve. KNN: k-nearest neighbor; PR: precision-recall; RF: random forest; SVM: support vector machine.

Prediction accuracy across all time horizons. AUC: area under the curve; KNN: k-nearest neighbor; RF: random forest; SVM: support vector machine.

Some machine learning algorithms allow for analysis of variable contributions to the model. For this analysis, feature importance was performed across all time points for the L1 regression and random forest models (

Feature selection (top features).

Model | Short-term follow-up (500-1000 days) | Intermediate follow-up (1500-2500 days) | Long-term follow-up (>2500 days) |

L1 regression |
Gender (0.90) Calcium channel blockers (0.47) IVCD Diabetes mellitus (0.32) Smoking (0.22) Systolic blood pressure (0.21) |
Gender (1.03) Diabetes mellitus (0.33) Calcium channel blockers (0.42) Hypertension (0.27) Alcohol (per week) (–0.21) |
Gender (0.50) Calcium channel blockers (0.33) Diabetes mellitus (0.20) |

Random forest |
Weight FEV1 BMI Height LDL-C |
Weight FEV1 BMI Height Gender |
Weight Total cholesterol BMI Height LDL-C |

^{a}IVCD: intraventricular conduction delay.

^{b}ECG: electrocardiogram.

^{c}FEV1: forced expiratory volume in one second.

^{d}LDL-C: low-density lipoprotein cholesterol.

For the L1 regression, the most important variables (based on the absolute value of coefficients applied to normalized inputs) at short-term follow-up intervals (ie, <1000 days) were gender, history of diabetes, use of calcium channel blockers or β-blockers, and having a ventricular conduction defect by electrocardiogram. At intermediate follow-up interval (ie, 1500-2500 days), the most important variables were gender, use of calcium-channel blocker, history of diabetes, and history of hypertension. At longer follow-up times (ie, >2500 days), the most important variables were gender, use of calcium channel blocker, and history of diabetes.

For the random forest variable selection based on accuracy, the most important variables at short-term follow-up intervals (ie, <1000 days) were weight, forced expiratory volume (FEV) by pulmonary function testing, BMI, height, and low-density lipoprotein (LDL) cholesterol. At intermediate follow-up interval (1500-2500 days), the most important variables were weight, FEV, BMI, height, and gender. At longer follow-up times (ie, >2500 days), the most important variables were weight, height, BMI, LDL cholesterol, and total cholesterol.

This study demonstrates the ability to use machine learning methods for the prediction of incident MI over varying time horizons in cohort data. Using AUROC as the primary metric for model performance, prediction across all models was most accurate in the moderate (ie, 1500-2500 day) follow-up horizon. The L1 regularized regression provided the most accurate prediction across all time frames, followed by the random forest algorithms. These two models compared favorably to the FHS coronary heart disease prediction variables, especially at longer follow-up intervals. Applying ranked variable importance algorithms demonstrated how the variables selected differed over time and in different models.

Prediction was most accurate in the moderate follow-up horizon. We suspect that this was due to the balance of accumulating enough events while still being close in time to the baseline data collected. A predictor that is measured closer in time to the outcome is more likely to be relevant in prediction, and as more events accumulate over time, the power to identify a predictive model increases. Prior studies have looked at machine learning prediction of coronary heart disease at short and intermediate follow-up times; however, to our knowledge, this is the first study to apply models to annual time horizons from short- to long-term follow-up [

The L1 regularized regression generally provided the most accurate prediction across all time frames. These regularized regression models expand upon traditional regression models by searching across all variables for the best subset of predictors prior to fitting a regression model. An L1 (Lasso) regression differs from other regularized regression models in that it can shrink the importance of many variables to zero, allowing for feature selection in addition to preventing overfitting. As such, it is very useful when using many variables, like in a cohort or electronic health record data. Prior studies have found these models to be comparable to more advanced machine learning methods for predicting clinical outcomes [

With machine learning models, the relationship between any one variable and the outcome is not as clear as with standard regression models. However, some methods can provide the relative importance of each variable to the model creation. We performed ranked variable analysis for the L1 regression and random forest models. We found that, generally, the models found traditional risk factors to be the most important; however, these most important variables changed over time.

The random forest variable importance found weight, height, LDL-cholesterol, and BMI to be highly important across time frames. FEV was important in short- and medium-term follow-up but less important in longer-term follow-up. For the L1 regression, gender, history of diabetes, and the use of calcium channel blockers were important variables across all time horizons. Although these associations are interesting, causation cannot be applied to these analyses, and it can only suggest further study on the importance of these variables.

This study has some notable limitations. First, the CHS [

In a population free of coronary heart disease, machine learning techniques can be used to accurately predict development of incident MI at varying time horizons. Moderate follow-up time horizons appear to have the most accurate prediction given the balance between proximity to baseline data and allowing ample number of events to occur. Future studies are needed to validate this technique in additional populations.

Tables and code used for model analysis.

area under the receiver operating characteristic curve

Cardiovascular Health Study

cardiovascular disease

forced expiratory volume

Framingham Heart Study

myocardial infarction

This research was supported by contracts 75N92021D00006, HHSN268201200036C, HHSN268200800007C, N01HC55222, N01HC85079, N01HC85080, N01HC85081, N01HC85082, N01HC85083, and N01HC85086, as well as grants U01HL080295 and U01HL130114 from the National Heart, Lung, and Blood Institute (NHLBI), with additional contribution from the National Institute of Neurological Disorders and Stroke (NINDS). Additional support was provided by R01AG023629 from the National Institute on Aging (NIA). A full list of principal CHS investigators and institutions can be found at CHS-NHLBI.org website. This work was also funded by grants from the National Institute of Health/NHLBI (MAR: 5K23 HL127296, R01 HL146824).

None declared.