This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Cardio, is properly cited. The complete bibliographic information, a link to the original publication on http://cardio.jmir.org, as well as this copyright and license information must be included.
Current atherosclerotic cardiovascular disease (ASCVD) predictive models have limitations; thus, efforts are underway to improve the discriminatory power of ASCVD models.
We sought to evaluate the discriminatory power of social media posts to predict the 10-year risk for ASCVD as compared to that of pooled cohort risk equations (PCEs).
We consented patients receiving care in an urban academic emergency department to share access to their Facebook posts and electronic medical records (EMRs). We retrieved Facebook status updates up to 5 years prior to study enrollment for all consenting patients. We identified patients (N=181) without a prior history of coronary heart disease, an ASCVD score in their EMR, and more than 200 words in their Facebook posts. Using Facebook posts from these patients, we applied a machine-learning model to predict 10-year ASCVD risk scores. Using a machine-learning model and a psycholinguistic dictionary, Linguistic Inquiry and Word Count, we evaluated if language from posts alone could predict differences in risk scores and the association of certain words with risk categories, respectively.
The machine-learning model predicted the 10-year ASCVD risk scores for the categories <5%, 5%-7.4%, 7.5%-9.9%, and ≥10% with area under the curve (AUC) values of 0.78, 0.57, 0.72, and 0.61, respectively. The machine-learning model distinguished between low risk (<10%) and high risk (>10%) with an AUC of 0.69. Additionally, the machine-learning model predicted the ASCVD risk score with Pearson
Language used on social media can provide insights about an individual’s ASCVD risk and inform approaches to risk modification.
Secondary prevention approaches have improved the longevity of patients with cardiovascular (CV) disease; however, risk factors and adverse health behaviors (eg, physical inactivity, smoking) are highly prevalent, and <1% of adults in most contemporary series meet all factors of ideal CV health [
Social media data in the form of posts, photos, and “likes” can provide information about individuals’ daily activities and behaviors. Social media has been used to track heart disease mortality rates [
The potential of social media data for CV health lies in tracking, codifying, and better understanding the hard-to-measure lifestyle choices, along with exposures related to diet, exercise, smoking, and other factors that can significantly contribute to the development and progression of heart disease. At present, measuring many of these behaviors is dependent on self-report and recall [
We sought to use social media data from consenting individuals to predict ASCVD risk reported in an electronic medical record (EMR), and to characterize differences in posts relative to four categories based on the 10-year primary risk from ASCVD risk scores.
This was a retrospective analysis of social media and EMR data of consenting patients. This study was approved by the University of Pennsylvania Institutional Review Board.
Recruitment for compiling the social mediome dataset began in March 2014 and included patients from inpatient and outpatient settings across two urban academic medical centers. Participants in this dataset consented to sharing access for selecting historical data from their social media accounts (eg, Facebook, Twitter, Instagram) and access to their medical record data. Data are stored in a Health Insurance Portability and Accountability Act (HIPAA)-compliant secure server and participants can elect to discontinue sharing data at any time point. Additional information about this dataset is published elsewhere [
Following recommendations of Goff et al [
Demographic information of patients included in our analysis (N=181).
Characteristic | Value | ||
|
|
||
|
African American | 104 (57.5) | |
|
White | 64 (35.4) | |
|
Other | 13 (7.2) | |
Men, n (%) | 48 (26.5) |
We used two approaches to process language from social media posts for inclusion in a regression model. Specifically, language features from posts were derived using (a) open vocabulary topics and (b) dictionary-based psycholinguistic features. These derived language features were then used to predict the patients’ 10-year ASCVD risk scores and to distinguish patients with different ASCVD risk scores.
The open vocabulary approach uses latent Dirichlet allocation (LDA) [
The dictionary-based psycholinguistic approach uses language from Facebook posts to identify the prevalence of predefined word categories represented in the Linguistic Inquiry and Word Count (LIWC) dictionary [
We sought to investigate the discriminatory power of predicting a patient’s 10-year ASCVD risk using language features derived from Facebook posts. We extracted the features described using an open-vocabulary approach and trained a logistic regression model, as implemented in Python 3.4 scikit-learn [
In 2013, the American Heart Association and American College of Cardiology put forth the ASCVD PCEs [
For Model 2, the ASCVD risk score of patients was applied as a continuous variable rather than as a categorical variable that was used in Model 1. The performance of Model 2 was assessed using the Pearson correlation coefficient (
Identifying patients with high risk (≥10%) of ASCVD is of interest to clinicians. Therefore, in Model 3, we treated ASCVD risk as a dichotomous variable and built a logistic regression model to distinguish the high-risk category using language compared to low ASCVD scores (ie, <10%). Additionally, we used LIWC to distinguish the different features associated with high-risk patients by correlating the LIWC category feature of patients from their social media posts and whether they are in the high-risk (>10%) or low-risk (<10%) categories; we measured the effect size using Cohen
The multiclass logistic regression model on Facebook posts was trained to classify patients in four different categories (<5%, 5%-7.4%, 7.5%-9.9%, ≥10%) based on their ASCVD risk scores. The model was able to delineate patients in the lowest risk category (<5%) from patients in other categories with an AUC of 0.78. The model delineated patients in the categories 5%-7.4%, 7.5%-9.9%, and ≥10% from those in other categories, as shown in
Area under the curve (AUC) scores for each category of atherosclerotic cardiovascular disease risk scores from Model 1.
Category | AUC (age only) | AUC (language only) |
<5% | 0.52 | 0.78 |
5%-7.4% | 0.55 | 0.57 |
7.5%-9.9% | 0.45 | 0.72 |
≥10% | 0.59 | 0.61 |
Using the linear regression model on Facebook posts, we predicted the ASCVD risk score of patients with
The logistic regression model delineated patients with a high risk (≥10%) of ASCVD from those with a low risk (<10%) with an AUC of 0.69.
The sadness LIWC category was most strongly associated with the high ASCVD risk category (≥10%) at a Benjamini-Hochberg–corrected significance level of
Language from Facebook posts has the potential to distinguish patients based on their calculated 10-year ASCVD risk score categorization and actual risk score. Although social media data are unlikely to replace traditional approaches for predicting CV risk, these findings suggest that such data can potentially provide supplemental information about an individual’s lifestyle and behavior, which can complement our understanding of contributors to long-term CV risk. More than 2 billion people share information about their daily lives on social media platforms, which can include information about what they eat and drink, if they smoke, when they exercise, what their lab results are, and other factors associated with Life’s Simple 7 [
The potential opportunity in exploring social media data is that this emerging data source could include data about behavior and lifestyle that might not have been reported to clinicians. There is still a gap in how this would be implemented in clinical practice, and would require further evaluation of feasibility, acceptability, and interpretability. These data are unlikely to replace the existing risk score input but rather may provide complementary adjunct data. Prior work has explored the contribution of nonclinical factors (eg, patient interviews about socioeconomic status, health status, adherence, psychosocial characteristics) in predicting CV outcomes (eg, congestive heart failure readmissions). The model performance overall was poor, although patient-reported information extended the predicted ranges of rates of readmission and slightly improved model discrimination [
In our patient cohort, a high ASCVD risk score was associated with increased use of “sad” language on Facebook. This is consistent with research demonstrating that depression is more prevalent in populations with CV disease, and is predictive of adverse outcomes (such as myocardial infarction and death) among populations with preexisting CV disease [
In our analysis, the AUC for Model 1 indicated low accuracy. A potential reason for this is that we used data from individuals between the ages of 40 and 79 years, and individuals in this age group do not post as much on social media compared to younger individuals. Accordingly, in our dataset, some users had fewer posts, leading to low accuracy from the AUC. We hypothesize that with more posts (ie, more words), our models will perform better.
We compared Models 1 and 3 together to determine which performed better at predicting the ASCVD risk score of individuals. Toward this end, we computed the micro AUC of Model 1 and compared it to that of Model 3, which was 0.66 and 0.69, respectively. This suggests that Model 3 is more reliable at predicting ASCVD risk compared to Model 1.
The findings of this study offer promise for using emerging digital data sources for identifying risk factors. This moves beyond what is simply reported by patients to what may be revealed when looking at a diary of information over multiple time points. This could aid clinicians in providing individualized recommendations for managing risk factors that contribute to heart disease.
This study has several limitations. The study cohort was primarily female and African American. Our analysis used posts from patients with at least 200 words in their Facebook posts, and therefore we cannot extrapolate about those who used social media less or did not consent to share; we used 200 words because prior work on using social media for predicting individuals’ traits determined that for good and stable predictive performance when working with social media data, data from users with 200 words or more on Facebook should be used [
We show that language from Facebook posts can be used to predict an individual’s 10-year risk for ASCVD. Specific information in posts could help to guide clinicians in better understanding lifestyles and behaviors, and in counseling patients about heart disease risk.
Twenty topics generated from our dataset.
atherosclerotic cardiovascular disease
area under the receiver operating characteristic curve
cardiovascular
electronic medical record
latent Dirichlet allocation
Linguistic Inquiry and Word Count
pooled cohort equations
This study is funded by the National Heart Lung and Blood Institute (NHLBI) of the National Institutes of Health (NIH) (1R01HL141844-01A1) to RM (principal investigator), DA, LA, and PG (coinvestigators).
DA is a partner and part owner of VAL Health, and is a US government employee. The other authors have no conflicts of interest to declare.