Examining Smoking Behavior and Perceptions of New and Emerging Tobacco Products Using Twitter and Natural Language Processing

Examining Smoking Behavior and Perceptions of New and Emerging Tobacco Products Using Twitter and Natural Language Processing

Mike Conway*, University of California, San Diego, San Diego, United States
Mark Myslin, University of California, San Diego, San Diego, United States
Shu-Hong Zhu, University of California, San Diego, San Diego, United States
Wendy Chapman, University of California, San Diego, San Diego, United States

Track: Research
Presentation Topic: Blogs, Microblogs, Twitter
Presentation Type: Poster presentation
Submission Type: Single Presentation

Last modified: 2013-09-25

If you are the presenter of this abstract (or if you cite this abstract in a talk or on a poster), please show the QR code in your slide or poster (QR code contains this URL).

Abstract

Background: Social media platforms such as Twitter are rapidly becoming key resources for public health surveillance, yet little is currently known about Twitter users' attitudes towards smoking and tobacco products, especially with regard to the emerging tobacco control challenges presented by hookah and e-cigarettes.

Objective: To develop an iterative content analysis of tobacco-related Twitter posts and build machine learning classifiers to automatically detect tobacco-relevant Twitter posts and sentiment towards tobacco, with a particular focus on new and emerging products like hookah and electronic cigarettes.

Methods: We collected 7,362 Twitter posts at 15 day intervals from December 2011 to July 2012 using tobacco-related keywords – e.g. cigarette, tobacco, hookah – in conjunction with the Twitter API. Each Twitter post was manually classified using a tri-axial scheme, capturing genre, theme and sentiment. The genre category consisted of nine subcategories (e.g. news, marketing), the theme category consisted of nineteen subcategories (e.g. cessation, hookah), and the sentiment category consisted of two categories (positive and negative sentiment). Using the collected data, machine-learning classifiers were trained to detect tobacco-related tweets vs. irrelevant tweets as well as positive vs. negative sentiment using NaÃ¯ve Bayes, k-Nearest Neighbors and Support Vector Machine algorithms. Finally, phi contingency coefficients were computed between each of the categories to discover emergent patterns.

Results: The most prevalent genres were first- and second-hand experience and opinion, and the most frequent themes were hookah, cessation, and pleasure. Sentiment toward tobacco was overall more positive (46% of tweets) than negative (32%) or neutral, even excluding the 9% of tweets categorized as marketing. Three separate metrics converged on supporting an emergent distinction between, on one hand, hookah and e-cigarettes corresponding to positive sentiment, and on the other hand, traditional tobacco products and more general references corresponding to negative sentiment. These metrics included correlations between categories in the content analysis scheme (hookah/positive = 0.39; e-cigarette/positive =0.19); correlations between search keywords and sentiment (chi-square= 414.50, df= 4, p < 0.001, Cramer's V = 0.36), and the most discriminating unigram features for positive and negative sentiment ranked by log-odds ratio in the machine learning component of the study. In the automated classification tasks, Support Vector Machines using a relatively small number of unigram features (500) achieved best performance in discriminating tobacco-related from unrelated tweets (F -score = 0.85).

Conclusions: Novel insights available through Twitter for tobacco surveillance are attested through the high prevalence of positive sentiment. This positive sentiment is correlated in complex ways with social image, personal experience, and recently popular products such as hookah and e-cigarettes. Several apparent perceptual disconnects between these products and their health effects suggest opportunities for tobacco control education. Finally, machine classification of tobacco-related posts shows a promising edge over strictly keyword-based approaches (keyword accuracy=57%; NLP accuracy=82%), yielding an improved signal-to-noise ratio in Twitter data and paving the way for automated tobacco surveillance applications.

Medicine 2.0® is happy to support and promote other conferences and workshops in this area. Contact us to produce, disseminate and promote your conference or workshop under this label and in this event series. In addition, we are always looking for hosts of future World Congresses. Medicine 2.0® is a registered trademark of JMIR Publications Inc., the leading academic ehealth publisher.

This work is licensed under a Creative Commons Attribution 3.0 License.