The Skeletal Structure of Health Forum Content
|
If you are the presenter of this abstract (or if you cite this abstract in a talk or on a poster), please show the QR code in your slide or poster (QR code contains this URL). |
Abstract
Background
As people increasingly turn to the Internet for medical information, a wealth of Patient Authored Text (PAT) is accumulating on Online Health Forums (OHFs). PAT contains rich, descriptive data on a variety of conditions & treatments. Specifically, PAT encodes relationships between medical concepts, such as co-morbidities, drug-treatment effects, etc. as experienced by hundreds of patients in real time. Despite this, efforts to conduct large-scale analysis of PAT have been limited; we believe this is due to the difficulty of automatically identifying medical terms in PAT.
Objective
In this research, we elicit the skeletal, or “backbone†structure of PAT corpora by distilling out irrelevant relationships from the set of medical concept relationships in the corpus; we then visualize the residual network. Applying this method to ~160 OHFs from our collaborators, MedHelp, we present a preliminary analysis of insights drawn from content skeleton structures and patterns.
Methods
Our dataset comprises the entire, anonymized discussion history of 160 MedHelp communities, current to mid-2011, which each contain more than 50,000 sentences.
In order to identify medical concepts in the text, we used Apache Lucene to tokenize the original text into sentences. We then use ADePT (Automatic Detector of Patient Terminology), a classifier we developed in prior work specifically designed to identify medical words in PAT, to extract the medical concepts.
We say that two terms t1 and t2 co-occur if they are appear together in the same sentence. For each community, we build a term co-occurrence frequency table. In order to find co-occurring concept pairs that are particularly relevant to the community, we next score each pair using a G2 test. This effectively awards high rank to notable term pairs (words that occur together unusually often), and low rank to common term pairs (words that we expect to co-occur, e.g. “and thenâ€). Finally, we synthesize our scored term pairs into a network, which can be explored visually and interactively.
Results
Our results are primarily visual; we present a summary here. First, we note that OHF content skeletons have significantly different network structures. Moreover, content skeletons appear to both summarize the main “topics†of the OHF condition, as well as reveal insight into how community members interact with the content itself. Much of the arthritis community network, for example, is reminiscent of a diagnostic map: fanning out from the concept PAIN are body parts such as NECK, SHOULDER, and FUNNY BONE; fanning out from SHOULDER are specific conditions and treatments, including BURSITIS, ROTATOR CUFF, and SHOULDER REPLACEMENT. The Lupus network, however, is more web-like: concepts tend to be connected several discursive “hub†topics, including RASH, PAIN, ANAEMIA, ANXIETY, TESTING, and DERMATOLOGY.
Conclusions
Our approach elicits “content skeletons†from OHFs by scoring co-occuring term relationships, and filtering out insignificant relationships. The “content skeleton†structures are community-specific, and appear to encode insights both regarding sub-conditions of the forum topic, as well as insights into how community members interact with the OHF content.
As people increasingly turn to the Internet for medical information, a wealth of Patient Authored Text (PAT) is accumulating on Online Health Forums (OHFs). PAT contains rich, descriptive data on a variety of conditions & treatments. Specifically, PAT encodes relationships between medical concepts, such as co-morbidities, drug-treatment effects, etc. as experienced by hundreds of patients in real time. Despite this, efforts to conduct large-scale analysis of PAT have been limited; we believe this is due to the difficulty of automatically identifying medical terms in PAT.
Objective
In this research, we elicit the skeletal, or “backbone†structure of PAT corpora by distilling out irrelevant relationships from the set of medical concept relationships in the corpus; we then visualize the residual network. Applying this method to ~160 OHFs from our collaborators, MedHelp, we present a preliminary analysis of insights drawn from content skeleton structures and patterns.
Methods
Our dataset comprises the entire, anonymized discussion history of 160 MedHelp communities, current to mid-2011, which each contain more than 50,000 sentences.
In order to identify medical concepts in the text, we used Apache Lucene to tokenize the original text into sentences. We then use ADePT (Automatic Detector of Patient Terminology), a classifier we developed in prior work specifically designed to identify medical words in PAT, to extract the medical concepts.
We say that two terms t1 and t2 co-occur if they are appear together in the same sentence. For each community, we build a term co-occurrence frequency table. In order to find co-occurring concept pairs that are particularly relevant to the community, we next score each pair using a G2 test. This effectively awards high rank to notable term pairs (words that occur together unusually often), and low rank to common term pairs (words that we expect to co-occur, e.g. “and thenâ€). Finally, we synthesize our scored term pairs into a network, which can be explored visually and interactively.
Results
Our results are primarily visual; we present a summary here. First, we note that OHF content skeletons have significantly different network structures. Moreover, content skeletons appear to both summarize the main “topics†of the OHF condition, as well as reveal insight into how community members interact with the content itself. Much of the arthritis community network, for example, is reminiscent of a diagnostic map: fanning out from the concept PAIN are body parts such as NECK, SHOULDER, and FUNNY BONE; fanning out from SHOULDER are specific conditions and treatments, including BURSITIS, ROTATOR CUFF, and SHOULDER REPLACEMENT. The Lupus network, however, is more web-like: concepts tend to be connected several discursive “hub†topics, including RASH, PAIN, ANAEMIA, ANXIETY, TESTING, and DERMATOLOGY.
Conclusions
Our approach elicits “content skeletons†from OHFs by scoring co-occuring term relationships, and filtering out insignificant relationships. The “content skeleton†structures are community-specific, and appear to encode insights both regarding sub-conditions of the forum topic, as well as insights into how community members interact with the OHF content.
Medicine 2.0® is happy to support and promote other conferences and workshops in this area. Contact us to produce, disseminate and promote your conference or workshop under this label and in this event series. In addition, we are always looking for hosts of future World Congresses. Medicine 2.0® is a registered trademark of JMIR Publications Inc., the leading academic ehealth publisher.

This work is licensed under a Creative Commons Attribution 3.0 License.