Social Media platforms, primarily used for networking, also serve as an invaluable source of knowledge on variety of topics, including health of the population. Such knowledge can be effectively utilised by healthcare professionals and decision makers if the appropriate techniques are employed to deal with high-volume, high-velocity, high-variety, and often arguable veracity user-generated content online. In a research article, published this month in BMC Medical Informatics and Decision Making, the authors aimed to investigate the potential of Twitter data for Hay Fever surveillance purposes and validate the effectiveness of relevant content curation using state-of-the-art Deep Learning models.

Why health surveillance from Social Media?

Numerous studies have already demonstrated that Twitter users openly share health-related information (i.e. symptoms, treatments) with their online networks. So far, real-time surveillance of infectious diseases (e.g. Influenza) from Social Media has been extensively investigated in literature, while allergic conditions still remain largely unexplored. At the same time, 1 in 5 Australian suffered from Hay Fever in 2014-15, becoming one of the most common chronic respiratory diseases. Due to environmental changes and increasing pollution, Pollen Allergy is dangerously on the rise, not only in Australia, but also worldwide.

Current attempts to Hay Fever estimation include either official statistics or marketing polls. More recently, the data obtained from GP prescriptions, hospital admissions, pollen rates, and antihistamine sales have been utilised. Existing approaches are both time-consuming and cost-intensive, and provide only a high level of detail about the condition (usually answers to pre-specified questions). Given the limitations, Social Media has become an attractive alternative as the real-time data is extracted automatically in an unobtrusive manner.

Challenges associated with user-generated content

Despite the wealth of knowledge available on Social Media platforms, the raw form of user-generated content proves highly challenging in relevant content extraction (i.e. actual Hay Fever self-reports) due to a large number of advertisements, news, warnings, etc. (even though still related to Hay Fever). What is more, user-written posts frequently abound in grammatical errors, ambiguous phrases, creative expressions, and so on. For example, how to automatically identify that tweet ‘I’m not crying, it’s my Hay Fever playing up’ refers to the most common Hay Fever symptom (watery eyes)? Or how to train the system to recognise that Telfast is the commercial name of popular Hay Fever medication without the extensive list of all of the potential medications provided a priori?

Deep Learning as promising solution for challenging content curation

Recent advances in Machine Learning, in particular its sub-field called Deep Learning, show great promise in highly challenging user-generated content curation. Providing system with even relatively small sample of class positive and class negative examples (in natural language), and letting the model to identify the most distinctive features between both classes, has already proved successful in healthcare domain and beyond.

Moreover, the implementation of state-of-the-art in Natural Language Processing word embeddings into the model training further improves the accuracy and robustness of the approach. That is, word-to-vector representation (word embedding) allows to account for syntactic and semantic linkages between words (similar words occur in similar contexts). This is due to a nearby position of the conceptually-related terms in the projected vector space (e.g. bee and honey are closer to pollen than spores and fossils). As a result, tears can be linked to watery eyes, and sniffles to runny nose without explicit rules definition.

Case study of Hay Fever surveillance from Twitter in Australia

The study conducted in Australia aimed to investigate the potential of Twitter data for Hay Fever surveillance purposes and validate the effectiveness of relevant content curation using state-of-the-art Deep Learning models (still in infancy in health informatics domain). The primary data was extracted over the 6-month period, covering the high pollen season. The number Hay Fever self-reports from Twitter users peaked around October and November, as expected. The accuracy of relevant posts detection (e.g. symptoms, treatments) was up to 88% for the highest performing model (GRU) and pre-trained word embeddings (GloVe). The major contribution of our work includes the implicit symptoms and emerging treatments automatic detection, without pre-defined rules specification. The results prove promising for real-time health surveillance from an alternative source such as Social Media source, and serves as an attractive complement to currently limited approaches to Pollen Allergy prevalence and severity estimation.