Twitter Catches The Flu: Detecting Influenza Epidemics using Twitter - PowerPoint PPT Presentation

About This Presentation
Title:

Twitter Catches The Flu: Detecting Influenza Epidemics using Twitter

Description:

Twitter Catches The Flu: Detecting Influenza Epidemics using Twitter Eiji ARAMAKI * Sachiko MASKAWA * Mizuki MORITA ** * The University of Tokyo – PowerPoint PPT presentation

Number of Views:228
Avg rating:3.0/5.0
Slides: 39
Provided by: aram4
Category:

less

Transcript and Presenter's Notes

Title: Twitter Catches The Flu: Detecting Influenza Epidemics using Twitter


1
Twitter Catches The Flu Detecting Influenza
Epidemics using Twitter
  • Eiji ARAMAKI
  • Sachiko MASKAWA
  • Mizuki MORITA
  • The University of Tokyo
  • National Institute of Biomedical Innovation

EMNLP2011
2
Why we developed this system?
Let me show you several existing systems
3
Centers for Disease Control and Prevention (CDC)
4
Infection Disease Surveillance Center (IDSC)
5
European Influenza Surveillance Network (EISN)
6
Why each country has each surveillance system?
  • Influenza epidemics are a major public health
    concern, because it causes tens of millions of
    illnesses each year.
  • To reduce the victims, the early detection of
    influenza epidemics is a national mission in
    every country.
  • BUT These surveillance systems
  • basically rely on hospital reports
  • (written manually).

7
Two Problems Recent Approach
  • (1) Small Scale
  • For example, IDSC gathers influenza patient data
    from 5,000 clinics. But It does not cover all
    cities (especially local cities).
  • (2) Time Delay (Time lag)
  • For example, the data gathering process typically
    has a 12 week reporting lag
  • To deal with these problems
  • Recently, various approaches that directly
    capture peoples behavior are proposed

8
Recent Approach
  • using Phone Call data
  • Espino et al. (2003) used data of a telephone
    triage service, a public service, to give an
    advice to users via telephone. They reported the
    number of telephone calls that correlates with
    influenza epidemics.
  • using Drug sale data
  • Magruder (2003) used the amount drug sales.

Among various approaches
9
The State-of-the-ArtWeb based Approach
  • Ginsberg et al. (Nature 2009) used Google web
    search queries that correlate with an influenza
    epidemic, such as flu, fever.
  • Polgreen et al. (2008) used a Yahoo! query log.
  • Hulth et al. (2009) used a query log of a
    Switzerland web search engine.

10
This Study
  • Web search query is a extremely large scale and
    real-time data resource.
  • BUT the query data is closed (not freely
    available), which is available only for several
    companies, such as Google, Yahoo, or Microsoft.
  • ? This study examines Twitter data, which is
    widely available.

11
OUTLINE
  • Background
  • Objective
  • Method
  • Experiment
  • Discussion
  • Conclusion

Detailed Task Definition
12
Simple Word Frequency in TwitterCold, Fever
influenza
Winter
Summer
Actual influenza curve is more smooth
Simple Word Frequency contains various
noises Because.
13
A word influenza does not always indicate an
influenza patient
Positive Influenza Tweet
Negative Influenza Tweet
14
Two types of Influenza Tweets
  • Negative influenza tweet
  • indicates an influenza patient
  • Negative influenza tweet
  • includes mention of influenza, but does not
    indicate that an influenza patient is present
  • Not only the general news, but also various
    phenomena generate Negative influenza tweet

Positive Influenza Tweet
Negative Influenza Tweet
15
Various Negative Influenza Tweet (1/2)
  • Prevention
  • You need to get a influenza shot sometime
    soon.
  • Modality (just suspition)
  • _at_John might be suffering from influenza
  • Question
  • Did you catch the influenza ?

16
Various Negative Influenza Tweet (2/2)
  • Influenza of Cat or Dog
  • Today, I couldn't go home late. My cat caught the
    influenza...
  • Influenza of TV Character
  • In the last episode of that TV Series, Ritsu-chan
    caught the flu

17
Research Questions
  • In total, half of Influenza related tweets are
    negative, motivating an automatic filtering.
  • RQ1 Could a NLP system filter out the negative
    influenza tweet?
  • RQ2 Could this filtering contributes to the
    surveillance accuracy?

18
OUTLINE
  • Background
  • Method
  • Experiment
  • Discussion
  • Conclusion

19
Basic Idea Binary Classification
  • We regard this task as a binary classification
    task , such as a spam mail filtering

input
(2) What kind of Feature?
(3) What kind of Machine Learning Method?
Training Corpus
(1) What kind of Corpus?
Negative
Positive
20
Corpus (5k Sentences with Labels)
See proceeding for detailed Average Annotator
Agreement Ratio 0.85
21
What kind of Feature?
Twitter contains many ungrammatical expressions
  • Surrounding Words (BOW, no stemming, no POS)

I think the influenza is going around
R2
L1
L2
L3
R1
R3
  • Among various settings, Window size 6 achieved
    the highest accuracy

22
What kind of Machine Learning Method?
Classifier F-Measure Time
AdaBoost 0.592 40.192
Bagging 0.739 530.310
Decision Tree 0.698 239.446
Logistic Regression 0.729 696.704
Nearest Neighbor 0.695 22.441
Random Forest 0.729 38.683
SVM (polynomial d2) 0.738 92.723
  • Among various settings, SVM achieved the feasible
    accuracy

23
OUTLINE
  • Background
  • Method
  • Experiment
  • Discussion
  • Objective

24
Twitter Data (2008-2010)
Season I
Season II
Season III
Season IV
  • First month is used for training corpus
  • We divides the other data into 4 seasons
  • Twitter API sometimes changes the spec, leading
    to dropout periods.

25
Method Comparison Evaluation
  • (1) TWEET-SVM (The proposed method)
  • (2) TWEET-RAW
  • Based on simple word frequency of influenza
  • (3) GOOGLE Ginsberg 2009
  • Based on Google web-search query
  • The previous estimation data is available at the
    Google Flu Trend website.
  • (4) DRUG-SALE Magruder 2003
  • Evaluation is based on
  • Average Correlation with GOLD_STANDARD DATA that
    is the real number of the influenza patients
    reported by Infection Disease Surveillance Center
    (IDSC)

26
Result Correlation Ratio
SVM
TWEET-RAW TWEET-SVM GOOGLE DRUG
Season I 0.683 0.816 0.817 -0.208
Season II -0.009 -0.018 0.232 0.406
Season III 0.382 0.474 0.881 0.684
Season IV 0.390 0.957 0.976 0.130
Bold indicates the correlation gt statistical
significance level.
In most seasons, the proposed method achieved the
higher correlation than simple word freq-based
method, demonstrating the advantage of the SVM
based filtering
27
Result Correlation Ratio
SVM
TWEET-RAW TWEET-SVM GOOGLE DRUG
Season I 0.683 0.816 0.817 -0.208
Season II -0.009 -0.018 0.232 0.406
Season III 0.382 0.474 0.881 0.684
Season IV 0.390 0.957 0.976 0.130
Bold indicates the correlation gt statistical
significance level.
Except for Season II, the proposed method
achieved almost the same accuracy to GOOGLE.
28
Why Twitter suffers from Season II? Because it
includes Pandemic!
WHO says Pandemic In 1999 Jul (Season II).
  • Suggesting Twitter might be biased by News Media

TWEET-RAW TWEET-SVM GOOGLE DRUG
Normal Season 0.831 0.890 0.847 0.308
Pandemic Season 0.001 0.060 0.918 0.844
29
Season I
TWEET-SVM ? GOOGLE
Relative number
30
Season II
Relative number
TWEET-SVM ltlt GOOGLE
31
OUTLINE
  • Background
  • Method
  • Experiment
  • Discussion
  • Conclusion

Extra Experiment
32
Frequent Question
  • Could an Influenza Patient REALLY use a Twitter
    or Google Search?
  • That seems to be un-natural situation!

Id like to sleep ...
Due to that, we modified the system assuming
as follows
People use Twitter or Google at the first
sign of the influenza
33
Implemented by usingInfectious Model
Kermack1927
(? Markov model)
UNDER FLU
0.62
BEFORE FLU
AFTER FLU
Catch the flu
Recover
S
I
R
0.38
Infectious
Recover
Susceptible
  • S-to-I transition is observed by Twitter / Google
  • 38 of Influenza people recover a day

34
BUT It ALSO improves Google based Approach
  • This model improves correlation of
  • BOTH Twitter GOOGLE.
  • This result suggests that there is a room of
    collaboration between medical study and web/NLP
    study

35
OUTLINE
  • Background
  • Method
  • Experiment
  • Discussion
  • Conclusion

36
Answer to Research Questions
  • This study proposed a new influenza surveillance
    system using Twitter
  • RQ1 Could a system filter out the negative
    influenza?
  • Yes. But NOT Perfect
  • RQ2 Could this accuracy contribute to the
    surveillance performance?
  • YES. It increases the correlation (except for
    pandemic period).
  • We could achieve the almost same accuracy to
    GOOGLE using freely available data.

37
Conclusion
  • Still now, more than 100 (sometime over 1,000)
    people die from influenza in Japan
  • We hope that this study might help people

38
Thank youNLP could save a life!
Eiji ARAMAKI Ph.D. University of
Tokyo http//mednlp.jp
Write a Comment
User Comments (0)
About PowerShow.com