A Quality Focused Crawler for Health Information

1 / 30
About This Presentation
Title:

A Quality Focused Crawler for Health Information

Description:

Web users can search for health information using general engines or domain ... NDCG = Normalised Discounted Cumulative Gain. 11. Expt1: Findings ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 31
Provided by: timt1

less

Transcript and Presenter's Notes

Title: A Quality Focused Crawler for Health Information


1
  • A Quality Focused Crawler for Health Information
  • Tim Tang

2
Outline
  • Overview
  • Contributions
  • Experiments and results
  • Issues for discussion
  • Future work
  • Questions Suggestions?

3
Overview
  • Many people use the Internet to search for health
    information
  • But health web pages may contain low quality
    information, and may lead to personal
    endangerment. (example)
  • It is important to find means to evaluate the
    quality of health websites and to provide high
    quality results in health search.

4
Motivation
  • Web users can search for health information using
    general engines or domain-specific engines like
    health portals
  • 79 of Web users in the U.S search for health
    information from the Internet (Fox S. Health Info
    Online, 2005)
  • No measurement technique is available for
    measuring the quality of Web health search
    results.
  • Also, there is no method for automatically
    enhancing the quality of health search results
  • Therefore, people building a high quality health
    portal have to do it manually and, without work
    on measurement, we cant tell how good a job they
    are doing
  • Example of such a health portal is BluePages
    search, developed by the ANUs centre for mental
    health research.

5
BluePages Search (BPS)
6
BPS result list
7
Research Objectives
  • To produce a health portal search that
  • Is built automatically to save time, effort, and
    expert knowledge (cost saving).
  • Contains (only) high quality information in the
    index by applying some quality criteria
  • Satisfies users demand for getting good advice
    (evidence-based medicine) about specific health
    topics from the Internet

8
Contributions
  • New and effective quality indicators for health
    websites using some IR-related techniques
  • Techniques to automate the manual quality
    assessment of health websites
  • Techniques to automate the process of building
    high quality health search engines

9
Expt1 General vs. domain specific search engines
  • Aim To compare the performance of general search
    engines (Google, GoogleD) and domain specific
    engines (BPS) for domain relevance and quality.
  • Details Running 100 depression queries in these
    engines. The top 10 results for each query from
    each engine are evaluated.
  • Results next slide.

10
Expt1 Results
Relevance Relevance Relevance Quality
Engine Mean MAP NDCG Score
GoogleD 0.407 0.609 78
BPS 0.319 0.553 127
Google 0.195 0.349 28
MAP Modified Average Precision NDCG
Normalised Discounted Cumulative Gain
11
Expt1 Findings
  • Findings GoogleD can retrieve more relevant
    pages, but less high quality pages compared to
    BPS. Domain-specific engines (BPS) have poor
    coverage (causing worse performance in
    relevance).
  • What next How to improve coverage for
    domain-specific engines? How to automate the
    process of constructing a domain specific engine?

12
Expt2 Prospect of Focused Crawling in building
domain-specific engines
  • Aim To investigate into the prospect of using
    focused crawling (FC) techniques to build health
    portals. In particular
  • Seed list BPS uses a seed list (start list for a
    crawl) that was manually selected by experts in
    the field. Can we automate this process?
  • Relevance of outgoing links Is it feasible to
    follow outgoing links from the currently crawled
    pages to obtain more relevant links?
  • Link prediction Can we successfully predict
    relevant links from available link information?

13
Expt2 Results Findings
  • Out of 227 URLs from DMOZ, 186 were relevant
    (81)
  • gt DMOZ provides good starting list of URLs for
    a FC
  • An unrestricted crawler starting from the BPS
    crawl can reach 25.3 more known relevant pages
    in one single step from the currently crawled
    pages.
  • gt Outgoing links from a constraint crawl lead
    to additional relevant content
  • Machine learning algorithm C4.5 decision tree can
    predict link relevance with a precision of 88.15
  • gt A decision tree created using features like
    anchor text, URL words and link anchor context
    can help a focused crawler obtain new relevant
    pages

14
Expt3 Automatic evaluation of Websites
  • Aim To investigate if Relevance Feedback (RF)
    technique can help in the automatic evaluation of
    health websites.
  • Details RF is used to learn terms (words and
    phrases) representing high quality documents and
    their weights. This weighted query is then
    compared with the text of web pages to find
    degree of similarity. We call this Automatic
    quality tool (AQT).
  • Findings Significant correlation was found
    between human-rated (EBM) results and AQT
    results.

15
Expt3 Results Correlation between AQT score
and EBM score
16
Expt3 Results Correlation between Google
PageRank and EBM score
  • Correlation small non-significant
  • r0.23, P0.22, n30
  • Excluding sites with PageRank of 0, we obtained
    better correlation, but still significantly lower
    than the correlation between AQT and EBM.

17
Expt4 Building a health portal using FC
  • Aim To build a high-quality health portal
    automatically, using FC techniques
  • Details
  • Relevance scores for links are predicted using
    the decision tree found in Expt. 2. Relevance
    scores are transformed into probability scores
    using Laplace correction formula
  • We found that machine learning didnt work well
    for predicting quality but RF helps.
  • Quality of target pages is predicted using the
    mean of quality scores of all the known (visited)
    source pages
  • Combination of relevance and quality score The
    product of the relevance score and the quality
    score is used to determine crawling priority

18
Expt4 Results Quality scores
3 crawls were built BF, Relevance and Quality
19
Expt4 Results Below Average Quality (BAQ)
pages in each crawl
20
Expt4 Findings
  • RF is a good technique to be used in predicting
    quality of web pages based on the quality of
    known source pages.
  • Quality is an important measure in health search
    because a lot of relevant information is of poor
    quality (e.g. the relevance crawler)
  • Further analysis shows that quality of content
    might be further improved by post-filtering a
    very big BF crawl but at the cost of
    substantially increased network traffic.

21
Issues for discussion
  • Combination of scores
  • Untrusted sites
  • Quality evaluation
  • Relevance threshold choice
  • Coverage
  • Combination of quality indicators
  • RF vs Machine learning

22
Issue Combination of scores
  • The decision to multiply the relevance and
    quality scores was taken arbitrarily, the idea
    was to keep a balance between relevance and
    quality, to make sure both quality and coverage
    are maintained.
  • Question Should addition (or other linear
    combinations) be a better way to calculate this
    score? Or rather, only the quality score should
    be considered? In general, how to combine
    relevance and quality scores?

23
Issue Untrusted sites
  • Untrusted sites
  • RF was used for predicting high quality, but
  • Analysis showed that low quality health sites are
    often untrusted sites, such as commercial sites,
    chat sites, forums, bulletins and message boards.
    Our results dont seem to exclude a some of these
    sites.
  • Question Is it feasible to use RF somehow, or
    any other means to detect these sources? How
    should that be incorporated into the crawler?

24
Issue Quality evaluation expt.
  • Expensive because manual evaluation for quality
    requires a lot of expert knowledge and effort. To
    know the quality of a site, we have to judge all
    the pages of that site.
  • Question How to design a cheaper but effective
    evaluation experiment for quality? Can lay
    judgment for quality be used somehow?

25
Issue Relevance threshold choice
  • A relevance classifier was built to help reducing
    the relevance judging effort. A cut-off point for
    relevance score needs to be identified. The
    classifier runs on 2000 pre-judged documents,
    half are relevant. I decided the cut-off
    threshold as a score at which the total number of
    false positive and false negative is minimised.
  • Question Is it a reasonable way to decide a
    relevance threshold? Any alternative?

26
Issue Coverage
  • The FC may not explore all the directions of the
    Web and resulted in low coverage. Its important
    to know how much of the high quality Web
    documents that the FC can index.
  • Question
  • How to design an experiment that evaluates
    coverage issue? (How to measure recall?)

27
Issue Combination of quality indicators
  • Health experts have identified several quality
    indicators that may help in the evaluation of
    quality, such as content currency, authoring
    information, information about disclosure, etc.
  • Question How can/should these indicators be used
    in my work to predict quality?

28
Issue RF vs Machine Learning
  • Compared to RF, ML has the flexibility of adding
    more features such as inherited quality score
    (from source pages) into the leaning process to
    predict the quality of the results.
  • However, weve tried ML initially to predict
    quality but found that RF is much better. Maybe
    because we didnt do it right!?
  • Question Could ML be used in a similar way that
    RF is used? Does the former promise better
    result?

29
Future work
  • Better combination of quality and relevance
    scores to improve quality
  • Involve quality dimension in ranking of health
    search results (create something similar to BM25,
    with the incorporation of quality measure?)
  • Move to another topic in health domain or an
    entirely new topic?
  • Combine heuristics, other medical quality
    indicators with RF?

30
Suggestions
  • Any more suggestions to improve my work?
  • Any more suggestions for future work?
  • Other suggestions?
  • The end!
Write a Comment
User Comments (0)