A Quality Focused Crawler for Health Information

1 / 30

About This Presentation

Title:

A Quality Focused Crawler for Health Information

Description:

Web users can search for health information using general engines or domain ... NDCG = Normalised Discounted Cumulative Gain. 11. Expt1: Findings ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 31

Provided by: timt1

more less

Transcript and Presenter's Notes

Title: A Quality Focused Crawler for Health Information

1

A Quality Focused Crawler for Health Information
Tim Tang

2
Outline

Overview
Contributions
Experiments and results
Issues for discussion
Future work
Questions Suggestions?

3
Overview

Many people use the Internet to search for health
information
But health web pages may contain low quality
information, and may lead to personal
endangerment. (example)
It is important to find means to evaluate the
quality of health websites and to provide high
quality results in health search.

4
Motivation

Web users can search for health information using
general engines or domain-specific engines like
health portals
79 of Web users in the U.S search for health
information from the Internet (Fox S. Health Info
Online, 2005)
No measurement technique is available for
measuring the quality of Web health search
results.
Also, there is no method for automatically
enhancing the quality of health search results
Therefore, people building a high quality health
portal have to do it manually and, without work
on measurement, we cant tell how good a job they
are doing
Example of such a health portal is BluePages
search, developed by the ANUs centre for mental
health research.

5
BluePages Search (BPS)
6
BPS result list
7
Research Objectives

To produce a health portal search that
Is built automatically to save time, effort, and
expert knowledge (cost saving).
Contains (only) high quality information in the
index by applying some quality criteria
Satisfies users demand for getting good advice
(evidence-based medicine) about specific health
topics from the Internet

8
Contributions

New and effective quality indicators for health
websites using some IR-related techniques
Techniques to automate the manual quality
assessment of health websites
Techniques to automate the process of building
high quality health search engines

9
Expt1 General vs. domain specific search engines

Aim To compare the performance of general search
engines (Google, GoogleD) and domain specific
engines (BPS) for domain relevance and quality.
Details Running 100 depression queries in these
engines. The top 10 results for each query from
each engine are evaluated.
Results next slide.

10
Expt1 Results
Relevance Relevance Relevance Quality
Engine Mean MAP NDCG Score
GoogleD 0.407 0.609 78
BPS 0.319 0.553 127
Google 0.195 0.349 28
MAP Modified Average Precision NDCG
Normalised Discounted Cumulative Gain
11
Expt1 Findings

Findings GoogleD can retrieve more relevant
pages, but less high quality pages compared to
BPS. Domain-specific engines (BPS) have poor
coverage (causing worse performance in
relevance).
What next How to improve coverage for
domain-specific engines? How to automate the
process of constructing a domain specific engine?

12
Expt2 Prospect of Focused Crawling in building
domain-specific engines

Aim To investigate into the prospect of using
focused crawling (FC) techniques to build health
portals. In particular
Seed list BPS uses a seed list (start list for a
crawl) that was manually selected by experts in
the field. Can we automate this process?
Relevance of outgoing links Is it feasible to
follow outgoing links from the currently crawled
pages to obtain more relevant links?
Link prediction Can we successfully predict
relevant links from available link information?

13
Expt2 Results Findings

Out of 227 URLs from DMOZ, 186 were relevant
(81)
gt DMOZ provides good starting list of URLs for
a FC
An unrestricted crawler starting from the BPS
crawl can reach 25.3 more known relevant pages
in one single step from the currently crawled
pages.
gt Outgoing links from a constraint crawl lead
to additional relevant content
Machine learning algorithm C4.5 decision tree can
predict link relevance with a precision of 88.15
gt A decision tree created using features like
anchor text, URL words and link anchor context
can help a focused crawler obtain new relevant
pages

14
Expt3 Automatic evaluation of Websites

Aim To investigate if Relevance Feedback (RF)
technique can help in the automatic evaluation of
health websites.
Details RF is used to learn terms (words and
phrases) representing high quality documents and
their weights. This weighted query is then
compared with the text of web pages to find
degree of similarity. We call this Automatic
quality tool (AQT).
Findings Significant correlation was found
between human-rated (EBM) results and AQT
results.

15
Expt3 Results Correlation between AQT score
and EBM score
16
Expt3 Results Correlation between Google
PageRank and EBM score

Correlation small non-significant
r0.23, P0.22, n30
Excluding sites with PageRank of 0, we obtained
better correlation, but still significantly lower
than the correlation between AQT and EBM.

17
Expt4 Building a health portal using FC

Aim To build a high-quality health portal
automatically, using FC techniques
Details
Relevance scores for links are predicted using
the decision tree found in Expt. 2. Relevance
scores are transformed into probability scores
using Laplace correction formula
We found that machine learning didnt work well
for predicting quality but RF helps.
Quality of target pages is predicted using the
mean of quality scores of all the known (visited)
source pages
Combination of relevance and quality score The
product of the relevance score and the quality
score is used to determine crawling priority

18
Expt4 Results Quality scores
3 crawls were built BF, Relevance and Quality
19
Expt4 Results Below Average Quality (BAQ)
pages in each crawl
20
Expt4 Findings

RF is a good technique to be used in predicting
quality of web pages based on the quality of
known source pages.
Quality is an important measure in health search
because a lot of relevant information is of poor
quality (e.g. the relevance crawler)
Further analysis shows that quality of content
might be further improved by post-filtering a
very big BF crawl but at the cost of
substantially increased network traffic.

21
Issues for discussion

Combination of scores
Untrusted sites
Quality evaluation
Relevance threshold choice
Coverage
Combination of quality indicators
RF vs Machine learning

22
Issue Combination of scores

The decision to multiply the relevance and
quality scores was taken arbitrarily, the idea
was to keep a balance between relevance and
quality, to make sure both quality and coverage
are maintained.
Question Should addition (or other linear
combinations) be a better way to calculate this
score? Or rather, only the quality score should
be considered? In general, how to combine
relevance and quality scores?

23
Issue Untrusted sites

Untrusted sites
RF was used for predicting high quality, but
Analysis showed that low quality health sites are
often untrusted sites, such as commercial sites,
chat sites, forums, bulletins and message boards.
Our results dont seem to exclude a some of these
sites.
Question Is it feasible to use RF somehow, or
any other means to detect these sources? How
should that be incorporated into the crawler?

24
Issue Quality evaluation expt.

Expensive because manual evaluation for quality
requires a lot of expert knowledge and effort. To
know the quality of a site, we have to judge all
the pages of that site.
Question How to design a cheaper but effective
evaluation experiment for quality? Can lay
judgment for quality be used somehow?

25
Issue Relevance threshold choice

A relevance classifier was built to help reducing
the relevance judging effort. A cut-off point for
relevance score needs to be identified. The
classifier runs on 2000 pre-judged documents,
half are relevant. I decided the cut-off
threshold as a score at which the total number of
false positive and false negative is minimised.
Question Is it a reasonable way to decide a
relevance threshold? Any alternative?

26
Issue Coverage

The FC may not explore all the directions of the
Web and resulted in low coverage. Its important
to know how much of the high quality Web
documents that the FC can index.
Question
How to design an experiment that evaluates
coverage issue? (How to measure recall?)

27
Issue Combination of quality indicators

Health experts have identified several quality
indicators that may help in the evaluation of
quality, such as content currency, authoring
information, information about disclosure, etc.
Question How can/should these indicators be used
in my work to predict quality?

28
Issue RF vs Machine Learning

Compared to RF, ML has the flexibility of adding
more features such as inherited quality score
(from source pages) into the leaning process to
predict the quality of the results.
However, weve tried ML initially to predict
quality but found that RF is much better. Maybe
because we didnt do it right!?
Question Could ML be used in a similar way that
RF is used? Does the former promise better
result?

29
Future work

Better combination of quality and relevance
scores to improve quality
Involve quality dimension in ranking of health
search results (create something similar to BM25,
with the incorporation of quality measure?)
Move to another topic in health domain or an
entirely new topic?
Combine heuristics, other medical quality
indicators with RF?

30
Suggestions