Title: A Quality Focused Crawler for Health Information
1- A Quality Focused Crawler for Health Information
- Tim Tang
2Outline
- Overview
- Contributions
- Experiments and results
- Issues for discussion
- Future work
- Questions Suggestions?
3Overview
- Many people use the Internet to search for health
information - But health web pages may contain low quality
information, and may lead to personal
endangerment. (example) - It is important to find means to evaluate the
quality of health websites and to provide high
quality results in health search.
4Motivation
- Web users can search for health information using
general engines or domain-specific engines like
health portals - 79 of Web users in the U.S search for health
information from the Internet (Fox S. Health Info
Online, 2005) - No measurement technique is available for
measuring the quality of Web health search
results. - Also, there is no method for automatically
enhancing the quality of health search results - Therefore, people building a high quality health
portal have to do it manually and, without work
on measurement, we cant tell how good a job they
are doing - Example of such a health portal is BluePages
search, developed by the ANUs centre for mental
health research.
5BluePages Search (BPS)
6BPS result list
7Research Objectives
- To produce a health portal search that
- Is built automatically to save time, effort, and
expert knowledge (cost saving). - Contains (only) high quality information in the
index by applying some quality criteria - Satisfies users demand for getting good advice
(evidence-based medicine) about specific health
topics from the Internet
8Contributions
- New and effective quality indicators for health
websites using some IR-related techniques - Techniques to automate the manual quality
assessment of health websites - Techniques to automate the process of building
high quality health search engines
9Expt1 General vs. domain specific search engines
- Aim To compare the performance of general search
engines (Google, GoogleD) and domain specific
engines (BPS) for domain relevance and quality. - Details Running 100 depression queries in these
engines. The top 10 results for each query from
each engine are evaluated. - Results next slide.
10Expt1 Results
Relevance Relevance Relevance Quality
Engine Mean MAP NDCG Score
GoogleD 0.407 0.609 78
BPS 0.319 0.553 127
Google 0.195 0.349 28
MAP Modified Average Precision NDCG
Normalised Discounted Cumulative Gain
11Expt1 Findings
- Findings GoogleD can retrieve more relevant
pages, but less high quality pages compared to
BPS. Domain-specific engines (BPS) have poor
coverage (causing worse performance in
relevance). - What next How to improve coverage for
domain-specific engines? How to automate the
process of constructing a domain specific engine?
12Expt2 Prospect of Focused Crawling in building
domain-specific engines
- Aim To investigate into the prospect of using
focused crawling (FC) techniques to build health
portals. In particular - Seed list BPS uses a seed list (start list for a
crawl) that was manually selected by experts in
the field. Can we automate this process? - Relevance of outgoing links Is it feasible to
follow outgoing links from the currently crawled
pages to obtain more relevant links? - Link prediction Can we successfully predict
relevant links from available link information?
13Expt2 Results Findings
- Out of 227 URLs from DMOZ, 186 were relevant
(81) - gt DMOZ provides good starting list of URLs for
a FC - An unrestricted crawler starting from the BPS
crawl can reach 25.3 more known relevant pages
in one single step from the currently crawled
pages. - gt Outgoing links from a constraint crawl lead
to additional relevant content - Machine learning algorithm C4.5 decision tree can
predict link relevance with a precision of 88.15 - gt A decision tree created using features like
anchor text, URL words and link anchor context
can help a focused crawler obtain new relevant
pages
14Expt3 Automatic evaluation of Websites
- Aim To investigate if Relevance Feedback (RF)
technique can help in the automatic evaluation of
health websites. - Details RF is used to learn terms (words and
phrases) representing high quality documents and
their weights. This weighted query is then
compared with the text of web pages to find
degree of similarity. We call this Automatic
quality tool (AQT). - Findings Significant correlation was found
between human-rated (EBM) results and AQT
results.
15Expt3 Results Correlation between AQT score
and EBM score
16Expt3 Results Correlation between Google
PageRank and EBM score
- Correlation small non-significant
- r0.23, P0.22, n30
- Excluding sites with PageRank of 0, we obtained
better correlation, but still significantly lower
than the correlation between AQT and EBM.
17Expt4 Building a health portal using FC
- Aim To build a high-quality health portal
automatically, using FC techniques - Details
- Relevance scores for links are predicted using
the decision tree found in Expt. 2. Relevance
scores are transformed into probability scores
using Laplace correction formula - We found that machine learning didnt work well
for predicting quality but RF helps. - Quality of target pages is predicted using the
mean of quality scores of all the known (visited)
source pages - Combination of relevance and quality score The
product of the relevance score and the quality
score is used to determine crawling priority
18Expt4 Results Quality scores
3 crawls were built BF, Relevance and Quality
19Expt4 Results Below Average Quality (BAQ)
pages in each crawl
20Expt4 Findings
- RF is a good technique to be used in predicting
quality of web pages based on the quality of
known source pages. - Quality is an important measure in health search
because a lot of relevant information is of poor
quality (e.g. the relevance crawler) - Further analysis shows that quality of content
might be further improved by post-filtering a
very big BF crawl but at the cost of
substantially increased network traffic.
21Issues for discussion
- Combination of scores
- Untrusted sites
- Quality evaluation
- Relevance threshold choice
- Coverage
- Combination of quality indicators
- RF vs Machine learning
22Issue Combination of scores
- The decision to multiply the relevance and
quality scores was taken arbitrarily, the idea
was to keep a balance between relevance and
quality, to make sure both quality and coverage
are maintained. - Question Should addition (or other linear
combinations) be a better way to calculate this
score? Or rather, only the quality score should
be considered? In general, how to combine
relevance and quality scores?
23Issue Untrusted sites
- Untrusted sites
- RF was used for predicting high quality, but
- Analysis showed that low quality health sites are
often untrusted sites, such as commercial sites,
chat sites, forums, bulletins and message boards.
Our results dont seem to exclude a some of these
sites. - Question Is it feasible to use RF somehow, or
any other means to detect these sources? How
should that be incorporated into the crawler?
24Issue Quality evaluation expt.
- Expensive because manual evaluation for quality
requires a lot of expert knowledge and effort. To
know the quality of a site, we have to judge all
the pages of that site. - Question How to design a cheaper but effective
evaluation experiment for quality? Can lay
judgment for quality be used somehow?
25Issue Relevance threshold choice
- A relevance classifier was built to help reducing
the relevance judging effort. A cut-off point for
relevance score needs to be identified. The
classifier runs on 2000 pre-judged documents,
half are relevant. I decided the cut-off
threshold as a score at which the total number of
false positive and false negative is minimised. - Question Is it a reasonable way to decide a
relevance threshold? Any alternative?
26Issue Coverage
- The FC may not explore all the directions of the
Web and resulted in low coverage. Its important
to know how much of the high quality Web
documents that the FC can index. - Question
- How to design an experiment that evaluates
coverage issue? (How to measure recall?)
27Issue Combination of quality indicators
- Health experts have identified several quality
indicators that may help in the evaluation of
quality, such as content currency, authoring
information, information about disclosure, etc. - Question How can/should these indicators be used
in my work to predict quality?
28Issue RF vs Machine Learning
- Compared to RF, ML has the flexibility of adding
more features such as inherited quality score
(from source pages) into the leaning process to
predict the quality of the results. - However, weve tried ML initially to predict
quality but found that RF is much better. Maybe
because we didnt do it right!? - Question Could ML be used in a similar way that
RF is used? Does the former promise better
result?
29Future work
- Better combination of quality and relevance
scores to improve quality - Involve quality dimension in ranking of health
search results (create something similar to BM25,
with the incorporation of quality measure?) - Move to another topic in health domain or an
entirely new topic? - Combine heuristics, other medical quality
indicators with RF?
30Suggestions
- Any more suggestions to improve my work?
- Any more suggestions for future work?
- Other suggestions?
- The end!