Introduction to MedIEQ - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Introduction to MedIEQ

Description:

Website spidering. Walk pages of a single website. Classify each page ... Spidering strategy. which documents belong together (e.g. page 1/7) which links to ... – PowerPoint PPT presentation

Number of Views:21
Avg rating:3.0/5.0
Slides: 22
Provided by: Janne75
Category:

less

Transcript and Presenter's Notes

Title: Introduction to MedIEQ


1
Introduction to MedIEQ
  • Quality Labelling of Medical Web content using
    Multilingual Information Extraction
  • http//zeus.iit.demokritos.gr/medieq

Martin Labský labsky_at_vse.cz Knowledge
Engineering Group (KEG) University of Economics
Prague (UEP)
2
Purpose of MedIEQ
  • Medical web sites are increasingly popular
  • Content strongly affects users decisions
  • Therefore, quality labeling is very important
  • Agencies invest large effort into labeling
    websites manually
  • We develop tools to minimize their effort
  • Tools will be multi-lingual, will support
    different and evolving labeling criteria

3
Agenda
  • Partners
  • Description of relevant work packages 3
  • Web content collection, Information Extraction,
    Lexical and semantic resources
  • Goals, tasks, partners
  • Existing tools (to be extended)
  • New tools (to be developed)
  • Existing resources (to be made accessible)
  • Milestones deliverables
  • References
  • Questions

4
Partners
  • Agencies
  • WMA Web Médica Acreditata (Es)
  • assigns a quality label that is shown on medical
    websites
  • websites ask for the label, are suggested
    changes, then get it
  • AQUMED Agency for Quality Labeling in medicine
    (De)
  • maintains a web directory organized by topics
  • only good-quality websites are present
  • Developers
  • NCSR Demokritos and I-Sieve (spin-off) (Gr)
  • UEP University of Economics Prague (Cz)
  • UNED National University of Distance Education
    (Es)
  • HUT Helsinki University of Technology (Fi)

5
Web Content Collection (WP5)
6
Website monitoring
  • Regular visits to labeled website
  • Checking pages
  • for relevant changes
  • which changes are relevant?
  • manual rules, machine learning...
  • alert agency when significant changes occur
  • or, increase the websites (web pages) priority
    in a list of to-be-checked resources
  • show what has changed, suggest solution
  • Needed by WMA, AQuMed

7
Web focused crawling
  • Find new medical websites
  • Use multiple existing search engines
  • specify lists of keywords / keyphrases
  • give sample similar documents
  • use Google/Yahoo API and filter their results
  • NCSR already has a focused crawler
  • we should contribute to its development
  • Needed by WMA

8
Website spidering
  • Walk pages of a single website
  • Classify each page
  • in order to choose relevant docs for quality
    labeling
  • e.g. contact page, page containing treatment
    description, page with sponsors
  • use machine learning, e.g. based on a
    bag-of-words (unigram, bigram) document
    representation
  • Spidering strategy
  • which documents belong together (e.g. page 1/7)
  • which links to follow next
  • NCSR has a spider
  • uses classifiers from Weka for doc classification
  • we should contribute

9
Information Extraction (WP6)
10
IE introduction
  • Documents to extract from
  • pages retrieved classified by spider
  • from known websites
  • from crawler
  • monitored labeled pages that have changed
  • Information to be extracted
  • derived from agencies labeling criteria
  • e.g. contact information of responsible persons,
    sponsor names, privacy warning texts...
  • Questions
  • how much human intervention needed?
  • complexity of label sets to be supported?
  • methodology of porting to a new language?

11
Example extracted information I.
  • Transparency and honesty
  • site provider (company name, contact)
  • site purpose, type of target audience
  • funding (grants, sponsors)
  • Authority
  • source citation for information provided, its
    type and date
  • names and credentials of all information
    providers
  • Privacy and data protection
  • privacy policy description
  • Timeliness of information
  • dates of publication/modification
  • Accountability
  • names (and roles) of people responsible for
    presented information
  • editorial policy description

12
Example extracted information II.
  • Content
  • medical terms, e.g. disease and drug names
  • statements recommending a certain product/method
  • advertisements
  • disallowed combinations (e.g. advertisement for X
    adjacent to an article related to X)
  • Formal
  • mandatory statements (e.g. importance of physical
    examination, privacy warnings when posting data
    into chats)

13
Sources of extraction knowledge
  • Training data
  • scarcity will be a problem for most extracted
    attributes
  • different types labeled documents, sample
    extracted data, data previously extracted from
    the same website, domain dictionaries
  • Extraction patterns
  • induced (semi)automatically from scarce training
    data
  • or even authored manually
  • Background domain knowledge
  • relations between extracted attributes,
    cardinalities ...
  • e.g. typically just one company is the web sites
    provider, but there are often multiple sponsors
  • Web site structure
  • exploit common formatting of a group of documents
    within a website
  • exploit common formatting used for a particular
    type of extracted data across different websites

14
IE tools
  • Ex (UEP)
  • IE system under development using extraction
    ontologies
  • extracts instances from semi-structured documents
  • utilizes training data manually defined
    patterns, includes spider
  • old version based on HMMs http//eso.vse.cz/lab
    sky/client/
  • Named entity recognizer (UNED)
  • extracts dates, person/institution names
  • 3rd party IE tools
  • wrapper management systems
  • e.g. LP2-based IE tool or annotation editor from
    Sheffield

15
Website assessment
  • Check websites technical correctness
  • SEO (findability in search engines with respect
    to some keyphrases)
  • accessibility (possibility of font enlargement,
    blind access, pages hidden deep in website
    structure, color schemes perceivable by anybody)
  • formal correctness (dead links, violations of
    HTML standards, failure to display well under at
    least the 3 most popular browsers)
  • Check non-technical correctness
  • e.g. typos, clear, easy-to-understand language
  • more check for black-listed phrases, claims, etc.

16
Website assessment tools
  • Relaxed (UEP)
  • HTML validator based on Relax NG and Schematron
    patterns
  • can perform formal checks of website content
    beyond DTDs
  • http//relaxed.sourceforge.net/
  • SEO tool (UEP)
  • could Honzas SEO tool be extended?

17
IE Deliverables
  • Duration M1-M28
  • Deliverables
  • D8 Methodology architecture of IE (M9)
  • D9.1 First version of IE toolkit (M15)
  • D9.2 Final version of IE toolkit (M24)

18
Lexical and semantic resources(WP7)
19
Lexical and semantic resources
  • Sp, De, En, Cz, Gr, Fi, Catalan (7!)
  • We are in charge of Cz, De(!)
  • Semantic
  • thesauri, ontologies (MESH)
  • lists of cures, vaccine names, lists of medical
    companies, illnesses, diagnoses
  • generic ontologies and translation dictionaries
    (e.g. Eurowordnet)
  • Lexical
  • lemmatizers/morphology analyzers, part-of-speech
    taggers, chunkers, syntactic parsers
  • medical document collections (for classification)

20
References
  • MedIEQ
  • http//www.iit.demokritos.gr/vangelis/MedIEQ/
  • http//zeus.iit.demokritos.gr/medieq
  • Related projects
  • WRAPIN http//debussy.hon.ch/cgi-bin/Wrapin/Client
    Wrapin.pl
  • Quatro http//www.quatro-project.org/DC2005.htm
  • CROSSMARC http//www.iit.demokritos.gr/skel/crossm
    arc/
  • Relaxed
  • http//badame.vse.cz/validator/
  • Ex
  • http//eso.vse.cz/labsky/doc/ex.pdf
  • Ellogon
  • http//www.ellogon.org/

21
Questions
  • ?
Write a Comment
User Comments (0)
About PowerShow.com