Implementing Automatic Value Extraction from Structured Web Pages - PowerPoint PPT Presentation

1 / 8
About This Presentation
Title:

Implementing Automatic Value Extraction from Structured Web Pages

Description:

Implementing Automatic Value Extraction from Structured Web Pages ... tuple elem Tom Cruise /elem elem Ethan Hunt /elem /tuple ... – PowerPoint PPT presentation

Number of Views:138
Avg rating:3.0/5.0
Slides: 9
Provided by: sup879
Category:

less

Transcript and Presenter's Notes

Title: Implementing Automatic Value Extraction from Structured Web Pages


1
Implementing Automatic Value Extraction from
Structured Web Pages
  • Varun Ganapathi, Jonathan Pines, Josh Wiseman

2
Problem
  • Context
  • Many web pages are generated by applying a
    template to structured data
  • Goal
  • Given a set of pages generated from a template,
    infer the template.
  • Extract values from previously unseen pages
    generated from the template
  • Why?
  • The template encodes structure that usually has
    semantic meaning.
  • The structured values that back a page are all
    the important information in the page.

3
What is a Template?
  • It is a special case of a context free grammar
  • Tuple ( fixed-length ordered lists )
  • Sets ( arbitrary-length lists denoted by
    separators )
  • Example of Instantiated Template
  • ltelemgtEthan Hunt comes face to face with a
    dangerous and lt/elemgt
  • ltelemgt6.8lt/elemgt
  • ltsetgt
  • lttuplegtltelemgtTom Cruiselt/elemgtltelemgtEthan
    Huntlt/elemgtlt/tuplegt
  • lttuplegtltelemgtVing Rhameslt/elemgtltelemgtLuther
    Strickelllt/elemgtlt/tuplegt
  • lt/setgt

4
Learning Templates
  • Use the following observations
  • When tokens occur frequently together, it might
    be because they are derived from the same
    template
  • The strings derived from templates have certain
    properties
  • Ordered
  • Nested
  • Loop
  • Find equivalence classes of differentiated tokens
  • Increase partial template
  • Differentiate tokens based on partial template
  • Construct Template using Patterns

5
Evaluation
  • We manually extracted interesting data from
    several IMDB movie pages.
  • ltelemgtEthan Hunt comes face to face with a
    dangerous and lt/elemgt
  • ltelemgt6.8lt/elemgt
  • ltsetgt
  • lttuplegtltelemgtTom Cruiselt/elemgtltelemgtEthan
    Huntlt/elemgtlt/tuplegt
  • lttuplegtltelemgtVing Rhameslt/elemgtltelemgtLuther
    Strickelllt/elemgtlt/tuplegt
  • lt/setgt
  • Some attributes title, writers, directors, plot
    summary, rating, actors, languages, trivia,
  • Attributes were either
  • Correct Our system was perfect.
  • Partially Correct Our system got a bit too much.
  • Incorrect Our system missed some data.

6
Results
7
Results
8
Results
  • Attributes
  • 5 correct
  • 5 partially correct
  • 6 incorrect
Write a Comment
User Comments (0)
About PowerShow.com