Learning to Extract Symbolic Knowledge from the World Wide Web

About This Presentation
Title:

Learning to Extract Symbolic Knowledge from the World Wide Web

Description:

3. Recognizing class and relation instances by extracting small fields of text ... building a probabilistic model of each class using labeled training data ... –

Number of Views:79
Avg rating:3.0/5.0
Slides: 24
Provided by: cedarB
Category:

less

Transcript and Presenter's Notes

Title: Learning to Extract Symbolic Knowledge from the World Wide Web


1
Learning to Extract Symbolic Knowledge from the
World Wide Web
  • Changho Choi
  • Source http//www.cs.cmu.edu/knigam/
  • Mark Craven, Dan DiPasquo, Dayne Freitag, Andrew
    McCallum
  • Carnegie Mellon University, J.Stefan Institute
  • AAAI-98

2
Abstract
Information on the Web
Unstandable to Human
Knowledgable
????
Extract information
KB
3
Introduction (1/4)
  • Two types of inputs
  • of the information extraction system
  • Ontology
  • Specifying the classes and relations of interest
  • For example, a hierarchy of classes including
    Person, Student, Research.Project, Course, etc.
  • Training examples
  • Represent instances of the ontology classes and
    relations
  • For example, a course web page for Course
    classes, faculty web pages for Faculty classes,
    this pair of pages for Courses.Taught.By, etc.

4
(No Transcript)
5
Introduction (3/4)
  • Assumptions
  • about the mapping between the ontology and the
    Web
  • 1. Each instance of an ontology class is
  • a single Web page,
  • a contiguous string of text,
  • or a collection of several Web pages.
  • 2. Each instance of a relation is
  • a segment of hypertext,
  • a contiguous segment of text,
  • or t he hypertext segment.

6
Introduction (4/4)
  • Three primary learning tasks
  • Involved in extracting knowledge-base instances
    for the Web
  • 1. Recognizing class instances by classifying
    bodies.
  • 2. Recognizing relation instances by classifying
    chains of hyperlinks.
  • 3. Recognizing class and relation instances by
    extracting small fields of text form Web pages.

7
Experimental Testbed
  • Experiments
  • Based on the ontology
  • ClassesDepartment, faculty, staff, student,
    research_project, course, other
  • Relations Instructors.Of.Course(251),
    Members.Of.Project(392), Department.Of.Person(748)
  • Data sets
  • A set of pages(4127) and hyperlinks(10945) from 4
    CS dept.
  • A set of pages(4120) from numerous other CS dept.
  • Evaluation
  • Four-fold cross validation
  • 3 for training, 1 for testing

8
Statistical Text Classification
  • Process
  • building a probabilistic model of each class
    using labeled training data
  • Classifying newly seen pages by selecting the
    class that that is most probable given the
    evidence of words describing the new page.
  • Train three classifiers
  • Full-text
  • Title/Heading
  • Hyperlink

9
Statistical Text Classification
  • Approach
  • the naïve Bayes, with minor modifications
  • Based on Kullback-Leibler Divergence
  • Given a document d to classify, we calculate a
    score for each class c as follows

10
Statistical Text Classification
  • Experimental evaluation

ActualPredicted course student faculty staff Research_project department other Accuracy
Course 202 17 0 0 1 0 552 26.2
Student 0 421 14 17 2 0 519 43.3
Faculty 5 56 118 16 3 0 264 17.9
Staff 0 15 0 4 0 0 45 6.2
Research_project 8 9 10 5 62 0 384 13.0
Department 10 8 3 1 5 4 209 1.7
Other 19 32 7 3 12 0 1064 93.6
Coverage 82.8 72.4 77.1 8.7 72.9 100.0 35.0
11
Accuracy/coverage
  • Coverage
  • The percentage of pages for a given class that
    are correctly classified as belonging to the
    class
  • accuracy
  • The percentage of pages classified into a given
    class that are actually members of that class

12
Accuracy/coverage tradeoff
1. Full-text classifiers
2. Hyperlink classifiers
3. Title/heading classifiers
Hyperlink information can provide strong
knowledge.
13
First-Order Text Classification
  • Second approach for text classification
  • learn first-order rules for classifying pages
  • 1st-order rules with variables
  • FOIL is the well-known algorithm for first-order
    learning.
  • 0th-order no variables. Prolog-like.
    Function-free Horn clauses
  • C4.5 is the well-known algorithm for zeroth-order
    learning.

14
FOILs input for text classification
  • For each distinct word,
  • has_word(Page)
  • word is stemmed.
  • For every hyperlink,
  • link_to(Page, Page)
  • Training data,
  • Student(http//www.cs.buffalo.edu/grads.html),
  • Course(http//www.cse.buffalo.edu/courses.html),

15
FOILs result
  • Sample learned rules,
  • Student(A) not(has_data(A)),
    not(has_comment(A)), link_to(B,A), has_jame(B),
    has_paul(B), not(has_mail(B)).Test Set 126(),
    5(-)
  • Faculty(A) - has_professor(A), has_ph(A),
    link_to(B,A), has_faculti(B).Test Set 18(),
    3(-)

16
FOILs result
  • Comparing to statistical classification
  • More accurate
  • Less coverage

17
Classifying Hyperlinks
  • Use a first-order representation
  • because this task involves discovering hyperlink
    paths of unknown and variable size.
  • and, since we want to find out following
    patterns.
  • The ProjectMember(A,B) relation holds if A is a
    Person, and B is a ResearchProject, and B
    includes a link to A near the word People.

18
FOILs Input for classifying hyperlinks
  • Predicates
  • class(Page)
  • link_to(Hyperlink, Page, Page)
  • has_word(Hyperlink)
  • all_words_capitalized(Hyperlink)
  • has alphanumeric_word(Hyperlink)
  • has_neighborhood_word(Hyperlink)
  • Training examples
  • Department.Of.Person(CSE, Changho Choi),
  • Instructors.Of.Course(Sargur N. Srihari,
    CSE711),

19
FOILs result
  • Sample learned rules,
  • Members_of_project(A, B) research_project(A),
    person(B), link_to(C,A,D), link_to(E,D,B),
    neighborhood_word_people(C).Test Set 18(),
    0(-)
  • department_of_person(A,B) person(A),
    department(B), link_to(C,D,A), link_to(E,F,D),
    link_to(G,B,F), neighborhood_word_graduate(E).Te
    st Set 371(), 4(-)

20
FOILs result
  • Fairly High Accuracy
  • Limited coverage
  • Because limited coverage of page classifiers

21
Extracting Text Fields
  • Uses a richer set of predicates
  • length(Fragment, Relop, N)
  • Some(Fragment, Var, Path, Attr, Value)
  • Position(Fragment, Var, From, Relop, N)
  • Relpos(Fragment, Var1, Var2, Relop, N)
  • Sample learned rule,
  • ownername(Fragment) some(Fragment, B, ,
    in_title, true), length(Fragment, lt, 3),
    some(Fragment, B, prev_token, word, gmt),
    some(Fragment, A, , longp, true),
    some(Fragment, B, , word, unknown),
    some(Fragment, B, , quadrupletonp, false)

22
FOILs result
23
Conclusions
  • The approach we propose in this paper is to
    construct a system that can be trained to
    automatically populate such a KB.
  • We have presented a variety pf approaches that
    take advantage of the special structure of
    hypertext
  • By considering relationships among Web pages,
  • Their hyperlinks,
  • And specific words on individual pages and
    hyperlinks.
Write a Comment
User Comments (0)
About PowerShow.com