Learning to Extract Symbolic Knowledge from the World Wide Web

About This Presentation

Title:

Learning to Extract Symbolic Knowledge from the World Wide Web

Description:

3. Recognizing class and relation instances by extracting small fields of text ... building a probabilistic model of each class using labeled training data ... –

Number of Views:79

Avg rating:3.0/5.0

Slides: 24

Provided by: cedarB

Learn more at: https://cedar.buffalo.edu

Category:

more less

Transcript and Presenter's Notes

Title: Learning to Extract Symbolic Knowledge from the World Wide Web

1
Learning to Extract Symbolic Knowledge from the
World Wide Web

Changho Choi
Source http//www.cs.cmu.edu/knigam/
Mark Craven, Dan DiPasquo, Dayne Freitag, Andrew
McCallum
Carnegie Mellon University, J.Stefan Institute
AAAI-98

2
Abstract
Information on the Web
Unstandable to Human
Knowledgable
????
Extract information
KB
3
Introduction (1/4)

Two types of inputs
of the information extraction system
Ontology
Specifying the classes and relations of interest
For example, a hierarchy of classes including
Person, Student, Research.Project, Course, etc.
Training examples
Represent instances of the ontology classes and
relations
For example, a course web page for Course
classes, faculty web pages for Faculty classes,
this pair of pages for Courses.Taught.By, etc.

4
(No Transcript)
5
Introduction (3/4)

Assumptions
about the mapping between the ontology and the
Web
1. Each instance of an ontology class is
a single Web page,
a contiguous string of text,
or a collection of several Web pages.
2. Each instance of a relation is
a segment of hypertext,
a contiguous segment of text,
or t he hypertext segment.

6
Introduction (4/4)

Three primary learning tasks
Involved in extracting knowledge-base instances
for the Web
1. Recognizing class instances by classifying
bodies.
2. Recognizing relation instances by classifying
chains of hyperlinks.
3. Recognizing class and relation instances by
extracting small fields of text form Web pages.

7
Experimental Testbed

Experiments
Based on the ontology
ClassesDepartment, faculty, staff, student,
research_project, course, other
Relations Instructors.Of.Course(251),
Members.Of.Project(392), Department.Of.Person(748)
Data sets
A set of pages(4127) and hyperlinks(10945) from 4
CS dept.
A set of pages(4120) from numerous other CS dept.
Evaluation
Four-fold cross validation
3 for training, 1 for testing

8
Statistical Text Classification

Process
building a probabilistic model of each class
using labeled training data
Classifying newly seen pages by selecting the
class that that is most probable given the
evidence of words describing the new page.
Train three classifiers
Full-text
Title/Heading
Hyperlink

9
Statistical Text Classification

Approach
the naïve Bayes, with minor modifications
Based on Kullback-Leibler Divergence
Given a document d to classify, we calculate a
score for each class c as follows

10
Statistical Text Classification

Experimental evaluation

ActualPredicted course student faculty staff Research_project department other Accuracy
Course 202 17 0 0 1 0 552 26.2
Student 0 421 14 17 2 0 519 43.3
Faculty 5 56 118 16 3 0 264 17.9
Staff 0 15 0 4 0 0 45 6.2
Research_project 8 9 10 5 62 0 384 13.0
Department 10 8 3 1 5 4 209 1.7
Other 19 32 7 3 12 0 1064 93.6
Coverage 82.8 72.4 77.1 8.7 72.9 100.0 35.0
11
Accuracy/coverage

Coverage
The percentage of pages for a given class that
are correctly classified as belonging to the
class
accuracy
The percentage of pages classified into a given
class that are actually members of that class

12
Accuracy/coverage tradeoff
1. Full-text classifiers
2. Hyperlink classifiers
3. Title/heading classifiers
Hyperlink information can provide strong
knowledge.
13
First-Order Text Classification

Second approach for text classification
learn first-order rules for classifying pages
1st-order rules with variables
FOIL is the well-known algorithm for first-order
learning.
0th-order no variables. Prolog-like.
Function-free Horn clauses
C4.5 is the well-known algorithm for zeroth-order
learning.

14
FOILs input for text classification

For each distinct word,
has_word(Page)
word is stemmed.
For every hyperlink,
link_to(Page, Page)
Training data,
Student(http//www.cs.buffalo.edu/grads.html),
Course(http//www.cse.buffalo.edu/courses.html),

15
FOILs result

Sample learned rules,
Student(A) not(has_data(A)),
not(has_comment(A)), link_to(B,A), has_jame(B),
has_paul(B), not(has_mail(B)).Test Set 126(),
5(-)
Faculty(A) - has_professor(A), has_ph(A),
link_to(B,A), has_faculti(B).Test Set 18(),
3(-)

16
FOILs result

Comparing to statistical classification
More accurate
Less coverage

17
Classifying Hyperlinks

Use a first-order representation
because this task involves discovering hyperlink
paths of unknown and variable size.
and, since we want to find out following
patterns.
The ProjectMember(A,B) relation holds if A is a
Person, and B is a ResearchProject, and B
includes a link to A near the word People.

18
FOILs Input for classifying hyperlinks

Predicates
class(Page)
link_to(Hyperlink, Page, Page)
has_word(Hyperlink)
all_words_capitalized(Hyperlink)
has alphanumeric_word(Hyperlink)
has_neighborhood_word(Hyperlink)
Training examples
Department.Of.Person(CSE, Changho Choi),
Instructors.Of.Course(Sargur N. Srihari,
CSE711),

19
FOILs result

Sample learned rules,
Members_of_project(A, B) research_project(A),
person(B), link_to(C,A,D), link_to(E,D,B),
neighborhood_word_people(C).Test Set 18(),
0(-)
department_of_person(A,B) person(A),
department(B), link_to(C,D,A), link_to(E,F,D),
link_to(G,B,F), neighborhood_word_graduate(E).Te
st Set 371(), 4(-)

20
FOILs result

Fairly High Accuracy
Limited coverage
Because limited coverage of page classifiers

21
Extracting Text Fields

Uses a richer set of predicates
length(Fragment, Relop, N)
Some(Fragment, Var, Path, Attr, Value)
Position(Fragment, Var, From, Relop, N)
Relpos(Fragment, Var1, Var2, Relop, N)
Sample learned rule,
ownername(Fragment) some(Fragment, B, ,
in_title, true), length(Fragment, lt, 3),
some(Fragment, B, prev_token, word, gmt),
some(Fragment, A, , longp, true),
some(Fragment, B, , word, unknown),
some(Fragment, B, , quadrupletonp, false)

22
FOILs result
23
Conclusions

The approach we propose in this paper is to
construct a system that can be trained to
automatically populate such a KB.
We have presented a variety pf approaches that
take advantage of the special structure of
hypertext
By considering relationships among Web pages,
Their hyperlinks,
And specific words on individual pages and
hyperlinks.