Title: Learning to Extract Symbolic Knowledge from the World Wide Web
1Learning to Extract Symbolic Knowledge from the
World Wide Web
- Changho Choi
- Source http//www.cs.cmu.edu/knigam/
- Mark Craven, Dan DiPasquo, Dayne Freitag, Andrew
McCallum - Carnegie Mellon University, J.Stefan Institute
- AAAI-98
2Abstract
Information on the Web
Unstandable to Human
Knowledgable
????
Extract information
KB
3Introduction (1/4)
- Two types of inputs
- of the information extraction system
- Ontology
- Specifying the classes and relations of interest
- For example, a hierarchy of classes including
Person, Student, Research.Project, Course, etc. - Training examples
- Represent instances of the ontology classes and
relations - For example, a course web page for Course
classes, faculty web pages for Faculty classes,
this pair of pages for Courses.Taught.By, etc.
4(No Transcript)
5Introduction (3/4)
- Assumptions
- about the mapping between the ontology and the
Web - 1. Each instance of an ontology class is
- a single Web page,
- a contiguous string of text,
- or a collection of several Web pages.
- 2. Each instance of a relation is
- a segment of hypertext,
- a contiguous segment of text,
- or t he hypertext segment.
6Introduction (4/4)
- Three primary learning tasks
- Involved in extracting knowledge-base instances
for the Web - 1. Recognizing class instances by classifying
bodies. - 2. Recognizing relation instances by classifying
chains of hyperlinks. - 3. Recognizing class and relation instances by
extracting small fields of text form Web pages.
7Experimental Testbed
- Experiments
- Based on the ontology
- ClassesDepartment, faculty, staff, student,
research_project, course, other - Relations Instructors.Of.Course(251),
Members.Of.Project(392), Department.Of.Person(748)
- Data sets
- A set of pages(4127) and hyperlinks(10945) from 4
CS dept. - A set of pages(4120) from numerous other CS dept.
- Evaluation
- Four-fold cross validation
- 3 for training, 1 for testing
8Statistical Text Classification
- Process
- building a probabilistic model of each class
using labeled training data - Classifying newly seen pages by selecting the
class that that is most probable given the
evidence of words describing the new page. - Train three classifiers
- Full-text
- Title/Heading
- Hyperlink
9Statistical Text Classification
- Approach
- the naïve Bayes, with minor modifications
- Based on Kullback-Leibler Divergence
- Given a document d to classify, we calculate a
score for each class c as follows
10Statistical Text Classification
ActualPredicted course student faculty staff Research_project department other Accuracy
Course 202 17 0 0 1 0 552 26.2
Student 0 421 14 17 2 0 519 43.3
Faculty 5 56 118 16 3 0 264 17.9
Staff 0 15 0 4 0 0 45 6.2
Research_project 8 9 10 5 62 0 384 13.0
Department 10 8 3 1 5 4 209 1.7
Other 19 32 7 3 12 0 1064 93.6
Coverage 82.8 72.4 77.1 8.7 72.9 100.0 35.0
11Accuracy/coverage
- Coverage
- The percentage of pages for a given class that
are correctly classified as belonging to the
class - accuracy
- The percentage of pages classified into a given
class that are actually members of that class
12Accuracy/coverage tradeoff
1. Full-text classifiers
2. Hyperlink classifiers
3. Title/heading classifiers
Hyperlink information can provide strong
knowledge.
13First-Order Text Classification
- Second approach for text classification
- learn first-order rules for classifying pages
- 1st-order rules with variables
- FOIL is the well-known algorithm for first-order
learning. - 0th-order no variables. Prolog-like.
Function-free Horn clauses - C4.5 is the well-known algorithm for zeroth-order
learning.
14FOILs input for text classification
- For each distinct word,
- has_word(Page)
- word is stemmed.
- For every hyperlink,
- link_to(Page, Page)
- Training data,
- Student(http//www.cs.buffalo.edu/grads.html),
- Course(http//www.cse.buffalo.edu/courses.html),
15FOILs result
- Sample learned rules,
- Student(A) not(has_data(A)),
not(has_comment(A)), link_to(B,A), has_jame(B),
has_paul(B), not(has_mail(B)).Test Set 126(),
5(-) - Faculty(A) - has_professor(A), has_ph(A),
link_to(B,A), has_faculti(B).Test Set 18(),
3(-)
16FOILs result
- Comparing to statistical classification
- More accurate
- Less coverage
17Classifying Hyperlinks
- Use a first-order representation
- because this task involves discovering hyperlink
paths of unknown and variable size. - and, since we want to find out following
patterns. - The ProjectMember(A,B) relation holds if A is a
Person, and B is a ResearchProject, and B
includes a link to A near the word People.
18FOILs Input for classifying hyperlinks
- Predicates
- class(Page)
- link_to(Hyperlink, Page, Page)
- has_word(Hyperlink)
- all_words_capitalized(Hyperlink)
- has alphanumeric_word(Hyperlink)
- has_neighborhood_word(Hyperlink)
- Training examples
- Department.Of.Person(CSE, Changho Choi),
- Instructors.Of.Course(Sargur N. Srihari,
CSE711),
19FOILs result
- Sample learned rules,
- Members_of_project(A, B) research_project(A),
person(B), link_to(C,A,D), link_to(E,D,B),
neighborhood_word_people(C).Test Set 18(),
0(-) - department_of_person(A,B) person(A),
department(B), link_to(C,D,A), link_to(E,F,D),
link_to(G,B,F), neighborhood_word_graduate(E).Te
st Set 371(), 4(-)
20FOILs result
- Fairly High Accuracy
- Limited coverage
- Because limited coverage of page classifiers
21Extracting Text Fields
- Uses a richer set of predicates
- length(Fragment, Relop, N)
- Some(Fragment, Var, Path, Attr, Value)
- Position(Fragment, Var, From, Relop, N)
- Relpos(Fragment, Var1, Var2, Relop, N)
- Sample learned rule,
- ownername(Fragment) some(Fragment, B, ,
in_title, true), length(Fragment, lt, 3),
some(Fragment, B, prev_token, word, gmt),
some(Fragment, A, , longp, true),
some(Fragment, B, , word, unknown),
some(Fragment, B, , quadrupletonp, false)
22FOILs result
23Conclusions
- The approach we propose in this paper is to
construct a system that can be trained to
automatically populate such a KB. - We have presented a variety pf approaches that
take advantage of the special structure of
hypertext - By considering relationships among Web pages,
- Their hyperlinks,
- And specific words on individual pages and
hyperlinks.