Title: Knowledge Discovery
1Knowledge Discovery
ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY
- Marko Grobelnik, Dunja Mladenic
- J.Stefan Institute
- Slovenia
2Contents
- Knowledge Discovery
- Large Scale Topic Ontology population
- Extraction of Semantic Networks from Text
- Active Learning for efficient using of human
interventions - Methods Addressing Different Aspects of Ontology
Construction - Final Remarks
3Why is Knowledge Discovery appropriate for
Semantic Web?
- Idea let a computer search for knowledge whereas
the humans give just broad directions about where
and how to search - Knowledge discovery (KD) could be defined as a
research area with several subfields Machine
Learning, Data Mining and Data bases (Mitchell,
1997 Fayyad et al., 1996 Witten and Frank,
1999 Hand et al., 2001) - KD techniques
- mainly about discovering structure in the data
- can serve as one of the key mechanisms for
structuring knowledge into an ontological
structure being further used in Knowledge
management process - Data and corresponding semantic structures change
in time - sub-field of KD called stream mining deals with
these kinds of problems - Semantic Web is ultimately concerned with
real-life data on the web which have exponential
growth - scalability is one of the central issues in KD
4Machine Learning view to Ontology Generation
5Knowledge Discovery Techniques
- Knowledge discovery technologies can be used to
support different phases and scenarios for
ontology generation - Observations
- Completely automatic construction of ontologies
is in general not possible for - theoretical reasons (e.g., information
bottleneck) and - practical reasons (e.g., the soft nature of the
knowledge being conceptualized). - Human interventions are necessary but costly in
terms of resources - therefore the technology should help in
efficient utilization of human interventions. - Document databases are the most common data type
conceptualized in the form of ontologies
6What is Ontology?
- In most ML contexts we can refer to an ontology
as being a graph/network structure consisting
from - a set of concepts (vertices in a graph)
- each concept Ci is described by a
membership-function ci(x) - a set of relations connecting concepts (directed
edges in a graph) - each relation Ri is described by a
membership-function ri(Ci, Cj) - a set of instances (data records assigned to
concepts or relations) - each instance Ii is described by a set of
features Fi,j
7We have 7 concepts (C1C7), and 3 relations
(R1R3) each of the concept and relation is
populated by a number of instances (data records)
R1
C2
C1
R3
C4
C3
R3
R2
R1
R3
R2
C5
C7
R1
C6
8Ontology Definition
- Ontology is defined as a tuple with 5 sets of
objects - OntologyltClasses, Relations, Instances,
Class-Definitions, Relation-Definitionsgt - in short OltC, R, I, CD, RDgt
- where
- Classes set of labels Ci
- Relations set of labels Ri
- Instances set of instance feature vectors Ii
- Class-Definitions set of class membership
functions CDi - Relation-Definitions set of relation membership
functions RDi - the idea is to describe ontology learning
tasks in above terms
9Ontology Learning
- Ontology learning is a set of tasks based on the
previous ontology definition - We define ontology learning tasks in terms of
mappings between ontology components where some
of the components are given and some are missing
and we want to induce the missing ones - Some typical scenarios
- Inducing classes/Clustering of instances
- C, CDf(I)
- Ontology population
- CD, RDf(C, R, I)
- Ontology generation
- C, R, CD, RDf(I) (hardest task)
10Representational language
- When performing learning of function f, we need
to select language for representation of
membership function f - Examples
- Linear functions (Support-Vector-Machines, )
- Propositional logic (decision trees, rules, )
- First order logic (Inductive Logic programming)
- by selecting different representation languages
we decide about - the power of the descriptions
- complexity of computation
11Ontology Quality
- For the same set of instances I we can have
multiple ontologies OltC, R, I, CD, RDgtI - We need a function q for measuring the quality of
a given ontology OI - function q returns numerical value
- the best ontology is the one with the highest
quality - Possible evaluation measures
- (1) analysis of statistical properties of
structured data, - (2) agreement to the properties derived from
manually built ontologies, - (3) optimization of efficiency of the user's
behaviour when using an ontology, - (4) using background knowledge, and
- (5) building hybrid measures (combination of
various approaches).
12Search for optimal Ontology
- Given set of instances I, we develop a series of
ontologies - O1, O2, O3,
- where we have set of transformation operators
(refinement operators) going from Oi to Oi1 - Good search procedure would select such
transformations which would lead efficiently
towards the highest quality q(Oi) - this formulation is in line with machine
learning with structured output - we could use human in the loop by using active
learning techniques
13Contents
- Knowledge Discovery
- Large Scale Topic Ontology population
- Extraction of Semantic Networks from Text
- Active Learning for efficient using of human
interventions - Methods Addressing Different Aspects of Ontology
Construction - Final Remarks
14Large Scale Topic Ontology population
15Text categorization into large topic ontology
- Categorization of documents into large topic
ontology is one of the problems in text mining - needs to be scalable
- e.g. being able to handle DMozs 600K categories
and 4M docs. - needs to be accurate
- having accuracy on the level of inter-human
agreement (60-80) - needs to be robust
- taking into account nature of web pages
(typically mixed quality content and often high
quality context)
16Approaches for handling hierarchy of categories
- There are several topic ontologies (taxonomies)
of textual documents - Yahoo, DMoz, Medline,
- Different people use different approaches
- series of hierarchically organized classifiers
- set of independent classifiers just for leaves
- set of independent classifiers for all nodes
17Yahoo! topic ontology (taxonomy)
- human constructed hierarchy of Web-documents
- exists in several languages
- easy to access and regularly updated
- captures most of the Web topics
- English version includes over 2M pages
categorized into 50,000 categories - contains about 250Mb of HTML-files
18Document to categorize CFP for CoNLL-2000
19Some predicted categories
20System architecture
Feature construction
Web
vectors of n-grams
Subproblem definition Feature selection Classifier
construction
labeled documents (from Yahoo! hierarchy)
unlabeled document
category (label)
??
Document Classifier
21Content categories
- For each content category generate a separate
classifier that predicts probability for a new
document to belong to its category
22Summary of experimental results on Yahoo!
23DMoz / ODP is largest topic ontology on the
web 4M sites 68k editors 600k concepts
24Categorization into DMoz
- On input we take DMoz RDF taxonomy data
- from http//rdf.dmoz.org/
- we preprocess it into efficient binary structure
- next, we build a classification model consisting
from models for individual categories - We take hierarchical nature into account
- Using classification model we classify new
documents into taxonomy - On output we get for a given document text and
URL - Set of most relevant categories from DMoz
- Set of most relevant keywords calculated from
DMoz category names (segments from the path names)
25What is used for learning?
- Currently the system uses hierarchical nearest
neighbor - in the past we experimented with Naïve Bayes
for Yahoo taxonomy (http//kt.ijs.si/Dunja/yplanet
.html) - heavy feature selection was needed
- we plan to experiment with Support Vector
Machine (SVM) algorithms - we plan to use this for ACM KDD Cup 2005
Challenge - Scalability is a problem for learning and
classification when dealing with 600K classes and
4M documents - Approaches still needs to be properly evaluated
26Performance issues
- Preprocessing of DMoz (from RDF to classification
model) takes approx. 1h - For classification into the whole DMoz we need
Win64 with at least 6Gb memory - subsets of DMoz run on Win32 with 2Gb
- Classification into DMoz is fast
- 20 document classifications per second
- e.g. whole Wikipedia was classified into DMoz in
several hours
27Demos
- Demo software for classification into
http//dmoz.org/Science/ available at
http//agava.ijs.si/marko/DMozClassifyDemo.zip
(40Mb) - includes AVI file with demo movie
- demo runs at http//alchemist.ijs.si11111/
- Demo for classification into the whole DMoz (all
600K classes) runs at http//alchemist.ijs.si2222
2/
28Example classification of URL of a web page
keywords
categories
classification of Hubble telescope web page
29Example classification of URL text of a web page
30Contents
- Knowledge Discovery
- Large Scale Topic Ontology population
- Extraction of Semantic Networks from Text
- Active Learning for efficient using of human
interventions - Methods Addressing Different Aspects of Ontology
Construction - Final Remarks
31Extracting Semantic Graph from text
32Summarization with semantic graph (Leskovec,
Grobelnik, Milic-Frayling 2005)
- Idea extract semantic network from text
documents and identify relevant parts of the
semantic network to represent summary - Semantic graph representation is used for
summarization task (DUC Challenge) - The main research result is the finding that
topology of extracted semantic graph helps in
determining importance of the content triples
(which Subject-Predicate-Object triple is
relevant) - joint collaboration with Microsoft Research,
Cambridge
33Approach Description
- Approach
- Learn a machine learning model for selecting
sentences - Use information about semantic structure of the
document (concepts and relations among concepts) - Results are promising
- achieved 70 recall of and 25 precision on
extracted Subject-Predicate-Object triples on DUC
(Document understanding conference) data
34Summarization
Human built document summary
Original Document
- Cracks Appear in U.N. Trade Embargo Against
Iraq. - Cracks appeared Tuesday in the U.N. trade
embargo against Iraq as Saddam Hussein sought to
circumvent the economic noose around his country.
Japan, meanwhile, announced it would increase its
aid to countries hardest hit by enforcing the
sanctions. Hoping to defuse criticism that it is
not doing its share to oppose Baghdad, Japan said
up to 2 billion in aid may be sent to nations
most affected by the U.N. embargo on Iraq.
President Bush on Tuesday night promised a joint
session of Congress and a nationwide radio and
television audience that Saddam Hussein will
fail'' to make his conquest of Kuwait permanent.
America must stand up to aggression, and we
will,'' said Bush, who added that the U.S.
military may remain in the Saudi Arabian desert
indefinitely. I cannot predict just how long it
will take to convince Iraq to withdraw from
Kuwait,'' Bush said. More than 150,000 U.S.
troops have been sent to the Persian Gulf region
to deter a possible Iraqi invasion of Saudi
Arabia. Bush's aides said the president would
follow his address to Congress with a televised
message for the Iraqi people, declaring the world
is united against their government's invasion of
Kuwait. Saddam had offered Bush time on Iraqi TV.
The Philippines and Namibia, the first of the
developing nations to respond to an offer Monday
by Saddam of free oil _ in exchange for sending
their own tankers to get it _ said no to the
Iraqi leader. Saddam's offer was seen as a
none-too-subtle attempt to bypass the U.N.
embargo, in effect since four days after Iraq's
Aug. 2 invasion of Kuwait, by getting poor
countries to dock their tankers in Iraq. But
according to a State Department survey, Cuba and
Romania have struck oil deals with Iraq and
companies elsewhere are trying to continue trade
with Baghdad, all in defiance of U.N. sanctions.
Romania denies the allegation. The report, made
available to The Associated Press, said some
Eastern European countries also are trying to
maintain their military sales to Iraq. A
well-informed source in Tehran told The
Associated Press that Iran has agreed to an Iraqi
request to exchange food and medicine for up to
200,000 barrels of refined oil a day and cash
payments. There was no official comment from
Tehran or Baghdad on the reported food-for-oil
deal. But the source, who requested anonymity,
said the deal was struck during Iraqi Foreign
Minister Tariq Aziz's visit Sunday to Tehran, the
first by a senior Iraqi official since the
1980-88 gulf war. After the visit, the two
countries announced they would resume diplomatic
relations. Well-informed oil industry sources in
the region, contacted by The AP, said that
although Iran is a major oil exporter itself, it
currently has to import about 150,000 barrels of
refined oil a day for domestic use because of
damages to refineries in the gulf war. Along
similar lines, ABC News reported that following
Aziz's visit, Iraq is apparently prepared to give
Iran all the oil it wants to make up for the
damage Iraq inflicted on Iran during their
conflict. Secretary of State James A. Baker III,
meanwhile, met in Moscow with Soviet Foreign
Minister Eduard Shevardnadze, two days after the
U.S.-Soviet summit that produced a joint demand
that Iraq withdraw from Kuwait. During the
summit, Bush encouraged Mikhail Gorbachev to
withdraw 190 Soviet military specialists from
Iraq, where they remain to fulfill contracts.
Shevardnadze told the Soviet parliament Tuesday
the specialists had not reneged on those
contracts for fear it would jeopardize the 5,800
Soviet citizens in Iraq. In his speech, Bush said
his heart went out to the families of the
hundreds of Americans held hostage by Iraq, but
he declared, Our policy cannot change, and it
will not change. America and the world will not
be blackmailed.'' The president added Vital
issues of principle are at stake. Saddam Hussein
is literally trying to wipe a country off the
face of the Earth.'' In other developments _A
U.S. diplomat in Baghdad said Tuesday up to 800
Americans and Britons will fly out of
Iraqi-occupied Kuwait this week, most of them
women and children leaving their husbands behind.
Saddam has said he is keeping foreign men as
human shields against attack. On Monday, a
planeload of 164 Westerners arrived in Baltimore
from Iraq. Evacuees spoke of food shortages in
Kuwait, nighttime gunfire and Iraqi roundups of
young people suspected of involvement in the
resistance. There is no law and order,'' said
Thuraya, 19, who would not give her last name.
A soldier can rape a father's daughter in front
of him and he can't do anything about it.'' _The
State Department said Iraq had told U.S.
officials that American males residing in Iraq
and Kuwait who were born in Arab countries will
be allowed to leave. Iraq generally has not let
American males leave. It was not known how many
men the Iraqi move could affect. _A Pentagon
spokesman said some increase in military
activity'' had been detected inside Iraq near its
borders with Turkey and Syria. He said there was
little indication hostilities are imminent.
Defense Secretary Dick Cheney said the cost of
the U.S. military buildup in the Middle East was
rising above the 1 billion-a-month estimate
generally used by government officials. He said
the total cost _ if no shooting war breaks out _
could total 15 billion in the next fiscal year
beginning Oct. 1. Cheney promised disgruntled
lawmakers a significant increase'' in help from
Arab nations and other U.S. allies for Operation
Desert Shield. Japan, which has been accused of
responding too slowly to the crisis in the gulf,
said Tuesday it may give 2 billion to Egypt,
Jordan and Turkey, hit hardest by the U.N.
prohibition on trade with Iraq. The pressure
from abroad is getting so strong,'' said Hiroyasu
Horio, an official with the Ministry of
International Trade and Industry. Local news
reports said the aid would be extended through
the World Bank and International Monetary Fund,
and 600 million would be sent as early as
mid-September. On Friday, Treasury Secretary
Nicholas Brady visited Tokyo on a world tour
seeking 10.5 billion to help Egypt, Jordan and
Turkey. Japan has already promised a 1 billion
aid package for multinational peacekeeping forces
in Saudi Arabia, including food, water, vehicles
and prefabricated housing for non-military uses.
But critics in the United States have said Japan
should do more because its economy depends
heavily on oil from the Middle East. Japan
imports 99 percent of its oil. Japan's
constitution bans the use of force in settling
international disputes and Japanese law restricts
the military to Japanese territory, except for
ceremonial occasions. On Monday, Saddam offered
developing nations free oil if they would send
their tankers to pick it up. The first two
countries to respond Tuesday _ the Philippines
and Namibia _ said no. Manila said it had already
fulfilled its oil requirements, and Namibia said
it would not sell its sovereignty'' for Iraqi
oil. Venezuelan President Carlos Andres Perez
dismissed Saddam's offer of free oil as a
propaganda ploy.'' Venezuela, an OPEC member,
has led a drive among oil-producing nations to
boost production to make up for the shortfall
caused by the loss of Iraqi and Kuwaiti oil from
the world market. Their oil makes up 20 percent
of the world's oil reserves. Only Saudi Arabia
has higher reserves. But according to the State
Department, Cuba, which faces an oil deficit
because of reduced Soviet deliveries, has
received a shipment of Iraqi petroleum since U.N.
sanctions were imposed five weeks ago. And
Romania, it said, expects to receive oil
indirectly from Iraq. Romania's ambassador to the
United States, Virgil Constantinescu, denied that
claim Tuesday, calling it absolutely false and
without foundation.''.
Cracks appeared in the U.N. trade embargo against
Iraq. The State Department reports that Cuba and
Romania have struck oil deals with Iraq as others
attempt to trade with Baghdad in defiance of the
sanctions. Iran has agreed to exchange food and
medicine for Iraqi oil. Saddam has offered
developing nations free oil if they send their
tankers to pick it up. Thus far, none has
accepted. Japan, accused of responding too slowly
to the Gulf crisis, has promised 2 billion in
aid to countries hit hardest by the Iraqi trade
embargo. President Bush has promised that
Saddam's aggression will not succeed.
Manual summarization
Creation of semantic network
Semantic net of Subj-Pred-Obj triples
Automatically built document summary (not done
by us)
70 recall, 40 precision of selected triples
according to human generated summaries
Automatic summarization by selecting relevant
triples
Cracks appeared in the U.N. trade embargo against
Iraq. The State Department reports that Cuba and
Romania have struck oil deals with Iraq as others
attempt to trade with Baghdad in defiance of the
sanctions. Iran has agreed to exchange food and
medicine for Iraqi oil. Saddam has offered
developing nations free oil if they send their
tankers to pick it up. Thus far, none has
accepted. Japan, accused of responding too slowly
to the Gulf crisis, has promised 2 billion in
aid to countries hit hardest by the Iraqi trade
embargo. President Bush has promised that
Saddam's aggression will not succeed.
Nat. Lang. Generation
Mapping between graphs learned with ML methods
Semantic net of Subj-Pred-Obj triples
35Detailed Summarization Procedure
- Linguistic analysis of the text
- - Deep parsing of sentences
- Refinement of the text parse
- - Named-entity consolidation
- Determine that George Bush Bush
- U.S. president
- - Anaphora resolution
- Link pronouns with name-entities
- Extract SubjectPredicateObject triples
Tom Sawyer went to town. He met a friend. Tom was
happy.
Tom Sawyer went to town. He Tom Sawyer met a
friend. Tom Tom Sawyer was happy.
Tom ? go ? town Tom ? meet ? friend Tom ? is ?
happy
Compose a graph from triples Describe each
triple with a set of features for learning Learn
a model to classify triples into the
summary Generate a summary graph
Use summary graph to generate textual document
summary
36Named entities consolidation
- Consolidating different surface forms that refer
to the same entities only for names of people,
places, companies, etc. - Example
- Hillary Rodham Clinton, Hillary Clinton, Hillary
Rodham, Mrs. Clinton ? Hillary Clinton - Heuristic based on the overlap in the surface
form of name variances - Accuracy on a subset of the data set 90.
37Pronomial anaphora resolution
- Link pronouns with their references
- Mary likes Paul. She went to buy him a
present. - ? Mary likes Paul. She Mary went to buy him
Paul a present. - Method
- restrict to 5 pronouns she, he, who, I, they.
- from the pronoun, traverse the text searching for
candidate references and assign a score - the score is based on the distance from the
pronoun and semantic information - assume that pronouns refer only to named entities
found in the document - Problem
- One passenger in King's car said they had been
drinking liquor. - Average accuracy on 1,500 hand labeled pronouns
81.2
38Anaphora resolution evaluation
Accuracy on 5 selected 81.2 (55.2 if counting
all pronouns)
39Extracting triples
- Enhanced parse tree is traversed to identify
SubjectPredicateObject triples - Example
- Conservatives embraced the nomination while
liberals were cautious or hostile - Resulting triples
- conservative ? embrace ? nomination
- liberal ? is ? cautious
- liberal ? is ? hostile
40Detailed Summarization Procedure
- Linguistic analysis of the text
- - Deep parsing of sentences
- Refinement of the text parse
- - Named-entity consolidation
- Determine that George Bush Bush
- U.S. president
- - Anaphora resolution
- Link pronouns with name-entities
- Extract Subject Predicate Object triples
Tom Sawyer went to town. He met a friend. Tom was
happy.
Tom Sawyer went to town. He Tom Sawyer met a
friend. Tom Tom Sawyer was happy.
Tom ? go ? town Tom ? meet ? friend Tom ? is ?
happy
Compose a graph from triples Describe each
triple with a set of features for learning Learn
a model to classify triples into the
summary Generate a summary graph
Use summary graph to generate textual document
summary
41Training of summarization model
- Model ranks Subject-Predicate-Object triples
according to their importance
Document Semantic network
Summary semantic network
42Composing a graph
- Graph consists of nodes, referred as concepts,
which can be subjects or objects and edges which
are predicates and capture relations among
concepts. - Use Word net to identify and compact synonym
nodes as they correspond to the same concepts.
43Feature construction
- Features used in the learning process include
triples described by the following attributes - Positional information
- Of the sentence from which the triple was derived
relative to the document text - Of the triple relative to the beginning of the
sentence - Linguistic attributes of the nodes in the triple
(NLP) - 18 syntactic attributes
- 100 semantic attributes
- 14 graph attributes PageRank, In/Out Degree,
reachable neighbours, etc. - Dataset this yield
- TOTAL of 466 attributes
- On average 72 non-zero attributes per triple.
44Experiments
- Machine learning with Linear SVM to classify
triples into relevant or not-relevant for the
summary - Positive examples are triples from the sentences
which were marked as summary sentences by experts - Negative examples are all other triples
- Data
- 147 documents from the DUC 2002 for which we had
extracted summaries. - Evaluation
- Report microaveraged values of precision, recall
and F1 for the extracted triples using 10-fold
cross validation.
45Performance for various attribute sets
46Performance for various attribute sets
Baseline performance (sentence position
selected terms from the sentence) F132.46 is
lower than in any of the other runs, except for
only linguistic attributes (F130.29). only
linguistic run includes only generic syntactic
and semantic labels - not expected to be good
discriminators on their own.
47Performance for various attribute sets
Adding generic linguistic attributes reduces
precision Position of triples and sentences ?
P31.05 Adding linguistic attributes ?
P28.67 but consistently increases recall
48Performance for various attribute sets
Information about the graph structure helps
Position of triples and sentences ? F139.05
Adding structure information ? F143.07
49Insights
We determine the median and quartiles of the
ranks across 10 runs.
- Most highly ranked features in SVM normal
50Example of summarization
- Cracks Appear in U.N. Trade Embargo Against
Iraq. - Cracks appeared Tuesday in the U.N. trade
embargo against Iraq as Saddam Hussein sought to
circumvent the economic noose around his country.
Japan, meanwhile, announced it would increase its
aid to countries hardest hit by enforcing the
sanctions. Hoping to defuse criticism that it is
not doing its share to oppose Baghdad, Japan said
up to 2 billion in aid may be sent to nations
most affected by the U.N. embargo on Iraq.
President Bush on Tuesday night promised a joint
session of Congress and a nationwide radio and
television audience that Saddam Hussein will
fail'' to make his conquest of Kuwait permanent.
America must stand up to aggression, and we
will,'' said Bush, who added that the U.S.
military may remain in the Saudi Arabian desert
indefinitely. I cannot predict just how long it
will take to convince Iraq to withdraw from
Kuwait,'' Bush said. More than 150,000 U.S.
troops have been sent to the Persian Gulf region
to deter a possible Iraqi invasion of Saudi
Arabia. Bush's aides said the president would
follow his address to Congress with a televised
message for the Iraqi people, declaring the world
is united against their government's invasion of
Kuwait. Saddam had offered Bush time on Iraqi TV.
The Philippines and Namibia, the first of the
developing nations to respond to an offer Monday
by Saddam of free oil _ in exchange for sending
their own tankers to get it _ said no to the
Iraqi leader. Saddam's offer was seen as a
none-too-subtle attempt to bypass the U.N.
embargo, in effect since four days after Iraq's
Aug. 2 invasion of Kuwait, by getting poor
countries to dock their tankers in Iraq. But
according to a State Department survey, Cuba and
Romania have struck oil deals with Iraq and
companies elsewhere are trying to continue trade
with Baghdad, all in defiance of U.N. sanctions.
Romania denies the allegation. The report, made
available to The Associated Press, said some
Eastern European countries also are trying to
maintain their military sales to Iraq. A
well-informed source in Tehran told The
Associated Press that Iran has agreed to an Iraqi
request to exchange food and medicine for up to
200,000 barrels of refined oil a day and cash
payments. There was no official comment from
Tehran or Baghdad on the reported food-for-oil
deal. But the source, who requested anonymity,
said the deal was struck during Iraqi Foreign
Minister Tariq Aziz's visit Sunday to Tehran, the
first by a senior Iraqi official since the
1980-88 gulf war. After the visit, the two
countries announced they would resume diplomatic
relations. Well-informed oil industry sources in
the region, contacted by The AP, said that
although Iran is a major oil exporter itself, it
currently has to import about 150,000 barrels of
refined oil a day for domestic use because of
damages to refineries in the gulf war. Along
similar lines, ABC News reported that following
Aziz's visit, Iraq is apparently prepared to give
Iran all the oil it wants to make up for the
damage Iraq inflicted on Iran during their
conflict. Secretary of State James A. Baker III,
meanwhile, met in Moscow with Soviet Foreign
Minister Eduard Shevardnadze, two days after the
U.S.-Soviet summit that produced a joint demand
that Iraq withdraw from Kuwait. During the
summit, Bush encouraged Mikhail Gorbachev to
withdraw 190 Soviet military specialists from
Iraq, where they remain to fulfill contracts.
Shevardnadze told the Soviet parliament Tuesday
the specialists had not reneged on those
contracts for fear it would jeopardize the 5,800
Soviet citizens in Iraq. In his speech, Bush said
his heart went out to the families of the
hundreds of Americans held hostage by Iraq, but
he declared, Our policy cannot change, and it
will not change. America and the world will not
be blackmailed.'' The president added Vital
issues of principle are at stake. Saddam Hussein
is literally trying to wipe a country off the
face of the Earth.'' In other developments _A
U.S. diplomat in Baghdad said Tuesday up to 800
Americans and Britons will fly out of
Iraqi-occupied Kuwait this week, most of them
women and children leaving their husbands behind.
Saddam has said he is keeping foreign men as
human shields against attack. On Monday, a
planeload of 164 Westerners arrived in Baltimore
from Iraq. Evacuees spoke of food shortages in
Kuwait, nighttime gunfire and Iraqi roundups of
young people suspected of involvement in the
resistance. There is no law and order,'' said
Thuraya, 19, who would not give her last name.
A soldier can rape a father's daughter in front
of him and he can't do anything about it.'' _The
State Department said Iraq had told U.S.
officials that American males residing in Iraq
and Kuwait who were born in Arab countries will
be allowed to leave. Iraq generally has not let
American males leave. It was not known how many
men the Iraqi move could affect. _A Pentagon
spokesman said some increase in military
activity'' had been detected inside Iraq near its
borders with Turkey and Syria. He said there was
little indication hostilities are imminent.
Defense Secretary Dick Cheney said the cost of
the U.S. military buildup in the Middle East was
rising above the 1 billion-a-month estimate
generally used by government officials. He said
the total cost _ if no shooting war breaks out _
could total 15 billion in the next fiscal year
beginning Oct. 1. Cheney promised disgruntled
lawmakers a significant increase'' in help from
Arab nations and other U.S. allies for Operation
Desert Shield. Japan, which has been accused of
responding too slowly to the crisis in the gulf,
said Tuesday it may give 2 billion to Egypt,
Jordan and Turkey, hit hardest by the U.N.
prohibition on trade with Iraq. The pressure
from abroad is getting so strong,'' said Hiroyasu
Horio, an official with the Ministry of
International Trade and Industry. Local news
reports said the aid would be extended through
the World Bank and International Monetary Fund,
and 600 million would be sent as early as
mid-September. On Friday, Treasury Secretary
Nicholas Brady visited Tokyo on a world tour
seeking 10.5 billion to help Egypt, Jordan and
Turkey. Japan has already promised a 1 billion
aid package for multinational peacekeeping forces
in Saudi Arabia, including food, water, vehicles
and prefabricated housing for non-military uses.
But critics in the United States have said Japan
should do more because its economy depends
heavily on oil from the Middle East. Japan
imports 99 percent of its oil. Japan's
constitution bans the use of force in settling
international disputes and Japanese law restricts
the military to Japanese territory, except for
ceremonial occasions. On Monday, Saddam offered
developing nations free oil if they would send
their tankers to pick it up. The first two
countries to respond Tuesday _ the Philippines
and Namibia _ said no. Manila said it had already
fulfilled its oil requirements, and Namibia said
it would not sell its sovereignty'' for Iraqi
oil. Venezuelan President Carlos Andres Perez
dismissed Saddam's offer of free oil as a
propaganda ploy.'' Venezuela, an OPEC member,
has led a drive among oil-producing nations to
boost production to make up for the shortfall
caused by the loss of Iraqi and Kuwaiti oil from
the world market. Their oil makes up 20 percent
of the world's oil reserves. Only Saudi Arabia
has higher reserves. But according to the State
Department, Cuba, which faces an oil deficit
because of reduced Soviet deliveries, has
received a shipment of Iraqi petroleum since U.N.
sanctions were imposed five weeks ago. And
Romania, it said, expects to receive oil
indirectly from Iraq. Romania's ambassador to the
United States, Virgil Constantinescu, denied that
claim Tuesday, calling it absolutely false and
without foundation.''.
Human written summary
Cracks appeared in the U.N. trade embargo against
Iraq. The State Department reports that Cuba and
Romania have struck oil deals with Iraq as others
attempt to trade with Baghdad in defiance of the
sanctions. Iran has agreed to exchange food and
medicine for Iraqi oil. Saddam has offered
developing nations free oil if they send their
tankers to pick it up. Thus far, none has
accepted. Japan, accused of responding too slowly
to the Gulf crisis, has promised 2 billion in
aid to countries hit hardest by the Iraqi trade
embargo. President Bush has promised that
Saddam's aggression will not succeed.
7800 chars, 1300 words
51Full document semantic graph
52Automatically generated summary graph
53Findings on summarization with semantic graphs
- Experiments show that attributes that
characterize the document semantic graph improve
selection of triples for summarization - This results need to be verified on additional
data sets - Need to perform comparison with additional
summarization methods - Explore various strategies for extracting and
generating summaries based on extracted triples. - No combination of features that was examined lead
to good separation of positive and negative
triples in the feature space - Opportunity for further investigations and
improvements.
54Contents
- Knowledge Discovery
- Large Scale Topic Ontology population
- Extraction of Semantic Networks from Text
- Active Learning for efficient using of human
interventions - Methods Addressing Different Aspects of Ontology
Construction - Final Remarks
55Active Learning /Dealing with unlabeled data
56The idea of Active Learning
- The idea of Active Learning is if a student asks
smart questions, it comes faster to the required
model of knowledge as by asking random questions - The goal is to use Active Learning algorithms for
semiautomatic - construction of models for labeling data and
- for ontology learning
57Quick Intro to Active Learning
Data labels
Teacher
passive student
- We use this methods whenever hand-labeled data
are rare or expensive to obtain - Interactive method
- Requests only labeling of interesting objects
- Much less human work needed for the same result
compared to arbitrary labeling examples
query
Teacher
active student
label
Active student asking smart questions
performance
Passive student asking random questions
number of questions
58Algorithms tested
- Uncertainty sampling (efficient)
- select example closest to the decision hyperplane
(or the one with classification probability
closest to P0.5) (Tong Koller 2000 Stanford) - Maximum margin ratio change
- select example with the largest predicted impact
on the margin size if selected (Tong Koller
2000 Stanford) - Monte Carlo Estimation of Error Reduction
- select example that reinforces our current
beliefs (Roy McCallum 2001, CMU) - Random sampling as baseline
- Experimental evaluation (using F1-measure) of the
four listed approaches shown on three categories
from Reuters-2000 dataset - average over 10 random samples of 5000 training
(out of 500k) and 10k testing (out of
300k)examples - the last two methods a rather time consuming,
thus we run them for including the first 50
unlabeled examples - experiments show that active learning is
especially useful for unbalanced data
59Category with balanced class distribution having
47 of positive examples Limited advantage over
random sampling
60Category with fairly unbalanced class
distribution having 20 of positive examples Best
performance with Uncertainty and MarginRatio,
Uncertainty is simpler and much more efficient
61Category with very unbalanced class distribution
having 2.7 of positive examples Uncertainty
seems to outperform MarginRatio
62Illustration of Active learning
- starting with one labeled example from each class
(red and blue) - select one example for labeling (green circle)
- request label and add re-generate the model using
the extended labeled data - Illustration of linear SVM model using
- arbitrary selection of unlabeled examples
(random) - active learning selecting the most uncertain
examples (closest to the decision hyperplane)
63Uncertainty sampling of unlabeled example
64(No Transcript)
65Contents
- Knowledge Discovery
- Large Scale Topic Ontology population
- Extraction of Semantic Networks from Text
- Active Learning for efficient using of human
interventions - Methods Addressing Different Aspects of Ontology
Construction - Final Remarks
66Methods Addressing Different Aspects of Ontology
Construction
67Methods addressing different aspects of ontology
construction
- Collecting data
- focused crawling with Google and DMoz in the loop
- Dealing with different natural languages
- map the documents into a language-independent
semantic-space - Going directly from the data
- semi-automatic creation of an ontology directly
from the data under predefined conditions/scenario
s - Annotation of text
68Focused Crawler
- Focused crawler which finds in a relatively short
time web pages related to the given web page - The solution uses DMoz topic ontology to get
content context, and Google to get web linkage
context - the main idea is to use browse web-graph as
bi-directional graph using link query in
Google - Algorithm
- For efficient initial set of candidate pages we
use Google and DMoz - From initial set pages are crawled in
breadth-first fashion - priority in the crawler-queue is given to more
similar pages - after some stopping condition is met, the
crawler returns the list of candidate web pages - Usage serves as a technique for collecting the
data for the next stages of data processing such
as building and populating ontologies for the
Semantic Web, improved knowledge access
69Example Focused Crawl
- Focused crawl for the BT home page
(http//www.bt.com) - 1. www.bt.co.uk/ - BT
- 2. www.yell.com/ucs/HomePageAction.do - UK's
local search engine - 3. www.att.com/ - ATT The World's Networking
Company - 4. www.cisco.com/ - Cisco Systems, Inc
- 5. www.microsoft.com/ - Microsoft Corporation
- 6. www.bbc.co.uk/ - BBC
- 7. www.hp.com/ - HP United States
- 8. www.ntl.com/ - Broadband cable internet access
- 9. www.telekom.de/ - Deutsche Telekom
- 10. www.epsrc.ac.uk/ - EPSRC
- 11. www.cw.com/ - Cable Wireless
- 12. www.royalmail.com/ - Royal Mail
- 13. www.ericsson.com/ - Ericsson
- 14. www.bp.com/home.do?categoryId1 - BP Global
- 15. www.telewest.co.uk/ - Telewest Broadband PLC
- 16. www.verizon.com/ - Verizon
- 17. www.nokia.com/ - Nokia
- 18. www.bt.com/at_home.jsp - BT.com At Home
70Language-independent document representation
- From aligned corpora we learn mappings between
documents into language independent
representation using Kernel Canonical
Correlation Analysis method - such representation could be used for
multilingual classification, multilingual IR, - On-going work on learning mappings between all
European languages using CELEX corpus of European
legislation in 21 lang
71Two views of the same data find the direction
with maximal correlation
72Corelation 0.17
View 2
73Correlation 0.44
View 2
74Correlation 0.97
View 2
75Correlated directions found with KCCA when
applied to financial news articles
76Modelling directly from the data getting
semantic classes with LSI
77Visualization of 6FP IST project (English)
78Modeling relationships between companies from the
news
79Annotation of text
- Annotation based on examples
- Annotation using clustering
- Annotation based on thesaurus
80Annotate text based on examples
- Problem Annotation of text by assigning
predefined labels to text fragments - Given examples of annotated text fragments
- learn annotation rules from already annotated
documents (.xml, ...) similar to learning IE - learn to classify sentences into semantic roles
81Annotate text using clustering
- Problem Annotation of text by finding labels and
assigning them to to text fragments - Given text to annotate
- split documents into sentences, represent each
sentence as word-vector - cluster sentences and label them by the most
characteristic words from the sentences - e.g., using local frequency of words, clustering
with SOM and using neural network weights of
words
82Annotate text based on thesaurus
- Problem Annotation of text by finding labels and
assigning them to to text fragments - Given text to annotate, thesaurus
- a) apply NLP on text to find noun-groups and map
them upon concepts of (medical) thesaurus - b) split document into sentences, cluster them
and map clusters upon concepts of a general
thesaurus (WordNet) - the concepts are used as semantic labels (XML
tags) for annotating documents
83Ontology evaluation directions
- Analysis of information-theoretic properties of
structured data instances - Measure of the agreement to the characteristics
derived from manually built ontologies - Optimization of efficiency of the user's
behaviour when using an ontology (e.g.,
minimizing the number of user clicks)
84Contents
- Knowledge Discovery
- Large Scale Topic Ontology population
- Extraction of Semantic Networks from Text
- Active Learning for efficient using of human
interventions - Methods Addressing Different Aspects of Ontology
Construction - Final Remarks
85Ontology Learning Challenge
- Academic challenge on DMoz data (Science part)
for 3 tasks - Taxonomy Population
- Given taxonomy with documents, the task is to
classify new documents into taxonomic categories - Naming Categories
- Given taxonomic categories with documents, the
task is to (semi)automatically propose names for
categories - Constructing Taxonomy from Documents
- Given a set of documents, the task is to
(semi)automatically propose taxonomic structure - The goal is to model human skills when dealing
with large amounts of data - Data
- DMoz/Science (10k concepts, 100k instances)
- Tourist ontology (from KU) (70 concepts, 1000
instances) - The challenge will be funded through PASCAL
Network of Excellence European project
(http//www.pascal-network.org/)
86Ideas / Future plans (1)
- DMoz categories as standard web meta-data
dictionary - the idea is to use DMoz categories/keywords as a
standardized dictionary for meta-data labeling of
general Web pages - because of dynamic and adaptive nature of DMoz
categorization (reflecting all major topics on
the web) this could be interesting as a baseline
for semantic web style annotation - e.g. could be deployed as a tool for
(semi)automatic generation of ltMETAgt tags for web
pages
87Ideas / Future plans (2)
- DMoz classifier as an annotation tool
- the idea is to use DMoz-classifier tool for
meta-data (keyword) generation - some other popular databases (e.g. Wikipedia)
could have attached automatically generated DMoz
categories - could be accessible as a web service (e.g. SOAP
interface)
88Ideas / Future plans (3)
- DMoz Visualizer
- the idea is to create a tool for visualization
and browsing through DMoz structure - browsing tools could combine other public and
commercial sources (such as Wikipedia, Google,
Amazon, eBay, ) - could appear as e.g. web-browser toolbar
89Ideas / Future plans (4)
- Analysis of DMoz Dynamics
- Future research plan is to model dynamics of DMoz
taxonomy based on data from DMoz Archive
(http//rdf.dmoz.org/rdf/archive/) - the idea is to model decision process when and
how the editors decide to split the category
nodes - currently the repository includes 120 snapshots
of DMoz from year 2000 on
90Ideas / Future plans (5)
- Focused crawling for DMoz
- the idea is to use focused crawler for proposing
new web sites for particular categories (as
editorial tool) - at JSI we developed focused crawler for fast and
efficient crawling for a focused content, can be
further extended - to use Google and DMoz in the loop
- to use user-hints (positive negative examples
of content pages) - based on Corpus-Builder project at CMU
- http//www-2.cs.cmu.edu/afs/cs/project/theo-4/text
-learning/www/corpusbuilder/
91Ideas / Future plans (6)
- Classification of non English documents
- we use string kernels for avoiding problems with
morphology - submitted paper at ECML/PKDD2005 (Fortuna
Mladenic) for classification into major Slovenian
and Croatian taxonomies - we plan to use use Canonical Correlation
Analysis (CCA) for efficient identification of
similar content written in different languages
92Text-Garden software library(in development over
the last 5 years)
93Text-Garden data
- Set of C classes for industrial strength text
mining problem solving - Currently organized in 50 command line utilities
covering - Machine learning/Data mining on text
- Web related functionality
- Profiling, Visualization,
- Currently works on Windows, to be ported to Linux
94Text Garden Architecture of clustering,
visualization, classification
95Text Garden Web sitewww.textmining.net