Title: Estrazione di informazioni da testo
1Estrazione di informazioni da testo
2Perchè occuparsene?
- E unapplicazione particolarmente complessa.
- Sfrutta la maggior parte delle risorse utilizzate
in compiti di analisi. - Il suo studio permette quindi di avere una buona
panoramica delle problematiche e delle tecnologie
utilizzate nellanalisi del linguaggio naturale.
3Cosa è lEstrazione di Informazioni da Testo?
- Information retrieval (IR) cercare e
informazioni in testi a fronte di richieste
specifiche. - Recupero di passaggi cercare e trovare passaggi
(paragrafi, frasi) allinterno di un testo che
possano fornire risposte a determinati quesiti. - Estrazione di informazioni (IE) trovare
informazioni che possano riempire schemi
(templates) predefiniti. - Domanda-risposta (Question-answering) dare
risposte a domande di tipo generale formulate da
un utente IEIR - Comprensione di testi modellare la comprensione
dei testi da parte di umani.
4Tipo di domande
- IR
- Recupero di passaggi
- IE
- Domanda/risposta
- Comprensione dei testi
Pre-definite. Aspetti fissi della informazione
testuale
5What is Information Extraction
As a task
Filling slots in a database from sub-segments of
text.
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
NAME TITLE ORGANIZATION
6What is Information Extraction
As a task
Filling slots in a database from sub-segments of
text.
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
IE
NAME TITLE ORGANIZATION Bill Gates
CEO Microsoft Bill Veghte VP
Microsoft Richard Stallman founder Free
Soft..
7What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification clustering association
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
aka named entity extraction
8What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
9What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
10What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
11Un esempio FASTUS (1993)
- Bridgestone Sports Co. said Friday it had set up
a joint venture in Taiwan with a local concern
and a Japanese trading house to produce golf
clubs to be supplied to Japan. - The joint venture, Bridgestone Sports Taiwan Co.,
capitalized at 20 million new Taiwan dollars,
will start production in January 1990 with
production of 20,000 iron and metal wood clubs
a month.
12Un esempio FASTUS (1993)
- Bridgestone Sports Co. said Friday it had set up
a joint venture in Taiwan with a local concern
and a Japanese trading house to produce golf
clubs to be supplied to Japan. - The joint venture, Bridgestone Sports Taiwan Co.,
capitalized at 20 million new Taiwan dollars,
will start production in January 1990 with
production of 20,000 iron and metal wood clubs
a month
13- Bridgestone Sports Co. said Friday it had set up
a joint venture in Taiwan with a local concern
and a Japanese trading house to produce golf
clubs to be supplied to Japan. - The joint venture, Bridgestone Sports Taiwan Co.,
capitalized at 20 - million new Taiwan dollars, will start production
in January 1990 - with production of 20,000 iron and metal wood
clubs a month
14- Bridgestone Sports Co. said Friday it had set up
a joint venture - in Taiwan with a local concern and a Japanese
trading house to - produce golf clubs to be supplied to Japan.
- The joint venture, Bridgestone Sports Taiwan Co.,
capitalized at 20 - million new Taiwan dollars, will start production
in January 1990 - with production of 20,000 iron and metal wood
clubs a month.
15Come funziona FASTUS
1.Parole complesse e nomi propri
set up new Twaiwan dollars
2.Sintagmi semplici nominali, verbali,
particelle
a Japanese trading house had set up
3.Sintagmi complessi
4.Eventi rilevanti Costruzione di semplici
templates
5. Fusione di templates, nel caso Presentino
informazioni sullo stesso evento
16(No Transcript)
17Altro esempio un template sbagliato
. Jurgen Pfrang, 51, reportedly stumbled upon
the robbers on the second floor of his Nanjing
home early on Sunday. The deputy general manager
of Yaxing Benz, a Sino-German joint venture that
makes buses and bus chassis in nearby
Yangzhou, was hacked to death with 45 cm
watermelon knives. .
Name of the Venture Yaxing Benz Products
buses and bus chassis Location
Yangzhou,China Companies involved
(1)Name X?
Country German
(2)Name Y?
Country China
Template sbagliato
18Template giusto
A German vehicle-firm executive was stabbed to
death . . Jurgen Pfrang, 51, reportedly
stumbled upon the robbers on the second floor of
his Nanjing home early on Sunday. The deputy
general manager of Yaxing Benz, a Sino-German
joint venture that makes buses and bus chassis
in nearby Yangzhou, was hacked to death with 45
cm watermelon knives. .
Crime-Type Murder Type
Stabbing The killed Name Jurgen Pfrang
Age 51
Profession Deputy general
manager Location Nanjing, China
19Chi esegue linterpretazione?
(1) IR
(2) Recupero passaggi
(3) IE
(4) Domanda/risposta
(5) Comprensione testi
20Sistema di IR
Insieme di testi
21Sistema di IR
Insieme di testi
22Recupero passaggi IR
Insieme di testi
23Recupero passaggi IR
Sistema di IE
Insieme di testi
testi
24Sistema di IE
Templates
testi
25IE un approccio Pragmatico al NLP
Interpretazaione
IE
Templates
Testi
Predefinito
26Valutazione delle prestazioni
(1)IR,
(2) recupero passaggi
(3) ie
(4) Domanda/Risposa
(5) Comprensione di testi
27Insieme dei documenti
28Insieme dei documenti
Il tutto è più complicato per la Possibilità di
template parzialmente riempiti