Recent advances in multi-document summarization - PowerPoint PPT Presentation

About This Presentation
Title:

Recent advances in multi-document summarization

Description:

Recent advances in multi-document summarization Dragomir Radev University of Michigan, Ann Arbor radev_at_umich.edu Presentation at UC Berkeley SIMS, November 10, 2004 – PowerPoint PPT presentation

Number of Views:314
Avg rating:3.0/5.0
Slides: 77
Provided by: rad9
Category:

less

Transcript and Presenter's Notes

Title: Recent advances in multi-document summarization


1
Recent advances in multi-document summarization
  • Dragomir RadevUniversity of Michigan, Ann
    Arborradev_at_umich.edu
  • Presentation at UC Berkeley SIMS, November 10,
    2004

2
WWW as a textual database
  • Large 1010 pages, 200 TB LymanVarian 03 cf.
    brain (1011 neurons)
  • Multilingual English 56.4 of sites, German
    7.7, French 5.6, Japanese 4.9, Chinese 2.4
  • Evolving 22 of sites change every day, another
    31 change every month ChoGarcia-Molina 00
  • Uneven importance at different levels
  • Adequate representations are needed for
    user-friendly access

3
Outline
  • Introduction
  • Random walks and social networks
  • LexRank
  • Projects in language modeling and machine learning

4
Outline
  • Introduction
  • Random walks and social networks
  • LexRank
  • Projects in language modeling and machine
    learning

5
Natural Language Processing (NLP)
  • Typical NLP problems
  • Entity extraction
  • Relation extraction
  • Text classification
  • Summarization
  • Information retrieval
  • Machine translation
  • Question answering
  • Text understanding
  • Parsing
  • Word sense disambiguation
  • Lexical acquisition
  • Paraphrasing
  • NLP is very hard!
  • The pen is in the box.
  • Every American has a mother.
  • Boston called.
  • I saw Zoe. The poor girl looked tired.
  • Mary and Sue bought each other a book.
  • The spirit is willing but the flesh is weak.
  • Children make delicious snacks.
  • Army head seeks arms.
  • Czech President and playwright Havel to receive
    honors

6
Recent trends in NLP
  • Multidisciplinary
  • Statistical
  • Well founded
  • Scaleable

NLP
7
Finding structure
  • Language doesnt have a regular structure (like a
    database)
  • Sentences are very unlike each other
  • Linguistic analysis parse trees
  • Hard to generalize
  • Finding structure
  • Across sentences
  • Across sites/sources/documents
  • Over time
  • Representations
  • Graphs everywhere!

8
NewsInEssence
  • MEAD salience-based extractive summarization
  • Centroid-based summarization (single and multi
    document)
  • Vector space model
  • Additional features position, length, lexrank
  • (1000 downloads)
  • Cross-document structure theory (CST)
  • NIE first robust news summarization system (2001)

9
(No Transcript)
10
(No Transcript)
11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
(No Transcript)
18
Outline
  • Introduction
  • Random walks and social networks
  • LexRank
  • Projects in language modeling and machine
    learning

19
Social networks
  • Induced by a relation
  • Symmetric or not
  • Examples
  • Friendship networks
  • Board membership
  • Citations
  • Power grid of the US
  • WWW

20
Krebs 2004
21
Graph-based representations
Square connectivity(incidence) matrix P
Graph G (V,E)
1 2 3 4 5 6 7 8
1 1 1
2 1
3 1 1
4 1
5 1 1 1 1
6 1 1
7
8
22
Markov chains
  • A homogeneous Markov chain is defined by an
    initial distribution x and a Markov kernel P.
  • Path sequence (x0, x1, , xn).
  • The probability of a path can be computed as a
    product of probabilities for each step i.

23
Random walks
  • Access time Hij expected number of steps to go
    from i to j.
  • Example Lovász 1993. What is Hij on a path with
    nodes 0, 1, n-1?
  • H(k-1,k) 2k-1
  • H(i,k) H(i,k-1) 2k-1
  • H(i,k) (2i1) (2i3) (2k-1) k2 i2
  • H(0,k) k2
  • (Brownian motion travel distance sqrt(t) in time
    t)
  • Electrical networks
  • Rst is the resistance between two nodes s and t.
    The round-trip travel time between s and t is
    exactly 2mRst, where m is the number of edges.

24
Stationary solutions
  • The fundamental Ergodic Theorem for Markov chains
    Grimmett and Stirzaker 1989 says that the
    Markov chain with kernel E has a stationary
    distribution p under three conditions
  • E is stochastic
  • E is irreducible
  • E is aperiodic
  • To make these conditions true
  • All rows of E add up to 1 (and no value is
    negative)
  • Make sure that E is strongly connected
  • Make sure that E is not bipartite
  • Example PageRank Brin and Page 1998 use
    teleportation

25
Example
This graph E has a second graph Esuperimposed
on itE is the uniform transition graph.
26
Eigenvectors
  • An eigenvector is an implicit direction for a
    matrix.
  • Ev ?v, where v is non-zero, though ? can be any
    complex number in principle.
  • The largest eigenvalue of a stochastic matrix E
    is ?1 1.
  • For ?1, the left (principal) eigenvector is p,
    the right eigenvector 1
  • In other words, ETp p.

27
Prestige and centrality
  • Degree centrality how many neighbors each node
    has.
  • Closeness centrality how close an actor is to
    all of the other nodes
  • Betweenness centrality based on the role that a
    node plays by virtue of being on the path between
    two other nodes
  • Eigenvector centrality the paths in the random
    walk are weighted by the centrality of the nodes
    that the path connects.
  • Prestige same as centrality but for directed
    graphs.

28
Computing the stationary distribution
Solution for thestationary distribution
function PowerStatDist (E) begin p(0) u
i1 repeat p(i) ETp(i-1) L
p(i)-p(i-1)1 i i 1 until L lt
? end
29
Example
30
Outline
  • Introduction
  • Random walks and social networks
  • LexRank

31
Centrality in summarization
  • Motivation capture the most central words in a
    document or cluster
  • Centroid score Radev al. 2000, 2004a
  • Alternative methods for computing centrality?

32
Sample multidocument cluster
(DUC cluster d1003t)
1 (d1s1) Iraqi Vice President Taha Yassin Ramadan
announced today, Sunday, that Iraq refuses to
back down from its decision to stop cooperating
with disarmament inspectors before its demands
are met. 2 (d2s1) Iraqi Vice president Taha
Yassin Ramadan announced today, Thursday, that
Iraq rejects cooperating with the United Nations
except on the issue of lifting the blockade
imposed upon it since the year 1990. 3 (d2s2)
Ramadan told reporters in Baghdad that "Iraq
cannot deal positively with whoever represents
the Security Council unless there was a clear
stance on the issue of lifting the blockade off
of it. 4 (d2s3) Baghdad had decided late last
October to completely cease cooperating with the
inspectors of the United Nations Special
Commission (UNSCOM), in charge of disarming
Iraq's weapons, and whose work became very
limited since the fifth of August, and announced
it will not resume its cooperation with the
Commission even if it were subjected to a
military operation. 5 (d3s1) The Russian Foreign
Minister, Igor Ivanov, warned today, Wednesday
against using force against Iraq, which will
destroy, according to him, seven years of
difficult diplomatic work and will complicate the
regional situation in the area. 6 (d3s2) Ivanov
contended that carrying out air strikes against
Iraq, who refuses to cooperate with the United
Nations inspectors, will end the tremendous
work achieved by the international group during
the past seven years and will complicate the
situation in the region.'' 7 (d3s3) Nevertheless,
Ivanov stressed that Baghdad must resume working
with the Special Commission in charge of
disarming the Iraqi weapons of mass destruction
(UNSCOM). 8 (d4s1) The Special Representative of
the United Nations Secretary-General in Baghdad,
Prakash Shah, announced today, Wednesday, after
meeting with the Iraqi Deputy Prime Minister
Tariq Aziz, that Iraq refuses to back down from
its decision to cut off cooperation with the
disarmament inspectors. 9 (d5s1) British Prime
Minister Tony Blair said today, Sunday, that the
crisis between the international community and
Iraq did not end'' and that Britain is still
ready, prepared, and able to strike Iraq.'' 10
(d5s2) In a gathering with the press held at the
Prime Minister's office, Blair contended that the
crisis with Iraq will not end until Iraq has
absolutely and unconditionally respected its
commitments'' towards the United Nations. 11
(d5s3) A spokesman for Tony Blair had indicated
that the British Prime Minister gave permission
to British Air Force Tornado planes stationed in
Kuwait to join the aerial bombardment against
Iraq.
33
Cosine between sentences
  • Let s1 and s2 be two sentences.
  • Let x and y be their representations in an
    n-dimensional vector space
  • The cosine between is then computed based on the
    inner product of the two.
  • The cosine ranges from 0 to 1.

34
LexRank (Cosine centrality)
1 2 3 4 5 6 7 8 9 10 11
1 1.00 0.45 0.02 0.17 0.03 0.22 0.03 0.28 0.06 0.06 0.00
2 0.45 1.00 0.16 0.27 0.03 0.19 0.03 0.21 0.03 0.15 0.00
3 0.02 0.16 1.00 0.03 0.00 0.01 0.03 0.04 0.00 0.01 0.00
4 0.17 0.27 0.03 1.00 0.01 0.16 0.28 0.17 0.00 0.09 0.01
5 0.03 0.03 0.00 0.01 1.00 0.29 0.05 0.15 0.20 0.04 0.18
6 0.22 0.19 0.01 0.16 0.29 1.00 0.05 0.29 0.04 0.20 0.03
7 0.03 0.03 0.03 0.28 0.05 0.05 1.00 0.06 0.00 0.00 0.01
8 0.28 0.21 0.04 0.17 0.15 0.29 0.06 1.00 0.25 0.20 0.17
9 0.06 0.03 0.00 0.00 0.20 0.04 0.00 0.25 1.00 0.26 0.38
10 0.06 0.15 0.01 0.09 0.04 0.20 0.00 0.20 0.26 1.00 0.12
11 0.00 0.00 0.00 0.01 0.18 0.03 0.01 0.17 0.38 0.12 1.00
35
Cosine centrality (t0.3)
36
Cosine centrality (t0.2)
37
Cosine centrality (t0.1)
Sentences vote for the most central sentence!
38
LexRank
  • T1Tn are pages that link to A, c(Ti) is the
    outdegree of pageTi, and N is the total number of
    pages.
  • d is the damping factor, or the probability
    that we jump to a far-away node during the
    random walk. It accounts for disconnected
    components or periodic graphs.
  • When d 0, we have a strict uniform
    distribution.When d 1, the method is not
    guaranteed to converge to a unique solution.
  • Typical value for d is between 0.1,0.2 (Brin
    and Page, 1998).

39
Cosine centrality vs. centroid centrality
40
Evaluation metrics
  • Difficult to evaluate summaries
  • Intrinsic vs. extrinsic evaluations
  • Extractive vs. non-extractive evaluations
  • Manual vs. automatic evaluations
  • ROUGE mixture of n-gram recall for different
    values of n.
  • Example
  • Reference The cat in the hat
  • System The cat wears a top hat
  • 1-gram recall 3/5 2-gram recall
    1/43,4-gram recall 0
  • ROUGE-W longest common subsequence
  • Example above 3/5

41
CODE ROUGE-1 ROUGE-2 ROUGE-W
C0.5 0.39013 0.10459 0.12202
C10 0.38539 0.10125 0.11870
C1.5 0.38074 0.09922 0.11804
C1 0.38181 0.10023 0.11909
C2.5 0.37985 0.10154 0.11917
C2 0.38001 0.09901 0.11772
Degree0.5T0.1 0.39016 0.10831 0.12292
Degree0.5T0.2 0.39076 0.11026 0.12236
Degree0.5T0.3 0.38568 0.10818 0.12088
Degree1.5T0.1 0.38634 0.10882 0.12136
Degree1.5T0.2 0.39395 0.11360 0.12329
Degree1.5T0.3 0.38553 0.10683 0.12064
Degree1T0.1 0.38882 0.10812 0.12286
Degree1T0.2 0.39241 0.11298 0.12277
Degree1T0.3 0.38412 0.10568 0.11961
Lpr0.5T0.1 0.39369 0.10665 0.12287
Lpr0.5T0.2 0.38899 0.10891 0.12200
Lpr0.5t0.3 0.38667 0.10255 0.12244
Lpr1.5t0.1 0.39997 0.11030 0.12427
Lpr1.5t0.2 0.39970 0.11508 0.12422
Lpr1.5t0.3 0.38251 0.10610 0.12039
Lpr1T0.1 0.39312 0.10730 0.12274
Lpr1T0.2 0.39614 0.11266 0.12350
Lpr1T0.3 0.38777 0.10586 0.12157
42
Evaluation results
  • Centroid C0.5, C10, C1.5, C1, C2.5, C2
  • Degree D0.5T0.1, D0.5T0.2, D0.5T0.3, D1.5T0.1,
    D1.5T0.2, D1.5T0.3, D1T0.1, D1T0.2, D1T0.3
  • LexRank Lr0.5T0.1, Lr0.5T0.2, Lr0.5t0.3,
    Lr1.5t0.1, Lr1.5t0.2, Lr1.5t0.3, Lr1T0.1,
    Lr1T0.2, Lr1T0.3

Rouge-2 Lr1.5t0.2 0.115 D1.5T0.2 0.114 D1T0.2 0.
113 C1.5 0.099
Rouge-1 Lr1.5t0.1 0.400 Lr1.5t0.2 0.400 Lr1T0.2 0
.396 C1 0.382
Rouge-4 Lr1.5t0.1 0.124 Lr1.5t0.2 0.124 Lr1T0.2 0
.124 C2 0.118
43
DUC results
Peer code Task ROUGE-1 ROUGE-2 ROUGE-3 ROUGE-4 ROUGE-L ROUGE-W
141 3 5 2 1 1 2 2
142 3 5 1 1 1 4 3
143 4 1 2 1 1 6 6
144 4 3 1 1 1 7 7
145 4 1 2 2 2 4 4
44
Results and applications
  • DUC results (MU recall, ROUGE)
  • 1st place 2003 (duc.nist.gov)
  • 1-2 place 2004
  • applications
  • Web page summarization (WIE)
  • Topical crawling
  • Answer focused
  • wireless access
  • Cross-lingual
  • IR-based evaluation
  • Knowledge based
  • Beyond summarization
  • Classification
  • WSD
  • Spam recognition

45
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
46
Outline
  • Introduction
  • Random walks and social networks
  • LexRank
  • Projects in language modeling and machine
    learning

47
Syntax in Statistical Machine Translation
  • Noisy channel model assume that a source
    sentence has to be translated into a target
    language sentence
  • Goal find
  • Obvious problems can be fixed with syntax (?)
  • JHU 02 and 03 projects
  • (Franz Och, Jan Hajic, Dan Gildea others)
  • Solution using log-linear combination of features

48
Setup
  • Given a Chinese sentence
  • The top 1000 candidate translations in English
  • Parse all of these
  • Compute features monolingual, bilingual,
    syntax-free, and syntactic
  • Evaluation using BLEU (BiLingual Evaluation
    Understudy)
  • Example
  • Is the number of constituents across languages
    the same?
  • Is the english tree grammatical?
  • Are the two sentences of comparable length?
  • Feature combination
  • Use a greedy maxbleu algorithm

49
Chinese parse tree
IP
NP
QP
NP
VP
NP
CLP
NP
NR
CD
M
NN
NN
NN
NN
NN
NN
VV
??
??
?
??
??
??
??
??
??
??
China 14 border
open cities economic
achievements marked
50
Multiple references
1. fourteen chinese open border cities make
significant achievements in economic
construction 2. xinhua news agency report of
february 12 from beijing - the fourteen chinese
border cities that have been opened to foreigners
achieved satisfactory results in their economic
construction in 1995 . 3. according to statistics
, the cities achieved a combined gross domestic
product of rmb 19 billion last year , an increase
of more than 90 over 1991 before their opening
. 4. the state council successively approved the
opening of fourteen border cities to foreigners
in 1992 , including heihe , pingxiang , hunchun ,
yining and ruili , and permitted them to set up
14 border economic cooperation zones . 1.
significant accomplishment achieved in the
economic construction of the fourteen open border
cities in china 2. xinhua news agency , beijing ,
feb. 12 - exciting accomplishment has been
achieved in 1995 in the economic construction of
china 's fourteen border cities open to
foreigners . 3. statistics have indicated that
these cities produced a combined gdp of over 19
billion yuan last year , an increase of more than
90 , compared with that in 1991 before the
cities were open to foreigners . 4. in 1992 , the
state council successively opened fourteen border
cities to foreigners . these included heihe ,
pingxiang , huichun , yining , and ruili .
meanwhile , the state council also gave its
approval to these cities to establish fourteen
border zones for economic cooperation . 1. in
china , fourteen cities along the border opened
to foreigners achieved remarkable economic
development 2. xinhua news agency , beijing ,
february 12 - the economic development in china
's fourteen cities along the border opened to
foreigners achieved gratifying results in 1995
. 3. according to statistics , these cities
completed a gross domestic product in excess of
rmb 19 billion in last year , an increase of more
than 90 over 1991 ( the year before they were
opened ) . 4. in 1992 , the state council
successively approved fourteen cities along the
border to be opened to foreigners , which
included hei he , pingxiang , hunchun , yining
and ruili etc. at the same time , these cities
were also given approvals to set up fourteen
border _at_-_at_ economic _at_-_at_ cooperation zones . 1.
economic construction achievement is prominent in
china 's fourteen border opening up cities . 2.
xinhua news agency , beijing , february 12 -
delightful economic construction result was
achieved in china 's fourteen border opening up
cities in 1995 . 3. according to statistics , gdp
registered over 19 billion yuan last year in
those cities , over 90 higher than those of
year 1991 before opening up . 4. fourteen border
cities like heihe , pingxiang , huichun , yinin ,
and ruili etc were approved successively by the
state council in 1992 as the cities opening to
the outside world , setting up of fourteen border
economic cooperation zones in these cities were
also approved simultaneously . 1. china 's 14
open border cities marked economic
achievements 2. xinhua news agency , beijing ,
february 12 chinese 14 border an open city 1995
economic development to achieve good results 3.
according to statistics , the city last year 's
gross domestic product ( gdp ) over 19 billion
yuan , and opening up of more than 90 growth in
1991 . 4. the state council in 1992 has approved
the heihe , pingxiang , huichun , yining and
ruili , 14 border cities as an open city , and
the city also approved a total of 14 border
economic cooperation .
51
Syntactic features
(S1 (S (PP (IN in) (NP (NNP china)))
(, ,) (NP (NP (CD fourteen)
(NNS cities)) (PP (IN along)
(NP (DT the) (NN
border)))) (VP (VBN opened) (PP
(TO to) (NP (NP (NNS foreigners))
(VP (VBN achieved)
(NP (JJ remarkable)
(JJ economic)
(NN development))))))))
(S1 (NP (NP (JJ significant) (NN
accomplishment)) (VP (VBN achieved)
(PP (IN in) (NP (NP (DT
the) (JJ economic)
(NN construction))
(PP (IN of) (NP (NP
(DT the) (CD
fourteen) (JJ
open) (NN
border) (NNS
cities)) (PP (IN in)
(NP (NNP
china))))))))))
(S1 (S (NP (CD fourteen) (ADJP (JJ
chinese) (JJ open))
(NN border) (NNS cities)) (VP
(VBP make) (NP (JJ significant)
(NNS achievements)) (PP (IN
in) (NP (JJ economic)
(NN construction))))))
(S1 (S (NP (JJ economic) (NN
construction) (NN achievement))
(VP (AUX is) (ADJP (JJ prominent)
(PP (IN in) (S
(NP (NP (NNP china)
(POS 's)) (NP (CD
fourteen) (NN
border))) (VP (VBG
opening) (PRT (RP
up)) (NP (NNS
cities)))))))))
(S1 (S (NP (NP (NNP china) (POS
's)) (CD 14) (ADJP (JJ
open)) (NN border) (NNS
cities)) (VP (VBD marked) (NP
(JJ economic) (NNS
achievements)))))
52
Flipdeps
53
(No Transcript)
54
Results
  • BLEU baseline
  • 31.6
  • Most features
  • 30.0-31.8
  • Flipdeps
  • 31.8
  • Best single feature
  • 32.5
  • Best combination
  • 32.9
  • (statistically significant improvement)
  • Results in Ochal.04

55
Phylogenetic Text Modeling
Machine translation identification
??????????????????????????
1. Other Party, governmental and law enforcement
authorities must take similar actions beginning
from the start of next year. 2. Other Party and
government agencies and judicial departments must
also take similar actions early next year. 3. All
other Party, Government and Judicial Departments
must start similar actions at the beginning of
next year. 4. Other Party, government, and
judicatory departments must take similar action
at the beginning of next year. 5. Other party and
government departments as well as judicial
departments must take similar action from the
beginning of next year. 6. All other party
government and judicial departments must also
take similar measures from the beginning of next
year. 7. Other party and judicial authorities
should take similar actions from the beginning of
next year. 8. Other departments of the Party, the
government and the judicial departments must also
take similar actions early next year. 9. Other
Party and Government departments as well as
judicial departments must also take similar
measures from the beginning of next year. 10. The
other law enforcement agencies and departments
will also take part in similar proceedings from
the beginning of next year. 11. Other party,
governmental and judicial departments will have
to take similar action from the beginning of next
year. 12. Other party politics and judicial
department also will have to start from next year
beginning of the year to adopt similar
motion. 13. Other party and judicial section must
start from the beginning of year of next year
taking similar action also 14. The beginning of a
year for and res judiciaria as welling must from
next year of other party commences assuming is
similar toing the proceeding. 15. At the
beginning of next year politics and judicial
department other parties must also start to pick
to take similar action. 16. Other party politics
and the judicial department also will have to
start from at the beginning of next year to take
the similar action. 17. Other party policies and
judicial department must also begin from early
next year to take similar action.
56
t-test plt0.05Chinese Levenshtein 50/50, BLEU
50/50 Arabic Levenshtein 50/50, BLEU 48/50
57
Chronological ordering
S1 Italian TV says the crash put a hole in the
25th floor of the Pirelli building, and that
smoke is pouring from the opening. (04/18/02
1222) S2 Italian TV showed a hole in the side
of the Pirelli building with smoke pouring from
the opening. (04/18/02 1232) S3 Italian state
television said the crash put a hole in the 25th
floor of the Pirelli building. (04/18/02
1242) S4 Italian state television said the
crash put a hole in the 25th floor of the
30-story building. (04/18/02 1244)
58
Best representation stop words removed
59
(No Transcript)
60
A small plane has hit a skyscraper in central
Milan, setting the top floors of the 30-story
building on fire, an Italian journalist told CNN.
The crash by the Piper tourist plane into the
26th floor occurred at 550 p.m. (1450 GMT) on
Thursday, said journalist Desideria Cavina. The
building houses government offices and is next to
the city's central train station. Several
storeys of the building were engulfed in fire,
she said. Italian TV says the crash put a hole
in the 25th floor of the Pirelli building, and
that smoke is pouring from the opening. Police
and ambulances are at the scene. Many people
were on the streets as they left work for the
evening at the time of the crash. Police were
trying to keep people away, and many ambulances
were on the scene. There is no word yet on
casualties.
A small plane has hit a skyscraper in central
Milan, setting the top floors of the 30-story
building on fire, an Italian journalist told CNN.
The crash by the Piper tourist plane into the
26th floor occurred at 550 p.m. (1450 GMT) on
Thursday, said journalist Desideria Cavina. The
building houses government offices and is next to
the city's central train station. Several
storeys of the building were engulfed in fire,
she said. Italian TV showed a hole in the side of
the Pirelli building with smoke pouring from the
opening. RAI state TV reported that the plane
had apparently radioed an SOS because of engine
trouble. Earlier though, in Rome, the senate's
president, Marcello Pera, said it "very probably"
appeared to be a terrorist attack. Police and
ambulances are at the scene. Many people were on
the streets as they left work for the evening at
the time of the crash. Police were trying to
keep people away, and many ambulances were on the
scene. There is no word yet on casualties. TV
pictures from the scene evoked horrific memories
of the September 11 attacks on the World Trade
Center in New York and the collapse of the
building's twin towers. "I heard a strange bang
so I went to the window and outside I saw the
windows of the Pirelli building blown out and
then I saw smoke coming from them," said Gianluca
Liberto, an engineer who was working in the area
told Reuters. The building is known as the
Pirelli skyscraper but the Italian tyre and cable
company does not operate out of the building. It
is one of the symbols of Italy's financial
capital and is one of the world's tallest
concrete buildings, designed between 1955 and
1960. A small plane crashed into a skyscraper in
downtown Milan today, setting several floors of
the 30-story building on fire. The plane crashed
into the 25th floor of the Pirelli building in
downtown Milan. The weather was clear at the time
of the crash. Smoke poured from the opening as
police and ambulances rushed to the area. The
president of the Italian Senate, Marcello Pera,
told Italian television it "very probably"
appeared to be a terrorist attack but soon
afterwards his spokesman said it was probably an
accident. A transport official told Reuters the
plane had reported problems with its
undercarriage and was circling the city ahead of
trying to land at a local airport. The Pirelli
building houses the administrative offices of the
local Lombardy region and sits next to the city's
central train station. It is constructed of
concrete and glass. The crash happened just
before rush hour, as office workers were closing
their day. A small airplane crashed into a
government building in heart of Milan, setting
the top floors on fire, Italian police reported.
There were no immediate reports on casualties as
rescue workers attempted to clear the area in the
citys financial district. Few details of the
crash were available, but news reports about it
immediately set off fears that it might be a
terrorist act akin to the Sept. 11 attacks in the
United States. Those fears sent U.S. stocks
tumbling to session lows in late morning trading.
Witnesses reported hearing a loud explosion from
the 30-story office building, which houses the
administrative off ices of the local Lombardy
region and sits next to the city s central train
station. Italian state television said the crash
put a hole in the 25th floor of the Pirelli
building. News reports said smoke poured from the
opening. Police and ambulances rushed to the
building in downtown Milan. No further details
were immediately available. Un aereo da turismo,
un Piper si è schiantato questo pomeriggio a
Milano, poco prima delle 18, contro il
grattacielo Pirelli, sede anche della Regione
Lombardia (il presidente della Regione, Roberto
Formigoni, è in missione ufficiale in India con
una delegazione della regione). Lo si è appreso
in ambienti investigativi. L' impatto sarebbe
avvenuto attorno al 25/o piano dei 30 del
grattacielo. Almeno sei piani alla vista
risultano sventrati. I detriti sono stati
lanciati dal'esplosione a una quarantina di metri
intorno all'edificio. In tutta l'area attorno al
grattacielo Pirelli lecomunicazioni telefoniche
anche via cellulare sono interrotte o quasi
impossibili. La Borsa ha sospeso la seduta serale
a Piazza Affari dopo lo schianto dell'aereo da
turismo, anche il presidente Bush è stato subito
avvertito dell'espolosione al Pirellone.Con
molta probabilità si tratta di un attentato. Lo
ha detto Marcello Pera aprendo la seduta a
Palazzo Madama. Ma secondo quanto si è appreso,
l'aereo da turismo era probabilmente in avaria
il pilota, infatti, avrebbe lanciato l'SOS,
raccolto dalla torre di controllo di Linate.
CNN 4/18/02 1222pm CNN 4/18/02 1232pm ABCNews
4/18/02 100pm MSNBC 4/18/02 100pm La Stampa
4/18/02 1245pm
61
Fact tracking
  • 04/18/02 1317 (CNN)
  • The plane, en route from Locarno in Switzerland,
    to Rome, Italy, smashed into the Pirelli
    building's 26th floor at 550 p.m. (1450 GMT) on
    Thursday.
  • 04/18/02 1342 (ABCNews)
  • The plane was destined for Italy's capital Rome,
    but there were conflicting reports as to whether
    it had come from Locarno, Switzerland or Sofia,
    Bulgaria.
  • 04/18/02 1342 (CNN)
  • The plane, en route from Locarno in Switzerland,
    to Rome, Italy, smashed into the Pirelli
    building's 26th floor at 550 p.m. (1450 GMT) on
    Thursday.
  • 04/18/02 1342 (FoxNews)
  • The plane had taken off from Locarno,
    Switzerland, and was heading to Milan's Linate
    airport, De Simone said.

62
Questions from Milan corpus
1. How many people were injured? 2. How many
people were killed? (age, number, gender,
description) 3. Was the pilot killed? 4. Where
was the plane coming from? 5. Was it an accident
(technical problem, illness, terrorist act)? 6.
Who was the pilot? (age, number, gender,
description) 7. When did the plane crash? 8.
How tall is the Pirelli building? 9. Who was on
the plane with the pilot? 10. Did the plane
catch fire before hitting the building? 11. What
was the weather like at the time of the crash?
12. When was the building built? 13. What
direction was the plane flying? 14. How many
people work in the building? 15. How many people
were in the building at the time of the crash?
16. How many people were taken to the hospital?
17. What kind of aircraft was used?
63
  • Changing answers
  • How many people were injured? 40 different
    answers!
  • no word yet on casualties/injuries'', 20
    people were taken to a nearby hospital'', 20 to
    30 people were hospitalized with iinjuries'',
    many people were injured'', there was no
    official word on the number of people injured in
    the building'', at least 20 injured were taken
    to hospital from the scene dozens of people had
    been taken to the hospital'', injuring
    dozens'', injuring at least 30'', injuring
    60'', dozens were injured'', 60 others were
    injured'', the number of injured, originally at
    60, was revised downward Friday to 36''.
  • Only 24 hours after the crash do agencies settle
    on the accurate number, namely 36 people''.

64
Source
one dead
at least two
four people
four dead
ABCNews
no word yeton casualties
two deaths
at least three
at least four
CNN
incorrect
partial
three peoplekilled
no immediatereports
two deaths
three people dead
correct
FoxNews
five peoplekilled
at leastfive
at least two
at least three
MSNBC
Fasuloand twootherskilled
at least three
five reporteddead
USAToday
Next Day
950
1247
1249
1251
1251
1301
1317
1342
1346
1413
1421
1429
1432
1452
1502
1522
1531
1536
1752
1813
1835
1840
931
1802
Time(EST)
65
Syntactic Alignment
  • Sequence alignment for (near) paraphrasing
    BarzilayLee 03
  • No syntax used
  • Dynamic programming
  • Different penalties for alignment depending on
    the syntactic similarity

talked
Mary
with
John
had
a
chat
66
Syntactic Alignment
A police official said it was a Piper tourist
plane and that the crash had set the top
floors on fire. According to ABCNEWS aviation
expert John Nance, Piper planes have no history
of mechanical troubles or other problems that
would lead a pilot to lose control. April 18,
2002 8212 A small Piper aircraft crashes into
the 417-foot-tall Pirelli skyscraper in Milan,
setting the top floors of the 32-story building
on fire. Authorities said the pilot of a small
Piper plane called in a problem with the landing
gear to the Milan's Linate airport at 554 p.m.,
the smaller airport that has a landing strip for
private planes. Initial reports described the
plane as a Piper, but did not note the specific
model. Italian rescue officials reported that at
least two people were killed after the Piper
aircraft struck the 32-story Pirelli building,
which is in the heart of the city s financial
district. A small piper plane with only the pilot
on board crashed Thursday into a 30-story
landmark skyscraper, killing at least two people
and injuring at least 30. Police officer
Celerissimo De Simone said the pilot of the Piper
Air Commander plane had sent out a distress call
at 550 p.m. just before the crash near Milan's
main train station. Police officer Celerissimo De
Simone said the pilot of the Piper aircraft had
sent out a distress call at 550 p.m. (1150
a.m.) Police officer Celerissimo De Simone said
the pilot of the Piper aircraft had sent out a
distress call at 550 p.m. just before the crash
near Milan's main train station. Police officer
Celerissimo De Simone said the pilot of the Piper
aircraft sent out a distress call at 550 p.m.
just before the crash near Milan's main train
station. Police officer Celerissimo De Simone
told The AP the pilot of the Piper aircraft had
sent out a distress call at 550 p.m. just before
crashing. Police say the aircraft was a Piper
tourism plane with only the pilot on
board. Police say the plane was an Air Commando
8212 a small plane similar to a Piper. Rescue
officials said that at least three people were
killed, including the pilot, while dozens were
injured after the Piper aircraft struck the
Pirelli high-rise in the heart of the city s
financial district. The crash by the Piper
tourist plane into the 26th floor occurred at
550 p.m. (1450 GMT) on Thursday, said journalist
Desideria Cavina.
Police officer Celerissimo De Simone said the
pilot of the Piper aircraft, en route from
Switzerland, sent out a distress call at 554
p.m. just before the crash near Milan's main
train station.
67
Algorithm and results
  • Three lexical methods
  • Two syntactic methods
  • Generate new sentences
  • method 4 (syntactic alignment except for stop
    words)
  • Grammaticality 3.74
  • Fidelity 3.77
  • on a scale from 1 to 4
  • Best lexical method
  • Grammaticality 3.12
  • Fidelity 3.07

68
Web-based QA
  • TREC questions
  • Where is Inoco based?
  • When was London's Docklands Light Railway
    constructed?
  • Who followed Willy Brandt as chancellor of the
    Federal Republic of Germany?
  • What is Grenada's main commodity export?
  • TREC evaluation
  • Earliest conference papers (Radev al.
    ANLP2000, Prager al. SIGIR2000)
  • Reranking models

69
Question Modulation
  • TREC question set
  • Start with initial formulation
  • TRDR Total Reciprocal Document Rank (range 0
    to 2.92)
  • Evolutionary operators mutation, permutation,
    crossover, drop, insert, phrase
  • What country is the biggest producer of tungsten?
    0.44
  • What country biggest producer of tungsten? 1.11
  • country biggest producer of tungsten? 1.98
  • Web results using Google as the backend search
    engine
  • 0.4 MRR (mean reciprocal rank)
  • Query modulation results
  • 42 increase in TRDR (from 0.79 to 1.12)

70
Models of the Web
  • Evolving networks fundamental object of
    statistical physics, social networks,
    mathematical biology, and epidemiology
  • Erdös/Rényi 59, 60
  • Barabási/Albert 99
  • Watts/Strogatz 98
  • Kleinberg 98
  • Menczer 02
  • Radev 03

71
Self-triggerability across hyperlinks
  • Document closures for information retrieval
  • Self-triggerability MostellerWallace 84 ?
    Poisson distribution
  • Two-Poisson BooksteinSwanson 74
  • Negative Binomial, K-mixture ChurchGale 95
  • Triggerability across hyperlinks?

72
Evolving Word-based Web
  • Observations
  • Links are made based on topics
  • Topics are expressed with words
  • Words are distributed very unevenly (Zipf,
    Benford, self-triggerability laws)
  • Model
  • Pick n
  • Generate n lengths according to a power-law
    distribution
  • Generate n documents using a trigram model
  • Model (contd)
  • Pick words in decreasing order of r.
  • Generate hyperlinks with random directionality
  • Outcome
  • Generates power-law degree distributions
  • Generates topical communities
  • Natural variation of PageRank LexRank

PageRank
Hits
73
Tripartite updating
  • Modeling classification problems using bipartite
    graphs
  • Weakly supervised learning why?
  • bootstrapping, co-training, active learning
  • Spectral partitioning
  • Fiedler vector
  • Singular value decomposition
  • Random walks
  • Tripartite updating
  • Matrix representation
  • Iterative power method

74
Tripartite updating
  • Four-way or three-way classification
  • For the same accuracy of SP and TU, TU handles
    twice as many labeled examples with ten times as
    many unlabeled examples
  • Tasks
  • Spam detection
  • Named entity classification
  • PP attachment
  • Number classification
  • Features
  • Number classification 5 classes based on context
    and hobbs class

75
Relation extraction
  • User gives examples of entity E1 and entity E2.
  • Example song Let it Be, singer the
    Beatles.
  • System finds other songs and singers with a very
    minimal number of training examples.
  • The relation may be quite different, e.g.,
    protein-protein, organization-leader,
    book-author, drug-disease.
  • Weakly supervised learning based on graphs is
    used.

76
Protein Regulatory Network Recognition
  • Wnt signaling
  • Glycogen synthase kinase-3 (GSK-3) and CK1
    (casein kinase 1) alpha phosphorylate Arm
    (Armadillo, ?-catenin) and cause it to degrade.
  • Axin also binds to the phosphatase PP2A
  • PP2A activity inhibits Wnt signaling

Hsu 1999, Li 2001, Yanagawa 2002, Liu2002, Nusse
2003
77
Method and Results
  • Medline
  • signal transduction as MeSH major topic and
    Wnt or AKT or Beta-catenin as words
  • 3300 papers extracted by Carlos Santos
  • 441 putative proteins (X is a protein, the X
    protein X verbs)
  • Verbs Bind associate interact activate repress
    inhibit upregulate regulate downregulate complex
    dimerize localize bound regulate stabilize
    control translocate antagonize amplify transduce
    trigger

78

Number classification X X X X X
Summarization MEAD/CST/NIE X X X X X X X X X
Lexical Web models X X X X X
Statistical MT X X X X X
Protein networks X X X X X X
Relation extraction X X X X X X X
Phylogenetic alignment X X X X X X X X
QA/NSIR X X X X X
Topical crawling X X X X X X
XML retrieval X X X
Fact tracking X X X X X
Manual evaluation
Uneven importance
Graph structure
Hard to train
unstructured
multilingual
multisource
redundant
evolving
79
A grabbag of research problems
  • Finding adequate representations for dynamic
    texts
  • Integrating user models
  • Using self-triggering for information retrieval
  • Weakly supervised and active learning
  • Robust semantic analysis
  • Adequate models of the Web
  • Relation extraction
  • Syntax-based machine translation and
    summarization
  • Automatic knowledge acquisition from the Web

80
Conclusion
  • New approaches to natural language processing and
    information retrieval using graph-based
    techniques such as random walks
  • Applications beyond NLP
  • Highest ranked system at DUC
  • Promising results in semi-supervised machine
    learning
  • Acknowledgments
  • CLAIR (Günes Erkan, Jahna Otterbacher, Siwei
    Shen, Zhu Zhang)
  • UROP program
  • NSF and NIH
  • Mark Newman
  • To read more
  • http//tangra.si.umich.edu/clair
  • http//www.summarization.com
  • http//www.newsinessence.com
  • Papers CACM 2005 JAIR 2004 EMNLP 2004 IPM
    2004 JASIST 2002, 2004, 2005 WWW 2002 AAAI
    2002 SIGIR 1995, 2000 ACL 1998, 2003 HLT 2001
    HLT-NAACL 2004 CIKM 2001, 2003 ANLP 1997, 2000
    LREC 2002, 2004 IJCNLP 2004 CL 1998, 2002
    COLING 2000, 2004

81
ACL 2005 www.aclweb.org June 25-30, 2005 Ann
Arbor, MI General chair Kevin Knight,
ISI Program co-chairs Kemal Öflazer, Sabanci U.
Hwee Tou Ng, NUS Local chair Dragomir Radev, U.
Michigan Submission deadline January 14
82
S
VP
NP
VB
PP
IN
PRP
NP
PRP
NN
Thank you for your attention !
tangra.si.umich.edu/clair
Write a Comment
User Comments (0)
About PowerShow.com