Title: Time Identification and Normalization
1Time Identification and Normalization
- Heng Ji
- hengji_at_cs.qc.cuny.edu
- March 9, 2009
Acknowlegement some slides from Nicolas Nicolov
2Outline
- QA about Assignment 3 and 4
- Background temporal reasoning why
- Time analysis an IE problem
- Timex2 and TERN
- Temporal IE Toolkit
- Time analysis for temporal reasoning
- how, infrastructure
- TimeML a mark-up language for time
- TimeML analysis an IE problem
- Timex3, Events, Links
- Temporal IEstatus challenges future
3Question answering
- When did the war with Iraq start?
- How many illegal immigrants have crossed the
border this year? - How long does it take to light a fire?
- Who is Frances Prime Minister?
4Trajectory
- TIDES
- Timex2
- TERN
- ACE
- TimeML ( Timex3 )
- Context (NLP apps)
- Resource management
- Technology toolkit
- State-of-the-art (status)
- Issues
5Temporal reasoning applications
- Summarisation
- Search, with temporal constraints
- Hypothesis generation and management
- Question answering
- Temporal augmentation of news
-
- Temporal monitoring
6Summarisation
- Dangling references
- recent summers
- was the source of the virus last week
- where Morris was a computer science
undergraduate till last June - Need resolution (normalization) of temporal
expressions
7Summarisation
- Temporally anchoring ordering events in news
- Assigning time stamps to event-clauses
- Correct chronological sequence from diverse
information sources
Mani, Wilson, Schiffman, Filatova, Hovy
8The Devils Advocate
- The polygamist axiom
- An object cannot be in two different places at
the same time furthermore, there must be an
event that moves it from one place to another.
S. Makarios
9Question answering
- When did the war with Iraq start?
- How many illegal immigrants have crossed the
border this year? - How long does it take to light a fire?
- Who is Frances Prime Minister?
10Temporal augmentation of news
- MITs News of the Future
- Personalisation of news
- Article augmentation
- Historical context
- Questioning news
- A readership of one
Koen and Bender
11Temporal monitoring
- Did ATT common stock go up today?
- Yes.
- No.
Shall I let you know if it keeps rising?
But shall I let you know if it starts a comeback?
E. Mays
12TimeExpressionRecognision andNormalization
13Examples
- Last Updated Friday, 15 July, 2005, 1019 GMT
1119 UK - He will make a three-day visit to Romania next
week - Police said the 31-year-old Briton died Thursday
- The president attended the meeting for the fifth
straight day in a row
14Timex2 annotation scheme
- Time points
- Third week of October val2000-W42
- Durations
- Half an hour long valPT30M
- Indexicality
- Tomorrow val2005-07-30
- Sets
- Every Tuesday valXXXX-WXX-2
setYES periodicityF1W granularityG1D - Fuzziness
- Summer of 2000 val2000-SU
- This morning val2005-07-29-TMO
- Non-specificity
- April is usually wet valXXXX-04
non-specificYES
15Interannotator agreement
Percent tags w/ extent errors Lisa
0.1 Liz 10.5 Janet 3.2
16Data (Training)
- bnewsnpaper nwire
- 794 docs
- 293K tokens
- 8047 expressions sparse
- 10.5 TIMEX2 tags / doc
Chinese data also
2002
2003
2004
17Data Format
- ltDOCgt
- ltDOCNOgt APW20001102.1223.0376 lt/DOCNOgt
- ltDOCTYPE SOURCE"newswire"gt NEWS STORY lt/DOCTYPEgt
- ltDATE_TIMEgt ltTIMEX2 val"2000-11-02T132723"
mod"" set"" non_specific"" anchor_val""
anchor_dir"" comment""gt2000-11-02
132723lt/TIMEX2gt lt/DATE_TIMEgt - ltBODYgt
- ltSLUGgt Israel-Explosion lt/SLUGgt
- ltHEADLINEgt Explosion Rocks Jerusalem Market
lt/HEADLINEgt - ltTEXTgt
- JERUSALEM (AP) _ A car bomb exploded ltTIMEX2
val"2000-11-02" mod"" set"" non_specific""
anchor_val"" anchor_dir"" comment""gtThursdaylt/T
IMEX2gt in a crowded outdoor market in the heart
of Jerusalem, killing at least two people, police
said. - Ambulances raced to the Mahane Yehuda market,
which sells food, vegetables and clothing in
Jewish west Jerusalem. Huge plumes of black smoke
rose into the sky. - The blast came shortly before Israeli Prime
Minister Ehud Barak and Palestinian leader Yasser
Arafat were scheduled to make separate
announcements of steps toward a cease-fire. - lt/TEXTgtlt/BODYgtlt/DOCgt
18Why it is easy
- Calendar points 2004-10-04 55.28
- Days, Months
- Contextual cues SGML tags
19Why it is difficult
- the two weeks since Mrs. Daniel first held a
news conference admonishing the hospital,
University of California-San Francisco Stanford
Health Care, for even debating the transplant
issue - just days after another court dismissed other
corruption charges against his 79-year-old father
on grounds of ill health, enraging pro-democracy
activists and triggering violent demonstrations
20Why it is difficult (cont)
- a month of delays following the disclosure that
new evidence surfaced on another group, the
Damascus-based Palestinian Front for the
Liberation of Palestine-General Command - an era in which Speakers have been defined by
belligerent partisanship -- particularly Mr.
Gingrich and the Democrat he hounded from office,
Jim Wright of Texas
21Why it is difficult (cont)
- in recent years
- BEFORE or ENDING this year?
- After a disastrous decade of trying to impose
North Vietnam's austere and repressive communism
on the more freewheeling South, the country's
leaders reversed course in the late 1980s and
began encouraging Southern style
entrepreneurship. - Point or Duration?
- If duration, how to anchor it?
- P1DE ENDING 198X ? P1DE BEFORE 1990?
22TERN English Training Data V1.0
- Available from LDC, catalog LDC2005T07
- 306K words annotated for TIMEX2
- Chinese data to be released soon
23Outline
- Background temporal reasoning why
- Time analysis an IE problem
- Timex2 and TERN
- Temporal IE Toolkit
- Time analysis for temporal reasoning
- how, infrastructure
- TimeML a mark-up language for time
- TimeML analysis an IE problem
- Timex3, Events, Links
- Temporal IE status challenges future
?
24Timex2 recognition
- Strategies
- Rule-based patterns, grammars
- Machine-learning based tagging, chunking
25A grammar and a parser for date expressions
1to9 1 2 3 4 5 6 7 8 9
0to9 0 1to9 SP , Day
Monday Tuesday ... Sunday Month
January February ... December Date
1to9 1 2 0to9 3 0 1 Year
1t09 ( 0t09 ( 0to9 0to9 ) ) )DExpr Day
( Day Sp ) Month Date ( SP Year )
Today is Friday, July 29, 2005 because
yesterday was Thursday July 28 tomorrow will
be Saturday, July 30 and not July 31...
26A grammar and a parser for valid date expressions
Valid Date DateExpression
MaxDaysInMonth LeapDays WeekDayDates
MaxdaysInMonth February 3 0
February April June September
November 3 1 .........
27Learning how to tag
- Chunking (Abney91) using a token tagging model
- Output tags B-Timex2, I-Timex2, O
- IOB2 encoding (Tjong Kim Sang Veenstra99)
Elections are on ltTIMEX2gt November 2nd lt/TIMEX2gt .
28Machine learning technologies
- Hidden Markov Models (HMM)
- Transformation-Based Learning (TBL)
- Decision Trees (DT)
- Memory-Based Learning (MBL)
- Maximum Entropy (MaxEnt)
- Robust Risk Minimization (RRM)
- Support Vector Machines (SVM)
29Embeddedness
- 1.9 (167 out of total 8,429)
- a year after the 15 EU leaders decided in
Helsinki, Finland, to create a corps of 60,000
troops capable by 2003 of deploying within
60 days and remaining on the ground for up to
a year - 3 Level embeddedness only 3 occurrences
- lt the first of two days of talks aimed at
laying the groundwork for a future agreement
gt - lt 60 days a period that ended Saturday
with no such plan gt - lt one of the toughest days that traders that
I've had a chance to talk to have seen in quite
a long time gt
30Features
startNP
TIME
DET
TIME
insideNP
t1TIMEt2TIME
Monday
morning
The
meeting
is
moved
to
1 / 2 / 2005
.
lowerCase
t1 move
XX/XX/XXXX
m, mo, ed, d
31Time Zones (aka GMT)
32Time hierarchies
TIME
PERIOD
POINT
Jurasic
daybreak noon sunset
DAY
YEAR
Monday today
DAY-SPAN
year of the dragon
morning afternoon
33Challenges
- Sentential complements fair enough
- First names
- Jan de Bont (director, Cradle of Life)
- January Sorensen (Wish upon a Star)
- April Webster (casting director, Patriot)
- Miss November
- Friday (Robinson Crusoes island
companion)Tuesday Weld (Dudley Moores wife) - Titles The Day After Tomorrow, Friday After
Next - ORG Sunday Times, Time Magazine
- LOC Times Square
- time schedule
- three bombings in as many years (BBC on
Indonesia)
34Unexpected challengeTokenization
- Remember the myopic vision of the systems
Second 1/2 of 2004 Second 1/2 of the white
source Second 1/2 of the fiscal year Nine 1/2
weeks
For recognition keep as much as possible together
( 1/2 vs. 1_/_2) Break further for normalization
(MM/DD/YYYY vs. DD/MM/YYYY formats)
35Statistical Tokenization
- List of unambiguous characters ?
- List of ambiguous characters , . !
- Go through text
- Split at unambiguous char
- Make a classif. decision at ambiguous char by
considering the local context - Classification are actions keep (http//www.) /
split before(s) / after(Dr.) / around(Yes!) /
oneBefore (dont)
36Outline
- Background temporal reasoning why ,
- Time analysis an IE problem
- Timex2 and TERN
- Temporal IE Toolkit
- Time analysis for temporal reasoning
- how, infrastructure
- TimeML a mark-up language for time
- TimeML analysis an IE problem
- Timex3, Events, Links
- Temporal IE status challenges future
?
37The TIMEBANK Corpus (2003)
TIMEBANK contains 186 newswire articles with
careful, detailed annotations of terms denoting
events, temporal expressions, and temporal
signals, and, most importantly, of links between
them denoting temporal relations. This
collection, the largest temporal-event annotated
corpus to date, provides a solid empirical basis
for future research into the way texts actually
express and connect series of events. It will
support research into areas as diverse as the
semantics of tense and aspect, the explicit
versus implicit communication of temporal
relational information, and the variation in
typical event structure across narrative domains
.
From a practical computational perspective, it
will become possible to consider training and
evaluating algorithms which determine event
ordering and time-stamping .
38Methodology of corpus design
- TimeBank vs. TERN corpus
- Relationship between
- Corpus and task
- Corpus and implementation
39TimeBank data characteristics
- (Some) TimeBank statistics 186
documents68.5K words - At 10 test data, barely over 60K words for
training
40TimeBank data characteristics
- Cf. Penn TreeBank / POS tagging corpus
- More than 1M words (gt 16 times larger than
TimeBank), and a simple tagging scheme.
41TimeBank data characteristics
- Cf. CoNLL NE chunking
- Extreme paucity of positive examples eg
- lt than 2K examples over 13 classes
- cf. gt200K words / 23K examples (15 times gt
TimeBank) over just 4 name classes
42TimeBank data characteristics
- Cf. TERNs training set almost 800 docs / 300K
words, with gt8K timex2,considered to be
somewhat sparse !
Timex3 DATE 975DURATION
314TIME 80SET 7
(1376)Total 1423
(in document) 1245
43Detour Timex normalisation
44ISO 8601 standard Numeric representation of
dates and times
- Easily readable/writable by software
- Easily comparable / sortable (str compare)
- Ditto for dates followed by times
- Language independent
- Unambiguous (wrt other date formats)
- Notation is short and of constant length
- year, month, day order widely used
- Japan, Korea, Hungary, Sweden, Finland, DK,
45ISO 8601 Dates
- YYYY-MM-DD 2005-07-29 (20050729)
2005-07 2005 - YYYY-WNN 2005-W32 (2005W32)
2005-W32-5 2005-07-29 - YYYY-DDD 2005-001 (2005001)
2005-217 2005-07-29
46ISO 8601 Times
- hhmmss 235959 (235959) 2359
23 235959.9942
(5.8ms before midnight) - 2005-02-04 2400 2005-02-05 00002005-02-04T23
59 20050204T235959 - fixes many disadvantages of the old English 12h
notation
47ISO 8601 Durations
- 6-dimensional space Gregorian year, month, day,
hour, minute, secondPnYnMnDTnHnMnS
P1Y2M3DT10H30M -120D - reduced precision truncation allowed
P.5Y P1D, P24H - order-relation partial order
48Timex value attribute
- July 29, 2005 2005-07-29
- Friday 2005-07-29
- today 2005-07-29
- 1993 1993
- the 1990's 199X
- midnight, July 16, 2005 2005-07-16T000000
- 5pm 2005-07-29T1700
- the previous day 2005-12-24
- last October 2004-10
- last autumn 2004-FA
- last week 2005-W31
- Thursday evening 2005-07-28TEV
- three months ago 200504
49Timex value attribute
- the early 1990's value"199X" MOD"START"
- the past 10 years value"P10Y"
anchor_val"2005" anchor_dir"BEFORE" - the next week value"P1W" anchor_val"2005-
W32" anchor_dir"AFTER" - the previous day cf. point above
- recent valuePAST_REF
anchor_val2005-07-29T1200
anchor_dirBEFORE
50TempEx
Mani Wilson
- identify temporal expressions
- resolve self-contained expressions
- interpret within a discourse processor
- context tracker (for context dependent
expressions)explicit offsets from reference
timepositional offsets from reference
timeimplicit offsets based on verb tensefurther
use of lexical markersnearby dates
51Machine learning in normalisation
Ahn, Adafre de Rijke
- Staged normalisation architecture classifiers
for some disambiguation subtasks - cf. TERN-2004 eg Cymphony
- Derivative of Mani Wilson, 2000
- only one classifier learnt generic vs. specific
use of today - only date-valued expressionsnot considering
point/duration ambiguities - Separate context-independent interpretation from
context-dependent processing
52Decomposing normalisation
Ahn, Adafre de Rijke
- Lexical lookup (names ? numbers, units ? ISO
values) - Context-independent compositioncombining values
of tokens within a timex - Context-based disambiguation timex is a point
or a duration? - Reference time (temporal focus) tracking, for
anaphoric timexes - Computation of (final) normalised value
53Interesting issues
- Data for today classifier is sparse heavily
skewed (90 of instances are specific) - use the frequency baseline classifier for task
- Training data for classifiers?
- mediate between pre-normalisation and truth
- straightforward for p-d and today tasks dir
task crucially requires a temporal reference
model