Time Identification and Normalization - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

Time Identification and Normalization

Description:

The Devil's Advocate. The 'polygamist' axiom: ... [ the two weeks since Mrs. Daniel first held a news conference admonishing the ... – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0
Slides: 54
Provided by: hen4
Category:

less

Transcript and Presenter's Notes

Title: Time Identification and Normalization


1
Time Identification and Normalization
  • Heng Ji
  • hengji_at_cs.qc.cuny.edu
  • March 9, 2009

Acknowlegement some slides from Nicolas Nicolov
2
Outline
  • QA about Assignment 3 and 4
  • Background temporal reasoning why
  • Time analysis an IE problem
  • Timex2 and TERN
  • Temporal IE Toolkit
  • Time analysis for temporal reasoning
  • how, infrastructure
  • TimeML a mark-up language for time
  • TimeML analysis an IE problem
  • Timex3, Events, Links
  • Temporal IEstatus challenges future

3
Question answering
  • When did the war with Iraq start?
  • How many illegal immigrants have crossed the
    border this year?
  • How long does it take to light a fire?
  • Who is Frances Prime Minister?

4
Trajectory
  • TIDES
  • Timex2
  • TERN
  • ACE
  • TimeML ( Timex3 )
  • Context (NLP apps)
  • Resource management
  • Technology toolkit
  • State-of-the-art (status)
  • Issues

5
Temporal reasoning applications
  • Summarisation
  • Search, with temporal constraints
  • Hypothesis generation and management
  • Question answering
  • Temporal augmentation of news
  • Temporal monitoring


6
Summarisation
  • Dangling references
  • recent summers
  • was the source of the virus last week
  • where Morris was a computer science
    undergraduate till last June
  • Need resolution (normalization) of temporal
    expressions

7
Summarisation
  • Temporally anchoring ordering events in news
  • Assigning time stamps to event-clauses
  • Correct chronological sequence from diverse
    information sources

Mani, Wilson, Schiffman, Filatova, Hovy
8
The Devils Advocate
  • The polygamist axiom
  • An object cannot be in two different places at
    the same time furthermore, there must be an
    event that moves it from one place to another.

S. Makarios
9
Question answering
  • When did the war with Iraq start?
  • How many illegal immigrants have crossed the
    border this year?
  • How long does it take to light a fire?
  • Who is Frances Prime Minister?

10
Temporal augmentation of news
  • MITs News of the Future
  • Personalisation of news
  • Article augmentation
  • Historical context
  • Questioning news
  • A readership of one

Koen and Bender
11
Temporal monitoring
  • Did ATT common stock go up today?
  • Yes.
  • No.

Shall I let you know if it keeps rising?
But shall I let you know if it starts a comeback?
E. Mays
12
TimeExpressionRecognision andNormalization
13
Examples
  • Last Updated Friday, 15 July, 2005, 1019 GMT
    1119 UK
  • He will make a three-day visit to Romania next
    week
  • Police said the 31-year-old Briton died Thursday
  • The president attended the meeting for the fifth
    straight day in a row

14
Timex2 annotation scheme
  • Time points
  • Third week of October val2000-W42
  • Durations
  • Half an hour long valPT30M
  • Indexicality
  • Tomorrow val2005-07-30
  • Sets
  • Every Tuesday valXXXX-WXX-2
    setYES periodicityF1W granularityG1D
  • Fuzziness
  • Summer of 2000 val2000-SU
  • This morning val2005-07-29-TMO
  • Non-specificity
  • April is usually wet valXXXX-04
    non-specificYES

15
Interannotator agreement
Percent tags w/ extent errors Lisa
0.1 Liz 10.5 Janet 3.2
16
Data (Training)
  • bnewsnpaper nwire
  • 794 docs
  • 293K tokens
  • 8047 expressions sparse
  • 10.5 TIMEX2 tags / doc

Chinese data also
2002
2003
2004
17
Data Format
  • ltDOCgt
  • ltDOCNOgt APW20001102.1223.0376 lt/DOCNOgt
  • ltDOCTYPE SOURCE"newswire"gt NEWS STORY lt/DOCTYPEgt
  • ltDATE_TIMEgt ltTIMEX2 val"2000-11-02T132723"
    mod"" set"" non_specific"" anchor_val""
    anchor_dir"" comment""gt2000-11-02
    132723lt/TIMEX2gt lt/DATE_TIMEgt
  • ltBODYgt
  • ltSLUGgt Israel-Explosion lt/SLUGgt
  • ltHEADLINEgt Explosion Rocks Jerusalem Market
    lt/HEADLINEgt
  • ltTEXTgt
  • JERUSALEM (AP) _ A car bomb exploded ltTIMEX2
    val"2000-11-02" mod"" set"" non_specific""
    anchor_val"" anchor_dir"" comment""gtThursdaylt/T
    IMEX2gt in a crowded outdoor market in the heart
    of Jerusalem, killing at least two people, police
    said.
  • Ambulances raced to the Mahane Yehuda market,
    which sells food, vegetables and clothing in
    Jewish west Jerusalem. Huge plumes of black smoke
    rose into the sky.
  • The blast came shortly before Israeli Prime
    Minister Ehud Barak and Palestinian leader Yasser
    Arafat were scheduled to make separate
    announcements of steps toward a cease-fire.
  • lt/TEXTgtlt/BODYgtlt/DOCgt

18
Why it is easy
  • Calendar points 2004-10-04 55.28
  • Days, Months
  • Contextual cues SGML tags

19
Why it is difficult
  • the two weeks since Mrs. Daniel first held a
    news conference admonishing the hospital,
    University of California-San Francisco Stanford
    Health Care, for even debating the transplant
    issue
  • just days after another court dismissed other
    corruption charges against his 79-year-old father
    on grounds of ill health, enraging pro-democracy
    activists and triggering violent demonstrations

20
Why it is difficult (cont)
  • a month of delays following the disclosure that
    new evidence surfaced on another group, the
    Damascus-based Palestinian Front for the
    Liberation of Palestine-General Command
  • an era in which Speakers have been defined by
    belligerent partisanship -- particularly Mr.
    Gingrich and the Democrat he hounded from office,
    Jim Wright of Texas

21
Why it is difficult (cont)
  • in recent years
  • BEFORE or ENDING this year?
  • After a disastrous decade of trying to impose
    North Vietnam's austere and repressive communism
    on the more freewheeling South, the country's
    leaders reversed course in the late 1980s and
    began encouraging Southern style
    entrepreneurship.
  • Point or Duration?
  • If duration, how to anchor it?
  • P1DE ENDING 198X ? P1DE BEFORE 1990?

22
TERN English Training Data V1.0
  • Available from LDC, catalog LDC2005T07
  • 306K words annotated for TIMEX2
  • Chinese data to be released soon

23
Outline
  • Background temporal reasoning why
  • Time analysis an IE problem
  • Timex2 and TERN
  • Temporal IE Toolkit
  • Time analysis for temporal reasoning
  • how, infrastructure
  • TimeML a mark-up language for time
  • TimeML analysis an IE problem
  • Timex3, Events, Links
  • Temporal IE status challenges future

?
24
Timex2 recognition
  • Strategies
  • Rule-based patterns, grammars
  • Machine-learning based tagging, chunking

25
A grammar and a parser for date expressions
1to9 1 2 3 4 5 6 7 8 9
0to9 0 1to9 SP , Day
Monday Tuesday ... Sunday Month
January February ... December Date
1to9 1 2 0to9 3 0 1 Year
1t09 ( 0t09 ( 0to9 0to9 ) ) )DExpr Day
( Day Sp ) Month Date ( SP Year )
Today is Friday, July 29, 2005 because
yesterday was Thursday July 28 tomorrow will
be Saturday, July 30 and not July 31...
26
A grammar and a parser for valid date expressions
Valid Date DateExpression
MaxDaysInMonth LeapDays WeekDayDates
MaxdaysInMonth February 3 0
February April June September
November 3 1 .........
27
Learning how to tag
  • Chunking (Abney91) using a token tagging model
  • Output tags B-Timex2, I-Timex2, O
  • IOB2 encoding (Tjong Kim Sang Veenstra99)

Elections are on ltTIMEX2gt November 2nd lt/TIMEX2gt .
28
Machine learning technologies
  • Hidden Markov Models (HMM)
  • Transformation-Based Learning (TBL)
  • Decision Trees (DT)
  • Memory-Based Learning (MBL)
  • Maximum Entropy (MaxEnt)
  • Robust Risk Minimization (RRM)
  • Support Vector Machines (SVM)

29
Embeddedness
  • 1.9 (167 out of total 8,429)
  • a year after the 15 EU leaders decided in
    Helsinki, Finland, to create a corps of 60,000
    troops capable by 2003 of deploying within
    60 days and remaining on the ground for up to
    a year
  • 3 Level embeddedness only 3 occurrences
  • lt the first of two days of talks aimed at
    laying the groundwork for a future agreement
    gt
  • lt 60 days a period that ended Saturday
    with no such plan gt
  • lt one of the toughest days that traders that
    I've had a chance to talk to have seen in quite
    a long time gt

30
Features
startNP
TIME
DET
TIME
insideNP
t1TIMEt2TIME
Monday
morning
The
meeting
is
moved
to
1 / 2 / 2005
.
lowerCase
t1 move
XX/XX/XXXX
m, mo, ed, d
31
Time Zones (aka GMT)
32
Time hierarchies
TIME
PERIOD
POINT
Jurasic
daybreak noon sunset
DAY
YEAR
Monday today
DAY-SPAN
year of the dragon
morning afternoon
33
Challenges
  • Sentential complements fair enough
  • First names
  • Jan de Bont (director, Cradle of Life)
  • January Sorensen (Wish upon a Star)
  • April Webster (casting director, Patriot)
  • Miss November
  • Friday (Robinson Crusoes island
    companion)Tuesday Weld (Dudley Moores wife)
  • Titles The Day After Tomorrow, Friday After
    Next
  • ORG Sunday Times, Time Magazine
  • LOC Times Square
  • time schedule
  • three bombings in as many years (BBC on
    Indonesia)

34
Unexpected challengeTokenization
  • Remember the myopic vision of the systems

Second 1/2 of 2004 Second 1/2 of the white
source Second 1/2 of the fiscal year Nine 1/2
weeks
For recognition keep as much as possible together
( 1/2 vs. 1_/_2) Break further for normalization
(MM/DD/YYYY vs. DD/MM/YYYY formats)
35
Statistical Tokenization
  • List of unambiguous characters ?
  • List of ambiguous characters , . !
  • Go through text
  • Split at unambiguous char
  • Make a classif. decision at ambiguous char by
    considering the local context
  • Classification are actions keep (http//www.) /
    split before(s) / after(Dr.) / around(Yes!) /
    oneBefore (dont)

36
Outline
  • Background temporal reasoning why ,
  • Time analysis an IE problem
  • Timex2 and TERN
  • Temporal IE Toolkit
  • Time analysis for temporal reasoning
  • how, infrastructure
  • TimeML a mark-up language for time
  • TimeML analysis an IE problem
  • Timex3, Events, Links
  • Temporal IE status challenges future

?
37
The TIMEBANK Corpus (2003)
TIMEBANK contains 186 newswire articles with
careful, detailed annotations of terms denoting
events, temporal expressions, and temporal
signals, and, most importantly, of links between
them denoting temporal relations. This
collection, the largest temporal-event annotated
corpus to date, provides a solid empirical basis
for future research into the way texts actually
express and connect series of events. It will
support research into areas as diverse as the
semantics of tense and aspect, the explicit
versus implicit communication of temporal
relational information, and the variation in
typical event structure across narrative domains
.
From a practical computational perspective, it
will become possible to consider training and
evaluating algorithms which determine event
ordering and time-stamping .
38
Methodology of corpus design
  • TimeBank vs. TERN corpus
  • Relationship between
  • Corpus and task
  • Corpus and implementation

39
TimeBank data characteristics
  • (Some) TimeBank statistics 186
    documents68.5K words
  • At 10 test data, barely over 60K words for
    training

40
TimeBank data characteristics
  • Cf. Penn TreeBank / POS tagging corpus
  • More than 1M words (gt 16 times larger than
    TimeBank), and a simple tagging scheme.

41
TimeBank data characteristics
  • Cf. CoNLL NE chunking
  • Extreme paucity of positive examples eg
  • lt than 2K examples over 13 classes
  • cf. gt200K words / 23K examples (15 times gt
    TimeBank) over just 4 name classes

42
TimeBank data characteristics
  • Cf. TERNs training set almost 800 docs / 300K
    words, with gt8K timex2,considered to be
    somewhat sparse !

Timex3 DATE 975DURATION
314TIME 80SET 7
(1376)Total 1423
(in document) 1245
43
Detour Timex normalisation
44
ISO 8601 standard Numeric representation of
dates and times
  • Easily readable/writable by software
  • Easily comparable / sortable (str compare)
  • Ditto for dates followed by times
  • Language independent
  • Unambiguous (wrt other date formats)
  • Notation is short and of constant length
  • year, month, day order widely used
  • Japan, Korea, Hungary, Sweden, Finland, DK,

45
ISO 8601 Dates
  • YYYY-MM-DD 2005-07-29 (20050729)
    2005-07 2005
  • YYYY-WNN 2005-W32 (2005W32)
    2005-W32-5 2005-07-29
  • YYYY-DDD 2005-001 (2005001)
    2005-217 2005-07-29

46
ISO 8601 Times
  • hhmmss 235959 (235959) 2359
    23 235959.9942
    (5.8ms before midnight)
  • 2005-02-04 2400 2005-02-05 00002005-02-04T23
    59 20050204T235959
  • fixes many disadvantages of the old English 12h
    notation

47
ISO 8601 Durations
  • 6-dimensional space Gregorian year, month, day,
    hour, minute, secondPnYnMnDTnHnMnS
    P1Y2M3DT10H30M -120D
  • reduced precision truncation allowed
    P.5Y P1D, P24H
  • order-relation partial order

48
Timex value attribute
  • July 29, 2005 2005-07-29
  • Friday 2005-07-29
  • today 2005-07-29
  • 1993 1993
  • the 1990's 199X
  • midnight, July 16, 2005 2005-07-16T000000
  • 5pm 2005-07-29T1700
  • the previous day 2005-12-24
  • last October 2004-10
  • last autumn 2004-FA
  • last week 2005-W31
  • Thursday evening 2005-07-28TEV
  • three months ago 200504

49
Timex value attribute
  • the early 1990's value"199X" MOD"START"
  • the past 10 years value"P10Y"
    anchor_val"2005" anchor_dir"BEFORE"
  • the next week value"P1W" anchor_val"2005-
    W32" anchor_dir"AFTER"
  • the previous day cf. point above
  • recent valuePAST_REF
    anchor_val2005-07-29T1200
    anchor_dirBEFORE

50
TempEx
Mani Wilson
  • identify temporal expressions
  • resolve self-contained expressions
  • interpret within a discourse processor
  • context tracker (for context dependent
    expressions)explicit offsets from reference
    timepositional offsets from reference
    timeimplicit offsets based on verb tensefurther
    use of lexical markersnearby dates

51
Machine learning in normalisation
Ahn, Adafre de Rijke
  • Staged normalisation architecture classifiers
    for some disambiguation subtasks
  • cf. TERN-2004 eg Cymphony
  • Derivative of Mani Wilson, 2000
  • only one classifier learnt generic vs. specific
    use of today
  • only date-valued expressionsnot considering
    point/duration ambiguities
  • Separate context-independent interpretation from
    context-dependent processing

52
Decomposing normalisation
Ahn, Adafre de Rijke
  • Lexical lookup (names ? numbers, units ? ISO
    values)
  • Context-independent compositioncombining values
    of tokens within a timex
  • Context-based disambiguation timex is a point
    or a duration?
  • Reference time (temporal focus) tracking, for
    anaphoric timexes
  • Computation of (final) normalised value

53
Interesting issues
  • Data for today classifier is sparse heavily
    skewed (90 of instances are specific)
  • use the frequency baseline classifier for task
  • Training data for classifiers?
  • mediate between pre-normalisation and truth
  • straightforward for p-d and today tasks dir
    task crucially requires a temporal reference
    model
Write a Comment
User Comments (0)
About PowerShow.com