Homing in on the TextInitial Cluster - PowerPoint PPT Presentation

About This Presentation
Title:

Homing in on the TextInitial Cluster

Description:

headline & lead. 1st sentence of 1st paragraph (TISC) all other ... Using the hard news corpus, How many 3-5 word clusters are found to be key in TISC sections? ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 33
Provided by: mikes96
Category:

less

Transcript and Presenter's Notes

Title: Homing in on the TextInitial Cluster


1
Homing in on the Text-Initial Cluster
  • Mike Scott
  • School of English
  • University of Liverpool
  • Aston Corpus Symposium
  • Friday May 4th 2007
  • This presentation is at www.lexically.net/download
    s/corpus_linguistics

2
Starting Questions
  • Are clusters like Once upon a time and lived
    happily ever after oddities in marking text
    position?
  • Or do many n-grams characterise the beginnings,
    middles or ends of certain kinds of text?
  • If so, are there any common patterns in
    text-initial clusters?

3
Context
  • Textual Priming Project, University of Liverpool
  • Michael Hoey
  • Michaela Mahlberg
  • Matthew ODonnell
  • Mike Scott

4
Textual Priming Project Aims
  • to investigate how many (and what types of)
    lexical items are primed to appear in
    text-initial or paragraph-initial position
  • to identify lexico-grammatical patterns and see
    how these patterns can be functionally
    interpreted in the textual contexts.
  • to relate these lexical and corpus-driven facts
    to current textual descriptions of (hard) news
    stories that might provide explanations for the
    positive primings of relevant lexis.

from ODonnell et al 2007
5
Hard News Corpus
  • Home News sections of the Guardian and Observer
  • 1998 to 2004
  • 115,654 articles
  • divided thus
  • headline lead
  • 1st sentence of 1st paragraph (TISC)
  • all other sentences
  • TISC contains 3.2 million tokens
  • The rest 51.2 million tokens
  • About 470 words per article

6
Research Questions
  • Using the hard news corpus,
  • How many 3-5 word clusters are found to be key in
    TISC sections?
  • How many are positively and how many are
    negatively key?
  • What recurrent patterns can be found in the two
    types of key cluster?

7
Methods (1)
  • Format the corpus in XML and separate out all
    TISC sections (done by Matt ODonnell)
  • Use WordSmiths WordList tool to compute wordlist
    indexes of
  • all the text
  • all the TISC sections
  • Using WordList, compute 3-5 word clusters for
    each index, save as .lst

8
Top clusters, all sections
  • GUARDIAN CO UK
  • ONE OF THE
  • A HREF HTTP, WWW GUARDIAN CO and similar web
    links
  • THE PRIME MINISTER
  • THE END OF
  • AS WELL AS
  • THE NUMBER OF
  • THERE IS A
  • SOME OF THE
  • THERE IS NO

9
Top clusters, TISC
  • ONE OF THE
  • ACCORDING TO A
  • LAST NIGHT AFTER
  • FOR THE FIRST
  • THE FIRST TIME
  • IS TO BE
  • FOR THE FIRST TIME
  • THE MURDER OF
  • ARE TO BE
  • THE DEATH OF
  • OF THE MOST
  • THE HOME SECRETARY
  • WAS LAST NIGHT
  • IT EMERGED YESTERDAY
  • AS PART OF
  • AN ATTEMPT TO
  • THE UNITED STATES
  • THE NUMBER OF
  • ONE OF THE MOST
  • ACCORDING TO THE

10
Methods (2)
  • Use KeyWords tool to compute KWs for the TISC 3-5
    word clusters using all the text as a reference
    corpus
  • Identify patterns in the KW clusters

11
TISC key clusters
  • ACCORDING TO A
  • LAST NIGHT AFTER
  • IT EMERGED YESTERDAY
  • WAS LAST NIGHT
  • ARE TO BE
  • THE MURDER OF
  • LAST NIGHT WHEN
  • THE GOVERNMENT YESTERDAY
  • LAST NIGHT AS
  • IS TO BE

WERE LAST NIGHT YESTERDAY AFTER A TONY BLAIR
YESTERDAY COURT HEARD YESTERDAY WAS TOLD
YESTERDAY WAS JAILED FOR THE DEATH OF YEAR OLD
BOY YESTERDAY WHEN THE WITH THE MURDER OF
12
Numbers of Key Clusters
13
RQs 1 2 Numbers of KW clusters
  • using a p value of 0.0000001 and minimum
    frequency of 3 and log likelihood statistic,
  • 8,132 key clusters altogether (in 3.2 million
    words of text)
  • of which 7,631 were positively key
  • and 501 negatively key
  • though there is repetition as these are 3-5 word
    n-grams

Research Question 2
14
Repetition
  • YESTERDAY FOUND GUILTY
  • YESTERDAY FOUND GUILTY OF
  • YESTERDAY FROM A
  • YESTERDAY FROM THE
  • YESTERDAY GAVE A
  • YESTERDAY GAVE HIS
  • YESTERDAY GAVE THE
  • YESTERDAY GIVEN A
  • YESTERDAY GIVEN THE
  • YESTERDAY GIVEN THE GO
  • YESTERDAY GIVEN THE GO AHEAD

15
Negatively key
  • A LOT OF
  • A SPOKESMAN FOR
  • THERE IS NO
  • HE SAID THE
  • SAID IT WAS
  • THERE IS A
  • THIS IS A
  • THE FACT THAT
  • AS WELL AS
  • IT WOULD BE

SPOKESMAN FOR THE PER CENT OF WE HAVE TO SAID
THAT THE BUT IT IS AT A TIME A SPOKESMAN FOR
THE SAID HE WAS IT IS NOT THERE WAS NO
16
RQ 1 Numbers of KW clusters
  • Is 8 thousand a large number of distinct key
    text-initial clusters?
  • In the same amount of text there are 84 thousand
    3-5 word clusters of frequency at least 5
    altogether
  • about one in 10 is associated with text initial
    position at the .0000001 level of significance

17
RQ 1, continued
  • is 1 in 10 a large number to be key?
  • In the case of SISC (sentences from paragraphs
    with only one sentence in), we get
  • 507 thousand clusters, of which
  • 2,192 are key (1,747 positively and 445
    negatively)
  • which is about 1 in 230

18
PATTERNS
19
RQ 3 patterns
  • recency
  • in the top 200, seventy express time, generally
    using yesterday or last night

20
Recency clusters
  • COURT HEARD YESTERDAY
  • TONY BLAIR YESTERDAY
  • YESTERDAY AFTER A
  • WERE LAST NIGHT
  • LAST NIGHT AS
  • THE GOVERNMENT YESTERDAY
  • LAST NIGHT WHEN
  • WAS LAST NIGHT
  • IT EMERGED YESTERDAY
  • LAST NIGHT AFTER

YESTERDAY IN A IT EMERGED LAST NIGHT A COURT
HEARD YESTERDAY YESTERDAY WHEN A YESTERDAY AFTER
THE EMERGED LAST NIGHT LAST NIGHT TO YESTERDAY AS
THE YESTERDAY WHEN THE WAS TOLD YESTERDAY
21
Superlatives
  • ONE OF BRITAIN'S MOST
  • ONE OF THE MOST
  • OF THE WORLD'S
  • THE FIRST TIME
  • OF BRITAIN'S MOST
  • FOR THE FIRST
  • FOR THE FIRST TIME

22
Research, Report etc.
  • ACCORDING TO A REPORT
  • A COURT HEARD (YESTERDAY)
  • ACCORDING TO RESEARCH
  • TO A SURVEY
  • IT EMERGED LAST NIGHT
  • IT WAS ANNOUNCED YESTERDAY
  • IT WAS REVEALED YESTERDAY
  • A REPORT PUBLISHED
  • ACCORDING TO A STUDY
  • TO RESEARCH PUBLISHED

23
Attention-grabbers
  • IT EMERGED THAT
  • OBSERVER CAN REVEAL
  • THE OBSERVER CAN REVEAL

24
Indefinite articles positively key.
  • A BABY GIRL
  • A BAN ON
  • A BEACH IN
  • A BID TO
  • A BITTER ROW
  • A BLACK MAN
  • A BLISTERING ATTACK ON
  • A JURY WAS TOLD YESTERDAY

A LABOUR MP A LANDMARK RULING A LAST DITCH
ATTEMPT TO A LAST MINUTE A LEADING BRITISH A
LEADING SCIENTIST A LEGAL BATTLE A LEGAL CHALLENGE
25
Indefinite articles negatively key
  • A KIND OF
  • A COUPLE OF
  • A GREAT DEAL
  • A KIND OF
  • A LOT MORE

26
IT reporting verb positively key
  • IT WAS ANNOUNCED LAST NIGHT
  • IT WAS CLAIMED LAST NIGHT
  • IT WAS CONFIRMED LAST NIGHT
  • IT IS REVEALED TODAY

27
IT otherwise negatively key
  • IT IS A
  • IT IS ABOUT
  • IT IS EXPECTED
  • IT IS GOING
  • IT IS ONLY
  • IT IS POSSIBLE
  • IT SEEMS TO

28
SAID YESTERDAY positively key
  • SAID YESTERDAY AFTER
  • SAID YESTERDAY THAT HE
  • SAID YESTERDAY THEY HAD

29
SAID without time negatively key
  • SAID AT THE
  • SAID HE HAD
  • SAID HE WOULD
  • SAID THE GOVERNMENT
  • SAID THERE WAS NO

30
Conclusions
  • The once upon a time syndrome seems to be much
    more common than might be thought.
  • In text-initial sections of 115 thousand hard
    news stories (3.2 m. words), out of 8 thousand
    3-5 word clusters, about 1 in 10 had text-initial
    significance
  • whereas in non text-initial sections only 1 in
    230 was key

31
Other patterns
  • recency
  • superlatives
  • research, report
  • attention-grabbers
  • indefinite articles
  • IT reporting verb SAID time

32
References
  • ODonnell, Matthew, Mike Scott, Michaela Malhberg
    Michael Hoey (forthcoming) When the text
    counts Exploring the Implications of text as
    unit in corpus linguistics. Paper presented at
    PALC, Lodz.. April 2007.
Write a Comment
User Comments (0)
About PowerShow.com