Title: Homing in on the TextInitial Cluster
1Homing in on the Text-Initial Cluster
- Mike Scott
- School of English
- University of Liverpool
- Aston Corpus Symposium
- Friday May 4th 2007
- This presentation is at www.lexically.net/download
s/corpus_linguistics
2Starting Questions
- Are clusters like Once upon a time and lived
happily ever after oddities in marking text
position? - Or do many n-grams characterise the beginnings,
middles or ends of certain kinds of text? - If so, are there any common patterns in
text-initial clusters?
3Context
- Textual Priming Project, University of Liverpool
- Michael Hoey
- Michaela Mahlberg
- Matthew ODonnell
- Mike Scott
4Textual Priming Project Aims
- to investigate how many (and what types of)
lexical items are primed to appear in
text-initial or paragraph-initial position - to identify lexico-grammatical patterns and see
how these patterns can be functionally
interpreted in the textual contexts. - to relate these lexical and corpus-driven facts
to current textual descriptions of (hard) news
stories that might provide explanations for the
positive primings of relevant lexis.
from ODonnell et al 2007
5Hard News Corpus
- Home News sections of the Guardian and Observer
- 1998 to 2004
- 115,654 articles
- divided thus
- headline lead
- 1st sentence of 1st paragraph (TISC)
- all other sentences
- TISC contains 3.2 million tokens
- The rest 51.2 million tokens
- About 470 words per article
6Research Questions
- Using the hard news corpus,
- How many 3-5 word clusters are found to be key in
TISC sections? - How many are positively and how many are
negatively key? - What recurrent patterns can be found in the two
types of key cluster?
7Methods (1)
- Format the corpus in XML and separate out all
TISC sections (done by Matt ODonnell) - Use WordSmiths WordList tool to compute wordlist
indexes of - all the text
- all the TISC sections
- Using WordList, compute 3-5 word clusters for
each index, save as .lst
8Top clusters, all sections
- GUARDIAN CO UK
- ONE OF THE
- A HREF HTTP, WWW GUARDIAN CO and similar web
links - THE PRIME MINISTER
- THE END OF
- AS WELL AS
- THE NUMBER OF
- THERE IS A
- SOME OF THE
- THERE IS NO
9Top clusters, TISC
- ONE OF THE
- ACCORDING TO A
- LAST NIGHT AFTER
- FOR THE FIRST
- THE FIRST TIME
- IS TO BE
- FOR THE FIRST TIME
- THE MURDER OF
- ARE TO BE
- THE DEATH OF
- OF THE MOST
- THE HOME SECRETARY
- WAS LAST NIGHT
- IT EMERGED YESTERDAY
- AS PART OF
- AN ATTEMPT TO
- THE UNITED STATES
- THE NUMBER OF
- ONE OF THE MOST
- ACCORDING TO THE
10Methods (2)
- Use KeyWords tool to compute KWs for the TISC 3-5
word clusters using all the text as a reference
corpus - Identify patterns in the KW clusters
11TISC key clusters
- ACCORDING TO A
- LAST NIGHT AFTER
- IT EMERGED YESTERDAY
- WAS LAST NIGHT
- ARE TO BE
- THE MURDER OF
- LAST NIGHT WHEN
- THE GOVERNMENT YESTERDAY
- LAST NIGHT AS
- IS TO BE
WERE LAST NIGHT YESTERDAY AFTER A TONY BLAIR
YESTERDAY COURT HEARD YESTERDAY WAS TOLD
YESTERDAY WAS JAILED FOR THE DEATH OF YEAR OLD
BOY YESTERDAY WHEN THE WITH THE MURDER OF
12Numbers of Key Clusters
13RQs 1 2 Numbers of KW clusters
- using a p value of 0.0000001 and minimum
frequency of 3 and log likelihood statistic, - 8,132 key clusters altogether (in 3.2 million
words of text) - of which 7,631 were positively key
- and 501 negatively key
- though there is repetition as these are 3-5 word
n-grams
Research Question 2
14Repetition
- YESTERDAY FOUND GUILTY
- YESTERDAY FOUND GUILTY OF
- YESTERDAY FROM A
- YESTERDAY FROM THE
- YESTERDAY GAVE A
- YESTERDAY GAVE HIS
- YESTERDAY GAVE THE
- YESTERDAY GIVEN A
- YESTERDAY GIVEN THE
- YESTERDAY GIVEN THE GO
- YESTERDAY GIVEN THE GO AHEAD
15Negatively key
- A LOT OF
- A SPOKESMAN FOR
- THERE IS NO
- HE SAID THE
- SAID IT WAS
- THERE IS A
- THIS IS A
- THE FACT THAT
- AS WELL AS
- IT WOULD BE
SPOKESMAN FOR THE PER CENT OF WE HAVE TO SAID
THAT THE BUT IT IS AT A TIME A SPOKESMAN FOR
THE SAID HE WAS IT IS NOT THERE WAS NO
16RQ 1 Numbers of KW clusters
- Is 8 thousand a large number of distinct key
text-initial clusters? - In the same amount of text there are 84 thousand
3-5 word clusters of frequency at least 5
altogether - about one in 10 is associated with text initial
position at the .0000001 level of significance
17RQ 1, continued
- is 1 in 10 a large number to be key?
- In the case of SISC (sentences from paragraphs
with only one sentence in), we get - 507 thousand clusters, of which
- 2,192 are key (1,747 positively and 445
negatively) - which is about 1 in 230
18PATTERNS
19RQ 3 patterns
- recency
- in the top 200, seventy express time, generally
using yesterday or last night
20Recency clusters
- COURT HEARD YESTERDAY
- TONY BLAIR YESTERDAY
- YESTERDAY AFTER A
- WERE LAST NIGHT
- LAST NIGHT AS
- THE GOVERNMENT YESTERDAY
- LAST NIGHT WHEN
- WAS LAST NIGHT
- IT EMERGED YESTERDAY
- LAST NIGHT AFTER
YESTERDAY IN A IT EMERGED LAST NIGHT A COURT
HEARD YESTERDAY YESTERDAY WHEN A YESTERDAY AFTER
THE EMERGED LAST NIGHT LAST NIGHT TO YESTERDAY AS
THE YESTERDAY WHEN THE WAS TOLD YESTERDAY
21Superlatives
- ONE OF BRITAIN'S MOST
- ONE OF THE MOST
- OF THE WORLD'S
- THE FIRST TIME
- OF BRITAIN'S MOST
- FOR THE FIRST
- FOR THE FIRST TIME
22Research, Report etc.
- ACCORDING TO A REPORT
- A COURT HEARD (YESTERDAY)
- ACCORDING TO RESEARCH
- TO A SURVEY
- IT EMERGED LAST NIGHT
- IT WAS ANNOUNCED YESTERDAY
- IT WAS REVEALED YESTERDAY
- A REPORT PUBLISHED
- ACCORDING TO A STUDY
- TO RESEARCH PUBLISHED
23Attention-grabbers
- IT EMERGED THAT
- OBSERVER CAN REVEAL
- THE OBSERVER CAN REVEAL
24Indefinite articles positively key.
- A BABY GIRL
- A BAN ON
- A BEACH IN
- A BID TO
- A BITTER ROW
- A BLACK MAN
- A BLISTERING ATTACK ON
- A JURY WAS TOLD YESTERDAY
A LABOUR MP A LANDMARK RULING A LAST DITCH
ATTEMPT TO A LAST MINUTE A LEADING BRITISH A
LEADING SCIENTIST A LEGAL BATTLE A LEGAL CHALLENGE
25Indefinite articles negatively key
- A KIND OF
- A COUPLE OF
- A GREAT DEAL
- A KIND OF
- A LOT MORE
26IT reporting verb positively key
- IT WAS ANNOUNCED LAST NIGHT
- IT WAS CLAIMED LAST NIGHT
- IT WAS CONFIRMED LAST NIGHT
- IT IS REVEALED TODAY
27IT otherwise negatively key
- IT IS A
- IT IS ABOUT
- IT IS EXPECTED
- IT IS GOING
- IT IS ONLY
- IT IS POSSIBLE
- IT SEEMS TO
28SAID YESTERDAY positively key
- SAID YESTERDAY AFTER
- SAID YESTERDAY THAT HE
- SAID YESTERDAY THEY HAD
29SAID without time negatively key
- SAID AT THE
- SAID HE HAD
- SAID HE WOULD
- SAID THE GOVERNMENT
- SAID THERE WAS NO
30Conclusions
- The once upon a time syndrome seems to be much
more common than might be thought. - In text-initial sections of 115 thousand hard
news stories (3.2 m. words), out of 8 thousand
3-5 word clusters, about 1 in 10 had text-initial
significance - whereas in non text-initial sections only 1 in
230 was key
31Other patterns
- recency
- superlatives
- research, report
- attention-grabbers
- indefinite articles
- IT reporting verb SAID time
32References
- ODonnell, Matthew, Mike Scott, Michaela Malhberg
Michael Hoey (forthcoming) When the text
counts Exploring the Implications of text as
unit in corpus linguistics. Paper presented at
PALC, Lodz.. April 2007.