Title: Malcolm Clark
1 Genre Analysis of Structured E-mails for Corpus
Profiling Workshop on Corpus Profiling for NLP/IR
Malcolm Clark Supervisors Professor Patrik
O'Brian Holt Dr Ian Ruthven
1/25
2Presentation Outline
- Introduction
- The Problems
- Information Retrieval (IR), Genre and Perception
- Experiment Research Questions, Setup, How do
People use Textual Features? - Conclusions
- Contributions and Implications
- Future Work
Malcolm Clark
2/25
3Introduction
- Focuses IR and cognitive psychology.
- Corpuses contain exemplar documents called
genres useful for profiling corpora - E-mail exchanges have socially constructed
communicative behaviours which exist to improve
the efficiency of a community of practice and for
profiling corpora. - Investigate these types of genres and how people
use emails in terms of genre and perception for
filtering.
Malcolm Clark
3/25
Malcolm Clark
4The Problems
- Identifying genres for profiling corpus
- Filter correct types of documents to user by
genre - E-mail filtering
- Understanding user tasks
- Rapidly understand a text without the necessity
for parsing the whole document?
4/25
Malcolm Clark
5The Project Examines
- The value of structure.
- How form or layout is perceived in structured
texts? - Constructivist (recognition) and ecological
approaches (action afforded ) or are they both
used? - If and how the objects of a community of practice
(COP) can be comprehended and exploited? - How readers react to genre features in document
collections.
Malcolm Clark
5/25
6Information Retrieval
Division of IR into computer science lab
experiments vs user-orientated social
studies Järvelin(2006)
6/25
Malcolm Clark
7Genre Background
Readily observable features
Communicative purpose
TYPICAL GENRE
Form
Purpose
Discourse Structure
Comms Medium
Arguments
Structural Features
Language or Symbol System
Themes
Topics
Topics
Topics
Formality, specialised vocab
Orlikowski and Yates 1994
Malcolm Clark
7/25
Malcolm Clark
8Corpus - Genre Example from E-mail-call for papers
Header Title etc
Abstract
Titles Topics (list)
Dates and submission
8/25
Malcolm Clark
9Genre What are Communities of Practice (COP)?
- What ?
- Social institutions/sites.
- When?
- Human agents draw on genre rules to
- engage in organizational communication.
- How?
- Produced, reproduced, or modified.
- But how are they perceived and used?
9/25
Malcolm Clark
10Human Perceptual Systems
- Two prominent fields in perception research
Perceive
Final goal?
Recognition
Action
10/25
Malcolm Clark
11Experiment Pilot - Research Questions
- How human beings use genres features and what do
they perceive? - How can genre categorization be performed by
using current skimming methods? - How do genres evolve in communities of practice
(i.e. e-mail etc)? - How are the document genres and structural
attributes used?
11/25
Malcolm Clark
12Experiment Pilot - How do People Use Texts?
- By eye tracking i.e. the position and movement of
- the eye
- Collect and analyse the empirical data produced
by experiments in e-mail community of practice. - Locating the strategies and features for
profiling corpora - e.g. centred blocks of text,
invariant cues. Taking into account features,
strategies etc. - How do humans view genre?
12/25
Malcolm Clark
13Experiment Pilot
13/25
Malcolm Clark
14Pilot - Setup
- Method - 4 x 16 image blocks (4 genres in each
two blocks). - Measurements
- Amount of genres idd correctly - purpose
- Structure vs Non-structure form - form
- Identification of genre response time - form
- Strategies and distinguishing features - purpose
and form - Variables
- Purpose/type of genre
- Form in 4 representations..
14/25
Malcolm Clark
15CFP - Content AND Structure
15/25
Malcolm Clark
16CFP Structure and No Content
16/25
Malcolm Clark
17CFP Content No Structure
17/25
Malcolm Clark
18CFP No Content AND No Structure
18/25
Malcolm Clark
19Setup
- Task and procedure
- Shown 64 images
- Vocally Id each image.
- Eyetracker records features and strategies used.
- Data recorded
- X/Y location saccades and fixations.
- Features and strategies
- Desktop video recording Wink
- Timed and vocal responses
19/25
Malcolm Clark
20Results after 5 Participants
- Amount of genres idd correctly-purpose
- 11.5 per block out of 16.
- Un-structured vs structure 41.6/72.9
- Orig (87.5),Orig no content (77), content no
struc (68), non 27 - Structure vs Non-form - av. response time (sec)
-
2.22 vs 2.72 - HOW WAS IT DONE?????
- Clues to strategies
- skimmed shape - left (sem) / centred (cfp)
- aligned and blocks of text/numerics
- No structure/no struc or content wide spirals of
scanning behaviour poss looking keywords?
20/25
Malcolm Clark
21Results Distinguishing features
Genre Features
CFP Dates, centered blocks
Cinema Block numerical content
ITS Inconclusive (participants ignore them?)
Lib List book (s) info at bottom
Nl Paragraph/summary of item then URL
Ord Left alignment/currency
Sem Inconclusive
Spam Keywords LOTTO/address and uppercase emboldened text
21/25
Malcolm Clark
22Conclusions
- Genre largely overlooked but momentum is
building. - Our approach is useful for filtering e-mails/id
features for characterising datasets - Purpose and form very useful for using texts.
- Clues to perception processes found but need to
add familiarity to the mix. - Train machine to emulate human behaviour and
understand textual input without reading whole
text?
22/25
Malcolm Clark
23Contributions and Implications
- Development of a language/perception
theory/framework of - How people use different types of texts.
- Modelling user tasks and behaviour in relation to
genre and perception. - Extend laboratory IR/user-orientated IR approach
- From algorithms and machines.
- To a user-oriented and contextual level.
23/25
Malcolm Clark
24Future Work
- Focus on narrowing down my work domains.
- Investigate domains
- Academic documents collections CSIRO Enterprise
- Legal documents - Enron
- Weblogs TREC Blog
- Web domains - Wikipedia
- Consider multi-genres e.g. course books, large
documents e.g social work report
24/25
Malcolm Clark
25 Malcolm Clark
25/25
26Motivation
- Useful features for profiling corpora.
- Adds another type of filtering to large data
collections to take advantage of genre i.e. news,
biographical etc. - Genre benefits organisations financially and
administratively i.e. rapid retrieval of
information. - Embrace genre and perception to understand and
examine these structures!
26/25
Malcolm Clark
27Evaluation System
- Model the findings based on FERRET and McFRUMPs
Predictor and Substantiator. - Our system Genre Retrieval and Understanding
Memory Program or GRUMP. - Similar features to Clark and Watt (2007)?
-
27/25
Malcolm Clark
28Skimming Categorisation
- Skimming
- Used to identify the main points in a text
much - quicker than normal reading without having
to - understand every word.
- Normally used when a reader has a large
- amount of text to read within a limited
time. - Categorisation
- Automatically labelled or classified.
- No need for manual organisation, labelling
or - sorting.
28/25
Malcolm Clark
29Evaluation System How it Works
Queries
Texts
Query Parser
McFRUMP Parser
Abstracts
Case Frame Matcher
Case frame patterns
Relevant Texts Figure taken from Mauldin 1991
McFRUMP parser contains the Predictor/Substantiato
r, Scripts etc
29/27
Malcolm Clark
30Evaluation System Script Example
- Using Schanks (1981, ch 3) Conceptual Dependency
theory of Scripts, Plans and Goals and DeJongs
(1982) FRUMP make different genre scripts - John Doe was arrested last Saturday morning after
holding up the New Haven Savings Bank - ARREST SCRIPT
- Police arrive at suspect location
- Suspect Apprehended
- Taken to police station
- Charged
- Incarcerated or bailed
- Using this type of script format to understand
stories, genre rules/features can be specified in
scripts to understand texts.
Modify script with genre rules
30/25