Title: Automated indexing of survey questionnaires and interviews
1Automated indexing of survey questionnaires and
interviews
- Louise Corti
- UK Data Archive, University of Essex
- NaCTeM, Manchester
- 25 January 2008
2Data collections at UKDA
- social, economic and historical data
collections - _at_5000 data collections studies
- studies primary data derived social research
methods - surveys, collated statistics, qualitative
interviewing, fieldwork and observation
3Resource discovery at UKDA
- how do our users currently locate data?
- at the highest level, a study level metadata
record is compiled for each study following the
DDI XML-based metadata schema - free text fields plus controlled vocabulary
- DDI - all the national social science data
archives in the world use this standard, so
harvesting across collections is possible
4(No Transcript)
5(No Transcript)
6Resource discovery at UKDA
- simplest resource discovery is a free text search
or on some key fields from the catalogue records,
eg health
7Free text search
8Use of key words
- UKDA manually assign key words to (index) data
for resource discovery purposes - surveys study description and question level
- qualitative data study level, methods
- key words used thus describe the methodology but
not the research data per se
9Key word search
Key words
10Key words
11Key words are manually assigned at the survey
question level but captured in a database at the
study level
This is very laborious as there can be hundreds
per study!
BUT key words are NOT linked to questions!! ie
at UKDA there is NO correspondence in the current
metadata schema.
12Keywords searches can be refined by the user
making use of the UKDA thesaurus of terms
Select this term and a new search is run
13UKDA Thesaurus
- HASSET (Humanities and Social Science Thesaurus )
is a subject thesaurus which has been developed
by the UKDA over the past 20 years - Initially based on the UNESCO thesaurus, it has
been continuously expanded and updated for use in
the UKDAs online catalogue - display of the hierarchical relationships of
terms can help users to broaden a search or make
it more specific. Cross referencing to synonyms
suggests alternative search terms, as does the
provision of links to other conceptually related
terms - it employs the conventional range of term
relationships of equivalence (preferred and
non-preferred terms), the hierarchical
relationships (broader and narrower terms) and
the associative relationships (related terms) - stored in SQL tables and multi lingual version
ELSST developed for EC
14Metadata
- DDI allows for
- study level description
- methodology and data description, authors, rights
management and access etc. - file level
- description of individual files e.g spss files,
work file, audio file (but not currently used
in-house) - question (variable) level
- question description, text, var names and values,
groups PLUS key words (again not used)
15Variable search
16But key words NOT linked to the variable!
17Indexing survey questions
- survey questions
- Do you suffer from any long-standing limiting
illness? - Keywords assigned long-tem illness
- Government survey questions are often
standardised to provide comparability across
surveys - Have large databases of individual questions
18Semi-automated solutions
- UKDA indexingthere must be an easier way!
- first a database of questions linked to key terms
(controlled vocab) must be built to test any
automated assignment - methodology and coder reliability should also be
investigated - in-house guidelines are in place but still
subjective assignment - no stringent quality control on key word
assignment
19Key words for qualitative data
- a different challenge
- indexing done at study level and is largely
conceptual - work needs to be done on how researchers assign
key words to data and how they search for
qualitative data - analysis of data processing methods in house
- analysis of UKDA search logs ..what terms you
users enter? - Can utilise named entity recognition, term
extraction and document summarisation tools on
these kinds of data (eg an unstructured
transcribed interview)
20What about using NaCTeM tools?
- given that we can provide databases of terms
linked to data (study, file and parts) - could test NaCTeM tools data to assign terms or
concepts/summarise text - nice front end processing tools are essential
- processors must have option to agree or edit any
terms - terms should be output to DDI XML metadata at the
study, file and variable level
21Structural and content mark up of textual
interview data
- for spoken interview texts, useful encoding
features are - utterance, specific turn taker, defining
idiosyncrasies in transcription - links to analytic annotation and other data types
(e.g.. thematic codes, concepts, audio or video
links, researcher memos, maps, images, URLs etc.)
- identifying information such as real names,
company names, place names, occupations, temporal
information
22An sample interview
- ID 001
- Sex M
- YOB 1921
- Place Oldham
- Finalocc Postman
- U id'1' who'interviewer' Right, it starts with
your grandparents. So give me the names and dates
of birth of both. Do you remember those sets of
grandparents? - U id'2' who'subject' Yes.
- U id'3' who'interviewer' Well, we'll start with
your mum's parents? Where did they live? - U id'4' who'subject' They lived in Widness,
Lancashire. - U id'5' who'interviewer' How do you remember
them? - U id'6' who'subject' When we Mum used to take
me to see them and me Grandma came to live with
us in the end, didn't she? - U id'7' who'Welham' Welham Yes, when Granddad
died - '48. - U id'8' who'interviewer' So he died when he was
48? - U id'9' who'Welham' Welham No, he was 52. He
died in 1948. - U id'10' who'interviewer' But I remember it.
How old would I be then? - U id'11' who'Welham' Welham Oh, you would have
been little then. - U id'12' who'subject' I remember him, he used
to have whiskers. He used to put me on his knee
and give me a kiss. .
23ESRC SQUAD project
- developed and tested universal standards and
technologies - long-term digital archiving
- publishing
- data exchange
- investigated user-friendly tools for
semi-automating processes already used to prepare
qualitative data and materials - formatted text documents ready for output
- mark-up of structural features of textual data
- annotation and anonymisation tool
- automated coding/indexing linked to a domain
ontology
24Identifying elements
- Identify atomic elements of information in text
- Person names
- Company/Organisation names
- Occupations
- Locations
- Dates and times
- Example
- Italy's business world was rocked by the
announcement last Thursday that Mr. Verdi would
leave his job as vice-president of Music Masters
of Milan, Inc to become operations director of
Arthur Anderson
25Testing NLP tools
- UKDA have investigated some basic NLP tools to
identify named entities with a nice GUI - part of ESRC SQUAD award
- rules can be written but obviously geared to
domain specificity. Individual interviews can
cover almost any subject! - system tuned to a sample of routine interview
datanot jargon-laden
2626
27XML schema - TEI
- main aim to tag data with key XML elements
- work on an XML schema has specified a reduced
set of Text Encoding Initiative (TEI) elements - core tag set for transcription
- names, numbers, dates ltpersnamegt
- links and cross references ltrefgt
- text structure ltbodygt
- unique to spoken texts ltkinesicgt
- contextual information (participants, setting,
text) - New XML schema developed under JISC funding
(DEXT) called QuDEx to describe annotation,
linking, segmentation and alignment of
qualitative data (www.data-archive.ac.uk/dext)
28Transcript with manual XML mark-up
28
29Automated XML mark-up input data file for NLP
tools
30Data processed through Edinburgh LT-XML and CME
tools
The main Graphical User Interface (GUI)
Invokes the SQUADCoder in NXT
31NXT tool
Locate the NXT metadata file which must be set
up with named entity types
The NXT generic window running the SQUAD Coder
32The SQUADCoder Window
All the references to a particular entity
The Named Entity Hierarchy
Transcription view
33Annotation tool - anonymise
The Coreference Action Panel
34Annotation tool
Enter pseudonym
35Anonymised data
The Anonymised Transcription View
36Annotated data in the NXT what formats and how
stored?
- NXT uses stand off annotation annotation
linked to or references individual words - uses the NITE NXT XML model
- creates new anonymised version of the text
- save original file
- save matrix of references - names to pseudonyms
- outputs annotations who worked on the file etc.
37Next steps
- these are all demo tools.. none taken any
further..project funding ended - I would like collaboration on annotation of data
through semantic tagging and document mark-up - automatic term recognition and XML element
tagging - automatic document classification indexing
- auto summarisation of text document reduction
- possibly detecting structural relationships
- can NacTeM (ASSERT) tools be used to undertake
- key word assignment for survey questions and
structured catalogue records - term extraction, summarisation and mark-up of
spoken interview data - coreferencing in interviews?
38- Louise Corti
- UK Data Archive
- 01206 872145
- corti_at_essex.ac.uk