Automated indexing of survey questionnaires and interviews - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Automated indexing of survey questionnaires and interviews

Description:

studies primary data derived social research methods ... stored in SQL tables and multi lingual version ELSST developed for EC. Metadata. DDI allows for: ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 39
Provided by: COR119
Category:

less

Transcript and Presenter's Notes

Title: Automated indexing of survey questionnaires and interviews


1
Automated indexing of survey questionnaires and
interviews
  • Louise Corti
  • UK Data Archive, University of Essex
  • NaCTeM, Manchester
  • 25 January 2008

2
Data collections at UKDA
  • social, economic and historical data
    collections
  • _at_5000 data collections studies
  • studies primary data derived social research
    methods
  • surveys, collated statistics, qualitative
    interviewing, fieldwork and observation

3
Resource discovery at UKDA
  • how do our users currently locate data?
  • at the highest level, a study level metadata
    record is compiled for each study following the
    DDI XML-based metadata schema
  • free text fields plus controlled vocabulary
  • DDI - all the national social science data
    archives in the world use this standard, so
    harvesting across collections is possible

4
(No Transcript)
5
(No Transcript)
6
Resource discovery at UKDA
  • simplest resource discovery is a free text search
    or on some key fields from the catalogue records,
    eg health

7
Free text search
8
Use of key words
  • UKDA manually assign key words to (index) data
    for resource discovery purposes
  • surveys study description and question level
  • qualitative data study level, methods
  • key words used thus describe the methodology but
    not the research data per se

9
Key word search
Key words
10
Key words
11
Key words are manually assigned at the survey
question level but captured in a database at the
study level
This is very laborious as there can be hundreds
per study!
BUT key words are NOT linked to questions!! ie
at UKDA there is NO correspondence in the current
metadata schema.
12
Keywords searches can be refined by the user
making use of the UKDA thesaurus of terms
Select this term and a new search is run
13
UKDA Thesaurus
  • HASSET (Humanities and Social Science Thesaurus )
    is a subject thesaurus which has been developed
    by the UKDA over the past 20 years
  • Initially based on the UNESCO thesaurus, it has
    been continuously expanded and updated for use in
    the UKDAs online catalogue
  • display of the hierarchical relationships of
    terms can help users to broaden a search or make
    it more specific. Cross referencing to synonyms
    suggests alternative search terms, as does the
    provision of links to other conceptually related
    terms
  • it employs the conventional range of term
    relationships of equivalence (preferred and
    non-preferred terms), the hierarchical
    relationships (broader and narrower terms) and
    the associative relationships (related terms)
  • stored in SQL tables and multi lingual version
    ELSST developed for EC

14
Metadata
  • DDI allows for
  • study level description
  • methodology and data description, authors, rights
    management and access etc.
  • file level
  • description of individual files e.g spss files,
    work file, audio file (but not currently used
    in-house)
  • question (variable) level
  • question description, text, var names and values,
    groups PLUS key words (again not used)

15
Variable search
16
But key words NOT linked to the variable!
17
Indexing survey questions
  • survey questions
  • Do you suffer from any long-standing limiting
    illness?
  • Keywords assigned long-tem illness
  • Government survey questions are often
    standardised to provide comparability across
    surveys
  • Have large databases of individual questions

18
Semi-automated solutions
  • UKDA indexingthere must be an easier way!
  • first a database of questions linked to key terms
    (controlled vocab) must be built to test any
    automated assignment
  • methodology and coder reliability should also be
    investigated
  • in-house guidelines are in place but still
    subjective assignment
  • no stringent quality control on key word
    assignment

19
Key words for qualitative data
  • a different challenge
  • indexing done at study level and is largely
    conceptual
  • work needs to be done on how researchers assign
    key words to data and how they search for
    qualitative data
  • analysis of data processing methods in house
  • analysis of UKDA search logs ..what terms you
    users enter?
  • Can utilise named entity recognition, term
    extraction and document summarisation tools on
    these kinds of data (eg an unstructured
    transcribed interview)

20
What about using NaCTeM tools?
  • given that we can provide databases of terms
    linked to data (study, file and parts)
  • could test NaCTeM tools data to assign terms or
    concepts/summarise text
  • nice front end processing tools are essential
  • processors must have option to agree or edit any
    terms
  • terms should be output to DDI XML metadata at the
    study, file and variable level

21
Structural and content mark up of textual
interview data
  • for spoken interview texts, useful encoding
    features are
  • utterance, specific turn taker, defining
    idiosyncrasies in transcription
  • links to analytic annotation and other data types
    (e.g.. thematic codes, concepts, audio or video
    links, researcher memos, maps, images, URLs etc.)
  • identifying information such as real names,
    company names, place names, occupations, temporal
    information

22
An sample interview
  • ID 001
  • Sex M
  • YOB 1921
  • Place Oldham
  • Finalocc Postman
  • U id'1' who'interviewer' Right, it starts with
    your grandparents. So give me the names and dates
    of birth of both. Do you remember those sets of
    grandparents?
  • U id'2' who'subject' Yes.
  • U id'3' who'interviewer' Well, we'll start with
    your mum's parents? Where did they live?
  • U id'4' who'subject' They lived in Widness,
    Lancashire.
  • U id'5' who'interviewer' How do you remember
    them?
  • U id'6' who'subject' When we Mum used to take
    me to see them and me Grandma came to live with
    us in the end, didn't she?
  • U id'7' who'Welham' Welham Yes, when Granddad
    died - '48.
  • U id'8' who'interviewer' So he died when he was
    48?
  • U id'9' who'Welham' Welham No, he was 52. He
    died in 1948.
  • U id'10' who'interviewer' But I remember it.
    How old would I be then?
  • U id'11' who'Welham' Welham Oh, you would have
    been little then.
  • U id'12' who'subject' I remember him, he used
    to have whiskers. He used to put me on his knee
    and give me a kiss. .

23
ESRC SQUAD project
  • developed and tested universal standards and
    technologies
  • long-term digital archiving
  • publishing
  • data exchange
  • investigated user-friendly tools for
    semi-automating processes already used to prepare
    qualitative data and materials
  • formatted text documents ready for output
  • mark-up of structural features of textual data
  • annotation and anonymisation tool
  • automated coding/indexing linked to a domain
    ontology

24
Identifying elements
  • Identify atomic elements of information in text
  • Person names
  • Company/Organisation names
  • Occupations
  • Locations
  • Dates and times
  • Example
  • Italy's business world was rocked by the
    announcement last Thursday that Mr. Verdi would
    leave his job as vice-president of Music Masters
    of Milan, Inc to become operations director of
    Arthur Anderson

25
Testing NLP tools
  • UKDA have investigated some basic NLP tools to
    identify named entities with a nice GUI
  • part of ESRC SQUAD award
  • rules can be written but obviously geared to
    domain specificity. Individual interviews can
    cover almost any subject!
  • system tuned to a sample of routine interview
    datanot jargon-laden

26
26
27
XML schema - TEI
  • main aim to tag data with key XML elements
  • work on an XML schema has specified a reduced
    set of Text Encoding Initiative (TEI) elements
  • core tag set for transcription
  • names, numbers, dates ltpersnamegt
  • links and cross references ltrefgt
  • text structure ltbodygt
  • unique to spoken texts ltkinesicgt
  • contextual information (participants, setting,
    text)
  • New XML schema developed under JISC funding
    (DEXT) called QuDEx to describe annotation,
    linking, segmentation and alignment of
    qualitative data (www.data-archive.ac.uk/dext)

28
Transcript with manual XML mark-up
28
29
Automated XML mark-up input data file for NLP
tools
30
Data processed through Edinburgh LT-XML and CME
tools
The main Graphical User Interface (GUI)
Invokes the SQUADCoder in NXT
31
NXT tool
Locate the NXT metadata file which must be set
up with named entity types
The NXT generic window running the SQUAD Coder
32
The SQUADCoder Window
All the references to a particular entity
The Named Entity Hierarchy
Transcription view
33
Annotation tool - anonymise
The Coreference Action Panel
34
Annotation tool
Enter pseudonym
35
Anonymised data
The Anonymised Transcription View
36
Annotated data in the NXT what formats and how
stored?
  • NXT uses stand off annotation annotation
    linked to or references individual words
  • uses the NITE NXT XML model
  • creates new anonymised version of the text
  • save original file
  • save matrix of references - names to pseudonyms
  • outputs annotations who worked on the file etc.

37
Next steps
  • these are all demo tools.. none taken any
    further..project funding ended
  • I would like collaboration on annotation of data
    through semantic tagging and document mark-up
  • automatic term recognition and XML element
    tagging
  • automatic document classification indexing
  • auto summarisation of text document reduction
  • possibly detecting structural relationships
  • can NacTeM (ASSERT) tools be used to undertake
  • key word assignment for survey questions and
    structured catalogue records
  • term extraction, summarisation and mark-up of
    spoken interview data
  • coreferencing in interviews?

38
  • Louise Corti
  • UK Data Archive
  • 01206 872145
  • corti_at_essex.ac.uk
Write a Comment
User Comments (0)
About PowerShow.com