Tools for Text and Data - PowerPoint PPT Presentation

1 / 59
About This Presentation
Title:

Tools for Text and Data

Description:

a consortium of dictionary publishers and academic researchers. OUP, Longman, Chambers ... academic, written academic medicine. post hoc criteria. collocates: ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 60
Provided by: lou142
Category:
Tags: data | text | tools

less

Transcript and Presenter's Notes

Title: Tools for Text and Data


1
Tools for Text and Data
  • some examples
  • Lou Burnard (HCU, Oxford)

lou.burnard_at_oucs.ox.ac.uk
2
Finding aids and analysis tools
  • Text and Data are they really so different?
  • SARA a text search engine
  • Phelix an XML database
  • An architecture for getting the best of both
    worlds

3
Resources
encoding
abstract model
digital resources
analysis
4
Transmitting the hermeneutic
  • scholarship depends on continuity
  • it is not enough to preserve the bytes of an
    encoding
  • there must also be a continuity of
    comprehension the encoding must be
    self-descriptive

5
Text and data form a continuum
It's ltdate value20010713'gt Friday 13 July
lt/dategt
It's ltdate gt Friday 13 July lt/dategt
It's ltdateStruct value20010713' gt ltday
typenamegtFridaylt/daygt ltday typenumbergt13lt/da
ygt ltmonthgtJulylt/monthgt lt/dateStructgt
  • Text is not a special kind of data.
  • Data is a special kind of text.

6
Corpus linguistics a case in point
  • How do we find out what words mean?
  • algorithm
  • authority
  • usage
  • Corpus linguistics re-centres the last of these

7
British National Corpus
Think of it as an archive!
  • A snapshot of British English at the end of the
    20th century
  • 100 million words in approx 4000 different text
    samples, both spoken (10) and written (90)
  • synchronic (1990-4), sampled, general purpose
    corpus

8
Who produced the BNC and why?
  • a consortium of dictionary publishers and
    academic researchers
  • OUP, Longman, Chambers
  • OUCS, UCREL, BL RD
  • with funding from DTI/ SERC under JFIT 1990-1994
  • Lexicographers, NLP researchers,
  • But not language teachers!

9
Distinctive features of the BNC
  • non-opportunistic design
  • word class annotation
  • contextual information
  • uniform (SGML) markup
  • general availability

10
Written Domains
11
Spoken domains
12
Architecture
bnc
bncdoc
header
4124
bncdoc
bncdoc
header
863
text
stext
13
Sample written text
  • lttext completeY decls'CN000 HN001 QN000
    SN000'gt
  • ltdiv1 completeY orgSEQgt
  • lthead typeMAINgt
  • lts n001gt
  • ltw NP0gtCAMRA ltw NN1gtFACT ltw NN1gtSHEET
  • ltw AT0gtNo ltw CRDgt1 lt/headgt
  • lthead rit typeSUBgt
  • lts n002gt
  • ltw AVQgtHow ltw NN1gtbeer ltw VBZgtis
  • ltw AJ0-VVNgtbrewed lt/headgt
  • ltpgtlts n003gt
  • ltw NN1gtBeer ltw VVZgtseems ltw DT0gtsuch
  • ltw AT0gta ltw AJ0gtsimple ltw NN1gtdrink ltw CJTgtthat
  • ltw PNPgtwe ltw VVBgttend ltw TO0gtto ltw VVIgttake
  • ltw PNPgtit ltw CJS-PRPgtfor ltw VVD-VVNgtgranted
  • ltc PUNgt.

14
Sample spoken text
  • ltu whoPS04Ygt
  • lts n01296gtltw ITJgtMm ltpausegt ltw ITJgtyes ltpause
    dur7gt
  • ltw PNPgtI ltw VVDgttold ltw NP0gtPaul ltpausegt
  • ltw CJTgtthat ltw PNPgthe ltw VM0gtcan ltw VVIgtbring
  • ltw AT0gta ltw NN1gtlady ltw AVPgtup ltpausegt ltw PRPgtat
  • ltw NN1gtChristmas-timeltc PUNgt.lt/ugt
  • ltu whoPS04Ugt
  • lts n01297gtltw VBZgtIs ltw PNPgthe ltw XX0gtnot
  • ltw VVGgtgoing ltw AV0gthome ltw AV0gtthenltc PUNgt?lt/ugt
  • ltu whoPS04Ygt
  • lts n01298gtltw ITJgtNo ltpause dur8gt ltw CJCgtand
  • ltw UNCgterm ltpause dur7gt ltw PNPgtIltw VBBgt'm
  • ltw VVGgtleaving ltw AT0gta ltw NN1gtturkey ltw PRPgtin
  • ltw AT0gtthe ltw NN1gtfreezerltc PUNgt
  • lts n01299gtltw NP0gtPaul ltw VBZgtis ltw AV0gtquite
  • ltw AJ0gtgood ltw PRPgtat ltw NN1-VVGgtcooking ltpausegt
  • ltw AJ0gtstandard ltw NN1gtcookingltc PUNgt.lt/ugt

15
Who needs this?
  • lexicographers
  • NLP researchers
  • teachers and learners of english
  • social science, cultural studies...

... in short, anywhere there is a need for real
life data about the English language
16
Software the choice
  • specialist high performance toolkits
  • generic SGML browse and display engines
  • plaintext concordancing tools
  • or roll your own..

17
How do you spell Maastricht?
18
(No Transcript)
19
(No Transcript)
20
What words co-occur with it?
And finally
21
Defining a subcorpus
  • content-defined
  • by searching the texts
  • metadata defined
  • by searching classification data in the headers

22
Content-defined subcorpora
  • may consist of
  • all texts containing solutions to a previous
    query
  • manually selected texts
  • combinations of the above
  • Subcorpora can be saved and indexed (offline)
  • cf. IR result sets

23
Weather language
24
sun. in weather forecasts
25
sunny in weather forecasts
26
sunny in all texts
27
mostly in weather forecasts
28
And incidentally, mostly in all?
29
Metadata-defined subcorpora
  • Selection features
  • written domain (ltcatRefgt) eg. imaginative / arts
    / leisure / world affairs
  • predefined criteria
  • Descriptive features
  • ltactivitygt eg. lecture, interview, meeting
  • ltclassCodegt eg. written, written academic,
    written academic medicine
  • post hoc criteria

30
collocates assumptions? in wac
31
XML databases the claim
  • In the old days, text and data were different
  • Now, with XML, you can have your cake and eat it
    too
  • text can become data
  • data can become text
  • Datacentric and docucentric worlds converge

32
A case in point descriptive bibliography
Perec, Georges Life - a users manual. Collins,
1988. Translated from the French La vie mode
demploi by David Bellos. xviii581 pp. 841.941
Literature - French - 20th century
  • Whats the authors name?
  • What translators are there ?
  • Which 20th c French works have more than 400
    pages?
  • List titles containing less than 6 words

33
even more so, for mss
  • long tradition of descriptive belle lettriste
    approach
  • necessitated by cultural complexity and lack of
    standardization
  • mss are unique objects, sometimes very valuable,
    sometimes the reverse
  • and spread across many different locations

34
for example
and so on, for two more pages
Lithuania, Vilnius, Lietuvos nacionaline Martyno
Mavydo biblioteka F101-26 Jurij Fiodorovic
Chrebtovicia (þðüè ôåäîðîâè õðåáòîâèà) emes
pardavimo ratas Vilnius, 1538 rugpjucio 23 (
kitoje lapo puseje prieraai lenku
kalba) Incipit ß þðüè ôåäîðîâè
õðåáòîâèà. Dvarionis Jurijus Fiodorovicius
Chrebtovicia parduoda Kareivikiu dvara su
tarnais (ivardyti) ir Ininkovskio dykyne savo
seseriai Hanai Martinovnai Chreptovicia
Martinovajai Podcentkovskajai Kareivikiu dvara
Jurijui Chrebtovicia padovanojo jo senele
idininkiene Hana Andriejevaja Aleksandrovicia .
Dvara Jurijus Chrebtovicia tvarke atskirai nuo
kitu tevonijos valdu, turedamas teise i dvara
valdyti savo nuoiura. Visos dvaro valdymo ir
palikimo teises perduodamos Jurijaus Chrebtovicia
seseriai. Dovanojimo ratu nustatyta, kad
deimtine nuo Kareivikiu dvaro turi buti mokama
Papikiu dvaro (karaliaus dvaro idininko,
Vilniaus laikytojo Ivano Andriejeviciaus valdos)
v. Mikalojaus banyciai Aatskiras lapas
pergamentas 38790 x 660 Textas raytas per visa
lapa, pusustavis, pereinantis i greitrati.
Tekstas raytas vieno ratininko aikiu
greitraciu, tas pats ratininkas parae F101-28
35
The MASTER project
  • Partners
  • Centre for Technology in the Arts, De Montfort
    University
  • Humanities Computing Unit, Oxford
  • Arnamagneæn Institute, Copenhagen
  • Institut de recherche histoire de textes, Paris
  • Royal Dutch Library, The Hague
  • Czech National Library, Prague
  • Funded under EU Libraries Programme Sept 1998 -
    July 2001

36
The MASTER project
  • Partners
  • Centre for Technology in the Arts, De Montfort
    University
  • Humanities Computing Unit, Oxford
  • Arnamagneæn Institute, Copenhagen
  • Institut de recherche et dhistoire des textes,
    Paris
  • Royal Dutch Library, The Hague
  • Národní knihovna CR, Praha
  • Funded under EU Libraries Programme Sept 1998 -
    July 2001

37
The MASTER plan
  • Develop a European standard for manuscript
    description
  • compatible with other relevant standards
  • Implement demonstrator systems allowing
  • distributed data capture
  • integrated searchability
  • Disseminate and concertate

38
Consequences
  • emphasis on distributed data, from many different
    partners
  • emphasis on structuring of text, some
    pre-existing, some not
  • short time scale
  • support issues

39
The ltmsDescriptiongt element
  • What is a manuscript description?
  • a text (MLE record)
  • a bit of a text (MLE crystal)
  • a description of a text? (MLE header field)
  • The answer depends on whether youre making
  • a finding aid
  • a catalogue raisonné
  • a digital surrogate
  • http//www.hcu.ox.ac.uk/TEI/Master/Reference/

40
Collections of msDescriptions
  • may be output from legacy systems
  • may be created de novo
  • traditional DBMS functionality
  • concurrent, multiple update
  • referential integrity
  • resilience
  • simple file system is inadequate

41
Referential integrity
  • authority files for
  • language codes
  • bibliographic sources
  • persons referenced
  • classification scheme
  • etc.
  • essential for distributed collection, desirable
    for single large system

42
Design goals for Phelix
  • Open source
  • Document repository functionality
  • No document editing or updating
  • Multi-document searching using XML structure
  • Adaptable and customizable

43
Implementation issues
  • Re-use existing tools where possible
  • Functionality above performance
  • Assumes networked academic environment
  • Single DTD system

44
Why use a RDBMS?
  • mature, stable, portable, scalable
  • widespread, easy to integrate
  • foreign key access (usually) optimized
  • but assembling XML fragments is inherently
    slow (see http//www.cs.wisc.edu/niagara/papers/vl
    db00XML.pdf)
  • OODBMS cost serious money and cannot be freely
    relicenced

45
What is modelled in the RDBMS?
  • Not the semantics of the data but its XML
    representation
  • XML serializes a tree structure
  • Phelix models the tree, not its meaning

XML fragments
RDBMS
46
The Phelix architecture
rdbms
scripting language
rdbms
validating parser
xslt engine
user agent
server/s
47
Current Phelix implementation
rdbms
HTML
PHP
mySQL
XML
expat/rxp
sablotron
janus.oucs.ox.ac.uk/master
IE5,Opera
48
Current functionality
  • You can
  • upload a valid XML msDescription, or an archive
    of them
  • validate (some) content against external
    authority files
  • publish them to other partners
  • search all published msDescriptions
  • view and download selected msDescriptions using
    your own stylesheet
  • save and review query results
  • You cannot validate or modify existing records

49
Under development
  • Initial spec ltmsDescriptiongts
  • Additional support for bibliography, onomastic
    authority files
  • Optional stylesheet interface
  • Generalized interface
  • Mapping Asia project
  • Free text search component?

50
Storage of XML documents
  • The parser decomposes the document tree into
    atomic nodes
  • element
  • pcdata fragments
  • attribute-value pairs
  • Each node is stored as a row in the DBMS
  • Relationships between nodes are represented by
    pointers (aka foreign keys)

51
XML Queries
  • XML Query or Xpath?
  • choice was unclear at design time
  • QueryExpression and Query objects
  • encapsulated in PHP layer
  • access to ancestors, parents, attributes, content

52
Interfaces
  • Theres nothing as powerful as a good metaphor
  • The Walkthrough
  • The Basket
  • Picking and Choosing
  • Designing the user interface is harder than
    designing the engine

53
Other interfaces
  • Sorry, no full DOM support (yet)
  • Nodes are stored in database
  • Interface customization layer
  • form design
  • user supplied stylesheets
  • much remains to be done

54
(No Transcript)
55
(No Transcript)
56
(No Transcript)
57
(No Transcript)
58
(No Transcript)
59
Messages
  • Text is data structured in ways we dont
    immediately recognise
  • Textual enrichment and text markup are two sides
    of the same coin
  • Text retrieval and database management tools are
    complementary, not competitive
Write a Comment
User Comments (0)
About PowerShow.com