Title: Tools for Text and Data
1Tools for Text and Data
- some examples
- Lou Burnard (HCU, Oxford)
lou.burnard_at_oucs.ox.ac.uk
2Finding aids and analysis tools
- Text and Data are they really so different?
- SARA a text search engine
- Phelix an XML database
- An architecture for getting the best of both
worlds
3Resources
encoding
abstract model
digital resources
analysis
4Transmitting the hermeneutic
- scholarship depends on continuity
- it is not enough to preserve the bytes of an
encoding - there must also be a continuity of
comprehension the encoding must be
self-descriptive
5Text and data form a continuum
It's ltdate value20010713'gt Friday 13 July
lt/dategt
It's ltdate gt Friday 13 July lt/dategt
It's ltdateStruct value20010713' gt ltday
typenamegtFridaylt/daygt ltday typenumbergt13lt/da
ygt ltmonthgtJulylt/monthgt lt/dateStructgt
- Text is not a special kind of data.
- Data is a special kind of text.
6Corpus linguistics a case in point
- How do we find out what words mean?
- algorithm
- authority
- usage
- Corpus linguistics re-centres the last of these
7British National Corpus
Think of it as an archive!
- A snapshot of British English at the end of the
20th century - 100 million words in approx 4000 different text
samples, both spoken (10) and written (90) - synchronic (1990-4), sampled, general purpose
corpus
8Who produced the BNC and why?
- a consortium of dictionary publishers and
academic researchers - OUP, Longman, Chambers
- OUCS, UCREL, BL RD
- with funding from DTI/ SERC under JFIT 1990-1994
- Lexicographers, NLP researchers,
- But not language teachers!
9Distinctive features of the BNC
- non-opportunistic design
- word class annotation
- contextual information
- uniform (SGML) markup
- general availability
10Written Domains
11Spoken domains
12Architecture
bnc
bncdoc
header
4124
bncdoc
bncdoc
header
863
text
stext
13Sample written text
- lttext completeY decls'CN000 HN001 QN000
SN000'gt - ltdiv1 completeY orgSEQgt
- lthead typeMAINgt
- lts n001gt
- ltw NP0gtCAMRA ltw NN1gtFACT ltw NN1gtSHEET
- ltw AT0gtNo ltw CRDgt1 lt/headgt
- lthead rit typeSUBgt
- lts n002gt
- ltw AVQgtHow ltw NN1gtbeer ltw VBZgtis
- ltw AJ0-VVNgtbrewed lt/headgt
- ltpgtlts n003gt
- ltw NN1gtBeer ltw VVZgtseems ltw DT0gtsuch
- ltw AT0gta ltw AJ0gtsimple ltw NN1gtdrink ltw CJTgtthat
- ltw PNPgtwe ltw VVBgttend ltw TO0gtto ltw VVIgttake
- ltw PNPgtit ltw CJS-PRPgtfor ltw VVD-VVNgtgranted
- ltc PUNgt.
14Sample spoken text
- ltu whoPS04Ygt
- lts n01296gtltw ITJgtMm ltpausegt ltw ITJgtyes ltpause
dur7gt - ltw PNPgtI ltw VVDgttold ltw NP0gtPaul ltpausegt
- ltw CJTgtthat ltw PNPgthe ltw VM0gtcan ltw VVIgtbring
- ltw AT0gta ltw NN1gtlady ltw AVPgtup ltpausegt ltw PRPgtat
- ltw NN1gtChristmas-timeltc PUNgt.lt/ugt
- ltu whoPS04Ugt
- lts n01297gtltw VBZgtIs ltw PNPgthe ltw XX0gtnot
- ltw VVGgtgoing ltw AV0gthome ltw AV0gtthenltc PUNgt?lt/ugt
- ltu whoPS04Ygt
- lts n01298gtltw ITJgtNo ltpause dur8gt ltw CJCgtand
- ltw UNCgterm ltpause dur7gt ltw PNPgtIltw VBBgt'm
- ltw VVGgtleaving ltw AT0gta ltw NN1gtturkey ltw PRPgtin
- ltw AT0gtthe ltw NN1gtfreezerltc PUNgt
- lts n01299gtltw NP0gtPaul ltw VBZgtis ltw AV0gtquite
- ltw AJ0gtgood ltw PRPgtat ltw NN1-VVGgtcooking ltpausegt
- ltw AJ0gtstandard ltw NN1gtcookingltc PUNgt.lt/ugt
15Who needs this?
- lexicographers
- NLP researchers
- teachers and learners of english
- social science, cultural studies...
... in short, anywhere there is a need for real
life data about the English language
16Software the choice
- specialist high performance toolkits
- generic SGML browse and display engines
- plaintext concordancing tools
- or roll your own..
17How do you spell Maastricht?
18(No Transcript)
19(No Transcript)
20What words co-occur with it?
And finally
21Defining a subcorpus
- content-defined
- by searching the texts
- metadata defined
- by searching classification data in the headers
22Content-defined subcorpora
- may consist of
- all texts containing solutions to a previous
query - manually selected texts
- combinations of the above
- Subcorpora can be saved and indexed (offline)
- cf. IR result sets
23Weather language
24sun. in weather forecasts
25sunny in weather forecasts
26sunny in all texts
27mostly in weather forecasts
28And incidentally, mostly in all?
29Metadata-defined subcorpora
- Selection features
- written domain (ltcatRefgt) eg. imaginative / arts
/ leisure / world affairs - predefined criteria
- Descriptive features
- ltactivitygt eg. lecture, interview, meeting
- ltclassCodegt eg. written, written academic,
written academic medicine - post hoc criteria
30collocates assumptions? in wac
31XML databases the claim
- In the old days, text and data were different
- Now, with XML, you can have your cake and eat it
too - text can become data
- data can become text
- Datacentric and docucentric worlds converge
32A case in point descriptive bibliography
Perec, Georges Life - a users manual. Collins,
1988. Translated from the French La vie mode
demploi by David Bellos. xviii581 pp. 841.941
Literature - French - 20th century
- Whats the authors name?
- What translators are there ?
- Which 20th c French works have more than 400
pages? - List titles containing less than 6 words
33 even more so, for mss
- long tradition of descriptive belle lettriste
approach - necessitated by cultural complexity and lack of
standardization - mss are unique objects, sometimes very valuable,
sometimes the reverse - and spread across many different locations
34for example
and so on, for two more pages
Lithuania, Vilnius, Lietuvos nacionaline Martyno
Mavydo biblioteka F101-26 Jurij Fiodorovic
Chrebtovicia (þðüè ôåäîðîâè õðåáòîâèà) emes
pardavimo ratas Vilnius, 1538 rugpjucio 23 (
kitoje lapo puseje prieraai lenku
kalba) Incipit ß þðüè ôåäîðîâè
õðåáòîâèà. Dvarionis Jurijus Fiodorovicius
Chrebtovicia parduoda Kareivikiu dvara su
tarnais (ivardyti) ir Ininkovskio dykyne savo
seseriai Hanai Martinovnai Chreptovicia
Martinovajai Podcentkovskajai Kareivikiu dvara
Jurijui Chrebtovicia padovanojo jo senele
idininkiene Hana Andriejevaja Aleksandrovicia .
Dvara Jurijus Chrebtovicia tvarke atskirai nuo
kitu tevonijos valdu, turedamas teise i dvara
valdyti savo nuoiura. Visos dvaro valdymo ir
palikimo teises perduodamos Jurijaus Chrebtovicia
seseriai. Dovanojimo ratu nustatyta, kad
deimtine nuo Kareivikiu dvaro turi buti mokama
Papikiu dvaro (karaliaus dvaro idininko,
Vilniaus laikytojo Ivano Andriejeviciaus valdos)
v. Mikalojaus banyciai Aatskiras lapas
pergamentas 38790 x 660 Textas raytas per visa
lapa, pusustavis, pereinantis i greitrati.
Tekstas raytas vieno ratininko aikiu
greitraciu, tas pats ratininkas parae F101-28
35The MASTER project
- Partners
- Centre for Technology in the Arts, De Montfort
University - Humanities Computing Unit, Oxford
- Arnamagneæn Institute, Copenhagen
- Institut de recherche histoire de textes, Paris
- Royal Dutch Library, The Hague
- Czech National Library, Prague
- Funded under EU Libraries Programme Sept 1998 -
July 2001
36The MASTER project
- Partners
- Centre for Technology in the Arts, De Montfort
University - Humanities Computing Unit, Oxford
- Arnamagneæn Institute, Copenhagen
- Institut de recherche et dhistoire des textes,
Paris - Royal Dutch Library, The Hague
- Národní knihovna CR, Praha
- Funded under EU Libraries Programme Sept 1998 -
July 2001
37The MASTER plan
- Develop a European standard for manuscript
description - compatible with other relevant standards
- Implement demonstrator systems allowing
- distributed data capture
- integrated searchability
- Disseminate and concertate
38Consequences
- emphasis on distributed data, from many different
partners - emphasis on structuring of text, some
pre-existing, some not - short time scale
- support issues
39The ltmsDescriptiongt element
- What is a manuscript description?
- a text (MLE record)
- a bit of a text (MLE crystal)
- a description of a text? (MLE header field)
- The answer depends on whether youre making
- a finding aid
- a catalogue raisonné
- a digital surrogate
- http//www.hcu.ox.ac.uk/TEI/Master/Reference/
40Collections of msDescriptions
- may be output from legacy systems
- may be created de novo
- traditional DBMS functionality
- concurrent, multiple update
- referential integrity
- resilience
- simple file system is inadequate
41Referential integrity
- authority files for
- language codes
- bibliographic sources
- persons referenced
- classification scheme
- etc.
- essential for distributed collection, desirable
for single large system
42Design goals for Phelix
- Open source
- Document repository functionality
- No document editing or updating
- Multi-document searching using XML structure
- Adaptable and customizable
43Implementation issues
- Re-use existing tools where possible
- Functionality above performance
- Assumes networked academic environment
- Single DTD system
44Why use a RDBMS?
- mature, stable, portable, scalable
- widespread, easy to integrate
- foreign key access (usually) optimized
- but assembling XML fragments is inherently
slow (see http//www.cs.wisc.edu/niagara/papers/vl
db00XML.pdf) - OODBMS cost serious money and cannot be freely
relicenced
45What is modelled in the RDBMS?
- Not the semantics of the data but its XML
representation - XML serializes a tree structure
- Phelix models the tree, not its meaning
XML fragments
RDBMS
46The Phelix architecture
rdbms
scripting language
rdbms
validating parser
xslt engine
user agent
server/s
47Current Phelix implementation
rdbms
HTML
PHP
mySQL
XML
expat/rxp
sablotron
janus.oucs.ox.ac.uk/master
IE5,Opera
48Current functionality
- You can
- upload a valid XML msDescription, or an archive
of them - validate (some) content against external
authority files - publish them to other partners
- search all published msDescriptions
- view and download selected msDescriptions using
your own stylesheet - save and review query results
- You cannot validate or modify existing records
49Under development
- Initial spec ltmsDescriptiongts
- Additional support for bibliography, onomastic
authority files - Optional stylesheet interface
- Generalized interface
- Mapping Asia project
- Free text search component?
50Storage of XML documents
- The parser decomposes the document tree into
atomic nodes - element
- pcdata fragments
- attribute-value pairs
- Each node is stored as a row in the DBMS
- Relationships between nodes are represented by
pointers (aka foreign keys)
51XML Queries
- XML Query or Xpath?
- choice was unclear at design time
- QueryExpression and Query objects
- encapsulated in PHP layer
- access to ancestors, parents, attributes, content
52Interfaces
- Theres nothing as powerful as a good metaphor
- The Walkthrough
- The Basket
- Picking and Choosing
- Designing the user interface is harder than
designing the engine
53Other interfaces
- Sorry, no full DOM support (yet)
- Nodes are stored in database
- Interface customization layer
- form design
- user supplied stylesheets
- much remains to be done
54(No Transcript)
55(No Transcript)
56(No Transcript)
57(No Transcript)
58(No Transcript)
59Messages
- Text is data structured in ways we dont
immediately recognise - Textual enrichment and text markup are two sides
of the same coin - Text retrieval and database management tools are
complementary, not competitive