The Simplest Query Language That Could Possibly Work - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

The Simplest Query Language That Could Possibly Work

Description:

The Simplest Query Language That Could Possibly Work. Richard A. ... Overkill. Tag instancing never used e.g. p[2] Child /' not needed descendant was better ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 22
Provided by: andrewt150
Category:

less

Transcript and Presenter's Notes

Title: The Simplest Query Language That Could Possibly Work


1
The Simplest Query Language That Could Possibly
Work
  • Richard A. OKeefe
  • Andrew Trotman
  • Department of Computer Science
  • University of Otago
  • Dunedin, New Zealand

2
Background
3
Old and New
  • Old
  • INEX02 and INEX03 use different languages
  • Currently
  • Extensions to XPath 1.0
  • Database language
  • Hard and complex
  • New
  • Proposal for INEX04 and future INEXes
  • Designed for INEX
  • Information retrieval query language
  • Simple, extensible, clearly defined

4
Usage Problems
  • 19 CAS topics needed title corrections
  • Thats 63
  • 3 CO topics needed title corrections
  • Thats 8
  • Corrections took 12 rounds
  • Thats 40 days of corrections
  • Proved hard to implement
  • 5 institutions submitted correct topics
  • topics were submitted with answers
  • yes, we submitted bad ones too

5
XPath issues
  • Overkill
  • Tag instancing never used e.g. p2
  • Child / not needed descendant was better //
  • Over design
  • ltfngtJoelt/fngtltlngtBloggslt/lngt
  • Gives string JoeBloggs not Joe Bloggs
  • ltyrgt2000,lt/yrgt
  • Is not a number
  • Operators must be followed faithfully
  • AND, OR, lt, gt, etc.
  • Could progress in directions not for us

6
DB and IR are different
  • Languages based on a database view
  • SQL, OQL, etc.
  • DB and IR are different

7
So Now What?
  • What we need
  • Tag equivalence
  • Precise syntax for imprecise queries
  • Not tied to one approach
  • Not tied to DB, IR, Probabilistic, etc
  • Extensible for future INEXes

8
Massaging the XML
9
Architectural Forms
  • Idea taken from HyTime
  • Extend the DTD
  • lt!ENTITY old-dtd PUBLIC "..." "oldarticle.dtd"gt
  • old-dtd
  • lt!ATTLIST ...gt
  • ...
  • lt!ATTLIST ...gt
  • Add three new attributes
  • INEXname - what to do with tags
  • INEXatts - what to do with attributes
  • INEXscan - how to index tags

10
  • INEXname
  • lt!ATTLIST ip1 INEXname NMTOKEN FIXED "p"gt
  • Process all ltip1gt tags as if they were ltpgt tags
  • This gives us same power as tag equivalences, but
    definable in the DTD before the documents are
    indexed
  • ltpgt restrictions will not return ltlpgt answers

11
  • INEXatts
  • lt!ATTLIST foo INEXatts NMTOKEN FIXED "yr year"gt
  • Process all yr attributes of foo as if they were
    year attributes of foo
  • lt!ATTLIST au INEXatts NMTOKEN FIXED "sequence
    -"gt
  • Do not index sequence attributes of au tags
  • Tag equivalence now apply to attributes too
  • Some attributes are not indexed (so cant be
    found as answers)

12
  • INEXscan
  • lt!ATTLIST p INEXscan NMTOKEN FIXED "all"gt
  • Index the contents of a p tag as belonging to
    that p tag
  • Index as expected
  • lt!ATTLIST scp INEXscan NMTOKEN FIXED "content"gt
  • Index the content of scp tags as if the tag were
    not present
  • Removes the ltstgtVltscpgtoicelt/scpgtlt/stgt problem
  • lt!ATTLIST ref INEXscan NMTOKEN FIXED "nothing"gt
  • Do not index this tag or the contents of this
    tag
  • Removes the ltrefgt3lt/refgt problem
  • lt!ATTLIST art INEXscan NMTOKEN FIXED "element"gt
  • Index this tag, not its content, but do so its
    children
  • Included as the remaining case

13
Searching the XML
14
Query Syntax
  • The complete grammar is given in the paper
  • A (electron)
  • answer about A
  • PA sec(electron)
  • return P tags about A
  • PAQB art(density)sec(electron)
  • Where P is about A, return P/Q tags about B
  • And so on

15
P and Q syntax
  • Tags are / (descendant) separated
  • art/sec
  • means sec descendant of art tag
  • There is no child, its not needed
  • Attributes cant have descendants so
  • art/au_at_sequence
  • means the sequence attribute of an au tag
    anywhere under an art tag

16
P and Q Filtering
  • artyr 1990..2000
  • Restrict art to where the yr tag is between 1990
    and 2000 (whatever the IR system chooses that to
    mean)
  • Also
  • artyr 1990
  • artyr ..2000
  • artyr 1990,1993..
  • .. Syntax used as it is NOT strict arithmetic,
    its what the IR system chooses it means

17
A and B syntax
  • Search terms can be mandatory (), exclusionary
    (-) or optional
  • fox in -sox
  • Means must include in, must not include sox
    and can include fox. Again, these are hints to
    the IR system.
  • Phrases are quoted (single or double)
  • knox in box
  • knox in box

18
Operators
  • These can then be joined with operators
  • a b
  • a or b or both should hold true
  • a b
  • a and b should both hold true
  • a
  • a should not hold true
  • ab
  • a should appear in b
  • Operators are also not precise (hence should
    hold true)

19
Examples
  • Topic 61
  • //articleabout(.,'clustering distributed') and
    about(.//sec,'java')
  • article(clustering distributed javasec)
  • Topic 64
  • //articleabout(./, 'hollerith')//secabout(./,
    'DEHOMAG')
  • article(hollerith)sec(DEHOMAG)
  • Topic 66
  • /article./fm//yr lt '2000'//secabout(.,'"sear
    ch engines"')
  • articlefm/yr ..1999sec("search engines")

20
Examples
  • Topic 76
  • //article(./fm//yr'2000' OR ./fm//yr'1999')
    AND about(., '"intelligent transportation
    system"')secabout(., 'automation vehicle')
  • articlefm/yr 1999..2000("intelligent
    transportation system")sec(automation vehicle)
  • Topic 91
  • Internet traffic
  • (Internet traffic)

21
Questions?
Write a Comment
User Comments (0)
About PowerShow.com