Information Retrieval and XML Data - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Information Retrieval and XML Data

Description:

Author, title, published, genre, format can occur at most once within a book ... ATTLIST BOOK genre (Science|Fiction) #REQUIRED ... – PowerPoint PPT presentation

Number of Views:223
Avg rating:3.0/5.0
Slides: 33
Provided by: hichamge
Category:

less

Transcript and Presenter's Notes

Title: Information Retrieval and XML Data


1
Information Retrieval and XML Data
  • Hicham Elmongui

2
Databases, IR, XML
  • IR has studied the problem of searching
    collections of text documents.
  • DBMS traditionally dealt with simple tabular
    data. ORDBMS (Object-Relational DBMS) were
    designed to support complex data types such as
    videos.
  • XML sits in the middle between them

3
Searching the Web
  • Search on the Web differs from IR
  • Scalability (billions vs. thousands)
  • Documents carefully prepared in IR
  • Existence of metadata (data about data) for XML
  • Documents on the Web can be images, video clips,

4
DBMS vs. IR systems
  • Searches vs. Queries
  • IR specialized class of queries called searches
  • Searches are specified by search terms
  • Underlying data usually unstructured text
  • Search result may be ranked (how well?)
  • DB
  • Traditionally unranked set of results
  • Relational queries are precise (Yes/No)

5
DBMS vs. IR systems
  • Updates and transactions
  • IR
  • optimized for read-mostly workload
  • No notion of transaction
  • New documents are added from time to time, index
    structures are periodically rebuilt/updated ?
    documents might be highly relevant to search but
    not retrievable
  • DB
  • Handle a wide range of workloads, including
    update-intensive transaction processing workloads

6
Indexing for Text Search
  • Text database Collection of text documents
  • Important class of queries Keyword searches
  • Boolean queries Query terms connected with AND,
    OR and NOT. Result is list of documents that
    satisfy the boolean expression.
  • Ranked queries Result is list of documents
    ranked by their relevance.

7
Inverted files
  • For each possible query term, store an ordered
    list (the inverted list) of document identifiers
    that contain the term.
  • Query evaluation Intersection or Union of
    inverted lists.

8
Inverted files query example
  • Example Agent AND James

9
Signature files
  • Index structure (the signature file) with one
    data entry for each document
  • Hash function hashes words to bit-vector.
  • Data entry for a document (the signature of the
    document) is the OR of all hashed words.
  • Signature S1 matches signature S2 if S2S1S2

10
Signature filesquery evaluation
  • Boolean query consisting of conjunction of words
  • Generate query signature Sq
  • Scan signatures of all documents.
  • If signature S matches Sq, then retrieve document
    and check for false positives.
  • Boolean query consisting of disjunction of k
    words
  • Generate k query signatures S1, , Sk
  • Scan signature file to find documents whose
    signature

11
Signature files Example
12
Web Search Engines
  • Contend with extremely large number of documents
    ?have to be highly scalable
  • Documents are linked to each other ? link
    information valuable in finding relevant pages
  • Different from IR, but rely on some form of
    inverted indexes as the basic indexing mechanism

13
Search engine architecture
  • Web search engines crawl the web to collect
    documents to index
  • Crawling by using a graph traversal algorithm.
    However
  • Details of connecting to millions sites
  • Minimizing network latencies
  • Parallelizing the crawling
  • Dealing with timeouts
  • The task of indexing is parallelizable
  • Index is partitioned across several machines (it
    can also be replicated)

14
Managing text in a DBMS
  • Having a special data type called FullText
    SQL/MM standard
  • Methods of FullText can be used in the WHERE
    clause to retrieve rows containing text objects
    that match an IR-style search criterion
  • The relevance rank of a FullText can be
    explicitly retrieved using the RANK method and
    this can be used to sort results by relevance

15
Managing text in a DBMS
  • General approach, performance inferior than
    specialized IR
  • This model does not adequately reflect documents
    with additional metadata ? another column? ? RANK
    only access FullText
  • What about updates? Requiring a system to update
    the indexes before updating transaction commits
    can impose a severe performance penalty

16
Loosely coupled inverted index
  • Current relational DBMSs that support text fields
    have a separate text-search engine that is
    loosely coupled to the DBMS
  • Engine periodically updates the indexes, but
    provides no transactional guarantees

17
XQuery querying XML data
  • A standard proposed by the World-Wide web
    Consortium (W3C)
  • In parallel, standards committees developing the
    SQL standards have been working on SQL/XML
    (http//sqlx.org/)

18
XQuery Path expressions
  • FOR 1 IN doc(www.ourbookstore.com/books.xml)//AUT
    HOR/LASTNAME
  • RETURN ltRESULTgt1lt/RESULTgt
  • ltRESULTgtltLASTNAMEgtFeynmanlt/LASTNAMEgtlt/RESULTgt
  • ltRESULTgtltLASTNAMEgtNarayanlt/LASTNAMEgtlt/RESULTgt
  • FOR analogous to FROM
  • RETURN analogous to SELECT
  • // nested anywhere in document
  • / nested immediately

19
XQuery FLWR expressions
  • FOR LET bind variables to values through path
    expressions
  • WHERE
  • RETURN
  • LET 1 IN doc(www.ourbookstore.com/books.xml)//AUT
    HOR/LASTNAME
  • RETURN ltRESULTgt1lt/RESULTgt
  • ltRESULTgt
  • ltLASTNAMEgtFeynmanlt/LASTNAMEgt
  • ltLASTNAMEgtNarayanlt/LASTNAMEgt
  • lt/RESULTgt

20
XQuery FLWR expressions
  • FOR b IN doc(www.ourbookstore.com/books.xml)/BOOK
    LIST/BOOK
  • WHERE b/PUBLISHED1980
  • RETURN
  • ltRESULTgt
  • b/AUTHOR/FIRSTNAME,
  • b/AUTHOR/LASTNAME
  • lt/RESULTgt
  • ltRESULTgt
  • ltFIRSTNAMEgtRichardlt/FIRSTNAMEgt
  • ltLASTNAMEgtFeynmanlt/LASTNAMEgt
  • lt/RESULTgt
  • ltRESULTgt
  • ltFIRSTNAMEgtR.K.lt/FIRSTNAMEgt
  • ltLASTNAMEgtNarayanlt/LASTNAMEgt
  • lt/RESULTgt

21
XQuery Ordering
  • Semantic of XQuery is that a path expression
    returns results sorted in document order.
  • SORT BY
  • FOR b IN doc(www.ourbookstore.com/books.xml)/BOOK
    LIST/BOOK
  • RETURN
  • ltBOOKTITLESgtb/TITLElt/BOOKTITLES gt
  • SORT BY TITLE

22
XQuery Grouping
  • FOR p IN DISTINCT doc(www.ourbookstore.com/books.
    xml)/BOOKLIST/BOOK/PUBLISHED
  • RETURN
  • ltRESULTgt
  • p,
  • FOR a IN DISTINCT
  • /BOOKLIST/BOOKPUBLISHEDp/AUTHOR
  • RETURN a
  • lt/RESULTgt
  • ltRESULTgt
  • ltPUBLISHEDgt1980lt/PUBLISHEDgt
  • ltLASTNAMEgtFeynmanlt/LASTNAMEgt
  • ltLASTNAMEgtNarayanlt/LASTNAMEgt
  • lt/RESULTgt
  • ltRESULTgt
  • ltPUBLISHEDgt1981lt/PUBLISHEDgt
  • ltLASTNAMEgtNarayanlt/LASTNAMEgt
  • lt/RESULTgt

23
Efficient evaluation of XML queries
  • Storage
  • CLOB
  • Indexing
  • Relational system ? B-tree
  • Native storage engine ? novel structures
  • Query optimization
  • Open area for developing
  • An algebra for XQuery
  • Statistics for path expression queries

24
Storing XML in RDBMS
  • Choice of relational schema
  • Queries in XQuery need to be translated into SQL
  • The result of a SQL query needs to be converted
    back to XML

25
XML data ? relations
  • Author, title, published, genre, format can occur
    at most once within a book
  • BOOKLIST (id integer)
  • BOOK ( booklistid integer, author_firstname
    string,
  • author_lastname string,
  • title string,
  • published string,
  • genre string,
  • format string)

26
XML data ? relations
  • Allowing more than one author
  • BOOKLIST (id integer)
  • BOOK ( id integer,
  • booklistid integer,
  • title string,
  • published string,
  • genre string,
  • format string)
  • Author ( bookid integer,
  • firstname string,
  • author_lastname string)

27
Query processing
  • FOR b IN doc(www.ourbookstore.com/books.xml)/BOOK
    LIST/BOOK
  • WHERE b/PUBLISHED1980
  • RETURN
  • ltRESULTgt
  • b/AUTHOR/FIRSTNAME,
  • b/AUTHOR/LASTNAME
  • lt/RESULTgt
  • Using the first schema
  • SELECT BOOK.author_firstname,
  • BOOK.author_lastname
  • FROM BOOK, BOOKLIST
  • WHERE BOOKLIST.idBOOK.booklistid
  • AND BOOK.published 1980

28
Query processing
  • Using the second schema
  • SELECT BOOK.id,
  • AUTHOR.firstname,
  • AUTHOR.lastname
  • FROM BOOK, BOOKLIST, AUTHOR
  • WHERE BOOKLIST.idBOOK.booklistid
  • AND BOOK.id AUTHOR.bookid
  • AND BOOK.published 1980
  • GROUP BY BOOK.id

29
Summary
  • Databases, IR, XML
  • Indexing for Text Search
  • Web Search Engines
  • Managing text in a DBMS
  • XQuery querying XML data
  • Storing XML in RDBMS
  • Query processing

30
References
  • Raghu Ramakrishnan and Johannes Gehrke, Database
    Management Systems, 3rd edition, 2003
  • http//www.w3.org/XML/

31
Example DTD
  • lt!DOCTYPE BOOKLIST
  • lt!ELEMENT BOOKLIST (BOOK)gt
  • lt!ELEMENT BOOK (AUTHOR, TITLE, PUBLISHED?)gt
  • lt!ELEMENT AUTHOR (FIRST, LAST)gt
  • lt!ELEMENT FIRST (PCDATA)gt
  • lt!ELEMENT LAST (PCDATA)gt
  • lt!ELEMENT TITLE (PCDATA)gt
  • lt!ELEMENT PUBLISHED (PCDATA)gt
  • lt!ATTLIST BOOK genre (ScienceFiction) REQUIREDgt
  • lt!ATTLIST BOOK format (PaperbackHardcover)
    Paperbackgt
  • gt

32
Booklist Example
Write a Comment
User Comments (0)
About PowerShow.com