Indexing Strategies for the Linguist - PowerPoint PPT Presentation

About This Presentation

Title:

Indexing Strategies for the Linguist

Description:

Indexing Strategies for the Linguist s Search Engine Aaron Elkiss and Philip Resnik UMIACS Why a Linguist s Search Engine? Goal for linguists: Use naturally ... – PowerPoint PPT presentation

Number of Views:113

Avg rating:3.0/5.0

Slides: 44

Provided by: Aaron219

Category:

more less

Transcript and Presenter's Notes

Title: Indexing Strategies for the Linguist

1
Indexing Strategies for the Linguists Search
Engine

Aaron Elkiss and Philip Resnik
UMIACS

2
Why a Linguists Search Engine?

Goal for linguists Use naturally occurring data
to support theories
Bag of word searches not sufficient
Structural searches of parse trees would be better

3
Constituency Parse
4
A Web Search Tool for the Ordinary Working
Linguist

Database
Must permit real-time interaction
Must permit large-scale searches
Must allow search on linguistic criteria
Interface
Must have linguist-friendly look and feel
Must minimize learning/ramp-up time
Must be reliable
Must evolve with real use

5
Querying Parse Trees

Find all trees containing a particular subtree
We use Query by Example to edit an example
sentence
to the structure were interested in

6
Query Properties

Typically concerned with structure near the
leaves of the tree
Relationship can be ancestorship rather than
immediate dominance

7
LSE Design Criteria

Must permit arbitrary structural searches
multiple branches with wildcards
in realtime
on a large collection of sentences
1GB scaling up to 10GB or more

8
Existing Techniques

Convert data to a relational model
Streaming techniques (tgrep2 (Rohde), XSQ
(Chawathe et al.))
Index, but permit only simple searches
(DataGuides Widom et al.)
Indexing techniques work best with a simple schema

9
Goals

Must handle a dataset with a very large schema
17 million paths from root to terminal
Xmark 1GB has 2.4 million
Path lengths also longer in LSE
Set of paths from root to preterminal fixed in
Xmark, grows without bound in LSE
Must handle queries with wildcards well
Must retrieve all results (100 recall)

10
Assumptions

Indexing can be slow (overnight)
Doesnt need to support online update
Can overgenerate results
lt 100 precision
Use tgrep2 as a filter

11
Baseline Solution

VIST A dynamic index method for querying XML
data by tree structures (Wang et al (IBM Watson),
SIGMOD 2003)
Suffix-tree based approach
Indexes structure and content together
Supports branching queries well

12
Suffix Trees

Index all suffixes of a given string

13
Structure Encoded Sequences

Represent each node in DFS order with the
complete path from the root to the node
One parse tree one document one structure
encoded sequence

S1 S_S1 NP_S_S1 NNP_S_S1
Jared_NNP_NP_S_S1 VP_S_S1 VBD_S_S1
laughed_VBD_VP_S_S1
14
VIST Trees

Insert structure encoded sequences instead of
suffixes of a string

15
Node Identification

(DFS order / node ID , number of descendants)
(n, d)
DFS order uniquely identifies a node
with number of descendants, identifies which
nodes are descendants of a given node
can produce without using a lot of memory using
perl and UNIX sort utility

16
VIST Indexes

Two Btree indexes using BerkeleyDB
Structural Sequence Index
Document Index

17
Structural Sequence Index

Structural Sequence Element ? (n, d)
S1 ? (0,12)
VP_S_S1 ? (5,2), (10,2)

18
Document Index

documents inserted at node ID of last element

7 ?
12 ?
19
Search
Query

Order of branches in query is important

20
Recursion Base Case

After the last branch of the query
Retrieve documents with descendant node IDs

7 ?
21
Peculiarities of VIST

Precision is not 100!
Query
matches both these documents

22
Problematic Query - Wildcards

Wildcards can still be a problem
Recursion isnt deep but can be very wide
End up looking at same nodes over and over again
with different wildcard instantiations from
previous branches

23
Problematic Query - Wildcards
24
Problematic Query Common Terminal

VISTs structural index actually stores
terminal length root preterminal
the 6 S1 S VP FRAG X DT
to find instantiated prefixes of structural
sequence elements
Wed look for
JJR 5 S1 S VP FRAG X

25
Problematic Query Common Terminal

To find structural sequence elements like
the_DT_X_FRAG_ we have to look at every element
with the terminal the
220284 for the_ vs. 121 for the_DT_X_frag_

26
Solution Overview

Ignore insufficiently selective query branches
Reorder processing of query branches
Different ordering for structural index
Create in-memory tree for the query
Memoization of nodes matching subtree of query

27
Ignore query branches

Generate statistics for each pair of tokens
Calculate estimated selectivity of each branch
Discard insufficiently selective branches
Use tgrep2 as filter

Still problematic
28
Reorder query branches

Start processing with most selective branch
Join to proceeding branches, then following
branches

29
Reorder structural index

Store as
terminal preterminal root
the DT X FRAG VP S S1
Immediately find paths with particular suffix
Terminals occurring in similar contexts are
clustered together

30
Reorder structural index

Now we have to look at every JJR_X_FRAG_ instead
of just those with the same prefix as
the_DT_X_FRAG_
But well only do so once, and only keep those
the_DT_X_FRAG_ and JJR_X_FRAG_ who have
matching prefixes

31
Create Query Tree

Keep relevant instantiations of each branch in
memory

S1__NP__robot robot_NN_NP_NP_S_SBAR_S_X_
X_S1 robot_NN_NP_NP_S_SBAR_VP_FRAG_S1
robot_NN_NP_NP_S_SBAR_VP_S_S_S1 S1__VP VP_S_
S1 _laughs
laughs_VBZ_VP_VP_S_SBAR_NP_PP_NP_PP
_us us_PRP_NP VP_VP_S_SBAR_NP_PP_
NP_PP_VP_S_S1 _laughs laughs_VBZ
_us us_PRP_NP
32
Subtree Memoization

Create sorted list of all nodes for a particular
branch of the query

S1__NP__robot robot_NN_NP_NP_S_SBAR_S_X_X_S1
(1,15) (30,10) S1__VP VP_S_S1
_laughs
laughs_VBZ_VP_VP_S_SBAR_NP_PP_NP_PP
(5,5) VP_VP_S_SBAR_NP_PP_NP_PP_VP_S_S1
_laughs laughs_VBZ (20,0)
S1__VP__laughs (5,5) (20,0)
33
Subtree Memoization

Specifier for memoized list includes wildcard
instantiations

S1__VP VP_S_S1 _laughs
laughs_VBZ_VP_VP_S_SBAR_NP_PP_NP_PP (5,5)
(10,0) _us
us_PRP_NP (6,0)
us_PRP_NP_NP (50,0) VP_VP_S_SBAR
_NP_PP_NP_PP_VP_S_S1 _laughs
laughs_VBZ (20,20) _us
us_PRP_NP (60,0)
S1__VP__us / VP_S_S1 (6,0)
(50,0)
S1__VP__us / VP_VP_S_SBAR_NP_PP_NP_PP_VP_S_S1
(60,0)
34
Evaluation

Original VIST scalability
XMark
LSE data

35
Original VIST scalability
Random queries over a synthetic data set
From Haixun Wang, Sanghyun Park, Wei Fan, and
Philip S Yu. VIST A dynamic index method for
querying XML data by tree structures. In SIGMOD,
2003. http//citeseer.nj.nec.com/wang03vist.html
36
Evaluation - VIST