IAW602 Fulltext Search in SQL Anywhere - PowerPoint PPT Presentation

1 / 56

About This Presentation

Title:

IAW602 Fulltext Search in SQL Anywhere

Description:

Searching for hyphenated phrases is particularly problematic ... Stoplist examples for different Latin languages (English, French, Spanish, etc) ... – PowerPoint PPT presentation

Number of Views:71

Avg rating:3.0/5.0

Slides: 57

Provided by: iablog1

Category:

more less

Transcript and Presenter's Notes

Title: IAW602 Fulltext Search in SQL Anywhere

1
IAW602 Fulltext Search in SQL Anywhere

Glenn Paulley
Director, Engineering
http//iablog.sybase.com/paulley

2
One-slide overview

Text indexes
CREATE TEXT INDEX tix_news_article ON
news_article( subject )
Text queries
SELECT
FROM news_article
WHERE CONTAINS( subject, ianywhere )

3
Contrast with other text search

LIKE predicates
Can use non-text index for word
Search for patterns rather than words
REGEXP, SIMILAR predicates
New in version 11
Can never use any kind of index
Search for patterns rather than words
Unlike the three predicates above, CONTAINS
predicates (or CONTAINS table expression) require
the existence of a full-text index

4
Outline

Querying
Index Maintenance
Index Creation and Configuration

5
Two forms of CONTAINS

Predicate
SELECT
FROM news_article
WHERE CONTAINS( subject, ianywhere )
FROM clause
SELECT
FROM news_article CONTAINS( subject, ianywhere
)
FROM version returns an implicit score column
Think of this as a table expression that joins
the rows of news_article with their corresponding
relevance scores

6
Scoring results

Higher score gt more relevant result
To get the most relevant results
SELECT TOP 10
FROM news_article CONTAINS( subject, ianywhere)
ORDER BY score DESC

7
Scoring results (continued)

Weighting is with BM25, a commonly-used
information retrieval formula
Probabilistic weighting scheme that uses the
inverse frequency of the query terms appearing in
each document
terms that appear rarely in a document are more
meaningful than if they appear more often.
The weighting scheme disregards any implied
relationship between query terms within a single
row, for example due to their relative proximity.
All else being equal, higher scores result from
Shorter matches
Less common words
More repetitions of a word

8
Scoring results (continued)

N number of documents (rows) in the collection
q query terms
nq vocabulary of the collection being searched
f(q,D) frequency of q terms in the document D
D - length of the document D
ADL average document length
k1 1.2, b 0.75

9
FROM clause syntax

Can use a different correlation name for the
table columns and the score column
SELECT TOP 10 na.
FROM news_article AS na
CONTAINS( subject, ianywhere ) AS cn
ORDER BY cn.score
Permits multiple CONTAINS procedures per query,
each with a different correlation name for
identifying the score column

10
Multiple term queries

Separate by spaces
SELECT FROM news_article
WHERE CONTAINS( subject, ianywhere solutions
developer)
A row matches if it contains all three words
Equivalently, use AND
CONTAINS( subject, ianywhere AND solutions AND
developer )

11
Multiple-column queries

Search more than one column at once
SELECT FROM news_article
CONTAINS( subject, body, ianywhere solutions )
Matching is as if the two columns were
concatenated
A matching row may have both words in one column
or one word in one column and one word in the
other column
Both columns must be covered by a (text) index

12
Prefix search

A word in a query can be a prefix
SELECT FROM news_article
CONTAINS( subject, com )
Matches the words compact, command,
compatibility, comments, etc.

13
Phrase search

Use double quotes in the query
SELECT FROM news_article
CONTAINS( subject, text compression )
The prefix operator can be used on words in a
phrase
CONTAINS( subject, text compress )
These forms of searching can be executed
efficiently
position information is stored in the text index
no need to scan the raw text

14
OR operator

Use OR to get some, but not necessarily all, of
the words
SELECT
FROM news_article
CONTAINS( subject, 8 OR 9 )
Score will be higher for matches with more words

15
AND NOT operator

Use AND NOT to exclude rows matching a word
SELECT
FROM news_article
WHERE CONTAINS( subject, text AND NOT file )

16
Mixing operators

Use brackets to make the meaning clear
SELECT
FROM news_article
WHERE CONTAINS( subject, (full AND text) OR
fulltext )

17
NEAR operator

Use to find words close together
SELECT FROM news_article
WHERE CONTAINS( subject, dbisql NEAR2 error )
Matches dbisql within 2 words of error
Distance is optional it defaults to 10

18
Alternate syntax

Some operators have equivalent symbols
AND
OR
AND NOT -
NEAR

19
Keywords and special characters

Keywords AND, OR, NOT, NEAR, FUZZY
Need to quote these as a single word phrase to
search for them AND OR
Special characters -
Be careful of special characters in queries,
especially hyphenated words getting treated as
AND NOT
Searching for hyphenated phrases is particularly
problematic
Search for the hyphenated word as a phrase with
the hyphen omitted, eg.
CONTAINS( subject, body, 'lichtner and
checkpoint only ' )

20
Vocabulary

A procedure is available that returns the list of
all unique words in an index
One possible use is to analyze the corpus for the
occurrence of stop words
Another is to construct queries with patterns
that arent directly supported
SELECT LIST( term, OR )
FROM sa_text_index_vocab( ti_news_article,
news_article, dba )
WHERE term LIKE anywhere

21
Fast first matches

The OPTIMIZATION_GOAL option setting applies as
usual
If you really want the first few matches quickly,
then use first-row

22
Fast first matches (continued)

Note that TOP N ORDER BY score DESC queries
currently compute scores for all potential
matches and sort them
This cannot be instantaneous if there are many
potential matches
A sequential scan is not an option, because
term existence and position is contained within
the text index
Hence the computation may involve joins of very
large intermediate results
A workaround is to try the query first as a
predicate to find if the number of potential
matches is too high

23
Querying simple views

Can query a simple view with CONTAINS
Simple means that the view is a single table with
a text index
The view cant use aggregates, window functions,
etc.
Useful if you create simple views for access
control

24
Integration with SQL query processing

Operations such as AND, OR, AND NOT are merged
into the main query as joins between the
referenced table(s) and the text index(es)
Complete statement is optimized as a unit on a
cost basis
Can examine the graphical plan to see where the
time is going

25
Integration with SQL query processing
26
Outline

Querying
Index Maintenance
Index Creation and Configuration

27
Refresh Modes

Three types of text index refresh
AUTO
MANUAL
IMMEDIATE (default)
Different characteristics with respect to
Ease of use
Update efficiency
Consistency and recency

28
IMMEDIATE REFRESH

The default if nothing is specified for a create
text index
Ease of use the simplest choice
index updates happen automatically when the table
is changed
Update efficiency the most expensive choice
change a row of the table, and every word in the
row causes a change to the index
One implication is that shorter rows are cheaper

29
IMMEDIATE REFRESH (continued)

Consistency and freshness
Always consistent with the most recent version of
the data
Anomalies possible with concurrent updates and
isolation levels less than 3
These anomalies are identical to those
encountered in ordinary SQL applications with
lower isolation levels (eg. phantom rows).
Almost full support for snapshot isolation
Scores may vary slightly when there are
concurrent updates from multiple transactions

30
MANUAL REFRESH

Specify at creation
CREATE TEXT INDEX ti ON
news_article( subject)
MANUAL REFRESH
In the main, MANUAL REFRESH is a performance
optimization
Ease of use
More work than immediate
Index is not refreshed at creation
Every refresh must be done manually
REFRESH TEXT INDEX ti ON news_article

31
MANUAL REFRESH (continued)

Update efficiency
Changes to the table update change tracking
information in the text index
Changes are tracked merely by noting the RIDs of
the modified base rows
No changes are made to the text index itself
This is about the same cost as updating a regular
index on an integer or timestamp column
Refreshes are batched and may use much less total
time than doing many immediate, individual
updates

32
MANUAL REFRESH (continued)

Consistency and Freshness
Rows that have been added or changed since the
last refresh are never returned by queries
When a row is changed, it disappears immediately
from all subsequent queries
This is not affected by isolation level
It is also not affected by ROLLBACKs or COMMITs
On REFRESH, either
Entire index is rebuilt from scratch due to
number of changes, or
Individual rows are re-indexed, in a batch,
using the rows identified in the change list

33
AUTO REFRESH

Similar to MANUAL REFRESH
CREATE TEXT INDEX ti on
news_article( subject)
AUTO REFRESH
Can optionally specify schedule
AUTO REFRESH EVERY integer MINUTES HOURS
EVERY 1 HOUR is used if omitted
If alternate scheduling is desired, can establish
an event to perform the REFRESH on an arbitrary
schedule

34
AUTO REFRESH (continued)

Ease of Use
Refreshes happen automatically at the specified
time interval
They may also happen more often if a significant
portion of the table is changed
Can use the REFRESH statement to manually refresh
an AUTO index before the next scheduled refresh
Update Efficiency
Same as MANUAL REFRESH
Consistency and Freshness
Same as MANUAL REFRESH

35
Refresh Isolation level

Refresh isolation level is not meaningful for
IMMEDIATE REFRESH text indexes
The first refresh happens at creation, and the
table is locked exclusively during this (same as
for non-text indexes)
By default, all refreshes for MANUAL and AUTO
indexes are at isolation level 0

36
Refresh Isolation Level (continued)

Can specify other isolation levels on a refresh
REFRESH TEXT INDEX text-index-name
ON owner.table-name
WITH ISOLATION LEVEL isolation-level
EXCLUSIVE MODE
SHARE MODE
EXCLUSIVE and SHARE mode are the same as for
materialized views and may occasionally be useful
Other isolation levels will rarely be useful
given that queries do not return changed rows

37
Incremental refresh

Refreshes can be incremental or full
By default the choice is automatic, based on the
number of deleted entries in the index
Can force one or the other if you need more
predictable refresh times
REFRESH TEXT INDEX text-index-name
ON owner.table-name
FORCE BUILD INCREMENTAL

38
Cost of refresh

Cost of a full refresh is comparable to creating
a normal index on the equivalent amount of data
(counting just the size of the indexed columns)
Incremental refresh is cheaper, depending on how
much of the table has changed

39
Online refresh

All refreshes are online
This means that you can continue to query or
update the indexed table during the refresh
If a complete rebuild, a clone of the text index
is constructed
Permits uninterrupted access to the existing text
index by queries while the next index is built
Locks created by choosing EXCLUSIVE MODE, SHARE
MODE, or an isolation level other than 0 can
interfere with this operation

40
Finding refresh status

Use sa_text_index_stats()
This gives summary information for every index in
the database, including
Last refresh time
Number of values indexed, and their total length
Total length of pending and deleted values

41
Outline

Querying
Index Maintenance
Index Creation and Configuration

42
CREATE syntax

CREATE TEXT INDEX text-index-name
ON owner.table-name( column-name, ... )
IN dbspace-name
CONFIGURATION owner.text-configuration-name
IMMEDIATE REFRESH
MANUAL REFRESH
AUTO REFRESH
EVERY integer MINUTES HOURS

43
Other DDL

Can also
ALTER TEXT INDEX
DROP TEXT INDEX
TRUNCATE TEXT INDEX
Meanings are mostly obvious. See the
documentation
UNLOAD/RELOAD will cause the reconstruction of
all text indexes
IMMEDIATEly REFRESHed indexes are built during
CREATE TEXT INDEX
MANUAL indexes are not refreshed as part of
reload unless the -g dbunload option is used
AUTO indexes get refreshed on their next refresh
cycle

44
What can you index?

Anything
If its not a string, then its converted to one
Some options affect conversion to strings
DATE_FORMAT
TIME_FORMAT
TIMESTAMP_FORMAT
Settings of these options are saved when the
configuration object for the index is created

45
Collations

Comparisons are done using the database
nchar_collation if any of the index columns are
NCHAR
Otherwise, they are done using the database
char_collation
This is not currently configurable
Therefore, the case sensitivity (for example) is
the same as for the database

46
Space Requirements

A text index typically uses about 30 of the
space of the indexed text
This depends mostly on uncontrollable factors
such as the number of unique words and their
frequency distribution
Everything is stored in the database and included
in backups, checkpoints, etc.
A text index is represented by several permanent
tables
These appear in the catalog with names like
4470pos1, 4443stat2, etc.
Access is restricted they cant be queried or
changed
Ignore them

47
Text Configuration objects

Stores information that is used in the creation
and querying of text indexes
Stoplist
Minimum and maximum word length
Term breaker
Stoplist examples for different Latin languages
(English, French, Spanish, etc) are installed
within the Samples directory

48
Specifying a configuration

On the CREATE statement
CREATE TEXT INDEX ti ON t
CONFIGURATION config1
When not specified, one of the default
configurations is used (default_char or
default_nchar)

49
DDL on Text configurations

Create a text configuration by copying a existing
one
CREATE TEXT CONFIGURATION config1 FROM
default_char
Can also DROP or ALTER a configuration as long as
a text index isnt using it

50
Altering text configurations

Change the stoplist
ALTER TEXT CONFIGURATION config1 STOPLIST and
the
Change the minimum or maximum word length
ALTER TEXT CONFIGURATION MINIMUM TERM LENGTH 2
ALTER TEXT CONFIGURATION MAXIMUM TERM LENGTH 40
Stoplist words and those shorter/longer than the
min/max term length are not indexed

51
Term breaking

Can use GENERIC (the default) or NGRAM
ALTER TEXT CONFIGURATION config1
TERM BREAKER NGRAM
ALTER TEXT CONFIGURATION config1
TERM BREAKER GENERIC

52
NGRAM term breaker

Words are possibly-overlapping strings of
alphanumerics
Length is always equal to the configured maximum
term length (minimum term length is ignored)
Example with N3
A quick brown fox gt qui, uic, ick, bro, row,
own, fox

53
Why an NGRAM term breaker?

Allows search of languages where words arent
separated by spaces
Allows an additional query operator
SELECT TOP 10
FROM news_article CONTAINS( subject, FUZZY
ianywhere )
ORDER BY score DESC
If N4, the above query is equivalent to iany OR
anyw OR nywh OR ywhe OR wher OR here
Usually want the highest scoring matches for a
fuzzy query

54
Term breaking in queries

Words in queries may be converted to phrases
Example with GENERIC
isnt gt isn t
Example with NGRAM (N3)
iany gt ian any

55
Unsupported features

For example
Stemming
Extraction of text from PDF files
Case insensitive search in a case sensitive
database
Text indexes over materialized views
Workarounds
Copy the text into another column, transform it
(e.g. with case folding, stemming, or plain text
extraction) and index that
Another alternative transform the queries (e.g.
using a query on the vocabulary procedure)
Use a COMPUTED COLUMN with the SOUNDEX built-in
function to handle potential spelling errors
within search terms