Title: IAW602 Fulltext Search in SQL Anywhere
1IAW602 Fulltext Search in SQL Anywhere
- Glenn Paulley
- Director, Engineering
- http//iablog.sybase.com/paulley
2One-slide overview
- Text indexes
- CREATE TEXT INDEX tix_news_article ON
- news_article( subject )
- Text queries
- SELECT
- FROM news_article
- WHERE CONTAINS( subject, ianywhere )
3Contrast with other text search
- LIKE predicates
- Can use non-text index for word
- Search for patterns rather than words
- REGEXP, SIMILAR predicates
- New in version 11
- Can never use any kind of index
- Search for patterns rather than words
- Unlike the three predicates above, CONTAINS
predicates (or CONTAINS table expression) require
the existence of a full-text index
4Outline
- Querying
- Index Maintenance
- Index Creation and Configuration
5Two forms of CONTAINS
- Predicate
- SELECT
- FROM news_article
- WHERE CONTAINS( subject, ianywhere )
- FROM clause
- SELECT
- FROM news_article CONTAINS( subject, ianywhere
) - FROM version returns an implicit score column
- Think of this as a table expression that joins
the rows of news_article with their corresponding
relevance scores
6Scoring results
- Higher score gt more relevant result
- To get the most relevant results
- SELECT TOP 10
- FROM news_article CONTAINS( subject, ianywhere)
- ORDER BY score DESC
7Scoring results (continued)
- Weighting is with BM25, a commonly-used
information retrieval formula - Probabilistic weighting scheme that uses the
inverse frequency of the query terms appearing in
each document - terms that appear rarely in a document are more
meaningful than if they appear more often. - The weighting scheme disregards any implied
relationship between query terms within a single
row, for example due to their relative proximity. - All else being equal, higher scores result from
- Shorter matches
- Less common words
- More repetitions of a word
8Scoring results (continued)
- N number of documents (rows) in the collection
- q query terms
- nq vocabulary of the collection being searched
- f(q,D) frequency of q terms in the document D
- D - length of the document D
- ADL average document length
- k1 1.2, b 0.75
9FROM clause syntax
- Can use a different correlation name for the
table columns and the score column - SELECT TOP 10 na.
- FROM news_article AS na
- CONTAINS( subject, ianywhere ) AS cn
- ORDER BY cn.score
- Permits multiple CONTAINS procedures per query,
each with a different correlation name for
identifying the score column
10Multiple term queries
- Separate by spaces
- SELECT FROM news_article
- WHERE CONTAINS( subject, ianywhere solutions
developer) - A row matches if it contains all three words
- Equivalently, use AND
- CONTAINS( subject, ianywhere AND solutions AND
developer )
11Multiple-column queries
- Search more than one column at once
- SELECT FROM news_article
- CONTAINS( subject, body, ianywhere solutions )
- Matching is as if the two columns were
concatenated - A matching row may have both words in one column
or one word in one column and one word in the
other column - Both columns must be covered by a (text) index
12Prefix search
- A word in a query can be a prefix
- SELECT FROM news_article
- CONTAINS( subject, com )
- Matches the words compact, command,
compatibility, comments, etc.
13Phrase search
- Use double quotes in the query
- SELECT FROM news_article
- CONTAINS( subject, text compression )
- The prefix operator can be used on words in a
phrase - CONTAINS( subject, text compress )
- These forms of searching can be executed
efficiently - position information is stored in the text index
- no need to scan the raw text
14OR operator
- Use OR to get some, but not necessarily all, of
the words - SELECT
- FROM news_article
- CONTAINS( subject, 8 OR 9 )
- Score will be higher for matches with more words
15AND NOT operator
- Use AND NOT to exclude rows matching a word
- SELECT
- FROM news_article
- WHERE CONTAINS( subject, text AND NOT file )
16Mixing operators
- Use brackets to make the meaning clear
- SELECT
- FROM news_article
- WHERE CONTAINS( subject, (full AND text) OR
fulltext )
17NEAR operator
- Use to find words close together
- SELECT FROM news_article
- WHERE CONTAINS( subject, dbisql NEAR2 error )
- Matches dbisql within 2 words of error
- Distance is optional it defaults to 10
18Alternate syntax
- Some operators have equivalent symbols
- AND
- OR
- AND NOT -
- NEAR
19Keywords and special characters
- Keywords AND, OR, NOT, NEAR, FUZZY
- Need to quote these as a single word phrase to
search for them AND OR - Special characters -
- Be careful of special characters in queries,
especially hyphenated words getting treated as
AND NOT - Searching for hyphenated phrases is particularly
problematic - Search for the hyphenated word as a phrase with
the hyphen omitted, eg. - CONTAINS( subject, body, 'lichtner and
checkpoint only ' )
20Vocabulary
- A procedure is available that returns the list of
all unique words in an index - One possible use is to analyze the corpus for the
occurrence of stop words - Another is to construct queries with patterns
that arent directly supported - SELECT LIST( term, OR )
- FROM sa_text_index_vocab( ti_news_article,
- news_article, dba )
- WHERE term LIKE anywhere
21Fast first matches
- The OPTIMIZATION_GOAL option setting applies as
usual - If you really want the first few matches quickly,
then use first-row
22Fast first matches (continued)
- Note that TOP N ORDER BY score DESC queries
currently compute scores for all potential
matches and sort them - This cannot be instantaneous if there are many
potential matches - A sequential scan is not an option, because
term existence and position is contained within
the text index - Hence the computation may involve joins of very
large intermediate results - A workaround is to try the query first as a
predicate to find if the number of potential
matches is too high
23Querying simple views
- Can query a simple view with CONTAINS
- Simple means that the view is a single table with
a text index - The view cant use aggregates, window functions,
etc. - Useful if you create simple views for access
control
24Integration with SQL query processing
- Operations such as AND, OR, AND NOT are merged
into the main query as joins between the
referenced table(s) and the text index(es) - Complete statement is optimized as a unit on a
cost basis - Can examine the graphical plan to see where the
time is going
25Integration with SQL query processing
26Outline
- Querying
- Index Maintenance
- Index Creation and Configuration
27Refresh Modes
- Three types of text index refresh
- AUTO
- MANUAL
- IMMEDIATE (default)
- Different characteristics with respect to
- Ease of use
- Update efficiency
- Consistency and recency
28IMMEDIATE REFRESH
- The default if nothing is specified for a create
text index - Ease of use the simplest choice
- index updates happen automatically when the table
is changed - Update efficiency the most expensive choice
- change a row of the table, and every word in the
row causes a change to the index - One implication is that shorter rows are cheaper
29IMMEDIATE REFRESH (continued)
- Consistency and freshness
- Always consistent with the most recent version of
the data - Anomalies possible with concurrent updates and
isolation levels less than 3 - These anomalies are identical to those
encountered in ordinary SQL applications with
lower isolation levels (eg. phantom rows). - Almost full support for snapshot isolation
- Scores may vary slightly when there are
concurrent updates from multiple transactions
30MANUAL REFRESH
- Specify at creation
-
- CREATE TEXT INDEX ti ON
- news_article( subject)
- MANUAL REFRESH
- In the main, MANUAL REFRESH is a performance
optimization - Ease of use
- More work than immediate
- Index is not refreshed at creation
- Every refresh must be done manually
- REFRESH TEXT INDEX ti ON news_article
31MANUAL REFRESH (continued)
- Update efficiency
- Changes to the table update change tracking
information in the text index - Changes are tracked merely by noting the RIDs of
the modified base rows - No changes are made to the text index itself
- This is about the same cost as updating a regular
index on an integer or timestamp column - Refreshes are batched and may use much less total
time than doing many immediate, individual
updates
32MANUAL REFRESH (continued)
- Consistency and Freshness
- Rows that have been added or changed since the
last refresh are never returned by queries - When a row is changed, it disappears immediately
from all subsequent queries - This is not affected by isolation level
- It is also not affected by ROLLBACKs or COMMITs
- On REFRESH, either
- Entire index is rebuilt from scratch due to
number of changes, or - Individual rows are re-indexed, in a batch,
using the rows identified in the change list
33AUTO REFRESH
- Similar to MANUAL REFRESH
- CREATE TEXT INDEX ti on
- news_article( subject)
- AUTO REFRESH
- Can optionally specify schedule
- AUTO REFRESH EVERY integer MINUTES HOURS
- EVERY 1 HOUR is used if omitted
- If alternate scheduling is desired, can establish
an event to perform the REFRESH on an arbitrary
schedule
34AUTO REFRESH (continued)
- Ease of Use
- Refreshes happen automatically at the specified
time interval - They may also happen more often if a significant
portion of the table is changed - Can use the REFRESH statement to manually refresh
an AUTO index before the next scheduled refresh - Update Efficiency
- Same as MANUAL REFRESH
- Consistency and Freshness
- Same as MANUAL REFRESH
35Refresh Isolation level
- Refresh isolation level is not meaningful for
IMMEDIATE REFRESH text indexes - The first refresh happens at creation, and the
table is locked exclusively during this (same as
for non-text indexes) - By default, all refreshes for MANUAL and AUTO
indexes are at isolation level 0
36Refresh Isolation Level (continued)
- Can specify other isolation levels on a refresh
- REFRESH TEXT INDEX text-index-name
- ON owner.table-name
- WITH ISOLATION LEVEL isolation-level
- EXCLUSIVE MODE
- SHARE MODE
- EXCLUSIVE and SHARE mode are the same as for
materialized views and may occasionally be useful - Other isolation levels will rarely be useful
given that queries do not return changed rows
37Incremental refresh
- Refreshes can be incremental or full
- By default the choice is automatic, based on the
number of deleted entries in the index - Can force one or the other if you need more
predictable refresh times - REFRESH TEXT INDEX text-index-name
- ON owner.table-name
- FORCE BUILD INCREMENTAL
38Cost of refresh
- Cost of a full refresh is comparable to creating
a normal index on the equivalent amount of data
(counting just the size of the indexed columns) - Incremental refresh is cheaper, depending on how
much of the table has changed
39Online refresh
- All refreshes are online
- This means that you can continue to query or
update the indexed table during the refresh - If a complete rebuild, a clone of the text index
is constructed - Permits uninterrupted access to the existing text
index by queries while the next index is built - Locks created by choosing EXCLUSIVE MODE, SHARE
MODE, or an isolation level other than 0 can
interfere with this operation
40Finding refresh status
- Use sa_text_index_stats()
- This gives summary information for every index in
the database, including - Last refresh time
- Number of values indexed, and their total length
- Total length of pending and deleted values
41Outline
- Querying
- Index Maintenance
- Index Creation and Configuration
42CREATE syntax
- CREATE TEXT INDEX text-index-name
- ON owner.table-name( column-name, ... )
- IN dbspace-name
- CONFIGURATION owner.text-configuration-name
- IMMEDIATE REFRESH
- MANUAL REFRESH
- AUTO REFRESH
- EVERY integer MINUTES HOURS
43Other DDL
- Can also
- ALTER TEXT INDEX
- DROP TEXT INDEX
- TRUNCATE TEXT INDEX
- Meanings are mostly obvious. See the
documentation - UNLOAD/RELOAD will cause the reconstruction of
all text indexes - IMMEDIATEly REFRESHed indexes are built during
CREATE TEXT INDEX - MANUAL indexes are not refreshed as part of
reload unless the -g dbunload option is used - AUTO indexes get refreshed on their next refresh
cycle
44What can you index?
- Anything
- If its not a string, then its converted to one
- Some options affect conversion to strings
- DATE_FORMAT
- TIME_FORMAT
- TIMESTAMP_FORMAT
- Settings of these options are saved when the
configuration object for the index is created
45Collations
- Comparisons are done using the database
nchar_collation if any of the index columns are
NCHAR - Otherwise, they are done using the database
char_collation - This is not currently configurable
- Therefore, the case sensitivity (for example) is
the same as for the database
46Space Requirements
- A text index typically uses about 30 of the
space of the indexed text - This depends mostly on uncontrollable factors
such as the number of unique words and their
frequency distribution - Everything is stored in the database and included
in backups, checkpoints, etc. - A text index is represented by several permanent
tables - These appear in the catalog with names like
4470pos1, 4443stat2, etc. - Access is restricted they cant be queried or
changed - Ignore them
47Text Configuration objects
- Stores information that is used in the creation
and querying of text indexes - Stoplist
- Minimum and maximum word length
- Term breaker
- Stoplist examples for different Latin languages
(English, French, Spanish, etc) are installed
within the Samples directory
48Specifying a configuration
- On the CREATE statement
- CREATE TEXT INDEX ti ON t
- CONFIGURATION config1
- When not specified, one of the default
configurations is used (default_char or
default_nchar)
49DDL on Text configurations
- Create a text configuration by copying a existing
one - CREATE TEXT CONFIGURATION config1 FROM
default_char - Can also DROP or ALTER a configuration as long as
a text index isnt using it
50Altering text configurations
- Change the stoplist
- ALTER TEXT CONFIGURATION config1 STOPLIST and
the - Change the minimum or maximum word length
- ALTER TEXT CONFIGURATION MINIMUM TERM LENGTH 2
- ALTER TEXT CONFIGURATION MAXIMUM TERM LENGTH 40
- Stoplist words and those shorter/longer than the
min/max term length are not indexed
51Term breaking
- Can use GENERIC (the default) or NGRAM
- ALTER TEXT CONFIGURATION config1
- TERM BREAKER NGRAM
- ALTER TEXT CONFIGURATION config1
- TERM BREAKER GENERIC
52NGRAM term breaker
- Words are possibly-overlapping strings of
alphanumerics - Length is always equal to the configured maximum
term length (minimum term length is ignored) - Example with N3
- A quick brown fox gt qui, uic, ick, bro, row,
own, fox
53Why an NGRAM term breaker?
- Allows search of languages where words arent
separated by spaces - Allows an additional query operator
- SELECT TOP 10
- FROM news_article CONTAINS( subject, FUZZY
ianywhere ) - ORDER BY score DESC
- If N4, the above query is equivalent to iany OR
anyw OR nywh OR ywhe OR wher OR here - Usually want the highest scoring matches for a
fuzzy query
54Term breaking in queries
- Words in queries may be converted to phrases
- Example with GENERIC
- isnt gt isn t
- Example with NGRAM (N3)
- iany gt ian any
55Unsupported features
- For example
- Stemming
- Extraction of text from PDF files
- Case insensitive search in a case sensitive
database - Text indexes over materialized views
- Workarounds
- Copy the text into another column, transform it
(e.g. with case folding, stemming, or plain text
extraction) and index that - Another alternative transform the queries (e.g.
using a query on the vocabulary procedure) - Use a COMPUTED COLUMN with the SOUNDEX built-in
function to handle potential spelling errors
within search terms
56Questions