IAW602 Fulltext Search in SQL Anywhere - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

IAW602 Fulltext Search in SQL Anywhere

Description:

Searching for hyphenated phrases is particularly problematic ... Stoplist examples for different Latin languages (English, French, Spanish, etc) ... – PowerPoint PPT presentation

Number of Views:71
Avg rating:3.0/5.0
Slides: 57
Provided by: iablog1
Category:

less

Transcript and Presenter's Notes

Title: IAW602 Fulltext Search in SQL Anywhere


1
IAW602 Fulltext Search in SQL Anywhere
  • Glenn Paulley
  • Director, Engineering
  • http//iablog.sybase.com/paulley

2
One-slide overview
  • Text indexes
  • CREATE TEXT INDEX tix_news_article ON
  • news_article( subject )
  • Text queries
  • SELECT
  • FROM news_article
  • WHERE CONTAINS( subject, ianywhere )

3
Contrast with other text search
  • LIKE predicates
  • Can use non-text index for word
  • Search for patterns rather than words
  • REGEXP, SIMILAR predicates
  • New in version 11
  • Can never use any kind of index
  • Search for patterns rather than words
  • Unlike the three predicates above, CONTAINS
    predicates (or CONTAINS table expression) require
    the existence of a full-text index

4
Outline
  • Querying
  • Index Maintenance
  • Index Creation and Configuration

5
Two forms of CONTAINS
  • Predicate
  • SELECT
  • FROM news_article
  • WHERE CONTAINS( subject, ianywhere )
  • FROM clause
  • SELECT
  • FROM news_article CONTAINS( subject, ianywhere
    )
  • FROM version returns an implicit score column
  • Think of this as a table expression that joins
    the rows of news_article with their corresponding
    relevance scores

6
Scoring results
  • Higher score gt more relevant result
  • To get the most relevant results
  • SELECT TOP 10
  • FROM news_article CONTAINS( subject, ianywhere)
  • ORDER BY score DESC

7
Scoring results (continued)
  • Weighting is with BM25, a commonly-used
    information retrieval formula
  • Probabilistic weighting scheme that uses the
    inverse frequency of the query terms appearing in
    each document
  • terms that appear rarely in a document are more
    meaningful than if they appear more often.
  • The weighting scheme disregards any implied
    relationship between query terms within a single
    row, for example due to their relative proximity.
  • All else being equal, higher scores result from
  • Shorter matches
  • Less common words
  • More repetitions of a word

8
Scoring results (continued)
  • N number of documents (rows) in the collection
  • q query terms
  • nq vocabulary of the collection being searched
  • f(q,D) frequency of q terms in the document D
  • D - length of the document D
  • ADL average document length
  • k1 1.2, b 0.75

9
FROM clause syntax
  • Can use a different correlation name for the
    table columns and the score column
  • SELECT TOP 10 na.
  • FROM news_article AS na
  • CONTAINS( subject, ianywhere ) AS cn
  • ORDER BY cn.score
  • Permits multiple CONTAINS procedures per query,
    each with a different correlation name for
    identifying the score column

10
Multiple term queries
  • Separate by spaces
  • SELECT FROM news_article
  • WHERE CONTAINS( subject, ianywhere solutions
    developer)
  • A row matches if it contains all three words
  • Equivalently, use AND
  • CONTAINS( subject, ianywhere AND solutions AND
    developer )

11
Multiple-column queries
  • Search more than one column at once
  • SELECT FROM news_article
  • CONTAINS( subject, body, ianywhere solutions )
  • Matching is as if the two columns were
    concatenated
  • A matching row may have both words in one column
    or one word in one column and one word in the
    other column
  • Both columns must be covered by a (text) index

12
Prefix search
  • A word in a query can be a prefix
  • SELECT FROM news_article
  • CONTAINS( subject, com )
  • Matches the words compact, command,
    compatibility, comments, etc.

13
Phrase search
  • Use double quotes in the query
  • SELECT FROM news_article
  • CONTAINS( subject, text compression )
  • The prefix operator can be used on words in a
    phrase
  • CONTAINS( subject, text compress )
  • These forms of searching can be executed
    efficiently
  • position information is stored in the text index
  • no need to scan the raw text

14
OR operator
  • Use OR to get some, but not necessarily all, of
    the words
  • SELECT
  • FROM news_article
  • CONTAINS( subject, 8 OR 9 )
  • Score will be higher for matches with more words

15
AND NOT operator
  • Use AND NOT to exclude rows matching a word
  • SELECT
  • FROM news_article
  • WHERE CONTAINS( subject, text AND NOT file )

16
Mixing operators
  • Use brackets to make the meaning clear
  • SELECT
  • FROM news_article
  • WHERE CONTAINS( subject, (full AND text) OR
    fulltext )

17
NEAR operator
  • Use to find words close together
  • SELECT FROM news_article
  • WHERE CONTAINS( subject, dbisql NEAR2 error )
  • Matches dbisql within 2 words of error
  • Distance is optional it defaults to 10

18
Alternate syntax
  • Some operators have equivalent symbols
  • AND
  • OR
  • AND NOT -
  • NEAR

19
Keywords and special characters
  • Keywords AND, OR, NOT, NEAR, FUZZY
  • Need to quote these as a single word phrase to
    search for them AND OR
  • Special characters -
  • Be careful of special characters in queries,
    especially hyphenated words getting treated as
    AND NOT
  • Searching for hyphenated phrases is particularly
    problematic
  • Search for the hyphenated word as a phrase with
    the hyphen omitted, eg.
  • CONTAINS( subject, body, 'lichtner and
    checkpoint only ' )

20
Vocabulary
  • A procedure is available that returns the list of
    all unique words in an index
  • One possible use is to analyze the corpus for the
    occurrence of stop words
  • Another is to construct queries with patterns
    that arent directly supported
  • SELECT LIST( term, OR )
  • FROM sa_text_index_vocab( ti_news_article,
  • news_article, dba )
  • WHERE term LIKE anywhere

21
Fast first matches
  • The OPTIMIZATION_GOAL option setting applies as
    usual
  • If you really want the first few matches quickly,
    then use first-row

22
Fast first matches (continued)
  • Note that TOP N ORDER BY score DESC queries
    currently compute scores for all potential
    matches and sort them
  • This cannot be instantaneous if there are many
    potential matches
  • A sequential scan is not an option, because
    term existence and position is contained within
    the text index
  • Hence the computation may involve joins of very
    large intermediate results
  • A workaround is to try the query first as a
    predicate to find if the number of potential
    matches is too high

23
Querying simple views
  • Can query a simple view with CONTAINS
  • Simple means that the view is a single table with
    a text index
  • The view cant use aggregates, window functions,
    etc.
  • Useful if you create simple views for access
    control

24
Integration with SQL query processing
  • Operations such as AND, OR, AND NOT are merged
    into the main query as joins between the
    referenced table(s) and the text index(es)
  • Complete statement is optimized as a unit on a
    cost basis
  • Can examine the graphical plan to see where the
    time is going

25
Integration with SQL query processing
26
Outline
  • Querying
  • Index Maintenance
  • Index Creation and Configuration

27
Refresh Modes
  • Three types of text index refresh
  • AUTO
  • MANUAL
  • IMMEDIATE (default)
  • Different characteristics with respect to
  • Ease of use
  • Update efficiency
  • Consistency and recency

28
IMMEDIATE REFRESH
  • The default if nothing is specified for a create
    text index
  • Ease of use the simplest choice
  • index updates happen automatically when the table
    is changed
  • Update efficiency the most expensive choice
  • change a row of the table, and every word in the
    row causes a change to the index
  • One implication is that shorter rows are cheaper

29
IMMEDIATE REFRESH (continued)
  • Consistency and freshness
  • Always consistent with the most recent version of
    the data
  • Anomalies possible with concurrent updates and
    isolation levels less than 3
  • These anomalies are identical to those
    encountered in ordinary SQL applications with
    lower isolation levels (eg. phantom rows).
  • Almost full support for snapshot isolation
  • Scores may vary slightly when there are
    concurrent updates from multiple transactions

30
MANUAL REFRESH
  • Specify at creation
  • CREATE TEXT INDEX ti ON
  • news_article( subject)
  • MANUAL REFRESH
  • In the main, MANUAL REFRESH is a performance
    optimization
  • Ease of use
  • More work than immediate
  • Index is not refreshed at creation
  • Every refresh must be done manually
  • REFRESH TEXT INDEX ti ON news_article

31
MANUAL REFRESH (continued)
  • Update efficiency
  • Changes to the table update change tracking
    information in the text index
  • Changes are tracked merely by noting the RIDs of
    the modified base rows
  • No changes are made to the text index itself
  • This is about the same cost as updating a regular
    index on an integer or timestamp column
  • Refreshes are batched and may use much less total
    time than doing many immediate, individual
    updates

32
MANUAL REFRESH (continued)
  • Consistency and Freshness
  • Rows that have been added or changed since the
    last refresh are never returned by queries
  • When a row is changed, it disappears immediately
    from all subsequent queries
  • This is not affected by isolation level
  • It is also not affected by ROLLBACKs or COMMITs
  • On REFRESH, either
  • Entire index is rebuilt from scratch due to
    number of changes, or
  • Individual rows are re-indexed, in a batch,
    using the rows identified in the change list

33
AUTO REFRESH
  • Similar to MANUAL REFRESH
  • CREATE TEXT INDEX ti on
  • news_article( subject)
  • AUTO REFRESH
  • Can optionally specify schedule
  • AUTO REFRESH EVERY integer MINUTES HOURS
  • EVERY 1 HOUR is used if omitted
  • If alternate scheduling is desired, can establish
    an event to perform the REFRESH on an arbitrary
    schedule

34
AUTO REFRESH (continued)
  • Ease of Use
  • Refreshes happen automatically at the specified
    time interval
  • They may also happen more often if a significant
    portion of the table is changed
  • Can use the REFRESH statement to manually refresh
    an AUTO index before the next scheduled refresh
  • Update Efficiency
  • Same as MANUAL REFRESH
  • Consistency and Freshness
  • Same as MANUAL REFRESH

35
Refresh Isolation level
  • Refresh isolation level is not meaningful for
    IMMEDIATE REFRESH text indexes
  • The first refresh happens at creation, and the
    table is locked exclusively during this (same as
    for non-text indexes)
  • By default, all refreshes for MANUAL and AUTO
    indexes are at isolation level 0

36
Refresh Isolation Level (continued)
  • Can specify other isolation levels on a refresh
  • REFRESH TEXT INDEX text-index-name
  • ON owner.table-name
  • WITH ISOLATION LEVEL isolation-level
  • EXCLUSIVE MODE
  • SHARE MODE
  • EXCLUSIVE and SHARE mode are the same as for
    materialized views and may occasionally be useful
  • Other isolation levels will rarely be useful
    given that queries do not return changed rows

37
Incremental refresh
  • Refreshes can be incremental or full
  • By default the choice is automatic, based on the
    number of deleted entries in the index
  • Can force one or the other if you need more
    predictable refresh times
  • REFRESH TEXT INDEX text-index-name
  • ON owner.table-name
  • FORCE BUILD INCREMENTAL

38
Cost of refresh
  • Cost of a full refresh is comparable to creating
    a normal index on the equivalent amount of data
    (counting just the size of the indexed columns)
  • Incremental refresh is cheaper, depending on how
    much of the table has changed

39
Online refresh
  • All refreshes are online
  • This means that you can continue to query or
    update the indexed table during the refresh
  • If a complete rebuild, a clone of the text index
    is constructed
  • Permits uninterrupted access to the existing text
    index by queries while the next index is built
  • Locks created by choosing EXCLUSIVE MODE, SHARE
    MODE, or an isolation level other than 0 can
    interfere with this operation

40
Finding refresh status
  • Use sa_text_index_stats()
  • This gives summary information for every index in
    the database, including
  • Last refresh time
  • Number of values indexed, and their total length
  • Total length of pending and deleted values

41
Outline
  • Querying
  • Index Maintenance
  • Index Creation and Configuration

42
CREATE syntax
  • CREATE TEXT INDEX text-index-name
  • ON owner.table-name( column-name, ... )
  • IN dbspace-name
  • CONFIGURATION owner.text-configuration-name
  • IMMEDIATE REFRESH
  • MANUAL REFRESH
  • AUTO REFRESH
  • EVERY integer MINUTES HOURS

43
Other DDL
  • Can also
  • ALTER TEXT INDEX
  • DROP TEXT INDEX
  • TRUNCATE TEXT INDEX
  • Meanings are mostly obvious. See the
    documentation
  • UNLOAD/RELOAD will cause the reconstruction of
    all text indexes
  • IMMEDIATEly REFRESHed indexes are built during
    CREATE TEXT INDEX
  • MANUAL indexes are not refreshed as part of
    reload unless the -g dbunload option is used
  • AUTO indexes get refreshed on their next refresh
    cycle

44
What can you index?
  • Anything
  • If its not a string, then its converted to one
  • Some options affect conversion to strings
  • DATE_FORMAT
  • TIME_FORMAT
  • TIMESTAMP_FORMAT
  • Settings of these options are saved when the
    configuration object for the index is created

45
Collations
  • Comparisons are done using the database
    nchar_collation if any of the index columns are
    NCHAR
  • Otherwise, they are done using the database
    char_collation
  • This is not currently configurable
  • Therefore, the case sensitivity (for example) is
    the same as for the database

46
Space Requirements
  • A text index typically uses about 30 of the
    space of the indexed text
  • This depends mostly on uncontrollable factors
    such as the number of unique words and their
    frequency distribution
  • Everything is stored in the database and included
    in backups, checkpoints, etc.
  • A text index is represented by several permanent
    tables
  • These appear in the catalog with names like
    4470pos1, 4443stat2, etc.
  • Access is restricted they cant be queried or
    changed
  • Ignore them

47
Text Configuration objects
  • Stores information that is used in the creation
    and querying of text indexes
  • Stoplist
  • Minimum and maximum word length
  • Term breaker
  • Stoplist examples for different Latin languages
    (English, French, Spanish, etc) are installed
    within the Samples directory

48
Specifying a configuration
  • On the CREATE statement
  • CREATE TEXT INDEX ti ON t
  • CONFIGURATION config1
  • When not specified, one of the default
    configurations is used (default_char or
    default_nchar)

49
DDL on Text configurations
  • Create a text configuration by copying a existing
    one
  • CREATE TEXT CONFIGURATION config1 FROM
    default_char
  • Can also DROP or ALTER a configuration as long as
    a text index isnt using it

50
Altering text configurations
  • Change the stoplist
  • ALTER TEXT CONFIGURATION config1 STOPLIST and
    the
  • Change the minimum or maximum word length
  • ALTER TEXT CONFIGURATION MINIMUM TERM LENGTH 2
  • ALTER TEXT CONFIGURATION MAXIMUM TERM LENGTH 40
  • Stoplist words and those shorter/longer than the
    min/max term length are not indexed

51
Term breaking
  • Can use GENERIC (the default) or NGRAM
  • ALTER TEXT CONFIGURATION config1
  • TERM BREAKER NGRAM
  • ALTER TEXT CONFIGURATION config1
  • TERM BREAKER GENERIC

52
NGRAM term breaker
  • Words are possibly-overlapping strings of
    alphanumerics
  • Length is always equal to the configured maximum
    term length (minimum term length is ignored)
  • Example with N3
  • A quick brown fox gt qui, uic, ick, bro, row,
    own, fox

53
Why an NGRAM term breaker?
  • Allows search of languages where words arent
    separated by spaces
  • Allows an additional query operator
  • SELECT TOP 10
  • FROM news_article CONTAINS( subject, FUZZY
    ianywhere )
  • ORDER BY score DESC
  • If N4, the above query is equivalent to iany OR
    anyw OR nywh OR ywhe OR wher OR here
  • Usually want the highest scoring matches for a
    fuzzy query

54
Term breaking in queries
  • Words in queries may be converted to phrases
  • Example with GENERIC
  • isnt gt isn t
  • Example with NGRAM (N3)
  • iany gt ian any

55
Unsupported features
  • For example
  • Stemming
  • Extraction of text from PDF files
  • Case insensitive search in a case sensitive
    database
  • Text indexes over materialized views
  • Workarounds
  • Copy the text into another column, transform it
    (e.g. with case folding, stemming, or plain text
    extraction) and index that
  • Another alternative transform the queries (e.g.
    using a query on the vocabulary procedure)
  • Use a COMPUTED COLUMN with the SOUNDEX built-in
    function to handle potential spelling errors
    within search terms

56
Questions
  • ?
Write a Comment
User Comments (0)
About PowerShow.com