Title: Oracle Database 11g New Search Features and Roadmap
1(No Transcript)
2Oracle Database 11g New Search Features and
Roadmap
- Roger Ford
- Senior Principal Product Manager
3Contents
ltInsert Picture Heregt
- Oracles Search Products
- Oracle Text 11g New Features
- Oracle Text 11.2.0.2 New Features
- Entity Extraction
- Name Search
- Result Set Interface
- Search Product Roadmap
- Oracle Text
- Secure Enterprise Search
4Oracles Search Products
- Oracle Text
- A SQL and PL/SQL based toolkit for creating
full-text search applications - Free with all database versions
- Previously known as Context Option, interMedia
Text - Secure Enterprise Search
- A complete search based on Oracle Text
capabilities - Crawlers for datasources such as web, email,
document repositories, databases - End-user query application and APIs for embedding
5Oracle Text 11g New Features
- Composite Domain Indexes and SDATA sections
- Allows storage of structured info (eg numbers,
dates) within text index - Makes for much faster mixed queries
- Auto Lexer
- Automatic Language Recognition
- Segmentation and Stemming for 32 languages
- Context-sensitive stemming for 23 of these
languages - Off-line and time-limited index creation
- Enables rebuild of indexes offline in quiet
periods for true 24x7 operation
6Demo Auto Lexer
711.2.0.2 New Features - Summary
- Entity Extraction
- Find entities such as people, countries,
cities, states, zip codes, phone numbers etc from
the text - Use default dictionary and rules or define your
own dictionary and rules based on regular
expressions - Name Search (NDATA sections)
- Inexact searches, copes with mis-spellings,
segmentation errors, contractions and word
reversal - Useful for many searches, but particular good for
names - ResultSet Interface
- Query request in XML and results returned as XML
- Avoids SQL layer and requirement to work within
SELECT semantics
8Entity Extraction
- Indentify names, places, dates, times, etc
- Tag each occurence with type and subtype
- Entities are defined by DICTIONARY and RULES
- Implemented by CTX_ENTITY package
- create_extract_policy create a policy to which
you can add extract rules - Choose to use/not use built in rules and
dictionary - add_extract_rule create an XML-based rule to
define an entity - add_stop_entity prevent defined entities from
being used - compile build the policy with its rules
- extract get an XML-based list of entities for a
doc - Also can use ctxload to load user dictionary
9Demo Entity Extraction
10Entities built-in types
- building
- city
- company
- country
- currency
- date
- day
- email_address
- geo_political
- holiday
- location_other
- month
- non_profit
- organization_other
- percent
- person_jobtitle
- person_name
- person_other
- phone_number
- postal_address
- product
- region
- ssn
- state
- time_duration
- tod
- url
- zip_code
11Entity Extraction Example 1 Defaults
- ctx_entity.create_extract_policy('my_default_polic
y') - ctx_entity.compile('mypolicy')
- ctx_entity.extract('mypolicy', mydoc, mylang,
myresults) - Output in "myresults"
- ltentitiesgt
- ltentity id"0" offset"75" length"8"
source"SuppliedDictionary"gt - lttextgtNew Yorklt/textgt
- lttypegtcitylt/typegt
- lt/entitygt
- ltentity id"1" offset"55" length"16"
source"SuppliedRule"gt - lttextgtHupplewhite Inc.lt/textgt
- lttypegtcompanylt/typegt
- lt/entitygt
- lt/entitiesgt
12Entity Extraction Example 2 User rule
- ctx_entity.create_extract_policy('mypolicy')
- ctx_entity.add_extract_rule('mypolicy', 5,
'ltrulegt ltexpressiongt((NorthSouth)?
America)lt/expressiongt - Â lttype refid"1"gtxContinentlt/typegt
- lt/rulegt')
- ctx_entity.compile('mypolicy')
- ctx_entity.extract('mypolicy', mydoc, mylang,
myresults) - Note parentheses around expression. refid"1"
means take the first expression in paren so
"North America" or just "America". - User defined types must be prefixed with a "x"
hence "xContinent" - ltentitiesgt
- ltentity id"0" offset"75" length"13"
source"UserRule"gt - lttextgtNorth Americalt/textgt
- lttypegtxContinentlt/typegt
- lt/entitygt
- lt/entitiesgt
13Ent Ext Adding a user dictionary
- Create file ud.xml
- ltdictionarygt ltentitiesgt
- ltentitygt ltvaluegtDow Jones Industrial
Averagelt/valuegt lttypegtxIndexlt/typegt lt/entitygt - ltentitygt ltvaluegtSampP 500lt/valuegt
lttypegtxIndexlt/typegt lt/entitygt - ltentitiesgt lt/dictionarygt
- Create the policy with CTXLOAD (can add rules
later) - ctxload -user scott/tiger -extract -name pol1
-file ud.xml - Compile the policy
- ctx_entity.compile('pol1')
- Results
- ltentity id"69" offset"1010" length"7"
source"UserDictionary"gt - lttextgtSampP 500lt/textgt
- lttypegtxIndexlt/typegt
- lt/entitygt
14Entity Extraction other stuff
- Extracting only certain entity types
- ctx_entity.extract('p1', mydoc, null, myresults,
'city,company,xContinent')
15Name Search
- Searching names has many difficulties
- Spelling (steven stephen)
- Alternate Names (fred alfred, chuck charles)
- Transcription (copying from spoken to written
form) - Transliteration (copying from one writing system
to another) - Segmentation (Mary Jane, Maryjane)
- First, Middle, and Last Name Classification
- Name search does intelligent matching across all
these issues
16Demo Name Search
17NDATA section type
- Basic implementation for name search
- Limitations
- 511 characters
- 255 whitespace-delimited terms
- No offset information, therefore no
- Highlighting / Markup
- NEAR or phrase search with NDATA
- Uses WORDLIST preference attributes
- NDATA_ALTERNATE_SPELLING
- NDATA_BASE_LETTER
- NDATA_THESAURUS (for alternate names default
thesaurus provided) - NDATA_JOIN_PARTICLES (list such as
'dedumcmac') - Query Syntax
- NDATA(fieldname, search terms , order ,
proximity )
18Result Set Interface
- Some queries are difficult to express in SQL
- eg "Give me the top 5 hits in each category"
- Result set interface uses a simple text query and
an XML result set descriptor - Hitlist is returned in XML according to result
set descriptor - Uses SDATA sections for
- Grouping
- Counting
19Result Set Example Query
- ctx_query.result_set('docidx', 'oracle',
- 'ltctx_result_set_descriptorgt
- ltcount/gt
- lthitlist start_hit_num"1"
end_hit_num"2" order"pubDate desc, score desc"gt
- ltscore/gt ltrowid/gt
- ltsdata name"author"/gt
- ltsdata name"pubDate"/gt
- lt/hitlistgt
- ltgroup sdata"pubDate"gt
- ltcount/gt
- lt/groupgt
- ltgroup sdata"author"gt
- ltcount/gt
- lt/groupgt
- lt/ctx_result_set_descriptorgt ', rs)
20Result Set Output
- ltctx_result_setgt
- lthitlistgt
- lthitgt
- ltscoregt3lt/scoregtltrowidgtAAAPoEAABAAAMWsAAClt/r
owidgt - ltsdata name"AUTHOR"gtJohnlt/sdatagt
- ltsdata name"PUBDATE"gt2001-01-03
000000lt/sdatagt - lt/hitgt
- lthitgt
- ltscoregt3lt/scoregtltrowidgtAAAPoEAABAAAMWsAAGlt/r
owidgt - ltsdata name"AUTHOR"gtJohnlt/sdatagt
- ltsdata name"PUBDATE"gt2001-01-03
000000lt/sdatagt - lt/hitgt
- lt/hitlistgt
-
- ltcountgt100lt/countgt
-
-
21Result Set Output - Continued
- ltgroups sdata"PUBDATE"gt
- ltgroup value"2001-01-01 000000"gtltcountgt25lt/
countgtlt/groupgt - ltgroup value"2001-01-02 000000"gtltcountgt50lt/
countgtlt/groupgt - ltgroup value"2001-01-03 000000"gtltcountgt25lt/
countgtlt/groupgt - lt/groupsgt
-
- ltgroups sdata"AUTHOR"gt
- ltgroup value"John"gtltcountgt50lt/countgtlt/groupgt
- ltgroup value"Mike"gtltcountgt25lt/countgtlt/groupgt
- ltgroup value"Steve"gtltcountgt25lt/countgtlt/groupgt
- lt/groupsgt
-
- lt/ctx_result_setgt
22Preview
23Roadmap merging Text and SES
Secure Enterprise Search
Oracle Text
Full Control
Full Featured
- Fine-grained Index Options
- Data Storage Options
- Lexer Options
- Stoplists
- Use existing database
- RAC, Exadata
- Built in database and mid-tier
- Crawlers for many sources
- Simple Query Interface
- End user GUI / API
- Embedded security
24Coming Search Features
- Natural Language Processing enhancements
- Ontology based classification
- Question answering
- Automatic Partitioning
- Query load load balancing
- Full support for facetted navigation (MVDATA
sections) - Functional completeness for Result Set Interface
- Result Iterator streaming support
- Parallel Query
- Replication Support
- Golden Gate / Logical Standby / Streams
- Operator improvements
- NEAR2 best query in one operator
- MNOT mild not, eg YORK mnot NEW YORK
- Nested near
- Substring index and query performance improvements
25Coming Search Features - Continued
- Multiple enhancements to query performance
- BIGIO leverages Secure Files CLOBs
- Automatic optimization of indexes with stage
index - Two level index keep common search terms in
memory - Partition maintenance without reindexing
- Off-load filtering from database server
- Section specific index options
- Choose different options, eg language, stopwords,
PRINTJOINS for each section - Regular expression based stopwords
- Forward Index
- Hugely improved performance for highlighting,
snippets - PDF Native Highlighting
- Unlimited SDATA, MDATA and Field Sections
26The preceding is intended to outline our general
product direction. It is intended for information
purposes only, and may not be incorporated into
any contract. It is not a commitment to deliver
any material, code, or functionality, and should
not be relied upon in making purchasing
decisions.The development, release, and timing
of any features or functionality described for
Oracles products remains at the sole discretion
of Oracle.
27(No Transcript)