Title: Oracle Text Search saves your time
1Oracle Text Search saves your time
Anna Suwalska European Organization for Nuclear
Research - Geneva
OracleWorld Paris 2003
2Content
CERN Engineering Data Management System at
CERN Oracle Text How we profit from this
technology Conclusion
3Content
CERN
4CERN - European Organization for Nuclear Research
- The worlds largest particle physics research
laboratory - Founded in 1954, CERN has today 20 member states
- 2400 staff
- Over 6500 scientists come here to use research
facilities - 500 universities, over 80 nationalities
- CERN explores what matter is made of, and what
forces hold it together - WWW was born here
5LHC - The Large Hadron Collider Project
6LHC - Cryodipole
7Content
EDMS Engineering Data Management System
8EDMS - Engineering Data Management System
EDMS Portal
EDMS Common layer
Axalant
MP5
Other DBs
Design Data Documents and Drawings
Asset tracking Work management
9EDMS - Engineering Data Management System
Managing
10EDMS - Engineering Data Management System
Managing
Structures
Complete life-cycle for a single/compound
documents.
11EDMS - Engineering Data Management System
Managing
Structures
Complete life-cycle for a single/compound
documents.
Document versioning
12EDMS - Engineering Data Management System
Managing
Structures
Complete life-cycle for a single/compound
documents.
Document versioning
Document approval processes (comments collector)
13EDMS - Engineering Data Management System
Managing
Structures
Complete life-cycle for a single/compound
documents.
Document versioning
Document approval processes (comments collector)
Assemblies
14EDMS - Engineering Data Management System
Managing
Structures
Complete life-cycle for a single/compound
documents.
Document versioning
Document approval processes (comments collector)
Assemblies
Equipment workflow, data
15EDMS - Engineering Data Management System
Managing
Structures
Complete life-cycle for a single/compound
documents.
Document versioning
Document approval processes (comments collector)
Assemblies
Equipment workflow, data
Installation (jobs, locations, etc..)
16EDMS mandate
Manage a full description of the LHC projects
engineering data over its lifetime (25 years)
Installation
Design
Operation
Dismantling
A full description of the machine and its
components through their lifecycle must be
constantly available for all concerned parties
Support and coordinate engineering work /
information / data workflow
Help tracing solutions to all problems occurring
in the machine
Establish a knowledge transfer evolving staff,
many short time visitors
Operation
17Oracle Text our choice
Our needs
18Oracle Text our choice
Our needs
Index metadata files First line search is
done on meta data, however the possibility to
index files is essential
19Oracle Text our choice
Our needs
Index metadata files First line search is
done on meta data, however the possibility to
index files is essential
Bi-lingual Official CERN languages are English
and French. We have to support both
20Oracle Text our choice
Our needs
Index metadata files First line search is
done on meta data, however the possibility to
index files is essential
Bi-lingual Official CERN languages are English
and French. We have to support both
Performance Response time is very important
21Oracle Text our choice
Our needs
Index metadata files First line search is
done on meta data, however the possibility to
index files is essential
Bi-lingual Official CERN languages are English
and French. We have to support both
Performance Response time is very important
Simplicity
Simple for users Simple to develop Simple to
maintain
22Oracle Text our choice
Our needs
Index metadata files First line search is
done on meta data, however the possibility to
index files is essential
?
Oracle Text supports most of the document formats
Bi-lingual Official CERN languages are English
and French. We have to support both
?
Oracle text supports 39 languages
Performance Response time is very important
?
Very efficient for searches within big collection
of data
Simplicity
Simple for users Simple to develop Simple to
maintain
Results with scoring methodology to help navigate
through a result Standard SQL statements Easy to
maintain with ALTER INDEX or CTX_DDL packages
?
23Oracle Text our choice
Our needs
Index metadata files First line search is
done on meta data, however the possibility to
index files is essential
?
Oracle Text supports most of the document formats
Bi-lingual Official CERN languages are English
and French. We have to support both
?
Oracle text supports 39 languages
Performance Response time is very important
?
Very efficient for searches within big collection
of data
Simplicity
Simple for users Simple to develop Simple to
maintain
Results with scoring methodology to help navigate
through a result Standard SQL statements Easy to
maintain with ALTER INDEX or CTX_DDL packages
?
?
Oracle text comes as an option in RDBMS - no
additional costs
24Content
Oracle Text
25Oracle Text
Oracle text
Uses standard SQL
Enables the building of a Text Query Application
and a Document Classification Application
Takes care of
indexing searching word and theme viewing text
26CONTEXT Index Creation
CREATE INDEX index_name ON table_name(column_name)
INDEXTYPE IS CTXSYS.CONTEXT PARAMETERS(parameter
s string)
datastore datastore_pref filter filter_pref
charset column charset_column_name format
column format_column_name lexer lexer_pref
language column language_column_name
wordlist wordlist_pref storage storage_pref
stoplist stoplist section group
section_group memory memsize populate
nopopulate
27Types of indexes
Index
Query Operator
Characteristics
Used for
CONTEXT
CLOB, BLOB, BFILE, CHAR, VARCHAR2, XML On text
column Most complete of all 3 types.
CONTAINS
Large coherent documents
CTXCAT
CHAR, VARCHAR2 Combined index on a text column
and one or more other columns. Transactional
no need for synchronizing when DML. Creating can
be longer because of the sub-indexes. Supports
INDEX SET, LEXER, STOPLIST, STORAGE,
WORDLIST Has its own query language.
CATSEARCH
For indexing small text fragments and related
information. To improve mixed query performance
CTXRULE
MATCHES
Building a document classification application
VARCHAR2, CLOB On column containing a set of
queries. Supports LEXER (only BASIC) Does not
support number of operators.
28Index Maintenance Optimization
ALTER INDEX index_name REBUILD ONLINEPARAMETER
S(parameters string)
ALTER INDEX cdi_text_ctx REBUILD ONLINE
PARAMETERS(optimize fast) ALTER INDEX
cdi_text_ctx REBUILD ONLINE PARAMETERS(optimize
full maxtime10) ALTER INDEX cdi_text_ctx
REBUILD ONLINE PARAMETERS(optimize full)
29Index Maintenance Optimization
ALTER INDEX index_name REBUILD ONLINEPARAMETER
S(parameters string)
CTX_DDL package
- CTX_DDL.OPTIMIZE_INDEX
- CTX_DDL.SYNC_INDEX
30DML processing
INSERT A new row inserted in DRPENDING queue,
not available for query before synchronization
UPDATE Existing ROWID is placed in DRPENDING,
neither new nor old content is available for
query before synchronization
DELETE The row is immediately unavailable for
query(marked as invalid), but only removed when
optimization complete
CTX_USER_PENDING (CTX_PENDING) view To check
records waiting for synchronization
31Scoring
To calculate a relevance score for a returned
document in a word query, Oracle uses an inverse
frequency algorithm based on Salton's formula.
Inverse frequency scoring assumes that
frequently occurring terms in a document set are
noise terms, and so these terms are scored lower.
For a document to score high, the query term must
occur frequently in the document but infrequently
in the document set as a whole. Oracle
Text Reference, Release 9.0.1
Example
In data set M number of occurrences of TERM1, N
number of occurrences of TERM2 M N Document
having equal (n-occurrences) of TERM1 and TERM2
Result
SCORE for querying TERM1 TERM2
32Query Operators
Boolean
OR NOT MINUS AND
lhc AND magnet AND NOT cryogenic
Linguistics
SYNonym ABOUT STEM Translation Term Broader,
Narrower, Preferred, Related Term
SYN (science) ABOUT (particle)
Others
FUZZY NEAR SOUNDEX WITHIN SQE
begin ctx_query.store_sqe ( particle , atom,
molecule proton ) end SQE (particle)
33CTX packages
CTX_ADMIN
Administer servers and the data dictionary (only
ctxsys user)
Create and manage the preferences, section
groups, stoplists, manage indexes
CTX_DDL
CTX_DOC
Document presentation features (only for CONTAINS
indexes)
CTX_OUTPUT
Manage logs for the indexes
Generating query feedback, counting hits, and
creating SQE (stored query expressions)
CTX_QUERY
CTX_THES
Manage and browse thesaurus
34Content
How we profit from this technology
35EDMS metadata index preferences
36EDMS search for both languages
accelerateur lhc méthode
Version 1.5
37EDMS metadata index preferences
38Escaping characters to query them
To be able to query on reserved words or symbols
such as minus, - , near they must be
escaped. There are 2 methods to escape the
character, using or \ . When
using We had to hardcode it for each symbol and
word. A standard dictionary table with the
reserved characters would be useful.
39EDMS metadata index preferences
40EDMS Index Maintenance Optimization
Meta data
Files
Environment
Hardware system Two node cluster based on
two Sun SPARC 450, running Solaris 2.6 Sun
Cluster 2.1 RDBMS 8.1.7.4 500 MB SGA
size 60-80 concurrent users (during working hours)
41EDMS Index Maintenance Optimization
Meta data
Files
Environment
Index synchronization every 10 min, takes a few
seconds
PROCEDURE rebuild_metedata_ctx IS BEGIN EXECUTE
IMMEDIATE ('alter index CDI_TEXT_CTX rebuild
online parameters(' ' sync ' ')') END
Index optimization every weekend, takes 30 min
PROCEDURE optimize_metedata_ctx IS BEGIN
EXECUTE IMMEDIATE ('alter index CDI_TEXT_CTX
rebuild online parameters(' ' optimize full'
')') END
42EDMS Index Maintenance Optimization
Meta data
Files
Environment
Synchronize every 24h ? Optimize (fast, full)
every month?
43Scoring
SQL SELECT c_id,score(10) FROM
compound_doc_info WHERE
CONTAINS(c_text,lhc,10)0 AND
c_id 1738594907 C_ID
SCORE(10) ------------------ ----------------
1738594907 9
SQL SELECT c_id,score(10) FROM
compound_doc_info WHERE
CONTAINS(c_text,evolution,10)0
AND c_id 1738594907 C_ID
SCORE(10) ------------------ ----------------
1738594907 15
44Using the thesaurus
DECLARE xtab ctx_thes.exp_tab . BEGIN
ctxsys.ctx_thes.rt(xtab,p_term,edms_thes)
FOR i IN 1..xtab.COUNT LOOP IF xtab(i).xrel
C_RELETED_TERM THEN htp.anchor (
L_DOC_SEARCH '?cookie'
cookie 'p_search_type'
p_search_type 'p_free_text'
LOWER(xtab(i).xphrase)
,LOWER(xtab(i).xphrase) ) END
IF END LOOP END
Propose the RT (Related Term) if nothing found
with the original term(s).
Would be nice to have a spell checker corrector,
using existing tokens.
45Using the thesaurus - example
46Using the thesaurus - example
47Querying with Oracle Text versus standard SQL
WHERE CONTAINS (c_text, p_free_text) 0
Total 83 ms
48Querying with Oracle Text versus standard SQL
WHERE UPPER(c_text) LIKE 'UPER(p_free_text)
Total 03.98s
49Querying with Oracle Text versus standard SQL
p_free_text is a single word or an exact
sentence
Tool
Oracle Text
Standard SQL
Characteristics
Underperforming.
Fast.
Statement
WHERE UPPER(c_text) LIKE 'UPER(p_free_text)
WHERE CONTAINS (c_text,p_free_text) 0
Time
83 ms (821ms)
03.98s (39.14s)
Tests done with TOra 1.3.8 (in parentheses
repeated 10x)
50Querying with Oracle Text versus standard SQL
p_free_text is an expression with OR operator
Tool
Oracle Text
Standard SQL
Characteristics
Underperforming.
Fast.
Statement
WHERE ( UPPER(c_text) LIKE 'UPPER(p_text_1)
OR UPPER(c_text) LIKE 'UPER(p_text_2)
)
WHERE CONTAINS (c_text,p_free_text) 0
Time
103ms (0103 )
0909 (122.09)
Tests done with TOra 1.3.8 (in parentheses
repeated 10x)
51Querying with Oracle Text
Total 02.36s
52Querying with Oracle Text
Total 02.36s
Total 02.31s
53Querying with Oracle Text
Total 02.36s
Total 02.31s
Total 02.25s
54Mixed queries
LHC-Q-EI-0002 is a document number Search is
done on 1) the document number column using a
standard index 2) the context index
55Indexing various file formats
File_format column stores TEXT or BINARY
value. INSO_FILTER ignores all with TEXT in
the format column.
NULL_FILTER for plain text and HTML formats
Formatted documents such as Microsoft Word, PDF
has to be filtered
56Some indexing problems we have
The creation of an Intermedia Text Index (with
URL_DATASTORE) is failing with ORA-4030 out of
process memory.
In June 2002 this was identified to be a memory
leak fixed in 8.1.7.4.0 We observe now the same
ORA-4030 error with 8.1.7.4.0 OPS
After successful indexing of the PDF files (using
INSO_FILTER), some are indexed only partially
without any error being created in the error
table.
Result very difficult to verify if the document
is correctly indexed.
57Content
Conclusion
58Conclusion
Oracle text is worth using because
Performance
Simplicity of the code (integrated with Oracle,
no external search engine)
Simplicity of the index maintenance
Functional features bi-lingual support, special
query operators, thesaurus
Document presentation features
59Thank you
EDMS SERVICE https//edms.cern.ch This
presentation https//edms.cern.ch/file/402581/1/O
racle_Text_OracleWorld2003.ppt Contact Anna.Suwa
lska_at_cern.ch