Contiguous Connection Model - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Contiguous Connection Model

Description:

It is recommended that an import consists of at least 1000 documents in a set. ... Validate files will report any failures in the import process. ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 22
Provided by: peterc65
Category:

less

Transcript and Presenter's Notes

Title: Contiguous Connection Model


1
Contiguous Connection Model
by Peter Baker Mohammad Khan
2
CCM Database
The CCM database model associates data in a
directed graph structure forming a data Net.
Each node on the net is a data element. Each
segment of the net represents the relationship
between the two data elements it connects. From
any node on the net, you can view the data
related to that node.
Database users can access the data net at any
point to explore the network as it relates to
that point. Obtaining information in this way
allows the user more flexibility in analyzing the
data rather than relying on a database query to
produce an observable association.
3
Data Set
Each unit of data is a data set, which consists
of a data type and a data value, separated by a
colon. The value is a single piece of data, and
the type provides a context that makes the value
meaningful. The value is associated directly with
the type since the two, together, form a unit of
data.
"Title Night at the Roxbury ? Data set Title
? Type Night at the Roxbury ? Value
4
Data Array
Related data Sets are grouped together into Data
arrays
The Key Data Set is the primary data Set of the
array
Key Data Set
  • Title Tortilla Flat
  • Author John Steinbeck
  • Date 1935

Data Array
Linked Data Sets
5
Multiple Levels of Indenture
Items may be linked to other items, creating
other levels of dependency under the key data
Set.
  • Title Tortilla Flat
  • Author John Steinbeck
  • -- Birth Date 2/27/1902
  • Date 1935

Since John Steinbecks personal attributes are
not directly related to Tortilla Flat, It is not
in the data array but linked to it through John
Steinbeck
6
Inverse Array
The CCM creates and stores a set of inverse
arrays for every data array in the database.
These inverse arrays are exact reversals of the
original input arrays. The parent of a data
Set in the input array becomes a child of that
data Set in the inverse array. Arrays created by
inversion are stored in the database and can be
viewed exactly as input arrays.
Each linked Data Set becomes a key, making it
possible to view data from the perspective of
every Data Set in the database
7
Convergence
As data accumulates, the Data Sets from different
input arrays are combined under identical parents
in the inverse arrays. The data converges at
these intersection points. For example, if
East of Eden were the key in another input
array, the data would converge on the author,
Steinbeck. No query would be required to see the
books in which the author wrote.
  • Title Tortilla Flat
  • Author John Steinbeck
  • -- Birth Date 2/27/1902
  • Date 1935

Author John Steinbeck Title Tortilla Flat
Title East of Eden
  • Title East of Eden
  • Author John Steinbeck
  • -- Birth Date 2/27/1902
  • Date 1952

Convergence points are the main points of
interest in a CCM database sometimes providing
unexpected information to the viewer. In a CCM
database, whenever you choose a linked Data Set
as the key, the convergence points are
immediately presented for viewing.
8
Unstructured Data Free Text
CCM establishes each significant word as a data
value, creates data types that express relative
positions of words, and applies the data types
(position expressions) to the data values
(words). CCM structures the free text into
meaningful relationships ready for further
analysis.
9
Concept Data Types
A CCM Concept is defined as a three-word phrase.
CCM creates the following data types to analyze
concepts in unstructured data.   W (A key
word) P1 (The word after W -- Word Plus 1) P2
(The word after P1 -- Word Plus 2) M1 (The word
before W -- Word Minus 1) M2 (The word before
M1 -- Word Minus 2)  
10
Significant Words
Data types are applied to each significant word
in the unstructured text. Significant words are
those that remain after the text is passed
through three filtering lists Stop-word list.
Stop words are those that appear in all text
documents and do not contain meaning that adds to
the value of a search ("the," "and," "for"). Stop
words are ignored during the process of creating
and inverting arrays. Compound word list. This
list identifies certain significant "words" that
are really multiple words representing a single
entity ("United States"). Domain High Hit list.
Normally, CCM disregards words less than three
characters in length. The high hit word list
identifies one- and two-character words that are
to be retained as significant words.
11
Related Words
CCM concept analysis is based on the theory that
words used in a similar context and associated
with the same group of words will have a
conceptual similarity to each other.
Example Birds and Airplanes
birds fly south
The two phrases have conceptual similarity
airplanes fly south
12
Applying Data Types to Significant Words
The  fog comes in on  little cat feet.
 The inverse arrays relate each word to the words
that precede it.
Insignificant words are ignored and have been
crossed out in the example. The remaining words
are structured into input arrays.
13
Weighted Related Words
For each significant word, CCM also builds a
weighted list of related words. Each
relationship is assigned a weight value up to
1000. The weight value for each related word
provides a relative indicator of how closely its
array matches the base array that is, how
closely the array with the related word as key
matches the array with the original word as key.
14
Other Word Relationships
CCM uses Stem Words and Soundex words to
determine similarities in free-text items. Stem
words derive from the same root (or stem) as the
primary word. For example hunt" is a stem word
for the words hunted" and hunting". Soundex
words are words that sound alike. The Soundex
feature is useful for finding misspelled words or
alternate spellings. For example, red would
have a soundex relationship to read".
15
Building a CCM Database
A CCM database created from text files contains
unstructured data. CCM sets up the data types and
assigns them to data values
Imports are only possible with ASCII text files.
It is recommended that an import consists of at
least 1000 documents in a set.
16
Importing Free Text
Select import wizard
Selecting the import wizard brings up the first
import window.
Select Import text files containing only free text
17
Select Free Text Directory
Accept Import Script Default
Enter the Directory with the text files
Apply filter if needed
18
Specify File Characteristics
Specify minimum and maximum file sizes. For
Aesops Fables the default file size must be
reduced to 0kb. Validate files will report any
failures in the import process.
Use the Validate Files to check import files
19
Selecting Stop Word List Import or Import and
Word Analysis
In this example, we use a standard stop list.
Import Only Word Analysis
20
Final Import Screen
Save the CCM database file to an appropriate
destination. The CCM Enterprise Manager presents
you with the Sinai Database window.
21
Concept Queries
Concept query windows. The query builder allows
the user to enter words on which to query and a
listing of words with occurrence values.
Results shows word associations and corresponding
text
Write a Comment
User Comments (0)
About PowerShow.com