Title: Text operations
1Text operations
- Prepared By Loay Alayadhi
- 425121605
- Supervised by Dr. Mourad Ykhlef
2Document Preprocessing
- Lexical analysis
- Elimination of stopwords
- Stemming of the remaining words
- Selection of index terms
- Construction of term categorization structures.
(thesaurus)
3Logical View of a Document
automatic or manual indexing
accents, spacing, etc.
noun groups
stemming
document
stopwords
text structure
text
structure recognition
index terms
full text
structure
41)Lexical Analysis of the Text
- Lexical AnalysisConvert an input stream of
characters into stream words . - Major objectives is the identification of the
words in the text !! How ?? - Digits. ignoring numbers is a common way
- Hyphens. state-of-the art
- punctuation marks. remove them. Exception 510B.C
- Case
52) Elimination of Stopwords
- words appear too often are not useful for IR.
- Stopwords words appear more than 80 of the
documents in the collection are stopwords and are
filtered out as potential index words - Problem
- Search for to be or not to be?
63) Stemming
- A Stem the portion of a word which is left after
the removal of its affixes (i.e., prefixes or
suffixes). - Example
- connect, connected, connecting, connection,
connections - Removing strategies
- affix removal intuitive, simple
- table lookup
- successor variety
- n-gram
74) Index Terms Selection
- Motivation
- A sentence is usually composed of nouns,
pronouns, articles, verbs, adjectives, adverbs,
and connectives. - Most of the semantics is carried by the noun
words. - Identification of noun groups
- A noun group is a set of nouns whose syntactic
distance in the text does not exceed a predefined
threshold
85) Thesaurus Construction
- Thesaurus a precompiled list of important words
in a given domain of knowledge and for each word
in this list, there is a set of related words. - A controlled vocabulary for the indexing and
searching. Why? - Normalization,
- indexing concept ,
- reduction of noise,
- identification, ..ect
9The Purpose of a Thesaurus
- To provide a standard vocabulary for indexing and
searching - To assist users with locating terms for proper
query formulation - To provide classified hierarchies that allow the
broadening and narrowing of the current query
request
10Thesaurus (cont.)
- Not like common dictionary
- Words with their explanations
- May contain words in a language
- Or only contains words in a specific domain.
- With a lot of other information especially the
relationship between words - Classification of words in the language
- Words relationship like synonyms, antonyms
11Roget thesaurusExample
- cowardly adjective (????)
- Ignobly lacking in courage cowardly turncoats
- Syns chicken (slang), chicken-hearted, craven,
- dastardly, faint-hearted, gutless, lily-livered,
- pusillanimous, unmanly, yellow (slang), yellow-
- bellied (slang)
- http//www.thesaurus.com
- http//www.dictionary.com/
12Thesaurus Term Relationships
- BT broader
- NT narrower
- RT non-hierarchical, but related
13Use of Thesaurus
- IndexingSelect the most appropriate thesaurus
entries for representing the document. - SearchingDesign the most appropriate search
strategy. - If the search does not retrieve enough documents,
the thesaurus can be used to expand the query. - If the search retrieves too many items, the
thesaurus can suggest more specific search
vocabulary
14Document clustering
- Document clustering the operation of grouping
together similar documents in classes - Global vs. local
- Global whole collection
- At compile time, one-time operation
- Local
- Cluster the results of a specific query
- At runtime, with each query
15Text Compression
- Why text compression is important?
- Less storage space
- Less time for data transmission
- Less time to search (if the compression method
allows direct search without decompression)
16Terminology
- Symbol
- The smallest unit for compression (e.g.,
character, word, or a fixed number of characters) - Alphabet
- A set of all possible symbols
- Compression ratio
- The size of the compressed file as a fraction of
the uncompressed file
17Types of compression models
- Static models
- Assume some data properties in advance (e.g.,
relative frequencies of symbols) for all input
text - Allow direct (or random) access
- Poor compression ratios when the input text
deviates from the assumption
18Types of compression models
- Semi-static models
- Learn data properties in a first pass
- Compress the input data in a second pass
- Allow direct (or random) access
- Good compression ratio
- Must store the learned data properties for
decoding - Must have whole data at hand
19Types of compression models
- Adaptive models
- Start with no information
- Progressively learn the data properties as the
compression process goes on - Need only one pass for compression
- Do not allow random access
- Decompression cannot start in the middle
20General approaches to text compression
- Dictionary methods
- (Basic) dictionary method
- Ziv-Lempels adaptive method
- Statistical methods
- Arithmetic coding
- Huffman coding
21Dictionary methods
- Replace a sequence of symbols with a pointer to a
dictionary entry
aaababbbaaabaaaaaaabaabb
input
May be suitable for one text but may be
unsuitable for another
22Adaptive Ziv-Lempel coding
- Instead of dictionary entries, pointers point to
the previous occurrences of symbols
aaababbbaaabaaaaaaabaabb
a?ab?b?b?a?a?a?b?b
23Adaptive Ziv-Lempel coding
- Instead of dictionary entries, pointers point to
the previous occurrences of symbols
aaababbbaaabaaaaaaabaabb
aaababbbaaabaaaaaaabaabb
1 2 3 4 5 6 7 8
9 10
0a1a0b1b3b2a3a6a2b9b
1 2 3 4 5 6 7 8
9 10
24Adaptive Ziv-Lempel coding
- Good compression ratio (4 bits/character)
- Suitable for general data compression and widely
used (e.g., zip, compress) - Do not allow decoding to start in the middle of a
compressed file ? direct access is impossible
without decompression from the beginning
25Arithmetic coding
- The input text (data) is converted to a real
number between 0 and 1, such as 0.328701 - Good compression ratio (2 bits/character)
- Slow
- Cannot start decoding in the middle of a file
26Symbols and alphabetfor textual data
- Words are more appropriate symbols for natural
language text - Example for each rose, a rose is a rose
- Alphabet
- a, each, for, is, rose, ?, ,
- Always assume a single space after a word unless
there is another separator - a, each, for, is, rose, ,?
27Huffman coding
- Assign shorter codes (bits) to more frequent
symbols and longer codes (bits) to less frequent
symbols - Example for each rose, a rose is a rose
28Example
5
5
4
4
2
2
2
2
rose
a
,?
each
for
is
29Example
0
1
0
1
0
1
rose
a
0
0
1
1
,?
each
for
is
30Example
0
1
0
1
0
1
rose
a
0
0
1
1
,?
each
for
is
31Canonical tree
- Height of the left subtree of any node is never
smaller than that of the right subtree - All
leaves are in increasing order of probabilities
(frequencies) from left to right
0
1
0
1
0
1
rose
a
0
0
1
1
,?
each
for
is
32Advantages of canonical tree
- Smaller data for decoding
- Non-canonical tree needs
- Mapping table between symbols and codes
- Canonical tree needs
- (Sorted) list of symbols
- A pair of number of symbols and numerical value
of the first code for each level - E.g., (0, NA), (2, 2), (4, 0)
- More efficient encoding/decoding
33Byte-oriented Huffman coding
- Use whole bytes instead of binary coding
Non-optimal tree
254 empty nodes
256 symbols
256 symbols
Optimal tree
254 symbols
254 empty nodes
256 symbols
2 symbols
34Comparison of methods
35Compression of inverted files
- Inverted file composed of
- A vector containing all distinct words in the
text collection. - for each a list of documents in which that word
occurs. - Types of code
- Unary
- Elias-
- Elisao
- Golomb
36Conclusions
- Text transformation meaning instead of strings
- Lexical analysis
- Stopwords
- Stemming
- Text compression
- Searchable
- Random access
- Model-coding
- inverted files
37Thanks.