Text operations - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

Text operations

Description:

Syns: chicken (slang), chicken-hearted, craven, dastardly, faint ... Instead of dictionary entries, pointers point to the previous occurrences of symbols ... – PowerPoint PPT presentation

Number of Views:66

Avg rating:3.0/5.0

Slides: 38

Provided by: faculty8

Category:

more less

Transcript and Presenter's Notes

Title: Text operations

1
Text operations

Prepared By Loay Alayadhi
425121605
Supervised by Dr. Mourad Ykhlef

2
Document Preprocessing

Lexical analysis
Elimination of stopwords
Stemming of the remaining words
Selection of index terms
Construction of term categorization structures.
(thesaurus)

3
Logical View of a Document
automatic or manual indexing
accents, spacing, etc.
noun groups
stemming
document
stopwords
text structure
text
structure recognition
index terms
full text
structure
4
1)Lexical Analysis of the Text

Lexical AnalysisConvert an input stream of
characters into stream words .
Major objectives is the identification of the
words in the text !! How ??
Digits. ignoring numbers is a common way
Hyphens. state-of-the art
punctuation marks. remove them. Exception 510B.C
Case

5
2) Elimination of Stopwords

words appear too often are not useful for IR.
Stopwords words appear more than 80 of the
documents in the collection are stopwords and are
filtered out as potential index words
Problem
Search for to be or not to be?

6
3) Stemming

A Stem the portion of a word which is left after
the removal of its affixes (i.e., prefixes or
suffixes).
Example
connect, connected, connecting, connection,
connections
Removing strategies
affix removal intuitive, simple
table lookup
successor variety
n-gram

7
4) Index Terms Selection

Motivation
A sentence is usually composed of nouns,
pronouns, articles, verbs, adjectives, adverbs,
and connectives.
Most of the semantics is carried by the noun
words.
Identification of noun groups
A noun group is a set of nouns whose syntactic
distance in the text does not exceed a predefined
threshold

8
5) Thesaurus Construction

Thesaurus a precompiled list of important words
in a given domain of knowledge and for each word
in this list, there is a set of related words.
A controlled vocabulary for the indexing and
searching. Why?
Normalization,
indexing concept ,
reduction of noise,
identification, ..ect

9
The Purpose of a Thesaurus

To provide a standard vocabulary for indexing and
searching
To assist users with locating terms for proper
query formulation
To provide classified hierarchies that allow the
broadening and narrowing of the current query
request

10
Thesaurus (cont.)

Not like common dictionary
Words with their explanations
May contain words in a language
Or only contains words in a specific domain.
With a lot of other information especially the
relationship between words
Classification of words in the language
Words relationship like synonyms, antonyms

11
Roget thesaurusExample

cowardly adjective (????)
Ignobly lacking in courage cowardly turncoats
Syns chicken (slang), chicken-hearted, craven,
dastardly, faint-hearted, gutless, lily-livered,
pusillanimous, unmanly, yellow (slang), yellow-
bellied (slang)
http//www.thesaurus.com
http//www.dictionary.com/

12
Thesaurus Term Relationships

BT broader
NT narrower
RT non-hierarchical, but related

13
Use of Thesaurus

IndexingSelect the most appropriate thesaurus
entries for representing the document.
SearchingDesign the most appropriate search
strategy.
If the search does not retrieve enough documents,
the thesaurus can be used to expand the query.
If the search retrieves too many items, the
thesaurus can suggest more specific search
vocabulary

14
Document clustering

Document clustering the operation of grouping
together similar documents in classes
Global vs. local
Global whole collection
At compile time, one-time operation
Local
Cluster the results of a specific query
At runtime, with each query

15
Text Compression

Why text compression is important?
Less storage space
Less time for data transmission
Less time to search (if the compression method
allows direct search without decompression)

16
Terminology

Symbol
The smallest unit for compression (e.g.,
character, word, or a fixed number of characters)
Alphabet
A set of all possible symbols
Compression ratio
The size of the compressed file as a fraction of
the uncompressed file

17
Types of compression models

Static models
Assume some data properties in advance (e.g.,
relative frequencies of symbols) for all input
text
Allow direct (or random) access
Poor compression ratios when the input text
deviates from the assumption

18
Types of compression models

Semi-static models
Learn data properties in a first pass
Compress the input data in a second pass
Allow direct (or random) access
Good compression ratio
Must store the learned data properties for
decoding
Must have whole data at hand

19
Types of compression models

Adaptive models
Start with no information
Progressively learn the data properties as the
compression process goes on
Need only one pass for compression
Do not allow random access
Decompression cannot start in the middle

20
General approaches to text compression

Dictionary methods
(Basic) dictionary method
Ziv-Lempels adaptive method
Statistical methods
Arithmetic coding
Huffman coding

21
Dictionary methods

Replace a sequence of symbols with a pointer to a
dictionary entry

aaababbbaaabaaaaaaabaabb
input
May be suitable for one text but may be
unsuitable for another
22
Adaptive Ziv-Lempel coding

Instead of dictionary entries, pointers point to
the previous occurrences of symbols

aaababbbaaabaaaaaaabaabb
a?ab?b?b?a?a?a?b?b
23
Adaptive Ziv-Lempel coding

Instead of dictionary entries, pointers point to
the previous occurrences of symbols

aaababbbaaabaaaaaaabaabb
aaababbbaaabaaaaaaabaabb
1 2 3 4 5 6 7 8
9 10
0a1a0b1b3b2a3a6a2b9b
1 2 3 4 5 6 7 8
9 10
24
Adaptive Ziv-Lempel coding

Good compression ratio (4 bits/character)
Suitable for general data compression and widely
used (e.g., zip, compress)
Do not allow decoding to start in the middle of a
compressed file ? direct access is impossible
without decompression from the beginning

25
Arithmetic coding

The input text (data) is converted to a real
number between 0 and 1, such as 0.328701
Good compression ratio (2 bits/character)
Slow
Cannot start decoding in the middle of a file

26
Symbols and alphabetfor textual data

Words are more appropriate symbols for natural
language text
Example for each rose, a rose is a rose
Alphabet
a, each, for, is, rose, ?, ,
Always assume a single space after a word unless
there is another separator
a, each, for, is, rose, ,?

27
Huffman coding

Assign shorter codes (bits) to more frequent
symbols and longer codes (bits) to less frequent
symbols
Example for each rose, a rose is a rose

28
Example
5
5
4
4
2
2
2
2
rose
a
,?
each
for
is
29
Example
0
1
0
1
0
1
rose
a
0
0
1
1
,?
each
for
is
30
Example
0
1
0
1
0
1
rose
a
0
0
1
1
,?
each
for
is
31
Canonical tree
- Height of the left subtree of any node is never
smaller than that of the right subtree - All
leaves are in increasing order of probabilities
(frequencies) from left to right
0
1
0
1
0
1
rose
a
0
0
1
1
,?
each
for
is
32
Advantages of canonical tree

Smaller data for decoding
Non-canonical tree needs
Mapping table between symbols and codes
Canonical tree needs
(Sorted) list of symbols
A pair of number of symbols and numerical value
of the first code for each level
E.g., (0, NA), (2, 2), (4, 0)
More efficient encoding/decoding

33
Byte-oriented Huffman coding

Use whole bytes instead of binary coding

Non-optimal tree
254 empty nodes
256 symbols
256 symbols
Optimal tree
254 symbols
254 empty nodes
256 symbols
2 symbols
34
Comparison of methods
35
Compression of inverted files