Text operations - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Text operations

Description:

Syns: chicken (slang), chicken-hearted, craven, dastardly, faint ... Instead of dictionary entries, pointers point to the previous occurrences of symbols ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 38
Provided by: faculty8
Category:
Tags: operations | text

less

Transcript and Presenter's Notes

Title: Text operations


1
Text operations
  • Prepared By Loay Alayadhi
  • 425121605
  • Supervised by Dr. Mourad Ykhlef

2
Document Preprocessing
  • Lexical analysis
  • Elimination of stopwords
  • Stemming of the remaining words
  • Selection of index terms
  • Construction of term categorization structures.
    (thesaurus)

3
Logical View of a Document
automatic or manual indexing
accents, spacing, etc.
noun groups
stemming
document
stopwords
text structure
text
structure recognition
index terms
full text
structure
4
1)Lexical Analysis of the Text
  • Lexical AnalysisConvert an input stream of
    characters into stream words .
  • Major objectives is the identification of the
    words in the text !! How ??
  • Digits. ignoring numbers is a common way
  • Hyphens. state-of-the art
  • punctuation marks. remove them. Exception 510B.C
  • Case

5
2) Elimination of Stopwords
  • words appear too often are not useful for IR.
  • Stopwords words appear more than 80 of the
    documents in the collection are stopwords and are
    filtered out as potential index words
  • Problem
  • Search for to be or not to be?

6
3) Stemming
  • A Stem the portion of a word which is left after
    the removal of its affixes (i.e., prefixes or
    suffixes).
  • Example
  • connect, connected, connecting, connection,
    connections
  • Removing strategies
  • affix removal intuitive, simple
  • table lookup
  • successor variety
  • n-gram

7
4) Index Terms Selection
  • Motivation
  • A sentence is usually composed of nouns,
    pronouns, articles, verbs, adjectives, adverbs,
    and connectives.
  • Most of the semantics is carried by the noun
    words.
  • Identification of noun groups
  • A noun group is a set of nouns whose syntactic
    distance in the text does not exceed a predefined
    threshold

8
5) Thesaurus Construction
  • Thesaurus a precompiled list of important words
    in a given domain of knowledge and for each word
    in this list, there is a set of related words.
  • A controlled vocabulary for the indexing and
    searching. Why?
  • Normalization,
  • indexing concept ,
  • reduction of noise,
  • identification, ..ect

9
The Purpose of a Thesaurus
  • To provide a standard vocabulary for indexing and
    searching
  • To assist users with locating terms for proper
    query formulation
  • To provide classified hierarchies that allow the
    broadening and narrowing of the current query
    request

10
Thesaurus (cont.)
  • Not like common dictionary
  • Words with their explanations
  • May contain words in a language
  • Or only contains words in a specific domain.
  • With a lot of other information especially the
    relationship between words
  • Classification of words in the language
  • Words relationship like synonyms, antonyms

11
Roget thesaurusExample
  • cowardly adjective (????)
  • Ignobly lacking in courage cowardly turncoats
  • Syns chicken (slang), chicken-hearted, craven,
  • dastardly, faint-hearted, gutless, lily-livered,
  • pusillanimous, unmanly, yellow (slang), yellow-
  • bellied (slang)
  • http//www.thesaurus.com
  • http//www.dictionary.com/

12
Thesaurus Term Relationships
  • BT broader
  • NT narrower
  • RT non-hierarchical, but related

13
Use of Thesaurus
  • IndexingSelect the most appropriate thesaurus
    entries for representing the document.
  • SearchingDesign the most appropriate search
    strategy.
  • If the search does not retrieve enough documents,
    the thesaurus can be used to expand the query.
  • If the search retrieves too many items, the
    thesaurus can suggest more specific search
    vocabulary

14
Document clustering
  • Document clustering the operation of grouping
    together similar documents in classes
  • Global vs. local
  • Global whole collection
  • At compile time, one-time operation
  • Local
  • Cluster the results of a specific query
  • At runtime, with each query

15
Text Compression
  • Why text compression is important?
  • Less storage space
  • Less time for data transmission
  • Less time to search (if the compression method
    allows direct search without decompression)

16
Terminology
  • Symbol
  • The smallest unit for compression (e.g.,
    character, word, or a fixed number of characters)
  • Alphabet
  • A set of all possible symbols
  • Compression ratio
  • The size of the compressed file as a fraction of
    the uncompressed file

17
Types of compression models
  • Static models
  • Assume some data properties in advance (e.g.,
    relative frequencies of symbols) for all input
    text
  • Allow direct (or random) access
  • Poor compression ratios when the input text
    deviates from the assumption

18
Types of compression models
  • Semi-static models
  • Learn data properties in a first pass
  • Compress the input data in a second pass
  • Allow direct (or random) access
  • Good compression ratio
  • Must store the learned data properties for
    decoding
  • Must have whole data at hand

19
Types of compression models
  • Adaptive models
  • Start with no information
  • Progressively learn the data properties as the
    compression process goes on
  • Need only one pass for compression
  • Do not allow random access
  • Decompression cannot start in the middle

20
General approaches to text compression
  • Dictionary methods
  • (Basic) dictionary method
  • Ziv-Lempels adaptive method
  • Statistical methods
  • Arithmetic coding
  • Huffman coding

21
Dictionary methods
  • Replace a sequence of symbols with a pointer to a
    dictionary entry

aaababbbaaabaaaaaaabaabb
input
May be suitable for one text but may be
unsuitable for another
22
Adaptive Ziv-Lempel coding
  • Instead of dictionary entries, pointers point to
    the previous occurrences of symbols

aaababbbaaabaaaaaaabaabb
a?ab?b?b?a?a?a?b?b
23
Adaptive Ziv-Lempel coding
  • Instead of dictionary entries, pointers point to
    the previous occurrences of symbols

aaababbbaaabaaaaaaabaabb
aaababbbaaabaaaaaaabaabb
1 2 3 4 5 6 7 8
9 10
0a1a0b1b3b2a3a6a2b9b
1 2 3 4 5 6 7 8
9 10
24
Adaptive Ziv-Lempel coding
  • Good compression ratio (4 bits/character)
  • Suitable for general data compression and widely
    used (e.g., zip, compress)
  • Do not allow decoding to start in the middle of a
    compressed file ? direct access is impossible
    without decompression from the beginning

25
Arithmetic coding
  • The input text (data) is converted to a real
    number between 0 and 1, such as 0.328701
  • Good compression ratio (2 bits/character)
  • Slow
  • Cannot start decoding in the middle of a file

26
Symbols and alphabetfor textual data
  • Words are more appropriate symbols for natural
    language text
  • Example for each rose, a rose is a rose
  • Alphabet
  • a, each, for, is, rose, ?, ,
  • Always assume a single space after a word unless
    there is another separator
  • a, each, for, is, rose, ,?

27
Huffman coding
  • Assign shorter codes (bits) to more frequent
    symbols and longer codes (bits) to less frequent
    symbols
  • Example for each rose, a rose is a rose

28
Example
5
5
4
4
2
2
2
2
rose
a
,?
each
for
is
29
Example
0
1
0
1
0
1
rose
a
0
0
1
1
,?
each
for
is
30
Example
0
1
0
1
0
1
rose
a
0
0
1
1
,?
each
for
is
31
Canonical tree
- Height of the left subtree of any node is never
smaller than that of the right subtree - All
leaves are in increasing order of probabilities
(frequencies) from left to right
0
1
0
1
0
1
rose
a
0
0
1
1
,?
each
for
is
32
Advantages of canonical tree
  • Smaller data for decoding
  • Non-canonical tree needs
  • Mapping table between symbols and codes
  • Canonical tree needs
  • (Sorted) list of symbols
  • A pair of number of symbols and numerical value
    of the first code for each level
  • E.g., (0, NA), (2, 2), (4, 0)
  • More efficient encoding/decoding

33
Byte-oriented Huffman coding
  • Use whole bytes instead of binary coding

Non-optimal tree
254 empty nodes
256 symbols
256 symbols
Optimal tree
254 symbols
254 empty nodes
256 symbols
2 symbols
34
Comparison of methods
35
Compression of inverted files
  • Inverted file composed of
  • A vector containing all distinct words in the
    text collection.
  • for each a list of documents in which that word
    occurs.
  • Types of code
  • Unary
  • Elias-
  • Elisao
  • Golomb

36
Conclusions
  • Text transformation meaning instead of strings
  • Lexical analysis
  • Stopwords
  • Stemming
  • Text compression
  • Searchable
  • Random access
  • Model-coding
  • inverted files

37
Thanks.
  • Any Questions.
Write a Comment
User Comments (0)
About PowerShow.com