Language Specific Crawler for Myanmar Web Pages - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Language Specific Crawler for Myanmar Web Pages

Description:

Lexicon. Indexing. Code conversion (Transcoding) Stop words ... In linguistic, the lexicon of a language is its vocabulary, including its words and expressions. ... – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 37
Provided by: yoshiki
Category:

less

Transcript and Presenter's Notes

Title: Language Specific Crawler for Myanmar Web Pages


1
Language Specific Crawler for Myanmar Web Pages
Pann Yu Mon Management and Information System
Engineering Department Nagaoka University of
Technology, Japan
10th July 2008
2
Outlines
  1. Introduction
  2. Design of Crawler
  3. Evaluation
  4. Conclusions
  5. Limitations
  6. Themes for Doctoral Study

3
Introduction
  • Internet users are 0.1 of population
  • Few Myanmar language contents found on the Web
  • No search engine is available for Myanmar
    language

Country Population of internet users Internet users ()
Myanmar (.mm) 52,373,958 63,700 0.123
4
Challenges for Language Specific Crawler (LSC)
for Myanmar
  • Multiple encodings used
  • Myanmar pages are sparsely scattered over the
    entire Web
  • Collect as much pages as possible with limited
    time and computer resources

Myanmar Pages
Non-Myanmar Pages
5
Language Specific Search EngineBasic Architecture
Language specific crawler
WWW
Crawler
Language Identification
Page repository
Ranking engine
Query engine
Parser
Indexer
query
results
6
Objectives
  • To propose Language Specific Crawler (LSC) which
    enables maximum collection of web pages written
    in target language, independent of domains.
  • To efficiently collect Myanmar web pages which
    then can be indexed and sorted and finally to be
    used in Search Engine.

7
2.Design of Crawler (cont.)
  • Challenges
  • Multiple encodings used
  • Myanmar pages are sparsely scattered over the
    entire Web
  • Collect as much pages as possible with limited
    time and computer resources
  • Design of Crawler
  • Automatic Language Identification (LI) capable of
    multiple encodings
  • Language-based tracing of links
  • Choice of seed-URLs
  • Multi-thread crawling
  • Robot-text exclusion

8
Crawling Process
World Wide Web
1. Extract URLs
Get URLs
2. Language Identification
Language Identifier
3. Saving into Database
9
Multi-threaded Crawler
  • A single crawling loop spends a large amount of
    time.
  • Multi-threading, can provide reasonable speed-up
    and efficient use of available bandwidth.

10
Language Identification (cont.)
  • G2LI is an algorithm from n-gram based Language
    Identification for Web Documents.
  • Advantages
  • Requires small computing resources.
  • Small training set (520 KB. Length is enough).

11
Various Myanmar Fonts and Encodings
Font Name Encoding Scheme
BIT Partial Unicode
CE Classic Graphic Encoding
Myanmar1 Unicode
Myanmar2 Unicode
MyaZedi Partial Unicode
MyMyanmar Partial Unicode
Popular Graphic Encoding
Wininwa Graphic Encoding
Zawgyi-One Partial Unicode
12
Database Design Cont..
  • Save URLs in CSV file
  • Save pages content in Dearby database
  • URL
  • ID URL
  • 1 http//www.google.com
  • CONTENT
  • ID ParentURL URL
    Level Content
  • 1 http//www.google.com
    http//www.google.com 0
    xxx
  • 1 http//www.google.com
    http//www.google.com/mail 1 xxx
  • 1 http//www.google.com/mail
    http//www.google.com/mail/signout 2 xxx

13
Evaluation
  1. Evaluation on the Language Identification (G2LI)
  2. Evaluation on Crawling efficiency by means of
    precision and recall
  3. Evaluation on the crawling coverage.

14
A) Evaluation of Language Identifier
G2LIs Guessing Verified Language Verified Language Verified Language Verified Language
G2LIs Guessing Myanmar Myanmar Non-Myanmar Total
Identified as Myanmar 763(92) 87 37 (8) 800 (100)
Identified as Non- Myanmar 106 13 1094 1200
Total 869 100 1131 2000
15
Accuracy Rate and Error Rate
  • T Downloaded pages

(7631094)/2000 93
(37106)/2000 7
16
Misclassified Cases
  • 1) not being retrieved but relevant case
  • Bilingual Page written in Myanmar and English.
  • Web page using numeric character reference.
  • eg (?4156, ?4153)
  • 2) being retrieved but not relevant case
  • the misclassified pages are all English Web pages

17
B) Precision and Recall
  • Precision
  • The ability to retrieve top-ranked documents that
    are mostly relevant.
  • Recall
  • The ability of the search to find all of the
    relevant items in the entire Web space.
  • Where X relevant documents Y retrieved
    documents

18
How to estimate total number of Web pages
19
Total numbers of URLs returned by Google for each
Keyword
Keywords Numbers of URLs
??? (Day) 68,500
???????? (But) 41,000
?? (Human Being) 117,000
??? (Now) 31,500
?????? (Myanmar) 56,500
?? (He) 46,600
Total 361,100
Experiment period 25th June 2008 to 27th June
2008.
20

Day But 68,500 45,200 13,700 205,000
Day Human 68,500 120,000 14,200 564,401
Day Now 68,500 35,300 11,800 182,860



Now He 31,500 46,600 10,000 140,805
Myanmar He 56,500 46,600 11,200 225,496
Total Total Total Total Total 4,905,169
Average of 15 pairs of Keyword combination Average of 15 pairs of Keyword combination Average of 15 pairs of Keyword combination Average of 15 pairs of Keyword combination Average of 15 pairs of Keyword combination 327,011
Estimated X
21
Precision and Recall of crawling Entertainment
site case
22
Precision and Recall of crawling Blog site case
23
Precision and Recall of crawling News site case
24
C) Crawling Coverage
  • Crawling parameters
  • Seed URLs 35
  • Level of depth 6
  • Crawling time 2 weeks
  • CPU 2.40 GHz
  • Memory 1 GB
  • Internet connection 100 M bit per second

Domains The Number of Pages Collected
.mm 3,555 1.1
.com 276,554 83.2
Other gTLDs 52,245 15.7
Total 332,354 100.0
25
Distribution of estimated total number of Myanmar
pages
Estimated Average 327,011
Collected 332,354
26
Conclusion
  • Proposed design of crawler proved to work as a
    LSC for Myanmar Languages
  • LSC can download Myanmar pages on the Web at
    satisfactory level
  • Proposed LSC can be used for the part of Myanmar
    search engine

27
Limitations of LSC
  • How to reach isolated Myanmar pages (choice of
    seed-URLs, etc.)
  • Misidentification of Language Identifier (in
    particular, need to collect bilingual pages -
    English and Myanmar)
  • Improved speed of LSC

28
Themes for doctoral study
  1. Lexicon
  2. Indexing
  3. Code conversion (Transcoding)
  4. Stop words removal
  5. Stemming algorithm

29
Language Specific Search EngineBasic Architecture
Language specific Search Engine
Language specific crawler
WWW
Crawler
Language Identification
Page repository
Ranking engine
Query engine
Parser
Indexer
query
results
30
1. Lexicon
  • Lexicon is also a synonym for dictionary or
    encyclopedic dictionary.
  • In linguistic, the lexicon of a language is its
    vocabulary, including its words and expressions.

Web pages URLs
Daily News Paper
Lexicon
Dictionary
31
2. Indexing
Indexing is a process by which a keywords is
assigned to which documents of a corpus
Database
Indexer
DatabaseID Web Pages Lexicon
1 2,3 ????
2 8 ?????????
3 6 ??????????
4 N
5 4


N-1 5
N 7

32
3. Code Conversion
Web Page (contents)
Unicode
Non-Unicode
Unicode
Transcoding
Lexicon encoded in Unicode
????
Server
Client
33
4. Stop Words Removal
  • Stop words are defined as non-information-bearing
    words.
  • Myanmar sentences can be tokenized by eliminating
    stop words.

??????????????????????????????????????????? ?????
???? ??????????????
???????? computer students
useful N N
Adj
34
Stop-words listEnglish Vs Myanmar
1. Subject personal pronouns 1. Subject personal pronouns
I, you, he, she, it, we, you, they uRefawmf? uRefr? ig? usKyf? uREkfyf? usaemf?
2. Object personal pronouns 2. Object personal pronouns
3. Reflexive personal pronouns 3. Reflexive personal pronouns
4. Relative pronouns 4. Relative pronouns
5. Possessive pronouns and adjectives 5. Possessive pronouns and adjectives
6. Indefinite pronouns and adjectives 6. Indefinite pronouns and adjectives
7. Demonstrative pronouns and adjectives 7. Demonstrative pronouns and adjectives
8. Conjunctions 8. Conjunctions
9. Questions 9. Questions
10. Other (pronouns, prepositions) 10. Other (pronouns, prepositions)
35
5. Stemming
  • Stemming algorithm is a conflation procedure
  • reduces all words with same root into a single
    root
  • A stem is the portion of a word which is left
    after the removal of its affixes (i.e., prefixes
    and suffixes)
  • e.g., connect is the stem for the variants
    connected, connecting, and connections
  • e.g., ??? is the stem for the variants
    ????????,?????? and ??????

36
  • Thank you!
  • Any question ?
Write a Comment
User Comments (0)
About PowerShow.com