Language Specific Crawler for Myanmar Web Pages - PowerPoint PPT Presentation

1 / 36

About This Presentation

Title:

Language Specific Crawler for Myanmar Web Pages

Description:

Lexicon. Indexing. Code conversion (Transcoding) Stop words ... In linguistic, the lexicon of a language is its vocabulary, including its words and expressions. ... – PowerPoint PPT presentation

Number of Views:65

Avg rating:3.0/5.0

Slides: 37

Provided by: yoshiki

Category:

more less

Transcript and Presenter's Notes

Title: Language Specific Crawler for Myanmar Web Pages

1
Language Specific Crawler for Myanmar Web Pages
Pann Yu Mon Management and Information System
Engineering Department Nagaoka University of
Technology, Japan
10th July 2008
2
Outlines

Introduction
Design of Crawler
Evaluation
Conclusions
Limitations
Themes for Doctoral Study

3
Introduction

Internet users are 0.1 of population
Few Myanmar language contents found on the Web
No search engine is available for Myanmar
language

Country Population of internet users Internet users ()
Myanmar (.mm) 52,373,958 63,700 0.123
4
Challenges for Language Specific Crawler (LSC)
for Myanmar

Multiple encodings used
Myanmar pages are sparsely scattered over the
entire Web
Collect as much pages as possible with limited
time and computer resources

Myanmar Pages
Non-Myanmar Pages
5
Language Specific Search EngineBasic Architecture
Language specific crawler
WWW
Crawler
Language Identification
Page repository
Ranking engine
Query engine
Parser
Indexer
query
results
6
Objectives

To propose Language Specific Crawler (LSC) which
enables maximum collection of web pages written
in target language, independent of domains.
To efficiently collect Myanmar web pages which
then can be indexed and sorted and finally to be
used in Search Engine.

7
2.Design of Crawler (cont.)

Challenges
Multiple encodings used
Myanmar pages are sparsely scattered over the
entire Web
Collect as much pages as possible with limited
time and computer resources

Design of Crawler
Automatic Language Identification (LI) capable of
multiple encodings
Language-based tracing of links
Choice of seed-URLs
Multi-thread crawling
Robot-text exclusion

8
Crawling Process
World Wide Web
1. Extract URLs
Get URLs
2. Language Identification
Language Identifier
3. Saving into Database
9
Multi-threaded Crawler

A single crawling loop spends a large amount of
time.
Multi-threading, can provide reasonable speed-up
and efficient use of available bandwidth.

10
Language Identification (cont.)

G2LI is an algorithm from n-gram based Language
Identification for Web Documents.
Advantages
Requires small computing resources.
Small training set (520 KB. Length is enough).

11
Various Myanmar Fonts and Encodings
Font Name Encoding Scheme
BIT Partial Unicode
CE Classic Graphic Encoding
Myanmar1 Unicode
Myanmar2 Unicode
MyaZedi Partial Unicode
MyMyanmar Partial Unicode
Popular Graphic Encoding
Wininwa Graphic Encoding
Zawgyi-One Partial Unicode
12
Database Design Cont..

Save URLs in CSV file
Save pages content in Dearby database

URL
ID URL
1 http//www.google.com
CONTENT
ID ParentURL URL
Level Content
1 http//www.google.com
http//www.google.com 0
xxx
1 http//www.google.com
http//www.google.com/mail 1 xxx
1 http//www.google.com/mail
http//www.google.com/mail/signout 2 xxx

13
Evaluation

Evaluation on the Language Identification (G2LI)
Evaluation on Crawling efficiency by means of
precision and recall
Evaluation on the crawling coverage.

14
A) Evaluation of Language Identifier
G2LIs Guessing Verified Language Verified Language Verified Language Verified Language
G2LIs Guessing Myanmar Myanmar Non-Myanmar Total
Identified as Myanmar 763(92) 87 37 (8) 800 (100)
Identified as Non- Myanmar 106 13 1094 1200
Total 869 100 1131 2000
15
Accuracy Rate and Error Rate

T Downloaded pages

(7631094)/2000 93
(37106)/2000 7
16
Misclassified Cases

1) not being retrieved but relevant case
Bilingual Page written in Myanmar and English.
Web page using numeric character reference.
eg (?4156, ?4153)
2) being retrieved but not relevant case
the misclassified pages are all English Web pages

17
B) Precision and Recall

Precision
The ability to retrieve top-ranked documents that
are mostly relevant.
Recall
The ability of the search to find all of the
relevant items in the entire Web space.
Where X relevant documents Y retrieved
documents

18
How to estimate total number of Web pages
19
Total numbers of URLs returned by Google for each
Keyword
Keywords Numbers of URLs
??? (Day) 68,500
???????? (But) 41,000
?? (Human Being) 117,000
??? (Now) 31,500
?????? (Myanmar) 56,500
?? (He) 46,600
Total 361,100
Experiment period 25th June 2008 to 27th June
2008.
20

Day But 68,500 45,200 13,700 205,000
Day Human 68,500 120,000 14,200 564,401
Day Now 68,500 35,300 11,800 182,860

Now He 31,500 46,600 10,000 140,805
Myanmar He 56,500 46,600 11,200 225,496
Total Total Total Total Total 4,905,169
Average of 15 pairs of Keyword combination Average of 15 pairs of Keyword combination Average of 15 pairs of Keyword combination Average of 15 pairs of Keyword combination Average of 15 pairs of Keyword combination 327,011
Estimated X
21
Precision and Recall of crawling Entertainment
site case
22
Precision and Recall of crawling Blog site case
23
Precision and Recall of crawling News site case
24
C) Crawling Coverage

Crawling parameters
Seed URLs 35
Level of depth 6
Crawling time 2 weeks
CPU 2.40 GHz
Memory 1 GB
Internet connection 100 M bit per second

Domains The Number of Pages Collected
.mm 3,555 1.1
.com 276,554 83.2
Other gTLDs 52,245 15.7
Total 332,354 100.0
25
Distribution of estimated total number of Myanmar
pages
Estimated Average 327,011
Collected 332,354
26
Conclusion

Proposed design of crawler proved to work as a
LSC for Myanmar Languages
LSC can download Myanmar pages on the Web at
satisfactory level
Proposed LSC can be used for the part of Myanmar
search engine

27
Limitations of LSC

How to reach isolated Myanmar pages (choice of
seed-URLs, etc.)
Misidentification of Language Identifier (in
particular, need to collect bilingual pages -
English and Myanmar)
Improved speed of LSC

28
Themes for doctoral study

Lexicon
Indexing
Code conversion (Transcoding)
Stop words removal
Stemming algorithm

29
Language Specific Search EngineBasic Architecture
Language specific Search Engine
Language specific crawler
WWW
Crawler
Language Identification
Page repository
Ranking engine
Query engine
Parser
Indexer
query
results
30
1. Lexicon

Lexicon is also a synonym for dictionary or
encyclopedic dictionary.
In linguistic, the lexicon of a language is its
vocabulary, including its words and expressions.

Web pages URLs
Daily News Paper
Lexicon
Dictionary
31
2. Indexing
Indexing is a process by which a keywords is
assigned to which documents of a corpus
Database
Indexer
DatabaseID Web Pages Lexicon
1 2,3 ????
2 8 ?????????
3 6 ??????????
4 N
5 4

N-1 5
N 7

32
3. Code Conversion
Web Page (contents)
Unicode
Non-Unicode
Unicode
Transcoding
Lexicon encoded in Unicode
????
Server
Client
33
4. Stop Words Removal

Stop words are defined as non-information-bearing
words.
Myanmar sentences can be tokenized by eliminating
stop words.

??????????????????????????????????????????? ?????
???? ??????????????
???????? computer students
useful N N
Adj
34
Stop-words listEnglish Vs Myanmar
1. Subject personal pronouns 1. Subject personal pronouns
I, you, he, she, it, we, you, they uRefawmf? uRefr? ig? usKyf? uREkfyf? usaemf?
2. Object personal pronouns 2. Object personal pronouns
3. Reflexive personal pronouns 3. Reflexive personal pronouns
4. Relative pronouns 4. Relative pronouns
5. Possessive pronouns and adjectives 5. Possessive pronouns and adjectives
6. Indefinite pronouns and adjectives 6. Indefinite pronouns and adjectives
7. Demonstrative pronouns and adjectives 7. Demonstrative pronouns and adjectives
8. Conjunctions 8. Conjunctions
9. Questions 9. Questions
10. Other (pronouns, prepositions) 10. Other (pronouns, prepositions)
35
5. Stemming

Stemming algorithm is a conflation procedure
reduces all words with same root into a single
root
A stem is the portion of a word which is left
after the removal of its affixes (i.e., prefixes
and suffixes)
e.g., connect is the stem for the variants
connected, connecting, and connections
e.g., ??? is the stem for the variants
????????,?????? and ??????