Title: Language Specific Crawler for Myanmar Web Pages
1Language Specific Crawler for Myanmar Web Pages
Pann Yu Mon Management and Information System
Engineering Department Nagaoka University of
Technology, Japan
10th July 2008
2Outlines
- Introduction
- Design of Crawler
- Evaluation
- Conclusions
- Limitations
- Themes for Doctoral Study
3Introduction
- Internet users are 0.1 of population
- Few Myanmar language contents found on the Web
- No search engine is available for Myanmar
language
Country Population of internet users Internet users ()
Myanmar (.mm) 52,373,958 63,700 0.123
4Challenges for Language Specific Crawler (LSC)
for Myanmar
- Multiple encodings used
- Myanmar pages are sparsely scattered over the
entire Web - Collect as much pages as possible with limited
time and computer resources
Myanmar Pages
Non-Myanmar Pages
5Language Specific Search EngineBasic Architecture
Language specific crawler
WWW
Crawler
Language Identification
Page repository
Ranking engine
Query engine
Parser
Indexer
query
results
6Objectives
- To propose Language Specific Crawler (LSC) which
enables maximum collection of web pages written
in target language, independent of domains. - To efficiently collect Myanmar web pages which
then can be indexed and sorted and finally to be
used in Search Engine.
72.Design of Crawler (cont.)
- Challenges
- Multiple encodings used
- Myanmar pages are sparsely scattered over the
entire Web - Collect as much pages as possible with limited
time and computer resources
- Design of Crawler
- Automatic Language Identification (LI) capable of
multiple encodings - Language-based tracing of links
- Choice of seed-URLs
- Multi-thread crawling
- Robot-text exclusion
8Crawling Process
World Wide Web
1. Extract URLs
Get URLs
2. Language Identification
Language Identifier
3. Saving into Database
9Multi-threaded Crawler
- A single crawling loop spends a large amount of
time. - Multi-threading, can provide reasonable speed-up
and efficient use of available bandwidth.
10Language Identification (cont.)
- G2LI is an algorithm from n-gram based Language
Identification for Web Documents. - Advantages
- Requires small computing resources.
- Small training set (520 KB. Length is enough).
11Various Myanmar Fonts and Encodings
Font Name Encoding Scheme
BIT Partial Unicode
CE Classic Graphic Encoding
Myanmar1 Unicode
Myanmar2 Unicode
MyaZedi Partial Unicode
MyMyanmar Partial Unicode
Popular Graphic Encoding
Wininwa Graphic Encoding
Zawgyi-One Partial Unicode
12Database Design Cont..
- Save URLs in CSV file
- Save pages content in Dearby database
- URL
- ID URL
- 1 http//www.google.com
- CONTENT
- ID ParentURL URL
Level Content - 1 http//www.google.com
http//www.google.com 0
xxx - 1 http//www.google.com
http//www.google.com/mail 1 xxx - 1 http//www.google.com/mail
http//www.google.com/mail/signout 2 xxx
13Evaluation
- Evaluation on the Language Identification (G2LI)
- Evaluation on Crawling efficiency by means of
precision and recall - Evaluation on the crawling coverage.
14A) Evaluation of Language Identifier
G2LIs Guessing Verified Language Verified Language Verified Language Verified Language
G2LIs Guessing Myanmar Myanmar Non-Myanmar Total
Identified as Myanmar 763(92) 87 37 (8) 800 (100)
Identified as Non- Myanmar 106 13 1094 1200
Total 869 100 1131 2000
15Accuracy Rate and Error Rate
(7631094)/2000 93
(37106)/2000 7
16Misclassified Cases
- 1) not being retrieved but relevant case
- Bilingual Page written in Myanmar and English.
- Web page using numeric character reference.
- eg (?4156, ?4153)
- 2) being retrieved but not relevant case
- the misclassified pages are all English Web pages
17B) Precision and Recall
- Precision
- The ability to retrieve top-ranked documents that
are mostly relevant. - Recall
- The ability of the search to find all of the
relevant items in the entire Web space. - Where X relevant documents Y retrieved
documents
18How to estimate total number of Web pages
19Total numbers of URLs returned by Google for each
Keyword
Keywords Numbers of URLs
??? (Day) 68,500
???????? (But) 41,000
?? (Human Being) 117,000
??? (Now) 31,500
?????? (Myanmar) 56,500
?? (He) 46,600
Total 361,100
Experiment period 25th June 2008 to 27th June
2008.
20Day But 68,500 45,200 13,700 205,000
Day Human 68,500 120,000 14,200 564,401
Day Now 68,500 35,300 11,800 182,860
Now He 31,500 46,600 10,000 140,805
Myanmar He 56,500 46,600 11,200 225,496
Total Total Total Total Total 4,905,169
Average of 15 pairs of Keyword combination Average of 15 pairs of Keyword combination Average of 15 pairs of Keyword combination Average of 15 pairs of Keyword combination Average of 15 pairs of Keyword combination 327,011
Estimated X
21Precision and Recall of crawling Entertainment
site case
22Precision and Recall of crawling Blog site case
23Precision and Recall of crawling News site case
24C) Crawling Coverage
- Crawling parameters
- Seed URLs 35
- Level of depth 6
- Crawling time 2 weeks
- CPU 2.40 GHz
- Memory 1 GB
- Internet connection 100 M bit per second
Domains The Number of Pages Collected
.mm 3,555 1.1
.com 276,554 83.2
Other gTLDs 52,245 15.7
Total 332,354 100.0
25Distribution of estimated total number of Myanmar
pages
Estimated Average 327,011
Collected 332,354
26Conclusion
- Proposed design of crawler proved to work as a
LSC for Myanmar Languages - LSC can download Myanmar pages on the Web at
satisfactory level - Proposed LSC can be used for the part of Myanmar
search engine
27Limitations of LSC
- How to reach isolated Myanmar pages (choice of
seed-URLs, etc.) - Misidentification of Language Identifier (in
particular, need to collect bilingual pages -
English and Myanmar) - Improved speed of LSC
28Themes for doctoral study
- Lexicon
- Indexing
- Code conversion (Transcoding)
- Stop words removal
- Stemming algorithm
29Language Specific Search EngineBasic Architecture
Language specific Search Engine
Language specific crawler
WWW
Crawler
Language Identification
Page repository
Ranking engine
Query engine
Parser
Indexer
query
results
301. Lexicon
- Lexicon is also a synonym for dictionary or
encyclopedic dictionary. - In linguistic, the lexicon of a language is its
vocabulary, including its words and expressions.
Web pages URLs
Daily News Paper
Lexicon
Dictionary
312. Indexing
Indexing is a process by which a keywords is
assigned to which documents of a corpus
Database
Indexer
DatabaseID Web Pages Lexicon
1 2,3 ????
2 8 ?????????
3 6 ??????????
4 N
5 4
N-1 5
N 7
323. Code Conversion
Web Page (contents)
Unicode
Non-Unicode
Unicode
Transcoding
Lexicon encoded in Unicode
????
Server
Client
334. Stop Words Removal
- Stop words are defined as non-information-bearing
words. - Myanmar sentences can be tokenized by eliminating
stop words.
??????????????????????????????????????????? ?????
???? ??????????????
???????? computer students
useful N N
Adj
34Stop-words listEnglish Vs Myanmar
1. Subject personal pronouns 1. Subject personal pronouns
I, you, he, she, it, we, you, they uRefawmf? uRefr? ig? usKyf? uREkfyf? usaemf?
2. Object personal pronouns 2. Object personal pronouns
3. Reflexive personal pronouns 3. Reflexive personal pronouns
4. Relative pronouns 4. Relative pronouns
5. Possessive pronouns and adjectives 5. Possessive pronouns and adjectives
6. Indefinite pronouns and adjectives 6. Indefinite pronouns and adjectives
7. Demonstrative pronouns and adjectives 7. Demonstrative pronouns and adjectives
8. Conjunctions 8. Conjunctions
9. Questions 9. Questions
10. Other (pronouns, prepositions) 10. Other (pronouns, prepositions)
355. Stemming
- Stemming algorithm is a conflation procedure
- reduces all words with same root into a single
root - A stem is the portion of a word which is left
after the removal of its affixes (i.e., prefixes
and suffixes) - e.g., connect is the stem for the variants
connected, connecting, and connections - e.g., ??? is the stem for the variants
????????,?????? and ??????
36-
- Thank you!
- Any question ?