????????????? ------ Web Search and Web Mining - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

????????????? ------ Web Search and Web Mining

Description:

-----Web Search and Web Mining cxq_at_ict.ac.cn 06.8.17 SWCL 2006 – PowerPoint PPT presentation

Number of Views:101
Avg rating:3.0/5.0
Slides: 49
Provided by: searchfor2
Category:
Tags: mining | search | web

less

Transcript and Presenter's Notes

Title: ????????????? ------ Web Search and Web Mining


1
????????????? ------ Web Search and Web Mining
  • ???, cxq_at_ict.ac.cn
  • ????????????
  • 06.8.17 SWCL 2006 ??

2
Outline
  • Background and Motivation
  • Some of existing works in ICT
  • Some of existing research works
  • Some of sharing systems
  • Conclusion

3
A Big Problem!
Semantic Web
Information Retrieval
Social Computing
Network Science
Web Computing
Natural Language Processing
Machine Learning Text Mining
4
?????Web?????
Unified Browsing
Unified Search
Personalized Search
Personalized Space
5
Web Mining????????????
  • ???????????
  • ????100???PB???????10???????????,???????????300
    0???????,?????10??
  • ??????????????
  • ??????????????????
  • ??????????????
  • ?????????
  • ??????????
  • ?????!
  • ??????????,???????
  • ???????????????
  • ??????????
  • ???????

6
Web 2.0????????
  • ???????
  • Architecture From server-centered to
    Peer-distributed
  • ????P2P, Blog
  • ???????
  • Streaming From INFORMATION to MESSAGE
  • Socialization
  • ???????Rich Content
  • ???????
  • Rich Dimensions

7
Motivation????????,?????????
  • ??????????????????????,??????
  • Assumptions of VSM, PM, LM etc
  • How to represent the Rich dimensional feature
    spaces?
  • ????????????????,????????
  • Unified RANKING has so many biases!Identity vs.
    Otherness (Active-computing)
  • Special algorithms for Rich-dimensional feature
    space
  • Streaming
  • Message vs. Text/Sentence Dynamic context
    sensitive
  • Tradeoff between deep understanding and
    performance
  • Shadow and efficient language processing

8
Outline
  • Background and Motivation
  • Some of existing works in ICT
  • Some of existing research works
  • Some of sharing systems
  • Conclusion

9
Organizations of ICT
  • About 80 persons in I3S
  • About 25 research faculty
  • More than 40 students
  • Over 20 Ph.d candidates,over 15 master candidates

10
Related works in I3S_at_ICT
  • Research topics
  • ???????????? Dr. ????
  • ???(??)?????? Dr. ????
  • ????Dr. ????? etc
  • ????????????? Dr. ????
  • ???????????Dr. ????
  • P2P ??Dr. ????
  • Sharable Systems
  • ???????????ICTCLAS
  • ????????????FirteX

11
Data Stream Management
???
  • Conditions
  • High speed streaming (Over 10GBps)
  • Large Scale queries
  • (Over 100,000)
  • Emergence of temporal unknown patterns
  • Requirement
  • Online responding
  • Emergence prediction
  • Challenges

12
What we are pursuing
???
  • Query Processing
  • Multiple filtering queries processing on single
    stream
  • Join algorithms on multiple streams
  • Data Stream mining
  • Frequent patterns discovery
  • Clustering
  • Emergence prediction

13
Multiple Strings Matching
???
  • Classic Algorithms
  • Prefix-based approach KMP, AC, Shift-And,
    Shift-Or
  • Suffix-based approach Boyer-Moore, Wu-Manber
  • Factor-based approach SBDM, SBOM
  • Challenge
  • The number of feature strings increase with the
    rapid grow of information scale.(ClamAntiVirus
    library 26653)
  • Traditional String matching algorithm cannot
    solve the problem while the feature number is
    over 5000.

?????????????????
?????????????
14
Partition Combinatorial Optimization Matching
(ICT-COM)
???
Find the optimal partition
Find the shortest path in a weighted graph
  • Construct a weighted graph G according to the
    given keywords set P as follows
  • Node each a block with length i in P

source
sink
  • Edge a set of blocks with length
    greater than or equal with i, but less than j
  • Weight the minimal time of the
    classical algorithms to search in a training text
    for the keywords in the corresponding subset
  • Objective
  • find the shortest path from source to sink in G

15
???
Results of ICT-COM
  • Optimization
  • Analysis
  • 4 subsets were given by COM and assigned with
    different algorithms.
  • 3-9(AC),10-13(SBOM),14-35(SBOM),36-210(SBOM)
  • The speed of COM is about 3 times faster than the
    quickest classical one.
  • ICT-COM is an efficient large-scale string
    matching algorithm.

LIU Ping, etc, A Partition-Based Efficient
Algorithm for Large Scale Multiple-Strings
Matching, IEEE SPIRE 2005
16
Lexical Processing
????
  • Difficulties in Chinese lexical analysis
  • Segmentation
  • Overlapped ambiguities
  • Combination ambiguities
  • Unknown words recognition
  • Named entities PER, LOC, ORG, etc.
  • New words
  • POS tagging

17
HHMM Architecture in ICTCLAS III
HHMM-based Chinese lexical analysis
????
HHMM Architecture Trace
18
Class-based segmentation
????
  • Word class definition
  • Class-based segmentation model

wi iff wi is listed in the segmentation
lexicon PER, LOC, ORG, TIME or NUM iff wi is an
unknown named entity STR iff wi is an unknown
symbol string BEG iff beginning of a sentence
END iff ending of a sentence OTHER otherwise.
ci
19
Role-based Unknown word recognition
????
  • Unknown words recognition role-based HMM
  • ?/Surname ?/Mid_name ?/last_name 1893?/context
    ??/remote_context
  • Probability P(WiCi) of recognized unknown words
    could be estimated in role-based HMM

Huaping Zhang etc, Chinese Named Entity
Recognition Using Role Model, International
Journal of Computational Linguistics and Chinese
Language Processing, 2003,Vol. 8 (2)
20
Chinese New Word Identification
????
  • Unknown words or new words blast with the
    development of Web size.
  • ????????????????????????
  • We explored character coupling, single-character
    word probability, Position information with
    identifying new words.

21
Chinese New Word Identification
????
  • Character Coupling

?? N1 N2 Coup(cicj)
?? 52 50 0.9615
?? 8 8 1
?? 10 0 0
?? 30 1 0.0323
?? 31 18 0.5806
?? 18 8 0.4444
22
Recognition Sample
????
????? ????? ?? ????
????????? ,???????????? ?/??/??/??/?/?/,?/?/?/?/?/?/?/?/??./?/? ?? ???????? ?? ??? ????
????????? ??/?/?/?/?/?/?/? ??????? ????
??????? ?/?/?/?/?/?/? ??????? ????
??????????? ????? ??/?/?/?/?/?/??/ ??/?/??/?? ????? ???
Unpublished
23
Text Mining
????
  • Supervised Learning Classification
  • Unsupervised LearningClustering
  • New Feature Detection

24
Text Classification(1) Information Granularity
based classification
????
  • From the view of granularity, clustering is a
    procedure in a uniform granularity, while
    classification in different granularities.
  • illustration

A
a
b
B
c
1 2 3 4 5
25
????
Average ?5
26
Text Classification (2)DragPushing A Refinement
Strategy for Text Classifier
????
  • DragPushing as a refinement strategy to enhance
    the performance of the latter high-speed text
    classifiers, such as CB or Rocchio.
  • The main motivation behind this strategy is the
    hypothesis that there still exists room for
    performance improvement because the learning
    algorithm itself may have inductive bias, or the
    text collection may misfit the learning model to
    some degree.

27
DragPushing
????
28
????
Dragpushing
  • ???????
  • ?Centroid??

????SVM??
Songbo Tan etc, A Novel Refinement Approach for
Text Categorization, ACM SIGIR 2005, ACM CIKM
2005 etc
29
Why P2P?
P2P IR
  • ??
  • ??????????????????????????????Google??????????????
    ???? ????????????????,???,??????????????
  • ???????????????????????????????????????,??????????
    ??????
  • ????
  • ??????????????????????????
  • ???????????????????
  • ??????????
  • ????
  • ??????????????????,?????????????
  • ?????????????????????????
  • ????,????????????,?????????SW Effect, PL?

30
P2P?????????????????
P2P IR
  • ??????????????????????????,??????????
  • ?????????????????????????????????????????????
  • ????????????
  • ????????????????????????????
  • ??
  • ??????????????????????????????????????
  • ??????????????????,????Scalability?

31
WonGoo ??P2P???????
P2P IR
  • ??CAN?M????????
  • ?????????????????(????)
  • ???????????????????(????????????????????????)

???????????
???????????
Jianming Lv etc, WonGoo A Pure Peer-to-Peer Full
Text Information Retrieval System Based On
Semantic Overlay Networks, IEEE NCA 2004
32
WonGoo_at_WAX Researcher Network
P2P IR
WAX
WAX
WAX
WAX
WAX
WAX
WAX
WAX
WAX
WAX
WAX
WAX
WAX
WAX
33
Community Identification
Web Mining
  • Current community identification

Link density community Kleinberg etc, Science 294
(2001)
Edge density community Palla etc, natural (2005)
34
Outline
  • Background and Motivation
  • Some of existing works in ICT
  • Some of existing research works
  • Some of sharing systems
  • Conclusion

35
??????????
  • ??????????????????????????????
  • ???????,?????,???????????????????,?????????
  • ???????,????
  • ????????????
  • ?????????ICTCLAS(??????????)
  • ?????????FirteX(????????)
  • ??/?????ICTDRAP(??????????)
  • ???????IceStream (?????)

36
FirteX------???????????????
  • ??1????????(????,??????????)
  • ??????????????(HTML,PDF,WORD?)?????XML???????,???
    ???????????,?????????
  • ??2????(?????????????????)
  • ???TREC?????????????????3???????,??????????????,
    ??? ??
  • ??3?????(????????????????????)
  • ???????????,??????????????,??????????????????????
    ????,??????C/C????COM????

37
FirteX??
Application
Collection
Parser
Analyzer
Index Access Component
Index Reader
Index Searcher
Index Writer
Index Component
Word Indexer
URL Indexer
Other User Indexer
Storage Component
Disk Storage
RAM Storage
Cluster Storage
38
Storage Layer
39
FirteX ?????
    Lucene 2.00 Lemur 4.32 Lemur 4.32 Lemur 4.32 Lemur 4.32 FirteX 1.02
    Lucene 2.00 InvFPIndex InvFPIndex Indri Keyfile FirteX 1.02
?? ???? ? ? ? ? ? ?
?? ????? ? ? ? ? ? ?
?? ????(???) ? ?- ?- ?- ?- ?
?? ??????? ? ? ? ? ? ?
?? ??Tb??? ? ? ? ? ? ?
?? ???? 1x 3x 3x 3x 3x 9x
?? ????????? ? ? ? ? ? ?
?? ????????? ? ? ? ? ? ?
?? ???? ? ? ? ? ? ?
?? ???? ?? ?? ?? ?? ?? ??
?? ??????? ? ? ? ? ? ?
?? ????????? ? ? ? ? ? ?
?? ??????? ? ? ? ? ? ?
?? COM???? ? ? ? ? ? ?
?? XML???? ? ? ? ? ? ?
?? ?????? ? ? ? ? ? ?
NOTE????Lucene?Lemur????????????????????.
40
FirteX ????-??
???? Windows 2000 Advance Server,P4
2.8G(2CPU),2G RAM,?5???????,?????CWT100G?????????,
????5k30k
corpus1 Corpus2 corpus4 corpus8 Corpus11
????(M) 1024 1024 1024 1024 1024
?????? 10 10 10 10 10
???(?/??)(?) 961 961 961 961 961
CPU????? 49 49 49 49 49
??????(G) 1.0 2.0 4.0 8.0 11.5
?????(?) 1 2 4 8 5
????(?) 60183 120367 240792 482319 699247
????(G) 0.49(510M) 0.99G 1.97 3.99 5.82
???(s) 247.11 573.15 1277.34 2603.33 3150.74
????(M/min) 248.4 214.2 193.2 189.0 224.4
41
FirteX ????-??
Corpus1 Corpus2 corpus4 corpus8 corpus11
????????(M) 16.61 16.07 14.85 17.0 17.4
????(G) 0.49 0.98 1.97 3.97 5.80
?????? 109641 109641 109641 109641 109641
???????(?) 2.5 2.5 2.5 2.5 2.5
?????(?) 71,345,331 140,474,915 280,155,094 570,921,066 838,464,138
???(s) 75.42 80.234 109.063 171.28 264.45
????(ms/q) 0.69 0.73 0.99 1.56 2.41
??????????????,????????
42
FirteX?????????
  • FirteX????????????????????????????????????????????
    ???????
  • ??????,FirteX????Lucene?Lemur???,???????FirteX????
    ???
  • FirteX????????,?????????????????
  • http//www.firtex.org

43
ICTCLAS Chinese Lexical Analysis
44
Architecture of ICTCLAS
Character String
Class-based WS model
Atom Segment
NSP rough segment
Unknown word recognition
Training
Lexical result
45
Evaluation of ICTCLAS
  • ICTCLAS3.0 Supported GB,BIG5 and Unicode
  • ICTCLAS integrates all lexical analysis tasks
    into a HHMM-based frame.
  • ICTCLAS1.0 ranked top in the national evaluation
    held by 973 fundamental research program in China
    by 97.58
  • ICTCLAS 2.0 achieved two first rank and one
    second rank among 6 tracks in the first
    international SIGHAN word segmentation bakeoff.
  • ICTCLAS free code has licensed over 30,000
    entities in the world. It is the most popular
    Chinese lexical analyzer.

46
Outline
  • Background and Motivation
  • Some of existing works in ICT
  • Some of existing research works
  • Some of sharing systems
  • Conclusion

47
Whats the next of Google?
  • ???????????
  • ????????????(?????)?P2P(?????)?IPTV?Blog???????
  • Google?DOCOMO????SKYPE?
  • ????????????
  • ?????????????????????
  • ?????CNGI,3G????????
  • ???????????????,IP????????????
  • Whats the characteristics of Google II?
  • Search Invisible and Search Anywhere
  • Self-organized, Activity?Weak-Rule based?P2P and
    Pervasive?Security
  • ????????(????????)

?????????????????
48
??????????!! Thanks!
???,cxq_at_ict.ac.cn ?????? ??????? ?????????????
Write a Comment
User Comments (0)
About PowerShow.com