Title: ????????????? ------ Web Search and Web Mining
1????????????? ------ Web Search and Web Mining
- ???, cxq_at_ict.ac.cn
- ????????????
- 06.8.17 SWCL 2006 ??
2Outline
- Background and Motivation
- Some of existing works in ICT
- Some of existing research works
- Some of sharing systems
- Conclusion
3A Big Problem!
Semantic Web
Information Retrieval
Social Computing
Network Science
Web Computing
Natural Language Processing
Machine Learning Text Mining
4?????Web?????
Unified Browsing
Unified Search
Personalized Search
Personalized Space
5Web Mining????????????
- ???????????
- ????100???PB???????10???????????,???????????300
0???????,?????10?? - ??????????????
- ??????????????????
- ??????????????
- ?????????
- ??????????
- ?????!
- ??????????,???????
- ???????????????
- ??????????
- ???????
6Web 2.0????????
- ???????
- Architecture From server-centered to
Peer-distributed - ????P2P, Blog
- ???????
- Streaming From INFORMATION to MESSAGE
- Socialization
- ???????Rich Content
- ???????
- Rich Dimensions
7Motivation????????,?????????
- ??????????????????????,??????
- Assumptions of VSM, PM, LM etc
- How to represent the Rich dimensional feature
spaces? - ????????????????,????????
- Unified RANKING has so many biases!Identity vs.
Otherness (Active-computing) - Special algorithms for Rich-dimensional feature
space - Streaming
- Message vs. Text/Sentence Dynamic context
sensitive - Tradeoff between deep understanding and
performance - Shadow and efficient language processing
8Outline
- Background and Motivation
- Some of existing works in ICT
- Some of existing research works
- Some of sharing systems
- Conclusion
9Organizations of ICT
- About 80 persons in I3S
- About 25 research faculty
- More than 40 students
- Over 20 Ph.d candidates,over 15 master candidates
10Related works in I3S_at_ICT
- Research topics
- ???????????? Dr. ????
- ???(??)?????? Dr. ????
- ????Dr. ????? etc
- ????????????? Dr. ????
- ???????????Dr. ????
- P2P ??Dr. ????
- Sharable Systems
- ???????????ICTCLAS
- ????????????FirteX
11Data Stream Management
???
- Conditions
- High speed streaming (Over 10GBps)
- Large Scale queries
- (Over 100,000)
- Emergence of temporal unknown patterns
- Requirement
- Online responding
- Emergence prediction
- Challenges
12What we are pursuing
???
- Query Processing
- Multiple filtering queries processing on single
stream - Join algorithms on multiple streams
- Data Stream mining
- Frequent patterns discovery
- Clustering
- Emergence prediction
-
13Multiple Strings Matching
???
- Classic Algorithms
- Prefix-based approach KMP, AC, Shift-And,
Shift-Or - Suffix-based approach Boyer-Moore, Wu-Manber
- Factor-based approach SBDM, SBOM
- Challenge
- The number of feature strings increase with the
rapid grow of information scale.(ClamAntiVirus
library 26653) - Traditional String matching algorithm cannot
solve the problem while the feature number is
over 5000.
?????????????????
?????????????
14Partition Combinatorial Optimization Matching
(ICT-COM)
???
Find the optimal partition
Find the shortest path in a weighted graph
- Construct a weighted graph G according to the
given keywords set P as follows - Node each a block with length i in P
source
sink
- Edge a set of blocks with length
greater than or equal with i, but less than j
- Weight the minimal time of the
classical algorithms to search in a training text
for the keywords in the corresponding subset
- Objective
- find the shortest path from source to sink in G
15???
Results of ICT-COM
- Optimization
- Analysis
- 4 subsets were given by COM and assigned with
different algorithms. - 3-9(AC),10-13(SBOM),14-35(SBOM),36-210(SBOM)
- The speed of COM is about 3 times faster than the
quickest classical one. - ICT-COM is an efficient large-scale string
matching algorithm.
LIU Ping, etc, A Partition-Based Efficient
Algorithm for Large Scale Multiple-Strings
Matching, IEEE SPIRE 2005
16Lexical Processing
????
- Difficulties in Chinese lexical analysis
- Segmentation
- Overlapped ambiguities
- Combination ambiguities
- Unknown words recognition
- Named entities PER, LOC, ORG, etc.
- New words
- POS tagging
17HHMM Architecture in ICTCLAS III
HHMM-based Chinese lexical analysis
????
HHMM Architecture Trace
18Class-based segmentation
????
- Word class definition
- Class-based segmentation model
wi iff wi is listed in the segmentation
lexicon PER, LOC, ORG, TIME or NUM iff wi is an
unknown named entity STR iff wi is an unknown
symbol string BEG iff beginning of a sentence
END iff ending of a sentence OTHER otherwise.
ci
19Role-based Unknown word recognition
????
- Unknown words recognition role-based HMM
- ?/Surname ?/Mid_name ?/last_name 1893?/context
??/remote_context - Probability P(WiCi) of recognized unknown words
could be estimated in role-based HMM
Huaping Zhang etc, Chinese Named Entity
Recognition Using Role Model, International
Journal of Computational Linguistics and Chinese
Language Processing, 2003,Vol. 8 (2)
20Chinese New Word Identification
????
- Unknown words or new words blast with the
development of Web size. - ????????????????????????
- We explored character coupling, single-character
word probability, Position information with
identifying new words.
21Chinese New Word Identification
????
?? N1 N2 Coup(cicj)
?? 52 50 0.9615
?? 8 8 1
?? 10 0 0
?? 30 1 0.0323
?? 31 18 0.5806
?? 18 8 0.4444
22Recognition Sample
????
????? ????? ?? ????
????????? ,???????????? ?/??/??/??/?/?/,?/?/?/?/?/?/?/?/??./?/? ?? ???????? ?? ??? ????
????????? ??/?/?/?/?/?/?/? ??????? ????
??????? ?/?/?/?/?/?/? ??????? ????
??????????? ????? ??/?/?/?/?/?/??/ ??/?/??/?? ????? ???
Unpublished
23Text Mining
????
- Supervised Learning Classification
- Unsupervised LearningClustering
- New Feature Detection
24Text Classification(1) Information Granularity
based classification
????
- From the view of granularity, clustering is a
procedure in a uniform granularity, while
classification in different granularities. - illustration
A
a
b
B
c
1 2 3 4 5
25????
Average ?5
26Text Classification (2)DragPushing A Refinement
Strategy for Text Classifier
????
- DragPushing as a refinement strategy to enhance
the performance of the latter high-speed text
classifiers, such as CB or Rocchio. - The main motivation behind this strategy is the
hypothesis that there still exists room for
performance improvement because the learning
algorithm itself may have inductive bias, or the
text collection may misfit the learning model to
some degree.
27DragPushing
????
28????
Dragpushing
????SVM??
Songbo Tan etc, A Novel Refinement Approach for
Text Categorization, ACM SIGIR 2005, ACM CIKM
2005 etc
29Why P2P?
P2P IR
- ??
- ??????????????????????????????Google??????????????
???? ????????????????,???,?????????????? - ???????????????????????????????????????,??????????
?????? - ????
- ??????????????????????????
- ???????????????????
- ??????????
- ????
- ??????????????????,?????????????
- ?????????????????????????
- ????,????????????,?????????SW Effect, PL?
30P2P?????????????????
P2P IR
- ??????????????????????????,??????????
- ?????????????????????????????????????????????
- ????????????
- ????????????????????????????
- ??
- ??????????????????????????????????????
- ??????????????????,????Scalability?
31WonGoo ??P2P???????
P2P IR
- ??CAN?M????????
- ?????????????????(????)
- ???????????????????(????????????????????????)
???????????
???????????
Jianming Lv etc, WonGoo A Pure Peer-to-Peer Full
Text Information Retrieval System Based On
Semantic Overlay Networks, IEEE NCA 2004
32WonGoo_at_WAX Researcher Network
P2P IR
WAX
WAX
WAX
WAX
WAX
WAX
WAX
WAX
WAX
WAX
WAX
WAX
WAX
WAX
33Community Identification
Web Mining
- Current community identification
Link density community Kleinberg etc, Science 294
(2001)
Edge density community Palla etc, natural (2005)
34Outline
- Background and Motivation
- Some of existing works in ICT
- Some of existing research works
- Some of sharing systems
- Conclusion
35??????????
- ??????????????????????????????
- ???????,?????,???????????????????,?????????
- ???????,????
- ????????????
- ?????????ICTCLAS(??????????)
- ?????????FirteX(????????)
- ??/?????ICTDRAP(??????????)
- ???????IceStream (?????)
36FirteX------???????????????
- ??1????????(????,??????????)
- ??????????????(HTML,PDF,WORD?)?????XML???????,???
???????????,????????? - ??2????(?????????????????)
- ???TREC?????????????????3???????,??????????????,
??? ?? - ??3?????(????????????????????)
- ???????????,??????????????,??????????????????????
????,??????C/C????COM????
37FirteX??
Application
Collection
Parser
Analyzer
Index Access Component
Index Reader
Index Searcher
Index Writer
Index Component
Word Indexer
URL Indexer
Other User Indexer
Storage Component
Disk Storage
RAM Storage
Cluster Storage
38Storage Layer
39FirteX ?????
  Lucene 2.00 Lemur 4.32 Lemur 4.32 Lemur 4.32 Lemur 4.32 FirteX 1.02
  Lucene 2.00 InvFPIndex InvFPIndex Indri Keyfile FirteX 1.02
?? ???? ? ? ? ? ? ?
?? ????? ? ? ? ? ? ?
?? ????(???) ? ?- ?- ?- ?- ?
?? ??????? ? ? ? ? ? ?
?? ??Tb??? ? ? ? ? ? ?
?? ???? 1x 3x 3x 3x 3x 9x
?? ????????? ? ? ? ? ? ?
?? ????????? ? ? ? ? ? ?
?? ???? ? ? ? ? ? ?
?? ???? ?? ?? ?? ?? ?? ??
?? ??????? ? ? ? ? ? ?
?? ????????? ? ? ? ? ? ?
?? ??????? ? ? ? ? ? ?
?? COM???? ? ? ? ? ? ?
?? XML???? ? ? ? ? ? ?
?? ?????? ? ? ? ? ? ?
NOTE????Lucene?Lemur????????????????????.
40FirteX ????-??
???? Windows 2000 Advance Server,P4
2.8G(2CPU),2G RAM,?5???????,?????CWT100G?????????,
????5k30k
corpus1 Corpus2 corpus4 corpus8 Corpus11
????(M) 1024 1024 1024 1024 1024
?????? 10 10 10 10 10
???(?/??)(?) 961 961 961 961 961
CPU????? 49 49 49 49 49
??????(G) 1.0 2.0 4.0 8.0 11.5
?????(?) 1 2 4 8 5
????(?) 60183 120367 240792 482319 699247
????(G) 0.49(510M) 0.99G 1.97 3.99 5.82
???(s) 247.11 573.15 1277.34 2603.33 3150.74
????(M/min) 248.4 214.2 193.2 189.0 224.4
41FirteX ????-??
Corpus1 Corpus2 corpus4 corpus8 corpus11
????????(M) 16.61 16.07 14.85 17.0 17.4
????(G) 0.49 0.98 1.97 3.97 5.80
?????? 109641 109641 109641 109641 109641
???????(?) 2.5 2.5 2.5 2.5 2.5
?????(?) 71,345,331 140,474,915 280,155,094 570,921,066 838,464,138
???(s) 75.42 80.234 109.063 171.28 264.45
????(ms/q) 0.69 0.73 0.99 1.56 2.41
??????????????,????????
42FirteX?????????
- FirteX????????????????????????????????????????????
??????? - ??????,FirteX????Lucene?Lemur???,???????FirteX????
??? - FirteX????????,?????????????????
- http//www.firtex.org
43ICTCLAS Chinese Lexical Analysis
44Architecture of ICTCLAS
Character String
Class-based WS model
Atom Segment
NSP rough segment
Unknown word recognition
Training
Lexical result
45Evaluation of ICTCLAS
- ICTCLAS3.0 Supported GB,BIG5 and Unicode
- ICTCLAS integrates all lexical analysis tasks
into a HHMM-based frame. - ICTCLAS1.0 ranked top in the national evaluation
held by 973 fundamental research program in China
by 97.58 - ICTCLAS 2.0 achieved two first rank and one
second rank among 6 tracks in the first
international SIGHAN word segmentation bakeoff. - ICTCLAS free code has licensed over 30,000
entities in the world. It is the most popular
Chinese lexical analyzer.
46Outline
- Background and Motivation
- Some of existing works in ICT
- Some of existing research works
- Some of sharing systems
- Conclusion
47Whats the next of Google?
- ???????????
- ????????????(?????)?P2P(?????)?IPTV?Blog???????
- Google?DOCOMO????SKYPE?
- ????????????
- ?????????????????????
- ?????CNGI,3G????????
- ???????????????,IP????????????
- Whats the characteristics of Google II?
- Search Invisible and Search Anywhere
- Self-organized, Activity?Weak-Rule based?P2P and
Pervasive?Security - ????????(????????)
?????????????????
48??????????!! Thanks!
???,cxq_at_ict.ac.cn ?????? ??????? ?????????????