Title: Adaptive Document Management
1Adaptive Document Management
- Tsinghua-ITF KE Co-LAB
- Tang Jie
2Agenda
- Background Current works
- Aims
- Framework
- Design
- Two Cases
31.Background
- The world is becoming increasing information
centric. - Huge volume
- Traditional technologies are gradually becoming
less and less effective. - Document content is also exploding
4TIPSI
- The developed work
- The design of framework of TIPSI
- The development of metadata editor in TIPSI
- The development of a universal parsing interface
for XML Schema - The implementation of algorithm for construction
the logical view - Partly implementation of algorithm for
construction the semantic view.
5Example of semantic view
6Problems
- Problems
- Less efficient intelligent processing engine.
Currently, navigator does not provide a powerful
way for information discovery, e.g. partial match
and fuzzy search. - Information redundancy
- Less efficient content-organization of large
volume for three layer view model. - How to further improve the semantic view based on
the logic view and how to validate logic view by
semantic view
72. Aims
- Organize documents in a catalog automatically.
- Provide an efficient way for user to browse
related documents in only one screen instead of
list of all related documents. - Design an efficient indexing and search engine.
- Develop a practical IE to efficiently extract the
key information. -
83. Framework
- Components include preprocessing, Classifier,
rule editor, evaluation tool, GUI interface and
processing engine. - The framework is described as following figure.
9(No Transcript)
10(1)Preprocessing
11Preprocessing
- Word segmentation
- Stopword removal
- Word similarity
- Stemmer
- weighter
12Word segmentation
- Word segmentation, also called tokenization.
- It is easy for English document. Space is natural
indicator. - Difficulty for Chinese or Japanese. Base on some
available tools.
13Stop word removal
- Remove common words that are known to contribute
little to the semantic meaning of a document. - As for English document, construct a list
containing all stop words. - As for Chinese and Japanese document, remove all
empty words.
14Word Similarity
- Word similarity is very useful in information
processing. - use a thesaurus to compute the similarity.
15Stemmer
- Stemmer normalize semantically similar words by
removing the suffixes. It is useful for English
document. - Do not use it in Chinese document.
16weighter
- Weighter computes the weight of each word.
- TFIDF or mutual information are both standard
metric in information processing community. - TFIDF. IDF denotes word distribution on the
whole corpora, TF is the frequency of word in
document. - Mutual information. Indicate how much information
provided by a word to a given document.
17(2)Classifier
18Problems
- How to classify a document.
- How to adapt the model automatically to the
dynamic pattern.
19Solutions
- Rule-based classification
- Machine learning based classification
- Combining rule and machine learning.
- Multi-strategy.
20Method 1 rule-based
- Rule-base algorithm such as predefined keywords
vocabulary, Specified author, Date, etc. - The rule can be described as
- Exist(keyword1 or keyword 2 )?cat1
- if(author(email-2)user1) then ?cat2
-
21Advantage disadvantage
- Advantage
- Simple, easy to be understood by people.
- Easy to implement.
- Disadvantage
- difficult to define the rules, especially
complex rules. - rules are always dynamic, such as static rules
cannot spam changing pattern in the advertising
emails.
22Method2 Supervisor machine learning
- Machine learning can learning from existing
categorizations and automatically classifies new
documents into these categories. - Machine learning algorithm, such as Bayes, SVM,
SOM, BPNN, etc.
23Advantage disadvantage
- Advantage
- auto adapt to changing of content pattern.
- Learning from user feedback.
- Auto extract rules.
- Disadvantage
- need a predefined documents collection to train
the model. - Problem of one class.
24Answer to problem of one class
- Fuzzy classification.
- Some time a document may be classified to one
category. In other time, it might be required to
categorized to multiple classes. This is a
problem of fuzzy classification. - Therefore, as for machine learning, capability
of multi-classification should be provided.
25Method 3 Combing rule-based and machine learning
- Combine rule-base algorithm and supervisor
machine learning algorithm. - Use machine learning algorithm to extract new
rules. - Use new rules and user-defined rules and model
predication to classify the new documents.
26Multi-strategy learning
- One learning algorithm always can not lead to
good output. (toward various application) - Combining multi-learning algorithm and classify
according to the vote of them is more trustful.
27(3)Rule Editor and Parser
28Rule Editor Parser
- Rule editor help user define the classification
rules. - Rule Parser parse the rule and generate a FSM
(finite state machine) or other structures. - They are general components. Also useful for
other application.
29Rule Definition
- Rule definition is based on JAPE, just as in
TIPSI. - A example
30Rule example
- Rule Agent
- (
- (
- (Token.string Authorization)
- (Token.string delegate)
- (Token.string )
- (
- (SpaceToken)
- (Token.kind space)
- )?
- )agent
- (Token.kind BR )
- )
- agent.Agent kindagent, ruleAgent
- This rule indicates that the value of agent is
located between word sequence authorization
delegate and ENTER
31(4)Evaluation Tool
32Evaluation Tool
- Evaluation tool has two usages.
- Evaluate the precision-recall of the classifier.
- Return the users feedback to classifier, so that
classifier can refine the model.
33(5)Efficient Indexing and Querying for
Large-Scale Logic View and Semantic View
34Indexing Engine
- With the increasing of logic view and semantic
view from TIPSI, how to efficiently access them
will become a big problem. - Therefore, a powerful indexing engine and
optimized querying algorithm are necessary. - Traditional indexing technology can not be port
to semantic view and logic view directly. - Content (plain text and original formats).
- Semantic information
- Logic information
35The strategy of index-IR based
- A strategy based on IR
- Two level mapping
- The first level store all the path under the
xmls datamodel, i.e.APEX. - The second level construct the inverted table of
all the indexed Path in the first level
36APEX-A popular graph based data model
- The data model APEX
- Definition XML data can be represented by a
directed labeled edge graph Gxml, Gxml
(V,E,R),V VcUVa,. - EVcAV is the set of labeled edges.
- All the nodes V can be reached from R. Each node
in V is assigned with an identifier id that
represents its order in XML document.
37Construct Index on the basis of data model(1)
- Construct Index on the basis of data model
- construct an xml data Gindex
- A Gindex for source XML data s is an XML
data d such that for every unique label path in
s, there exist exactly one data path instance in
d ,and every label path in d is a label in s, and
in d every node o has only one incoming label. - In a word, a Gindex is a Gxml that is got rid of
the repeated path.
38Construct Index on the basis of data model(2)
- Construct the path index in accordance with
Gindex. - This step means to record every unique path
appeared in Gindex.
39Construct Index on the basis of data model(3)
- Construct the inverted index underlies the path
index. - This step pays attention to improve the
performance of partial matching query. - This is an inverted table which maintains an
inverted scratch label path list for each label
in source xml data. - Each entry in this inverted table records two
types of data the label, the scratch label path
that contains the label and the inverted position
of the label in the path expression.
40Querying Algorithm
- Query interface
- XPath language (regular path expression)
- Cache-based querying processing. To keep the
queries submitted frequently, the process of
query will be accelerated, especially in the
distributed environment.
41(6)Processing Engine
42Document representation
- Processed by TIPSI, document are represented as
logic view and semantic view. - Logic view provides a friendly navigator to users
- Semantic view can answer more meaningful
question. Such as where is the xxx corporation?
How many corporations in Shanghai?
43Query Analysis, Matcher Retrieval
- Not simple keyword retrieval.
- Analyze what is user really want?
- Efficiently match user query to semantic view or
logic view. - Locate to the specified sections in the
documents. - Retrieve all related documents or related
sections.
44Information integration
- Integrate all related information together from
retrieval component instead of related documents.
45For example
- Question how many corporations in Beijing?
- retrieve all corporations with locationBeijing.
- Count the returned corporations.
- Other questions list all corporations whose
profit in 2003 greater that that in 2002. How
much is total profit of all corporations in
Beijing?
46Information integration
Company A Company B Company C
47Redundancy removal
- There might exist lots of redundancy in the
documents returned by retrieval. - Such as how to contact KEG in Tsinghua
University. - Should only show one correct contact information.
48XSL Template, ontology personal profile
- XSL template provide user a friendly and
easy-to-use navigator. - Based on the explicit definition of ontology,
system could provide more powerful functions. - Personal profile log users action on the
internet, and provide user-oriented search.
49(7)GUI Interface
50Result Display
- Post a request, user might retrieval many related
documents/sections. It is necessary to
re-organize these information. - Clustering is good solution.
- Such as user select IT company, display component
cluster the retrieved documents, and group them
(software, hardware, etc).
51(No Transcript)
52Key technologies
53Key technologies
- Key technologies include
- High multi-dimension document expression, viz.
(Data sparse) and efficient indexing and querying
algorithm. - Dynamic model dynamic rules
- Information redundancy removal. And matcher,
information integration. - Classifier refinement (with user feedback).
545.Two Cases
55Case-1
How could I find the IT corporations whose profit
was more than 1million last year
Document Processing Engine
Too tired to organize
Database (Indexing)
Annual report
56Case-1
How could I find the IT corporations whose profit
was more than 1million last year
So easy
Smart Document Processing Engine
Content analysis (logic extract Semantic extract)
Database (Indexing)
Classification
Annual report
57GUI Interface
58(2)Email-dispatch
59Email-dispatch
Dept of Tech
How much..
How to install..
Secretary
Intelligent Client Support Center
meeting with general Manager
Dept of Sales
advertisement
spam
60Project plan
- To be completed in five months by three full-time
staffs. - Project Schedule
- 2004.3 2004.4 preprocessing, rule editor, rule
parser - 2004.4 2004.5 classifier, evaluation tool
- 2004.5 2004.6 indexing for semantic view and
logic view. - 2004.6 2004.7 search processing engine and user
browser. - 2004.7 2004.8 integrate all components and
debug the whole framework
61Reference
- Fax Internet Technical Resources Electronic Mail
(Email) and Postal Mail. http//www.cs.columbia.ed
u/hgs/internet/email.html. - E-mail Classification in the Haystack Framework.
Mark Rosen. DEEC of MIT.
62Thanks!