Adaptive Document Management - PowerPoint PPT Presentation

1 / 62

About This Presentation

Title:

Adaptive Document Management

Description:

Auto extract rules. Disadvantage: need a predefined documents collection to train the model. ... Some time a document may be classified to one category. ... – PowerPoint PPT presentation

Number of Views:33

Avg rating:3.0/5.0

Slides: 63

Provided by: kegCsTsi

Category:

more less

Transcript and Presenter's Notes

Title: Adaptive Document Management

1
Adaptive Document Management

Tsinghua-ITF KE Co-LAB
Tang Jie

2
Agenda

Background Current works
Aims
Framework
Design
Two Cases

3
1.Background

The world is becoming increasing information
centric.
Huge volume
Traditional technologies are gradually becoming
less and less effective.
Document content is also exploding

4
TIPSI

The developed work
The design of framework of TIPSI
The development of metadata editor in TIPSI
The development of a universal parsing interface
for XML Schema
The implementation of algorithm for construction
the logical view
Partly implementation of algorithm for
construction the semantic view.

5
Example of semantic view
6
Problems

Problems
Less efficient intelligent processing engine.
Currently, navigator does not provide a powerful
way for information discovery, e.g. partial match
and fuzzy search.
Information redundancy
Less efficient content-organization of large
volume for three layer view model.
How to further improve the semantic view based on
the logic view and how to validate logic view by
semantic view

7
2. Aims

Organize documents in a catalog automatically.
Provide an efficient way for user to browse
related documents in only one screen instead of
list of all related documents.
Design an efficient indexing and search engine.
Develop a practical IE to efficiently extract the
key information.

8
3. Framework

Components include preprocessing, Classifier,
rule editor, evaluation tool, GUI interface and
processing engine.
The framework is described as following figure.

9
(No Transcript)
10
(1)Preprocessing
11
Preprocessing

Word segmentation
Stopword removal
Word similarity
Stemmer
weighter

12
Word segmentation

Word segmentation, also called tokenization.
It is easy for English document. Space is natural
indicator.
Difficulty for Chinese or Japanese. Base on some
available tools.

13
Stop word removal

Remove common words that are known to contribute
little to the semantic meaning of a document.
As for English document, construct a list
containing all stop words.
As for Chinese and Japanese document, remove all
empty words.

14
Word Similarity

Word similarity is very useful in information
processing.
use a thesaurus to compute the similarity.

15
Stemmer

Stemmer normalize semantically similar words by
removing the suffixes. It is useful for English
document.
Do not use it in Chinese document.

16
weighter

Weighter computes the weight of each word.
TFIDF or mutual information are both standard
metric in information processing community.
TFIDF. IDF denotes word distribution on the
whole corpora, TF is the frequency of word in
document.
Mutual information. Indicate how much information
provided by a word to a given document.

17
(2)Classifier
18
Problems

How to classify a document.
How to adapt the model automatically to the
dynamic pattern.

19
Solutions

Rule-based classification
Machine learning based classification
Combining rule and machine learning.
Multi-strategy.

20
Method 1 rule-based

Rule-base algorithm such as predefined keywords
vocabulary, Specified author, Date, etc.
The rule can be described as
Exist(keyword1 or keyword 2 )?cat1
if(author(email-2)user1) then ?cat2

21
Advantage disadvantage

Advantage
Simple, easy to be understood by people.
Easy to implement.
Disadvantage
difficult to define the rules, especially
complex rules.
rules are always dynamic, such as static rules
cannot spam changing pattern in the advertising
emails.

22
Method2 Supervisor machine learning

Machine learning can learning from existing
categorizations and automatically classifies new
documents into these categories.
Machine learning algorithm, such as Bayes, SVM,
SOM, BPNN, etc.

23
Advantage disadvantage

Advantage
auto adapt to changing of content pattern.
Learning from user feedback.
Auto extract rules.
Disadvantage
need a predefined documents collection to train
the model.
Problem of one class.

24
Answer to problem of one class

Fuzzy classification.
Some time a document may be classified to one
category. In other time, it might be required to
categorized to multiple classes. This is a
problem of fuzzy classification.
Therefore, as for machine learning, capability
of multi-classification should be provided.

25
Method 3 Combing rule-based and machine learning

Combine rule-base algorithm and supervisor
machine learning algorithm.
Use machine learning algorithm to extract new
rules.
Use new rules and user-defined rules and model
predication to classify the new documents.

26
Multi-strategy learning

One learning algorithm always can not lead to
good output. (toward various application)
Combining multi-learning algorithm and classify
according to the vote of them is more trustful.

27
(3)Rule Editor and Parser
28
Rule Editor Parser

Rule editor help user define the classification
rules.
Rule Parser parse the rule and generate a FSM
(finite state machine) or other structures.
They are general components. Also useful for
other application.

29
Rule Definition

Rule definition is based on JAPE, just as in
TIPSI.
A example

30
Rule example

Rule Agent
(
(
(Token.string Authorization)
(Token.string delegate)
(Token.string )
(
(SpaceToken)
(Token.kind space)
)?
)agent
(Token.kind BR )
)
agent.Agent kindagent, ruleAgent
This rule indicates that the value of agent is
located between word sequence authorization
delegate and ENTER

31
(4)Evaluation Tool
32
Evaluation Tool

Evaluation tool has two usages.
Evaluate the precision-recall of the classifier.
Return the users feedback to classifier, so that
classifier can refine the model.

33
(5)Efficient Indexing and Querying for
Large-Scale Logic View and Semantic View
34
Indexing Engine

With the increasing of logic view and semantic
view from TIPSI, how to efficiently access them
will become a big problem.
Therefore, a powerful indexing engine and
optimized querying algorithm are necessary.
Traditional indexing technology can not be port
to semantic view and logic view directly.
Content (plain text and original formats).
Semantic information
Logic information

35
The strategy of index-IR based

A strategy based on IR
Two level mapping
The first level store all the path under the
xmls datamodel, i.e.APEX.
The second level construct the inverted table of
all the indexed Path in the first level

36
APEX-A popular graph based data model

The data model APEX
Definition XML data can be represented by a
directed labeled edge graph Gxml, Gxml
(V,E,R),V VcUVa,.
EVcAV is the set of labeled edges.
All the nodes V can be reached from R. Each node
in V is assigned with an identifier id that
represents its order in XML document.

37
Construct Index on the basis of data model(1)

Construct Index on the basis of data model
construct an xml data Gindex
A Gindex for source XML data s is an XML
data d such that for every unique label path in
s, there exist exactly one data path instance in
d ,and every label path in d is a label in s, and
in d every node o has only one incoming label.
In a word, a Gindex is a Gxml that is got rid of
the repeated path.

38
Construct Index on the basis of data model(2)

Construct the path index in accordance with
Gindex.
This step means to record every unique path
appeared in Gindex.

39
Construct Index on the basis of data model(3)

Construct the inverted index underlies the path
index.
This step pays attention to improve the
performance of partial matching query.
This is an inverted table which maintains an
inverted scratch label path list for each label
in source xml data.
Each entry in this inverted table records two
types of data the label, the scratch label path
that contains the label and the inverted position
of the label in the path expression.

40
Querying Algorithm

Query interface
XPath language (regular path expression)
Cache-based querying processing. To keep the
queries submitted frequently, the process of
query will be accelerated, especially in the
distributed environment.

41
(6)Processing Engine
42
Document representation

Processed by TIPSI, document are represented as
logic view and semantic view.
Logic view provides a friendly navigator to users
Semantic view can answer more meaningful
question. Such as where is the xxx corporation?
How many corporations in Shanghai?

43
Query Analysis, Matcher Retrieval

Not simple keyword retrieval.
Analyze what is user really want?
Efficiently match user query to semantic view or
logic view.
Locate to the specified sections in the
documents.
Retrieve all related documents or related
sections.

44
Information integration

Integrate all related information together from
retrieval component instead of related documents.

45
For example

Question how many corporations in Beijing?
retrieve all corporations with locationBeijing.
Count the returned corporations.
Other questions list all corporations whose
profit in 2003 greater that that in 2002. How
much is total profit of all corporations in
Beijing?

46
Information integration
Company A Company B Company C
47
Redundancy removal

There might exist lots of redundancy in the
documents returned by retrieval.
Such as how to contact KEG in Tsinghua
University.
Should only show one correct contact information.

48
XSL Template, ontology personal profile

XSL template provide user a friendly and
easy-to-use navigator.
Based on the explicit definition of ontology,
system could provide more powerful functions.
Personal profile log users action on the
internet, and provide user-oriented search.

49
(7)GUI Interface
50
Result Display

Post a request, user might retrieval many related
documents/sections. It is necessary to
re-organize these information.
Clustering is good solution.
Such as user select IT company, display component
cluster the retrieved documents, and group them
(software, hardware, etc).

51
(No Transcript)
52
Key technologies
53
Key technologies

Key technologies include
High multi-dimension document expression, viz.
(Data sparse) and efficient indexing and querying
algorithm.
Dynamic model dynamic rules
Information redundancy removal. And matcher,
information integration.
Classifier refinement (with user feedback).

54
5.Two Cases
55
Case-1
How could I find the IT corporations whose profit
was more than 1million last year
Document Processing Engine
Too tired to organize
Database (Indexing)
Annual report
56
Case-1
How could I find the IT corporations whose profit
was more than 1million last year
So easy
Smart Document Processing Engine
Content analysis (logic extract Semantic extract)
Database (Indexing)
Classification
Annual report
57
GUI Interface
58
(2)Email-dispatch
59
Email-dispatch
Dept of Tech
How much..
How to install..
Secretary
Intelligent Client Support Center
meeting with general Manager

Dept of Sales
advertisement
spam
60
Project plan

To be completed in five months by three full-time
staffs.
Project Schedule
2004.3 2004.4 preprocessing, rule editor, rule
parser
2004.4 2004.5 classifier, evaluation tool
2004.5 2004.6 indexing for semantic view and
logic view.
2004.6 2004.7 search processing engine and user
browser.
2004.7 2004.8 integrate all components and
debug the whole framework

61
Reference

Fax Internet Technical Resources Electronic Mail
(Email) and Postal Mail. http//www.cs.columbia.ed
u/hgs/internet/email.html.
E-mail Classification in the Haystack Framework.
Mark Rosen. DEEC of MIT.

62
Thanks!

Write a Comment

User Comments (0)