Adaptive Document Management - PowerPoint PPT Presentation

1 / 62
About This Presentation
Title:

Adaptive Document Management

Description:

Auto extract rules. Disadvantage: need a predefined documents collection to train the model. ... Some time a document may be classified to one category. ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 63
Provided by: kegCsTsi
Category:

less

Transcript and Presenter's Notes

Title: Adaptive Document Management


1
Adaptive Document Management
  • Tsinghua-ITF KE Co-LAB
  • Tang Jie

2
Agenda
  • Background Current works
  • Aims
  • Framework
  • Design
  • Two Cases

3
1.Background
  • The world is becoming increasing information
    centric.
  • Huge volume
  • Traditional technologies are gradually becoming
    less and less effective.
  • Document content is also exploding

4
TIPSI
  • The developed work
  • The design of framework of TIPSI
  • The development of metadata editor in TIPSI
  • The development of a universal parsing interface
    for XML Schema
  • The implementation of algorithm for construction
    the logical view
  • Partly implementation of algorithm for
    construction the semantic view.

5
Example of semantic view
6
Problems
  • Problems
  • Less efficient intelligent processing engine.
    Currently, navigator does not provide a powerful
    way for information discovery, e.g. partial match
    and fuzzy search.
  • Information redundancy
  • Less efficient content-organization of large
    volume for three layer view model.
  • How to further improve the semantic view based on
    the logic view and how to validate logic view by
    semantic view

7
2. Aims
  • Organize documents in a catalog automatically.
  • Provide an efficient way for user to browse
    related documents in only one screen instead of
    list of all related documents.
  • Design an efficient indexing and search engine.
  • Develop a practical IE to efficiently extract the
    key information.

8
3. Framework
  • Components include preprocessing, Classifier,
    rule editor, evaluation tool, GUI interface and
    processing engine.
  • The framework is described as following figure.

9
(No Transcript)
10
(1)Preprocessing
11
Preprocessing
  • Word segmentation
  • Stopword removal
  • Word similarity
  • Stemmer
  • weighter

12
Word segmentation
  • Word segmentation, also called tokenization.
  • It is easy for English document. Space is natural
    indicator.
  • Difficulty for Chinese or Japanese. Base on some
    available tools.

13
Stop word removal
  • Remove common words that are known to contribute
    little to the semantic meaning of a document.
  • As for English document, construct a list
    containing all stop words.
  • As for Chinese and Japanese document, remove all
    empty words.

14
Word Similarity
  • Word similarity is very useful in information
    processing.
  • use a thesaurus to compute the similarity.

15
Stemmer
  • Stemmer normalize semantically similar words by
    removing the suffixes. It is useful for English
    document.
  • Do not use it in Chinese document.

16
weighter
  • Weighter computes the weight of each word.
  • TFIDF or mutual information are both standard
    metric in information processing community.
  • TFIDF. IDF denotes word distribution on the
    whole corpora, TF is the frequency of word in
    document.
  • Mutual information. Indicate how much information
    provided by a word to a given document.

17
(2)Classifier
18
Problems
  • How to classify a document.
  • How to adapt the model automatically to the
    dynamic pattern.

19
Solutions
  • Rule-based classification
  • Machine learning based classification
  • Combining rule and machine learning.
  • Multi-strategy.

20
Method 1 rule-based
  • Rule-base algorithm such as predefined keywords
    vocabulary, Specified author, Date, etc.
  • The rule can be described as
  • Exist(keyword1 or keyword 2 )?cat1
  • if(author(email-2)user1) then ?cat2

21
Advantage disadvantage
  • Advantage
  • Simple, easy to be understood by people.
  • Easy to implement.
  • Disadvantage
  • difficult to define the rules, especially
    complex rules.
  • rules are always dynamic, such as static rules
    cannot spam changing pattern in the advertising
    emails.

22
Method2 Supervisor machine learning
  • Machine learning can learning from existing
    categorizations and automatically classifies new
    documents into these categories.
  • Machine learning algorithm, such as Bayes, SVM,
    SOM, BPNN, etc.

23
Advantage disadvantage
  • Advantage
  • auto adapt to changing of content pattern.
  • Learning from user feedback.
  • Auto extract rules.
  • Disadvantage
  • need a predefined documents collection to train
    the model.
  • Problem of one class.

24
Answer to problem of one class
  • Fuzzy classification.
  • Some time a document may be classified to one
    category. In other time, it might be required to
    categorized to multiple classes. This is a
    problem of fuzzy classification.
  • Therefore, as for machine learning, capability
    of multi-classification should be provided.

25
Method 3 Combing rule-based and machine learning
  • Combine rule-base algorithm and supervisor
    machine learning algorithm.
  • Use machine learning algorithm to extract new
    rules.
  • Use new rules and user-defined rules and model
    predication to classify the new documents.

26
Multi-strategy learning
  • One learning algorithm always can not lead to
    good output. (toward various application)
  • Combining multi-learning algorithm and classify
    according to the vote of them is more trustful.

27
(3)Rule Editor and Parser
28
Rule Editor Parser
  • Rule editor help user define the classification
    rules.
  • Rule Parser parse the rule and generate a FSM
    (finite state machine) or other structures.
  • They are general components. Also useful for
    other application.

29
Rule Definition
  • Rule definition is based on JAPE, just as in
    TIPSI.
  • A example

30
Rule example
  • Rule Agent
  • (
  • (
  • (Token.string Authorization)
  • (Token.string delegate)
  • (Token.string )
  • (
  • (SpaceToken)
  • (Token.kind space)
  • )?
  • )agent
  • (Token.kind BR )
  • )
  • agent.Agent kindagent, ruleAgent
  • This rule indicates that the value of agent is
    located between word sequence authorization
    delegate and ENTER

31
(4)Evaluation Tool
32
Evaluation Tool
  • Evaluation tool has two usages.
  • Evaluate the precision-recall of the classifier.
  • Return the users feedback to classifier, so that
    classifier can refine the model.

33
(5)Efficient Indexing and Querying for
Large-Scale Logic View and Semantic View
34
Indexing Engine
  • With the increasing of logic view and semantic
    view from TIPSI, how to efficiently access them
    will become a big problem.
  • Therefore, a powerful indexing engine and
    optimized querying algorithm are necessary.
  • Traditional indexing technology can not be port
    to semantic view and logic view directly.
  • Content (plain text and original formats).
  • Semantic information
  • Logic information

35
The strategy of index-IR based
  • A strategy based on IR
  • Two level mapping
  • The first level store all the path under the
    xmls datamodel, i.e.APEX.
  • The second level construct the inverted table of
    all the indexed Path in the first level

36
APEX-A popular graph based data model
  • The data model APEX
  • Definition XML data can be represented by a
    directed labeled edge graph Gxml, Gxml
    (V,E,R),V VcUVa,.
  • EVcAV is the set of labeled edges.
  • All the nodes V can be reached from R. Each node
    in V is assigned with an identifier id that
    represents its order in XML document.

37
Construct Index on the basis of data model(1)
  • Construct Index on the basis of data model
  • construct an xml data Gindex
  • A Gindex for source XML data s is an XML
    data d such that for every unique label path in
    s, there exist exactly one data path instance in
    d ,and every label path in d is a label in s, and
    in d every node o has only one incoming label.
  • In a word, a Gindex is a Gxml that is got rid of
    the repeated path.

38
Construct Index on the basis of data model(2)
  • Construct the path index in accordance with
    Gindex.
  • This step means to record every unique path
    appeared in Gindex.

39
Construct Index on the basis of data model(3)
  • Construct the inverted index underlies the path
    index.
  • This step pays attention to improve the
    performance of partial matching query.
  • This is an inverted table which maintains an
    inverted scratch label path list for each label
    in source xml data.
  • Each entry in this inverted table records two
    types of data the label, the scratch label path
    that contains the label and the inverted position
    of the label in the path expression.

40
Querying Algorithm
  • Query interface
  • XPath language (regular path expression)
  • Cache-based querying processing. To keep the
    queries submitted frequently, the process of
    query will be accelerated, especially in the
    distributed environment.

41
(6)Processing Engine
42
Document representation
  • Processed by TIPSI, document are represented as
    logic view and semantic view.
  • Logic view provides a friendly navigator to users
  • Semantic view can answer more meaningful
    question. Such as where is the xxx corporation?
    How many corporations in Shanghai?

43
Query Analysis, Matcher Retrieval
  • Not simple keyword retrieval.
  • Analyze what is user really want?
  • Efficiently match user query to semantic view or
    logic view.
  • Locate to the specified sections in the
    documents.
  • Retrieve all related documents or related
    sections.

44
Information integration
  • Integrate all related information together from
    retrieval component instead of related documents.

45
For example
  • Question how many corporations in Beijing?
  • retrieve all corporations with locationBeijing.
  • Count the returned corporations.
  • Other questions list all corporations whose
    profit in 2003 greater that that in 2002. How
    much is total profit of all corporations in
    Beijing?

46
Information integration
Company A Company B Company C
47
Redundancy removal
  • There might exist lots of redundancy in the
    documents returned by retrieval.
  • Such as how to contact KEG in Tsinghua
    University.
  • Should only show one correct contact information.

48
XSL Template, ontology personal profile
  • XSL template provide user a friendly and
    easy-to-use navigator.
  • Based on the explicit definition of ontology,
    system could provide more powerful functions.
  • Personal profile log users action on the
    internet, and provide user-oriented search.

49
(7)GUI Interface
50
Result Display
  • Post a request, user might retrieval many related
    documents/sections. It is necessary to
    re-organize these information.
  • Clustering is good solution.
  • Such as user select IT company, display component
    cluster the retrieved documents, and group them
    (software, hardware, etc).

51
(No Transcript)
52
Key technologies
53
Key technologies
  • Key technologies include
  • High multi-dimension document expression, viz.
    (Data sparse) and efficient indexing and querying
    algorithm.
  • Dynamic model dynamic rules
  • Information redundancy removal. And matcher,
    information integration.
  • Classifier refinement (with user feedback).

54
5.Two Cases
55
Case-1
How could I find the IT corporations whose profit
was more than 1million last year
Document Processing Engine
Too tired to organize
Database (Indexing)
Annual report
56
Case-1
How could I find the IT corporations whose profit
was more than 1million last year
So easy
Smart Document Processing Engine
Content analysis (logic extract Semantic extract)
Database (Indexing)
Classification
Annual report
57
GUI Interface
58
(2)Email-dispatch
59
Email-dispatch
Dept of Tech
How much..
How to install..
Secretary
Intelligent Client Support Center
meeting with general Manager

Dept of Sales
advertisement
spam
60
Project plan
  • To be completed in five months by three full-time
    staffs.
  • Project Schedule
  • 2004.3 2004.4 preprocessing, rule editor, rule
    parser
  • 2004.4 2004.5 classifier, evaluation tool
  • 2004.5 2004.6 indexing for semantic view and
    logic view.
  • 2004.6 2004.7 search processing engine and user
    browser.
  • 2004.7 2004.8 integrate all components and
    debug the whole framework

61
Reference
  • Fax Internet Technical Resources Electronic Mail
    (Email) and Postal Mail. http//www.cs.columbia.ed
    u/hgs/internet/email.html.
  • E-mail Classification in the Haystack Framework.
    Mark Rosen. DEEC of MIT.

62
Thanks!
Write a Comment
User Comments (0)
About PowerShow.com