WIT 2005 Data Mining - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

WIT 2005 Data Mining

Description:

Information Retrieval. Improve retrieval and stability ... Explore other Fitness functions. Use the correlated terms for query enhancements ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 16
Provided by: guydesj
Category:
Tags: wit | data | mining

less

Transcript and Presenter's Notes

Title: WIT 2005 Data Mining


1
WIT 2005Data Mining
  • A Genetic Algorithm
  • for Text Mining

Guy Desjardins intellagent_at_vif.com Robert
Godin godin.robert_at_uqam.ca Robert
Proulx proulx.robert_at_uqam.ca
2
Agenda
  • Text Mining Information Retrieval
  • Challenge
  • Genetic Model
  • Experiment and Results
  • Conclusion - Future

3
Agenda
  • Text Mining Information Retrieval
  • Challenge
  • Genetic Model ?
  • Experiment and Results
  • Conclusion - Future
  • Algorithm
  • Operators
  • Fitness

4
Text Mining Information Retrieval
  • Common representation
  • Indexing documents ? corpus terms

5
Text Mining Information Retrieval
  • Text Mining
  • Classification of texts
  • Within cluster similarity
  • Information Retrieval
  • Query Document matching
  • Similarity between queries and documents
  • Similarity(Di, Dj) cosine (di, dj)

6
Text Mining Information Retrieval
  • tf ? idf weighting scheme
  • tf term frequencies
  • idf inverse document frequencies

7
Challenge
  • Text Mining
  • Improve the within cluster similarities
  • Information Retrieval
  • Improve retrieval and stability
  • Stability and generalization difficult to reach

8
Challenge
  • Approach
  • Add co-occurrent terms to text representations
  • Problem
  • Combinatorial explosion ( 10 26)
  • Solution
  • Genetic algorithm

9
Genetic Model
  • Cycle

10
Genetic Model
  • Operators
  • Crossover exchanges portions of chromosomes
  • Mutation modification of a random gene

11
Genetic Model
  • Fitness weights of co-occurrent term sets

12
Experiment
  • Collection TREC ZF109
  • 22 709 documents
  • 19 queries
  • 72 983 index terms
  • GA parameters
  • Pinit 1 000 sets of 2-6 co-occurrent terms
  • Hyper-mutation rate 50
  • Averages
  • 6 documents / index term
  • 20 terms / document

13
Results
  • Text mining

14
Results
  • Information retrieval

15
Conclusion - Future
  • Genetic algorithm focused on major sub-domains
  • Would improve one cluster in text mining
  • Not enough to improve retrieval on large
    collections
  • Future work
  • Apply Genetic algorithm recursively
  • Explore other Fitness functions
  • Use the correlated terms for query enhancements
Write a Comment
User Comments (0)
About PowerShow.com