Title: Mining Sequential Patterns
1Course on Data Mining (581550-4) Seminar
Meetings
Ass. Rules
Clustering
P
P
Episodes
KDD Process
P
M
Text Mining
Home Exam
M
2Course on Data Mining (581550-4) Seminar
Meetings
Today 16.11.2001
- R. Feldman, M. Fresko, H. Hirsh, et.al.
"Knowledge Management A Text Mining Approach",
Proc of the 2nd Int'l Conf. on Practical Aspects
of Knowledge Management (PAKM98), 1998 - B. Lent, R. Agrawal, R. Srikant "Discovering
Trends in Text Databases", Proc. of the 3rd Int'l
Conference on Knowledge Discovery in Databases
and Data Mining, 1997.
3Course on Data Mining (581550-4) Seminar
Meetings
Good to Read as Background
- Both papers refer to the Agrawal and Srikant
paper we had last week -
- Rakesh Agrawal and Ramakrishnan Srikant Mining
Sequential Patterns. Int'l Conference on Data
Engineering, 1995.
4Knowledge Management A Text Mining Approach
- R. Feldman, M. Fresko, H. Hirsh, et.al
- Bar-Ilan University and Instict Software, ISRAEL
Rutgers University, USA LIA-EPFL, Switzerland - Published in PAKM'98 (Int'l Conf. on Practical
Aspects of Knowledge Management) - Data Mining course Autumn 2001/University of
Helsinki - Summary by Mika Klemettinen
5KM A Text Mining Approach
- Basic idea (see selected phases on the next
slides) - 1. Get input data in SGML (or XML) format
- Select only the contents of desired elements!
(title, abstract, etc.) - 2. Do linguistic preprocessing
- 2.1 Term extraction (use linguistic software
for this) - 2.2 Term generation (combine adjacent terms to
morpho- syntactic patterns like "noun-noun",
"adj.-noun", etc. by calculating association
coefficients) - 2.3 Term filtering (select only the top M most
frequent ones) - 3. Create taxonomies (there is a tool for this)
- 4. Generate associations (you may constrain the
creation) - 5. Visualize/explore the results
62.1 Term Extraction
73 Taxonomy Construction
84 Association Rule Generation
94 Association Rule Generation
105.1 Visualization/Exploration
115.2 Visualization/Exploration
12Discovering Trends in Text Databases
- Brian Lent, Rakesh Agrawal and Ramakrishnan
Srikant - IBM Almaden Research Center, USA
- Published in KDD'97
- Data Mining course Autumn 2001/University of
Helsinki - Summary by Mika Klemettinen
13Discovering Trends in Text Databases
- Basic ideas
- Identify frequent phrases using sequential
patterns mining (see the slides summaries from
the Agrawal et. al paper "Mining Sequential
Patterns" (MSP)) - Generate histories of phrases
- Find phrases that satisfy a specified trend
- Definitions
- Phrase phrase p is ? (w1)(w2) (wn ) ?, where w
is a word - 1-phrase ? ?(IBM)? ?(data)(mining)? ?
- 2-phrase ? ?(IBM)? ?(data)(mining)? ? ?
?(Anderson) (Consulting) ? ?(decision)(support)?
? - Itemset, sequence, is contained, etc. as in MSP
paper
14Discovering Trends in Text Databases
- Gaps Minimum and maximum gaps between adjacent
words identify relations of words/phrases inside
sentences/paragraphs, between words/phrases in
different paragraphs, between words/phrases in
different sections, etc. - Sentence boundary 1000
- Paragraph boundary 100.000
- Section boundary 10.000.000
- Phases
- Partition data/documents based on their time
stamps, create phrases for each partition (Lent
al. have patent data documents) - Select the frequent phrases and save their
frequences - Define shape queries using SDL (Shape Definition
Language)
15Discovering Trends in Text Databases
16Discovering Trends in Text Databases
17Discovering Trends in Text Databases