Title: Master of Science
1Data Mining Research at SMU
ME
Margaret H. Dunham, DBGroup Yu Meng, Jie
Huang, Lin Lu, Donya Quick, Michael Pierce CSE
Department Southern Methodist University Dallas,
Texas 75275 mhd_at_engr.smu.edu
2Data Mining Introductory and Advanced Topics, by
Margaret H. Dunham, Prentice Hall, 2003. DILBERT
reprinted by permission of United Feature
Syndicate, Inc.
3Outline
- What is Data Mining?
- EMM
- Spatio-temporal modeling
- Rare Event Detection
- Bioinformatics
- TCGR DNA/RNA visualization
- miRNA prediction
- Web Usage Mining
4Data Mining Definition
- Finding hidden information in a database
- Fit data to a model
- Similar terms
- Exploratory data analysis
- Data driven discovery
- Deductive learning
5Query Examples
- Find all credit applicants with last name of
Smith.
- Identify customers who have purchased more than
10,000 in the last month.
- Find all customers who have purchased milk
- Find all credit applicants who are poor credit
risks. (classification)
- Identify customers with similar buying habits.
(Clustering)
- Find all items which are frequently purchased
with milk. (association rules)
6Outline
- What is Data Mining?
- EMM
- Spatio-temporal modeling
- Rare Event Detection
- Bioinformatics
- TCGR DNA/RNA visualization
- miRNA prediction
- Web Usage Mining
7Spatiotemporal Environment
- Events arriving in a stream
- At any time, t, we can view the state of the
problem as represented by a vector of n numeric
values - Vt ltS1t, S2t, ..., Sntgt
Time
8Technique
- Spatiotemporal modeling technique based on Markov
models. - However
- Size of MM depends on size of dataset
- The required structure of the MM is not known at
the model construction time. - As the real world being modeled by the MM
changes, so should the structure of the MM.
9Extensible Markov Model (EMM)
- Time Varying Discrete First Order Markov Model
- Nodes are clusters of real world states.
- Learning continues during application phase.
- Learning
- Transition probabilities between nodes
- Node labels (centroid/medoid of cluster)
- Nodes are added and removed as data arrives
10EMM Learning
lt18,10,3,3,1,0,0gt lt17,10,2,3,1,0,0gt lt16,9,2,3,1,0,
0gt lt14,8,2,3,1,0,0gt lt14,8,2,3,0,0,0gt lt18,10,3,3,1,
1,0.gt
11Growth of EMM
Servent Data
12EMM Performance Growth Rate
Minnesota Traffic Data
13EMM Water Level Prediction Ouse Data
14Rare Event
- Rare - Anomalous Surprising
- Out of the ordinary
- Not outlier detection
- Ex Snow in upstate New York is not rare
- Snow in upstate New York in June is rare
- Rare events may change over time
- Applications
- Intrusion Detection
- Fraud
- Flooding
- Unusual automobile/network traffic
15Rare Event in Cisco Data
16Outline
- What is Data Mining?
- EMM
- Spatio-temporal modeling
- Rare Event Detection
- Bioinformatics
- TCGR DNA/RNA visualization
- miRNA prediction
- Web Usage Mining
17Chaos Game Representation (CGR)
- 2D technique to visually see the distribution of
subpatterns - Our technique is based on the following
- Generate totals for each subpattern
- Scale totals to a 0,1 range. (Note scaling can
be a problem) - Convert range to red/blue
- 0-0.5 White to Blue
- 0.5-1 Blue to Red
18CGR Example
Homo Sapiens all mature miRNA Patterns of
length 3
UUC
GUG
19Temporal CGR (TCGR)
- Temporal version of Frequency CGR
- In our context temporal means the starting
location of a window - 2D Array
- Each Row represents counts for a particular
window in sequence - First row first window
- Last row last window
- We start successive windows at the next character
location - Each Column represents the counts for the
associated pattern in that window - Initially we have assumed order of patterns is
alphabetic - Size of TCGR depends on sequence length and
subpattern lengt - As sequence lengths vary, we only examine
complete windows - We only count patterns completely contained in
each window.
20TCGR Example
21TCGR Mature miRNA (Window5 Pattern2)
22Outline
- What is Data Mining?
- EMM
- Spatio-temporal modeling
- Rare Event Detection
- Bioinformatics
- TCGR DNA/RNA visualization
- miRNA prediction
- Web Usage Mining
23The BIG PICTURE
2003-10-05154920050721435700000026210000000000
02652026520000000002003-10-051640
49050832595900000872710001142380
07107071070000000002003-10-0504551005076779990
0000191300000670518
00000000000000000002003-10-0509431005078176610
0000603030000000000
03657004690000000002003-10-0514493605081824200
00007066200000000000811a39
09142071070000000002003-10-0521235705075903160
0000465050002794335
11992071070000000002003-10-0511301605073051260
0000465050000195747
1684600597corduroycoats
CANT SEE THE FOREST FOR THE TREES
24Preprocess Web Data Cleanse Sessionize
URL Abstraction
Markov Model per Cluster
Markov Model
User defined beginning/ending Web pages
Significant Usage Pattern
User Preferred Navigation Trail
Cluster Web Sessions
Normalized Probability
25Experimental Result
- On average purchase sessions are longer than
those - sessions without purchase
- - review the information, compare the price,
the quality and etc. - - fill out the billing and shipping
information to commit the purchase
WebKDD05 25
26Experimental Result
SUPs in non-purchase cluster
Interested in gathering information of products
in different categories.
S-C1-C1-C2-C3-C4-C5-C5-I1-E S-C1-C1-I1-C1-C2-C3-C4
-C5-E S-I1-C1-C2-C3-C4-C5-C6-C7-E
Interested in reviewing general pages (to gather
general information).
Not serious visitors (the average session length
is 3)
WebKDD05 26
27Experimental Result
WebKDD05 27
28