Title: Research on Data Mining and Knowledge Discovery at WPI
1Research on Data Mining and Knowledge Discovery
at WPI
- Prof. Carolina Ruiz
- Department of Computer Science
- Worcester Polytechnic Institute
2Outline of this talk
- Short tutorial on Data Mining and Knowledge
Discovery in Databases (KDD) - Sample ongoing KDD research projects at WPI
3Need for Data Mining
- Data are being gathered and stored extremely fast
- Currently, the amount of new data stored in
digital computer systems every day is roughly
equivalent to 3000 pages of text for every person
on Earth (estimate based on a projection to 2003
of a study led by Lyman Varian at UC-Berkeley
in 2000). - Computational tools and techniques are needed to
help humans in summarizing, understanding, and
taking advantage of accumulated data
4 What is Data Mining?or more generally, Knowledge
Discovery in Databases (KDD)
- Non-trivial process of identifying valid, novel,
potentially useful, and ultimately understandable
patterns in data Fayyad et al. 1996 - Raw Data
- Data Mining
-
- Patterns
- Analytical and Statistical Patterns (rules,
decision trees, ) - Visual Patterns
Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P.
"From Data Mining to Knowledge Discovery in
Databases" AAAI Magazine, pp. 37-54. Fall 1996.
5Data Analysis (KDD)Process
6KDD is Interdisciplinarytechniques come from
multiple fields
- Machine Learning (AI)
- Contributes (semi-)automatic induction of
empirical laws from observations
experimentation - Statistics
- Contributes language, framework, and techniques
- Pattern Recognition
- Contributes pattern extraction and pattern
matching techniques
- Databases
- Contributes efficient data storage, data
cleansing, and data access techniques - Data Visualization
- Contributes visual data displays and data
exploration - High Performance Comp.
- Contributes techniques to efficiently handling
complexity - Application Domain
- Contributes domain knowledge
7Data Mining Modes
- Confirmatory (verification)
- Given a hypothesis, verify its validity against
the data
- Exploratory (discovery)
- Prescriptive patterns
- Patterns for predicting behavior of newly
encountered entities - Descriptive patterns
- Patterns for presenting the behavior of observed
entities in a human-understandable format
8Analytical and Visual Data Mining
- Analytical
- A model that represents the data is constructed
using computational methods
- Visual
- Data are displayed on computer screen using
colors and shapes - Patterns in the data are identified by the human
(user) eye.
9What do you want to learn from your data?KDD
approaches
regression
classification
clustering
Data
change/deviation detection
summarization
dependency/assoc. analysis
10Commercial Data Mining Systems
11Closer Look IBMs Intelligent Miner
12CBALiu et al., National Univ. of Singapore
Data Mining Academic Systems
WEKAFrank et al., University of Waikato, New
Zealand
ARMiner Cristofor et al., UMass/Boston
WPI WEKA - Our Temporal/Spatial Association Rules
13Some Current Analytical Data Mining Research
Projects at WPI
- Mining Complex Data Set and Sequence Mining
- Systems performance Data
- Sleep Data
- Financial Data
- Web Data
- Data Mining for Genetic Analysis
- Correlating genetic information with diseases
- Predicting gene expression patterns
- Data Mining for Electronic Commerce
- Collaborative and Content-Based Filtering
- Using Association Rules and using Neural Networks
14Mining Complex Data
- names/aliases bank account age
felonies gender iris scan
joe smith, greg jones 27 ltburglary 2/86, fraud 11/93, murder 3/99gt M
kathy pearls, kathy dow, susan harris 97,72,67,80, 53 ltchild abuse 9/98, kidnapping 2/03gt F
drew harris 10,29,37,16, 49 lt gt M
P1 P2 P3
Based partially on work w/ Norfolk County Sheriff
Office
15Sample Complex Patterns
- Potential temporal/spatial association
- Teenage males from Eastern Massachusetts who are
convicted of burglary are likely (7) to commit
violent crimes when they are adults.
16Analyzing Sleep Data
- Purpose
- Associations between sleep patterns and
health/pathology - Obtain patterns of different sleep stages (4
sleepREM Wake)
- DATA SET
- Clinical (sequential)
- Electro-encephalogram (EEG),
- Electro-oculogram (EOG),
- Electro-myogram (EMG),
- Probe measuring flow of Oxygen
- in blood etc.
- Diagnostic (tabular)
- Questionnaire responses
- Patients demographic info.
- Patients medical history
(Source http//www. blsc.com)
- Potential Rules
- Association Rules
- (Sleep latency lt3 min) (hereditary disorder)
gt Narcolepsy confidence92, support 13 - (B) Classification Rules
- (snoring HEAVY) (AHI gt 30/hour)
severe OSA
- gt (Race Caucasian) confidence70, support
8 - AHI Apnea Hypopnea index, OSA
Obstructive Sleep Apnea -
WPI, UMassMedical, BC
17Input Data
- Each instance Tabular set sequential
attributes - attr1 attr2 attr3
attr4 attr5 class - illnesses heart rate
age oxygen gender Epworth
depression, fatigue 27 M 5
stroke, dementia, fatigue 97,72,67,80, 73 90,92,96,89,86, F 23
arthritis 102,99,87,96, 49 97,100,82,80,70, M 14
P1 P2 P3
18Analyzing Financial Data
- Sequential data daily stock values
- Normal (tabular/relational) data
- sector (computers, agricultural, educational, ),
type of government, product releases, companies
awards, - Desired rules
- If DELLs stock value increases 1999ltyearlt2002
gt IBMs stock value decreases
19Financial Data Analysis
Stock values
Products Athlon XP 2200 (Nov 11, 02) Aironet 1100 Series (Oct 2, 02)
Awards Lifetime Achievement (Oct 31, 02) None
Neg. Events Reduce workforce (Nov 14, 02) None
Expansion/Merge None None
Cisco
AMD
20Events Sleep Data6 Basic sleep events/stages
W,S1,S2,S3,S4,REM
- Sa02 the mean oxygen saturation (SaO2) around
90 - heart rate shown by ECG in beats per minute
- the sleep stages - W or Wake, 1 or Stage1, 2, 3,
4 and REM or Rapid Eye Movement stage. - Also shown brown markings are
- Epoch (of duration 30sec) and
- Clock time (indicating total sleep time).
21Events Financial Data Basic events 16 or so
financial templates LittleRhodes78difficult
pattern matching alignments and time warping
Panic Reversal Head
Shoulders Reversal
Rounding Top Reversal Descending
Triangle Reversal
22Example Event Identification
- Templates increase , decrease ,
sustain - Confidence 90, support 15, class Epworth
- illnesses heart rate
age oxygen gender Epworth
depression, fatigue 27 M 5
stroke, dementia, fatigue 97,72,67,80, 73 90,92,96,89,86, F 23
arthritis 102,99,87,96, 49 97,100,82,80,70, M 14
P1 P2 P3
23Temporal Relations between two Events event1
event2
meets before after overlaps is equal to
starts during finishes
24Example temporal association rules
- heart rate decreases immediately after oxygen
stops increasing genderM gt epworth10
(conf95, supp 23) - HR-dect1,t2 oxygen-inct0,t1 genderM
gtepworth10 - Heart rate sustains while oxygen increases
patient suffers of dementia gt ethnicitywhite
(conf99, supp 16) - Patient suffers of dementia and depression
genderF REMt0,t2 gt
oxygen-inct1,t3 (conf91, supp 17)
t0 t1 t2 t3
25Closer Look WPI Weka Tool for mining complex
temporal/spatial associations
26Data Mining for Genetic Analysisw/ Profs. Ryder
(BB, WPI), Krushkal (BB, U. Tennessee), Ward (CS,
WPI), and Alvarez (CS, BC)
- SNP analysis
- discovering correlations between sequence
variations and diseases - Gene expression
- discovering patterns that cause a gene to be
expressed in a particular cell -
27Correlating Genetics with Diseases
- Utilize Data Mining Techniques with Actual
Genetic Data Sampled from Research - Spinal Muscular Atrophy inherited disease that
results in progressive muscle degeneration and
weakness.
28Genomic Data Resources
Patient Gender SMA Type (Severity) SNP Location C212 Father / Mother AG1-CA Father / Mother
Female Severe Y272C 31 / 28 29 102 / 108 112
Male Mild Y272C 28 29 / 25 108 112 / 114
Wirth, B. et al. Journal of Human Molecular
Genetics
29Data Mining Techniques
- Association Rule Mining
- Metrics for evaluation of mined rules
- Confidence P(Consequent Premise)
- Support P(Consequent È Premise)
- Lift P(Consequent Premise) /
P(Consequent) - Example
Ag1-CA, 110 absent Ag1-CA, 108 associated
Gender Female Confidence 100 Support
9.364 Lift 2.39
SMA Type Severe
30Mining Gene Expression Patterns
- Different cells require different proteins
- DNA uses a four letter alphabet (ATCG)
- Cell expression pattern depends on motifs
31Gene expression Analysis
PR1
PROMOTER(S)
CELL TYPES
neural neural muscle neural muscle neural neural n
eural muscle
M1
M2
M4
M5
CTTGTCTAATGGGCCGACTATATAGTCTGTACGATTCCGAAT
PR2
M1
M4
M5
AGTGTCCTAAGGGCGACTTATCTAGTCTGTATTCCGTCGACA
PR3
M1
M4
CCTGGACTATGGGCCCCTTCTAAAGTCTGTACGTCGTCGATA
PR4
M1
M2
M5
GGCCTAAAATGTAGTCCTTATATAGTCTGATTCTCGTCGAAA
PR5
M1
M4
ACTGTCTAATGGCTAACTTATATAGTGACTACGTCGTCGAGA
PR6
M4
M5
M3
GTTGTGTAGTGGGCCCCGACTATAGTCTGTATTCCGTCGAAC
PR7
M1
M2
M5
M3
TGCGATTCATGGGCTAGTTATATAGGTAGTACGTCTAAGAAA
PR8
M2
M4
M5
ATTGTCTATAGTCCCCTGACTTAGTCATTCTGTACTCGATATC
PR9
M4
M3
ATTGTGACTTGGGCGTAGTATATAGTCTGTACGTCGTCGAAA
32Our System CAGE
- To predict gene expression based on DNA sequences.
Muscle Cell
Gene 1
Gene 3
Gene 2
Neural Cell
CAGE
Gene 1
Gene 3
Gene 2
Seam Cells
On
Gene 1
Gene 3
Gene 2
Off
33Summary
- KDD is the non-trivial process of identifying
valid, novel, potentially useful, and ultimately
understandable patterns in data - The KDD process includes data collection and
pre-processing, data mining, and evaluation and
validation of those patterns - Data mining is the discovery and extraction of
patterns from data, not the extraction of data - Important challenges in data mining privacy,
security, scalability, real-time, and handling
non-conventional data
34Data Mining Resources Books
- Advances in Knowledge Discovery and Data Mining.
Eds. Fayyad, Piatetsky-Shapiro, Smyth, and
Uthurusamy. The MIT Press, 1995. - Data Mining Concepts and Techniques. J. Han and
M. Kamber. Morgan Kaufmann Publishers. 2001. - Data Mining Practical Machine Learning Tools and
Techniques with Java Implementations. I. Witten
and E. Frank. Morgan Kaufmann Publishers. 2000. - Data Mining. Technologies, Techniques, Tools, and
Trends. B. Thuraisingham. CRC, 1998. - Principles of Data Mining , D. J. Hand, H.
Mannila and P. Smyth, MIT Press, 2000 - The Elements of Statistical Learning Data
Mining, Inference, and Prediction, T. Hastie, R.
Tibshirani, J. Friedman, Springer Verlag, 2001. - Data Mining Cookbook, modeling data for
marketing, risk, and CRM. O. Parr Rud, Wiley,
2001. - Data Mining. A hands-on approach for business
professionals. R. Groth. Prentice Hall, 1998. - Data Preparation for Data Mining. Dorian Pyle,
Morgan Kaufmann, 1999 - Data Mining Methods for Knowledge Discovery Cios,
Pedrycz, Swiniarski, Kluwer, 1998.
35Data Mining Resources Books (cont.)
- Mastering Data Mining, M. Berry G. Linoff, John
Wiley Sons, 2000. - Data Mining Techniques for Marketing, Sales and
Customer Support. Berry Linoff. John Wiley
Sons, 1997. - Decision Support using Data Mining. S. Anand and
A. Buchner. Financial Times Pitman Publishing,
1998 - Feature Selection for Knowledge Discovery and
Data Mining. Liu and Motoda, Kluwer, 1998. - Feature Extraction, Construction and Selection A
Data Mining Perpective. Eds Motoda and Liu.
Kluwer, 1998 - Knowledge Acquisition from Databases. Xindong Wu.
- Mining Very Large Databases with Parallel
Processing. A. Freitas S. Lavington. Kluwer,
1998. - Predictive Data-Mining A Practical Guide. Weiss
Indurkhya. Morgan Kaufmann. 1998. - Machine Learning and Data Mining Methods and
Applications. Michalski, Bratko, and Kubat, John
Wiley Sons. 1998. - Rough Sets and Data Mining Analysis of Imprecise
Data. Eds Lin and Cercone Kluwer. - Seven Methods for Transforming Corporate Data
into Business Intelligence. Vasant Dhar and Roger
Stein Prentice-Hall, 1997.
36Data Mining Resources Journals
- Data Mining and Knowledge Discovery Journal
- Newsletters
- ACM SIGKDD Explorations Newsletter
- Related Journals
- TKDE IEEE Transactions in Knowledge and Data
Engineering - TODS ACM Transaction on Database Systems
- JACM Journal of ACM
- Data and Knowledge Engineering
- JIIS Intl. Journal of Intelligent Information
Systems -
37Data Mining Resources Conferences
- KDD ACM SIGKDD Intl. Conf. on Knowledge
Discovery and Data Mining - ICDM IEEE International Conference on Data
Mining, - SIAM International Conference on Data Mining
- PKDD European Conference on Principles and
Practice of Knowledge Discovery in Databases - PAKDD Pacific-Asia Conference on Knowledge
Discovery and Data Mining - DaWak Intl. Conference on Data Warehousing and
Knowledge Discovery - Related Conferences
- ICML Intl. Conf. On Machine Learning
- IDEAL Intl. Conf. On Intelligent Data
Engineering and Automated Learning - IJCAI International Joint Conference on
Artificial Intelligence - AAAI American Association for Artificial
Intelligence Conference - SIGMOD/PODS ACM Intl. Conference on Data
Management - ICDE International Conference on Data
Engineering - VLDB International Conference on Very Large Data
Bases