Research on Data Mining and Knowledge Discovery at WPI - PowerPoint PPT Presentation

About This Presentation
Title:

Research on Data Mining and Knowledge Discovery at WPI

Description:

Research on Data Mining and Knowledge Discovery at WPI Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute Outline of this talk Short ... – PowerPoint PPT presentation

Number of Views:134
Avg rating:3.0/5.0
Slides: 38
Provided by: webCsWpi68
Learn more at: http://web.cs.wpi.edu
Category:

less

Transcript and Presenter's Notes

Title: Research on Data Mining and Knowledge Discovery at WPI


1
Research on Data Mining and Knowledge Discovery
at WPI
  • Prof. Carolina Ruiz
  • Department of Computer Science
  • Worcester Polytechnic Institute

2
Outline of this talk
  • Short tutorial on Data Mining and Knowledge
    Discovery in Databases (KDD)
  • Sample ongoing KDD research projects at WPI

3
Need for Data Mining
  • Data are being gathered and stored extremely fast
  • Currently, the amount of new data stored in
    digital computer systems every day is roughly
    equivalent to 3000 pages of text for every person
    on Earth (estimate based on a projection to 2003
    of a study led by Lyman Varian at UC-Berkeley
    in 2000).
  • Computational tools and techniques are needed to
    help humans in summarizing, understanding, and
    taking advantage of accumulated data

4

What is Data Mining?or more generally, Knowledge
Discovery in Databases (KDD)
  • Non-trivial process of identifying valid, novel,
    potentially useful, and ultimately understandable
    patterns in data Fayyad et al. 1996
  • Raw Data
  • Data Mining
  • Patterns
  • Analytical and Statistical Patterns (rules,
    decision trees, )
  • Visual Patterns

Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P.
"From Data Mining to Knowledge Discovery in
Databases" AAAI Magazine, pp. 37-54. Fall 1996.
5
Data Analysis (KDD)Process
6
KDD is Interdisciplinarytechniques come from
multiple fields
  • Machine Learning (AI)
  • Contributes (semi-)automatic induction of
    empirical laws from observations
    experimentation
  • Statistics
  • Contributes language, framework, and techniques
  • Pattern Recognition
  • Contributes pattern extraction and pattern
    matching techniques
  • Databases
  • Contributes efficient data storage, data
    cleansing, and data access techniques
  • Data Visualization
  • Contributes visual data displays and data
    exploration
  • High Performance Comp.
  • Contributes techniques to efficiently handling
    complexity
  • Application Domain
  • Contributes domain knowledge

7
Data Mining Modes
  • Confirmatory (verification)
  • Given a hypothesis, verify its validity against
    the data
  • Exploratory (discovery)
  • Prescriptive patterns
  • Patterns for predicting behavior of newly
    encountered entities
  • Descriptive patterns
  • Patterns for presenting the behavior of observed
    entities in a human-understandable format

8
Analytical and Visual Data Mining
  • Analytical
  • A model that represents the data is constructed
    using computational methods
  • Visual
  • Data are displayed on computer screen using
    colors and shapes
  • Patterns in the data are identified by the human
    (user) eye.

9
What do you want to learn from your data?KDD
approaches
regression
classification
clustering
Data
change/deviation detection
summarization
dependency/assoc. analysis
10
Commercial Data Mining Systems
11
Closer Look IBMs Intelligent Miner
12
CBALiu et al., National Univ. of Singapore
Data Mining Academic Systems
WEKAFrank et al., University of Waikato, New
Zealand
ARMiner Cristofor et al., UMass/Boston
WPI WEKA - Our Temporal/Spatial Association Rules
13
Some Current Analytical Data Mining Research
Projects at WPI
  • Mining Complex Data Set and Sequence Mining
  • Systems performance Data
  • Sleep Data
  • Financial Data
  • Web Data
  • Data Mining for Genetic Analysis
  • Correlating genetic information with diseases
  • Predicting gene expression patterns
  • Data Mining for Electronic Commerce
  • Collaborative and Content-Based Filtering
  • Using Association Rules and using Neural Networks

14
Mining Complex Data
  • names/aliases bank account age
    felonies gender iris scan

joe smith, greg jones 27 ltburglary 2/86, fraud 11/93, murder 3/99gt M
kathy pearls, kathy dow, susan harris 97,72,67,80, 53 ltchild abuse 9/98, kidnapping 2/03gt F
drew harris 10,29,37,16, 49 lt gt M

P1 P2 P3
Based partially on work w/ Norfolk County Sheriff
Office
15
Sample Complex Patterns
  • Potential temporal/spatial association
  • Teenage males from Eastern Massachusetts who are
    convicted of burglary are likely (7) to commit
    violent crimes when they are adults.

16
Analyzing Sleep Data
  • Purpose
  • Associations between sleep patterns and
    health/pathology
  • Obtain patterns of different sleep stages (4
    sleepREM Wake)
  • DATA SET
  • Clinical (sequential)
  • Electro-encephalogram (EEG),
  • Electro-oculogram (EOG),
  • Electro-myogram (EMG),
  • Probe measuring flow of Oxygen
  • in blood etc.
  • Diagnostic (tabular)
  • Questionnaire responses
  • Patients demographic info.
  • Patients medical history

(Source http//www. blsc.com)
  • Potential Rules
  • Association Rules
  • (Sleep latency lt3 min) (hereditary disorder)
    gt Narcolepsy confidence92, support 13
  • (B) Classification Rules
  • (snoring HEAVY) (AHI gt 30/hour)
    severe OSA
  • gt (Race Caucasian) confidence70, support
    8
  • AHI Apnea Hypopnea index, OSA
    Obstructive Sleep Apnea

WPI, UMassMedical, BC
17
Input Data
  • Each instance Tabular set sequential
    attributes
  • attr1 attr2 attr3
    attr4 attr5 class
  • illnesses heart rate
    age oxygen gender Epworth

depression, fatigue 27 M 5
stroke, dementia, fatigue 97,72,67,80, 73 90,92,96,89,86, F 23
arthritis 102,99,87,96, 49 97,100,82,80,70, M 14

P1 P2 P3
18
Analyzing Financial Data
  • Sequential data daily stock values
  • Normal (tabular/relational) data
  • sector (computers, agricultural, educational, ),
    type of government, product releases, companies
    awards,
  • Desired rules
  • If DELLs stock value increases 1999ltyearlt2002
    gt IBMs stock value decreases

19
Financial Data Analysis
Stock values
Products Athlon XP 2200 (Nov 11, 02) Aironet 1100 Series (Oct 2, 02)
Awards Lifetime Achievement (Oct 31, 02) None
Neg. Events Reduce workforce (Nov 14, 02) None
Expansion/Merge None None
Cisco
AMD
20
Events Sleep Data6 Basic sleep events/stages
W,S1,S2,S3,S4,REM
  • Sa02 the mean oxygen saturation (SaO2) around
    90
  • heart rate shown by ECG in beats per minute
  • the sleep stages - W or Wake, 1 or Stage1, 2, 3,
    4 and REM or Rapid Eye Movement stage.
  • Also shown brown markings are
  • Epoch (of duration 30sec) and
  • Clock time (indicating total sleep time).

21
Events Financial Data Basic events 16 or so
financial templates LittleRhodes78difficult
pattern matching alignments and time warping
Panic Reversal Head
Shoulders Reversal
Rounding Top Reversal Descending
Triangle Reversal
22
Example Event Identification
  • Templates increase , decrease ,
    sustain
  • Confidence 90, support 15, class Epworth
  • illnesses heart rate
    age oxygen gender Epworth

depression, fatigue 27 M 5
stroke, dementia, fatigue 97,72,67,80, 73 90,92,96,89,86, F 23
arthritis 102,99,87,96, 49 97,100,82,80,70, M 14

P1 P2 P3
23
Temporal Relations between two Events event1
event2
meets before after overlaps is equal to
starts during finishes
24
Example temporal association rules
  • heart rate decreases immediately after oxygen
    stops increasing genderM gt epworth10
    (conf95, supp 23)
  • HR-dect1,t2 oxygen-inct0,t1 genderM
    gtepworth10
  • Heart rate sustains while oxygen increases
    patient suffers of dementia gt ethnicitywhite
    (conf99, supp 16)
  • Patient suffers of dementia and depression
    genderF REMt0,t2 gt
    oxygen-inct1,t3 (conf91, supp 17)

t0 t1 t2 t3
25
Closer Look WPI Weka Tool for mining complex
temporal/spatial associations
26
Data Mining for Genetic Analysisw/ Profs. Ryder
(BB, WPI), Krushkal (BB, U. Tennessee), Ward (CS,
WPI), and Alvarez (CS, BC)
  • SNP analysis
  • discovering correlations between sequence
    variations and diseases
  • Gene expression
  • discovering patterns that cause a gene to be
    expressed in a particular cell

27
Correlating Genetics with Diseases
  • Utilize Data Mining Techniques with Actual
    Genetic Data Sampled from Research
  • Spinal Muscular Atrophy inherited disease that
    results in progressive muscle degeneration and
    weakness.

28
Genomic Data Resources
Patient Gender SMA Type (Severity) SNP Location C212 Father / Mother AG1-CA Father / Mother
Female Severe Y272C 31 / 28 29 102 / 108 112
Male Mild Y272C 28 29 / 25 108 112 / 114

Wirth, B. et al. Journal of Human Molecular
Genetics
29
Data Mining Techniques
  • Association Rule Mining
  • Metrics for evaluation of mined rules
  • Confidence P(Consequent Premise)
  • Support P(Consequent È Premise)
  • Lift P(Consequent Premise) /
    P(Consequent)
  • Example



Ag1-CA, 110 absent Ag1-CA, 108 associated
Gender Female Confidence 100 Support
9.364 Lift 2.39
SMA Type Severe
30
Mining Gene Expression Patterns
  • Different cells require different proteins
  • DNA uses a four letter alphabet (ATCG)
  • Cell expression pattern depends on motifs

31
Gene expression Analysis

PR1
PROMOTER(S)
CELL TYPES
neural neural muscle neural muscle neural neural n
eural muscle
M1
M2
M4
M5
CTTGTCTAATGGGCCGACTATATAGTCTGTACGATTCCGAAT
PR2
M1
M4
M5
AGTGTCCTAAGGGCGACTTATCTAGTCTGTATTCCGTCGACA
PR3
M1
M4
CCTGGACTATGGGCCCCTTCTAAAGTCTGTACGTCGTCGATA
PR4
M1
M2
M5
GGCCTAAAATGTAGTCCTTATATAGTCTGATTCTCGTCGAAA
PR5
M1
M4
ACTGTCTAATGGCTAACTTATATAGTGACTACGTCGTCGAGA
PR6
M4
M5
M3
GTTGTGTAGTGGGCCCCGACTATAGTCTGTATTCCGTCGAAC
PR7
M1
M2
M5
M3
TGCGATTCATGGGCTAGTTATATAGGTAGTACGTCTAAGAAA
PR8
M2
M4
M5
ATTGTCTATAGTCCCCTGACTTAGTCATTCTGTACTCGATATC
PR9
M4
M3
ATTGTGACTTGGGCGTAGTATATAGTCTGTACGTCGTCGAAA
32
Our System CAGE
  • To predict gene expression based on DNA sequences.

Muscle Cell
Gene 1
Gene 3
Gene 2
Neural Cell
CAGE
Gene 1
Gene 3
Gene 2
Seam Cells
On
Gene 1
Gene 3
Gene 2
Off
33
Summary
  • KDD is the non-trivial process of identifying
    valid, novel, potentially useful, and ultimately
    understandable patterns in data
  • The KDD process includes data collection and
    pre-processing, data mining, and evaluation and
    validation of those patterns
  • Data mining is the discovery and extraction of
    patterns from data, not the extraction of data
  • Important challenges in data mining privacy,
    security, scalability, real-time, and handling
    non-conventional data

34
Data Mining Resources Books
  • Advances in Knowledge Discovery and Data Mining.
    Eds. Fayyad, Piatetsky-Shapiro, Smyth, and
    Uthurusamy. The MIT Press, 1995.
  • Data Mining Concepts and Techniques. J. Han and
    M. Kamber. Morgan Kaufmann Publishers. 2001.
  • Data Mining Practical Machine Learning Tools and
    Techniques with Java Implementations. I. Witten
    and E. Frank. Morgan Kaufmann Publishers. 2000.
  • Data Mining. Technologies, Techniques, Tools, and
    Trends. B. Thuraisingham. CRC, 1998.
  • Principles of Data Mining , D. J. Hand, H.
    Mannila and P. Smyth, MIT Press, 2000
  • The Elements of Statistical Learning Data
    Mining, Inference, and Prediction, T. Hastie, R.
    Tibshirani, J. Friedman, Springer Verlag, 2001.
  • Data Mining Cookbook, modeling data for
    marketing, risk, and CRM. O. Parr Rud, Wiley,
    2001.
  • Data Mining. A hands-on approach for business
    professionals. R. Groth. Prentice Hall, 1998.
  • Data Preparation for Data Mining. Dorian Pyle,
    Morgan Kaufmann, 1999
  • Data Mining Methods for Knowledge Discovery Cios,
    Pedrycz, Swiniarski, Kluwer, 1998.

35
Data Mining Resources Books (cont.)
  • Mastering Data Mining, M. Berry G. Linoff, John
    Wiley Sons, 2000.
  • Data Mining Techniques for Marketing, Sales and
    Customer Support. Berry Linoff. John Wiley
    Sons, 1997.
  • Decision Support using Data Mining. S. Anand and
    A. Buchner. Financial Times Pitman Publishing,
    1998
  • Feature Selection for Knowledge Discovery and
    Data Mining. Liu and Motoda, Kluwer, 1998.
  • Feature Extraction, Construction and Selection A
    Data Mining Perpective. Eds Motoda and Liu.
    Kluwer, 1998
  • Knowledge Acquisition from Databases. Xindong Wu.
  • Mining Very Large Databases with Parallel
    Processing. A. Freitas S. Lavington. Kluwer,
    1998.
  • Predictive Data-Mining A Practical Guide. Weiss
    Indurkhya. Morgan Kaufmann. 1998.
  • Machine Learning and Data Mining Methods and
    Applications. Michalski, Bratko, and Kubat, John
    Wiley Sons. 1998.
  • Rough Sets and Data Mining Analysis of Imprecise
    Data. Eds Lin and Cercone Kluwer.
  • Seven Methods for Transforming Corporate Data
    into Business Intelligence. Vasant Dhar and Roger
    Stein Prentice-Hall, 1997.

36
Data Mining Resources Journals
  • Data Mining and Knowledge Discovery Journal
  • Newsletters
  • ACM SIGKDD Explorations Newsletter
  • Related Journals
  • TKDE IEEE Transactions in Knowledge and Data
    Engineering
  • TODS ACM Transaction on Database Systems
  • JACM Journal of ACM
  • Data and Knowledge Engineering
  • JIIS Intl. Journal of Intelligent Information
    Systems

37
Data Mining Resources Conferences
  • KDD ACM SIGKDD Intl. Conf. on Knowledge
    Discovery and Data Mining
  • ICDM IEEE International Conference on Data
    Mining,
  • SIAM International Conference on Data Mining
  • PKDD European Conference on Principles and
    Practice of Knowledge Discovery in Databases
  • PAKDD Pacific-Asia Conference on Knowledge
    Discovery and Data Mining
  • DaWak Intl. Conference on Data Warehousing and
    Knowledge Discovery
  • Related Conferences
  • ICML Intl. Conf. On Machine Learning
  • IDEAL Intl. Conf. On Intelligent Data
    Engineering and Automated Learning
  • IJCAI International Joint Conference on
    Artificial Intelligence
  • AAAI American Association for Artificial
    Intelligence Conference
  • SIGMOD/PODS ACM Intl. Conference on Data
    Management
  • ICDE International Conference on Data
    Engineering
  • VLDB International Conference on Very Large Data
    Bases
Write a Comment
User Comments (0)
About PowerShow.com