Data Intelligence and Mining - PowerPoint PPT Presentation

1 / 12
About This Presentation
Title:

Data Intelligence and Mining

Description:

We know which customers decided to buy and which decided otherwise ... music by comparing his or her past preferences to those of other people with similar tastes ... – PowerPoint PPT presentation

Number of Views:18
Avg rating:3.0/5.0
Slides: 13
Provided by: jiaw192
Category:

less

Transcript and Presenter's Notes

Title: Data Intelligence and Mining


1
Data Intelligence and Mining
  • CSCI 317A

2
Recap
  • Gene vs. DNA?
  • DNA vs. mRNA (Source code vs. Object code)
  • Expressed gene?
  • So what is clustering?
  • Two stipulations?
  • Hierarchical vs. Partitional
  • Applications?

3
Illustrating Document Clustering
  • Clustering Points 3204 Articles of Los Angeles
    Times.
  • Similarity Measure How many words are common in
    these documents (after some word filtering).

4
Clustering MicroArrays
  • Detect expression patterns among genes
  • Identify groups of possibly co-regulated genes
  • Environmental conditions that are similar
  • Identify similar classes of biological conditions
    (e.g. tumor subtypes)
  • Cancer classified according to location so far
  • Not very realistic
  • Compare the genetic profile of a new cancer
    patient with the available clusters of cancer
    profiles which have already been studied and
    analyzed? give similar treatments

5
Classification
  • Supervised vs. unsupervised learning
  • Class discovery
  • Given a collection of records (training set)
  • Each record contains a set of attributes and a
    class label
  • Discrete class label
  • Find a model for the class attribute as a
    function of the values of other attributes
  • Prediction/regression from Statistics
  • Goal previously unseen records should be
    assigned a class label as accurately as possible
  • A test set is used to determine the accuracy of
    the model
  • Usually, the given data set is divided into
    training and test sets, with training set used to
    build the model and test set used to validate it

6
Classification Example
categorical
categorical
continuous
class
Learn Classifier
Training Set
7
Classification Application 1
  • Direct Marketing
  • Goal Reduce cost of mailing by targeting a set
    of consumers likely to buy a new product.
  • Approach
  • Use the data for similar products introduced
    before
  • We know which customers decided to buy and which
    decided otherwise
  • This buy, dont buy decision forms the class
    attribute
  • Collect various demographic, lifestyle, and
    company-interaction related information about all
    such customers.
  • Type of business, where they stay, how much they
    earn, etc.
  • Use this information as input attributes to learn
    a classifier model to be applied on new customers

8
Neflix
  • Neflixs Recommendation System
  • http//news.com.com/2100-1026_3-6121649.html
  • http//www.netflixprize.com
  • Devise a system that is more accurate than the
    company's current recommendation system by at
    least 10 percent
  • Available to the public 100 million of its
    customers' movie ratings, a database the company
    says is the largest of its kind ever released
  • Recommendation systems try to predict whether a
    customer will like a movie, book or piece of
    music by comparing his or her past preferences to
    those of other people with similar tastes

9
http//mips.gsf.de/proj/funcatDB/
10
Deviation/Anomaly Detection
  • Detect objects that are significantly different
    from others
  • Detection of CC fraud, network intrusion
    detection, etc

11
Importance of Anomaly Detection
  • Ozone Depletion History
  • One of the most infamous outliers in recent
    history!
  • In 1985 Data from the British Antarctic Survey
    showed ozone levels for Antarctica had dropped
    10 below normal levels
  • Puzzle Why didnt Nimbus 7 satellite, which had
    instruments aboard for recording ozone levels,
    record similarly low ozone concentrations?
  • The ozone concentrations recorded by the
    satellite were so low they were being treated as
    outliers by a computer program and discarded!
  • It had been gathering evidence of low ozone
    levels since 1976.
  • The damage to our atmosphere went undetected and
    untreated for up to nine years because outliers
    were discarded without being examined
  • Moral Don't just toss out outliers, as they may
    be the most valuable members of a dataset

Sources http//exploringdata.cqu.edu.au/ozone.h
tm http//www.epa.gov/ozone/science/hole/size.htm
l 25M KM2 15,534,280 Mile2
12
Deviation/Anomaly Detection
  • Genes or experiments highly distinct from others
  • Gene Outliers
  • Could be (solely) responsible for
  • Vital functions inside the body and thus vital
    for life
  • Diseases
  • Experiment Outliers
  • new cancer patient against available cancer
    patient profiles (already been studied), if an
    outlier
  • New case analysis is required
Write a Comment
User Comments (0)
About PowerShow.com