Introduction --- Part2 - PowerPoint PPT Presentation

About This Presentation
Title:

Introduction --- Part2

Description:

Introduction --- Part2 Another Introduction to Data Mining Course Information * * * * * * * * Knowledge Discovery in Data [and Data Mining] (KDD) Let us find ... – PowerPoint PPT presentation

Number of Views:211
Avg rating:3.0/5.0
Slides: 29
Provided by: www2CsUh8
Learn more at: https://www2.cs.uh.edu
Category:

less

Transcript and Presenter's Notes

Title: Introduction --- Part2


1
Introduction --- Part2
  1. Another Introduction to Data Mining
  2. Course Information

2
Knowledge Discovery in Data and Data Mining
(KDD)
Let us find something interesting!
  • Definition KDD is the non-trivial process of
    identifying valid, novel, potentially useful, and
    ultimately understandable patterns in data
    (Fayyad)
  • Frequently, the term data mining is used to refer
    to KDD.
  • Many commercial and experimental tools and tool
    suites are available (see http//www.kdnuggets.com
    /siftware.html)
  • Field is more dominated by industry than by
    research institutions

3
Motivation Necessity is the Mother of
Invention
  • Data explosion problem
  • Automated data collection tools and mature
    database technology lead to tremendous amounts of
    data stored in databases, data warehouses and
    other information repositories
  • We are drowning in data, but starving for
    knowledge!
  • Solution Data warehousing and data mining
  • Data warehousing and on-line analytical
    processing (analyzing and mining the raw data
    rarely works)idea mine summarized,. aggregated
    data
  • Extraction of interesting knowledge (rules,
    regularities, patterns, constraints) from data
    collections

4
YAHOO!s View of Data Mining
Whats New?
Whats Interesting?
Predict for me
http//www.sigkdd.org/kdd2008/
5
Data Mining A KDD Process
Knowledge
Pattern Evaluation
  • Data mining the core of knowledge discovery
    process.

Data Mining
Task-relevant Data
Selection
Data Warehouse
Data Cleaning
Data Integration
Databases
6
Steps of a KDD Process
  • Learning the application domain
  • relevant prior knowledge and goals of application
  • Creating a target data set data selection
  • Data cleaning and preprocessing
  • Data reduction and transformation (the first 4
    steps may take 75 of effort!)
  • Find useful features, dimensionality/variable
    reduction, invariant representation.
  • Choosing functions of data mining
  • summarization, classification, regression,
    association, clustering.
  • Choosing the mining algorithm(s)
  • Data mining search for patterns of interest
  • Pattern evaluation and knowledge presentation
  • visualization, transformation, removing redundant
    patterns, etc.
  • Use of discovered knowledge

7
Data Mining and Business Intelligence
Increasing potential to support business decisions
End User
Making Decisions
Business Analyst
Data Presentation
Visualization Techniques
Data Mining
Data Analyst
Information Discovery
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
DBA
Data Sources
Paper, Files, Information Providers, Database
Systems, OLTP
8
Are All the Discovered Patterns Interesting?
  • A data mining system/query may generate thousands
    of patterns, not all of them are interesting.
  • Suggested approach Human-centered, query-based,
    focused mining
  • Interestingness measures A pattern is
    interesting if it is easily understood by humans,
    valid on new or test data with some degree of
    certainty, potentially useful, novel, or
    validates some hypothesis that a user seeks to
    confirm
  • Objective vs. subjective interestingness
    measures
  • Objective based on statistics and structures of
    patterns, e.g., support, confidence, etc.
  • Subjective based on users belief in the data,
    e.g., unexpectedness, novelty, actionability, etc.

9
Data Mining Confluence of Multiple Disciplines
Machine Learning
Statistics
Pattern Recognition
Data Mining
Visualization
Applications
Algorithm
Database Technology
High-Performance Computing
10
KDD Process A Typical View from ML and Statistics
Pattern Information Knowledge
Data Mining
Post-Processing
Input Data
Data Pre-Processing
Association Analysis Classification Clustering Out
lier analysis Summary Generation
  • This is a view from typical machine learning and
    statistics communities

11
Data Mining Competitions
  • Netflix Price http//www.netflixprize.com//index
  • KDD Cup 2009 http//www.kddcup-orange.com/
  • KDD Cup 2011 http//www.kdd.org/kdd2011/kddcup.sh
    tml

12
Summary
  • Data mining discovering interesting patterns
    from large amounts of data
  • A natural evolution of database technology, in
    great demand, with wide applications
  • A KDD process includes data cleaning, data
    integration, data selection, transformation, data
    mining, pattern evaluation, and knowledge
    presentation
  • Mining can be performed in a variety of
    information repositories
  • Data mining functionalities characterization,
    discrimination, association, classification,
    clustering, outlier and trend analysis, etc.
  • Classification of data mining systems

13
COSC 6335 in a Nutshell
Preprocessing Data Mining
Post Processing
Association Analysis Pattern Evaluation

Clustering Visualization

Summarization
Classification
Prediction
14
Prerequisites
  • The course is basically self contained however,
    the following skills are important to be
    successful in taking this course
  • Basic knowledge of programming
  • Java/language of your own choice and data mining
    tools will be used in the programming
    projectsbasic knowledge of Java is sufficient!
  • Basic knowledge of statistics
  • Basic knowledge of data structures

15
Course Objectives
  • will know what the goals and objectives of data
    mining are
  • will have a basic understanding on how to conduct
    a data mining project
  • will obtain practical experience in data analysis
    and making sense out of data
  • will have sound knowledge of popular
    classification techniques, such as decision
    trees, support vector machines and
    nearest-neighbor approaches.
  • will know the most important association analysis
    techniques
  • will have detailed knowledge of popular
    clustering algorithms, such as K-means, DBSCAN,
    grid-based, hierarchical and supervised
    clustering.
  • will have some knowledge of R, an open source
    statistics/data mining environment
  • will obtain practical experience in designing
    data mining algorithms and in applying data
    mining techniques to real world data sets
  • will have some exposure to more advanced topics,
    such as sequence mining, spatial data mining, and
    web page ranking algorithms

16
Data Mining Course Organization
  • I Introduction to Data Mining and Data Mining
    Basics (Chapter 1 and 2.1)
  • II Exploratory Data Analysis (Chapter 3) moved!
  • III Introduction to Classification --- Basic
    Concepts and Decision Trees (Chapter 4
  • IV Introduction to Similarity Assessment and
    Clustering (Other material 2.3 and Chapter 8 in
    part)
  • V Introduction to Data Cubes (Section 3.4) moved!
  • VI Association Analysis (Chapter 6)
  • VII Spatial Data Mining
  • VIII More on Classification Regression,
    Instance-based Learning and Support Vector
    Machines (Chapter 5)
  • IX Data Preprocessing, Data Cubes, and Data
    Warehouses (Chapter 2 and l)
  • X More on Clustering (Chapter 8 and Chapter 9 in
    part)
  • XI Sequence and Graph Mining (Chapter 7 in part)
  • XI PageRank and other Top 10 Data Mining
    Algorithms (Journal Paper)
  • XII Final Words

17
Order of Coverage
  • Introduction ? Exploratory Data Analysis ?
    Similarity Assessment ? Clustering ? Association
    Analysis ? Classification? Spatial Data Mining ?
    More on Classification? OLAP and Data Warehousing
    ? Preprocessing ? More on Clustering ? Sequence
    and Graph Mining? Top 10 Data Mining Algorithms ?
    Summary
  • Also Some introductory tutorial into R (2-3
    classes)

18
In particular, R will be used for most course
projects, except spatial clustering algorithms
which are part of Cougar2 will be used in the
third project. The bad news is that it is more
challenging to get started with R (compared to
Weka---but Weka is a "dead" language), although
you should be okay after you used R for some
weeks. On the other hand, the good news about R
is that it continues to grow quickly in
popularity. A recent poll at KDnuggets found
that 34 of respondents do at least half of
their data mining in R. Although it's a domain
specific language, it's versatile. As we have
not used R in the course before, we expect some
startup problems and ask you for your patience,
but, on the positive side knowing R will be a
plus when conducting research projects and when
looking for jobs after you graduate, due to R's
completeness and R's rising popularity.
19
Where to Find References?
  • Data mining and KDD
  • Conference proceedings ICDM, KDD, PKDD, PAKDD,
    SDM,ADMA etc.
  • Journal Data Mining and Knowledge Discovery
  • Database field (SIGMOD member CD ROM)
  • Conference proceedings VLDB, ICDE, ACM-SIGMOD,
    CIKM
  • Journals ACM-TODS, J. ACM, IEEE-TKDE, JIIS, etc.
  • AI and Machine Learning
  • Conference proceedings ICML, AAAI, IJCAI, ECML,
    etc.
  • Journals Machine Learning, Artificial
    Intelligence, etc.
  • Statistics
  • Conference proceedings Joint Stat. Meeting, etc.
  • Journals Annals of statistics, etc.
  • Visualization
  • Conference proceedings CHI, etc.
  • Journals IEEE Trans. visualization and computer
    graphics, etc.

20
Textbooks
Required Text P.-N. Tang, M. Steinback, and V.
Kumar Introduction to Data Mining, Addison
Wesley, Link to Book HomePage Mildly
Recommended Text Jiawei Han and Micheline
Kamber, Data Mining Concepts and Techniques,
Morgan Kaufman Publishers, second edition. Link
to Data Mining Book Home Page
21
Tentative Schedule for
  • Exams October 25, December 6
  • Reviews
  • Plan First Half of the Fall 2011 Semester
  • Aug. 2325 Introduction to DM
  • August 30 Exploratory Data Analysis (Dr. Chen)
  • September 122 Lab (Zechun Cao)
  • September 681520 Clustering I
  • September 2729Oct. 4 Association Analysis
  • October 61113 Classification and Prediction
  • October 1820 Spatial Data Mining
  • October 27Nov.1 More on Classification and
    Prediction
  • October 25 Midterm Exam

22
2011 Course Projects
Project 1 Exploratory Data Analysis Project
2 Traditional Clustering with K-means and
DBSCAN Project 3 Spatial Clustering with
CLEVER Project 4 Group Project (different
topics, no programming) Project 5 TBDL
(something with SVMS and/or regression)

23
TA/Students of my Research Group
  • Duties
  • Grading of programming projects, home works, and
    exams (in part)
  • Run 2/3 labs
  • Help students with homework, programming projects
    and problems with the course material
  • Teach a class (two to three times)
  • Office
  • Office Hours
  • E-mail
  • Meet our TA Thursday

24
Web
  • Course Webpage (http//www2.cs.uh.edu/ceick/DM/DM
    11.html )
  • UH-DMML Webpage (http//www2.cs.uh.edu/UH-DMML/in
    dex.html)

25
Where to Find References? DBLP, CiteSeer, Google
  • Data mining and KDD (SIGKDD CDROM)
  • Conferences ACM-SIGKDD, IEEE-ICDM, SIAM-DM,
    PKDD, PAKDD, etc.
  • Journal Data Mining and Knowledge Discovery, KDD
    Explorations, ACM TKDD
  • Database systems (SIGMOD ACM SIGMOD AnthologyCD
    ROM)
  • Conferences ACM-SIGMOD, ACM-PODS, VLDB,
    IEEE-ICDE, EDBT, ICDT, DASFAA
  • Journals IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM,
    VLDB J., Info. Sys., etc.
  • AI Machine Learning
  • Conferences Machine learning (ML), AAAI, IJCAI,
    COLT (Learning Theory), CVPR, NIPS, etc.
  • Journals Machine Learning, Artificial
    Intelligence, Knowledge and Information Systems,
    IEEE-PAMI, etc.
  • Web and IR
  • Conferences SIGIR, WWW, CIKM, etc.
  • Journals WWW Internet and Web Information
    Systems,
  • Statistics
  • Conferences Joint Stat. Meeting, etc.
  • Journals Annals of statistics, etc.
  • Visualization
  • Conference proceedings CHI, ACM-SIGGraph, etc.
  • Journals IEEE Trans. visualization and computer
    graphics, etc.

26
Teaching Philosophy and Advice
  • The first 8 weeks will give a basic introduction
    to data mining and follows the textbook somewhat
    closely.
  • Read the sections of the textbook before you come
    to the lecture if you work continuously for the
    class you will do better and lectures will be
    more enjoyable. Starting to review the material
    that is covered in this class 1 week before the
    next exam is not a good idea.
  • Do not be afraid to ask questions! I really like
    interactions with students in the lectures If
    you do not understand something at all send me an
    e-mail before the next lecture!
  • If you have a serious problem talk to me, before
    the problem gets out of hand.

27
Where to Find References? DBLP, CiteSeer, Google
  • Data mining and KDD (SIGKDD CDROM)
  • Conferences ACM-SIGKDD, IEEE-ICDM, SIAM-DM,
    PKDD, PAKDD, etc.
  • Journal Data Mining and Knowledge Discovery, KDD
    Explorations, ACM TKDD
  • Database systems (SIGMOD ACM SIGMOD AnthologyCD
    ROM)
  • Conferences ACM-SIGMOD, ACM-PODS, VLDB,
    IEEE-ICDE, EDBT, ICDT, DASFAA
  • Journals IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM,
    VLDB J., Info. Sys., etc.
  • AI Machine Learning
  • Conferences Machine learning (ML), AAAI, IJCAI,
    COLT (Learning Theory), CVPR, NIPS, etc.
  • Journals Machine Learning, Artificial
    Intelligence, Knowledge and Information Systems,
    IEEE-PAMI, etc.
  • Web and IR
  • Conferences SIGIR, WWW, CIKM, etc.
  • Journals WWW Internet and Web Information
    Systems,
  • Statistics
  • Conferences Joint Stat. Meeting, etc.
  • Journals Annals of statistics, etc.
  • Visualization
  • Conference proceedings CHI, ACM-SIGGraph, etc.
  • Journals IEEE Trans. visualization and computer
    graphics, etc.

28
Course Planning for Research in Data Mining
  • This course Data Mining
  • I also suggest to taking at least 1, preferably
    two, of the following courses Pattern
    Classification (COSC 6343), Artificial
    Intelligence (COSC 6368), and Machine Learning
    (COSC 6342).
  • Moreover, having basic knowledge in data
    structures, software design, and databases is
    important when conducting data mining projects
    therefore, taking COSC 6320, COSC 6318 and COSC
    6340 is a good choice.
  • Moreover, taking a course that teaches high
    performance computing is also a good choice,
    because data mining algorithms are very time
    consuming.
  • Because a lot of data mining projects have to
    deal with images, I suggest to take at least one
    of the many biomedical image processing courses
    that are offered in our curriculum.
  • Finally, having knowledge in evolutionary
    computing, data visualization, statistics,
    solving optimization problems, GIS (geographical
    information systems) is a plus!
Write a Comment
User Comments (0)
About PowerShow.com