CSE/CBS 572 Data Mining - PowerPoint PPT Presentation

About This Presentation
Title:

CSE/CBS 572 Data Mining

Description:

Contents of basic and advanced topics. Classification, Clustering, Association, and Applications ... 'No pain, no gain', or 'As you sow, so you shall reap' ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 29
Provided by: Huan77
Category:
Tags: cbs | cse | data | mining | sow

less

Transcript and Presenter's Notes

Title: CSE/CBS 572 Data Mining


1
CSE/CBS 572 Data Mining
  • Huan Liu, CSE, CEAS, ASU
  • http//www.public.asu.edu/huanliu/DM07S/cse572.ht
    ml

2
CSE 572
  • Contents of basic and advanced topics
  • Classification, Clustering, Association, and
    Applications
  • Format An interactive and hands-on course with
    ample opportunities to work, create and share
  • Paper reading, discussion, project, presentation,
    or any learning activities you can suggest
  • Assessment in various ways
  • Class participation, assignments, quizzes, a
    course project, presentations, 1 or 2 exams

3
  • You our future data mining star, a potential
    zillionaire
  • TA Lei Tang, l.tang_at_asu.edu
  • Me Huan Liu, huanliu_at_asu.edu
  • Where Brickyard 566
  • When see on the course website, or by
    appointment
  • No pain, no gain, or As you sow, so you shall
    reap. We will also learn the principle of No
    Free Lunch.
  • MyASU will be used, make sure wont miss
    important announcement and your email address
    stays current (I push some information to you via
    emails sometimes)

4
Course Format
  • What is the effective teaching of graduate data
    mining ? Your feedback is keenly sought.
  • Current research papers - the main categories to
    be found on the course web site.
  • You can choose one of the textbooks listed. It is
    an entering point for you to access related
    subjects. The truth is It is a fast changing
    field.
  • Everyone is expected to read research papers and
    participate in class discussion.
  • Paper presentations.
  • Project presentations.
  • Presentations will also be evaluated in class.

5
Point distribution (tentative)
  • Projects (20-25)
  • Reading/presentation assignment (5)
  • Exam(s) (50)
  • Assignments (20-25), and class participation,
    quizzes (up to 10 extra credit)
  • Late penalty, YES, increasing exponentially.
  • Academic integrity (http//www.public.asu.edu/hua
    nliu/conduct.html)

6
Research paper reading
  • We will provide a reading list and you can also
    choose your favorite
  • All are expected to search for and read the
    selected papers part of the learning process.
  • What is it about (e.g., key idea, basic
    algorithm)?
  • What are points to discuss and improve?
  • What can we do with it?
  • What to submit? (see more on the class website)
  • A brief report that describes the above and 2
    questions suitable for quizzes/tests with
    solutions
  • A set of presentation slides for 20 minutes
  • Due date TBA, use digital drop box
  • Grading criteria include (1) quality of
    additional papers you select, (2) slides for
    presentation, (3) the report, and (4) oral
    presentation will be selected among the best
    submissions and presenters will be given extra
    credit based on presentation
  • Presentation can start as early as in Feburary,
    if possible.

7
Project
  • Proposal
  • A theme-based project for all in class?
  • More discussion later
  • Proposal presentation, discussion, revision
  • A project worth the effort of a semesters work
  • Progress report
  • Final report
  • Class presentation and/or demo
  • One key goal of this course is to take advantage
    of your intelligence and (limited) experience (so
    youre bold and creative) to expand your
    knowledge in creating something useful and
    interesting

8
Topic Distribution (tentative)
9
Categories of interests (including design and
implementation)
  • Data and application security
  • Data mining and privacy
  • Data reduction and selection
  • Streaming data reduction
  • Dealing with large data (column- row-wise)
  • Search bias, overfitting
  • Learning algorithms
  • Ensemble methods
  • Semi-supervised learning
  • Active learning and co-training
  • Bioinformatics or others
  • A discussion board will be created, if needed

10
Your first assignment is to think
  • Think about what you want to accomplish in this
    course.
  • List 2 of your areas of interests (dont be
    restricted by the previous list, and this is one
    of some rare opportunities that allow you to
    day-dream and earn grade points), and briefly
    justify your choices.
  • Pick an area of interests and choose a general
    topic for paper presentation.
  • Submission is in hardcopy
  • Complete the above and submit it in the 3rd class
    (Monday, January 29, 2007).

11
Your 2nd homework (project related) assignment
  • Find some (2-3) Web 2.0 or HealthCare
    applications
  • Wikipedia, some examples facebook,
    del.icio.us,myspace, digg,
  • Massive data of various types in a hospital
    environment
  • Surf the Web and compile their URLs and functions
  • Choose your favorite one and explain why
  • How can data mining help in your perspective
    with your limited knowledge of data mining
  • Test-drive it (manually, imagine that you have
    what you wish for, )
  • Write a summary about your experience, challenges
    you encountered, and suggestions if any
  • Due on Wednesday February 7, 2007

12
Selecting a paper of your interest
  • First, continue from your 1st (and/or 2nd)
    homework assignments about your category of
    interest
  • Second, you may form presentation groups (e.g.,
    2-3 a group for discussion purposes)
  • Third, each student picks a paper from the given
    category and find 1 additional high-quality
    relevant paper
  • Submit it through myASU
  • The first student in each category will present
    the given paper (s/he does not need to look for
    another paper)
  • TA will organize and help you and compile a list
    of all papers at the end
  • Write a summary for your selected paper including
  • What is it about
  • Why is it significant and relevant
  • Where is it published and when

13
Introduction
  • The need for data mining in the Internet era
  • Data everywhere, examples?
  • Data mining
  • Text mining
  • Image mining
  • Web mining (log, link, content, blog)
  • Bioinformatics
  • Many products and abundant applications
  • Where do we stand

14
What is data mining
  • Data mining is
  • extraction of useful patterns from data sources,
    e.g., databases, texts, web, image.
  • the analysis of (often large) observational data
    sets to find unsuspected relationships and to
    summarize the data in novel ways that are both
    understandable and useful to the data owner.

15
Patterns (1)
  • Patterns are the relationships and summaries
    derived through a data mining exercise.
  • Patterns must be
  • valid
  • novel
  • potentially useful, or actionable
  • understandable

16
Patterns (2)
  • Patterns are used for
  • prediction or classification
  • describing the existing data
  • segmenting the data (e.g., the market)
  • profiling the data (e.g., your customers)
  • Detection (e.g., intrusion, fault, anomaly)

17
Data (1)
  • Data mining typically deals with data that have
    already been collected for some purpose other
    than data mining.
  • Data miners usually have no influence on data
    collection strategies.
  • Large bodies of data cause new problems
    representation, storage, retrieval, analysis, ...

18
Data (2)
  • Even with a very large data set, we are usually
    faced with just a sample from the population.
  • Data exist in many types (continuous, nominal)
    and forms (credit card usage records, supermarket
    transactions, government statistics, text,
    images, medical records, human genome databases,
    molecular databases).

19
Typical DM tasks
  • Classification
  • mining patterns that can classify future data
    into known classes.
  • Association rule mining
  • mining any rule of the form X ?? Y, where X and Y
    are sets of data items.
  • Clustering
  • identifying a set of similar groups in the data

20
  • Sequential pattern mining
  • A sequential rule A? B, says that event A will
    be immediately followed by event B with a certain
    confidence
  • Deviation/anomaly/exception detection
  • discovering the most significant changes in data
  • Data visualization (or visual analytics) using
    graphical methods to show patterns in data.
  • High performance computing
  • Bioinformatics

21
Why data mining
  • Rapid computerization of businesses produces huge
    amounts of data
  • How to make best use of data to our advantages?
  • A growing realization knowledge discovered from
    data can be used for competitive advantage and to
    increase business intelligence.
  • There are problems that might not be suitable for
    data mining Top 10 Statistics Problems for
    CapitalOne (Bill Khans invited talk at SIGKDD06)

22
  • Make use/sense of your data assets
  • Many interesting things you want to find cannot
    be found using database queries
  • find me people likely to buy my products
  • Who are likely to respond to my promotion
  • Fast identify underlying relationships and
    respond to emerging opportunities

23
Why now and for the near future
  • The data is abundant.
  • The data is being collected or warehoused.
  • The computing power is affordable.
  • The competitive pressure is increasing.
  • Data mining tools have become available.
  • New challenges
  • New data types evolve
  • New applications emerge

24
DM fields
  • Data mining is an emerging multi-disciplinary
    field
  • Statistics
  • Machine learning
  • Databases
  • Visualization
  • OLAP and data warehousing
  • High-performance computing
  • ...

25
Summary
  • What is data mining?
  • KDD - knowledge discovery in databases
    non-trivial extraction of implicit, previously
    unknown and potentially useful/actionable
    information
  • Why do we need data mining?
  • Wide use of computer systems - data explosion -
    knowledge is power but were data rich,
    knowledge poor useful, understandable and
    actionable knowledge ...
  • Data mining is not a plug-and-play, so we are not
    done yet and need to continue this class

26
An Overview of KDD Process (Guess which is which)
27
Web mining an application
  • The Web is a massive database
  • Semi-structured data
  • XML and RDF
  • Web mining
  • Content
  • Structure
  • Usage
  • Link analysis

28
About project
  • Data collection
  • How, where, what to collect
  • Evaluation methods
  • How and why
  • Analysis and mining
  • What requires our (yet-to-develop) expertise
  • For what
  • Potentially useful methods
  • For example, graph theories, social networks
  • Great applications
Write a Comment
User Comments (0)
About PowerShow.com