Kein Folientitel - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Kein Folientitel

Description:

Comparison of existing algorithms on a new, interesting dataset. identify criteria for choice of algorithms / open research problems ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 27
Provided by: est79
Category:
Tags: folientitel | kein

less

Transcript and Presenter's Notes

Title: Kein Folientitel


1
Special Topics in Database Systems
Martin Ester Simon Fraser University School of
Computing Science CMPT 884 Spring 2009
2
Introduction
Fayyad, Piatetsky-Shapiro Smyth 96
  • Knowledge discovery in databases (KDD) is the
    process of (semi-)automatic extraction of
    knowledge from databases which is
  • valid
  • previously unknown
  • and potentially useful.
  • Remarks
  • (semi)-automatic distinction from manual
    analysis / OLAP. Typically, some user
    interaction necessary.
  • valid in the statistical sense.
  • previously unknown not explicit, no common
    sense knowledge.
  • potentially useful for some given application.

3
Introduction
  • Statistics Hand, Mannila Smyth 2001
  • representation of uncertainty
  • model-based inferences
  • focus on numeric data
  • Machine Learning Mitchell 1997
  • knowledge representation
  • search strategies
  • focus on symbolic data
  • Database Systems Han Kamber 2000
  • data management
  • integration of data mining with DBS
  • scalability for large databases

4
Introduction
KDD Process Han Kamber 2000
Databases
KDD Process Fayyad, Piatetsky-Shapiro Smyth
1996
Data Mining
Trans- formation
Pre-processing
Focussing
Evaluation
Pattern
Knowledge
Database
5
Data Mining
  • Definition Fayyad, Piatetsky-Shapiro, Smyth
    1996
  • Data Mining is the application of efficient
    algorithms to determine the patterns contained in
    some database.
  • Data-Mining Tasks

clustering
classification
A and B ? C
association rules
generalisation
other tasks regression, outlier detection . . .
6
Trends in KDD Research
  • KDD 2000 Conference
  • New Data Mining Algorithms
  • Efficiency and Scalability of Data Mining
    Algorithms
  • Interactive Data Exploration
  • Visualization
  • Constraints and Evaluation in the KDD Process

7
Trends in KDD Research
  • KDD 2002 Conference
  • Statistical Methods
  • Frequent Patterns
  • Streams and Time Series
  • Visualization
  • Web Search and Navigation
  • Text and Web Page Classification
  • Intrusion and Privacy
  • Applications

8
Trends in KDD Research
  • KDD 2004 Conference
  • Frequent Patterns / Association Rules
  • Clustering
  • Mining Spatio-Temporal Data
  • Mining Data Streams
  • Dimensionality Reduction
  • Privacy-Preserving Data Mining
  • Mining Biological Data
  • Applications (Web, biological data, security, .
    . .)

9
Trends in KDD Research
  • KDD 2006 Conference
  • Clustering
  • Classification / supervised ML
  • Privacy
  • Web / Graph Mining
  • Web / Text Mining
  • Frequent Pattern Mining
  • Structured Data

10
Trends in KDD Research
  • KDD 2008 Conference
  • Text Mining
  • Data Integration
  • Social Networks
  • Graph Mining
  • Distance Functions and Metric Learning
  • Active and Semi-supervised Learning
  • Pattern Mining
  • Collaborative Filtering

11
Trends in KDD Research
  • Some Hot Topics
  • Social Networks THE hot topic of KDD 08 ?
    topic of the only panel
  • Graph mining
  • Text mining and information extraction /
    integration
  • Collaborative Filtering more general,
    recommender systems ? 1M NetFlix prize

12
Overview of this Course
  • Prerequisites
  • Foundations of database systems and statistics
  • Introductory graduate data mining course or
    equivalent
  • Objectives
  • Introduction into some hot topics of data mining
    research
  • Training in research methodology
  • Presentation skills
  • start thesis work after this class!

13
Overview of this Course
  • Topics
  • Graph mining social network analysis and
    analysis of biological networks as driving
    applications
  • Recommender systems in particular trust-based
    recommendation
  • Information extraction and integration
    integration with existing databases

14
Overview of this Course
  • Format
  • Tutorial surveys by instructor
  • Written research paper reviews by students
  • Research paper presentations by
    students discussions in class
  • Course research projects by students on a
    topic of their choice

15
Overview of this Course
  • Tentative Grading Scheme
  • Paper review (20 )
  • Paper presentation (20 )
  • Course project report (40) two steps
    project proposal, final project report
  • Course project presentation (20 )
  • ? marking criteria originality, technical
    quality, presentation

16
Overview of this Course
  • Types of Course Projects
  • Literature survey summarize the
    state-of-the-art and identify open research
    problems
  • New problem introduce and analyze a new problem
  • New algorithm for known problem implement and
    evaluate algorithm
  • Improvement of existing algorithm implement and
    compare algorithm
  • Comparison of existing algorithms on a new,
    interesting dataset identify criteria for choice
    of algorithms / open research problems

17
Graph Mining
  • Motivating Applications
  • Social network analysis
  • What communities exist?
  • How does information about a new product spread?
  • What customers should be targeted to maximize the
    profit of a marketing campaign?
  • Analysis of biological networks
  • o What are the functional modules of an organism?
  • o How do biological networks evolve in the course
    of time?
  • o What protein should be targeted to inhibit some
    virulent bacteria?

18
Graph Mining
  • Methods
  • Frequent subgraph mining
  • frequent pattern mining approach
  • Graph clustering e.g., normalized cut, i.e.
    Minimize number of edges between graph
    components / clusters
  • Graph generative models probabilistic models
    that generate graphs similar to real graphs /
    networks

19
Graph Mining
  • Challenges
  • Complexity of graph algorithms
  • Many graph mining problems are NP-hard.
  • Real graphs tend to be extremely large.
  • ? need efficient algorithms
  • Attribute data
  • Many graphs have attributes associated with the
    nodes.
  • Transformation into weighted graph looses a lot
    of information.
  • ? need new models / algorithms considering
    relationship and attribute data

20
Recommender Systems
  • Motivating Applications
  • Motivation o The internet provides a flood of
    information on all kinds of items. o There is a
    great need for personalized recommendations.
    o The internet also provides a wealth of item
    ratings / reviews.
  • Typical applications
  • Movie recommendation
  • Product recommendation
  • Keyword recommendation

21
Recommender Systems
  • Methods
  • Collaborative filtering o Uses only a database
    of user item ratings. o Recommendation based
    on ratings by users with similar rating patterns.
  • Content-based recommender systems
  • o Uses information about the content of items
    and / or the properties of users.
  • o Recommends items that have content similar to
    items liked by user.
  • Trust-based recommender systems
  • Assume a social network / trust network. Trust
    can be defined explicitly or implicitly.
  • Recommendation based on ratings by trusted
    neighbors.

22
Recommender Systems
  • Challenges
  • High dimensionality and sparsity of data o The
    overwhelming majority (gt 99) of user item
    ratings is unknown. o Recommendation especially
    hard for cold start users and controversial
    items.
  • ? dimensionality reduction, model based methods,
    trust-based approach
  • Fraud
  • o Memory-based collaborative filtering can be
    easily manipulated by adding fraudulent
    ratings.
  • ? trust-based approach more robust to fraud
  • Privacy issues with trust network data
  • o only very few trust networks are public domain

23
Information Extraction and Integration
  • Motivating Applications
  • Importance of unstructured text data o The
    overwhelming majority (gt 80) of human generated
    information is not in structured form,
    but in unstructured text.
  • Biomedical literature
  • o Contains a wealth of valuable information that
    cannot be processed / searched
    automatically.
  • o Extraction of entities and relationships such
    as proteins and their localizations.
  • Online product reviews
  • o A lot of product reviews available online in
    community databases or blogs.
  • o Companies want to know what customers think of
    their products.

24
Information Extraction and Integration
  • Methods
  • Basic NLP methods o Part-of-speech tagging
  • o Lexica, ontologies, . . .
  • Machine learning methods
  • o Typically, supervised classification.
  • o CRFs and similar methods are state-of-the-art.
  • Bootstrapping approach
  • o Using a small labeled training dataset, find
    textual extraction patterns.
  • o Using these patterns, extract further entities
    / relationships and continue.

25
Information Extraction and Integration
  • Challenges
  • Text data is hard to understand o Many of the
    NLP problems are still essentially unsolved. ?
    relatively simple NLP methods often sufficient
    for information extraction
  • Portability across domains
  • o Extraction methods need to be portable from
    one domain to another.
  • o Knowledge engineering approach (domain expert
    defines rules) is labor-intensive and
    expensive.
  • ? machine learning methods
  • Entity mentions need to be resolved
  • o Information extraction produces strings
    referencing an entity of a given type.
  • o Without mapping to known real world entities,
    extracted information is of limited
    usefulness. ? need to integrate extracted
    information with existing databases

26
References
  • Graph mining
  • X Yan Karsten Borgwardt, "Graph Mining and
    Graph Kernels", Tutorial KDD 08
  • Jure Leskovec and Christos Faloutsos, Mining
    Large Graphs Models, Diffusion and Case
    Studies, Tutorial ECML/PKDD 2007
  • Recommender systems
  • Joseph Konstan, Introduction to Recommender
    Systems, Tutorial SIGMOD 2008
  • Information extraction and integration - Eugene
    Agichtein Sunita Sarawagi, Scalable
    Information Extraction and Integration,
    Tutorial KDD 06
  • - AnHai Doan Raghu Ramakrishnan Shiv
    Vaithyanathan, Managing Information
    Extraction, Tutorial SIGMOD 2006
Write a Comment
User Comments (0)
About PowerShow.com