Kein Folientitel - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

Kein Folientitel

Description:

Comparison of existing algorithms on a new, interesting dataset. identify criteria for choice of algorithms / open research problems ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 27

Provided by: est79

Category:

more less

Transcript and Presenter's Notes

Title: Kein Folientitel

1
Special Topics in Database Systems
Martin Ester Simon Fraser University School of
Computing Science CMPT 884 Spring 2009
2
Introduction
Fayyad, Piatetsky-Shapiro Smyth 96

Knowledge discovery in databases (KDD) is the
process of (semi-)automatic extraction of
knowledge from databases which is
valid
previously unknown
and potentially useful.
Remarks
(semi)-automatic distinction from manual
analysis / OLAP. Typically, some user
interaction necessary.
valid in the statistical sense.
previously unknown not explicit, no common
sense knowledge.
potentially useful for some given application.

3
Introduction

Statistics Hand, Mannila Smyth 2001
representation of uncertainty
model-based inferences
focus on numeric data
Machine Learning Mitchell 1997
knowledge representation
search strategies
focus on symbolic data
Database Systems Han Kamber 2000
data management
integration of data mining with DBS
scalability for large databases

4
Introduction
KDD Process Han Kamber 2000
Databases
KDD Process Fayyad, Piatetsky-Shapiro Smyth
1996
Data Mining
Trans- formation
Pre-processing
Focussing
Evaluation
Pattern
Knowledge
Database
5
Data Mining

Definition Fayyad, Piatetsky-Shapiro, Smyth
1996
Data Mining is the application of efficient
algorithms to determine the patterns contained in
some database.
Data-Mining Tasks

clustering
classification
A and B ? C
association rules
generalisation
other tasks regression, outlier detection . . .
6
Trends in KDD Research

KDD 2000 Conference
New Data Mining Algorithms
Efficiency and Scalability of Data Mining
Algorithms
Interactive Data Exploration
Visualization
Constraints and Evaluation in the KDD Process

7
Trends in KDD Research

KDD 2002 Conference
Statistical Methods
Frequent Patterns
Streams and Time Series
Visualization
Web Search and Navigation
Text and Web Page Classification
Intrusion and Privacy
Applications

8
Trends in KDD Research

KDD 2004 Conference
Frequent Patterns / Association Rules
Clustering
Mining Spatio-Temporal Data
Mining Data Streams
Dimensionality Reduction
Privacy-Preserving Data Mining
Mining Biological Data
Applications (Web, biological data, security, .
. .)

9
Trends in KDD Research

KDD 2006 Conference
Clustering
Classification / supervised ML
Privacy
Web / Graph Mining
Web / Text Mining
Frequent Pattern Mining
Structured Data

10
Trends in KDD Research

KDD 2008 Conference
Text Mining
Data Integration
Social Networks
Graph Mining
Distance Functions and Metric Learning
Active and Semi-supervised Learning
Pattern Mining
Collaborative Filtering

11
Trends in KDD Research

Some Hot Topics
Social Networks THE hot topic of KDD 08 ?
topic of the only panel
Graph mining
Text mining and information extraction /
integration
Collaborative Filtering more general,
recommender systems ? 1M NetFlix prize

12
Overview of this Course

Prerequisites
Foundations of database systems and statistics
Introductory graduate data mining course or
equivalent
Objectives
Introduction into some hot topics of data mining
research
Training in research methodology
Presentation skills
start thesis work after this class!

13
Overview of this Course

Topics
Graph mining social network analysis and
analysis of biological networks as driving
applications
Recommender systems in particular trust-based
recommendation
Information extraction and integration
integration with existing databases

14
Overview of this Course

Format
Tutorial surveys by instructor
Written research paper reviews by students
Research paper presentations by
students discussions in class
Course research projects by students on a
topic of their choice

15
Overview of this Course

Tentative Grading Scheme
Paper review (20 )
Paper presentation (20 )
Course project report (40) two steps
project proposal, final project report
Course project presentation (20 )
? marking criteria originality, technical
quality, presentation

16
Overview of this Course

Types of Course Projects
Literature survey summarize the
state-of-the-art and identify open research
problems
New problem introduce and analyze a new problem
New algorithm for known problem implement and
evaluate algorithm
Improvement of existing algorithm implement and
compare algorithm
Comparison of existing algorithms on a new,
interesting dataset identify criteria for choice
of algorithms / open research problems

17
Graph Mining

Motivating Applications
Social network analysis
What communities exist?
How does information about a new product spread?
What customers should be targeted to maximize the
profit of a marketing campaign?
Analysis of biological networks
o What are the functional modules of an organism?
o How do biological networks evolve in the course
of time?
o What protein should be targeted to inhibit some
virulent bacteria?

18
Graph Mining

Methods
Frequent subgraph mining
frequent pattern mining approach
Graph clustering e.g., normalized cut, i.e.
Minimize number of edges between graph
components / clusters
Graph generative models probabilistic models
that generate graphs similar to real graphs /
networks

19
Graph Mining

Challenges
Complexity of graph algorithms
Many graph mining problems are NP-hard.
Real graphs tend to be extremely large.
? need efficient algorithms
Attribute data
Many graphs have attributes associated with the
nodes.
Transformation into weighted graph looses a lot
of information.
? need new models / algorithms considering
relationship and attribute data

20
Recommender Systems

Motivating Applications
Motivation o The internet provides a flood of
information on all kinds of items. o There is a
great need for personalized recommendations.
o The internet also provides a wealth of item
ratings / reviews.
Typical applications
Movie recommendation
Product recommendation
Keyword recommendation

21
Recommender Systems

Methods
Collaborative filtering o Uses only a database
of user item ratings. o Recommendation based
on ratings by users with similar rating patterns.
Content-based recommender systems
o Uses information about the content of items
and / or the properties of users.
o Recommends items that have content similar to
items liked by user.
Trust-based recommender systems
Assume a social network / trust network. Trust
can be defined explicitly or implicitly.
Recommendation based on ratings by trusted
neighbors.

22
Recommender Systems

Challenges
High dimensionality and sparsity of data o The
overwhelming majority (gt 99) of user item
ratings is unknown. o Recommendation especially
hard for cold start users and controversial
items.
? dimensionality reduction, model based methods,
trust-based approach
Fraud
o Memory-based collaborative filtering can be
easily manipulated by adding fraudulent
ratings.
? trust-based approach more robust to fraud
Privacy issues with trust network data
o only very few trust networks are public domain

23
Information Extraction and Integration

Motivating Applications
Importance of unstructured text data o The
overwhelming majority (gt 80) of human generated
information is not in structured form,
but in unstructured text.
Biomedical literature
o Contains a wealth of valuable information that
cannot be processed / searched
automatically.
o Extraction of entities and relationships such
as proteins and their localizations.
Online product reviews
o A lot of product reviews available online in
community databases or blogs.
o Companies want to know what customers think of
their products.

24
Information Extraction and Integration

Methods
Basic NLP methods o Part-of-speech tagging
o Lexica, ontologies, . . .
Machine learning methods
o Typically, supervised classification.
o CRFs and similar methods are state-of-the-art.
Bootstrapping approach
o Using a small labeled training dataset, find
textual extraction patterns.
o Using these patterns, extract further entities
/ relationships and continue.

25
Information Extraction and Integration

Challenges
Text data is hard to understand o Many of the
NLP problems are still essentially unsolved. ?
relatively simple NLP methods often sufficient
for information extraction
Portability across domains
o Extraction methods need to be portable from
one domain to another.
o Knowledge engineering approach (domain expert
defines rules) is labor-intensive and
expensive.
? machine learning methods
Entity mentions need to be resolved
o Information extraction produces strings
referencing an entity of a given type.
o Without mapping to known real world entities,
extracted information is of limited
usefulness. ? need to integrate extracted
information with existing databases

26
References

Graph mining
X Yan Karsten Borgwardt, "Graph Mining and
Graph Kernels", Tutorial KDD 08
Jure Leskovec and Christos Faloutsos, Mining
Large Graphs Models, Diffusion and Case
Studies, Tutorial ECML/PKDD 2007
Recommender systems
Joseph Konstan, Introduction to Recommender
Systems, Tutorial SIGMOD 2008
Information extraction and integration - Eugene
Agichtein Sunita Sarawagi, Scalable
Information Extraction and Integration,
Tutorial KDD 06
- AnHai Doan Raghu Ramakrishnan Shiv
Vaithyanathan, Managing Information
Extraction, Tutorial SIGMOD 2006