Title: Themes in this session
1Lecture 2
- Themes in this session
- Knowledge discovery in databases
- Data mining
- Multidimensional analysis and OLAP
2Knowledge discovery in databases
3What is Knowledge?
- Data
- symbols representing properties of events and
their environments - Information
- is contained in descriptions, provides the
answers to a number of basic questions - Knowledge
- basic know-how facilitates allows action
- Understanding
- achieved through diagnosis and prescription
- Wisdom
- judgement of what is efficient and effective
4Characteristics of discovered knowledge
- non-trivial
- valid
- novel
- potential useful
- understandable
- An aggregated measure is interestingness
- validity
- novelty
- usefulness
- simplicity
5A more formal definition of knowledge
- Pattern
- A pattern is an expression E in a language L
describing facts in a subset FE of F. E is called
a pattern if it is simpler than the enumeration
of all the facts in FE - Knowledge
- A pattern E ? L is called knowledge if for some
user-specified threshold i ? Mi , I(E,F,C,N,U,S)
gt i - where C validity, N novelty, U usefulness,
S simplicity
6What is KDD?
- Knowledge Discovery in Databases involves the
extraction of implicit, previously unknown and
potentially useful information from data. - KDD is a process
- involves the extraction, organisation and
presentation of discovered information - KDD is effected by a human-centred system
- is in itself a knowledge intensive task
consisting of complex interactions between a
human and a (large) database.
7Overview of the analysts tasks
Goals
Insight
gains
formulates
enriches
Queries
generates
Analyses
DB
Output
Dataset
8Characteristics of the KDD process
- highly iterative
- protracted over time
- numerous sub-tasks
- highly complex
- numerous input systems
9A description of the KDD process
Task discovery
Data analysis
Model development
Data cleaning
Output generation
Goal formulation
Data discovery
10Goal formulation
- Based on a means-ends chain extending into the
workings of the organisation - Formulate a goal for improving the operations of
the business - Decide what one needs to know in order to fulfil
this goal and perform the business activity in a
better manner - On the basis of what one needs to know formulate
goals for how to discover this information by
using the KDD process - Revise all of the goals above if needs on the
basis of iterative discovery
11Data discovery
- Try and understand the domain in order to
determine which entities are relevant to the
discovery process - Check the coverage and content of the data
- sift through the source data to see what is
available - sift through the source data to see what is not
available - Determine the quality of the data
- Determine the structure of the data
12Task discovery
- Find means stipulated by the ends contained in
the knowledge discovery goals - Find out what the real requirements on the tasks
and the performance of these tasks are - Refine the requirements and choice of tasks until
youre sure youre setting about answering the
correct questions
13Data cleaning
- Ensure the quality of the data that will be used
in the KDD process - Eliminate data quality problems in the data such
as - inconsistencies due to differences between
various data sources - missing data
- different forms of data representation
- data incompatibility
14Model development
- Involves activities concerned with forming a
basic hypothesis which can satisfy the knowledge
discovery goals - Select the parameters for the model
- formulate measures that can be used to quantify
achievement of the goal (outcome variable or
dependent variable) - select a set of independent variables which are
deemed to have relevance to the outcome variables - Segment the data
- find possible relevant subsets in the population
- Choose an analysis model which fits the problem
domain - NOTE This whole phase demands background
knowledge of the domain
15Data analysis
- Involves activities aimed at determining the
rules/reasons governing the behaviour of those
entities focused on by the knowledge discovery
goal - specify the chosen model
- use some form of formal expression
- fit the model to the data
- perform initial adjustments to some of the
parameters - evaluate the model
- check the soundness of the model against the data
- refine the model
- modify the model on the basis of its
discrepancies with the evidence presented by the
data
16Output generation
- Reports of findings in the analysis
- Action suggestions on the basis of the findings
- Models for use in similar analysis scenarios
- Monitoring mechanisms which observe the variables
covered in the analysis and trigger
notifications when certain conditions are noted
in the data.
17Developing KDD applications
- Purpose an application to answer a key business
question - a labour intensive initial discovery of knowledge
by someone who understands the domain as well as
the specific data analysis techniques needed - encoding of the discovered knowledge within a
specific problem solving architecture - application of the knowledge in the context of a
real world task by a well understood class of
end-users - Installation of analysis, monitoring, and
reporting mechanisms as a base for continual
evaluation of data
18Data mining
19What is data mining?
- Rather formal definition
- Data mining involves fitting models to, and
observing patterns from, observed data through
the application of specific algorithms. - Less formally
- Data analysis in order to explain an aspect of a
complex reality by expressing it as an
understandable simplification
20Goals for data mining
- Prediction
- involve using some variables or fields in the
database to predict unknown or future values of
other variables of interest - Description
- focuses on finding human interpretable patterns
describing the data
21Rationale for data mining
- Dramatic increase in the amount of data available
(the data explosion) - Increasing competition in the worlds market
- The low relative value of easily discovered
information - Increasing cleverness
- Emergence of new enabling technology
22Enabling factors for data mining
- Increased data storage ability
- Increased data gathering ability
- Increased processing power
- The introduction of new computationally intensive
methods of machine learning
23Background to data mining
- Inductive learning
- supervised learning
- unsupervised learning
- Statistics
- Machine learning
- Differences between DM and ML
- DM finds understandable knowledge, ML improves
the performance of an agent - DM is concerned with large, real-world databases,
ML with smaller data sets - ML is a broader files, not only learning by
example
24Data mining algorithms
- Specific mix of three components
- The model
- function
- representational form
- parameters from the data
- The model evaluation (preference) criterion
- preference of one set of models or set of
parameters over another - based on goodness-of-fit function
- The search method
- a method for finding particular models and
parameters - Given data, family of models, preference
criterion
25Primary operations in data mining
- A number of basic operations can be used for
prediction and depiction - Classification
- Regression
- Clustering
- Summarisation
- Dependency modelling
- Change and deviation detection
26Classification
- Learning a function that maps (classifies) a data
item into one of several predefined classes - In supervised learning it is the user that
defines the classes. - The classification is applied in the form of one
or more attributes that denotes the class of the
data item. - These classifying attributes are known as
predicted attributes. A combination of values for
the predicted attributes defines a class - Other attributes of the data item are known as
predicting attributes
27Regression
- A common statistical technique for modelling the
relationship between two or more variables - Learning a function which maps a data item to a
real-valued prediction variable - Simple linear regression uses the straight line
model Y ?0 ?1X ? , where Y is the
prediction variable (dependent variable) and X is
the predictive variable (independent variable) - Multiple regression involves more than two
variables and uses the model Y ?0 ?1X1 ?2X2
?nXn ? , where Y is the prediction variable
and X1 Xn are the predictive variables
28Clustering
- A common descriptive task for determining a
finite set of categories or clusters to describe
the data - Categories may be mutually descriptive and
exhaustive, or consist of richer representations
such as hierarchical or overlapping categories - A cluster is a group of objects grouped together
because of their similarity of proximity. Data
units in a cluster are both homogeneous and
differ significantly from other groups - Correlations and functions of distance between
elements are used in defining the clusters
29Summarisation
- Methods for finding a compact description for a
subset of data - Often relies on statistical methods such as the
calculating of means and standard derivations - Are often applied to interactive exploratory data
analysis and automated report generation.
30Dependency modelling
- Consists for finding a model which describes
significant dependencies between variables - There are two levels of dependency in dependency
models - The structural level specifies which variables
are locally dependent on each other - The quantitative level specifies the strengths of
the dependencies using some numerical scale - Often in the form x of all record containing
items A and B, also contain items D and E
31Change and deviation detection
- Focuses on discovering the most significant
changes in the data from previously measured or
normative values - Often used on a long time series of records in
order to discover trends - Often used to discover sequential patterns
occurring over extended time periods
32Problems and issues in data mining
- Limited information
- Noise and missing values
- Uncertainty
- Size of databases
- Irrelevance of certain fields
- Updates to databases
33Multidimensional analysis and OLAP
34OLAP vs OLTP
- OLTP servers handle mission-critical production
data accessed through simple queries - usually handles queries of an automated nature
- OLTP applications consist of a large number of
relatively simple transactions. - Most often contains data organised on the basis
of logical relations between normalised tables - OLAP servers handle management-critical data
accessed through an iterative analytical
investigation - usually handles queries of an ad-hoc nature
- supports more complex and demanding transactions
- contains logically organised data in multiple
dimensions
35What is OLAP?
- Definition The dynamic synthesis, analysis and
consolidation of large volumes of
multidimensional data. - Flexible information synthesis
- Multiple data dimensions/consolidation paths
- Dynamic data analysis
36Codds four data models for data analysis
- Categorical data models
- Exegetical data models
- Contemplative data models
- Formulaic data models
37Dimensionality revisited
38OLAP Tool evaluation criteria (1-6)
- Multidimensional conceptual view
- Transparency
- Accessibility
- Consistent reporting performance
- Client-Server architecture
- Generic dimensionality
39OLAP Tool evaluation criteria (7-12)
- Dynamic Sparse Matrix handling
- Multi-user support
- Unrestricted cross-dimensional analysis
- Intuitive data manipulation
- Flexible reporting
- Unlimited dimensions and aggregation levels
40Functionality of OLAP tools
- Drill-down
- Drill-up
- Roll-up or consolidation
- Slicing and dicing by pivoting
- Drill-through
- Drill-across
41An OLAP answer set
42Different forms of OLAP
- True OLAP
- ROLAP (relational OLAP)
- MOLAP (multidimensional OLAP)