Title: Data Mining Query Languages
1Data Mining Query Languages
- Kristen LeFevre
- April 19, 2004
- With Thanks to Zheng Huang and Lei Chen
2Outline
- Introduce the problem of querying data mining
models - Overview of three different solutions and their
contributions - Topic for Discussion What would an ideal
solution support?
3Problem Description
- You guys are armed with two powerful tools
- Database management systems
- Efficient and effective data mining algorithms
and frameworks - Generally, this work asks
- How can we merge the two?
- How can we integrate data mining more closely
with traditional database systems, particularly
querying?
4Three Different Answers
- DMQL A Data Mining Query Language for Relational
Databases (Han et al, Simon Fraser University) - Integrating Data Mining with SQL Databases OLE
DB for Data Mining (Netz et al, Microsoft) - MSQL A Query Language for Database Mining
(Imielinski Virmani, Rutgers University)
5Some Common Ground
- Create and manipulate data mining models through
a SQL-based interface (Command-driven data
mining) - Abstract away the data mining particulars
- Data mining should be performed on data in the
database (should not need to export to a
special-purpose environment) - Approaches differ on what kinds of models should
be created, and what operations we should be able
to perform
6DMQL
- Commands specify the following
- The set of data relevant to the data mining task
(the training set) - The kinds of knowledge to be discovered
- Generalized relation
- Characteristic rules
- Discriminant rules
- Classification rules
- Association rules
7DMQL
- Commands Specify the following
- Background knowledge
- Concept hierarchies based on attribute
relationships, etc. - Various thresholds
- Minimum support, confidence, etc.
8DMQL
- Syntax
- use database ltdatabase_namegt
- use hierarchy lthierarchy_namegt for ltattributegt
- ltrule_specgt
- related to ltattr_or_agg_listgt
- from ltrelation(s)gt
- where ltconditionsgt
- order by ltorder listgt
- with ltkinds ofgt threshold ltthreshold_valuegt
for ltattribute(s)gt
9DMQL
- Syntax ltrule_specgt
- find classification rules as ltrule_namegt
- according to ltattributesgt
- Find association rules as ltrule_namegt
- generalize data into ltrelation_namegt
- others
10DMQL
- use database Hospital
- find association rules as Heart_Health
- related to Salary, Age, Smoker, Heart_Disease
- from Patient_Financial f, Patient_Medical m
- where f.ID m.ID and m.age gt 18
- with support threshold .05
- with confidence threshold .7
11DMQL
- DMQL provides a display in command to view
resulting rules, but no advanced way to query
them - Suggests that a GUI interface might aid in the
presentation of these results in different forms
(charts, graphs, etc.)
12MSQL
- Focus on Association Rules
- Seeks to provide a language both to selectively
generate rules, and separately to query the rule
base - Expressive rule generation language, and
techniques for optimizing some commands
13MSQL
- Get-Rules and Select-Rules Queries
- Get-Rules operator generates rules over elements
of argument class C, which satisfy conditions
described in the where clause - Project Body, Consequent, confidence, support
- GetRules(C) as R1
- into ltrulebase_namegt
- where ltcondsgt
- sql-group-by clause
- using-clause
14MSQL
- ltcondsgt may contain a number of conditions,
including - restrictions on the attributes in the body or
consequent - rule.body HAS (Job Doctor
- rule1.consequent IN rule2.body
- rule.consequent IS Age
- pruning conditions (restrict by support,
confidence, or size) - Stratified or correlated subqueries
in, has, and is are rule subset, superset, and
equality respectively
15MSQL
- GetRules(Patients)
- where Body has Age
- and Support gt .05 and Confidence gt .7
- and not exists ( GetRules(Patients)
- Support gt .05 and Confidence gt .7
- and R2.Body HAS R1.Body)
Retrieve all rules with descriptors of the form
Age x in the body, except when there is a
rule with equal or greater support and confidence
with a rule containing a superset of the
descriptors in the body
16MSQL
- GetRules(C) R1
- where ltpruning-condsgt
- and not exists ( GetRules(C) R2
- where ltsame pruning-condsgt
- and R2.Body HAS R1.Body)
correlated
GetRules(C) R1 where ltpruning-condsgt and
consequent is (X) and consequent in
(SelectRules(R2) where consequent is
(X)
stratified
17MSQL
- Nested Get-Rules Queries and their optimization
- Stratified (non-corrolated) queries are evaluated
bottom-up. The subquery is evaluated first,
and replaced with its results in the outer query. - Correlated queries are evaluated either top-down
or bottom-up (like loop-unfolding), and there
are rules for choosing between the two options
18MSQL
GetRules(Patients) where Body has Age and
Support gt .05 and Confidence gt .7 and not exists
( GetRules(Patients) Support gt .05 and
Confidence gt .7 and R2.Body HAS R1.Body)
19MSQL
Top-Down Evaluation
GetRules(Patients) where Body has Age and
Support gt .05 and Confidence gt .7
For each rule produced by the outer, evaluate the
inner
not exists ( GetRules(Patients) Support gt .05
and Confidence gt .7 and R2.Body HAS
R1.Body)
20MSQL
Bottom-Up Evaluation
not exists ( GetRules(Patients) Support gt .05
and Confidence gt .7 and R2.Body HAS
R1.Body)
For each rule produced by the inner, evaluate the
outer
GetRules(Patients) where Body has Age and
Support gt .05 and Confidence gt .7
21MSQL
- Choosing between the two
- In general, evaluate the expression with more
restrictive conditions first - Heuristic rules
- Evaluate the query with higher support threshold
first - Next consider confidence threshold
- A (length x) expression is in general more
restrictive than (length gt x), which is more
restrictive than (length lt x) - Body IS (constant expression) is more
restrictive than Body HAS, which is more
restrictive than Body IN - Next consider Consequent IN expressions
- Descriptors of for (A a) are more restrictive
than wildcards such as (A )
22OLE DB for DM
- An extension to the OLE DB interface for
Microsoft SQL Server - Seeks to support the following ideas
- Define a model by specifying the set of
attributes to be predicted, the attributes used
for the prediction, and the algorithm - Populate the model using the training data
- Predict attributes for new data using the
populated model - Browse the mining model (not fully addressed
because it varies a lot by model type)
None of the others seemed to support this
23OLE DB for DM
- Defining a Mining Model
- Identify the set of data attributes to be
predicted, the set of attributes to be used for
prediction, and the algorithm to be used for
building the model - Populating the Model
- Pull the information into a single rowset using
views, and train the model using the data and
algorithm specified - Supports complex objects, so rowset may be
hierarchical (see paper for more complex examples)
24OLE DB for DM
- Using the mining model to predict
- Defines a new operator prediction join. A model
may be used to make predictions on datasets by
taking the prediction join of the mining model
and the data set.
25OLE DB for DM
- CREATE MINING MODEL Heart_Health Prediction
- ID Int Key,
- Age Int,
- Smoker Int,
- Salary Double discretized,
- HeartAttack Int PREDICT, Prediction column
- USING Decision_Trees_101
Identifies the source columns for the training
data, the column to be predicted, and the data
mining algorithm.
26OLE DB for DM
- INSERT INTO Heart_Health Prediction
- (ID, Age, Smoker, Salary)
- SELECT ID, Age, Smoker, Salary FROM
Patient_Medical M, Patient_Financial F - WHERE M.ID F.ID
The INSERT represents using a tuple for training
the model (not actually inserting it into the
rowset).
27OLE DB for DM
- SELECT t.ID,
- Heart_Health Prediction.HeartAttack
- FROM Heart_Health Prediction
- PREDICTION JOIN (
- SELECT ID, Age, Smoker, Salary
- FROM Patient_Medical M, Patient_Financial F
- WHERE M.ID F.ID) as t
- ON Heart_Health Prediction.Age t.Age AND
Heath_Health Prediction.Smoker t.Smoker AND
Heart_Health Prediction.Salary t.Salary
Prediction join connects the model and an actual
data table to make predictions
28Key Ideas
- Important to have an API for creating and
manipulating data mining models - The data is already in the DBMS, so it makes
sense to do the data mining where the data is - Applications already use SQL, so a SQL extension
seems logical
29Key Ideas
- Need a method for defining data mining models,
including algorithm specification, specification
of various parameters, and training set
specification (DMQL, MSQL, ODBDM) - Need a method of querying the models (MSQL)
- Need a way of using the data mining model to
interact with other data in the database, for
purposes such as prediction (ODBDM)
30Discussion Topic What Functionality would and
Ideal Solution Support?