Title: Chapter 22: Advanced Querying and Information Retrieval
1Chapter 22 Advanced Querying and Information
Retrieval
- Decision-Support Systems
- Data Analysis and OLAP
- Data Mining
- Data Warehousing
- Information-Retrieval Systems
2Decision Support Systems
- Decision-Support systems are used to make
business decisions often based on data collected
by on-line transaction-processing systems. - Examples of business decisions
- what items to stock?
- What insurance premium to change?
- Who to send advertisements to?
- Examples of data used for making decisions
- Retail sales transaction details
- Customer profiles (income, age, sex, etc.)
3Decision-Support Systems Overview
- Data analysis tasks are simplified by specialized
tools and SQL extensions - Example tasks
- For each product category and each region, what
were the total sales in the last quarter and how
do they compare with the same quarter last year - As above, for each product category and each
customer category - Statistical analysis packages (e.g., S) can
be interfaced with databases - Statistical analysis is a large field will not
study it here - Data mining seeks to discover knowledge
automatically in the form of statistical rules
and patterns from Large databases. - A data warehouse archives information gathered
from multiple sources, and stores it under a
unified schema, at a single site. - Important for large businesses which generate
data from multiple divisions, possibly at
multiple sites - Data may also be purchased externally
4Data Analysis and OLAP
- Aggregate functions summarize large volumes of
data - Online Analytical Processing (OLAP)
- Tools to support interactive analysis of data,
allowing data to be summarized and viewed in
different ways - A histogram partitions the values taken by an
attribute into ranges, and computes an aggregate
over the values in each range cumbersome to use
standard SQL to construct a histogram. Extension
proposed by Red Brick - select percentile, avg (balance)
- from account
- group by N_tile (balance, 10) as percentile
5Cross Tabulation of sales by item-name and color
- The table above is an example of a
cross-tabulation(or cross-tab) also referred to
as a pivot-table. In general, a cross-table is a
table where values for one attribute form the row
headers, values for another attribute form the
column headers, and the values in an individual
cell are derived as follows. - A cross tab with summary rows/columns can be
represented by introducing a special value all to
represent subtotals.
6Relational Representation of the Data in Figure
22.1
7Three-Dimensional Data Cube
8Online Analytical Processing
- The operation of changing the dimensions used in
a cross-tab is called pivoting. - An OLAP system provides other functionality as
well. For instance, the analyst may wish to see a
cross-tab on item-name and color for a fixed
value of size, for example, large, instead of the
sum across all sizes. Such an operation is
referred to as slicing. The operation is
sometimes called dicing, particularly when values
for multiple dimensions are fixed. - The operation of moving from finer-granularity
data to a coarser granularity is called a rollup. - The opposite operation - that of moving from
coarser-granularity data to finer-granularity
data is called a drill down.
9OLAP Implementation
- The earliest OLAP systems used multidimensional
arrays in memory to store data cubes, and are
referred to as mutidimensional OLAP (MOLAP)
systems. - Hybrid systems, which store some summaries in
memory and store the base data and other
summaries in a relational database, are called
hybrid OLAP (HOLAP) systems.
10Hierarchies on Dimensions
11Cross Tabulation of sales With Hierarchy on
item-name
12Data Analysis (Cont.)
Total
Medium
Small
Large
8 20
35 10
310 5
Light Dark
53 35
28
45
15
88
Total
- Cross-tabulation of number by size and color of
sample relation sales with the schema
Sales(color, size, number).
13Data Analysis (Cont.)
Color
Size
Number
Light Light Light Light Dark Dark Dark Dark all al
l all all
8 35 10 53 20 10 5 35 28 45 15 88
Small Medium Large all Small Medium Large all Smal
l Medium Large all
- Can represent subtotals in relational form by
using the value all - E.g. obtain (Light, all, 53) and (Dark, all,
35) by aggregating individual tuples with
different values for size for each color.
14Data Analysis (Cont.)
- Rollup Moving from finer-granularity data to a
coarser granularity by means of aggregation. - Drill down Moving from coarser-granularity data
finer-granularity data. - Proposed extensions to SQL, such as the cube
operation help to support generation of summary
data - The following query generates the previous table.
- select color, size, sum (number)
- from sales
- groupby color, size with cube
15Data Analysis (Cont.)
- Figure shows the combinations of dimensions size,
color, price - In general computing cube operation with n
groupby columns gives 2nd different groupby
combinations.
16Data Mining
- Like knowledge discovery in artificial
intelligence data mining discovers statistical
rules and patterns it differs from machine
learning in that it deals with large volumes of
data stored primarily on disk. - Knowledge discovered from a database can be
represented by a set of rules. - e.g., Young women with annual incomes greater
than 50,000 are most likely to buy sports cars - Discover rules using one of two models
- 1. The user is involved directly in the process
of knowledge discovery. - 2. The system is responsible for automatically
discovering knowledge from the database by
detecting patterns and correlation's in the data.
17Knowledge Representation Using Rules
- General form of rules ? X antecedent ? consequent
- X is a list of one or more variables with
associated ranges. - The rule ? transactions T, buys (T, bread) ?
buys(T, milk) states if there is a tuple (t1,
bread) in the relation buys, there must also be a
tuple (t1, milk) in the relation buys. - Population Cross-product of the ranges of the
variables in the rule. - In the above example, the set of all
transactions. - Support Measure of what fraction of the
population satisfies both the antecedent and the
consequent of the rule. - e.g., 10 of transactions buy bread and milk.
- Confidence Measure of how often the consequent
is true when the antecedent is true. - e.g., 80 of transactions that buy bread also buy
milk
18Some Classes of Data-Mining Problems
- Classification Finding rules that partition the
given data into disjoint groups (classes) that
are relevant for making a decision - (e.g., which of several factors help classify a
persons credit worthiness). - Associations Useful to determine associations
between different items - (e.g., someone who buys bread is quite likely
also to buy milk). - Sequence correlations determine correlations
between related sequence data. - (e.g., when bond rates go up stock prices go
down within two days.)
19User-Guided Data Mining
- In user-guided data mining, primary
responsibility for discovering rules is with the
user. - User may runs tests on the database to verify or
refute a hypothesis. Confidence and support for
rules expressing a hypothesis are derived from
the database. - An iterative process of forming and refining
rules is used. Example Test the hypothesis
People who hold masters degrees are the most
likely to have an excellent credit rating. If
confidence of rule is low, may refine it into the
rule - ? people P, P.degree Masters and C.income ?
75, 000 - ? C.credit excellent
- Data-visualization though graphical
representations like maps, charts, and
color-coding, helps detect patterns in data
20Classification Rules
- Classification rules help assign new objects to a
set of classes. E.g., given a new automobile
insurance applicant, should he or she be
classified as low risk, medium risk or high risk? - Classification rules for above example could use
a variety of knowledge, such as educational level
of applicant, salary of applicant, age of
applicant, etc. - Classification rules can be compactly shown as a
Classification tree.
21Part of Credit Risk Classification Tree
22Discovery of Classification Rules
- Training set a data sample in which the grouping
for each tuple is already known. - Top down generation of classification tree.
- Each internal node of the tree partitions the
data into groups based on the attribute. - The data at a node is not partitioned further if
either all (or most) of the items at the node
belong to the same class, or all attributes have
been considered. Such a node is a leaf node. - Otherwise the data at the node is partitioned
further by picking an attribute for partitioning
data at the node.
23Discovery of Classification Rules (Cont.)
- Consider credit risk example Suppose degree is
chosen to partition the data at the root. Since
degree has a small number of possible values,
one child is created for each value. - At each child node of the root, further
classification is done tuple if required. Here,
partitions are defined by income. Since income is
a continuous attribute, some number of intervals
are chosen, and one child created for each
interval. - Different classification algorithms use different
ways of choosing which attribute to partition on
at each node, and what the intervals, if any,
are. - In general, different branches of the tree could
grow to different levels. Different nodes at the
same level may use difficult partitioning
attributes.
24Discovery of Association Rules
- Example ? transactions T, buys (T, bread) ? buys
(T, milk) - In general notion of transaction , and its
intemset, the set of items contained in the
transaction - General form of rule
- ? transactions T, c(T, i1) and . . . and c(T,
io) ? c(T, i0) - where c(T, ik) denotes that transaction T
contains item ik. - Above can be represented as A ? b where A i1,
i2, . . . , in - and b io.
- Support of rule number of transactions whose
itemsets contain A ? b - Usually desire rules with strong support, which
will involve only items purchased in a
significant percentage of the transactions.
25Discovery of Association Rules (Cont.)
- Consider all possible sets of relevant items.
- For each set find its support (i.e. , how many
transactions purchase all items in the set). - Use sets with sufficiently high support to
generate association rules. - From set A generate the rules A - b ?b for
each b ? A. - Support of each of the rules is support of A.
- Confidence of a rule is support of A divided by
support of A - b.
26Finding Support
- Few sets Determine level of support via a single
pass. - A count is maintained for each set, initially set
to 0. - When a transaction is fetched, the count is
incremented for each set of items which contained
in the itemset of the transaction. - Sets with a high count at the end of the pass
correspond to items with a high degree of
association. - Many sets If memory not enough to hold all
counts for all sets Use multiple passes,
considering only some sets in each pass. - Optimization Once a set is eliminated because it
occurs in too small a fraction of the
transactions, none of its supersets needs to be
considered.
27Data Warehousing
- A data warehouse is a repository of information
gathered from multiple sources.
28Data Warehousing (Cont.)
- Provides a single consolidated interface to data
- Data stored for an extended period, providing
access to historical data - Data/updates are periodically downloaded form
online transaction processing (OLTP) systems. - Typically, download happens each night.
- Data may not be completely up-to-date, but is
recent enough for analysis. - Running large queries at the warehouse ensures
that OLTP systems are not affected by the
decision-support workload.
29Issues in Building a Warehouse
- When and how to gather data.
- Source driven data source initiates data
transfer - Destination driven warehouse initiates data
transfer - What schema to use.
- Schema integration
- Cleaning and conversion of incoming data
- What data to summarize.
- Raw data may be too large to store on-line
- Aggregate values (totals/subtotals) often suffice
- Queries on raw data can often be transformed by
query optimizer to use aggregate values - How to propagate updates.
- Date at warehouse is a view on source data
- Efficient view maintenance techniques required
30Information Retrieval Systems
- Information retrieval (IR) systems use a simpler
data model than database systems, but provide
more powerful querying capabilities within the
restricted model. - Queries attempt to locate documents that are of
interest by specifying, for example, sets of
keywords. - e.g., find documents containing the words
database systems - Information retrieval systems order answers based
on their estimated relevance. - e.g., user may really only want documents about
database systems, but the system may retrieve all
documents that mention the phrase database
systems. - Documents may be ordered by, for example, how
many times the phrase appears in the document.
31Queries
- Combinations of keywords
- motorcycle and maintenance
- computer or micro-processor
- computer but not database
- Closeness of keyword s in the and case affects
ranking. Some systems allow user to specify that
the keywords must occur close to each other. - Synonyms
- To retrieve document title motorcycle repair for
the query motorcycle and maintenance, need to
realize that maintenance and repair are synonyms - Similarity based retrieval - retrieve documents
similar to a given document. Similarity may be
defined based on metrics such as number of common
keywords.
32Differences From Database Systems
- Information retrieval systems, unlike traditional
database systems, handle - Unstructured documents
- Searching using keywords and relevance ranking
- Most information retrieval systems do not handle
- High update rates
- Concurrency control
- Data structured using more complex data models
(e.g., relational or object oriented data models) - Complex queries written in, e.g., SQL
33Indexing of Documents
- Documents that contain a specified keyword can be
located using an inverted index which maps each
keyword Ki to the set Si of identifiers of
documents that contain Ki . - IR systems save space by using index structures
that support only approximate retrieval. May
result in - false drop - some relevant documents may not be
retrieved. - false positive - some irrelevant documents may be
retrieved. - For many applications a good index should not
permit any false drops, but may permit a few
false positives. - Relevant performance metrics
- Precision - what percentage of the retrieved
documents are relevant to the query. - Recall - what percentage of the documents
relevant to the query were retrieved.
34Indexing of Documents (Cont.)
- and operation Finds documents that contain all
of a set of keywords K1, K2, ..., Kn. - Retrieve the corresponding sets of identifiers of
documents S1, S2, ... Sn. - The intersection, S1? S2 ?..... ? Sn, gives the
identifiers of the desired set of documents. - or operation Gives the set of all documents that
contain at least one of the keywords K1, K2, ,
Kn - Found by computing the union, S1? S2 ?..... ? Sn,
of the sets.
35Indexing of Documents (Cont.)
- not operation Finds documents that do not
contain a specified keyword Ki - Let Si be the set of identifiers of documents
that contain the keyword Ki. - Given a set of document identifies S, eliminate
documents that contain the specified keyword Ki
by taking the difference S-Si - A full-text index uses every work in the document
as a keyword. - Stop words are very commonly occurring words that
are useless as key words, e.g, a, an, the, it
etc. These are eliminated from the index.
36Browsing
- Storing related documents together facilitates
browsing, where a user can see not only requested
document but also related ones. - Browsing in a library facilitated by
classification system that organizes logically
related books together.
37A Classification Tree
38Classification DAG
- Documents can reside in multiple places in a
hierarchy in an information retrieval system,
since physical location is not important. - Classification hierarchy is thus Directed Acyclic
Graph (DAG).
39A Classification DAG for a library information
retrieval system
40Locating of Information on the Web
- The Archie system automatically follows Gopher
links to locate information, and creates a
centralized index of information found various
sites. - Web indexing systems (Web crawlers) follow the
hypertext links in documents to find other
documents, and build an index on the documents. - The indices are often full-text indices, and are
stored locally at the indexing system. - These systems run a background process to
- find new sites.
- obtain updated information from known sites.
- discard defunct sites.
41Locating of Information on the Web (Cont.)
- Web indexing systems permit documents to be
located even though they are not registered with
any central authority. - Drawback poor precision of recall
- Full text indexing retrieves unrelated documents
that just happen to mention the requested keyword - HTML extensions now allow documents to be tagged
with keywords to be used by search engines
unfortunately, many documents do not provide such
keywords. - The extremely large number of documents on the
Web often leads to far more results than a human
can handle. - Alternative approach a cataloging system for the
Web, such as that provide by Yahoo - Combinations of catalogs and indexing are quite
useful Provided by, e.g., Yahoo (www.yahoo.com).
42Classification Tree
43Data-Warehouse Architecture
44Star Schema For A Data Warehouse
45A Classification Hierarchy For A Library System
46A Classification DAG For A Library Information
Retrieval System
47Data Analysis and OLAP
- Online Analytical Processing
- Data that can be modeled as dimension attributes
and measure attributes are called
multidimensional data. - Given a relation used for data analysis, we can
identify some of its attributes as measure
attributes, since they measure some value, and
can be aggregated upon. For instance, the
attribute number of the sales relation is a
measure attribute, since it measures the number
of units sold. - Some of the other attributes of the relation are
identified as dimension attributes, since they
define the dimensions on which measure
attributes, and summaries of measure attributes,
are viewed.
48Extended Aggregation
- SQL1999 also supports generalizations of the
group by constructs, using the cube and rollup
constructs. A representative use of the cube
construct is - select item-name, color, size,
sum(number) from sales group by cube(item-name,
color, size) - This query computes the union of eight different
groupings of the - sales relation
- (item-name, color, size), (item-name, color),
(item-name, size), (color, size), (item-name),
(color), (size), ( ) - Where ( ) denotes an empty group by list.
- For each grouping, the result contains the null
value for attributes not present in the grouping.
For instance, with occurrences of all replaced by
null, can be computed by the query - select item-name, color, sum(number) from
sales group by cube(item-name, color)
49Extended Aggregation (Cont.)
- A representative rollup construct is
- select item-name, color, size,
sum(number) from sales group by
rollup(item-name, color, size) - Here only four grouping are generated
- (item-name, color, size), (item-name, color),
(item-name), ( ) - Rollup can be used to generate aggregates at
multiple levels of ahierarchy on a column. For
instance, we have a table itemcategory(item-name,
category) giving the category of each item. Then
the query - select category, item-name, sum(number)from
sales, categorywhere sales.item-name
itemcategory.item-namegroup by rollup(category,
item-name) - would give a hierarchical summary by item-name
and by category.
50Extended Aggregation (Cont.)
- Multiple rollups and cubes can be used in a
single group by clause.For instances, the
following query - select item-name, color, size, sum(number)from
salesgroup by rollup(item-name), rollup(color,
size) - generates the groupings
- (item-name, color, size), (item-name, color),
(item-name), (color, size), (color), ( ) - The function grouping can be applied on an
attribute it returns 1 if the value is a null
value representing all, and returns 0 in all
other cases. Consider the following query - select item-name, color, size,
sum(number), grouping(item-name) as
item-name-flag, grouping(color) as
color-flag, grouping(size) as size-flag,from
salesgroup by cube(item-name, color, size)
51Ranking
- Ranking is done in conjunction with an order by
specification. Suppose we are given a relation
student-marks(student-id, marks) which stores the
marks obtained by each student. The following
query gives the rank of each student. - select student-id, rank( ) over (order by
(marks) desc as s-rankfrom student-marks - An extra order by clause is needed to get them in
sorted order, as shown below. - select student-id, rank ( ) (order by (marks)
desc) as s-rankfrom student-marks order by
s-rank - Ranking can be done within partition of the data.
The following query then gives the rank of
students within each section. - select student-id, section, rank ( ) over
(partition by section order by marks desc) as
sec-rankfrom student-marks, student-sectionwhere
student-marks.student-id student-section.studen
t-idorder by section, sec-rank
52Ranking (Cont.)
- For a given constant n, the ranking the function
ntile(n) takes the tuples in each partition in
the specified order, and divides them into n
buckets with qual numbers of tuples. For
instance, we an sort employees by salary, and use
ntile(3) to find which range (bottom third,
middle third, or top third) each employee is in,
and compute the total salary earned by employees
in each range - select threetile, sum(salary)from ( select
salary, ntile(3) over (order by (salary) as
threetile from employee) as sgroup by threetile - SQL1999 permits the user to specify where they
should occur by using nulls first or nulls last,
for instance - select student-id, rank ( ) over (order by marks
desc nulls last) as s-rankfrom student-marks
53Windowing
- An example of window query is that, given sales
values for each date, calculates for each date
the average of the sales on that day, the
previous day, and the next day such moving
average queries are used to smooth out random
variations. - In contrast to group by, the same tuple can exist
in multiple windows. Suppose we are given a
relation transaction(account-number, date-time,
value), where value is positive for a deposit and
negative for a withdrawal, consider a query - select account-number, date-time, sum(value)
over (partition by account-number order by
date-time rows unbounded preceding) as
balancefrom transactionorder by account-number,
date-time
54Decision - Tree Construction
- The main idea of decision tree construction is to
evaluate different attributes and different
partitioning conditions and pick the attribute
and partitioning condition that results in the
maximum information gain ratio. - Procedure Grow.Tree(S) Partition(S)Procedure
Partition (S) if (purity(S) gt ?p or S lt ?s)
then return for each attribute A
evaluate splits on attribute A Use best split
found (across all attributes) to partition
S into S1, S2, ., Sr, for i 1, 2, .., r
Partition(Si) - We can generate classification rules from a
decision tree, if we so desire. For each leaf we
generate a rule as follows The left hand side is
the conjunction of all the split conditions on
the path to the leaf, and the class is the class
of the majority of the training instances at the
leaf. An example of such a classification rule is
- degree masters and income gt 75,000 ? excellent
55Other Types of Classifiers
- There are two types of classifiers
- Neural net classifiers
- Bayesian classifiers
- Neural net classifiers use the training data to
train artificial neural nets. - To find the probability p(cjd) of instance d
being in class cj, Bayesian classifiers use
Bayes theorem, which says - p(cjd) p(dcj)p(cj)
- p(d)where p(dcj) is the
probability of generating instance d given class
cj, p(cj) is the probability of occurrence of
class cj, and p(d) is the probability of
instanced occuring. - To simplify the task, naïve Bayesian classifiers
assume attributes have independent distributions,
and thereby estimate - p(dcj) p(d1cj) p(d2cj) . (p(dncj)
- That is, the probability of the instance d
occuring is the product of the probability of
occurrence of each of the attribute values di of
d, given the class cj.
56Regression
- Regression deals with the prediction of a value,
rather than a class. Given values for a set of
variables, X1, X2, , Xn, we wish to predict the
value of a variable Y. - One way is to infer coefficients a0, a1, a1, ,
an such that Y a0 a1 X1 a2 X2 an
Xn - Finding such a linear polynomial is called linear
regression. The process to find a curve that fits
the data is called curve fitting. - The fit may only be approximate, because of noise
in the data or because the relationship is not
exactly a polynomial, so regression aims to find
coefficients that give the best possible fit.
57Association Rules
- Retail shops are often interested in associations
between different items that people buy. Examples
of such associations are - Someone who buys bread is quite likely also to
buy milk - A person who bought the book Database System
Concepts is quite likely also to buy the book
Operating System Concepts. - Associations information can be used in several
ways. When a customer buys a particular book, an
online shop may suggest associated books. - Association Rules
- An example of Association Rules is
- bread ? milk
- An association rule must have an associated
population the population consists of a set of
instances.
58Association Rules (Cont.)
- Association Rules An example of Association
Rules is - bread ? milk
- An association rule must have an associated
population the population consists of a set of
instances. - Rules have an associated support, as well as an
associated confidence. These are defined in the
context of population - Support is a measure of what fraction of the
population satisfies both the antecedent and the
consequent of the rule.For instance, suppose only
0.001 percent of all purchases include milk and
screwdrivers. The support for the rule is milk ?
screwdrivers is low. - Confidence is a measure of how often the
consequent is true when the antecedent is true.
For instance, the rule bread ? milk has a
confidence of 80 percent if 80 percent of the
purchases that include bread also include milk.A
rule with a low confidence is not meaningful. - Note that the confidence of bread ? milk may be
very different from the confidence of milk ?
bread, although both have the same supports.
59Clustering
60Other Types of Mining
61Data Warehousing
62Components of Data Warehouse
- When and how to gather data.
- What schema to use.
- Data cleansing
- How to propagate updates.
- What data to summarize
63Warehouse Schemas
64Information-Retrieval Systems
65Keyword Search
- Information-retrieval systems typically allow
query expressions formed using keywords and the
logical connectives and, or, and not. - A query containing keywords without any of the
above connectives is assumed to have ands
implicitly connecting the keywords - In full text retrieval, all the words in each
document are considered to be keywords. We shall
use the word term to refer to the words in a
document, since all words are keywords.
66Relevance Ranking Using Terms
- One way of measuring r(d, t), the relevance of a
document d to a term t, is - Where n(d) denotes the number of terms in the
document and n(d, t) denotes the number of
occurrences of term t in the document d.
67Relevance Using Hyperlinks
68Similarity-Based Retrieval
69Synonyms and Homonyms
70Indexing of Documents
71Measuring Retrieval Effectiveness
72Web Search Engines
73Directories
74(No Transcript)
75Applications of Data Mining
- ? person P, P.degree masters and P.income gt
75,000 ? P.credit excellent?person P,
P.degree bachelors or (p.income ? 25,000 and
P.income ? 75,000) ? P.credit good
76Best Splits
- The purity of a set S of training instances can
be measured quantitatively in several ways.
Suppose there are k classes, and of the instances
in S the fraction of instances in class I is pi.
One measure of purity, the Gini measure is
defined as -
- Gini (S) 1 - ?
- When all instances are in a single class, the
Gini value is 0, while it reaches its maximum (of
1 1 /k) if each class the same number of
instances. Another measure of purity is the
entropy measure, which is defined as - Entropy (S) - ?
- When a set S is split into multiple sets Si, I1,
2, , r, we can measure the purity of the
resultant set of sets as -
- Purity(S1, S2, .., Sr) ?
-
- The above formula can be used with both the Gini
measure and the entropy measure of purity.
77Best Splits (Cont.)
- The information gain due to particular split of S
into Si, I 1, 2, ., r is then - Information-gain (S,S1, S2, ., Sr) purity(S)
purity (S1, S2, Sr) - The information content of a particular split can
be defined in terms of entropy as
Information-content(S, S1, S2, .., Sr)) - ? - All of this leads to a definition The best split
for an attribute is the one that gives the
maximum information gain ratio, defined as - Information-gain (SS1, S2, , Sr)
- Information-content (S, S1, S2, .., Sr)