Title: Topics Related to Data Mining
1Topics Related to Data Mining
2Information Retrieval
- Relevance Ranking Using Terms
- Relevance Using Hyperlinks
- Synonyms., Homonyms, and Ontologies
- Indexing of Documents
- Measuring Retrieval Effectiveness
- Information Retrieval and Structured Data
3Information Retrieval Systems
- Information retrieval (IR) systems use a simpler
data model than database systems - Information organized as a collection of
documents - Documents are unstructured, no schema
- Information retrieval locates relevant documents,
on the basis of user input such as keywords or
example documents - e.g., find documents containing the words
database systems - Can be used even on textual descriptions provided
with non-textual data such as images
4Keyword Search
- In full text retrieval, all the words in each
document are considered to be keywords. - We use the word term to refer to the words in a
document - Information-retrieval systems typically allow
query expressions formed using keywords and the
logical connectives and, or, and not - Ands are implicit, even if not explicitly
specified - Ranking of documents on the basis of estimated
relevance to a query is critical - Relevance ranking is based on factors such as
- Term frequency
- Frequency of occurrence of query keyword in
document - Inverse document frequency
- How many documents the query keyword occurs in
- Fewer ? give more importance to keyword
- Hyperlinks to documents
- More links to a document ? document is more
important
5Relevance Ranking Using Terms
- TF-IDF (Term frequency/Inverse Document
frequency) ranking - Let n(d) number of terms in the document d
- n(d, t) number of occurrences of term t in the
document d. - Relevance of a document d to a term t
- The log factor is to avoid excessive weight to
frequent terms - Relevance of document to query Q
n(d, t)
1
TF (d, t) log
n(d)
TF (d, t)
?
r (d, Q)
n(t)
t?Q
IDF1/n(t), n(t) is the number of documents that
contain the term t
6Relevance Ranking Using Terms (Cont.)
- Most systems add to the above model
- Words that occur in title, author list, section
headings, etc. are given greater importance - Words whose first occurrence is late in the
document are given lower importance - Very common words such as a, an, the, it
etc are eliminated - Called stop words
- Proximity if keywords in query occur close
together in the document, the document has higher
importance than if they occur far apart - Documents are returned in decreasing order of
relevance score - Usually only top few documents are returned, not
all
7Synonyms and Homonyms
- Synonyms
- E.g. document motorcycle repair, query
motorcycle maintenance - need to realize that maintenance and repair
are synonyms - System can extend query as motorcycle and
(repair or maintenance) - Homonyms
- E.g. object has different meanings as noun/verb
- Can disambiguate meanings (to some extent) from
the context - Extending queries automatically using synonyms
can be problematic - Need to understand intended meaning in order to
infer synonyms - Or verify synonyms with user
- Synonyms may have other meanings as well
8Indexing of Documents
- An inverted index maps each keyword Ki to a set
of documents Si that contain the keyword - Documents identified by identifiers
- Inverted index may record
- Keyword locations within document to allow
proximity based ranking - Counts of number of occurrences of keyword to
compute TF - and operation Finds documents that contain all
of K1, K2, ..., Kn. - Intersection S1? S2 ?..... ? Sn
- or operation documents that contain at least one
of K1, K2, , Kn - union, S1?S2 ?..... ? Sn,.
- Each Si is kept sorted to allow efficient
intersection/union by merging - not can also be efficiently implemented by
merging of sorted lists
9Word-Level Inverted File
lexicon
posting
10Measuring Retrieval Effectiveness
- Information-retrieval systems save space by using
index structures that support only approximate
retrieval. May result in - false negative (false drop) - some relevant
documents may not be retrieved. - false positive - some irrelevant documents may be
retrieved. - For many applications a good index should not
permit any false drops, but may permit a few
false positives. - Relevant performance metrics
- precision - what percentage of the retrieved
documents are relevant to the query. - recall - what percentage of the documents
relevant to the query were retrieved.
11Measuring Retrieval Effectiveness (Cont.)
- Recall vs. precision tradeoff
- Can increase recall by retrieving many documents
(down to a low level of relevance ranking), but
many irrelevant documents would be fetched,
reducing precision - Measures of retrieval effectiveness
- Recall as a function of number of documents
fetched, or - Precision as a function of recall
- Equivalently, as a function of number of
documents fetched - E.g. precision of 75 at recall of 50, and 60
at a recall of 75 - Problem which documents are actually relevant,
and which are not
12Information Retrieval and Structured Data
- Information retrieval systems originally treated
documents as a collection of words - Information extraction systems infer structure
from documents, e.g. - Extraction of house attributes (size, address,
number of bedrooms, etc.) from a text
advertisement - Extraction of topic and people named from a new
article - Relations or XML structures used to store
extracted data - System seeks connections among data to answer
queries - Question answering systems
13Probabilities and Statistic
14Probabilities
1.
2.
Event E is defined as a any subset of
f(x) is called a probability distribution
function (pdf)
15Conditional Probabilities
Conditional probability of E, provided that G
occurred is
E and G are independent if and only if
.
Expected Value
Expected value of X is
For continuous function f(x), the E(X) is
E(XY) E(X)E(Y) E(aXb) aE(X)b
16Variance
2
2
- Var(X) E(X-E(X))
- It indicates how values of random variable are
distributed around its expected value - Standard deviation of X is defined as
- VAR(XY) VAR(X) VAR(Y)
- VAR(aXb) VAR(X)b
- P(S - E(S) r) VAR(S)/r
(Chebyshevs Ineequality)
2
2
2
17Random Distributions
Normal
ยต
E(X)
2
Var(X) s
Bernoulli
E(X) np Var(X) np(1-p)
18Normal Distributions
E(x)
19Random Distributions
Geometric
2
E(X) 1/p VAR(X) (1-p)/p
Poisson
E(X)VAR(X)m
P(Xx) 1/(b-a)
Uniform
2
E(X)(b-a)/2 VAR(X) (b-a) /12
20Data and their characteristics
21Types of Attributes
- There are different types of attributes
- Nominal
- Examples ID numbers, eye color, zip codes
- Ordinal
- Examples rankings (e.g., taste of potato chips
on a scale from 1-10), grades, height in tall,
medium, short - Interval
- Examples calendar dates, temperatures in Celsius
or Fahrenheit. - Ratio
- Examples temperature in Kelvin, length, time,
counts
22Properties of Attribute Values
- The type of an attribute depends on which of the
following properties it possesses - Distinctness ?
- Order lt gt
- Addition -
- Multiplication /
- Nominal attribute distinctness
- Ordinal attribute distinctness order
- Interval attribute distinctness, order
addition - Ratio attribute all 4 properties
23(No Transcript)
24Discrete and Continuous Attributes
- Discrete Attribute
- Has only a finite or countably infinite set of
values - Examples zip codes, counts, or the set of words
in a collection of documents - Often represented as integer variables.
- Note binary attributes are a special case of
discrete attributes - Continuous Attribute
- Has real numbers as attribute values
- Examples temperature, height, or weight.
- Practically, real values can only be measured and
represented using a finite number of digits. - Continuous attributes are typically represented
as floating-point variables.
25Data Matrix
- If data objects have the same fixed set of
numeric attributes, then the data objects can be
thought of as points in a multi-dimensional
space, where each dimension represents a distinct
attribute - Such data set can be represented by an m by n
matrix, where there are m rows, one for each
object, and n columns, one for each attribute
26Data Quality
- What kinds of data quality problems?
- How can we detect problems with the data?
- What can we do about these problems?
- Examples of data quality problems
- Noise and outliers
- missing values
- duplicate data
27Noise
- Noise refers to modification of original values
- Examples distortion of a persons voice when
talking on a poor phone and snow on television
screen
Two Sine Waves
Two Sine Waves Noise
28Outliers
- Outliers are data objects with characteristics
that are considerably different than most of the
other data objects in the data set
29Data Preprocessing
- Aggregation
- Sampling
- Dimensionality Reduction
- Feature subset selection
- Feature creation
- Discretization and Binarization
- Attribute Transformation
30Aggregation
- Combining two or more attributes (or objects)
into a single attribute (or object) - Purpose
- Data reduction
- Reduce the number of attributes or objects
- Change of scale
- Cities aggregated into regions, states,
countries, etc - More stable data
- Aggregated data tends to have less variability
31Sampling
- Sampling is the main technique employed for data
selection. - It is often used for both the preliminary
investigation of the data and the final data
analysis. -
- Statisticians sample because obtaining the entire
set of data of interest is too expensive or time
consuming. -
- Sampling is used in data mining because
processing the entire set of data of interest is
too expensive or time consuming.
32Sampling
- The key principle for effective sampling is the
following - using a sample will work almost as well as using
the entire data sets, if the sample is
representative - A sample is representative if it has
approximately the same property (of interest) as
the original set of data
33Types of Sampling
- Simple Random Sampling
- There is an equal probability of selecting any
particular item - Sampling without replacement
- As each item is selected, it is removed from the
population - Sampling with replacement
- Objects are not removed from the population as
they are selected for the sample. - In sampling with replacement, the same object
can be picked up more than once - Stratified sampling
- Split the data into several partitions then draw
random samples from each partition
34Curse of Dimensionality
- When dimensionality increases, data becomes
increasingly sparse in the space that it occupies - Definitions of density and distance between
points, which is critical for clustering and
outlier detection, become less meaningful
- Randomly generate 500 points
- Compute difference between max and min distance
between any pair of points
35Discretization Using Class Labels
3 categories for both x and y
5 categories for both x and y
36Discretization Without Using Class Labels
Data
Equal interval width
Equal frequency
K-means
37Similarity and Dissimilarity
- Similarity
- Numerical measure of how alike two data objects
are. - Is higher when objects are more alike.
- Often falls in the range 0,1
- Dissimilarity
- Numerical measure of how different are two data
objects - Lower when objects are more alike
- Minimum dissimilarity is often 0
- Upper limit varies
- Proximity refers to a similarity or dissimilarity
38Similarity/Dissimilarity for Simple Attributes
p and q are the attribute values for two data
objects.
39Euclidean Distance
- Euclidean Distance
-
- Where n is the number of dimensions
(attributes) and pk and qk are, respectively, the
kth attributes (components) or data objects p and
q.
40Euclidean Distance
Distance Matrix
41Minkowski Distance
- Minkowski Distance is a generalization of
Euclidean Distance -
- Where r is a parameter, n is the number of
dimensions (attributes) and pk and qk are,
respectively, the kth attributes (components) or
data objects p and q.
42Minkowski Distance
Distance Matrix
43Common Properties of a Distance
- Distances, such as the Euclidean distance, have
some well known properties. - d(p, q) ? 0 for all p and q and d(p, q) 0
only if p q. (Positive definiteness) - d(p, q) d(q, p) for all p and q. (Symmetry)
- d(p, r) ? d(p, q) d(q, r) for all points p,
q, and r. (Triangle Inequality) - where d(p, q) is the distance (dissimilarity)
between points (data objects), p and q. - A distance that satisfies these properties is
called a metric
44Common Properties of a Similarity
- Similarities, also have some well known
properties. - s(p, q) 1 (or maximum similarity) only if p
q. - s(p, q) s(q, p) for all p and q. (Symmetry)
- where s(p, q) is the similarity between points
(data objects), p and q.
45Similarity Between Binary Vectors
- Common situation is that objects, p and q, have
only binary attributes - Compute similarities using the following
quantities - M01 the number of attributes where p was 0 and
q was 1 - M10 the number of attributes where p was 1 and
q was 0 - M00 the number of attributes where p was 0 and
q was 0 - M11 the number of attributes where p was 1 and
q was 1 - Simple Matching and Jaccard Coefficients
- SMC number of matches / number of attributes
- (M11 M00) / (M01 M10 M11
M00) - J number of 11 matches / number of
not-both-zero attributes values - (M11) / (M01 M10 M11)
46SMC versus Jaccard Example
- p 1 0 0 0 0 0 0 0 0 0
- q 0 0 0 0 0 0 1 0 0 1
- M01 2 (the number of attributes where p was 0
and q was 1) - M10 1 (the number of attributes where p was 1
and q was 0) - M00 7 (the number of attributes where p was 0
and q was 0) - M11 0 (the number of attributes where p was 1
and q was 1) -
- SMC (M11 M00)/(M01 M10 M11 M00) (07)
/ (2107) 0.7 - J (M11) / (M01 M10 M11) 0 / (2 1 0)
0