Data Engineering - PowerPoint PPT Presentation

About This Presentation

Title:

Data Engineering

Description:

Substructure mining: Which substructures occur frequently in a set of compounds? ... the presence of any of these substructures is associated with the presence ... – PowerPoint PPT presentation

Number of Views:46

Avg rating:3.0/5.0

Slides: 29

Provided by: aaa23

Category:

more less

Transcript and Presenter's Notes

Title: Data Engineering

1
Data Engineering
2
A Story of Data-Related Issues

You receive an email from a medical researcher
concerning a project that you are eager to work
on.
Hi,
I've attached the data file that I mentioned in
my previous email.
Each line contains the information for a single
patient and consists of five fields.
We want to predict the last field using the
other fields.
I don't have time to provide any more
information about the data since I'm going out of
town for a couple of days, but hopefully that
won't slow you down too much.
Thanks and see you in a couple of days.

3
Continued

Despite some misgivings, you proceed to analyze
the data. The first few rows of the file are as
follows

Nothing looks strange. You put your doubts
aside and start the analysis.
Two days later you you arrive for the meeting,
and while waiting for others to arrive, you
strike up a conversation with a statistician who
is working on the project.
4
Continued

Statistician So, you got the data for all the
patients?
Data Miner Yes. I haven't had much time for
analysis, but I do have a few interesting
results.
Statistician Amazing. There were so many data
issues with this set of patients that I couldn't
do much.
Data Miner Oh? I didn't hear about any possible
problems.
Statistician Well, first there is field 5, the
variable we want to predict. It's common
knowledge among people who analyze this type of
data that results are better if you work with the
log of the values, but I didn't discover this
until later. Was it mentioned to you?
Data Miner No.
Statistician But surely you heard about what
happened to field 4? It's supposed to be measured
on a scale from 1 to 10, with 0 indicating a
missing value, but because of a data entry error,
all 10's were changed into 0's.
Data Miner Interesting. Were there any other
problems?
Statistician Yes, fields 2 and 3 are basically
the same, but I assume that you probably noticed
that.
Data Miner Yes, but these fields were only weak
predictors of field 5.

5
Continued

Statistician Anyway, given all those problems,
I'm surprised you were able to accomplish
anything.
Data Miner True, but my results are really quite
good. Field 1 is a very strong predictor of field
5. I'm surprised that this wasn't noticed before.
Statistician What? Field 1 is just an
identification number.
Data Miner Nonetheless, my results speak for
themselves.
Statistician Oh, no! I just remembered. We
assigned ID numbers after we sorted the records
based on field 5. There is a strong connection,
but it's meaningless. Sorry.

Lesson Get to know your data!
6
Formally What is Data?

Collection of data objects and their attributes
An attribute is a property or characteristic of
an object
Examples eye color of a person, temperature,
etc.
Attribute is also known as variable, field,
characteristic, or feature
A collection of attributes describe an object
Object is also known as record, point, case,
sample, entity, or instance

Attributes
Objects
7
Employee Age and ID Number

Two attributes of an employee are ID and age.
Both can be represented as integers.
However, while it is reasonable to talk about the
average age of an employee, it makes no sense to
talk about the average employee ID.
The only valid operation for employee IDs is to
test whether they are equal.
There is no hint of this limitation, however,
when integers are used to represent the employee
ID attribute.
Knowing the type of an attribute is important
because it tells us which properties of the
measured values are consistent with the
underlying properties of the attribute, and
therefore, it allows us to avoid foolish actions,
such as computing the average employee ID.

8
Types of Attributes

There are different types of attributes
Nominal
Examples ID numbers, eye color, zip codes
Ordinal
Examples rankings (e.g., taste of potato chips
on a scale from 1-10), grades, height in tall,
medium, short
Interval
Examples calendar dates, temperatures in Celsius
or Fahrenheit.
Ratio
Examples temperature in Kelvin, length, time,
counts

9
Properties of Attribute Values

The type of an attribute depends on which of the
following properties it possesses
Distinctness ?
Order lt gt
Addition -
Multiplication /
Nominal attribute distinctness
Ordinal attribute distinctness order
Interval attribute distinctness, order
addition
Ratio attribute all 4 properties

10
Discrete and Continuous Attributes

Discrete Attribute
Has only a finite or countably infinite set of
values
Examples zip codes, counts, or the set of words
in a collection of documents
Often represented as integer variables.
Note binary attributes are a special case of
discrete attributes
Continuous Attribute
Has real numbers as attribute values
Examples temperature, height, or weight.
Practically, real values can only be measured and
represented using a finite number of digits.
Continuous attributes are typically represented
as floating-point variables.
Typically, nominal and ordinal attributes are
binary or discrete, while interval
and ratio attributes are continuous. However,
count attributes, which are
discrete, are also ratio attributes.

11
Asymmetric Attributes

For asymmetric attributes, only presence -- a
non-zero attribute value -- is regarded as
important.
E.g. Transaction data
Bread, Coke etc are in fact (asymmetric)
attributes and only their presence (i.e. value 1
or true) is important.

12
Types of data sets

Record
Data Matrix
Document Data
Transaction Data
Graph
World Wide Web
Molecular Structures
Ordered
Spatial Data
Temporal Data
Sequential Data
Genetic Sequence Data

13
Record Data

Data that consists of a collection of records,
each of which consists of a fixed set of
attributes

14
Document Data

Each document becomes a term (word) vector,
each term is a component (attribute) of the
vector,
the value of each component is the number of
times the corresponding term occurs in the
document.

15
Transaction Data

A special type of record data, where
each record (transaction) involves a set of
items.
For example, consider a grocery store. The set
of products purchased by a customer during one
shopping trip constitute a transaction, while the
individual products that were purchased are the
items.

16
Data with Relationships among Objects

Examples Generic graph and HTML Links

Web search engines collect and process Web pages
to extract their contents. It is well known,
however, that the links to and from each page
provide a great deal of information about the
relevance of a Web page to a query, and thus,
must also be taken into consideration.
17
Data with Objects That Are Graphs E.g. Chemical
Data

Benzene Molecule C6H6

Substructure mining Which substructures occur
frequently in a set of compounds? Ascertain
whether the presence of any of these
substructures is associated with the presence or
absence of certain chemical properties, such as
melting point or heat of formation.
18
Data Quality

What are the kinds of data quality problems?
How can we detect problems with the data?
What can we do about these problems?
Examples of data quality problems
Data collection errors
Noise
Outliers
Missing values
Duplicate data

We will study (2) and (3) in detail, later after
the classification.
19
Outliers

Outliers are data objects with characteristics
that are considerably different than most of the
other data objects in the data set

20
Missing Values

Reasons for missing values
Information is not collected (e.g., people
decline to give their age and weight)
Attributes may not be applicable to all cases
(e.g., annual income is not applicable to
children)
Handling missing values
Eliminate Data Objects
Estimate Missing Values
Ignore the Missing Value During Analysis
Replace with all possible values (weighted by
their probabilities)

21
Data Preprocessing

Aggregation
Sampling
Dimensionality Reduction
Feature subset selection
Feature creation
Discretization and Binarization
Attribute Transformation

22
Aggregation

Sometimes "less is more" and this is the case
with aggregation, the combining of two or more
objects into a single object.

One way to aggregate transactions for this data
set is to replace all the transactions of a
single store with a single storewide transaction.
This reduces the number of data objects which is
now equal to the number of stores.

How the values of each attribute are combined
across all the records of a group (store for
instance)?
Some quantitative attributes, e.g. price, are
typically aggregated by taking a sum or an
average.
Other attributes, e.g. item or date, are omitted.

23
Sampling

Sampling is the main technique employed for data
selection.
It is often used for both the preliminary
investigation of the data and the final data
analysis.
Sampling is used in data mining because
processing the entire set of data of interest is
too expensive or time consuming.
The key principle for effective sampling is the
following
using a sample will work almost as well as using
the entire data sets, if the sample is
representative.
A sample is representative if it has
approximately the same property (of interest) as
the original set of data.

24
Sample Size

8000 points 2000 Points 500 Points
25
Curse of Dimensionality

When dimensionality increases, data becomes
increasingly sparse in the space that it occupies
For classification. this can mean that there are
not enough data objects to allow the creation of
a model that reliably assigns a class to all
possible objects.
Definitions of density and distance between
points, which is critical for clustering and
outlier detection, become less meaningful.

26
Feature Subset Selection

Redundant features
duplicate much or all of the information
contained in one or more other attributes
Example purchase price of a product and the
amount of sales tax paid
Irrelevant features
contain no information that is useful for the
data mining task at hand
Example students' ID is often irrelevant to the
task of predicting students' GPA

27
Feature Subset Selection

Techniques
Brute-force approch
Try all possible feature subsets as input to data
mining algorithm
Embedded approaches
Feature selection occurs naturally as part of
the data mining algorithm
Filter approaches
Features are selected before data mining
algorithm is run

28
Discretization and Binarization

Some data mining algorithms, especially certain
classification algorithms, require that the data
be in the form of categorical attributes.
Algorithms that find association patterns require
that the data be in the form of binary
attributes.
Thus it is often necessary to transform a
continuous attribute into a categorical attribute
(discretization), and both continuous and
discrete attributes may need to be transformed
into one or more binary attributes
(binarization).