Data Engineering - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Data Engineering

Description:

Substructure mining: Which substructures occur frequently in a set of compounds? ... the presence of any of these substructures is associated with the presence ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 29
Provided by: aaa23
Category:

less

Transcript and Presenter's Notes

Title: Data Engineering


1
Data Engineering
2
A Story of Data-Related Issues
  • You receive an email from a medical researcher
    concerning a project that you are eager to work
    on.
  • Hi,
  • I've attached the data file that I mentioned in
    my previous email.
  • Each line contains the information for a single
    patient and consists of five fields.
  • We want to predict the last field using the
    other fields.
  • I don't have time to provide any more
    information about the data since I'm going out of
    town for a couple of days, but hopefully that
    won't slow you down too much.
  • Thanks and see you in a couple of days.

3
Continued
  • Despite some misgivings, you proceed to analyze
    the data. The first few rows of the file are as
    follows

Nothing looks strange. You put your doubts
aside and start the analysis.
Two days later you you arrive for the meeting,
and while waiting for others to arrive, you
strike up a conversation with a statistician who
is working on the project.
4
Continued
  • Statistician So, you got the data for all the
    patients?
  • Data Miner Yes. I haven't had much time for
    analysis, but I do have a few interesting
    results.
  • Statistician Amazing. There were so many data
    issues with this set of patients that I couldn't
    do much.
  • Data Miner Oh? I didn't hear about any possible
    problems.
  • Statistician Well, first there is field 5, the
    variable we want to predict. It's common
    knowledge among people who analyze this type of
    data that results are better if you work with the
    log of the values, but I didn't discover this
    until later. Was it mentioned to you?
  • Data Miner No.
  • Statistician But surely you heard about what
    happened to field 4? It's supposed to be measured
    on a scale from 1 to 10, with 0 indicating a
    missing value, but because of a data entry error,
    all 10's were changed into 0's.
  • Data Miner Interesting. Were there any other
    problems?
  • Statistician Yes, fields 2 and 3 are basically
    the same, but I assume that you probably noticed
    that.
  • Data Miner Yes, but these fields were only weak
    predictors of field 5.

5
Continued
  • Statistician Anyway, given all those problems,
    I'm surprised you were able to accomplish
    anything.
  • Data Miner True, but my results are really quite
    good. Field 1 is a very strong predictor of field
    5. I'm surprised that this wasn't noticed before.
  • Statistician What? Field 1 is just an
    identification number.
  • Data Miner Nonetheless, my results speak for
    themselves.
  • Statistician Oh, no! I just remembered. We
    assigned ID numbers after we sorted the records
    based on field 5. There is a strong connection,
    but it's meaningless. Sorry.

Lesson Get to know your data!
6
Formally What is Data?
  • Collection of data objects and their attributes
  • An attribute is a property or characteristic of
    an object
  • Examples eye color of a person, temperature,
    etc.
  • Attribute is also known as variable, field,
    characteristic, or feature
  • A collection of attributes describe an object
  • Object is also known as record, point, case,
    sample, entity, or instance

Attributes
Objects
7
Employee Age and ID Number
  • Two attributes of an employee are ID and age.
  • Both can be represented as integers.
  • However, while it is reasonable to talk about the
    average age of an employee, it makes no sense to
    talk about the average employee ID.
  • The only valid operation for employee IDs is to
    test whether they are equal.
  • There is no hint of this limitation, however,
    when integers are used to represent the employee
    ID attribute.
  • Knowing the type of an attribute is important
    because it tells us which properties of the
    measured values are consistent with the
    underlying properties of the attribute, and
    therefore, it allows us to avoid foolish actions,
    such as computing the average employee ID.

8
Types of Attributes
  • There are different types of attributes
  • Nominal
  • Examples ID numbers, eye color, zip codes
  • Ordinal
  • Examples rankings (e.g., taste of potato chips
    on a scale from 1-10), grades, height in tall,
    medium, short
  • Interval
  • Examples calendar dates, temperatures in Celsius
    or Fahrenheit.
  • Ratio
  • Examples temperature in Kelvin, length, time,
    counts

9
Properties of Attribute Values
  • The type of an attribute depends on which of the
    following properties it possesses
  • Distinctness ?
  • Order lt gt
  • Addition -
  • Multiplication /
  • Nominal attribute distinctness
  • Ordinal attribute distinctness order
  • Interval attribute distinctness, order
    addition
  • Ratio attribute all 4 properties

10
Discrete and Continuous Attributes
  • Discrete Attribute
  • Has only a finite or countably infinite set of
    values
  • Examples zip codes, counts, or the set of words
    in a collection of documents
  • Often represented as integer variables.
  • Note binary attributes are a special case of
    discrete attributes
  • Continuous Attribute
  • Has real numbers as attribute values
  • Examples temperature, height, or weight.
  • Practically, real values can only be measured and
    represented using a finite number of digits.
  • Continuous attributes are typically represented
    as floating-point variables.
  • Typically, nominal and ordinal attributes are
    binary or discrete, while interval
  • and ratio attributes are continuous. However,
    count attributes, which are
  • discrete, are also ratio attributes.

11
Asymmetric Attributes
  • For asymmetric attributes, only presence -- a
    non-zero attribute value -- is regarded as
    important.
  • E.g. Transaction data
  • Bread, Coke etc are in fact (asymmetric)
    attributes and only their presence (i.e. value 1
    or true) is important.

12
Types of data sets
  • Record
  • Data Matrix
  • Document Data
  • Transaction Data
  • Graph
  • World Wide Web
  • Molecular Structures
  • Ordered
  • Spatial Data
  • Temporal Data
  • Sequential Data
  • Genetic Sequence Data

13
Record Data
  • Data that consists of a collection of records,
    each of which consists of a fixed set of
    attributes

14
Document Data
  • Each document becomes a term (word) vector,
  • each term is a component (attribute) of the
    vector,
  • the value of each component is the number of
    times the corresponding term occurs in the
    document.

15
Transaction Data
  • A special type of record data, where
  • each record (transaction) involves a set of
    items.
  • For example, consider a grocery store. The set
    of products purchased by a customer during one
    shopping trip constitute a transaction, while the
    individual products that were purchased are the
    items.

16
Data with Relationships among Objects
  • Examples Generic graph and HTML Links

Web search engines collect and process Web pages
to extract their contents. It is well known,
however, that the links to and from each page
provide a great deal of information about the
relevance of a Web page to a query, and thus,
must also be taken into consideration.
17
Data with Objects That Are Graphs E.g. Chemical
Data
  • Benzene Molecule C6H6

Substructure mining Which substructures occur
frequently in a set of compounds? Ascertain
whether the presence of any of these
substructures is associated with the presence or
absence of certain chemical properties, such as
melting point or heat of formation.
18
Data Quality
  • What are the kinds of data quality problems?
  • How can we detect problems with the data?
  • What can we do about these problems?
  • Examples of data quality problems
  • Data collection errors
  • Noise
  • Outliers
  • Missing values
  • Duplicate data

We will study (2) and (3) in detail, later after
the classification.
19
Outliers
  • Outliers are data objects with characteristics
    that are considerably different than most of the
    other data objects in the data set

20
Missing Values
  • Reasons for missing values
  • Information is not collected (e.g., people
    decline to give their age and weight)
  • Attributes may not be applicable to all cases
    (e.g., annual income is not applicable to
    children)
  • Handling missing values
  • Eliminate Data Objects
  • Estimate Missing Values
  • Ignore the Missing Value During Analysis
  • Replace with all possible values (weighted by
    their probabilities)

21
Data Preprocessing
  • Aggregation
  • Sampling
  • Dimensionality Reduction
  • Feature subset selection
  • Feature creation
  • Discretization and Binarization
  • Attribute Transformation

22
Aggregation
  • Sometimes "less is more" and this is the case
    with aggregation, the combining of two or more
    objects into a single object.
  • One way to aggregate transactions for this data
    set is to replace all the transactions of a
    single store with a single storewide transaction.
  • This reduces the number of data objects which is
    now equal to the number of stores.
  • How the values of each attribute are combined
    across all the records of a group (store for
    instance)?
  • Some quantitative attributes, e.g. price, are
    typically aggregated by taking a sum or an
    average.
  • Other attributes, e.g. item or date, are omitted.

23
Sampling
  • Sampling is the main technique employed for data
    selection.
  • It is often used for both the preliminary
    investigation of the data and the final data
    analysis.
  • Sampling is used in data mining because
    processing the entire set of data of interest is
    too expensive or time consuming.
  • The key principle for effective sampling is the
    following
  • using a sample will work almost as well as using
    the entire data sets, if the sample is
    representative.
  • A sample is representative if it has
    approximately the same property (of interest) as
    the original set of data.

24
Sample Size

8000 points 2000 Points 500 Points
25
Curse of Dimensionality
  • When dimensionality increases, data becomes
    increasingly sparse in the space that it occupies
  • For classification. this can mean that there are
    not enough data objects to allow the creation of
    a model that reliably assigns a class to all
    possible objects.
  • Definitions of density and distance between
    points, which is critical for clustering and
    outlier detection, become less meaningful.

26
Feature Subset Selection
  • Redundant features
  • duplicate much or all of the information
    contained in one or more other attributes
  • Example purchase price of a product and the
    amount of sales tax paid
  • Irrelevant features
  • contain no information that is useful for the
    data mining task at hand
  • Example students' ID is often irrelevant to the
    task of predicting students' GPA

27
Feature Subset Selection
  • Techniques
  • Brute-force approch
  • Try all possible feature subsets as input to data
    mining algorithm
  • Embedded approaches
  • Feature selection occurs naturally as part of
    the data mining algorithm
  • Filter approaches
  • Features are selected before data mining
    algorithm is run

28
Discretization and Binarization
  • Some data mining algorithms, especially certain
    classification algorithms, require that the data
    be in the form of categorical attributes.
  • Algorithms that find association patterns require
    that the data be in the form of binary
    attributes.
  • Thus it is often necessary to transform a
    continuous attribute into a categorical attribute
    (discretization), and both continuous and
    discrete attributes may need to be transformed
    into one or more binary attributes
    (binarization).
Write a Comment
User Comments (0)
About PowerShow.com