CS4323 Data Mining Data and Data Preprocessing - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

CS4323 Data Mining Data and Data Preprocessing

Description:

Nominal, e.g. eye color=brown, blue, ... only equality tests ... DAN. PADA CLEMENTINE? 27. DATA EXPLORATION. 28. What is data exploration? ... – PowerPoint PPT presentation

Number of Views:118
Avg rating:3.0/5.0
Slides: 40
Provided by: arifbij
Category:

less

Transcript and Presenter's Notes

Title: CS4323 Data Mining Data and Data Preprocessing


1
CS4323 Data MiningData and Data Preprocessing
2
DATA and DATA PREPROCESSING
3
What is Data?
Attributes
  • Attribute is also known as variable, field,
    characteristic, or feature
  • Object is also known as record, point, case,
    sample, entity, or instance

Objects
4
Whats in an attribute?
  • Each instance is described by a fixed predefined
    set of features, its attributes
  • Possible attribute types (levels of
    measurement)
  • Nominal, ordinal, interval and ratio

witteneibe
5
Nominal quantities
  • Values are distinct symbols
  • Values themselves serve only as labels or names
  • Nominal comes from the Latin word for name
  • Example attribute outlook from weather data
  • Values sunny,overcast, and rainy
  • No relation is implied among nominal values (no
    ordering or distance measure)
  • Only equality tests can be performed

witteneibe
6
Ordinal quantities
  • Impose order on values
  • But no distance between values defined
  • Exampleattribute temperature in weather data
  • Values hot gt mild gt cool
  • Note addition and subtraction dont make sense
  • Example rule temperature lt hot c play yes

witteneibe
7
Interval quantities (Numeric)
  • Interval quantities are not only ordered but
    measured in fixed and equal units
  • Example 1 attribute temperature expressed in
    degrees Fahrenheit
  • Example 2 attribute year
  • Difference of two values makes sense
  • Sum or product doesnt make sense

witteneibe
8
Ratio quantities
  • Example attribute distance
  • Distance between an object and itself is zero
  • Ratio quantities are treated as real numbers
  • All mathematical operations are allowed

witteneibe
9
Attribute types used in practice
  • Most schemes accommodate just two levels of
    measurement nominal and ordinal
  • Nominal attributes are also called categorical,
    enumerated, or discrete
  • But enumerated and discrete imply order
  • Special case dichotomy (boolean attribute)
  • Ordinal attributes are called numeric, or
    continuous

witteneibe
10
Attribute types Summary
  • Nominal, e.g. eye colorbrown, blue,
  • only equality tests
  • important special case boolean (True/False)
  • Ordinal, e.g. gradek,1,2,..,12
  • Continuous (numeric), e.g. year
  • interval quantities integer
  • ratio quantities -- real

11
Why specify attribute types?
  • Q Why Machine Learning algorithms need to know
    about attribute type?
  • A To be able to make right comparisons and learn
    correct concepts, e.g.
  • Outlook gt sunny does not make sense, while
  • Temperature gt cool or
  • Humidity gt 70 does
  • Additional uses of attribute type check for
    valid values, deal with missing, etc.

12
Transforming ordinal to boolean
  • Simple transformation allowsordinal attribute
    with n valuesto be coded using n1 boolean
    attributes
  • Example attribute temperature
  • Better than coding it as a nominal attribute

Original data
Transformed data
c
witteneibe
13
The ARFF format
witteneibe
14
Attribute types in Weka
  • ARFF supports numeric and nominal attributes
  • Interpretation depends on learning scheme
  • Numeric attributes are interpreted as
  • ordinal scales if less-than and greater-than are
    used
  • ratio scales if distance calculations are
    performed (normalization/standardization may be
    required)
  • Instance-based schemes define distance between
    nominal values (0 if values are equal, 1
    otherwise)

witteneibe
15
Nominal vs. ordinal
  • Attribute age nominal
  • Attribute age ordinal
  • (e.g. young lt pre-presbyopic lt
    presbyopic)

witteneibe
16
  • APA SAJA TIPE ATRIBUT
  • DI WEKA
  • DAN
  • DI CLEMENTINE?

17
Types of data sets
  • Record
  • Data Matrix
  • Document Data
  • Transaction Data
  • Graph
  • World Wide Web
  • Molecular Structures
  • Ordered
  • Spatial Data
  • Temporal Data
  • Sequential Data

18
Record Data
  • Data that consists of a collection of records,
    each of which consists of a fixed set of
    attributes

19
Document Data
  • Each document becomes a term' vector

20
Transaction Data
  • A special type of record data, where each record
    (transaction) involves a set of items.

21
Graph Data
  • Examples Generic graph and HTML Links

22
Ordered Data
  • Sequences of transactions

Items/Events
An element of the sequence
23
Ordered Data
  • Genomic sequence data

24
Data Preprocessing
  • Aggregation
  • Sampling (.. Oversampling, undersampling)
  • Dimensionality Reduction
  • Feature subset selection
  • Feature creation
  • Discretization and Binarization
  • Attribute Transformation

25
Curse of Dimensionality
  • When dimensionality increases, data becomes
    increasingly sparse in the space that it occupies
  • Definitions of density and distance between
    points, which is critical for clustering and
    outlier detection, become less meaningful
  • Randomly generate 500 points
  • Compute difference between max and min distance
    between any pair of points

26
  • BAGAIMANA KEMAMPUAN VISUALISASI PADA WEKA
  • DAN
  • PADA CLEMENTINE?

27
DATA EXPLORATION
28
What is data exploration?
A preliminary exploration of the data to better
understand its characteristics.
  • Key motivations of data exploration include
  • Helping to select the right tool for
    preprocessing or analysis
  • Making use of humans abilities to recognize
    patterns
  • People can recognize patterns not captured by
    data analysis tools
  • Related to the area of Exploratory Data Analysis
    (EDA)
  • How we dissect a data set what we look for how
    we look and how we interpret
  • EDA heavily uses the collection of techniques
    that we call "statistical graphics"

29
Visualization
  • Visualization is the conversion of data into a
    visual or tabular format so that the
    characteristics of the data and the relationships
    among data items or attributes can be analyzed or
    reported.
  • Visualization of data is one of the most powerful
    and appealing techniques for data exploration.
  • Can detect general patterns and trends
  • Can detect outliers and unusual patterns

30
Visualization Techniques Histograms
  • Histogram
  • Usually shows the distribution of values of a
    single variable
  • Example Petal Width (10 and 20 bins,
    respectively)

31
Two-Dimensional Histograms
  • Show the joint distribution of the values of two
    attributes
  • Example petal width and petal length
  • What does this tell us?

32
Visualization Techniques Box Plots
  • Box Plots

33
Example of Box Plots
  • Box plots can be used to compare attributes

34
Scatter Plot Array of Iris Attributes
35
Contour Plot Example SST Dec, 1998
36
Visualization of the Iris Correlation Matrix
37
Parallel Coordinates Plots for Iris Data
38
OLAP
  • Relational databases put data into tables, while
    OLAP uses a multidimensional array
    representation.
  • Such representations of data previously existed
    in statistics and other fields
  • There are a number of data analysis and data
    exploration operations that are easier with such
    a data representation.

39
Example Iris data (continued)
  • Each unique tuple of petal width, petal length,
    and species type identifies one element of the
    array.
  • This element is assigned the corresponding count
    value.
  • The figure illustrates the result.
  • All non-specified tuples are 0.
Write a Comment
User Comments (0)
About PowerShow.com