Title: CS4323 Data Mining Data and Data Preprocessing
1CS4323 Data MiningData and Data Preprocessing
2DATA and DATA PREPROCESSING
3What is Data?
Attributes
- Attribute is also known as variable, field,
characteristic, or feature - Object is also known as record, point, case,
sample, entity, or instance
Objects
4Whats in an attribute?
- Each instance is described by a fixed predefined
set of features, its attributes - Possible attribute types (levels of
measurement) - Nominal, ordinal, interval and ratio
witteneibe
5Nominal quantities
- Values are distinct symbols
- Values themselves serve only as labels or names
- Nominal comes from the Latin word for name
- Example attribute outlook from weather data
- Values sunny,overcast, and rainy
- No relation is implied among nominal values (no
ordering or distance measure) - Only equality tests can be performed
witteneibe
6Ordinal quantities
- Impose order on values
- But no distance between values defined
- Exampleattribute temperature in weather data
- Values hot gt mild gt cool
- Note addition and subtraction dont make sense
- Example rule temperature lt hot c play yes
witteneibe
7Interval quantities (Numeric)
- Interval quantities are not only ordered but
measured in fixed and equal units - Example 1 attribute temperature expressed in
degrees Fahrenheit - Example 2 attribute year
- Difference of two values makes sense
- Sum or product doesnt make sense
witteneibe
8Ratio quantities
- Example attribute distance
- Distance between an object and itself is zero
- Ratio quantities are treated as real numbers
- All mathematical operations are allowed
witteneibe
9Attribute types used in practice
- Most schemes accommodate just two levels of
measurement nominal and ordinal - Nominal attributes are also called categorical,
enumerated, or discrete - But enumerated and discrete imply order
- Special case dichotomy (boolean attribute)
- Ordinal attributes are called numeric, or
continuous
witteneibe
10Attribute types Summary
- Nominal, e.g. eye colorbrown, blue,
- only equality tests
- important special case boolean (True/False)
- Ordinal, e.g. gradek,1,2,..,12
- Continuous (numeric), e.g. year
- interval quantities integer
- ratio quantities -- real
11Why specify attribute types?
- Q Why Machine Learning algorithms need to know
about attribute type? - A To be able to make right comparisons and learn
correct concepts, e.g. - Outlook gt sunny does not make sense, while
- Temperature gt cool or
- Humidity gt 70 does
- Additional uses of attribute type check for
valid values, deal with missing, etc.
12Transforming ordinal to boolean
- Simple transformation allowsordinal attribute
with n valuesto be coded using n1 boolean
attributes - Example attribute temperature
- Better than coding it as a nominal attribute
Original data
Transformed data
c
witteneibe
13The ARFF format
witteneibe
14Attribute types in Weka
- ARFF supports numeric and nominal attributes
- Interpretation depends on learning scheme
- Numeric attributes are interpreted as
- ordinal scales if less-than and greater-than are
used - ratio scales if distance calculations are
performed (normalization/standardization may be
required) - Instance-based schemes define distance between
nominal values (0 if values are equal, 1
otherwise)
witteneibe
15Nominal vs. ordinal
- Attribute age nominal
- Attribute age ordinal
- (e.g. young lt pre-presbyopic lt
presbyopic)
witteneibe
16- APA SAJA TIPE ATRIBUT
- DI WEKA
- DAN
- DI CLEMENTINE?
17Types of data sets
- Record
- Data Matrix
- Document Data
- Transaction Data
- Graph
- World Wide Web
- Molecular Structures
- Ordered
- Spatial Data
- Temporal Data
- Sequential Data
18Record Data
- Data that consists of a collection of records,
each of which consists of a fixed set of
attributes
19Document Data
- Each document becomes a term' vector
20Transaction Data
- A special type of record data, where each record
(transaction) involves a set of items.
21Graph Data
- Examples Generic graph and HTML Links
22Ordered Data
- Sequences of transactions
Items/Events
An element of the sequence
23Ordered Data
24Data Preprocessing
- Aggregation
- Sampling (.. Oversampling, undersampling)
- Dimensionality Reduction
- Feature subset selection
- Feature creation
- Discretization and Binarization
- Attribute Transformation
25Curse of Dimensionality
- When dimensionality increases, data becomes
increasingly sparse in the space that it occupies - Definitions of density and distance between
points, which is critical for clustering and
outlier detection, become less meaningful
- Randomly generate 500 points
- Compute difference between max and min distance
between any pair of points
26- BAGAIMANA KEMAMPUAN VISUALISASI PADA WEKA
- DAN
- PADA CLEMENTINE?
27DATA EXPLORATION
28What is data exploration?
A preliminary exploration of the data to better
understand its characteristics.
- Key motivations of data exploration include
- Helping to select the right tool for
preprocessing or analysis - Making use of humans abilities to recognize
patterns - People can recognize patterns not captured by
data analysis tools - Related to the area of Exploratory Data Analysis
(EDA) - How we dissect a data set what we look for how
we look and how we interpret - EDA heavily uses the collection of techniques
that we call "statistical graphics"
29Visualization
- Visualization is the conversion of data into a
visual or tabular format so that the
characteristics of the data and the relationships
among data items or attributes can be analyzed or
reported. - Visualization of data is one of the most powerful
and appealing techniques for data exploration. - Can detect general patterns and trends
- Can detect outliers and unusual patterns
30Visualization Techniques Histograms
- Histogram
- Usually shows the distribution of values of a
single variable - Example Petal Width (10 and 20 bins,
respectively)
31Two-Dimensional Histograms
- Show the joint distribution of the values of two
attributes - Example petal width and petal length
- What does this tell us?
32Visualization Techniques Box Plots
33Example of Box Plots
- Box plots can be used to compare attributes
34Scatter Plot Array of Iris Attributes
35Contour Plot Example SST Dec, 1998
36Visualization of the Iris Correlation Matrix
37Parallel Coordinates Plots for Iris Data
38OLAP
- Relational databases put data into tables, while
OLAP uses a multidimensional array
representation. - Such representations of data previously existed
in statistics and other fields - There are a number of data analysis and data
exploration operations that are easier with such
a data representation.
39Example Iris data (continued)
- Each unique tuple of petal width, petal length,
and species type identifies one element of the
array. - This element is assigned the corresponding count
value. - The figure illustrates the result.
- All non-specified tuples are 0.