Title: Data Mining in Macroeconomic Data Sets
1Data Mining in Macroeconomic Data Sets
- Advised by Christos Faloutsos
- 2006. 04. 27
- Ping Chen
2Outline
- Research Background
- Research Questions
- Task 1 Exploration of Economy Network Property
- Task 2 Temporal Evolution Patterns
3Motivation
- Economic Supply Chain Connections
- Hidden sector connections
- Economic Input Output (EIO) Account
supply-demand connections
4Approach
Power Supply Sector
Construc-tion Sector
Manufac -turing Sector
5Research Questions
- RQ1 Can we describe the properties of the
economy network? - RQ2 Can we characterize the changes in the
transactions over years and explain why? - RQ3 Can we spot anti-correlated, correlated
sectors effectively? - RQ4 Can we detect outlier sectors effectively?
6Data Preparation
- EIO Table Structure
- Row Supply Sector
- Column Demand Sector
- Sector Pair
- Pair Transaction ()
- Yearly Transaction
- Set
- Pair Transaction
- Sequence
Power Sector
Construction Sector
Year 1947
Year 1958
Year 1982
7Part I. Economy Network Property
8Network Topology Weight Distribution
- How would the transaction distribution look like,
how to model them (Gaussian? Uniform? )
9Network Topology Weight Distribution
1982 Inter-Transaction Distribution
10Power Laws
- Power Laws
- Special case, Paretos Laws
Negative Cumulative Probability Density Function
Probability Density Function
Cumulative Number of sites with gt x visitors
(Log)
Proportion of sites (Log)
Slope-2.07
Slope-1.07
Number of visitors (Log)
Number of visitors (Log)
(Paretos Laws)
(Power Law)
11Examples of double Pareto logNormal (dPlN)
Distribution
Log (density)
Log (density)
Log (density)
Log (Income)
Log (Income)
Log (Income)
United States
Canada
Sri Lanka
Household Income Data for different Countries
Reed, 2002 2003
12Double Pareto LogNormal (dPlN) Distribution
- Double Pareto LogNormal Distribution Reed, etc
2003
CDF
NCDF
13dPlN Parameter Interpretation
PDF, log normal
Mean Variance
CDF, log-log
Slope
NCDF, log-log
Slope
14Weight (Transaction) Distribution
15Weight (Transaction) Distribution
4.35.5
1.21.5
7.17.9
0.51.1
16RQ 1 How should we describe the web property of
the economy network?
- Highly skewed transaction data sets (network
weight) - Transaction distribution is well fitted by double
Pareto logNormal (dPlN) distribution
17Part II. Economic Dependency Evolution Pattern
18Research Questions
- RQ1 How should we describe the web structure of
the economy network? - RQ2 Can we characterize the changes in the
transactions over years and explain why? - RQ4 Can we spot correlated sectors effectively?
- RQ3 Can we detect outlier sectors effectively?
19Clustering Methods Survey
- K-means MacQueen, J. B, 1967
- Singular Value Decomposition Maltseva, E.,
Pizzuti, C., Talia, D, 2001
kth Singular Value
vk
sk
Variance of kth Principal Component of XTX
kth Principal Component of XTX
20Power Sector
Construction Sector
(Construction Sector -gt Power Sector)
Year 1947
Year 1958
Year 1982
Yearly Transaction Set (Year 1947)
21 PCA Projection
22 PCA Projection
- Advantage Dimension reduction, visualization
- Disadvantage Suffer from data skewness
23Data Normalization and Redo PCA
24Sub-Questions
- How to handle data skewness?
- How to normalize data?
- How to interpret the PCA outcomes using
normalized data? - How to bring back information that is missing in
data normalization process, i.e., transaction
scale, etc?
Normalization
25Solution Multiple Steps of Pattern REcognition
in skewed DAta (M-SPREAD)
- Step 1 Data normalization
- Step 2 Principal Component Analysis
- Step 3 Data Visualization and Pattern
Identification M-Plane - Step 4 Data Bucket Generation
- Step 5 Sub data set Pattern Identification
M-Slice
26Principal Components and Interpretation
- Observations
- Reversed PC1 Continuous intensified
inter-transactions - Reversed PC2 Interrupted inter-transaction
growth in 1970s - Possible Reason Oil Crisis in 1970s
27(Oil price chronology)
28Oil Benefiting
PC2
Growing
Shrinking
PC1
Oil Suffering
29M-Plane
30Observations from M-Plane
- Four Regions
- Growing (Majority)
- Shrinking
- Oil Suffering (Another Cluster)
- Oil Benefiting
- Two clusters (C1, C2)
- C1 Inter-transaction amounts grow
- C2 Inter-transaction amounts suffered from Oil
Crisis in 1970s
31Data Bucket Generation
Before Normalization
Average
1
2
3
4
4
1
2
3
32M-Slices
M-Plane
33Observations from M-Slices
- Major patterns change over data buckets having
different data magnitude - Very Large-size Transaction pairs Most growing,
very few oil sensitive - Large and Small-size Transaction pairs Both
growing, a few oil sensitive - Small-size Transaction pairs mixed growth
patterns
34M-SubSettings
Demand Sector
Supply Sector
35M-SubSetting Examples
Motor Vehicle and Equipment Industry
Aircraft and Parts
Supply sector
Demand sector
M-Plane
Petroleum Refining and related Industry
Transportation and Warehousing
36Observations from M-Sub Settings
- Individual Industry related dependence evolution
pattern - Motor, Aircraft parts etc industry oil
suffering - Domestic petroleum industry, warehousing industry
related oil benefiting - Correlated sectors motor, aircraft parts
- Observation of substitution phenomena
transportation approach vs. warehousing facility
etc.
37Discussion of M-SPREAD Procedure
- Effects of selecting of normalization methods
38Discussion of M-SPREAD Procedure
- Effects of removing small transactions
39Summary of Patterns
- Correlated and anti-correlated Sectors (auto vs.
aircraft part, auto vs. warehousing) - Time Evolution Patterns (growing, oil
suffering) - Effect of magnitude (largegrowing)
- Outlier Sector, Correlations, Substitution effect
- Outlier Time Stamp (1977)
40Contributions
- Discovery of dPlN of transaction distribution
- M-SPREAD, handle normalization, handle various
magnitude - Effective visualization method (M-Plane, M-Slice,
M-Sub Setting) - Discovery of patterns
- Time evolution pattern
- Effect of magnitude
- Correlated, anti-correlated sectors
- Outliers
41Complete Principal Components
42Plot of Singular Values
43(No Transcript)
44(No Transcript)
45(No Transcript)
46(No Transcript)
47Selection of Feature Values