Distributed Data Mining - PowerPoint PPT Presentation

1 / 174
About This Presentation
Title:

Distributed Data Mining

Description:

1 Mflop/s 1 Megaflop/s 106 Flop/sec. 1 Gflop/s 1 Gigaflop/s 109 Flop/sec ... E.g., HPF, TreadMarks, sw for NoW (JavaParty, Manta, Jackal) 35 ... – PowerPoint PPT presentation

Number of Views:838
Avg rating:3.0/5.0
Slides: 175
Provided by: Opte
Category:

less

Transcript and Presenter's Notes

Title: Distributed Data Mining


1
Distributed Data Mining
ACAI05/SEKT05 ADVANCED COURSE ON KNOWLEDGE
DISCOVERY
  • Dr. Giuseppe Di Fatta
  • University of Konstanz (Germany)
  • and ICAR-CNR, Palermo (Italy)
  • 5 July, 2005
  • Email fatta_at_inf.uni-konstanz.de,
    difatta_at_pa.icar.cnr.it

2
Tutorial Outline
  • Part 1 Overview of High-Performance Computing
  • Technology trends
  • Parallel and Distributed Computing architectures
  • Programming paradigms
  • Part 2 Distributed Data Mining
  • Classification
  • Clustering
  • Association Rules
  • Graph Mining
  • Conclusions

3
Tutorial Outline
  • Part 1 Overview of High-Performance Computing
  • Technology trends
  • Moores law
  • Processing
  • Memory
  • Communication
  • Supercomputers

4
Units of HPC
  • Processing
  • 1 Mflop/s 1 Megaflop/s 106 Flop/sec
  • 1 Gflop/s 1 Gigaflop/s 109 Flop/sec
  • 1 Tflop/s 1 Teraflop/s 1012 Flop/sec
  • 1 Pflop/s 1 Petaflop/s 1015 Flop/sec
  • Memory
  • 1 MB 1 Megabyte 106 Bytes
  • 1 GB 1 Gigabyte 109 Bytes
  • 1 TB 1 Terabyte 1012 Bytes
  • 1 PB 1 Petabyte 1015 Bytes

5
How far did we go?
6
Technology Limits
r
1 Tflop - 1 TB sequential machine
r 0.3 mm
1 TB
  • Consider the 1 Tflop sequential machine
  • data must travel some distance, r, to get from
    memory to CPU
  • to get 1 data element per cycle, this means 1012
    times per second at the speed of light, c 3x108
    m/s
  • so r lt c/1012 0.3 mm
  • Now put 1 TB of storage in a 0.3 mm2 area
  • each word occupies about 3 Angstroms2, the size
    of a small atom

7
Moores Law (1965)
  • Gordon Moore
  • (co-founder of Intel)

The complexity for minimum component costs has
increased at a rate of roughly a factor of two
per year. Certainly over the short term this rate
can be expected to continue, if not to increase.
Over the longer term, the rate of increase is a
bit more uncertain, although there is no reason
to believe it will not remain nearly constant for
at least 10 years. That means by 1975, the number
of components per integrated circuit for minimum
cost will be 65,000.
8
Moores Law (1975)
  • In 1975, Moore refined his law
  • circuit complexity doubles every 18 months.
  • So far it holds for CPUs and DRAMs!
  • Extrapolation for computing power at a given
    cost and semiconductor revenues.

9
Technology Trend
10
Technology Trend
11
Technology Trend
  • Processors issue instructions roughly every
    nanosecond.
  • DRAM can be accessed roughly every 100
    nanoseconds.
  • DRAM cannot keep processors busy! And the gap is
    growing
  • processors getting faster by 60 per year.
  • DRAM getting faster by 7 per year.

12
Memory Hierarchy
  • Most programs have a high degree of locality in
    their accesses
  • spatial locality accessing things nearby
    previous accesses
  • temporal locality reusing an item that was
    previously accessed
  • Memory hierarchy tries to exploit locality.

13
Memory Latency
  • Hiding memory latency
  • temporal and spatial locality (caching)
  • multithreading
  • prefetching

14
Communication
  • Topology
  • The manner in which the nodes are connected.
  • Best choice would be a fully connected network
    (every processor to every other).
  • Unfeasible for cost and scaling reasons. Instead,
    processors are arranged in some variation of a
    bus, grid, torus, or hypercube.
  • Latency
  • How long does it take to start sending a
    "message"? Measured in microseconds.
  • (Also in processors how long does it take to
    output results of some operations, such as
    floating point add, divide etc., which are
    pipelined?)
  • Bandwidth
  • What data rate can be sustained once the message
    is started? Measured in Mbytes/sec.

15
Networking Trend
  • System interconnection network
  • bus, crossbar, array, mesh, tree
  • static, dynamic
  • LAN/WAN

16
LAN/WAN
  • 1st network connection in 1969 50 Kpbs
  • at about 1030 PM on October 29'th, 1969, the
    first ARPANET connection was established between
    UCLA and SRI over a 50 kbps line provided by the
    ATT telephone company.
  • At the UCLA end, they typed in the 'l' and asked
    SRI if they received it 'got the l' came the
    voice reply. UCLA typed in the 'o', asked if they
    got it, and received 'got the o'. UCLA then typed
    in the 'g' and the darned system CRASHED! Quite a
    beginning. On the second attempt, it worked
    fine! (Leonard Kleinrock)
  • 10Base5 Ethernet in 1976 by Bob Metcalfe and
    David Boggs
  • end of 90s 100 Mbps (fast Ethernet) and 1 Gbps
  • Bandwidth is not all the story!
  • Do not forget to consider delay and latency.

17
Delay in packet-switched networks
  • (3) Transmission delay
  • Rlink bandwidth (bps)
  • Lpacket length (bits)
  • time to send bits into link L/R
  • (4) Propagation delay
  • d length of physical link
  • s propagation speed in medium (2x108 m/sec)
  • propagation delay d/s
  • (1) nodal processing
  • check bit errors
  • determine output link
  • (2) queuing
  • time waiting at output link for transmission
  • depends on congestion level of router

Note s and R are very different quantities!
18
Latency
  • How long does it take to start sending a
    "message"?
  • Latency may be critical for parallel computing.
    Some LAN technologies provide high BW and low
    latency for .

19
HPC Trend
20 years ago Mflop/s 1x106 Floating Point
Ops/sec - Scalar based 10 years ago
Gflop/s 1x109 Floating Point Ops/sec) Vector
Shared memory computing, bandwidth aware block
partitioned, latency tolerant Today
Tflop/s 1x1012 Floating Point Ops/sec Highly
parallel, distributed processing, message
passing, network based data decomposition,
communication/computation 5 years away
Pflop/s 1x1015 Floating Point Ops/sec Many more
levels MH, combination/gridsHPC More adaptive,
LT and BW aware, fault tolerant, extended
precision, attention to SMP nodes
20
TOP500 SuperComputers
21
TOP500 SuperComputers
22
IBM BlueGene/L
23
Tutorial Outline
  • Part 1 Overview of High-Performance Computing
  • Technology trends
  • Parallel and Distributed Computing architectures
  • Programming paradigms

24
Parallel and Distributed Systems
25
Different Architectures
  • Parallel computing
  • single systems with many processors working on
    same problem
  • Distributed computing
  • many systems loosely coupled by a scheduler to
    work on related problems
  • Grid Computing (MetaComputing)
  • many systems tightly coupled by software, perhaps
    geographically distributed, to work together on
    single problems or on related problems
  • Massively Parallel Processors (MPPs) continue to
    account of more than half of all installed
    high-performance computers worldwide (Top500
    list).
  • Microprocessor based supercomputers have brought
    a major change in accessibility and
    affordability.
  • Nowadays, cluster systems are the most growing
    part.

26
Classification Control Model
  • Flynns Classical Taxonomy (1966)
  • Flynn's taxonomy distinguishes multi-processor
    computer architectures according to how they can
    be classified along the two independent
    dimensions of Instruction and Data. Each of these
    dimensions can have only one of two possible
    states Single or Multiple.

27
SISD
Von Neumann Machine
  • Single Instruction, Single Data
  • A serial (non-parallel) computer
  • Single instruction only one instruction stream
    is being acted on by the CPU during any one clock
    cycle
  • Single data only one data stream is being used
    as input during any one clock cycle
  • Deterministic execution
  • This is the oldest and until recently, the most
    prevalent form of computer
  • Examples most PCs, single CPU workstations and
    mainframes

28
SIMD
  • Single Instruction, Multiple Data
  • Single instruction all processing units execute
    the same instruction at any given clock cycle.
  • Multiple data each processing unit can operate
    on a different data element.
  • This type of machine typically has an instruction
    dispatcher, a very high-bandwidth internal
    network, and a very large array of very
    small-capacity instruction units.
  • Best suited for specialized problems
    characterized by a high degree of regularity,
    such as image processing.
  • Synchronous (lockstep) and deterministic
    execution
  • Two varieties Processor Arrays and Vector
    Pipelines
  • Examples
  • Processor Arrays Connection Machine CM-2, Maspar
    MP-1, MP-2
  • Vector Pipelines IBM 9000, Cray C90, Fujitsu VP,
    NEC SX-2, Hitachi S820

29
MISD
  • Multiple Instruction, Single Data
  • Few actual examples of this class of parallel
    computer have ever existed.
  • Some conceivable examples might be
  • multiple frequency filters operating on a single
    signal stream
  • multiple cryptography algorithms attempting to
    crack a single coded message.

30
MIMD
  • Multiple Instruction, Multiple Data
  • Currently, the most common type of parallel
    computer
  • Multiple Instruction every processor may be
    executing a different instruction stream.
  • Multiple Data every processor may be working
    with a different data stream.
  • Execution can be synchronous or asynchronous,
    deterministic or non- deterministic.
  • Examples most current supercomputers, networked
    parallel computer "grids" and multi-processor SMP
    computers - including some types of PCs.

31
Classification Communication Model
  • Shared vs. Distributed Memory systems

32
Shared Memory UMA vs. NUMA
33
Distributed Memory MPPs vs. Clusters
  • Processors-memory nodes are connected by some
    type of interconnect network
  • Massively Parallel Processor (MPP) tightly
    integrated, single system image.
  • Cluster individual computers connected by SW

Interconnect Network
P
P
P
P
P
P
M
M
M
M
M
M
34
Distributed Shared-Memory
  • Virtual shared memory (shared address space)
  • on hardware level
  • on software level
  • Global address space spanning all of the memory
    in the system.
  • E.g., HPF, TreadMarks, sw for NoW (JavaParty,
    Manta, Jackal)

35
Parallel vs. Distributed Computing
  • Parallel computing usually considers dedicated
    homogeneous HPC systems to solve parallel
    problems.
  • Distributed computing extends the parallel
    approach to heterogeneous general-purpose
    systems.
  • Both look at the parallel formulation of a
    problem.
  • But usually reliability, security, heterogeneity
    are not considered in parallel computing. But
    they are considered in Grid computing.
  • A distributed system is one in which the failure
    of a computer you didn't even know existed can
    render your own computer unusable. (Leslie
    Lamport)

36
Parallel and Distributed Computing
  • Parallel computing
  • Shared-Memory SIMD
  • Distributed-Memory SIMD
  • Shared-Memory MIMD
  • Distributed-Memory MIMD
  • Behind DM-MIMD
  • Distributed computing and Clusters
  • Behind parallel and distributed computing
  • Metacomputing

SCALABILITY
37
Tutorial Outline
  • Part 1 Overview of High-Performance Computing
  • Technology trends
  • Parallel and Distributed Computing architectures
  • Programming paradigms
  • Programming models
  • Problem decomposition
  • Parallel programming issues

38
Programming Paradigms
  • Parallel Programming Models
  • Control
  • how is parallelism created
  • what orderings exist between operations
  • how do different threads of control synchronize
  • Naming
  • what data is private vs. shared
  • how logically shared data is accessed or
    communicated
  • Set of operations
  • what are the basic operations
  • what operations are considered to be atomic
  • Cost
  • how do we account for the cost of each of the
    above

39
Model 1 Shared Address Space
  • Program consists of a collection of threads of
    control,
  • Each with a set of private variables
  • e.g., local variables on the stack
  • Collectively with a set of shared variables
  • e.g., static variables, shared common blocks,
    global heap
  • Threads communicate implicitly by writing and
    reading shared variables
  • Threads coordinate explicitly by synchronization
    operations on shared variables
  • writing and reading flags
  • locks, semaphores
  • Like concurrent programming in uniprocessor

40
Model 2 Message Passing
  • Program consists of a collection of named
    processes
  • thread of control plus local address space
  • local variables, static variables, common blocks,
    heap
  • Processes communicate by explicit data transfers
  • matching pair of send receive by source and
    dest. proc.
  • Coordination is implicit in every communication
    event
  • Logically shared data is partitioned over local
    processes
  • Like distributed programming
  • Program with standard libraries MPI, PVM
  • aka shared nothing architecture, or a
    multicomputer

41
Model 3 Data Parallel
  • Single sequential thread of control consisting of
    parallel operations
  • Parallel operations applied to all (or defined
    subset) of a data structure
  • Communication is implicit in parallel operators
    and shifted data structures
  • Elegant and easy to understand
  • Not all problems fit this model
  • Vector computing

42
SIMD Machine
  • An SIMD (Single Instruction Multiple Data)
    machine
  • A large number of small processors
  • A single control processor issues each
    instruction
  • each processor executes the same instruction
  • some processors may be turned off on any
    instruction interconnect
  • Machines not popular (CM2), but programming model
    is
  • implemented by mapping n-fold parallelism to p
    processors
  • mostly done in the compilers (HPF High
    Performance Fortran) control processor

43
Model 4 Hybrid
  • Shared memory machines (SMPs) are the fastest
    commodity machine. Why not build a larger machine
    by connecting many of them with a network?
  • CLUMP Cluster of SMPs
  • Shared memory within one SMP, message passing
    outside
  • Clusters, ASCI Red (Intel), ...
  • Programming model?
  • Treat machine as flat, always use message
    passing, even within SMP (simple, but ignore
    important part of memory hierarchy)
  • Expose two layers shared memory (OpenMP) and
    message passing (MPI) higher performance, but
    ugly to program.

44
Hybrid Systems
45
Model 5 BSP
  • Bulk Synchronous Processing (BSP) (L. Valiant,
    1990)
  • Used within the message passing or shared memory
    models as a programming convention
  • Phases separated by global barriers
  • Compute phases all operate on local data (in
    distributed memory)
  • or read access to global data (in shared memory)
  • Communication phases all participate in
    rearrangement or reduction of global data
  • Generally all doing the same thing in a phase
  • all do f, but may all do different things within
    f
  • Simplicity of data parallelism without
    restrictions

BSP superstep
46
Problem Decomposition
  • Domain decomposition ? data parallel
  • Functional decomposition ? task parallel

47
Parallel Programming
  • directives-based data-parallel language
  • Such as High Performance Fortran (HPF) or OpenMP
  • Serial code is made parallel by adding directives
    (which appear as comments in the serial code)
    that tell the compiler how to distribute data and
    work across the processors.
  • The details of how data distribution,
    computation, and communications are to be done
    are left to the compiler.
  • Usually implemented on shared-memory
    architectures.
  • Message Passing (e.g. MPI, PVM)
  • very flexible approach based on explicit message
    passing via library calls from standard
    programming languages
  • It is left up to the programmer to explicitly
    divide data and work across the processors as
    well as manage the communications among them.
  • Multi-threading in distributed environments
  • Parallelism is transparent to the programmer
  • Shared-memory or distributed shared-memory systems

48
Parallel Programming Issues
  • The main goal of a parallel program is to get
    better performance over the serial version.
  • Performance evaluation
  • Important issues to take into account
  • Load balancing
  • Minimizing communication
  • Overlapping communication and computation

49
Speedup
  • Serial fraction
  • Parallel fraction
  • Speedup
  • Superlinear speedup is, in general, impossible
    but it may arise in two cases
  • memory hierarchy phenomena
  • search algorithms

50
Maximum Speedup
  • Amdahls Law states that potential program
    speedup is defined by the fraction of code (fp)
    which can be parallelized.

51
Maximum Speedup
  • There are limits to the scalability of
    parallelism.
  • For example, at fp 0.50, 0.90 and 0.99,
  • 50, 90 and 99 of the code is parallelizable.
  • However, certain problems demonstrate increased
    performance by increasing the problem size.
    Problems which increase the percentage of
    parallel time with their size are more "scalable"
    than problems with a fixed percentage of parallel
    time.
  • fs and fp may not be static

52
Efficiency
  • Given the parallel cost
  • Efficiency E
  • In general, the total overhead To is an
    increasing function of p, at least linearly when
    fs gt 0
  • communication,
  • extra computation,
  • idle periods due to sequential components,
  • idle periods due to load imbalance.

53
Cost-optimality of Parallel Systems
  • A parallel system is composed by a parallel
    algorithm and a parallel computational platform.
  • A parallel system is cost-optimal if the cost of
    solving a problem has the same asymptotic growth
    (in ? terms, as a function of the input size W)
    as the fastest known sequential algorithm.
  • As a consequence,

54
Isoefficiency
  • For a given problem size, when we increase the
    number of PEs, the speedup and the efficiency
    decrease.
  • How much do we need to increase the problem size
    to keep the efficiency constant?
  • Isoefficiency is a metric for scalability
  • And, in general, as the problem size increases
    the efficiency increases, while keeping the
    number of processors constant
  • Isoefficiency
  • In scalable parallel system, when increasing the
    number of PEs, the efficiency can be kept
    constant by increasing the problem size.
  • Of course, for different problems, the rate at
    which W must be increased may vary. This rate
    determines the degree of scalability of the
    system.

55
Sources of Parallel Overhead
  • Total parallel overhead
  • INTERPROCESSOR COMMUNICATION
  • If each PE spends Tcomm time for communications,
    then the overhead will increase by pTcomm.
  • LOAD IMBALANCE
  • if exists, some PE will be idle while others are
    busy. Idle time of any PE contributes to the
    Overhead Time.
  • Load imbalance always occurs if there is a
    strictly sequential component of the algorithm.
  • Load imbalance often occurs at the end of the
    process run for asynchronous termination (e.g. in
    coarse-grain parallelism).
  • EXTRA COMPUTATION
  • Parallel version of the fastest sequential
    algorithm may not be straightforward. Additional
    computation may be needed in the parallel
    algorithm. This contributes to the Overhead Time.

To pTp Ts
56
Load Balancing
  • Load balancing is the task of equally dividing
    the work among the available processes.
  • A range of load balancing problems is determined
    by
  • Task costs
  • Task dependencies
  • Locality needs
  • Spectrum of solutions from static to dynamic

A closely related problem is scheduling, which
determines the order in which tasks run.
57
Different Load Balancing Problems
  • Load balancing problems differ in
  • Tasks costs
  • Do all tasks have equal costs?
  • If not, when are the costs known?
  • Before starting, when task created, or only when
    task ends
  • Task dependencies
  • Can all tasks be run in any order (including
    parallel)?
  • If not, when are the dependencies known?
  • Before starting, when task created, or only when
    task ends
  • Locality
  • Is it important for some tasks to be scheduled on
    the same processor (or nearby) to reduce
    communication cost?
  • When is the information about communication
    between tasks known?

58
Task cost
59
Task Dependency
(e.g. data/control dependencies at end/beginning
of task executions)
60
Task Locality
(e.g. data/control dependencies during task
executions)
61
Spectrum of Solutions
  • Static scheduling. All information is available
    to scheduling algorithm, which runs before any
    real computation starts. (offline algorithms)
  • Semi-static scheduling. Information may be known
    at program startup, or the beginning of each
    timestep, or at other well-defined points.
    Offline algorithms may be used even though the
    problem is dynamic.
  • Dynamic scheduling. Information is not known
    until mid-execution.
  • (online algorithms)

62
LB Approaches
  • Static load balancing
  • Semi-static load balancing
  • Self-scheduling (manager-workers)
  • Distributed task queues
  • Diffusion-based load balancing
  • DAG scheduling (graph partitioning is
    NP-complete)
  • Mixed Parallelism

63
Distributed and Dynamic LB
  • Dynamic load balancing algorithms, aka work
    stealing/donating
  • Basic idea, when applied to search trees
  • Each processor performs search on disjoint part
    of tree
  • When finished, get work from a processor that is
    still busy
  • Requires asynchronous communication

busy
Finished available work
idle
Select a processor and request work
Do fixed amount of work
No work found
Service pending messages
Service pending messages
Got work
64
Selecting a Donor
  • Basic distributed algorithms
  • Asynchronous Round Robin (ARR)
  • Each processor k, keeps a variable target_k
  • When a processor runs out of work, request from
    target_k
  • Set target_k (target_k 1) procs
  • Nearest Neighbor (NN)
  • Round robin over neighbors
  • Takes topology into account (as diffusive
    techniques)
  • Load balancing somewhat slower than randomized
  • Global Round Robin (GRR)
  • Processor 0 keeps a single variable target
  • When a processor needs work, get target, a
    request from target
  • P0 increments (mod procs) with each access to
    target
  • Random polling/stealing
  • When a processor needs work, select a random
    processor and request work from it

65
Tutorial Outline
  • Part 2 Distributed Data Mining
  • Classification
  • Clustering
  • Association Rules
  • Graph Mining

66
Knowledge Discovery in Databases
  • Knowledge Discovery in Databases (KDD) is a
    non-trivial process of identifying valid, novel,
    potentially useful, and ultimately understandable
    patterns in data.

Data Mining
Clean, Collect, Summarize
Data Preparation
Training Data
Data Warehouse
Model, Patterns
Verification Evaluation
Operational Databases
67
Origins of Data Mining
  • KDD draws ideas from machine learning/AI, pattern
    recognition, statistics, database systems, and
    data visualization.
  • Prediction Methods
  • Use some variables to predict unknown or future
    values of other variables.
  • Description Methods
  • Find human-interpretable patterns that describe
    the data.
  • Traditional techniques may be unsuitable
  • enormity of data
  • high dimensionality of data
  • heterogeneous, distributed nature of data

68
Speeding up Data Mining
  • Data oriented approach
  • Discretization
  • Feature selection
  • Feature construction (PCA)
  • Sampling
  • Methods oriented approach
  • Efficient and scalable algorithms

69
Speeding up Data Mining
  • Methods oriented approach (contd.)
  • Distributed and parallel data-mining
  • Task or control parallelism
  • Data parallelism
  • Hybrid parallelism
  • Distributed-data mining
  • Voting
  • Meta-learning, etc.

70
Tutorial Outline
  • Part 2 Distributed Data Mining
  • Classification
  • Clustering
  • Association Rules
  • Graph Mining

71
What is Classification?
  • Classification is the process of assigning new
    objects to predefined categories or classes
  • Given a set of labeled records
  • Build a model (e.g. a decision tree)
  • Predict labels for future unlabeled records

72
Classification learning
  • Supervised learning (labels are known)
  • Example described in terms of attributes
  • Categorical (unordered symbolic values)
  • Numeric (integers, reals)
  • Class (output/predicted attribute)
  • categorical for classification
  • numeric for regression

73
Classification learning
  • Training set
  • set of examples, where each example is a feature
    vector (i.e., a set of ltattribute,valuegt pairs)
    with its associated class. The model is built on
    this set.
  • Test set
  • a set of examples disjoint from the training set,
    used for testing the accuracy of a model.

74
Classification Example
categorical
categorical
continuous
class
Learn Classifier
Training Set
75
Classification Models
  • Some models are better than others
  • Accuracy
  • Understandability
  • Models range from easy to understand to
    incomprehensible
  • Decision trees
  • Rule induction
  • Regression models
  • Genetic Algorithms
  • Bayesian Networks
  • Neural networks

Easier
Harder
76
Decision Trees
  • Decision tree models are better suited for data
    mining
  • Inexpensive to construct
  • Easy to Interpret
  • Easy to integrate with database systems
  • Comparable or better accuracy in many
    applications

77
Decision Trees Example
categorical
categorical
continuous
class
Splitting Attributes
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
The splitting attribute at a node is determined
based on the Gini index.
78
From Tree to Rules
1) Refund Yes ? NO 2) Refund No and MarSt
in Single, Divorced and TaxInc lt 80K ? NO 3)
Refund No and MarSt in Single, Divorced
and TaxInc gt 80K ? YES 4) Refund No and
MarSt in Married ? NO
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
79
Decision Trees Sequential Algorithms
  • Many algorithms
  • Hunts algorithm (one of the earliest)
  • CART
  • ID3, C4.5
  • SLIQ, SPRINT
  • General structure
  • Tree induction
  • Tree pruning

80
Classification algorithm
  • Build tree
  • Start with data at root node
  • Select an attribute and formulate a logical test
    on attribute
  • Branch on each outcome of the test, and move
    subset of examples satisfying that outcome to
    corresponding child node
  • Recurse on each child node
  • Repeat until leaves are pure, i.e., have
    example from a single class, or nearly pure,
    i.e., majority of examples are from the same
    class
  • Prune tree
  • Remove subtrees that do not improve
    classification accuracy
  • Avoid over-fitting, i.e., training set specific
    artifacts

81
Build tree
  • Evaluate split-points for all attributes
  • Select the best point and the winning
    attribute
  • Split the data into two
  • Breadth/depth-first construction
  • CRITICAL STEPS
  • Formulation of good split tests
  • Selection measure for attributes

82
How to capture good splits?
  • Occams razor Prefer the simplest hypothesis
    that fits the data
  • Minimum message/description length
  • dataset D
  • hypotheses H1, H2, , Hx describing D
  • MML(Hi) Mlength(Hi)Mlength(DHi)
  • pick Hk with minimum MML
  • Mlength given by Gini index, Gain, etc.

83
Tree pruning using MDL
  • Data encoding sum classification errors
  • Model encoding
  • Encode the tree structure
  • Encode the split points
  • Pruning choose smallest length option
  • Convert to leaf
  • Prune left or right child
  • Do nothing

84
Hunts Method
  • Attributes Refund (Yes, No), Marital Status
    (Single, Married, Divorced), Taxable Income
  • Class Cheat, Dont Cheat

Dont Cheat
85
Whats really happening?
Marital Status
Cheat
Dont Cheat
Married
Income
lt 80K
86
Finding good split points
  • Use Gini index for partition purity
  • where p(i) frequency of class i in the node.
  • If S is pure, Gini(S) 1-1 0
  • Find split-point with minimum Gini
  • Only need class distributions

87
Finding good splits points
Marital Status
Marital Status
Cheat
Dont Cheat
Income
Income
Gini(split) 0.34
Gini(split) 0.31
88
Categorical Attributes Computing Gini Index
  • For each distinct value, gather counts for each
    class in the dataset
  • Use the count matrix to make decisions

Two-way split (find best partition of values)
Multi-way split
89
Decision Trees Parallel Algorithms
  • Approaches for Categorical Attributes
  • Synchronous Tree Construction (data parallel)
  • no data movement required
  • high communication cost as tree becomes bushy
  • Partitioned Tree Construction (task parallel)
  • processors work independently once partitioned
    completely
  • load imbalance and high cost of data movement
  • Hybrid Algorithm
  • combines good features of two approaches
  • adapts dynamically according to the size and
    shape of trees

90
Synchronous Tree Construction
  • Partitioning of data only
  • global reduction per node is required
  • large number of classification tree nodes gives
    high communication cost

m categorical attributes
n records
91
Partitioned Tree Construction
  • Partitioning of classification tree nodes
  • natural concurrency
  • load imbalance as the amount of work associated
    with each node varies
  • child nodes use the same data as used by parent
    node
  • loss of locality
  • high data movement cost

92
Synchronous Tree Construction
Partition Data Across Processors
  • No data movement is required
  • Load imbalance
  • can be eliminated by breadth-first expansion
  • High communication cost
  • becomes too high in lower parts of the tree

93
Partitioned Tree Construction
Partition Data and Nodes
  • Highly concurrent
  • High communication cost due to excessive data
    movements
  • Load imbalance

94
Hybrid Parallel Formulation
switch
95
Load Balancing
96
Switch Criterion
  • Switch to Partitioned Tree Construction when
  • Splitting criterion ensures
  • Parallel Formulations of Decision-Tree
    Classification Algorithms. A. Srivastava, E.
    Han, V. Kumar, and V. Singh, Data Mining and
    Knowledge Discovery An International Journal,
    vol. 3, no. 3, pp 237-261, September 1999.

97
Speedup Comparison
linear
linear
hybrid
hybrid
partitioned
partitioned
synchronous
synchronous
0.8 million examples
1.6 million examples
98
Speedup of the Hybrid Algorithm with Different
Size Data Sets
99
Scaleup of the Hybrid Algorithm
100
Summary of Algorithms for Categorical Attributes
  • Synchronous Tree Construction Approach
  • no data movement required
  • high communication cost as tree becomes bushy
  • Partitioned Tree Construction Approach
  • processors work independently once partitioned
    completely
  • load imbalance and high cost of data movement
  • Hybrid Algorithm
  • combines good features of two approaches
  • adapts dynamically according to the size and
    shape of trees

101
Tutorial Outline
  • Part 2 Distributed Data Mining
  • Classification
  • Clustering
  • Association Rules
  • Graph Mining

102
Clustering Definition
  • Given a set of data points, each having a set of
    attributes, and a similarity measure among them,
    find clusters such that
  • Data points in one cluster are more similar to
    one another
  • Data points in separate clusters are less similar
    to one another

103
Clustering
  • Given N k-dimensional feature vectors, find a
    meaningful partition of the N examples into c
    subsets or groups.
  • Discover the labels automatically
  • c may be given, or discovered
  • Much more difficult than classification, since
    in the latter the groups are given, and we seek a
    compact description.

104
Clustering Illustration
k3 ? Euclidean Distance Based Clustering in 3-D
space
Intracluster distances are minimized
Intercluster distances are maximized
105
Clustering
  • Have to define some notion of similarity
    between examples
  • Similarity Measures
  • Euclidean Distance if attributes are continuous.
  • Other Problem-specific Measures.
  • Goal maximize intra-cluster similarity and
    minimize inter-cluster similarity
  • Feature vector be
  • All numeric (well defined distances)
  • All categorical or mixed (harder to define
    similarity geometric notions dont work)

106
Clustering schemes
  • Distance-based
  • Numeric
  • Euclidean distance (root of sum of squared
    differences along each dimension)
  • Angle between two vectors
  • Categorical
  • Number of common features (categorical)
  • Partition-based
  • Enumerate partitions and score each

107
Clustering schemes
  • Model-based
  • Estimate a density (e.g., a mixture of gaussians)
  • Go bump-hunting
  • Compute P(Feature Vector i Cluster j)
  • Finds overlapping clusters too
  • Example bayesian clustering

108
Before clustering
  • Normalization
  • Given three attributes
  • A in micro-seconds
  • B in milli-seconds
  • C in seconds
  • Cant treat differences as the same in all
    dimensions or attributes
  • Need to scale or normalize for comparison
  • Can assign weight for more importance

109
The k-means algorithm
  • Specify k, the number of clusters
  • Guess k seed cluster centers
  • 1) Look at each example and assign it to the
    center that is closest
  • 2) Recalculate the center
  • Iterate on steps 1 and 2 till centers converge or
    for a fixed number of times

110
K-means algorithm
Initial seeds
111
K-means algorithm
New centers
112
K-means algorithm
Final centers
113
Operations in k-means
  • Main Operation Calculate distance to all k means
    or centroids
  • Other operations
  • Find the closest centroid for each point
  • Calculate mean squared error (MSE) for all points
  • Recalculate centroids

114
Parallel k-means
  • Divide N points among P processors
  • Replicate the k centroids
  • Each processor computes distance of each local
    point to the centroids
  • Assign points to closest centroid and compute
    local MSE
  • Perform reduction for global centroids and global
    MSE value

115
Serial and Parallel k-means
Group communication
116
Serial k-means Complexity
117
Parallel k-means Complexity
Where depends on the physical
communication topology, e.g. in a hypercube
118
Speedup and Scaleup
Condition for linear speedup
Condition for linear scaleup (w.r.t. n)
119
Tutorial Outline
  • Part 2 Distributed Data Mining
  • Classification
  • Clustering
  • Association Rules
  • Frequent Itemset Mining
  • Graph Mining

120
ARM Definition
  • Given a set of records each of which contain some
    number of items from a given collection
  • Produce dependency rules which will predict
    occurrence of an item based on occurrences of
    other items.

121
ARM Definition
  • Given a set of items/attributes, and a set of
    objects containing a subset of the items
  • Find rules if I1 then I2 (sup, conf)
  • I1, I2 are sets of items
  • I1, I2 have sufficient support P(I1I2)
  • Rule has sufficient confidence P(I2I1)

122
Association Mining
  • User specifies interestingness
  • Minimum support (minsup)
  • Minimum confidence (minconf)
  • Find all frequent itemsets (gt minsup)
  • Exponential Search Space
  • Computation and I/O Intensive
  • Generate strong rules (gt minconf)
  • Relatively cheap

123
Association Rule Discovery Support and
Confidence
Example
Association Rule
Support
Confidence
124
Handling Exponential Complexity
  • Given n transactions and m different items
  • number of possible association rules
  • computation complexity
  • Systematic search for all patterns, based on
    support constraint
  • If A,B has support at least a, then both A and
    B have support at least a.
  • If either A or B has support less than a, then
    A,B has support less than a.
  • Use patterns of k-1 items to find patterns of k
    items.

125
Apriori Principle
  • Collect single item counts. Find large items.
  • Find candidate pairs, count them gt large pairs
    of items.
  • Find candidate triplets, count them gt large
    triplets of items, and so on...
  • Guiding Principle every subset of a frequent
    itemset has to be frequent.
  • Used for pruning many candidates.

126
Illustrating Apriori Principle
Items (1-itemsets)
Pairs (2-itemsets)
Minimum Support 3
Triplets (3-itemsets)
If every subset is considered, 6C1 6C2 6C3
41 With support-based pruning, 6 6 2 14
127
Counting Candidates
  • Frequent Itemsets are found by counting
    candidates.
  • Simple way
  • Search for each candidate in each
    transaction. Expensive!!!

Transactions
Candidates
M
N
128
Association Rule Discovery Hash tree for fast
access.
Candidate Hash Tree
Hash Function
1,4,7
3,6,9
2,5,8
129
Association Rule Discovery Subset Operation
transaction
130
Association Rule Discovery Subset Operation
(contd.)
transaction
1 3 6
3 4 5
1 5 9
131
Parallel Formulation of Association Rules
  • Large-scale problems have
  • Huge Transaction Datasets (10s of TB)
  • Large Number of Candidates.
  • Parallel Approaches
  • Partition the Transaction Database, or
  • Partition the Candidates, or
  • Both

132
Parallel Association Rules Count Distribution
(CD)
  • Each Processor has complete candidate hash tree.
  • Each Processor updates its hash tree with local
    data.
  • Each Processor participates in global reduction
    to get global counts of candidates in the hash
    tree.
  • Multiple database scans per iteration are
    required if hash tree too big for memory.

133
CD Illustration
P0
P1
P2
N/p
N/p
N/p
Global Reduction of Counts
134
Parallel Association Rules Data Distribution
(DD)
  • Candidate set is partitioned among the
    processors.
  • Once local data has been partitioned, it is
    broadcast to all other processors.
  • High Communication Cost due to data movement.
  • Redundant work due to multiple traversals of the
    hash trees.

135
DD Illustration
P0
P1
P2
Data Broadcast
Remote Data
Remote Data
Remote Data
9
1,2
2,3
12
1,3
10
3,4
10
All-to-All Broadcast of Candidates
136
Parallel Association Rules Intelligent Data
Distribution (IDD)
  • Data Distribution using point-to-point
    communication.
  • Intelligent partitioning of candidate sets.
  • Partitioning based on the first item of
    candidates.
  • Bitmap to keep track of local candidate items.
  • Pruning at the root of candidate hash tree using
    the bitmap.
  • Suitable for single data source such as database
    server.
  • With smaller candidate set, load balancing is
    difficult.

137
IDD Illustration
P0
P1
P2
Data Shift
Remote Data
Remote Data
Remote Data
1
2,3
5
bitmask
Count
Count
Count
9
1,2
2,3
12
1,3
10
3,4
10
All-to-All Broadcast of Candidates
138
Filtering Transactions in IDD
bitmask
139
Parallel Association Rules Hybrid Distribution
(HD)
  • Candidate set is partitioned into G groups to
    just fit in main memory
  • Ensures Good load balance with smaller candidate
    set.
  • Logical processor mesh G x P/G is formed.
  • Perform IDD along the column processors
  • Data movement among processors is minimized.
  • Perform CD along the row processors
  • Smaller number of processors in global reduction
    operation.

140
HD Illustration
P/G Processors per Group
N/P
N/P
N/P
G Groups of Processors
N/P
N/P
N/P
N/P
N/P
N/P
141
Parallel Association Rules Comments
  • HD has shown the same linear speedup and sizeup
    behavior as that of CD.
  • HD exploits total aggregate main memory, while CD
    does not.
  • IDD has much better scaleup behavior than DD.

142
Tutorial Outline
  • Part 2 Distributed Data Mining
  • Classification
  • Clustering
  • Association Rules
  • Graph Mining
  • Frequent Subgraph Mining

143
Graph Mining
  • ? Market Basket Analysis
  • Association Rule Mining (ARM) ? find frequent
    itemset
  • Search space
  • Unstructured data only item type is important
  • item set I, In ? power set P(I), P(I) 2n
  • pruning technique to make the search feasible
  • Subset test
  • for each user transaction t and each candidate
    frequent-itemset s we need a subset test in order
    to compute the support (frequency).
  • ? Molecular Compound Analysis
  • Frequent Subgraph Mining (FSM) ? find frequent
    subgraph
  • Bigger search space
  • Structured data atom types are not sufficient
    atoms have bonds with other atoms.
  • Subgraph isomorphism test
  • for each graph and each candidate
    frequent-subgraph we need a subgraph isomorphism
    test.
  • N.B. for general graphs the subgraph isomorphism
    test is NP-complete

144
Molecular Fragment Lattice

C
O
S
N
C-C
S-O
C-S
S-N
SN
C-S-N
C-SN
C-C-S
C-S-O
N-S-O
C-S-O N
C-C-S-N
(minSupp50)
145
Mining Molecular Fragments
  • Frequent Molecular Fragments
  • Frequent Subgraph Mining (FSM)
  • Discriminative Molecular Fragments
  • Molecular compounds are classified into
  • active compounds ? focus subset
  • inactive compounds ? complement subset
  • Problem definition
  • find all discriminative molecular fragments,
    which are frequent in the set of the active
    compounds and not frequent among the inactive
    compounds, i.e. contrast substructures.
  • User parameters
  • minSupp (minimum support in the focus dataset)
  • maxSupp (maximum support in the complement
    dataset)

C
F
146
Molecular Fragment Search Tree
A search tree node represents a molecular
fragment.
Successor nodes are generated by extending the
fragment of one bond and, eventually, one atom.
147
Large-Scale Issue
  • Need for scalability in terms of
  • the number of molecules
  • larger main and secondary memory to store
    molecules
  • fragments with longer list of instances in
    molecules (embeddings)
  • the size of molecules
  • larger memory to store larger molecules
  • fragments with longer list of longer embeddings
  • more fragments (bigger search space)
  • the minimum support
  • with lower support threshold the mining algorithm
    produces
  • more embeddings for each fragment
  • more fragments and longer search tree branches

148
High-Performance Distributed Approach
  • Sequential algorithms cannot handle large-scale
    problems and small values of the user parameters
    for better quality results.
  • Search Space Partitioning
  • Distributed implementation of backtracking
  • external representation (enhanced SMILES)
  • DB selection and projection
  • Tree-based reduction
  • Dynamic load balancing for irregular problems
  • donor selection and work splitting mechanism
  • Peer-to-Peer computing

149
Search Space Partitioning
150
Search Space Partitioning
  • 4th kind of search-tree pruning Distributed
    Computing pruning
  • prune a search node, generate a new job and
    assign it to an idle processor
  • asynchronous communication and low overhead
  • backtracking is particularly suitable to parallel
    processing because a subtree rooted at any node
    can be searched independently

151
Tree-based Reduction
3D hypercube (p8)

Star reduction (master-slave) O(p)
Tree reduction O(log(p))
152
Job Assignment and Parallel Overheads
overheads
external representation (enhanced SMILES)
split
embed
stack
latency
donor
delay
parallel computing overhead communication
excess computation idling periods
receiver
idle
  • 1st job assignment
  • job assignment
  • termination detection

Overlapped with useful computation
DB selection projection
DLB for irregular problems
153
Parallel Execution Analysis
worker execution
setup
idle1
jobs
idle3
idle2
setup configuration message, DB loading idle1
wait first job assignment, due to initial
sequential part idle2 processor
starvation idle3 idle period due to load
imbalance jobs job processing time (including
computational overhead)
single job execution
mining
data
prep
data data preprocessing prep prepare root
search node (embed the core fragment) mining data
mining processing (useful work)
154
Load Balancing
  • The search space is not know a priori and is very
    irregular
  • Dynamic load balancing
  • receiver-initiated approach
  • donor selection
  • work splitting mechanism
  • The DLB determines the overall performance and
    efficiency.

155
Highly Irregular Problem
Search tree node visit-time (subtree visit)
Search tree node expand-time (node extension)
Power-law distribution
156
Work Splitting
A search tree node n can be donated only if
1) stackSize() gt minStackSize, 2) support(n) gt
(1 a) minSupp 3) lxa(n) lt ß atomCount(n)
157
Dynamic Load Balancing
  • Receiver-initiated approaches
  • Random Polling (RP)
  • Scheduler-based (MS)

excellent scalability
Quasi-Random Polling (QRP)
optimal solution
  • Quasi-Random Polling (QRP) policy
  • Global list of potential donors (sorted w. r. t.
    the running time)
  • server collects job statistics
  • receiver periodically gets updated donor-list
  • P2P Computing framework
  • receiver selects a random donor according to a
    probability distribution decreasing with the
    donor rank in the list
  • high probability to choose long running jobs

158
Issues
  • Penalty for global synchronization
  • Adaptive application
  • Asynchronous communication
  • Highly irregular problem
  • Difficulty in predicting work loads
  • Heavy work loads may delay message processing
  • Large-scale multi-domain heterogeneous computing
    environment
  • Network latency and delay tolerant

159
Tutorial Outline
  • Part 1 Overview of High-Performance Computing
  • Technology trends
  • Parallel and Distributed Computing architectures
  • Programming paradigms
  • Part 2 Distributed Data Mining
  • Classification
  • Clustering
  • Association Rules
  • Graph Mining
  • Conclusions

160
Large-scale Parallel KDD Systems
  • Data
  • Terabyte-sized datasets
  • Centralized or distributed datasets
  • Incremental changes (refine knowledge as data
    changes)
  • Heterogeneous data sources

161
Large-scale Parallel KDD Systems
  • Software
  • Pre-processing, mining, post-processing
  • Interactive (anytime mining)
  • Modular (rapid development)
  • Web services
  • Workflow management tool integration
  • Fault and latency tolerant
  • Highly scalable

162
Large-scale Parallel KDD Systems
  • Computing Infrastructure
  • Clusters already widespread
  • Multi-domain heterogeneous
  • Data and computational Grids
  • Dynamic resource aggregation (P2P)
  • Self-managing

163
Research Directions
  • Fast algorithms different mining tasks
  • Classification, clustering, associations, etc.
  • Incorporating concept hierarchies
  • Parallelism and scalability
  • Millions of records
  • Thousands of attributes/dimensions
  • Single pass algorithms
  • Sampling
  • Parallel I/O and file systems

164
Research Directions (contd.)
  • Parallel Ensemble Learning
  • parallel execution of different data mining
    algorithms and techniques that can be integrated
    to obtain a better model.
  • Not just high performance but also high accuracy

165
Research Directions (contd.)
  • Tight database integration
  • Push common primitives inside DBMS
  • Use multiple tables
  • Use efficient indexing techniques
  • Caching strategies for sequence of data mining
    operations
  • Data mining query language and parallel query
    optimization

166
Research Directions (contd.)
  • Understandability too many patterns
  • Incorporate background knowledge
  • Integrate constraints
  • Meta-level mining
  • Visualization, exploration
  • Usability build a complete system
  • Pre-processing, mining, post-processing,
    persistent management of mined results

167
Conclusions
  • Data mining is a rapidly growing field
  • Fueled by enormous data collection rates, and
    need for intelligent analysis for business and
    scientific gains.
  • Large and high-dimensional data requires new
    analysis techniques and algorithms.
  • High Performance Distributed Computing is
    becoming an essential component in data mining
    and data exploration.
  • Many research and commercial opportunities.

168
Resources
  • Workshops
  • IEEE IPDPS Workshop on Parallel and Distributed
    Data Mining
  • HiPC Special Session on Large-Scale Data Mining
  • ACM SIGKDD Workshop on Distributed Data Mining
  • IEEE IPDPS Workshop on High Performance Data
    Mining
  • ACM SIGKDD Workshop on Large-Scale Parallel KDD
    Systems
  • ACM SIGKDD Workshop on Distributed Data Mining,
  • IEEE IPPS Workshop on High Performance Data
    Mining
  • LifeDDM, Distributed Data Mining in Life Science
  • Books
  • A. Freitas and S. Lavington. Mining very large
    databases with parallel processing. Kluwer
    Academic Pub., Boston, MA, 1998.
  • M. J. Zaki and C.-T. Ho (eds). Large-Scale
    Parallel Data Mining. LNAI State-of-the-Art
    Survey, Volume 1759, Springer-Verlag, 2000.
  • H. Kargupta and P. Chan (eds). Advance in
    Distributed and Parallel Knowledge Discovery,
    AAAI Press, Summer 2000.

169
References
  • Journal Special Issues
  • P. Stolorz and R. Musick (eds.). Scalable
    High-Performance Computing for KDD, Data Mining
    and Knowledge Discovery An International
    Journal, Vol. 1, No. 4, December 1997.
  • Y. Guo and R. Grossman (eds.). Scalable Parallel
    and Distributed Data Mining, Data Mining and
    Kno
Write a Comment
User Comments (0)
About PowerShow.com