Title: VIRTUAL PRESENCE
1VIRTUAL PRESENCE
Authors
Voislav Galic, vgalic_at_bitsyu.net
Duan Zecevic, zdusan_at_softhome.net
Ðorde Ðurdevic, madcat_at_tesla.rcub.bg.ac.yu
Veljko Milutinovic, vm_at_etf.bg.ac.yu
http//galeb.etf.bg.ac.yu/vm/tutorial
2DEFINITION
Virtual presence is a term with various shades
of meanings in different industries, but its
essence remains constant it is a new tool that
enables some form of telecommunication in which
the individual may substitute their physical
presence with an alternate, typically,
electronic presence
3SUMMARY
- Introduction to Virtual Presence - Data Mining
for Virtual Presence - A New Software
Paradigm - Selected Case Studies
4INTRODUCTION TO VP
- - Definitions
- VP applications
- Psychological aspects
5DATA MINING FOR VP
- Why Data Mining?
- What can Data Mining do?
- Growing popularity of Data Mining
- - Algorithms
6SOFTWARE AGENTS
- A new software paradigm
- Standardization
- FIPA specifications
- Agent management
- Agent Communication Language
7GoodNews (CMU)
- Categorization of financial news articles
- Co-located phrases
- Domain Experts
- Implementation and results
Carnegie Mellon University, Pittsburgh, USA
8iMatch (MIT)
- The idea
- - associate MIT students and staff in order to
ease their cooperation - - help students find resources they need
- Implementation
- advanced, agent-based system architecture
- - Tomorrow?
Massachusetts Institute of Technology, USA
9Tourist city (ETF)
- A qualitative step forward in the domain of
maximization of customer satisfaction - Technologies
- Data Mining
- Software Agents (mobile)
Faculty of Electrical Engineering, University
of Belgrade, Serbia and Montenegro
10CONCLUSION
- This tutorial will attempt to familiarize you
with - The concept of VP (Virtual Presence) as a
new technological challenge - The new paradigms and technologies that will
bring the VP to everyday life - - Data Mining - Software Agents
11INTRODUCTION
- Virtual presence will arguably be one of the
most important aspects of personal communication
in the twenty-first century
12Essence of VP
- The usefulness and reliability of virtual
presence - The ability to conduct everyday tasks by being
virtually or electronically present
13How to Accomplish it?
- The presence is accomplished through the
Internet, video, or other communications,
perhaps even psychically one day - Technological advance will sophisticate virtual
presence, altering the very meaning of the word
presence
14VP Applications
- VP in government
- Sunshine laws
- Voting
15VP Applications
- VP in business
- Online board meetings
- Shareholder voting online
16VP Applications
- VP in education
- interactive lectures and courses
17VP Applications
- VP in medicine
- Telemedicine
- Diagnostics
- Remote surgery
- Risks
- Privacy
18VP Applications
- VP in everyday life
- Telecommuting/Telework
- Software agents as our virtual shadows
19Psychological Aspects
- Cyberspace and Mind
- Presence in Virtual Space
- Communal Mind and Virtual Community
20DATA MINING
- Knowledge discovery is a non-trivial process of
identifying valid, novel, potentially useful, and
ultimately understandable patterns in data
21Many Definitions
- Data mining is also called data or knowledge
discovery - It is a process of inferring knowledge from
large oceans of data - Search for valuable information in large volumes
of data - Analyzing data from different perspectives and
summarizing it into useful information
22Why Data Mining ?
- DM allows you to extract knowledge from
historical data and predict outcomes of future
situations - Optimize business decisions and improve
customers satisfaction with your services - Analyze data from many different angles,
categorize it, and summarize the relationships
identified - Reveal knowledge hidden in data and turn this
knowledge into a crucial competitive advantage
23What Can Data Mining Do?
- Identify your best prospects and then retain
them as customers - Predict cross-sell opportunities and make
recommendations - Learn parameters influencing trends in sales and
margins - Segment markets and personalize communications
- etc.
24The Power of Data Mining
- Having a database is one thing, making sense of
it is quite another - It does not rely on narrow human queries to
produce results, but instead uses AI related
technology and algorithms - Inductive reasoning
- Using more than one type of algorithm to search
for patterns in data - Data mining produces usually more general (more
powerful) results than those obtained by
traditional techniques - Relational DB storage and management technology
is OK for data mining applications less than 50
gigabytes
25Reasons for the Growing Popularity of Data Mining
- Growing Data Volume
- Low Cost of Machine Learning
- Limitations of Human Analysis
-
26Tasks Solved by Data Mining
- Predicting
- Classification
- Detection of relations
- Explicit modeling
- Clustering
- Market basket analysis
- Deviation detection
27Algorithms
- Generally, their complexity is around n (log
n)(n is the number of records) - Data mining includes three major components,
with corresponding algorithms - Clustering (Classification)
- Association Rules
- Sequential Analysis
28Classification Algorithms
- The aim is to develop a description or model for
each class in a database, based on the features
present in a set of class-labeled training
data - Data Classification Methods
- Statistical algorithms
- Neural networks
- Genetic algorithms
- Nearest neighbor method
- Rule induction
- Data visualization
29Classification-rule Learning
- Data abstraction
- Classification-rule learning finding rules or
decision trees that partition given data into
predefined classes - Hunts method
- Decision tree building algorithms
- ID3 / C4.5 algorithm
- SLIQ / SPRINT algorithm (IBM)
- Other algorithms
30Parallel Algorithms
- Basic Idea N training data items are randomly
distributed to P processors. All the processors
cooperate to expand the root node of the
decision tree - There are two approaches for future progress
(the remaining nodes) - Synchronous approach
- Partitioned approach
31Association Rule Algorithms
- Association rule implies certain association
relationship among the set of objects in a
database - These objects occur together, or one implies
the other - Formally X ? Y, where X and Y are sets of items
(itemsets) - Key terms
- Confidence
- Support
- The goal to find all association rules that
satisfy user-specified minimum support and
minimum confidence constraints.
32Association Rule Algorithms
- Apriori algorithm and its variations
- AprioriTid
- AprioriHybrid
- FT (Fault-tolerant) Apriori
- Distributed / Parallel algorithms (FDM, )
33Sequential Analysis
- Sequential Patterns
- The problem finding all sequential patterns
with user-specified minimum support - Elements of a sequential pattern need not to be
- consecutive
- simple items
- Algorithms for finding sequential patterns
- count-all algorithms
- count-some algorithms
34Conclusion
- Drawbacks of existing algorithms
- Data size
- Data noise
- There are two critical technological drivers
- Size of the database
- Query complexity
- The infrastructure has to be significantly
enhanced to support larger applications - Solutions
- Adding extensive indexing capabilities
- Using new HW architectures to achieve
improvements in query time
35THE NEW SOFTWARE PARADIGM
- All software agents are programs, but not all
programs are agents
36Many Definitions
- Computational systems that inhabit some dynamic
environment, sense and act autonomously and
realize a set of goals or tasks for which they
are designed - Hardware or (more usually) software-based
computer system that enjoys the following
properties
- Reactive (sensing and acting) - Autonomous -
Goal-oriented (pro-active purposeful) -
Temporally continuous - Communicative (socially
able)
- Learning (adaptive) - Mobile - Flexible -
Character
37Interesting Topic of Study
- They draw on and integrate many diverse
disciplines of computer science and other areas - objects and distributed object architectures
- adaptive learning systems
- artificial intelligence and expert systems
- collaborative online social environments
- security
- knowledge based systems, databases
- communications networks
- cognitive science and psychology
-
38What Problems do Agents Solve ?
- Client/server network bandwidth problem
- In the design of a client/server architecture
- The problems created by intermittent or
unreliable network connections - Attempts to get computers to do real thinking for
us
39The New Software Paradigm
- Unless special care has been taken in the design
of the code, two software programs cannot
interoperate - The promise of agent technology is to move the
burden of interoperability from software
programmers to programs themselves - This can happen if two conditions are met
- A common language (Agent Communication Language
ACL) - An appropriate architecture
40The Need for Standards
- Anywhere, anytime consumer access to the
Universal bouquet of information and services is
the new goal of the information revolution - The scope of Internet standards makes the scope
of choices extreme - The Foundation for Intelligent Physical Agents
(FIPA), established in 1996 in Geneva - international non-profit association of companies
and organizations - specifications of generic agent technologies.
41FIPA Specifications
- Agent Management
- Agent Communication Language
- Agent/Software Integration
- Agent Management Support for Mobility
- Human-Agent Interaction
- Agent Security Management
- Agent Naming
- FIPA Architecture
- Agent Message Transport
- etc.
42Agent Management
- Provides the normative framework within which
FIPA agents exist and operate - Establishes the logical reference model for the
creation, registration, location, communication,
migration and retirement of agents
- The entities contained in the reference model are
logical capability sets and do not imply any
physical configuration - - Additionally, the implementation details of
individual APs and agents are the design choices
of the individual agent system developers
43Components of the Model
- computational process - fundamental actor on an
AP - as a physical software process has a life
cycle that has to be managed by the AP
- - yellow pages to other agents
- supported function are
- register
- deregister
- modify
- search
- - white pages services to other agents
- - maintains a directory of AIDs which contain
transport addresses - supported function are
- register
- deregister
- modify
- search
- get-description
- operations for underlying AP
- Message Transport Service
- communication method between agents
- physical infrastructure in which agents can be
deployed
- all non-agent, executable collections of
instructions accessible through an agent
44Agent Life Cycle
- FIPA agents exist physically on an AP and utilize
the facilities offered by the AP for realising
their functionalities - In this context, an agent, as a physical software
process, has a physical life cycle that has to
be managed by the AP
The state transitions of agents can be described
as
- create - invoke - destroy - quit - suspend
- resume - wait - wake up - move - execute
45Agent Communication Language
- The specification consists of a set of message
types and the description of their meanings - Requirements
- Implementing a subset of the pre-defined message
types and protocols - Sending and receiving the not-understood message
- Correct implementation of communicative acts
defined in the specification - Freedom to use communicative acts with other
names, not defined in the specification - Obligation of correctly generating messages in
the transport form - Language must be able to express propositions,
objects and actions - The use of Agent Management Content Language and
ontology
46ACL Syntax Elements
- Pre-defined message parameters
sender receiver content reply-with in-reply-t
o envelope language ontology reply-by protoco
l conversation-id
accept-proposal agree cancel cfp confirm disconfir
m failure inform inform-if inform-ref
not-understood propose query-if query-ref refuse r
eject-proposal request request-when request-whenev
er subscribe
47Communication Examples
- Agent i confirms to agent j that it is,
- in fact, true that it is snowing today
- (confirm sender i receiver j
content "weather( today, snowing )"
language Prolog - )
- Agent i asks agent j if j is registered with
domain server d1 (query-if sender i
receiver j content (registered
(server d1) (agent j)) reply-with
r09) ... (inform sender j receiver
i content (not (registered (server d1)
(agent j))) in-reply-to r09)
- Agent j replies that it can reserve trains,
planes and automobiles (inform sender j
receiver i content ( (iota ?x
(available-services j ?x))
((reserve-ticket train)
(reserve-ticket plane) (reserve
automobile)) ) )
- Agent i, believing that agent j thinks that a
shark is a - mammal, attempts to change j's belief
- (disconfirm sender i receiver j
content (mammal shark) - )
- Agent j refuses to i reserve a ticket for i,
since i there are insufficient funds in i's
account (refuse sender j receiver
i content ( (action j
(reserve-ticket LHR, MUC, 27-sept-97))
(insufficient-funds ac12345) )
language sl)
- Auction bid (inform sender agent_X
receiver auction_server_Y content
(price (bid good02) 150) in-reply-to
round-4 reply-with bid04 language sl
ontology auction)
- Agent i did not understand an query-if message
because it did not recognize the
ontology (not-understood sender i
receiver j content ((query-if sender j
receiver i ) (unknown (ontology
www))) language sl )
- Agent i asks agent j for its available
services (query-ref sender i
receiver j content (iota ?x
(available-services j ?x)) )
48Agent/Software Integration
- Integration of services provided by non-agent
software into a multi-agent community - Definition of the relationship between agents
and software systems - Allowing agents to describe, broker and negotiate
over software systems - Allowing new software services to be dynamically
introduced into an agent community - Defining how software resources can be described,
shared and dynamically controlled in an agent
community
49New Agent Roles
- To support specification, two new agent roles
have been identified - Agent Resource Broker (ARB)
- WRAPPER Agent
50GoodNews
- A system that automatically categorizesnews
reports that reflect positively or negativelyon
a companys financial outlook
51Introduction
- Correlation between news reports on a companys
financial outlook and its attractiveness as an
investment - Volume of such reports is huge
- A new text classification algorithm Domain
Experts with self-confident sampling
technique - Two types of data
- (Human-)labeled
- Unlabeled
- The algorithm classifies financial news into the
predefined five categories - (good) ? (good, uncertain) ? (neutral) ? ? (bad,
uncertain) ? (bad)
52Introduction
- Text categorization task
- FCP (Frequently Co-located Phrase) the building
elementfor the categorization algorithm - Text categorization very difficult domainfor
the use of machine learning - Very large number of input features
- High level of attribute and class noise
- Large percent of irrelevant features
- Very expensive labeled data, while unlabeled
data are cheaply available
53Categorization
- The algorithm categorizes each given news article
into the predefined categories in terms of
referred companys financial well-being - GOOD strong and explicit evidences of the
companys financial status - shares of ABC company rose 2 percent to
24-15/16 - GOOD, UNCERTAIN predictions and forecasts of
future profitability - ABC company predicts fourth-quarter earnings
will be high
54Categorization
- NEUTRAL nothing is mentioned about the
financial well-being of the company - ABC announced plans to focus on products based
on recycledmaterials - BAD, UNCERTAIN predictions of future loses
- ABC announced today that fourth-quarter results
couldfall short of expectations - BAD explicitly bad evidences
- shares of ABC fell 0.57 to 44.65 in early NY
trading - Problems with construction of the training (i.e.
labeled)data set inter-indexer inconsistency
55Co-located Phrase
- The proposed algorithm labels the unlabeled
news articlesthrough voting process among
experts that are FCPs - Definition a co-located phrase is a sequence of
nearby, but not necessarily consecutive words - shares of ABC rose 8.5 (shares, rose) GOOD
- ABC presented its new product (present,
product) NEUTRAL - Contextual information
- The use of heuristics to cope with enormous
phrase space(amount of possible phrases)
56Naive-Bayes v Domain Experts
- Naive-Bayes with EM (Expectation Maximization)
- Problems with small sets of labeled (training)
data - EM (Expectation Maximization) a class of
iterative algorithms for maximum likelihood
estimation in problems with incomplete data - Domain Experts algorithm is able to deal with
inconsistent hypotheses - Iterative building of the training set
57Implementation and Results
- The experiment focused on two performance
criteria - Using unlabeled data for improving categorization
accuracy - The categorization itself
- The accuracy is around 75 (total of 2000 news
articles) - Comparison of a few different methods (picture)
58Conclusions
- Domain Experts with SC sampling outperform naive
Bayes with EM - collocation property and vote entropy are
appropriate to such a domain - The accuracy of around 75 is the limit with the
techniques used - Better performance could be achieved by using
some natural language processing techniques - Such techniques are pretty rudimental today
59iMatch
- The vision of each MIT student
- having a personal software agent,
- which helps to manage its owner's academic life
60Introduction
- The aim bring together MIT students and staff
who may usefully collaborate with each other - This collaboration can have several goals
- completing final projects
- studying for exams
- tutoring one another
- iMATCH agents are supposed to facilitate students
and faculty matching for - Research
- Teaching
- Internship
- opportunities within and across campuses
61iMatch Agent Architecture
- iMatch agents are situated within an environment
- Sensors of the agent convert environmental
inputsinto representations that can be
manipulated within the agent - Effectors translate actions planned by the
agentinto executable statements for the
environment - The action planner selects the action with the
highest utilityaccording to the owners
preference specification
62Impacts and Benefits
- MIT
- Benefit MIT students by matching them to
appropriate resources - Aid the recruitment of student researchers
- Help students manage their lives
- Use iMATCH in Medical Computing
- GLOBAL
- Facilitate Cross Community Collaboration
63Research Topics
- Knowledge representation
- preference specification
- Multi-agents systems
- reputation management system
- static interest matching
- dynamic interest matching
- Infrastructure
- distributed security infrastructure
64Ceteris Paribus Preference
- Ceteris paribus relations express a preference
over sets of possible outcomes - All possible outcomes are considered to be
describable by some (large) set of binary
features (true or false) - The specified features are instantiated to either
true or false - Other features are ignored
65CPP Agent Configuration
- Specify a domain for preference
- Agent methods of communication and notification
- Different security settings of different servers
- Preference statements themselves
- How to get users to easily adjust C.P. rules
(graphical interface) - Pose hypothetical preference questions to user to
help complete the preferences of an ambivalent
user - People will only put down their true profile, if
they know that the system is secure
66Static Interest Matching
- Group together similar users for specific context
- This enables viewing a human user as a
resourcefor dynamic resource discovery - (locate experts, enthusiasts,...)
- The approach
- Keyword matching
- Ontological matching using Kulbeck-Leiber (KL)
distance
67Dynamic Interest Matching
- Location and/or temporal specific resource
matching - As students and their agents move from one
physical location to another, iMatch services for
matching the closest resources can be offered - The idea anything worthwhile is locatable
- The approach
- Intentional naming scheme
- Reputation based resource discovery
68Technology
- Components
- Distributed Multi-Agent Infrastructures
- Ceteris Paribus preference-based Interest
Matching - Reputation Management Infrastructure
- Technology
- Microsoft.Net
- Bluetooth
- IEEE 802.11
- Smartcards (PC/SC)
- INS (International Naming System)
69Conclusion
- Benefit MIT students by matching them to
appropriate resources - Static interest matching
- Group together similar users for specific context
- This enables viewing a human user as a resource
for dynamic resource discovery (locate experts,
enthusiasts,...) - Dinamic interest matching
- Location and/or temporal specific resource
matching As students and their agents move from
one physical location to another, iMatch
services for matching the closest resources can
be offered - Help students manage their lives
70The near future
- The focus of the research is on e-tourism after
the year 2005, but the applications of the
proposed infrastructure are multifold
71Introduction
- The assumptions
- after the year 2005, each tourist in Europe will
be equiped with a cell phone of the power same or
better than the Pentium IV - whenever a tourism-based service or product is
purchased, a mobile agent is assigned to that
cell phone PC, to monitor the behaviour of the
customer - all tourist cell phone PCs create an AD-HOC
networkaround the points of touristic
attractions, and link to a data mine that
collects all information of interest
72How to accomplish it?
- The information of interest is not collected by
asking the customer to fill out the forms, but by
monitoring the behaviour of the customer - The collected information, sorted in the data
mine, is made available to other tourists, as an
on-line owner-independent source of information
about the given services and/or products
73What can be done
- If a tourist would like to know, at that very
moment, what restaurant has good food/atmosphere
and happy customers, he/she can access the data
mine (via the Internet) and obtain the
information that is linked to that very moment,
and is not created by the owner of the business,
but by the customers themselves - Accessing the given restaurants website has two
drawbacks - the information is not fresh - periodically
updated - the information is made by the owner of the
restaurant, and therefore not completely objective
74Conclusion
- Consequently, the proposed approach works much
better , and represents a qualitative step
forward in the domain of maximization of
customer satisfaction - This may mean that the privacy of the person is
jeopardized,however, if the monitored behaviour
is non-personalized, and if the customer obtains
a discount based on the fact that mobile agents
are welcome, the privacy stops to be an issue,
and people will sign up voluntarily
75Appendix
- A Survey of the Data Mining Algorithms
76Apriori Algorithm
- The task mining association rules by finding
large itemsets and translating them to the
corresponding association rules - A ? B, or A1 ? A2 ?? Am ? B1 ? B2 ?? Bn, where
A ? B ? - The terminology
- Confidence
- Support
- k-itemset a set of k items
- Large itemsets the large itemset A, B
corresponds to the following rules
(implications) A ? B and B ? A
77Apriori Algorithm
- The ? operator definition
- n 1 S2 S1 ? S1 A, B, C ? A, B,
C AB, AC, BC - n k Sk1 Sk ? Sk X ? Y X, Y ? Sk, X ?
Y k-1 - X and Y must have the same number of elements,
and must have exactly k-1 identical elements - Every k-element subset of any resulting set
element (an element is actually a k1 element
set) has to belong to the original set of
itemsets
78Apriori Algorithm
79Apriori Algorithm
- Step 1 generate a candidate set of 1-itemsets
C1 - Every possible 1-element set from the database is
potentially a large itemset, because we dont
know the number of its appearances in the
database in advance (á priori ?) - The task adds up to identifying (counting) all
the different elements in the database every
such element forms a 1-element candidate set - C1 A, B, C, D, E
- Now, we are going to scan the entire database, to
count the number of appearances for each one of
these elements (i.e. one-element sets)
80Apriori Algorithm
- Now, we are going to scan the entire database, to
count the number of appearances for each one of
these elements (i.e. one-element sets)
81Apriori Algorithm
- Step 2 generate a set of large 1-itemsets L1
- Each element in C1 with support that exceeds some
adopted minimum support (for example 50) becomes
a member of L1 - L1 A, B, C,E and we can omit D in
further steps (if D doesnt have enough support
alone, there is no way it could satisfy
requested support in a combination with some
other element(s))
82Apriori Algorithm
- Step 3 generate a candidate set of large
2-itemsets, C2 - C2 L1 ? L1 AB, AC, AE, BC, BE,
CE - Count the corresponding appearances
- Step 4 generate a set of large 2-itemsets, L2
- Eliminate the candidates without minimum
support - L2 AC, BC, BE, CE
83Apriori Algorithm
- Step 5 (C3)
- C3 L2 ? L2 BCE
- Why not ABC and ACE because their 2-element
subsets AB and AE are not the elements of
large 2-itemset set L2 (calculation is made
according to the operator ? definition) - Step 6 (L3)
- L3 BCE, since BCE satisfies the required
support of 50 (two appearances) - There can be no further steps in this particular
case, because L3 ? L3 ? - Answer L1 ? L2 ? L3
84Apriori Algorithm
- L1 large 1-itemsets
- for (k2 Lk-1 ? ? k)
- Ck apriori-gen(Lk-1)
- forall transactions t ? D do begin
- Ct subset (Ck, t)
- forall candidates c ? Ct do
- c.count
- end
- Lk c ? Ck c.count ? minsup
- end
- Answer ?k Lk
85Apriori Algorithm
- Enhancements to the basic algorithm
- Scan-reduction
- The most time consuming operation in Apriori
algorithm is the database scan it is originally
performed after each candidate set generation, to
determine the frequency of each candidate in the
database - Scan number reduction counting candidates of
multiple sizes in one pass - Rather than counting only candidates of size k in
the kth pass, we can also calculate the
candidates Ck1, where Ck1 is generated from
Ck (instead Lk), using the ? operator
86Apriori Algorithm
- Compare Ck1 Ck ? Ck Ck1 Lk ? Lk
- Note that Ck1 ? Ck1
- This variation can pay off in later passes, when
the cost of counting and keeping in memory
additional Ck1 - Ck1 candidates becomes less
than the cost of scanning the database - There has to be enough space in main memory for
both Ck and Ck1 - Following this idea, we can make further scan
reduction - Ck1 is calculated from Ck for k gt 1
- There must be enough memory space for all Cks (k
gt 1) - Consequently, only two database scans need to be
performed (the first to determine L1, and the
second to determine all the other Lks)
87Apriori Algorithm
- Abstraction levels
- Higher level associations are stronger (more
powerful), but also less certain - A good practice would be adopting different
thresholds for different abstraction levels
(higher thresholds for higher levels of
abstraction)
88DHP Algorithm
- DHP Direct Hashing and Pruning another
algorithm for mining association rules - Based on the Apriori algorithm (Ck/Lk generation
in the kth step) - Empirical analysis of the Apriori algorithm shows
that candidate sets (Ck) are much larger than
corresponding sets of large itemsets (Lk),
especially in a first few iterations - DHP introduces more efficient candidate set
generation method - The idea is to insert into Ck only those
candidate sets that are likely to become large
itemsets
89DHP Algorithm
- Additional improvement is accomplished through
two-dimensional search base reduction
length(number of records in the search base)
and width (number of relevant attributes in a
record) - Large itemsets characteristics
- Every non-empty subset of a large itemset is a
large itemset as well, for example, BCD ? L3 ?
BC, CD, BD ? L2 - It implies that a record is relevant for
discovering large k1-itemsets only if it
contains at least k1 large k-itemsets
90DHP Algorithm
- During the Ck ? Lk phase we might count large
k-itemsets in each record if their number in a
particular record is less than k1, we omit that
record during the Ck1 generation - Similarly, if a record contains one or more large
k1-itemsets, each element (item) of these
itemsets appears in, at least, k candidates from
Ck - Hashing
- Hashing boosts the performance of the DHP
algorithm - The algorithm does not specify any hash function
in particular, it depends on the application - Likewise, it does not specify the size of the
hash table (number of groups/addresses)
91DHP Algorithm
92DHP Algorithm
- Step 1 generate a candidate set of 1-itemsets
C1 - C1 A, B, C, D, E
- Simultaneously with counting each elements
support, a hash tree is generated that contains
all the elements from the database, in order to
improve the counting performance - For each new element, DHP checks whether the
element is already in the tree or not - If yes, DHP increments the current number of
appearances for that element otherwise, the
element is added to the hash tree, and the number
of its appearances is set to 1
93DHP Algorithm
- Having counted each C1 element appearances, all
possible 2-element subsets are generated and
inserted into H2 hash table - The address of a particular subset could be
calculated with respect to the position of its
elements in C1 candidate set, using chosen hash
function h(x, y)
94DHP Algorithm
- For example, lets adopt the following hash
function h(x y) (posC1(x)10 posC1(y))
mod 7 - The corresponding H2 hash table is shown below
95DHP Algorithm
- Whenever a new element is added to the hash
table, the weight of the particular address is
increased by one - C2 is generated out of L1 (just like in Apriori
case) - Besides that, only those elements that map to the
addresses whose weight is greater or equal than
specified minimum support (let the minimum
support be 50), will be taken into consideration
during the C2 generation - C2 AC, BC, BE, CE
- It contains two elements less (!) than the C2 set
generated by the Apriori algorithm for the same
example database
96DHP Algorithm
- In general, the Hk hash table is used for the Ck
candidate set generation in the kth step of the
algorithm Hk is created in the previous (k-1)th
step - Each address of the Hk hash table contains a
number of k-element subsets as elements its
weight denotes the number of elements - The fact that an address doesnt satisfy minimum
support requirement means that neither element
(set) that is mapped to the address can satisfy
the requirement alone ? all the elements (sets)
at such Hk addresses are omitted for the Ck
generation - During the kth step, Ck is generated starting
from Lk-1, with the restrictions described above
97DHP Algorithm
- Conclusions
- DHP outperforms Apriori, for the same input data
- The time spent for the hash tables generation
(especially H2) is overcome by extremely reduced
candidate sets (C2, ) - The same improvements applied on Apriori, may as
well be applied here (scan reduction, abstraction
levels, )
98References
- http//www.marconi.com
- http//www.blueyed.com
- http//www.fipa.org
- http//www.rpi.edu
- http//research.microsoft.com
- http//imatch.lcs.mit.edu
99THE END
- Quatenus nobis denegatum diu vivere, relinquamus
aliquid, quo nos vixisse testemur
Authors
Voislav Galic, vgalic_at_bitsyu.net
Duan Zecevic, zdusan_at_softhome.net
Ðorde Ðurdevic, madcat_at_tesla.rcub.bg.ac.yu
Veljko Milutinovic, vm_at_etf.bg.ac.yu
http//galeb.etf.bg.ac.yu/vm/tutorial