Title: Applications of Slow Intelligence Systems
1Applications ofSlow Intelligence Systems
2Outline
- Application Social Influence Analysis
- Application Product Service Optimization
- Application Topic/Trend Detection
- Application High Dimensional Feature Selection
- Discussion
3Outline
- Application Social Influence Analysis
- Application Product Service Optimization
- Application Topic/Trend Detection
- Application High Dimensional Feature Selection
- Discussion
4Application to Social Influence Analysis
- In large social networks, nodes (users,
entities) are influenced by others for many
different reasons. How to model the diffusion
processes over social network and how to predict
which node will influence which other nodes in
network have been an active research topic
recently. Many researchers proposed various
algorithms. How to utilize these algorithms and
evolutionarily select the best one with the most
appropriate parameters to do social influence
analysis is our objective in applying the SIS
technology.
5The Social Influence Analysis SIS System
- Input data stream is first processed by the
Pre-Processor. The Enumerator then invokes the
super-component that creates the various social
influence analysis algorithms such as Linear
Threshold LIM, Susceptible-Infective-Susceptible
SIS, Susceptible-Infective-Recovered SIR and
Independent Cascading. The Tester collects and
presents the test results.
6LIM Results of concept 1 and concept 3 with two
combinations of parameters in Plurk dataset
7LIM Results of concept 1 and concept 3 with two
combinations of parameters in Facebook dataset
8The SIA/SIS System
- The Timing Controller will restart the social
influence analysis cycle with a different SIA
super component such as the Heat Diffusion
algorithms, or with different pre-processor. The
Eliminator eliminates the inferior SIA
algorithms, and the Concentrator selects the
optimal SIA algorithm.
9Outline
- Application Social Influence Analysis
- Application Product Service Optimization
- Application Topic/Trend Detection
- Application High Dimensional Feature Selection
- Discussion
10SIS Application to Product Configuration
Production of personalized or custom-tailored
goods or services to meet consumers' diverse and
changing needs
11Ontological Filter and Slow Intelligence System
12A Scenario
- A customer would like to buy a Personal Computer
in order to play videogames and surf on the
internet. - He knows that he needs an operating system, a
web browser and an antivirus package. - In particular, the user prefers a Microsoft
Windows operating system. He lives in the United
States and prefers to have a desktop. He also
prefers low cost components.
13Ontological Transform for Product Configurator
14Outline
- Application Social Influence Analysis
- Application Product Service Optimization
- Application Topic/Trend Detection
- Application High Dimensional Feature Selection
- Discussion
15Topic Detection and Tracking (TDT) System
Overview
- Detect current hot topics and predict future hot
topics based on data collected from the internet - TDT System composes of
- Crawler Extractor
- Collect latest data from Internet for users
needs - Restrict range of data collection from web data
(focus crawler) - Topic Extractor
- Discover current hot topics from a set of text
documents - Topic Detector
- Predict hot topics
-
15
16Topic/Trend Detection System
Social Media
HTML documents
Users Keywords of Interests
Web Crawler
Text documents
Web data DB
Topic Extractor
Information Extractor
Extract articles and metadata (title, author,
content, etc) from semi-structured web content
Crawler Extractor
17Focused Crawler Classification
Yahoo! Open Directory Project
Taxonomy Creation
Example Collection
- System proposes the most common classes
- User marks as GOOD
- User change trees
Taxonomy Selection and Refinement
- System propose URLs found in small neighborhood
of examples. - User examines and includes some of these
examples.
Interactive Exploration
Training
- Integrate refinements into statistical class
model - (classifier-specific action).
17
18Focused Crawler Distillation
- Identify relevant hubs by running a topic
distillation algorithm. - Raise visit priorities of hubs and immediate
neighbors.
Distillation
- Report most popular sites and resources.
- Mark results as useful/useless.
- Send feedback to classifier and distiller.
Feedback
18
19Extractor
- Given a Web page
- Build the HTML tag tree
- Mine data regions
- Mining data records directly is hard
- Identify data records from each data region
- Learn the structure of a general data record
- A data record can contain optional fields
- Extract the data
19
20TDT Petri Net Simulation
- Topic Detection and Tracking
20
2121
22Crawler
22
23Initial State
23
24Accept user input
24
25Validate user input
25
26Refine user input
26
27Train the system
27
28Detect most popular topic
28
29Extractor
29
30Extractor activated
30
31Generate HTML tag trees
31
32Detect important data
32
33Train the system with record
33
34Extract data
34
35Save data into knowledge base
35
36Topic Detection and Tracking
36
37Slow Intelligence Steps in blue colorAccept
user requestSend request data to TDTEnumerator
generates combinationsEliminator selects the
best method to fit our needEvaluate
combinationsUse concentrator to highlight the
selected resultsSend the result to TDTGenerate
the instructions to the serverDispatcher gets
the instructionDecide where we are going to send
the instructionsSend the instructions to the
serverEnd of simulation run
37
38Outline
- Application Social Influence Analysis
- Application Product Service Optimization
- Application Topic/Trend Detection
- Application High Dimensional Feature Selection
- Discussion
39Introduction
- High-dimensional feature selection is a hot topic
in statistics and machine learning. - Model relationship between one response and
associated features , based on a
sample of size n. -
-
-
39
40Math formulation
- Let be a vector of responses
and be - their associated covariate vectors where
. - When for the classification
problem, we assume a - Logistic model
- We estimate the regression coefficient and
the bias by - minimizing the loss function
-
-
40
41Application
- Supervised learning gene selection problem in
bioinformatics - one wants to eliminate those irrelevant genes
(features) to obtain a robust classifier. - one wants to know which genes are the most
critical factors to the disease. -
-
-
each samples data with p gene expression levels
n samples, patients or healthy ones
Important genes selected
41
each Gene expression level
42Challenges
- Dimensionality grows rapidly with interactions of
the features - Portfolio selection and networking modeling
2000 stocks involve - over 2 millions unknown parameters in the
covariance matrix. - Protein-protein interaction the sample size may
be in the order of - thousands, but the number of features can be in
the order of millions. - To construct effective method to learn
relationships between features and responses in
high dimension for scientific purposes.
42
43Feature Selection Approach
- Main SIS procedure
-
- main_Enumerator
- main_Eliminator
- main_Adaptator
- main_Propagator
- main_Concentrator
- time controller
-
- Sub procedure
-
- sub_enumerator
- sub_concentrator
- knowledge base
43
44Main Enumerator
- Enumerate p features
-
- Among these features, some are relevant
to the responses while others not.
44
45Main Eliminator
- Apply Pearson Correlation between each feature
and response , then rank the value from high
to low and eliminate the lowest
features. -
- is a pre-defined constant.
- is selected top feature set.
-
45
46Sub Enumerator
- Enumerate all feature selection algorithms in
Knowledge base by applying them to feature set
. And select top features as set from
for each algorithm. -
-
- Knowledge Base stores the existing candidate
algorithms. -
- We add L1-regularized regression, elastic-net
regularized regression - and forward stepwise regression. In
principle, any feature selection - algorithms can be put into the knowledge
base.
46
47Sub Concentrator
- For each selected feature set , we compute
the loss function - and choose the best algorithm with the
minimum loss. - Then the sub system selects features
from . - We denote the feature set
-
47
48Main Adaptor
- For all other features in the total p features,
- we add each one to and compute the
loss function -
48
49Main Concentrator
- Ranking all with
from low to high, and select the top
features with the smallest . - top features
49
50Main Propagator
- Add these top features to to form
the new feature set . -
- top features
-
-
50
51Timing Controller
- Timing controller controls the termination of
whole process. It sets a threshold . -
- if , it stops after sub
concentration process and outputs the selected
features - if , the process continues to
main adaption. - The larger the is, the more accurate
feature selection result is, but it needs more
time to compute. Thus slow decision cycles can
result in better performance for a long run. -
-
-
51
52General algorithm
52
53Experimental ResultsDataset description
- Leukemia dataset
- Leukemia is a type of cancer of the blood.
This dataset consists of 72 - samples including 47 acute myeloid
leukemia and 25 patients with - lymphoblastic leukemia, including
expression levels of 7129 human - genes. The data is separated to 38 samples
for training set and 34 - samples for testing set.
- Colon cancer dataset
- This dataset consists of 62 samples
including 40 tumor colon tissues - and 22 normal colon tissue, including
expression levels of 2000 human - genes. The data is separated to 32 samples
for training set and 30 - samples for testing set.
-
-
-
53
54Experimental protocol
- We compare our system with the three individual
feature selection algorithms in Knowledge base. - We report the number errors
- and balance error rate
-
-
-
-
54
55Experimental results
- Our method out-performs individual algorithm.
- When we increase K,
- the number of cycles
- defined by time controller,
- the accuracy of our system
- improves. It is a tradeoff
- between the running time
- and the performance.
-
-
-
-
-
55
56Experimental results
- For biological background, these genes are
critical for leukemia disease - Zyxin is known to interact with leukemogenic bHLH
proteins. This one is selected by both SIS (K5)
and SIS (K10). - Cystatin C (CST3) and Cystatin A are very
important two genes selected by SIS (K10) not by
SIS(K5), which indicates larger K leads more
accurate result. -
-
-
-
56
57Outline
- Application Social Influence Analysis
- Application Product Service Optimization
- Application Topic/Trend Detection
- Application High Dimensional Feature Selection
- Discussion
58Discussions
- Implemented Social Influence Analysis algorithms
to find best model based upon Slow Intelligence
principles - Applied Slow Intelligence principle to
ontological filtering for Product and Service
Selection - Modeled and simulated Trend and Topic Detection
system using Petri net with the framework of Slow
Intelligence System. - Studied a new feature selection application with
the framework of Slow Intelligence System. It
leads to superior performance and can handle high
dimensional data. -
58
59Further Work
- Design mechanism to dynamically update the
knowledge base by applying SIS approach onto
itself - Design a user-friendly interface to develop and
manage an application system
59
60QA