Exploring Blog Networks - PowerPoint PPT Presentation

About This Presentation

Title:

Exploring Blog Networks

Description:

Blogs cite one another, creating a record of how information and ideas spread ... Understanding how the blog network works is important for: ... – PowerPoint PPT presentation

Number of Views:75

Avg rating:3.0/5.0

Slides: 112

Provided by: csC76

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Exploring Blog Networks

1
Exploring Blog Networks

Patterns and a Model for Information Propagation

(As seen at SIAM- Data Mining 2007)
Mary McGlohon In collaboration with Jure
Leskovec, Christos Faloutsos Natalie Glance,
Matthew Hurst Sandia National Labs- July 6, 2007
2
Long-term Goals

How does information on the Web propagate?
With what pattern do ideas catch on, diffuse, and
decrease in popularity?
Can we build a model for this propagation?

3
Why blogs?

Blogs are a widely used medium of information for
many topics and have become an important mode of
communication.
Blogs cite one another, creating a record of how
information and ideas spread through a social
network.
This record is publicly available.

4
Why do we care?

Understanding how the blog network works is
important for
Social issues Political mapping, social trends
and change, reactions to mass media.
Economic issues Marketing, predicting commercial
success, discovering links between companies.

Example blogs in the 2004 election. Adamic,
Glance 2005
5
Immediate Goals

Temporal questions Does popularity have
half-life? Is there periodicity?
Topological questions What topological patterns
do posts and blogs follow? What shapes do
cascades take on? Stars? Chains? Something
else?
Generative model Can we build a generative model
that mimics properties of cascades?

6
Outline

Motivation
Preliminaries
Concepts and terminology
Data
Temporal Observations
Topological Observations
Cascade Generation Model
Discussion Conclusions

7
What is a blog?

A blog is a frequently-updated webpage.
A blogs author updates the blog using posts.
Each post has a permanent hyperlink, and may
contain links to other blog posts.

slashdot
boingboing
8
What is a blog?

A blog is a frequently-updated webpage.
A blogs author updates the blog using posts.
Each post has a permanent hyperlink, and may
contain links to other blog posts.

The iPhone is here, hooray!
slashdot
boingboing
9
What is a blog?

A blog is a frequently-updated webpage.
A blogs author updates the blog using posts.
Each post has a permanent hyperlink, and may
contain links to other blog posts.

The iPhone is here, hooray!
At this link, Slashdot says the iPhone has
arrived. But Im not buying one, because
slashdot
boingboing
10
What is a blog?

A blog is a frequently-updated webpage.
A blogs author updates the blog using posts.
Each post has a permanent hyperlink, and may
contain links to other blog posts.

The iPhone is here, hooray!
At this link, Slashdot says the iPhone has
arrived. But Im not buying one, because
Here Boingboing says theyre not buying an
iPhone. Theyre just jealous.
slashdot
boingboing
11
From blogs to networks
slashdot
boingboing
MichelleMalkin
Dlisted

Blogosphere network

slashdot
boingboing
1
MichelleMalkin
Dlisted
Blog network Post network
Non-trivial vs. trivial cascades Stars vs.
chains Nodes a,b,c,d are cascade initiators e is
a connector
Cascades
12
From networks to cascades
slashdot
boingboing
Non-trivial vs. trivial cascades
MichelleMalkin
Dlisted

Blogosphere network

Cascades
13
From networks to cascades
slashdot
boingboing
Non-trivial vs. trivial cascades Cascade
initiators are first sources of information We
also have stars and chains
MichelleMalkin
Dlisted

Blogosphere network

Cascades
14
Dataset (Nielsen Buzzmetrics)

Gathered from August-September 2005
Used set of 44,362 blogs, traced cascades
2.4 million posts, 5 million out-links, 245,404
blog-to-blog links

Number of posts
Time 1 day
15
Outline

Motivation
Preliminaries
Concepts and terminology
Data
Temporal Observations
Does blog traffic behave periodically?
How does popularity change over time?
Topological Observations
Cascade Generation Model
Discussion Conclusions
Future Work

16
Temporal Observations

Does blog traffic behave periodically?
Posts have weekend effect, less traffic on
Saturday/Sunday.

17
Temporal Observations

Does blog traffic behave periodically?
Monday appears to compensate for this behavior,
but it is not actually the case.
We normalize data countnorm count / pd
where pd is percentage of links on that day.

Number in-links (log)
Number in-links (log)
Monday post dropoff- days after post
Same data, normalized
18
Temporal Observations

How does post popularity change over time?
Post popularity dropoff follows a power law
identical to that found in communication response
times in Vazquez2006.

Observation 1 The probability that a post
written at time tp acquires a link at time tp ?
is p(tp?) ? ?1.5
Number of in-links
Days after post
Cascades
19
Outline

Motivation
Preliminaries
Temporal Observations
Does blog traffic behave periodically?
How does post popularity change over time?
Topological Observations
What are graph properties for blog networks?
What shapes do cascades take on? Stars, chains,
or something else?
Cascade Generation Model
Discussion Conclusions
Future Work

20
Topological Observations

What graph properties does the blog network
exhibit?

21
Topological Observations

What graph properties does the blog network
exhibit? How connected?
44,356 nodes, 122,153 edges
Half of blogs belong to largest connected
component.

22
Topological Observations

What power laws does the blog network exhibit?

Count (log scale)
Count (log scale)
Number of blog in-links (log scale)
Number of blog out-links (log scale)
Both in- and out-degree follows a power law
distribution, in-link PL exponent -1.7,
out-degree PL exponent near -3. This suggests
strong rich-get-richer phenomena.
23
Topological Observations

How are blog in- and out-degree related?

In-links and out-links are not correlated.
(correlation coefficient 0.16)
Number of blog out-links (log scale)
Number of blog in-links (log scale)
24
Topological Observations
What graph properties does the post network
exhibit?
25
Topological Observations
What graph properties does the post network
exhibit?

Very sparsely connected 98 of posts are
isolated.

26
Topological Observations
What power laws does the post network exhibit?

Both in-and out-degree follow power laws
In-degree has PL exponent -2.15, out-degree has
PL exponent -2.95.

Count
Count
Post in-degree
Post out-degree
27
Topological Observations
How do we measure how information flows through
the network?

We gather cascades using the following procedure
Find all initiators (out-degree 0).

a
b
c
d
e
28
Topological Observations
How do we measure how information flows through
the network?

We gather cascades using the following procedure
Find all initiators (out-degree 0).
Follow in-links.

a
a
b
b
c
c
d
d
e
e
29
Topological Observations
How do we measure how information flows through
the network?

We gather cascades using the following procedure
Find all initiators (out-degree 0).
Follow in-links.
Produces directed acyclic graph.

a
a
a
c
b
d
b
b
c
c
d
e
d
e
e
e
30
Topological Observations
How do we measure how information flows through
the network?

Common cascade shapes are extracted using
algorithms in Leskovec2006.

31
Topological Observations
How do we measure how information flows through
the network?

Number of edges increases linearally with
cascade size, while effective diameter increases
logarithmically, suggesting tree-like structures.

Number of edges
Effective diameter
Cascade size ( nodes)
Cascade size
32
Topological Observations
How do we measure how information flows through
the network?

We work with a bag of cascades each cascade is
a disconnected subgraph.
We now explore some graph properties of cascades.

33
Topological Observations
What graph properties do cascades exhibit?

As before, in- and out-degree in bag of cascades
follow power laws.

Count
Count
Cascade node out-degree
Cascade node in-degree
34
Topological Observations
What graph properties do cascades exhibit?

Cascade size distributions also follow power law.

35
Topological Observations
What graph properties do cascades exhibit?

Cascade size distributions also follow power law.

Observation 2 The probability of observing a
cascade on n nodes follows a Zipf
distribution p(n) ? n-2
Count
Cascade size ( of nodes)
36
Topological Observations
What graph properties do cascades exhibit?
Stars and chains also follow a power law, with
different exponents (star -3.1, chain -8.5).
37
Topological Observations
What graph properties do cascades exhibit?
Stars and chains also follow a power law, with
different exponents (star -3.1, chain -8.5).
Count
Count
Size of chain ( nodes)
Size of star ( nodes)
38
Outline

Motivation
Preliminaries
Temporal Observations
Topological Observations
What are graph properties for blog networks?
What shapes and patterns do cascades take on?
Cascade Generation Model
Epidemiological Background
Proposed Model
Experimental Validation
Discussion Conclusions
Future Work

39
Epidemiological models

We consider modeling cascade generation as an
epidemic, with ideas as viruses.
We use the SIS model
At any time, an entity is in one of two states
susceptible or infected.
One parameter ? determines how easily spreading
conversations are.
Hethcote2000

40
Epidemiological models

We consider modeling cascade generation as an
epidemic, with ideas as viruses.
We use the SIS model
At any time, an entity is in one of two states
susceptible or infected.
One parameter ? determines how easily spreading
conversations are.
Hethcote2000

41
Epidemiological models

We consider modeling cascade generation as an
epidemic, with ideas as viruses.
We use the SIS model
At any time, an entity is in one of two states
susceptible or infected.
One parameter ? determines how easily spreading
conversations are.
Hethcote2000

42
Epidemiological models

We consider modeling cascade generation as an
epidemic, with ideas as viruses.
We use the SIS model
At any time, an entity is in one of two states
susceptible or infected.
One parameter ? determines how easily spreading
conversations are.
Hethcote2000

43
Epidemiological models

We consider modeling cascade generation as an
epidemic, with ideas as viruses.
We use the SIS model
At any time, an entity is in one of two states
susceptible or infected.
One parameter ? determines how easily spreading
conversations are.
Hethcote2000

44
Epidemiological models

We consider modeling cascade generation as an
epidemic, with ideas as viruses.
We use the SIS model
At any time, an entity is in one of two states
susceptible or infected.
One parameter ? determines how easily spreading
conversations are.
Hethcote2000

45
Epidemiological models

We consider modeling cascade generation as an
epidemic, with ideas as viruses.
We use the SIS model
At any time, an entity is in one of two states
susceptible or infected.
One parameter ? determines how easily spreading
conversations are.
Hethcote2000

46
Epidemiological models

We consider modeling cascade generation as an
epidemic, with ideas as viruses.
We use the SIS model
At any time, an entity is in one of two states
susceptible or infected.
One parameter ? determines how easily spreading
conversations are.
Hethcote2000

47
Cascade Generation Model
0. Begin with Blog Net.
1
B1
B2
1
2
1
1
3
B3
B4
48
Cascade Generation Model
0. Begin with Blog Net, but ignore edge weights.
Example B1 links to B2, B2 links to B1, B4 links
to B2 and B1, as well as itself B3 is isolated,
linking to itself.
B1
B2
B3
B4
49
Cascade Generation Model
1. Randomly pick a blog to infect, add node to
cascade
B1
B1
B2
B3
B4
50
Cascade Generation Model
2. Infect each in-linked neighbor with
probability b.
B1
B1
B2
B3
B4
51
Cascade Generation Model
2. Infect each in-linked neighbor with
probability b.
DO NOT INFECT
B1
B1
B2
INFECT
B3
B4
52
Cascade Generation Model
3. Add infected neighbors to cascade.
B1
B1
B2
B4
B3
B4
53
Cascade Generation Model
4. Set old infected nodes to uninfected.
B1
B1
B2
B4
B3
B4
54
Cascade Generation Model
4. Set old infected nodes to uninfected.
Repeat steps 2-4 until no nodes are infected.
B1
B1
B2
B4
B3
B4
55
Cascade Generation Model
4. Set old infected nodes to uninfected.
Repeat steps 2-4 until no nodes are infected.
B1
B1
B2
B4
DO NOT INFECT
B3
B4
56
Cascade Generation Model
4. Set old infected nodes to uninfected.
Repeat steps 2-4 until no nodes are infected.
Completed cascade!
B1
B1
B2
B4
B3
B4
57
CGM matches observations

After trying several values, we decide on ?.025.
10 simulations, 2 million cascades each
Most frequent cascades 7 of 10 matched exactly.

model
data
58
CGM matches observations
Cascade size in this model also follows a power
law-- the model distribution is shown with the
real data points.
Count
Cascade size (number of nodes)
59
CGM matches observations

Stars and chains both follow power laws, close to
those observed in real data.

Count
Count
Star size
Chain size
60
Results in brief

Analyzed one of largest available collections of
blog information.
Two networks Post network and blog network.
Discovered several properties of the networks.
Also analyzed properties of cascades.
Presented generative model for cascades.

61
Immediate questions answered

Temporal questions Does popularity have
half-life? Is there periodicity?
Popularity dropoff follows a power-law
distribution exactly as found in response times
in other work. We do find that posts follow
weekly periodicity.

Number of in-links
Days after post
62
Immediate questions answered

Topology What topological patterns do posts and
blogs follow? What shapes to cascades take on?
Stars? Chains? Something else?
We find power law distributions in almost every
topological property. In cascade shapes, stars
are more common than chains, and size of cascades
follow a power law. Cascades are tree-like.

Count
Count
Size of chain ( nodes)
Size of star ( nodes)
63
Immediate questions answered

Can a simple model replicate this behavior?
Yes. We developed a model based on the SIS model
in epidemiology. It is a simple model with only
one parameter, and it produces behavior
remarkably similar to that found in the dataset.

Count
Count
Star size
Chain size
64
Future work and applications

This work suggested that ideas may behave like
viruses under an SIS model.
This may be useful for mapping social/political
trends.
Further investigation into these properties may
also allow us early detection of changes in
social or economic structure.

65
Related work

For explanation of SIS model
Hethcote2000 H.W. Hethcote. The mathematics
of infectious diseases. SIAM Rev., 42(4)599653,
2000.
For algorithms for extracting cascade shapes
Leskovec2006 J. Leskovec, A. Singh, and J.
Kleinberg. Patterns of influence in a
recommendation network. PAKDD 2006.
For some modeling of power laws
Vazquez2006 A. Vazquez, J. G. Oliveira, Z.
Dezso, K. I. Goh, I. Kondor, and A. L. Barabasi.
Modeling bursts and heavy tails in human
dynamics. Physical Review E, 73036127, 2006.

66
Additional Info

Mary McGlohon
www.cs.cmu.edu/mmcgloho
mcglohon_at_cmu.edu

67
Acknowledgments

Mary McGlohon was partially supported by an NSF
Graduate Fellowship.
Jure Leskovec was partially supported by a
Microsoft Fellowship.

67
68
Questions?
69

EXTRA SLIDES BEGIN HERE!

70
Preliminaries- PCA

We will work with very high-dimensional data
(9,000 dimensions).
Principal Component Analysis is a method of
dimensionality reduction.

Hypothetically, for each blog...
Depth upwards
Conversation mass upwards
70
71
Preliminaries- PCA

We will work with very high-dimensional data
(9,000 dimensions).
Principal Component Analysis is a method of
dimensionality reduction.

Hypothetically, for each blog...
Depth upwards
Conversation mass upwards
71
72
Preliminaries- PCA

We will work with very high-dimensional data
(9,000 dimensions).
Principal Component Analysis is a method of
dimensionality reduction.

Hypothetically, for each blog...
Depth upwards
Conversation mass upwards
72
73
Preliminaries- PCA
We can represent any real N x M matrix X as X U
x ? x Vt
where U is size N x r, r is the rank of matrix
X, ? is diagonal r x r matrix and V is M x r.
X
U
?
Vt
74
Preliminaries- PCA

Reduce dimensionality by setting all other
components of ? to zero.

x
x

75
Preliminaries- PCA
x
x

Reference Fukunaga, K. (1990). Introduction to
Statistical Pattern Recognition, Academic Press.

76
Preliminaries- Regularizing data

Not everything in life is normally distributed. ?

Blog properties, linear-linear scale
Total In-links
Total Conversation Mass Downwards
77
Preliminaries- Regularizing data

Not everything in life is normally distributed. ?

Blog properties, linear-linear scale
Total In-links
99.4 of points!
Total Conversation Mass Downwards
78
Preliminaries Regularizing data

Not everything in life is normally distributed. ?

Blog properties, linear-linear scale
Try to fit a line...
Total In-links
Total Conversation Mass Downwards
79
Preliminaries Regularizing data

Not everything in life is normally distributed. ?

Blog properties, linear-linear scale
Try to fit a line... Outliers dramatically
affect fit.
Total In-links
Total Conversation Mass Downwards
80
Preliminaries Regularizing data

Not everything in life is normally distributed. ?
Therefore, we propose to take log(count1).

Blog properties, log-log scale
Total In-links
Total Conversation Mass Downwards
81
Preliminaries Regularizing data

Not everything in life is normally distributed. ?
Therefore, we propose to take log(count1).

Blog properties, log-log scale
Outliers effects are minimized.
Total In-links
Total Conversation Mass Downwards
82

Suppose we want to cluster blogs based on
content. What features do we use per blog?

83
CascadeType

Perform PCA on sparse matrix.
Use log(count1)
Project onto 2 PC

9,000 cascade types

44,000 blogs
84
CascadeType Results

Observation Content of blogs and cascade
behavior are often related.

Distinct clusters for conservative and
humorous blogs (hand-labeling).

84
85
CascadeType Results

Observation Content of blogs and cascade
behavior are often related.

Distinct clusters for conservative and
humorous blogs (hand-labeling).

85
86

Suppose we want to cluster blog posts. What
features do we use?

87
Preliminaries- Blogs

There are several terms we use to describe
cascades
In-link, out-link
Green node has one out-link
Yellow node has one in-link.
Depth downwards/upwards
Pink node has an upward depth of 1,
downward depth of 2.
Conversation mass upwards/downwards
Pink node has upward CM 1,
downward CM 3

87
88
PostFeatures
in-links out-links CM up CM down
depth up depth down
Run PCA
2,400,000 posts
88
89
PostFeatures Results

Observation Posts within a blog tend to retain
similar network characteristics.

90
PostFeatures Results

Observation Posts within a blog tend to retain
similar network characteristics.

PC1 CM upward
PC2 CM downward
We show this scatter plot instead.

MichelleMalkin
Dlisted
91
Ranking blogs by PostFeatures

Conversation mass up/down gives a better
understanding of the blog posts than in-links and
out-links.
Therefore, we may choose to rank blogs based on
these attributes.

91
92
Blogs ranked by CM vs in-links
Top blogs by conversation mass
Top blogs by in-links

michellemalkin.com
boingboing.net
imao.us (75)
captainsquartersblog.com/mt
instapundit.com
radioequalizer.blogspot.com (53)
powerlineblog.com
waxy.org/links
washingtonmonthly.com
kottke.org/reminder

boingboing.net
michellemalkin.com
instapundit.com
waxy.org/links
kottke.com/reminder
patriotdaily.com (11)
captainsquartersblog.com/mt
powerlineblog.com
washingtonmonthly.com
petashon.com (30)

92
93
Blogs ranked by CM vs in-links
Top blogs by conversation mass
Top blogs by in-links

michellemalkin.com
boingboing.net
imao.us (75)
captainsquartersblog.com/mt

boingboing.net
michellemalkin.com
instapundit.com
waxy.org/links

..... 10 petashon.com (30)
in-links 2 CM 6
in-links 5 CM 5

Perhaps IMAO has longer cascades, just fewer
inlinks.
While petashun has stars.

93
94
BlogTimeFractal some time series

Problem time series data is nonuniform and
difficult to analyze.
Any patterns?
Any measures?

in-links over time
95
BlogTimeFractal Definitions

Any patterns?
Self similarity!
The 80-20 law describes self-similarity.
For any sequence, we divide it into two
equal-length subsequences. 80 of traffic is in
one, 20 in the other.
Repeat recursively.

95
96
Self-similarity

The bias factor for the 80-20 law is b0.8.

20
80
97
Self-similarity

The bias factor for the 80-20 law is b0.8.

20
80
Q How do we estimate b?
98
Self-similarity

The bias factor for the 80-20 law is b0.8.

20
80
Q How do we estimate b?
A Entropy plots!
99
BlogTimeFractal

An entropy plot plots entropy vs. resolution.
From time series data, begin with resolution R
T/2.
Record entropy HR

99
100
BlogTimeFractal

An entropy plot plots entropy vs. resolution.
From time series data, begin with resolution R
T/2.
Record entropy HR
Recursively take finer resolutions.

100
101
BlogTimeFractal

An entropy plot plots entropy vs. resolution.
From time series data, begin with resolution r
T/2.
Record entropy Hr
Recursively take finer resolutions.

101
102
BlogTimeFractal Definitions

Entropy measures the non-uniformity of histogram
at a given resolution.
We define entropy of our sequence at given R
where p(t) is percentage of posts from a blog on
interval t, R is resolution and 2R is number of
intervals.

103
BlogTimeFractal

For a b-model (and self similar cases), entropy
plot is linear. The slope s will tell us the
bias factor.
Lemma For traffic generated by a b-model, the
bias factor b obeys the equation
s - b log2 b (1-b) log2 (1-b)

103
104
Entropy Plots

Linear plot ? Self-similarity

Entropy
Resolution
105
Entropy Plots

Linear plot ? Self-similarity
Uniform slope s1. bias.5
Point mass s0. bias1

Entropy
Resolution
106
Entropy Plots

Linear plot ? Self-similarity
Uniform slope s1. bias.5
Point mass s0. bias1

Michelle Malkin in-links, s 0.85 By Lemma 1, b
0.72
Entropy
Resolution
107
BlogTimeFractal Results

Observation Most time series of interest are
self-similar.
Observation Bias factor is approximately 0.7--
that is, more bursty than uniform (70/30 law).

Entropy plots MichelleMalkin
in-links, b.72 conversation mass, b.76 number
of posts, b.70
107
108

Other related work

109
Ali-Hasen, Adamic 2007
Expressing Social Relationships on the Blog
through Links and Comments Analyzed three blog
communities
Dallas-Fort Worth -Most links are external to
community (91) -Low centralization -Low
reciprocity
UAE -Fewer links external to community -More
centralization -Obvious hub structure
Kuwait -Fewest links external to community
(53) -Highly centralized -Much reciprocity
110
Duarte et. al. 2007